GridPP Operations Meeting, Tuesday, 24 June 2014
================================================

http://indico.cern.ch/event/326647/

Minutes: Andrew McNab

Experiment problems/issues
--------------------------

Review of weekly issues by experiment/VO

LHCb: running smoothly. Problem of SAM tests of ARC CEs, workaround to use a WMS. Small load on RAL WMS if go down this route (all LHCb ARC 
sites are in the UK.)

CMS: NTR

ATLAS: Digi+Reco 8TeV release 19 close to being validated, 350million events to be done. 13 TeV simulation, 270 million events to 
be done, 3-4 weeks for this. Sites shouldn't see a difference. Production system 2 is in test. Panda/Jedi update ready for user 
analysis. LFC->Rucio: LFC not used, please report to cloud support any error messages due to LFC references. Reminder: please 
upgrade cvmfs to 1.1.19, WLCG tickets after 1st July. Checks of HTTP WebDAV access at UK sites turned up some problems (eg QMUL due 
to StoRM, also issues identified at UCL, Birmingham, Cambridge.) httpd dies on SL5 perhaps? Table in the slides: 
http://indico.cern.ch/event/326647/contribution/1/material/slides/0.pdf 

Other: 

Meetings & updates
------------------

See comments in https://www.gridpp.ac.uk/wiki/Operations_Bulletin_230614

General updates: GridPP33 registration open (in August 2014) WLCG ops coordination: we should use the T1/T2 slots to raise issues; 
 Shoal vs other proxy discovery methods: not clear what is direction things are going in
Tier-1 status
Storage and data management
Accounting
Documentation
Interoperation
Monitoring
On-duty
Rollout
Security: there will be a new CA RPMs release next week
Services: need to reboot perfsonar boxes after yum update to new kernel is used
Tickets
Tools
VOs
Site updates

Operational plans & changes over coming years
---------------------------------------------

All EGI NGIs have received a request to present a summary of plans for
the coming 1-2 years in an operations management meeting this Thursday. This
should cover plans for the infrastructure and service deployments. Please
could you help me add to this list:

    * Gradual migration to IPv6 (or dual stack)
    * Increasing usage of DIRAC (if it meets VO needs) - with probably a
      consequent decrease in WMS
    * Enablement of more resources under cloud/VM interfaces (and
      federation)
    * Move away from torque/maui (probably also to ARC CE)
    * Resistance to adding more services, and desire to remove existing ones (eg local APEL DB)
 

DIRAC progress
--------------

* See also Tools section of the Bulletin
* Who has used it
* Feedback/experiences
* Getting more people involved.

Still needs more people to involved. Several sites interested, but not started pursuing it as they intend yet. 

AOB
---

* Dissemination updates (see Bulletin)
* Reminder: HEPSYSMAN security challenge debrief this afteroon (4pm)
  https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=mdZ6gy1wDiWq
* Durham likely to have ~2 weeks of downtime in the next 2-3 weeks, as moving cluster between machine rooms

Chat log
--------

Jeremy Coles: (24/06/2014 10:59)
Andrew is taking minutes today.
We will start in the next few minutes....
Matt Doidge: (11:06 AM)
There are a lot of CMS tickets - but all seem to be being handled.
Daniela Bauer: (11:09 AM)
I can't hear anyone. This migth take longer.
Jeremy Coles: (11:10 AM)
Present: Alessandra, Andy, Andrew L, Andrew M, Chris, Dan, David C, Elena, Ewan S, Gang, Gareth R, Gareth S, Govind, Jeremy, John B, John H, Kashif, Mark S, Matt D, Matt RB, Matt W, Raja, Robert, Rob, Sam, rf?
+ Daniela.
and Ewan M... Vidyo doesn't make this easy.
Daniela Bauer: (11:12 AM)
I can hear Elena now. 
Christopher John Walker: (11:12 AM)
How urgent is the WebDAV testing? I can push the StoRM developers a bit. 
Daniela Bauer: (11:12 AM)
Whether my microphone works remains to be seen
I don't think CMS has anything to report
Samuel Cadellin Skipsey: (11:13 AM)
(So the issue with the reliability of the WebDAV service on DPM is known and with the developers)
(Wahid pushed it, again, earlier this week)
Christopher John Walker: (11:18 AM)
I don't think you even need the restart
Elena Korolkova: (11:19 AM)
I'll check Glasgow Frontier
John Bland: (11:20 AM)
The logs for Liverpool frontier are pretty hefty as well (gig or two a day)
Ewan Mac Mahon: (11:20 AM)
I believe the main points were than you have to be 'fully' on a 2.1.x by the deadline. It is 'recommended' to have the latest for general bugfixes and bits, but it's not a requirement for the change cern are making.
Christopher John Walker: (11:20 AM)
QMUL is on 2.1.19
Daniela Bauer: (11:21 AM)
We are on 2.1.19
raul: (11:21 AM)
Brunel on 2.1.19 
Elena Korolkova: (11:21 AM)
we are on 2.1.19
John Bland: (11:21 AM)
liverpool on 2.1.19
John Hill: (11:21 AM)
Cambridge on 2.1.19
Ewan Steele: (11:21 AM)
were updated
Govind: (11:21 AM)
rhul on 2.1.19
Matt Doidge: (11:22 AM)
Lancaster upgraded.
Ewan Mac Mahon: (11:22 AM)
As Alesandra says, the scripts do a 'reload', but I believe (and this is the bit that's still slightly sketchy) that it won't be fully using the new code until the filesystems are unmounted and remounted (cf upgrading glibc on a running system).
But, if you're on 2.1.x and you do the hot patch upgrade to 2.1.19, and you still have some nodes that haven't done the unmount/remount by the deadline, that's still actually fine.
The only critical problem would be still running 2.0.x
I think.
Elena Korolkova: (11:25 AM)
@Dave Crooks:
http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas/UKI-SCOTGRID-GLASGOW/index.html
doesn't show high increase in load
Can you send me a link, please
Christopher John Walker: (11:28 AM)
If you can remind us (me?) of the meeting I might come along. 
David Crooks: (11:29 AM)
Elena: I'll drop you an email after the meeting, we don't have a direct link
Thanks
Elena Korolkova: (11:29 AM)
ok. Thanks
Ewan Mac Mahon: (11:42 AM)
We tend to do reviews of other meetings after they've happened; maybe we should do a bit more previews - so in each tuesday ops meeting, consider the other meetings coming up in the week and whether there's anything we want to raise at them.
Ewan Steele: (11:43 AM)
Durham are supprised as we thought our accounting was now working
Christopher John Walker: (11:44 AM)
Which table
Jeremy Coles: (11:44 AM)
https://www.egi.eu/earlyAdopters/table
Govind: (11:45 AM)
RHUL accounting looks like not update due to last week downtime.. I will be looking into this
Jeremy Coles: (11:46 AM)
Thanks
John Hill: (11:47 AM)
Doing a rolling update even as we meet...
Matt Raso-Barnett: (11:47 AM)
To anyone with lustre, does lustre build on the latest kernel, or is there a patch required?
I haven't managed to look at this yet... :(
John Bland: (11:47 AM)
sorry, was just away, what's the big update?
Christopher John Walker: (11:47 AM)
Patchless client builds just fine
Matt Raso-Barnett: (11:47 AM)
great
thanks
Christopher John Walker: (11:47 AM)
I recommend a patch for a bug I submitted. 
Ewan Mac Mahon: (11:48 AM)
https://access.redhat.com/security/cve/CVE-2014-3153
^ RH advisory for the kernel bug.
Matt Doidge: (11:48 AM)
I might not be on that list - I'll get onto signing up.
Ewan Mac Mahon: (11:49 AM)
Update available for 6, 5 nt vulnerable.
Matt Doidge: (11:49 AM)
Oh- scratch that, I am.
Ewan Mac Mahon: (11:49 AM)
Given that it's a kernel update, it's a yum update and a reboot.
John Bland: (11:49 AM)
thanks, Ewan
Duncan Rand: (11:50 AM)
https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhostgroup%26hostgroup%3DUK
Ewan Mac Mahon: (11:50 AM)
the reboot will also deal with any lingering cvmfs mounts too, so that's quite nice,
David Crooks: (11:52 AM)
http://stackoverflow.com/questions/20407292/centos-another-mysql-daemon-already-running-with-the-same-unix-socket
Ewan Mac Mahon: (11:57 AM)
It's interesting that IC's cloud instances went direct, not via a shoal advertised squid, as the ones at Oxford did.
The ones at Oxford picked a wildly inappropriate shoal advertised squid, but they did use one.
Incedentally, and not that I've seen (or looked for) any evidence of this, but the Oxford squid is probably now the network local shoal advertised squid for Imperial too.
Which might have an impact.
Duncan Rand: (11:58 AM)
Come to the ATLAS thurday meeting
Matt Doidge: (11:58 AM)
From the ticket I think Shoal wasn't yet installed - could be wrong though.
Ewan Mac Mahon: (11:59 AM)
There's two bits of shoal though - the problem at Oxford was that the ATLAS images had the 'look for a squid' bit, but the squid didn't have the 'advertise a squid' bit.
I'd imagine (?) that the ATLAS images are the same at both sites
Matt Doidge: (12:01 PM)
Just a quick Grid Engine Head Count - Lancaster, Sussex, Edinburgh, IC and QM?
Christopher John Walker: (12:02 PM)
http://aipanda024.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/4360002.0.log
CREAM error: reason=255
Which I think is that the job has been killed by the batch system. 
Daniela Bauer: (12:03 PM)
I'm going to wait until the real patch comes out
Duncan Rand: (12:04 PM)
http://aipanda023.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/4142961.0.log
Daniela Bauer: (12:04 PM)
I've got a temporary hack which should work until then
Duncan Rand: (12:08 PM)
Matt can you find any trace of the job Chris or I listed
in your batch system...
Matt Raso-Barnett: (12:09 PM)
hi duncan
yes i can see an error now
job 3279600 exceeds job soft limit "s_vmem" of queue "grid.q@node101.cm.cluster" (4892622848.00000 > limit:4194304000.00000) - sending SIGXCPU
that isn't the same job
but they all seem to be reporting this
im just looking at why this limit is being hit as nothing to my knowledge has changed with any of our queue configuration in months...
Duncan Rand: (12:12 PM)
I think thats the problem Chris had..
Matt Raso-Barnett: (12:12 PM)
sorry if you are typing stuff to me in chat directly, I can't see the private chat window -- there is something up with my vidyo client
Matt Doidge: (12:12 PM)
Did your /usr/libexec/sge_local_submit_attributes.sh change?
Christopher John Walker: (12:13 PM)
yes - I'll dig it out. 
Duncan Rand: (12:13 PM)
I'm using this public one now..
Matt Doidge: (12:13 PM)
Or perhaps your publishing, which is attracting memory hungry jobs like lochosts.
Duncan Rand: (12:13 PM)
http://panda.cern.ch/server/pandamon/query?tp=queue&id=UKI-SOUTHGRID-SUSX_SL6
4 GB is being advertised
Matt Raso-Barnett: (12:14 PM)
hi matt, i think it would have, since this is a completely rebuilt cream node
4GB is correct for our nodes so that is good
Samuel Cadellin Skipsey: (12:15 PM)
the SNOPLUS problem is solved with CMVFS, isn't it?
Matt Raso-Barnett: (12:15 PM)
the sge queue has no limits on vmem at the queue level
Samuel Cadellin Skipsey: (12:15 PM)
...CVMFS, even
(The point is that we really don't need Cloud images to solve most VOs problems with enviroment)
Duncan Rand: (12:16 PM)
cream_attributes = CERequirements = "other.GlueHostMainMemoryRAMSize > 4000 && other.GlueHostPolicyMaxWallClockTime >= 4320";
Andrew McNab: (12:16 PM)
What Ewan said
John Bland: (12:16 PM)
better hope cvmfs doesn't break then
Elena Korolkova: (12:17 PM)
http://aipanda017.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/3603246.0.log
000 (3603246.000.000) 06/24 12:45:13 Job submitted from host: 
...
027 (3603246.000.000) 06/24 12:59:03 Job submitted to grid resource
GridResource: cream grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 sge grid.q
GridJobId: cream https://grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 CREAM240682095
...
001 (3603246.000.000) 06/24 13:00:46 Job executing on host: cream grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 sge grid.q
...
009 (3603246.000.000) 06/24 13:05:10 Job was aborted by the user.
CREAM error: reason=255
...
Samuel Cadellin Skipsey: (12:17 PM)
(and, as people repeatedly ignore: there are a) overheads to VMs that cannot be entirely removed b) noone serious even in Cloud services is using VMs anymore.)
Duncan Rand: (12:17 PM)
at QMUL the moemory is set to 
memory=
3500

Christopher John Walker: (12:18 PM)
"Everything in CVMFS". Is there a way of deploying rpms into cvmfs, or does someone have an entire SL distribution in CVMFS so the VOs don't need to do that. 
Duncan: At QMUL the memory limit is ignored by the batch system. 
Matt Doidge: (12:20 PM)
Chris - there's scope for something along those lines with the WN in cvmfs stuff.
Duncan Rand: (12:21 PM)
If anyone has rebooted their perfsonar hosts please let me know
John Hill: (12:21 PM)
I did a few minutes ago
Ewan Mac Mahon: (12:21 PM)
Mine both were a day or so ago.
Christopher John Walker: (12:21 PM)
Matt: I think that recompiling all of SL would be a lot of effort - for very little gain I suspect. 
Matt Raso-Barnett: (12:21 PM)
i did when you mentioned it earlier in this meeting
John Bland: (12:22 PM)
Liverpool PS rebooted about half an hour ago and seemingly working
Duncan Rand: (12:22 PM)
Yes, all those sites are OK..
Samuel Cadellin Skipsey: (12:22 PM)
Chris: well, you probably couldn't "simply" unpack an RPM into a cvmfs, but you could cpio the RPM files into a directory structure for a CVMFS repo, and you'd only have to do it once.
Duncan Rand: (12:23 PM)
Suss: OK
Andrew McNab: (12:24 PM)
CernVM3 works this way. It's SL6 RPMs via cvmfs, with a copy-on-write partition so you can install extras inside the VM
ie the root partition via cvmfs works that way
Jeremy Coles: (12:27 PM)
https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=mdZ6gy1wDiWq
Matt Doidge: (12:29 PM)
When Lancaster moved we used professional movers and they were worth every penny (luckily we didn't have to pay for them).