GridPP Operations Meeting, Tuesday, 24 June 2014 ================================================ http://indico.cern.ch/event/326647/ Minutes: Andrew McNab Experiment problems/issues -------------------------- Review of weekly issues by experiment/VO LHCb: running smoothly. Problem of SAM tests of ARC CEs, workaround to use a WMS. Small load on RAL WMS if go down this route (all LHCb ARC sites are in the UK.) CMS: NTR ATLAS: Digi+Reco 8TeV release 19 close to being validated, 350million events to be done. 13 TeV simulation, 270 million events to be done, 3-4 weeks for this. Sites shouldn't see a difference. Production system 2 is in test. Panda/Jedi update ready for user analysis. LFC->Rucio: LFC not used, please report to cloud support any error messages due to LFC references. Reminder: please upgrade cvmfs to 1.1.19, WLCG tickets after 1st July. Checks of HTTP WebDAV access at UK sites turned up some problems (eg QMUL due to StoRM, also issues identified at UCL, Birmingham, Cambridge.) httpd dies on SL5 perhaps? Table in the slides: http://indico.cern.ch/event/326647/contribution/1/material/slides/0.pdf Other: Meetings & updates ------------------ See comments in https://www.gridpp.ac.uk/wiki/Operations_Bulletin_230614 General updates: GridPP33 registration open (in August 2014) WLCG ops coordination: we should use the T1/T2 slots to raise issues; Shoal vs other proxy discovery methods: not clear what is direction things are going in Tier-1 status Storage and data management Accounting Documentation Interoperation Monitoring On-duty Rollout Security: there will be a new CA RPMs release next week Services: need to reboot perfsonar boxes after yum update to new kernel is used Tickets Tools VOs Site updates Operational plans & changes over coming years --------------------------------------------- All EGI NGIs have received a request to present a summary of plans for the coming 1-2 years in an operations management meeting this Thursday. This should cover plans for the infrastructure and service deployments. Please could you help me add to this list: * Gradual migration to IPv6 (or dual stack) * Increasing usage of DIRAC (if it meets VO needs) - with probably a consequent decrease in WMS * Enablement of more resources under cloud/VM interfaces (and federation) * Move away from torque/maui (probably also to ARC CE) * Resistance to adding more services, and desire to remove existing ones (eg local APEL DB) DIRAC progress -------------- * See also Tools section of the Bulletin * Who has used it * Feedback/experiences * Getting more people involved. Still needs more people to involved. Several sites interested, but not started pursuing it as they intend yet. AOB --- * Dissemination updates (see Bulletin) * Reminder: HEPSYSMAN security challenge debrief this afteroon (4pm) https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=mdZ6gy1wDiWq * Durham likely to have ~2 weeks of downtime in the next 2-3 weeks, as moving cluster between machine rooms Chat log -------- Jeremy Coles: (24/06/2014 10:59) Andrew is taking minutes today. We will start in the next few minutes.... Matt Doidge: (11:06 AM) There are a lot of CMS tickets - but all seem to be being handled. Daniela Bauer: (11:09 AM) I can't hear anyone. This migth take longer. Jeremy Coles: (11:10 AM) Present: Alessandra, Andy, Andrew L, Andrew M, Chris, Dan, David C, Elena, Ewan S, Gang, Gareth R, Gareth S, Govind, Jeremy, John B, John H, Kashif, Mark S, Matt D, Matt RB, Matt W, Raja, Robert, Rob, Sam, rf? + Daniela. and Ewan M... Vidyo doesn't make this easy. Daniela Bauer: (11:12 AM) I can hear Elena now. Christopher John Walker: (11:12 AM) How urgent is the WebDAV testing? I can push the StoRM developers a bit. Daniela Bauer: (11:12 AM) Whether my microphone works remains to be seen I don't think CMS has anything to report Samuel Cadellin Skipsey: (11:13 AM) (So the issue with the reliability of the WebDAV service on DPM is known and with the developers) (Wahid pushed it, again, earlier this week) Christopher John Walker: (11:18 AM) I don't think you even need the restart Elena Korolkova: (11:19 AM) I'll check Glasgow Frontier John Bland: (11:20 AM) The logs for Liverpool frontier are pretty hefty as well (gig or two a day) Ewan Mac Mahon: (11:20 AM) I believe the main points were than you have to be 'fully' on a 2.1.x by the deadline. It is 'recommended' to have the latest for general bugfixes and bits, but it's not a requirement for the change cern are making. Christopher John Walker: (11:20 AM) QMUL is on 2.1.19 Daniela Bauer: (11:21 AM) We are on 2.1.19 raul: (11:21 AM) Brunel on 2.1.19 Elena Korolkova: (11:21 AM) we are on 2.1.19 John Bland: (11:21 AM) liverpool on 2.1.19 John Hill: (11:21 AM) Cambridge on 2.1.19 Ewan Steele: (11:21 AM) were updated Govind: (11:21 AM) rhul on 2.1.19 Matt Doidge: (11:22 AM) Lancaster upgraded. Ewan Mac Mahon: (11:22 AM) As Alesandra says, the scripts do a 'reload', but I believe (and this is the bit that's still slightly sketchy) that it won't be fully using the new code until the filesystems are unmounted and remounted (cf upgrading glibc on a running system). But, if you're on 2.1.x and you do the hot patch upgrade to 2.1.19, and you still have some nodes that haven't done the unmount/remount by the deadline, that's still actually fine. The only critical problem would be still running 2.0.x I think. Elena Korolkova: (11:25 AM) @Dave Crooks: http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas/UKI-SCOTGRID-GLASGOW/index.html doesn't show high increase in load Can you send me a link, please Christopher John Walker: (11:28 AM) If you can remind us (me?) of the meeting I might come along. David Crooks: (11:29 AM) Elena: I'll drop you an email after the meeting, we don't have a direct link Thanks Elena Korolkova: (11:29 AM) ok. Thanks Ewan Mac Mahon: (11:42 AM) We tend to do reviews of other meetings after they've happened; maybe we should do a bit more previews - so in each tuesday ops meeting, consider the other meetings coming up in the week and whether there's anything we want to raise at them. Ewan Steele: (11:43 AM) Durham are supprised as we thought our accounting was now working Christopher John Walker: (11:44 AM) Which table Jeremy Coles: (11:44 AM) https://www.egi.eu/earlyAdopters/table Govind: (11:45 AM) RHUL accounting looks like not update due to last week downtime.. I will be looking into this Jeremy Coles: (11:46 AM) Thanks John Hill: (11:47 AM) Doing a rolling update even as we meet... Matt Raso-Barnett: (11:47 AM) To anyone with lustre, does lustre build on the latest kernel, or is there a patch required? I haven't managed to look at this yet... :( John Bland: (11:47 AM) sorry, was just away, what's the big update? Christopher John Walker: (11:47 AM) Patchless client builds just fine Matt Raso-Barnett: (11:47 AM) great thanks Christopher John Walker: (11:47 AM) I recommend a patch for a bug I submitted. Ewan Mac Mahon: (11:48 AM) https://access.redhat.com/security/cve/CVE-2014-3153 ^ RH advisory for the kernel bug. Matt Doidge: (11:48 AM) I might not be on that list - I'll get onto signing up. Ewan Mac Mahon: (11:49 AM) Update available for 6, 5 nt vulnerable. Matt Doidge: (11:49 AM) Oh- scratch that, I am. Ewan Mac Mahon: (11:49 AM) Given that it's a kernel update, it's a yum update and a reboot. John Bland: (11:49 AM) thanks, Ewan Duncan Rand: (11:50 AM) https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhostgroup%26hostgroup%3DUK Ewan Mac Mahon: (11:50 AM) the reboot will also deal with any lingering cvmfs mounts too, so that's quite nice, David Crooks: (11:52 AM) http://stackoverflow.com/questions/20407292/centos-another-mysql-daemon-already-running-with-the-same-unix-socket Ewan Mac Mahon: (11:57 AM) It's interesting that IC's cloud instances went direct, not via a shoal advertised squid, as the ones at Oxford did. The ones at Oxford picked a wildly inappropriate shoal advertised squid, but they did use one. Incedentally, and not that I've seen (or looked for) any evidence of this, but the Oxford squid is probably now the network local shoal advertised squid for Imperial too. Which might have an impact. Duncan Rand: (11:58 AM) Come to the ATLAS thurday meeting Matt Doidge: (11:58 AM) From the ticket I think Shoal wasn't yet installed - could be wrong though. Ewan Mac Mahon: (11:59 AM) There's two bits of shoal though - the problem at Oxford was that the ATLAS images had the 'look for a squid' bit, but the squid didn't have the 'advertise a squid' bit. I'd imagine (?) that the ATLAS images are the same at both sites Matt Doidge: (12:01 PM) Just a quick Grid Engine Head Count - Lancaster, Sussex, Edinburgh, IC and QM? Christopher John Walker: (12:02 PM) http://aipanda024.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/4360002.0.log CREAM error: reason=255 Which I think is that the job has been killed by the batch system. Daniela Bauer: (12:03 PM) I'm going to wait until the real patch comes out Duncan Rand: (12:04 PM) http://aipanda023.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/4142961.0.log Daniela Bauer: (12:04 PM) I've got a temporary hack which should work until then Duncan Rand: (12:08 PM) Matt can you find any trace of the job Chris or I listed in your batch system... Matt Raso-Barnett: (12:09 PM) hi duncan yes i can see an error now job 3279600 exceeds job soft limit "s_vmem" of queue "grid.q@node101.cm.cluster" (4892622848.00000 > limit:4194304000.00000) - sending SIGXCPU that isn't the same job but they all seem to be reporting this im just looking at why this limit is being hit as nothing to my knowledge has changed with any of our queue configuration in months... Duncan Rand: (12:12 PM) I think thats the problem Chris had.. Matt Raso-Barnett: (12:12 PM) sorry if you are typing stuff to me in chat directly, I can't see the private chat window -- there is something up with my vidyo client Matt Doidge: (12:12 PM) Did your /usr/libexec/sge_local_submit_attributes.sh change? Christopher John Walker: (12:13 PM) yes - I'll dig it out. Duncan Rand: (12:13 PM) I'm using this public one now.. Matt Doidge: (12:13 PM) Or perhaps your publishing, which is attracting memory hungry jobs like lochosts. Duncan Rand: (12:13 PM) http://panda.cern.ch/server/pandamon/query?tp=queue&id=UKI-SOUTHGRID-SUSX_SL6 4 GB is being advertised Matt Raso-Barnett: (12:14 PM) hi matt, i think it would have, since this is a completely rebuilt cream node 4GB is correct for our nodes so that is good Samuel Cadellin Skipsey: (12:15 PM) the SNOPLUS problem is solved with CMVFS, isn't it? Matt Raso-Barnett: (12:15 PM) the sge queue has no limits on vmem at the queue level Samuel Cadellin Skipsey: (12:15 PM) ...CVMFS, even (The point is that we really don't need Cloud images to solve most VOs problems with enviroment) Duncan Rand: (12:16 PM) cream_attributes = CERequirements = "other.GlueHostMainMemoryRAMSize > 4000 && other.GlueHostPolicyMaxWallClockTime >= 4320"; Andrew McNab: (12:16 PM) What Ewan said John Bland: (12:16 PM) better hope cvmfs doesn't break then Elena Korolkova: (12:17 PM) http://aipanda017.cern.ch:25880/2014-06-24/UKI-SOUTHGRID-SUSX_SL6-10539/3603246.0.log 000 (3603246.000.000) 06/24 12:45:13 Job submitted from host: ... 027 (3603246.000.000) 06/24 12:59:03 Job submitted to grid resource GridResource: cream grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 sge grid.q GridJobId: cream https://grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 CREAM240682095 ... 001 (3603246.000.000) 06/24 13:00:46 Job executing on host: cream grid-cream-02.hpc.susx.ac.uk:8443/ce-cream/services/CREAM2 sge grid.q ... 009 (3603246.000.000) 06/24 13:05:10 Job was aborted by the user. CREAM error: reason=255 ... Samuel Cadellin Skipsey: (12:17 PM) (and, as people repeatedly ignore: there are a) overheads to VMs that cannot be entirely removed b) noone serious even in Cloud services is using VMs anymore.) Duncan Rand: (12:17 PM) at QMUL the moemory is set to memory= 3500 Christopher John Walker: (12:18 PM) "Everything in CVMFS". Is there a way of deploying rpms into cvmfs, or does someone have an entire SL distribution in CVMFS so the VOs don't need to do that. Duncan: At QMUL the memory limit is ignored by the batch system. Matt Doidge: (12:20 PM) Chris - there's scope for something along those lines with the WN in cvmfs stuff. Duncan Rand: (12:21 PM) If anyone has rebooted their perfsonar hosts please let me know John Hill: (12:21 PM) I did a few minutes ago Ewan Mac Mahon: (12:21 PM) Mine both were a day or so ago. Christopher John Walker: (12:21 PM) Matt: I think that recompiling all of SL would be a lot of effort - for very little gain I suspect. Matt Raso-Barnett: (12:21 PM) i did when you mentioned it earlier in this meeting John Bland: (12:22 PM) Liverpool PS rebooted about half an hour ago and seemingly working Duncan Rand: (12:22 PM) Yes, all those sites are OK.. Samuel Cadellin Skipsey: (12:22 PM) Chris: well, you probably couldn't "simply" unpack an RPM into a cvmfs, but you could cpio the RPM files into a directory structure for a CVMFS repo, and you'd only have to do it once. Duncan Rand: (12:23 PM) Suss: OK Andrew McNab: (12:24 PM) CernVM3 works this way. It's SL6 RPMs via cvmfs, with a copy-on-write partition so you can install extras inside the VM ie the root partition via cvmfs works that way Jeremy Coles: (12:27 PM) https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=mdZ6gy1wDiWq Matt Doidge: (12:29 PM) When Lancaster moved we used professional movers and they were worth every penny (luckily we didn't have to pay for them).