GridPP Ops meeting 22nd May 2012 ================================ http://indico.cern.ch/conferenceDisplay.py?confId=190949 Lower attendance expected due to CHEP Santanu leaving Cambridge: wanted to thank everyone. Meetings & updates ------------------ (With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest) - Tier-1 status * Various issues and new disk servers deployed for ATLAS * Issues with CERN CRL update close to expiry date * LHCb sees some interruption to conditions database etc, cleared now though - Storage and Data Management meeting ( https://indico.cern.ch/conferenceDisplay.py?confId=190949 ) * Outcomes from GridPP and HEP SYSMAN were discussed - Accounting * PMB agreed on 3% of total for Other disk resources weight * HEPSPEC06 benchmarking: please put new kit on the Wiki page and indicate date * TB-Support discussions indicate gstat results still not correct - Documentation * CIC portal XML doesn't correspond exactly to page on Wiki. Updating obvious discrepancies - On-duty * In April some expired tickets, being looked into - Security * Transition to Rob, Ewan, Linda, Alessandra rota team when Mingchao leaves * Nothing to report this week apart from ongoing Italian/CERN incident investigation - Services * PMB decision: PC will survey sites about connectivity to feed into JANET - Tickets * Tickets bulletin snapshot: 20 Open UK tickets this week. NGI https://ggus.eu/ws/ticket_info.php?ticket=80259 Positive progress being made with the new neuroscience VO neurogrid.incf.org. TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=82100 SNO+ are having difficulties getting the using srm-snoplus.gridpp.rl.ac.uk. RAL are having trouble getting the DEFAULT_SE value to publish. Brunel https://ggus.eu/ws/ticket_info.php?ticket=82341 Brunel being hit by a torque bug affecting lhcb jobs, Brunel are implementing a workaround. RHUL https://ggus.eu/ws/ticket_info.php?ticket=82320 An ATLAS user's jobs are suffering a 50% failure rate, after a very good job postmortem by Duncan it appears that the failed jobs aren't setting up properly (incorrect/incomplete paths). - SOLVED BIRMINGHAM https://ggus.eu/ws/ticket_info.php?ticket=82284 Atlas are seeing library problems, some libraries cannot be preloaded. Seems similar to previous problems with libraries. GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=82191 na62 transfers to Glasgow were failing due to the srmv2 interface not publishingna62 support. Sam fixed it, and looks like it can be closed. - SOLVED SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 The 12 Tasks of Emyr. He's currently trying to tame his CREAM CE (see his mail today to TB-SUPPORT), any help would be appreciated. Solved Case File: https://ggus.eu/ws/ticket_info.php?ticket=82081 QMUL got ticketed due to their "test" SE failing Ops jobs. It would be wise to prevent this from happening again (would removing it from the gocdb although it to avoid tests but still operate fully for testing?). From the UK with Love: Following Stephen Burke's suggestion to search tickets via DN to try to track tickets submitted by UKers seems to reveal some good results, but sadly no EMI tickets. In future weeks I'll start trying to get a handle on the list of open EMI UK-submitted tickets to see if I can catch any relevant tidbits. FYI EMI tickets are at: http://tinyurl.com/cu424oa Of Interest to the Ops team (particularly Chris W): na62 relevant deployment tickets: https://ggus.eu/ws/ticket_info.php?ticket=82327 Documents the validation progress https://ggus.eu/ws/ticket_info.php?ticket=81669 Documents the setting up of the fts channels at CERN (bounced from RAL). Experiments ----------- LHCb - nothing major to report (apart from RAL issue above) CMS - No report, but apparently similar RAL related problems for CMS. Also "yesterday someone messed up the central CMS CVMFS repository at CERN. It took quite some time that it was noticed and even longer to get it fixed." ATLAS - No major issues. 20 GB per job-slot/core requirement. Asking to check with sites (email sent to tb-support). Some confusion as 50 GB/core figure had also been mentioned. EMI-2 ----- EMI-2 released end of last week. Want to check which sites will do component testing. EMI-1 offers: https://www.egi.eu/earlyAdopters/table (see NGI_UK sites at bottom) http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html (also relevant) glexec ------ Increasing in priority for WLCG. Rollout across summer. 7 sites running glexec according to dashboard. 7 more with no information. tar ball built at in2p3 but not working yet (lack of proper instructions, Daniella trying to work it all out from yaim config files) Aim for definitive comments by June GDB to escalate this. Actions ------- No concrete updates from previous week(s). Chat window log --------------- [11:00:55] raul lopes joined [11:00:56] Pete Gronbech joined [11:00:59] Jeremy Coles joined [11:00:59] Ian Collier joined [11:01:00] David Crooks joined [11:01:01] Emyr James joined [11:01:03] Matthew Doidge joined [11:01:03] Gareth Roy joined [11:01:04] Mark Norman joined [11:01:24] Jeremy Coles We'll start when a few more people have joined! Low numbers today because of CHEP most likely. [11:01:39] Stephen Jones joined [11:01:47] Emyr James hello [11:02:00] Brian Davies joined [11:02:14] Jeremy Coles Skipping the first item for now as no expt. reps. [11:02:25] Mark Slater joined [11:02:28] John Hill joined [11:03:05] Duncan Rand joined [11:03:08] John Kelly joined [11:03:20] Catalin Condurache joined [11:03:51] Ewan Mac Mahon joined [11:04:03] Raja Nandakumar joined [11:05:37] Elena Korolkova joined [11:06:26] Elena Korolkova can you give a link to agenda, please [11:06:38] Jeremy Coles https://indico.cern.ch/conferenceDisplay.py?confId=190949 [11:06:44] Elena Korolkova thanks [11:07:27] Mark Slater Has it all gone quiet or is it just me? [11:07:41] Brian Davies its gone quiet for me [11:07:51] Mark Slater That's OK then [11:07:53] Elena Korolkova and for me [11:08:53] Mingchao Ma joined [11:09:53] John Hill OK for me [11:10:56] Daniela Bauer joined [11:10:58] John Bland joined [11:11:05] Daniela Bauer sorry i am late [11:25:13] Ewan Mac Mahon No. [11:25:34] Emyr James ...and testing my mic :-D [11:27:43] Elena Korolkova I'll check the ticket and the errors [11:28:34] Duncan Rand left [11:28:40] Ewan Mac Mahon Mark - what does ls -ld ${ATLAS_LOCAL_AREA}/lib64 on one of your WNs show, something or nothing? [11:30:07] Duncan Rand joined [11:30:26] Mark Slater It shows the dir and the links and files inside seem OK so I guess that's not the problem [11:32:05] Daniela Bauer https://wiki.italiangrid.it/twiki/bin/view/CREAM/ServiceReferenceCard#Open_ports [11:32:26] Daniela Bauer linked from [11:32:28] Daniela Bauer https://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1 [11:32:41] Emyr James yeah...9091 *IS* there....oops [11:32:56] Emyr James I'll check the other ports in my cream firewall [11:33:38] Daniela Bauer yes, but it's a lengthy document with lots of links, and it might not have been there when you first installed it (speaking from experience...) [11:34:36] raul lopes cms: yesterday someone messed up the central CMS CVMFS repository at CERN. It took quite some time that it was noticed and even longer to get it fixed. [11:34:53] Daniela Bauer (heplnx208.pp.rl.ac.uk [11:37:21] Mark Slater A couple of hours [11:39:02] Ewan Mac Mahon You've got to wonder how this takes very long to fix - you'd think it'd just be a repository rebuild from the input files and you'd be done. [11:39:28] Jeremy Coles left [11:39:48] raul lopes that's what the CMS guy said. he couldn't believe the thingg was tsill broken after 24h [11:40:27] Daniela Bauer But CVMFS is the solution to all our problems [11:40:37] Jeremy Coles joined [11:42:08] Jeremy Coles It seems EVO was defaulting to the US for me which had a rubbish connection. Apologies for dropping out. [11:43:02] Ewan Mac Mahon Elena - I'll email you some numbers for Oxford after the meeting (we've got a few nodes that are 50GB) [11:43:44] Ewan Mac Mahon AIUI Elena needs one email per ATLAS supporting site./ [11:43:46] Elena Korolkova Thanks, Ewan [11:43:58] Ewan Mac Mahon With the amount of scatch per core that each site has, [11:44:22] Ewan Mac Mahon Or, I suppose, the amount that can be relied upon, i.e. the amount on the 'worst' node. [11:45:11] Elena Korolkova I need one number - the lowest one [11:45:35] Stephen Jones Emyr: CREAM Ports: https://wiki.italiangrid.it/twiki/bin/view/CREAM/ServiceReferenceCard [11:47:02] Ewan Mac Mahon Also, just for the record, Elena's quite right about the ATLAS VO ID card, it says: [11:47:05] Ewan Mac Mahon Max size of scratch space used by jobs (MB) : 20000 [11:47:13] Jeremy Coles CHEP workshop: https://indico.cern.ch/conferenceDisplay.py?confId=146547 [11:47:28] Jeremy Coles We'll go through the main points next week! [11:48:29] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html [11:49:32] Daniela Bauer https://www.egi.eu/earlyAdopters/table [11:53:21] Brian Davies left [11:55:11] Elena Korolkova I've installed glexec a year ago on half WN's bit as there was no interest wasn't back to it [11:55:19] Elena Korolkova I need to check [11:55:21] Govind Songara joined [11:55:28] Duncan Rand ditto at rhul [11:57:15] John Kelly left [11:58:51] Duncan Rand i think chris is happy to install the glexec rpm directly on the WN [11:59:13] Duncan Rand but he will clarify later