GridPP Ops Meeting Tuesday, 11 August 2015 Alessandra Forti, Andrew McNab (minutes), Andrew Washbrook, Andrew Lahiff, Brian Davies, Catalin Condurach, "Dan and terry", Daniela Bauer, David Colling, David Crooks, Elena Korolkova, Ewan Mac Mahon, Federico Melaccio, Gang Qin, Gareth Roy, Gordon Stewart, Govind, Ian Loader, Ian Neilson, Jeremy Coles, John Kelly, Liam Skinner, Matt Doidge, Matt Raso-Barnett, Oliver Smith, Paige Winslowe Lacesso, Peter Clarke, Peter Gronbech, Raja Nandakumar, Sam Skipsey, Steve Jones, Tom Whyntie Apologies from John Hill Experiment problems/issues ========================== Review of weekly issues by experiment/VO LHCb ---- Running fine in the UK. Very few jobs at the moment. No jobs at RAL T2, but probably a DIRAC problem. Will follow up on QMUL ticket: https://ggus.eu/?mode=ticket_info&ticket_id=114573 CMS --- https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel See comment last week: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin Most of CMS is on holiday and all sites are green, and nothing further to report. ATLAS ------ Problem with low number of pilots (noticed Sunday evening). Not all sites. QMUL and Manchester in particular. Limit due to Python thread libraries? Workaround yesterday. Factory restarts disabled. Another problem with high number of file transfers to ECDF; Andy had to put ECDF in downtime due to resulting disruption to network at the institute. Mitigated by setting limit on FTS servers. Working with ATLAS to understand why this is happening. Q. Shouldn't the network people at the site deal with this by throttling somehow? Rather than expecting users to limit their usage? Everyone else does this somehow. Glasgow has a cap to avoid bringing down ClydeNet. Need something like that. (All these flows caused by recovery at Taiwan?) Relevant info about other sites: https://www.gridpp.ac.uk/wiki/Protected_Site_networking DiRAC ----- Lydia at Durham has produced guide about for sites extending things. LIGO ---- Using 2 CernVM instances on a machine. Able to install base set up using RPMs + bits of Python. LIGO cvmfs repo at OSG site. /cvmfs/oasis.opensciencegrid.org/ligo Looking at which to use going forward LOFAR ----- - LSST ---- NTR LZ -- - (Last week agreed to enable at more sites.) UKQCD ----- Craig able to run jobs (stopped production for now.) Hope for a talk at GridPP35. UCLan/GalDyn ------------ NTR PRaVDA ------ - (Understand that Tony away, but had made some progress. Hope to run jobs by end of month.) GridPP DIRAC status ------------------- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view All good, except: Cambridge in a general downtime for machine room move, and still sorting out Brunel DIRAC configuration. Meetings & updates ================== With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates glexec discussion on tb-support prompted by GridPP DIRAC use by small VOs. Tom's request for info on any more GridPP-related GitHub projects. - WLCG ops coordination New portal: please look if you've not done so already. - Tier-1 status Network problems compounded by core switch problems. More SL6 upgrades planned before end of September. Some downtimes. - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security EGI security advisories. No UK sites on dashboard yesterday, but two today. Looking into if a false positive. One user banned at RAL Tier1. Details in a GGUS ticket. https://ggus.eu/?mode=ticket_info&ticket_id=115512 - Services - Tickets - Tools - VOs - Site updates Discussion topics ================= 1. Technical discussions session at GridPP35. Friday afternoon session of http://indico.cern.ch/event/404077/ First day theme is working with new communities, but want to have more hard-core technical talks during the second day. Possible topics: IPv6 rollout, multicore, glexec + security model, Cloud/VM rollout, other VOs, helping new users, GridPP DIRAC, local monitoring solutions (Graphite,...), cloud security/traceability How about dedicating the session to security and traceability, with glexec and Clouds & VM? Also including an hour on IPv6? Previous IPv6 talk: https://indico.cern.ch/event/321806/session/12/contribution/38/attachments/621913/855717/An_IPv6_Addressing_Plan.pdf (Also plan to have a 2 day meeting at Birmingham for new user communities, with Ganga people there. Explaining how to use the current set of tools.) Possibility of overview tables to touch on different topics. IPv6,... 2. Informing new users of infrastructure issues. Should they be on the GridPP-Users list!? Actions & AOB ============= * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items SL5 status page from David: https://www.gridpp.ac.uk/wiki/SL5_status Chat window =========== Matt Doidge: (11:04 AM) https://ggus.eu/?mode=ticket_info&ticket_id=114573 Federico Melaccio: (11:04 AM) hi, I heard RALPP on joining Matt Doidge: (11:05 AM) Thanks Raja! Federico Melaccio: (11:05 AM) ok sorry Ewan Mac Mahon: (11:05 AM) @Federico - short version was that you might not be getting jobs from LHCb right at the moment, but Raja thinks it's them, not you. So you don't need to do anything, Federico Melaccio: (11:06 AM) thanks Ewan, we noticed that job shortage and I thought it was not our problem Ewan Mac Mahon: (11:09 AM) The issue seems to be a network team who panic if someone uses their network. The FTS is working as designed. Daniela Bauer: (11:09 AM) My Vidyo keeps crashing, but most of CMS is on holiday and all sites are green, so I don't think there's anything to report. Jeremy Coles: (11:10 AM) Thanks. Ewan Mac Mahon: (11:11 AM) All of it. Andrew McNab: (11:15 AM) The network already has a way of telling the application to slow down: it drops packets. Jeremy Coles: (11:16 AM) Sorry my Vidyo stopped! Ewan Mac Mahon: (11:18 AM) Hmm. Pete Clarke? Jeremy Coles: (11:19 AM) https://www.gridpp.ac.uk/wiki/Protected_Site_networking Ewan Mac Mahon: (11:21 AM) Yes. Elena Korolkova: (11:25 AM) There was a low number of pilots for many sites since Sunday evening. For UK QM and Manchester were affected. The problem was mitigated yesterday afternoon by disabling factory restarts. Jeremy Coles: (11:25 AM) Andrew M - please note John Hill gave apologies today. Attending an induction event for the new machine room! Elena Korolkova: (11:25 AM) This is a short summary for Jeremy Catalin Condurache - RAL: (11:26 AM) /cvmfs/oasis.opensciencegrid.org/ligo it shoudl be accessible by any node with standard CVMFS configuratioin Tom Whyntie: (11:27 AM) Cool, thanks Ewan Mac Mahon: (11:28 AM) The default modern config is everything from egi.eu, cern.ch and opensciencegrid.org Not everyone is using that current-style config yet necessarily, but it should be fairly common and only get more so. Jeremy Coles: (11:32 AM) http://indico.cern.ch/event/404077/ Peter Gronbech: (11:32 AM) https://indico.cern.ch/event/404077/ Ewan Mac Mahon: (11:35 AM) I think this is an email discussion a bit; it needs more consideration/thinking time. Alessandra Forti: (11:36 AM) I thought we did that already Matt Doidge: (11:38 AM) isn' this something we could cover on a Friday meeting? Ewan Mac Mahon: (11:40 AM) There are probably no topics that we could only cover at a physical GridPP, but there are some discussions that can make better progress face-to-face; it's easier to kick stuff around in a GridPP meeting, then follow up by email/Vidyo meetings. Federico Melaccio: (11:44 AM) what about cloud security/traceability? Ewan Mac Mahon: (11:56 AM) I think getting IPv6 working is moderately pressing now. We're fast approaching the point where people are going to want to use it for real. I think less than a year I'm going to want to have production IPv6-only resources. I might not get it, but I think I'm going to want it. Samuel Cadellin Skipsey: (12:04 PM) (for those looking for the IPv6 talks previously, they were in GridPP33, the last Ambleside one) Peter Gronbech: (12:04 PM) https://indico.cern.ch/event/321806/session/12/contribution/38/attachments/621913/855717/An_IPv6_Addressing_Plan.pdf Ewan Mac Mahon: (12:06 PM) That may be it - if you've got unsupportive site networking people, 'make' them go to networkshop. Possibly also go with them. Yup. Alessandra Forti: (12:09 PM) I've just got dropped out of gridpp-users because CERN rejects emails from it has it happened to others? Jeremy Coles: (12:10 PM) https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest Andrew McNab: (12:15 PM) Yes me too Alessandra David Crooks: (12:28 PM) https://www.gridpp.ac.uk/wiki/SL5_status