GridPP Operations team meeting Tuesday, 15 December 2015 from 11:00 to 12:30 Apologies: None Present: Gareth Roy, Raja Nandakumar, Winnie Lacesso, Ian Neilson, Matt Raso-Barnett, Ian Loader, Andrew Lahiffe, Steve Jones (mins), Alessandra Forti, Chris Brew, Daniela Bauer, Daniel Peter Traynor, David Crooks, Elena Korelkova, Ewan Mac Mahon, Federico Melaccio, Gareth Smith, Govind, Songara, Jeremy Coles, Kashif Mohammed, Matt Doidge, Paige Winslowe Lacesso, Peter Gronbech, Raul, Lopes, Samuel Cadellin Skipsey, Tom Whyntie. Experiment problems/issues Review of weekly issues by experiment/VO LHCb Raja: Smooth running, some T1 problems. Will reprocess CS data, with grid proc data from 2011/12 3 staging prior, isssues at RAL due to network. 2 x disk servers down, one for RAID battery, another due to Castor bug/feature on Monday. Network problems at RAL left workernodes cut-off from Dirac. Affected random jobs which got stuck in START. Not yet severe. T2s all OK. CMS Daniela: No much to say, all sites green. https://cms-site-readiness.web.cern.ch/ cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel ATLAS Elena: Tickets issued to many sites asking to implement dump feature. Several sites still to respond. Please use gridpp storage list to request help. Glasgow queried. Sam: Glasgow waiting for ATALS to respond. Elena: Will push ATLAS via Cloud Support. Elena: Disk server broken at RHUL. Govind managed to get data back and save the server. Other VOs: Moved to a monthly update (or on request) from last week. Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator. Daniela: LZ scientists visited at Imperial. They wish to expand to Physics departments which have a stake in LZ. Sites which have LZ presence should prepare cluster for use by LZ. No storage for now. VO Card exists but uncertain who manages. Meetings & updates DPM Workshop Review Luke gave presentation on their work at Bristol Core team - performance improvements and efforts to remove legacy DPM components. DPM rest. Focus on HTTP support. Team seem fairly optimistic.  Sam gave feedback about Puppet. Because of our funding model we are less willing to change config management.  Overall good meeting.  Second day was to demo Puppet. A short discussion was held on the relative merits for 'other VOs' of using LFC or Dirac FC. Belle II actively want to use LFC, AF approved of LFC for such uses. Sam says LFC has been deserted by the big experiments, and there are not many left. Also discussed such plans for snoplus and t2k. Another short discussion was held about plans for DPM. DPM to merge with Castor. Development to be done by IT-DSS (A cern software group). Support for DPM + LFC will continue for now. Future is not so certain. JC: Lancaster and Sheffield have issues with Dirac SAM tests. With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest General updates Items discussed were switch monitoring, WMS queries, APEL/SL5, Recent JANET flakiness. Tier-1 status GS: Packet storm on Thursday causes problems for those accessing RAL based services. Tapes to be upgraded from type C to type D. RN: What is the network status over the last 2 days. GS: OK now, after packet storm. Small problem on Friday. HW boards of of UK routers have been changed due to fault in interface. Things may improve. GS: Over XMAS, T1 to close 3PM Christmas Eve, to reopen 4th Jan. Some support over holidays. Will monitor on call. Will write up in the Blog. Storage and data management During a recent incident at RHUL, there was some uncertainly about the proper process for declaring a data loss to ATLAS. SAM to clarify. Interoperation DC: Gave meeting overview. Meeting calendar agreed, 2nd Monday of each month. UMDPreview to replace EMI. UMD4 Centos only. Volunteers needed for Centos7 staged rollout. Plans announced for decommission of dCache 2.6, Sl5 (April 2016). Monitoring SAM: Work on-going to find right approach for DPM monitoring. On-duty JANET flakiness raised many issues last week, inc. extended GGUS outage. Security IN: sites should make sure they respond to tickets about security that arise from CSIRT, to avoid suspension. Tickets Matt Doidge read through the tickets. Matt Raso-Barnett trying to clear his desk prior to leaving soon. Review of the GDB Agenda: https://indico.cern.ch/event/319754/. Actions & AOB https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Elena and/or Daniela to update bulletin with list of sites that should support LZ. Catalin to give provide status of RAL WMSes for JC. Sam to clarify process for declaring data loss to ATLAS. Peter Gronbech announced that Hepsysman agenda and web page are out, and details will be circulated. http://hepwww.rl.ac.uk/sysman/jan2016/main.html Chat Window Matt Doidge: (15/12/2015 11:02) Thanks Steve! Jeremy Coles: (11:02 AM) Steve is taking minutes. Thank you! Daniela Bauer: (11:04 AM) I just realized I might not have a microphone as I am at home and I am not sure my laptop can handle this. Chris Brew: (11:10 AM) Daniela, we just enabled LZ at RALPP. Jeremy Coles: (11:11 AM) https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator Ewan Mac Mahon: (11:11 AM) I think we're done. Unless they want the storage too? Chris Brew: (11:13 AM) We did comet as well Ewan Mac Mahon: (11:13 AM) Ta. No storage for now then. elena: (11:13 AM) Sites with groups involved in LZ: Matt Doidge: (11:14 AM) Should I enable all the incubatated VOs in the grid.cern.ch cvmfs UI? elena: (11:15 AM) IC, ECDF, Oxford, Sheffield, Liverpool and UCL Thanks to Ewan and Chris Govind: (11:16 AM) How do I find which VO need storage ? Ewan Mac Mahon: (11:16 AM) You refuse to give them any storage and see if they object :-) Govind: (11:17 AM) :-) Alessandra Forti: (11:18 AM) yes IT-DSS a bit Jeremy Coles: (11:19 AM) Yes Steve. David Crooks: (11:19 AM) Hi Steve, we can hear you :-) Steve Jones: (11:19 AM) No sound. Jeremy, pls keep track for mins. Is anyone talking Jeremy Coles: (11:19 AM) Yes Sam. Okay. Steve Jones: (11:21 AM) Back on line. Jeremy, pls snap chat at end for me - mine's gone. Alessandra Forti: (11:24 AM) nothing happened while you were gone in the chat Kashif: (11:26 AM) https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_WMS&style=detail wms list Daniela Bauer: (11:27 AM) sorry, I need to bow out for a couple of min Matt Doidge: (11:27 AM) On the Lancaster Dirac status - none of our vac boxes were vaccing. I forgot to chkconfig the vacd on. Kashif: (11:27 AM) Total 7 wms in UK Daniela Bauer: (11:28 AM) I'm back Ewan Mac Mahon: (11:30 AM) I'm not sure that answering the question is any worse than pointing at documentation that answers the question. Samuel Cadellin Skipsey: (11:44 AM) We also did the Centos7 tests for DPM (At Glasgow) Matt Doidge: (11:46 AM) I assume if you're in danger of being suspended you'll have an inbox full of warnings and ticket notifications. Daniela Bauer: (11:58 AM) sorry, child is whinging again Ewan Mac Mahon: (12:01 PM) Or just let them know you're dropping biomed support from the SE and that they should replicate anything they care about first. Then nuke the lot. Obviously, that doesn't get you to quite the same place. But it's tidy. Govind: (12:07 PM) sorry I frogot to look into this.. Gareth Smith: (12:08 PM) I'm sorry I have to leave the meeting now. Daniel Peter Traynor: (12:08 PM) not top of my list on hold Jeremy Coles: (12:14 PM) https://confluence.ska-sdp.org/display/PRESDPCMT/SDP+MT64+-+01-Dec-15 https://indico.cern.ch/event/319754/ raul: (12:20 PM) i have Tom Whyntie: (12:21 PM) Have to leave, sorry - thanks, bye raul: (12:21 PM) you need a lot of hacks report next week next meeting sorry! vidyo problems Matt Doidge: (12:26 PM) Is there a HEPSYSMAN registration page? Federico Melaccio: (12:28 PM) https://indico.cern.ch/event/465560/ you can register on indico I guess and the "top level" webpage is at http://hepwww.rl.ac.uk/sysman/jan2016/main.html Matt Doidge: (12:29 PM) Thanks, I lost that website. Alessandra Forti: (12:34 PM) I think the ganga workshop agenda is not ready yet for registration Paige Winslowe Lacesso: (12:35 PM) Sorry, must leave now! Matt Doidge: (12:35 PM) I don't think it is either - couldn't see the option to register. Ewan Mac Mahon: (12:39 PM) I'm not sure that 'scaling well with HS06' is a good thing; I think that could reasonably be taken to mean 'just as bad as HS06'. Peter Gronbech: (12:41 PM) HEPSYSMAN details here http://hepwww.rl.ac.uk/sysman/jan2016/main.html Daniela Bauer: (12:41 PM) Bye