Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies: Ian Neilson; Matt Doidge Minutes:
    • 11:00 11:01
      Ops meeting minutes 1m
      * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!

      OPS Meeting : 8th Dec 2015

      List of attendees:

      A. Forti

      A. Dibbo

      A. McNab

      B. Davies

      C. Condurache

      D. Traynor

      D. Bauer

      D. Crooks

      D. Rand

      E. Korolkova

      E. MacMahon

      F. Melaccio

      G. Roy

      G. Smith

      J. Coles

      J. Hill

      J. Kelly

      K. Mohammad

      O. Smith

      P. Gronbech

      R. Nandakumar

      R. Frank

      S. Jones

      T. Whyntie

       

      LHCb:  Low activity. No issue with Tier1 or any of the Tier2.

      CMS: Nothing to report

      Atlas:  Elena (Thanks for sending report separately) 

      There was a problem with certificates for the services which submit Production functional tests and sub board which managed SSB board (service which put storage and panda queues on/offline according to DT). Alessandra manually set Lancaster offline (because of flooding) and Sheffield was set in a test mode on Saturday and was manually set online on Sunday.

      There are 2 GGUS’s with request to remove PRODDISK space token and files under PRODDISK: RALPP and Sussex.

      Most of the tickets are ATLAS request- storage consistency checks.

      Lancaster and Liverpool has the dumps in place and cron jobs running.

      All the tickets are in progress. Cambridge and Glasgow have done their job put there are problems from atlas side to check the dumps.

      Atlas person responsible for storage dumps will be contacted.

       

      Other VOs:  Moving to monthly update  or on request

      HPC Dirac:  Brain : Early stages  for tar invocation of files on HPC Dirac Server. Moved around 4 M files totalling 100 TB.

      LIGO:   Andrew Lahiff is working with Paul . Nothing else to report

      LSST: Alessandra:  Opened a ticket about VOMS issue with LSST. Cannot create proxy using voms-proxy-init for lsst VO . https://ggus.eu/index.php?mode=ticket_info&ticket_id=114044

      LZ: Elena: David Colling is to update GridPP VO Incubator page for LZ. .  Brunel is ready to support LZ VO.

      UCLan : GalDyn has updated GridPP VO Incubator   page

       

       GridPP DIRAC Status:  Started to pick VM based site for SAM monitoring. VMs are now working with new dirac.gridpp.ac.uk service.  Sites need to change name of the dirac server.

      Action: update wiki about VM based site

       

      Meetings and update:  http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      General Update

      Janet Networking issue:  Discussed DDOS attack on Janet. It was mentioned that Janet has stopped updating status on twitter and its website as that information is used by attacker as well. Jeremy asked whether GridPP gets status through a different channel. Ewan commented that   as Janet sees site as end user  and  it does not see GridPP as separate entity. Someone should talk to Janet about changing this relationship

      IPv6 address for GridPP web site. Andrew said that it should be quite straight forward and he is already  testing.

      WLCG Operation Co-ordination:

       

      Security:  Advisory about openSSL. EGI assess that it is a routine security update and site should deal it as normal security update.

      Duncan asked site to check http://grid-monitoring.cern.ch/perfsonar_report.txt

       

      ARGO update: Kashif mentioned that there is no clarity about move to ARGO. It seems that it is going to be centralized ARGO as oppose to current way of distributed configuration. There is no clarity about VO Nagios as well.

       

      HepSysman : Alessandra ask that how many people are  interested in Ganga Workshop? If there are enough people then a workshop can be organized.  Mark, Matt and Tom need to be there to make it useful. Alessandra is going to contact them.

       

      John Hill: (08/12/2015 11:01:16)

      The bells, the bells!

      Very Christmassy!

      Jeremy Coles: (11:02 AM)

      Kashif is taking minutes. Thanks!

      Ewan Mac Mahon: (11:08 AM)

      You're fine for me, Jeremy, but Elena does have the occasional drop-out. It's not bad quality breaking up-age, it's fine for a while then goes for several seconds.

       

      David Crooks: (11:09 AM)

      Yeah, I hear the same

      Marcus Ebert: (11:10 AM)

      here, all audio is dropping out for some seconds too, like now

      Brian Davies @ RAL: (11:10 AM)

      DiRAC update: still investgating how to tar before transfering

      Ewan Mac Mahon: (11:12 AM)

      Where is this discussion taking place? Any chance of shunting it to an archived list?

      Brian Davies @ RAL: (11:12 AM)

      early on at the moment , 100Tb 4M files

      yes

      jerermy you have droppped

      Tom Whyntie: (11:14 AM)

      It's a really interesting story, though, so I look forward to writing it up!

      John Hill: (11:14 AM)

      I lost him for a while also

      Ewan Mac Mahon: (11:14 AM)

      I could hear Jeremy, sounds like we lost all the RAL folk for ~10s there.

      I think the thing is that STFC have a tape machine.

      The point here is to enable them to use it.

      Peter Gronbech: (11:25 AM)

      We are not sure what sounded very positive as we lost sound again

      Ewan Mac Mahon: (11:25 AM)

      Oxford VAC will be along shortly, I just need to tweak the config - it's high on the todo list.

      Peter Gronbech: (11:26 AM)

      gone

      Federico Melaccio: (11:26 AM)

      yeah for me as well

      Jeremy Coles: (11:27 AM)

      https://www.gridpp.ac.uk/wiki/Cloud_%26_VM_status

      Ewan Mac Mahon: (11:29 AM)

      That sounds ideal.

      I would quite like the possibility of a wildcard 'any' option as well as the explicit list if possible, but it's good to have the ability to limit it if need be.

      But it would be nice to be able to add VO support to actual resources just by adding it to the dirac centrally and without needing the sites to make any changes.

      John Bland: (11:32 AM)

      I can hear both of you

      Tom Whyntie: (11:32 AM)

      Yes, I can hear both

      Federico Melaccio: (11:32 AM)

      I could not hear Ewan...

      John Bland: (11:33 AM)

      it would be nice to know the extent of the problem and likely timelines

      Ewan Mac Mahon: (11:33 AM)

      So, to recap - AIUI we don't have a particularly good route into Janet for this,

      Perhaps once the dust settles we should explore the possibility of having a more direct customer relationship with Janet rather than just being considered as downstream users of our respective institutions.

      Tom Whyntie: (11:35 AM)

      And when we do it'll make a good News Item :-)

      Ewan Mac Mahon: (11:36 AM)

      And that also sounds ideal. Might be worth posting that back to tb-support for the benefit of the folks not here.

      Tom Whyntie: (11:36 AM)

      £100/day for hotels

      Have to leave - thanks, bye

      Ewan Mac Mahon: (11:38 AM)

      Oh, and on the specifics of the Janet situation, my general feeling is that there are no timelines to give because it's an ongoing shifting threat, not a technical fault - they've applied sucessful fixes, only to see an effectively new attack start. It's very hard to say when it'll be over.

      David Crooks: (11:44 AM)

      Sorry, lost Jeremy at an inopportune time :-)

      Ewan Mac Mahon: (11:47 AM)

      Routine SSL update - no special significance for us, no special EGI handling, but the general assumption that sites will be keeping their nodes generally updated stands, so the reccommendation is to instal the updates as part of normal patching.

      Except Edinburgh, of course, who's idea of normal patching is not to do it until they get yelled at.

      Duncan Rand: (11:48 AM)

      Please could sites check perfsonar here: http://grid-monitoring.cern.ch/perfsonar_report.txt

      Federico Melaccio: (11:50 AM)

      wasn't there a sort of dashboard for that?

      John Hill: (11:51 AM)

      Most UK sites seem to fail!

      Duncan Rand: (11:51 AM)

      Yes, I will post that in a minute.

      Ewan Mac Mahon: (11:51 AM)

      *does the all OK dance*

      John Hill: (11:51 AM)

      Since I think our configuration is correct I have no idea what to fix

      Federico Melaccio: (11:51 AM)

      RALPP is all OK yaaa

      Andrew McNab: (11:52 AM)

      want we want is the failure history to see it

      Ewan Mac Mahon: (11:54 AM)

      IMO there's no reason to object to the services being run centrally, but we need to mke it clear that one of those services is the VO nagios.

      Duncan Rand: (11:56 AM)

      https://maddash.aglt2.org/

      https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fdashboard.py

      Ewan Mac Mahon: (11:57 AM)

      I haven't done anything with the RIPE atlas noes in ages and ages.

      Er, nodes.

      Duncan Rand: (11:57 AM)

      http://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=Latency%20tests%20between%20all%20WLCG%20hosts

       

      yes

      Steve Jones: (12:01 PM)

      Sure thing, why not?

      Ewan Mac Mahon: (12:01 PM)

      Right, in principle, I'm interested then :-)

      Marcus Ebert: (12:01 PM)

      I'm interested too

      Federico Melaccio: (12:03 PM)

      In principle I'm interested

      Ewan Mac Mahon: (12:04 PM)

      And it gets you a trip to sunny Manchester and there will probably be food and beer/beer equivalents.

      Jeremy Coles: (12:05 PM)

      https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

    • 11:01 11:20
      Experiment problems/issues 19m
      Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel - ATLAS - Other: we will change to a monthly update (or on request) from this week. Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator. -- DEAP3600: Pending -- HPC DiRAC: Jens / Daniela -- LIGO: Catalin -- LOFAR: George --- Alex Dibbo has taken over the management of our cloud... -- LSST: Alessandra -- LZ: David / Elena -- UKQCD: Jeremy -- UCLan/GalDyn: Tom -- PRaVDA: Mark/Matt - GridPP DIRAC status [Andrew McNab] -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view - Status of pilot enabling across sites.
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Tier-2 Evolution - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 12:00
      Free discussion 20m
    • 12:00 12:05
      Actions & AOB 5m
      * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items