Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting

- The intention is to run the meeting in VidyoConnect: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6

-- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone.

-- The London (UK) service is on +442030510622.

-- The meeting extension is 109308582. PIN 1234

Chair:  Matt

Minutes: Matt

Apologies: Darren, David

Videoconference Rooms
GridPP-Operations
Name
GridPP-Operations
Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. Direct link http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MDMaM82v2nD2Du999sD99D - The phone bridge number is +44 (0)161 306 6802. The phone bridge ID is 1001002 with code: 4880. Apologies:
Extension
109308582
Owner
Alessandra Forti
Auto-join URL
Useful links
Phone numbers

Attending: People (missed the attendee list, sorry).
Apologies: Darren, David.

LHCB running smoothly
Few tickets open - most bog standard
Ticket for Manchester, data access problems. 141430. Alessandra might need support from LHCB (Andrew?).
Tier 1 - few issues. Short staffed and taking things slowly

CMS - All quiet on the CMS front. Brunel still has some lost files, working through. Today all green!

Atlas - several tickets
RAL - Frontier service problems, looked like it was closed 141549
Lancaster - transfer error tickets. It looks green so can be closed.
Durham - lost heartbeat ticket. Looks fixed now. Different type of problems on the current (acceptable level) of failures.
Durham ticket 2 - squid problems, checking with network team at Glasgow/Durham. Still waiting on Glasgow team to see if those ports are blocked.
Oxford ticket - should be fixed, Elena checking it.

CentOS7 deployment page:
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment
If you disagree with your site status on this page please email cloud support.

Checking that page - Birmingham and Cambridge on VAC, John mentions that once they decommission their CREAM they're all VAC.

Other VOs:
Pete Clarke - LSST encouraged to come to these meetings, but very happy at the moment

IRIS VO - is just a placeholder, for use as a means to submit tickets - a virtual virtual organisation. Sites should NOT enable it.

Bulletin:
Genreral Updates:
 The Security Day + HEPSYSMAN was on t'other week: https://indico.cern.ch/event/721692/
Please can sites review their GOCDB information: https://ggus.eu/?mode=ticket_info&ticket_id=141296
iris.ac.uk VO - Andrew explained this.
New(-ish) HEPOSLib release - 7.2.9 https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineTable
Gareth's query about the WS interface to ARC on TB-SUPPORT
 - Gareth is setting up his new ARC with both interfaces to allow devs to test.

Operation Coordination meeting next Thursday - Matt will attend.

Tier 1
04 June 2019 Report for the Experiments Liaison Report (03/06/2019) is here.

    Ongoing, we are seeing high outbound packet loss over IPv6. Central networking performed a firmware update to the border routers but this didn’t resolve the issue. Plan to move connections to the new border routers in Mid June. Will do this before trying to debug any further.
    The old LHCb Castor instance lost three disk servers over the weekend!! We don’t intend to spend much effort recovering them. The old LHCb castor instance will be decommissioned (no files will be recoverable) on Friday 7th June.

LHCB moved to ECHO so disk server loss didn't impact production.

Storage:
Matt notes the bug/feature in gsiftp noticed by DIRAC (CTA) users tryin to access Lancaster storage - gsiftp was returning a v6 address to the v4 only client. Daniela was not impressed with the DPM devs responses.

Interoperation
10th June meeting

Security prompted discussion.
Podman works really well
Glasgow moving away from Docker
VAC support through docker, but no one's using it.
Mark was thinking about VAC containers, but not likely to go down that road now.
Singularity's ability to be nested
REDHAT effectively dumped docker for podman. Some say Docker's days are numbered.

Tickets
LCCDM retirement - these are LHCB trying to get the lay of the land now SRM on its way out. Sam notes it's usefull to be pushed.


GOC round table:
All sites okay.

John notes No GOC downtime emails (as per a problem that Simon noticed) - Elena noted for simon it was a wrong address.
Downtimes announced differently?
Drop mail to goc db support to make sure this is as intended.

Rob - EDCF getting new BDII soon- is it just the BDII?
Chris suggested an alias, but ECDF can't do that.

External to site security officers - should be okay, but will review with David.

Saving review of HEPSYSMAN/SECURITY DAY to next week.


Chat window was sadly lost - not having a good day!

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

    • 11:01 11:20
      Experiment problems/issues 19m

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
        T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 11:20 11:40
      Meetings & updates 20m

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 11:40 12:20
      Discussion topics 40m

      -HEPSYSMAN + SECURITY DAY feedback.
      -GOCDB Information Round Up
      -Sites ready for LCGDM retirement?

    • 12:20 12:25
      Actions & AOB 5m