Operations team & Sites

Name: Operations team & Sites
Start: 2019-01-14T11:00:00+00:00
End: 2019-01-14T12:30:00+00:00
Location: EVO - GridPP Operations team meeting

Monday 14 Jan 2019, 11:00 → 12:30 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting

- The intention is to run the meeting in VidyoConnect: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6

-- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone.

-- The London (UK) service is on +442030510622.

-- The meeting extension is 109308582. PIN 1234

Chair: Kashif M

Minutes:

Apologies:

Hide

Present:
Andrew McNab
Daniela Bauer
Darren Moore
Elena Korolkova
Emanuele (new at Glasgow)
Kashif M
Matt Doidge
Gordon Stewart
John Hill
Raja K
Rob Currie
Duncan Rand
Mark Slater
Winnie Lacesso
Sam Skipsey
Vip
Robert Frank
David Crooks
Alessandra Forti
Chris Brew
Steve Jones
Leo Rojas (joined late)

*** Review of weekly issues by experiment/VO

    * LHCb (Raja): major problem with DIRAC infrastructure and VAC and vcycle
      after update to DIRAC. Pilot version was downgraded this seemed to fix
      it. Still working on it. Some minor problems at UK sites, but all under control

    * CMS (Daniela)
    All sites green on monitoring. Brunel in Waiting Room after misdeclared
    downtime, but should be out now. Smooth running over Christmas.

    * ATLAS (Elena)
    Two open tickets:
    RAL-LCG2 and singularity: https://ggus.eu/?mode=ticket_info&ticket_id=138033
    RALPP and ipv6 transfers:
   https://ggus.eu/?mode=ticket_info&ticket_id=139127
    Chris: ipv6: No external firewall, only affects Atlas, CMS seems to
    work. Will have another look.
   Chris (in chat window):
   Got it! Suddenly occurred to me as I was speaking that a problem affecting Atlas and not CMS might be down to a problem on the pool node, not the GridFTP mover or higher up the network stack. And looking at the Pools I find a couple on SL6 nodes that only appeared to have working IPv6. Restarting the network on those seems to have solved the problem
    Kashif: Local data disk is full, what to do ? Elena: It's a known problem,
    Atlas working on it.

    * T2K and the missing checksums
    Some discussion about DPM without DOME. Conclusion: Dump all available
    checksums from database and ask T2K if they are still interested in the
    files without checksums. Attach this list to your t2k ticket, please.

Jeremy's activity webpage, please fill it out:
https://www.gridpp.ac.uk/wiki/Engagements_and_commitments

GDB 16th: ipv6, storage accouting

Tier1: All quiet.

EGI ops meeting (Kashif): No new releases that affect UK sites. Preparing
IPV6 report, Kashif updated UK status

*** Security
    David Crooks: systemd vulnerability, updates from Redhat, Centos available,
          another advisory will be out, no reboot required
      Please attend workshop if you can:
      WLCG Security Operations Center WG Workshop/Hackathon
      https://indico.cern.ch/event/775579/

*** Tickets (Matt):
40 Open UK Tickets this week.

T2K DFC Migration on DPMs
Liverpool: https://ggus.eu/?mode=ticket_info&ticket_id=138648
Oxford: https://ggus.eu/?mode=ticket_info&ticket_id=138647
Sheffield: https://ggus.eu/?mode=ticket_info&ticket_id=138649
Lancaster: https://ggus.eu/?mode=ticket_info&ticket_id=138365
[Already discussed.]

A quick summing up of these tickets- to provide the information T2K need (namely adler32 checksums for files that don't already have them) it appears your DPM needs to be DOME'd. At Lancaster seem to be having the most luck with this so far so please feel free to prod me about it.

v6-looking transfer problems
Liverpool (lhcb): https://ggus.eu/?mode=ticket_info&ticket_id=138943 (19/12) (fixed)
RALPP: (atlas): https://ggus.eu/?mode=ticket_info&ticket_id=139127 (10/11)
(discussed earlier)

Bristol LHCB Ticket
https://ggus.eu/?mode=ticket_info&ticket_id=138402 (21/11/18)
Are the issues described in this ticket still happening? That might be a
question for the VO rather then the site. (6/12/18)
Now works after Bristol disabled SL7 workernodes for LHCb. Bristol still
working on it.

Last Year's Tier 1 Tickets:
https://ggus.eu/?mode=ticket_info&ticket_id=138665 (LFC access issues)
https://ggus.eu/?mode=ticket_info&ticket_id=138500 (CMS transfer failures)
(needs an update)
https://ggus.eu/?mode=ticket_info&ticket_id=138361 (T2K DFC migration) (under
control -- Daniela)

Matt intends to go over all the ipv6 tickets next week, so please update them !

Site round table:
Manchester: (Alessandra) Storage upgrade, (Andrew): Vac and VCycle wrt IRIS
RALPP: (Chris): more nodes onto ipv6, nothing urgent happening, storage is dual
stack
Imperial (Daniela): Getting IRIS storage and compute into the racks up and running.
RAL-LCG2 (Darren): All good.
Sheffield (Elena): CentOS7.
Cambridge (John): Personar, new CPU
Oxford (Kashif): move to CentOS7
Birmingham (Mark): ipv6 - dualstacking perfsonar, orders for new storage, last
DPM user evicted, should be able to decommission DPM. Goal: just EOS and VAC
Lancaster (Matt): Racking up new kit. Looking at HTCondorCEs. Updating systemd
:-)
Bristol (Winnie): Debug why LHCb doesn't work on SL7, replacing some hardware.
Edinburgh (Rob): DPM 1.11, Cloud, ipv6
Glasgow (Sam): waiting for DPM 1.11 to be stable, hope this will help for a
variety of problems, considering DOME
Liverpool (Steve): can't do anything on ipv6, HTCondorCE
Sussex (Leo): Nothing to report, Physics looking for someone to replace Leo,
buying kit for 10k, ipv6 works
[Raja had to leave at 11:45]

There are minutes attached to this event. Show them.

- 11:00 → 11:01
  Ops meeting minutes 1m
  - This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.
  - The team composition has been changing. If everybody contributes then the task comes around less often.
  - Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.
  - Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).
  - Upcoming allocations:
- 11:01 → 11:20
  Experiment problems/issues 19m
  Review of weekly issues by experiment/VO
  - LHCb
  - CMS
    T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
    T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel
  Please see attached notes.
  - ATLAS
  - Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.
  - GridPP DIRAC status [Andrew McNab]
    -- https://www.gridpp.ac.uk/gridpp-dirac-sam
- 11:20 → 11:40
  Meetings & updates 20m
  With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
  - General updates
  - WLCG ops coordination
  - Tier-1 status
  - Storage and data management
  - Tier-2 Evolution
  - Accounting
  - Documentation
  - Interoperation
  - Monitoring
  - On-duty
  - Security
  - Services
  - Tickets
  - Tools
  - VOs
  - Site updates
- 11:40 → 12:20
  
  Discussion topics 40m
  
  -Site 'round table'
- 12:20 → 12:25
  
  Actions & AOB 5m