ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-10-08T10:00:00+01:00
End: 2020-10-08T11:00:00+01:00
Location: Vidyo

Thursday 8 Oct 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148946 UKI-LT2-QMUL less urgent in progress 2020-10-07 10:34:00 Failovers from jobs running at UKI-LT2-QMUL queue
- WNs available with IPV6
148908 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-07 16:56:00 UKI-NORTH-LANCS-HEP jobs failing due to “lost heartbeat”
- Downtime for improvements with shared FS done.
- ZFS failed files; checks ongoing.
- Current HC failures with root://fal-pygrid-30.lancs.ac.uk:1094//dpm/lancs.ac.uk/home/atlas/atlasdatadisk/rucio/data18_13TeV/96/e0/data18_13TeV.00349263.physics_Main.merge.AOD.f937_m1972._lb0150._0003.1
  - JW to declare HC File lost, to get HC passing again
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-07 18:15:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- SS apoligies for absence; report via email:
  - disk cleanup/deletions on the DPM are being handled locally for things on disk063, which seems to be weird dark data
- Pickup in failure rate overnight for CEPH:
  - initially, it looks like putting the 40GB/s connection into the ceph cluster might have caused some load spikes. Later today I’m going to see what I can do to shape the traffic a bit here - it looks like write traffic is really the only thing seriously affected.
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- Update requested from Grid Services team on timeline
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- No update
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- No update
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- No update

CPU

RAL
- Running above pledge; CMS problems released slots
Northgrid
- LANCS issues, and recent drop for MAN
London
SouthGrid
Scotgrid
- GLA Ceph related issues noted above

Other new issues

Ongoing issues

CentOS7 - Sussex
- on hold
Grand Unified queues
- on hold

News round-table

Vip
- 26-27th possible downtime?
- To find a time to discuss Storageless tests and plans
Dan
- NTR; asked for relevant info from ATLAS S&C week to be passed back to T2s.
- JW mentioned moving of Data-carousel model into production mode
Matt
- Expecting more disks to arrive
Peter
- Raised interest in Covid working arrangements at other sites
Sam
- Sent Appologies
JW
- TPC-http tests reveal issue in Pulls (RAL as Dest) with writing data.

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    148946 UKI-LT2-QMUL less urgent in progress 2020-10-07 10:34:00 Failovers from jobs running at UKI-LT2-QMUL queue
    
    WNs available with IPV6
    
    148908 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-07 16:56:00 UKI-NORTH-LANCS-HEP jobs failing due to “lost heartbeat”
    
    Downtime for improvements with shared FS done.
    
    ZFS failed files; checks ongoing.
    
    Current HC failures with root://fal-pygrid-30.lancs.ac.uk:1094//dpm/lancs.ac.uk/home/atlas/atlasdatadisk/rucio/data18_13TeV/96/e0/data18_13TeV.00349263.physics_Main.merge.AOD.f937_m1972._lb0150._0003.1
    
    JW to declare HC File lost, to get HC passing again
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-07 18:15:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    SS apoligies for absence; report via email:
    
    disk cleanup/deletions on the DPM are being handled locally for things on disk063, which seems to be weird dark data
    
    Pickup in failure rate overnight for CEPH:
    
    initially, it looks like putting the 40GB/s connection into the ceph cluster might have caused some load spikes. Later today I’m going to see what I can do to shape the traffic a bit here - it looks like write traffic is really the only thing seriously affected.
    
    146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    
    Update requested from Grid Services team on timeline
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    No update
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    No update
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    No update
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Running above pledge; CMS problems released slots
    
    Northgrid
    
    LANCS issues, and recent drop for MAN
    
    London
    
    SouthGrid
    
    Scotgrid
    
    GLA Ceph related issues noted above
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  on hold
  
  Grand Unified queues
  
  on hold
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
  - TPC with http 5m
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  26-27th possible downtime?
  
  Try to find a time to discuss Storageless tests and plans
  
  Dan
  
  NTR; asked for relevant info from ATLAS S&C week to be passed back to T2s.
  
  JW mentioned moving of Data-carousel model into production mode
  
  Matt
  
  Expecting more disks to arrive
  
  Peter
  
  Raised interest in Covid working arrangements at other sites
  
  Sam
  
  Sent Appologies
  
  JW
  
  TPC-http tests reveal issue in Pulls (RAL as Dest) with writing data.
- 10:50 → 11:00
  
  AOB 10m

Choose timezone