ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-10-22T10:00:00+01:00
End: 2020-10-22T11:00:00+01:00
Location: Vidyo

Thursday 22 Oct 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

149095 UKI-SOUTHGRID-OX-HEP less urgent in progress 2020-10-17 07:37:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
- Should be fixed; JW - to check and close
148968 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-21 13:49:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
- Should now be fine; JW - to check and close
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-19 13:51:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Power fluctuations (cable strikes in and off campus)
  - Odd networking state for some racks; rebooting appears to have cleared this
- Looking ok? JW to check and close if so.
- Other cvmfs issues also seem to be resolved
146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
- on-hold
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on-hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on-hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on-hold

CPU

RAL
- Firewall upgrade at RAL (am of 22nd Oct); killed most jobs; now recovering
Northgrid
- LANCS now recovering on slots. Unclear why jobs still ran with missing files when declared temporarily unavailable.
London
- Small issue at QMUL, resolved quickly
SouthGrid
Scotgrid
- Durham, some unexpected downtime. One disk server identified with lost ATLAS data; List of files is in preparation.

Other new issues

Ongoing issues

CentOS7 - Sussex
- on-hold
Grand Unified queues
- (awaiting Sheffield)
TPC via http
- Ceph; fix available for testing for aligment issues in EC pool in xrootd
  - Appears to be working at RAL; although still some failure modes observed

News round-table

Vip
- Approx 1/3 of Cs to be drained from Saturday, for work on Tuesday.
Dan
- Weekend memory problem; Storm / Argus; requires a restart. (open ticket with Storm devs.)
- Aiming for improving the automatation of restarts
Matt
- rebuilding of servers ongoing.
Peter
- NTR
Sam
- AOD -> DAOD jobs still show up as source of failures. (JW to also follow this).
JW
- NTR (see tpc-http info above).
Rob
- Will have a discussion with ATLAS experts regarding QoS developments with ECDF storage

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    149095 UKI-SOUTHGRID-OX-HEP less urgent in progress 2020-10-17 07:37:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
    
    Should be fixed; JW - to check and close
    
    148968 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-21 13:49:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    
    Should now be fine; JW - to check and close
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-19 13:51:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Power fluctuations (cable strikes in and off campus)
    
    Odd networking state for some racks; rebooting appears to have cleared this
    
    Looking ok? JW to check and close if so.
    
    Other cvmfs issues also seem to be resolved
    
    146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    
    on-hold
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    on-hold
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    on-hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    on-hold
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Firewall upgrade at RAL (am of 22nd Oct); killed most jobs; now recovering
    
    Northgrid
    
    LANCS now recovering on slots. Unclear why jobs still ran with missing files when declared temporarily unavailable.
    
    London
    
    Small issue at QMUL, resolved quickly
    
    SouthGrid
    
    Scotgrid
    
    Durham, some unexpected downtime. One disk server identified with lost ATLAS data; List of files is in preparation.
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  on-hold
  
  Grand Unified queues
  
  (awaiting Sheffield)
  
  TPC via http
  
  Ceph; fix available for testing for aligment issues in EC pool in xrootd
  
  Appears to be working at RAL; although still some failure modes observed
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
  - TPC with http 5m
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Approx 1/3 of Cs to be drained from Saturday, for work on Tuesday.
  
  Dan
  
  Weekend memory problem; Storm / Argus; requires a restart. (open ticket with Storm devs.)
  
  Aiming for improving the automatation of restarts
  
  Matt
  
  rebuilding of servers ongoing.
  
  Peter
  
  NTR
  
  Sam
  
  AOD -> DAOD jobs still show up as source of failures. (JW to also follow this).
  
  JW
  
  NTR (see tpc-http info above).
  
  Rob
  
  Will have a discussion with ATLAS experts regarding QoS developments with ECDF storage
- 10:50 → 11:00
  AOB 10m
  NTR