ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-09-17T10:00:00+01:00
End: 2020-09-17T11:00:00+01:00
Location: Vidyo

Thursday 17 Sept 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-15 08:59:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy
- Waiting for reply from site; squid now monitored through gocdb
148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-16 10:40:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
- 17 further files declared lost; zfs scrubbing continuing
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Deletions inside DPM ongoing.
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Action JW - to check and close.
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- On hold
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- On hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- On hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- On hold

CPU and transfers

UK accounting page not working;
- JW to follow-up.
RAL
- no issues.
Northgrid
- Mancs move to arc-6; one of the ‘hacks’ reverted (arc-5 to remove some memory limits).
London
- QMUL out of downtime; additional storage needs to be added asap
  - affects gridFTP and SRM, but not xrootd.
  - With new storage SCRATCHDISK will be enabled (set to reasonable size again)
SouthGrid
Scotgrid

Other new issues

Ongoing issues

TPC http
- NTR

News round-table

Vip
- DPM upgrade in couple of weeks; will enter downtime.
Dan
- (Update added to CPU section on QMUL lustre migration)
Matt
- MD off next week.
Alessandra
- NTR
Sam
- NTR
Gareth
- Question on moving to storageless:
  - Are there values for required sizes of caches (storage per job slot)?
    - eg. xcache. for example per site requirements.
  - example: BHAM (80TB) => OX (200TB) from HS06 scaling
  - AF: Not much from UK using xcache, and CMS a good place to look.
Tim
- Noted AF closed a number of Jira tickets.
JW
- NTR

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-15 08:59:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy
    
    Waiting for reply from site; squid now monitored through gocdb
    
    148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-16 10:40:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
    
    17 further files declared lost; zfs scrubbing continuing
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Deletions inside DPM ongoing.
    
    146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    Action JW - to check and close.
    
    146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    
    On hold
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    On hold
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    On hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    On hold
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    UK accounting page not working;
    
    JW to follow-up.
    
    RAL
    
    no issues.
    
    Northgrid
    
    Mancs move to arc-6; one of the ‘hacks’ reverted (arc-5 to remove some memory limits).
    
    London
    
    QMUL out of downtime; additional storage needs to be added asap
    
    affects gridFTP and SRM, but not xrootd.
    
    With new storage SCRATCHDISK will be enabled (set to reasonable size again)
    
    SouthGrid
    
    Scotgrid
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  TPC http
  
  NTR
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
  - TPC with http 20m
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  DPM upgrade in couple of weeks; will enter downtime.
  
  Dan
  
  (Update added to CPU section on QMUL lustre migration)
  
  Matt
  
  MD off next week.
  
  Alessandra
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  Question on moving to storageless:
  
  Are there values for required sizes of caches (storage per job slot)?
  
  eg. xcache. for example per site requirements.
  
  example: BHAM (80TB) => OX (200TB) from HS06 scaling
  
  AF: Not much from UK using xcache, and CMS a good place to look.
  
  Tim
  
  Noted AF closed a number of Jira tickets.
  
  JW
  
  NTR
- 10:50 → 11:00
  
  AOB 10m

Choose timezone