ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-06-11T10:00:00+01:00
End: 2020-06-11T11:00:00+01:00
Location: Vidyo

Thursday 11 Jun 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy
- Routing issue between sides of DC; attempt some static routing, but will physical access to finally resolve
147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-08 07:43:00 Deletion errors at UKI-SCOTGRID-GLASGOW
- Check if files lost are not in our DPM DB they need to removed on the ATLAS side?
- Tricky to delete multiple replicas; risk to delete the whole object, not just on disk039.
- JW To ask the DDM OPs people.
146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-09 10:03:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
- Still with other pressing priorities
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-07 16:11:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Problematic xroot with DPM, plan still to upgrade to centos 7.
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- In todo list
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- Queue set to TEST, progress being made
145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
- On hold
145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs
- To close -> DirectIO comparisons
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- As previously; needs to change the HW.
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Updated; physical access required to manage migration

CPU

Pledge lines - only visible in 30day mode now

Other new issues

Ongoing issues

CentOS7 DPM Lancs
- NTR
CentOS7 - Sussex
- As mentioned above
Glasgow Ceph storage
- Not BW problems - unlikely in the ceph cluster
- Problems seems in the disk cache. If problem getting a file, will store the truncated file? Hence poisoned by the corrupt copies?
- Using xrootd 4.12, compiled.
  - Try perhaps a 4.11? (Can use the exact version that RAL uses,
- CEPH itself appears more stable after configurations
Grand Unified queues
- Awaiting SHEF

News round-table

Vip
- NTR
Dan
- Migration to centos7 for several services in progress
Matt
- NTR
Peter
- School closures continue to interupt work as normal
Alessandra
- DPM 1.14 in testing; needed for TPC tests in production; contains puppet and memory libraries (to avoid full mem)
  - Petr, RAL off RAL-FTS (on to CERN), to have the TPC capabilities
Sam
- NTR
Gareth
- NTR
Tim
- TPC; transfers (xrootd) to test, have checksum issues: Too slow for the stress-test. Can it be improved by checksumming close to the storage?
  - Can also reduce the number of simultaneous connections?
- Petr pushing to look at http (may be the eventual prefered protocol)
- curent issues with the the xrootd server, not the protocol
JW
- NTR

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy
  
  Routing issue between sides of DC; attempt some static routing, but will physical access to finally resolve
  
  147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-08 07:43:00 Deletion errors at UKI-SCOTGRID-GLASGOW
  
  Check if files lost are not in our DPM DB they need to removed on the ATLAS side?
  
  Tricky to delete multiple replicas; risk to delete the whole object, not just on disk039.
  
  JW To ask the DDM OPs people.
  
  146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-09 10:03:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
  
  Still with other pressing priorities
  
  146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-07 16:11:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
  
  Problematic xroot with DPM, plan still to upgrade to centos 7.
  
  146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
  
  In todo list
  
  146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
  
  Queue set to TEST, progress being made
  
  145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
  
  On hold
  
  145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs
  
  To close -> DirectIO comparisons
  
  144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
  
  As previously; needs to change the HW.
  
  142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
  
  Updated; physical access required to manage migration
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
  - CPU 5m
    
    UK Cloud jobs over last week
    
    Pledge lines - appears only to be visible in 30day mode currently
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 DPM Lancs
  
  NTR
  
  CentOS7 - Sussex
  
  As mentioned above
  
  Glasgow Ceph storage
  
  Not BW problems - unlikely in the ceph cluster
  
  Problems seems in the disk cache. If problem getting a file, will store the truncated file? Hence poisoned by the corrupt copies?
  
  Using xrootd 4.12, compiled.
  
  Try perhaps a 4.11? (Can use the exact version that RAL uses,
  
  CEPH itself appears more stable after configurations
  
  Grand Unified queues
  
  Awaiting SHEF
  - LANCS DPM centos 7 upgrade 5m
    
    Jira: ATLDDMOPS-5527
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Glasgow Ceph storage 5m
    
    ADCINFR-152: Glasgow Ceph
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  NTR
  
  Dan
  
  Migration to centos7 for several services in progress
  
  Matt
  
  NTR
  
  Peter
  
  School closures continue to interupt work as normal
  
  Alessandra
  
  DPM 1.14 in testing; needed for TPC tests in production; contains puppet and memory libraries (to avoid full mem)
  
  Petr, RAL off RAL-FTS (on to CERN), to have the TPC capabilities
  
  Sam
  
  NTR
  
  Gareth
  
  NTR
  
  Tim
  
  TPC; transfers (xrootd) to test, have checksum issues: Too slow for the stress-test. Can it be improved by checksumming close to the storage?
  
  Can also reduce the number of simultaneous connections?
  
  Petr pushing to look at http (may be the eventual prefered protocol)
  
  curent issues with the the xrootd server, not the protocol
  
  JW
  
  NTR
- 10:50 → 11:00
  
  AOB 10m