ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-09-10T10:00:00+01:00
End: 2020-09-10T11:00:00+01:00
Location: Vidyo

Thursday 10 Sept 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-09 10:50:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy
- In progress; Local user causing issues. Squid not set to be monitored in gocdb
148578 UKI-NORTHGRID-LANCS-HEP urgent in progress 2020-09-09 15:04:00 cannot download files from UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK
- Some files lost, others recovered on LOCALGROUPDISK
- ZFS needs time to complete the list.
- JW - to delete the 7 LGD files.
148544 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-07 13:09:00 UKI-SCOTGRID-ECDF failed jobs
- Possible chksum timeouts for large files
- Upgrades underway and inprogress;
148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-09 19:20:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
- Initial filelist declared lost
- Diskserver finally failed the disk, preparing list of lost files.
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-04 05:40:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Work on capacity in the ceph pool in progress
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Await for upgrades to finish
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- on hold
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on hold

CPU

RAL
- Echo Downtime for network firmware upgrades. Needed Manual downtime to be set.
Northgrid
- Issues with Lancs (as described above)
London
- QMUL in downtime for lustre updates; will extend into next week.
SouthGrid
Scotgrid

Other new issues

Active http TPC endpoints
- LANCS upgraded; works fine
- EDCF upgraded but some cerificate issues.
- Durham (with macaroon), only up- and down-load, not tpc; and not in functional tests.

Ongoing issues

CentOS7 - Sussex
- No update
Grand Unified queues
- Awaiting SHEF

News round-table

Vip
- Paul to send Vip instructions for DPM upgrades.
  - Some manual changes needed for TPC.
- Still working on residual fallout from previous DC power issues
Dan
- In Downtime for upgrades; hardware done
- One difficult ATLAS directory (many files to verify checksums); downtime to next week to finish migration.
- Additionally, AC situation will be improved by next week.
- Change to mountpoints needed; to confirm via email.
Matt
- Working on the Storage issues
- More jobs running from other VOs.
Peter
- ATLAS will switch to python3 from cvmfs; should be transparent.
Alessandra
- NTR
Sam
- Atlas to move to 40G ceph
- To start to look at xrootd 5 for some new featuree.
JW
- Work on TPC with HTTP for CEPH ongoing.

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-09 10:50:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy
  
  In progress; Local user causing issues. Squid not set to be monitored in gocdb
  
  148578 UKI-NORTHGRID-LANCS-HEP urgent in progress 2020-09-09 15:04:00 cannot download files from UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK
  
  Some files lost, others recovered on LOCALGROUPDISK
  
  ZFS needs time to complete the list.
  
  JW - to delete the 7 LGD files.
  
  148544 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-07 13:09:00 UKI-SCOTGRID-ECDF failed jobs
  
  Possible chksum timeouts for large files
  
  Upgrades underway and inprogress;
  
  148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-09 19:20:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
  
  Initial filelist declared lost
  
  Diskserver finally failed the disk, preparing list of lost files.
  
  148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-04 05:40:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
  
  Work on capacity in the ceph pool in progress
  
  146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
  
  Await for upgrades to finish
  
  146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
  
  on hold
  
  146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
  
  on hold
  
  144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
  
  on hold
  
  142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
  
  on hold
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Echo Downtime for network firmware upgrades. Needed Manual downtime to be set.
    
    Northgrid
    
    Issues with Lancs (as described above)
    
    London
    
    QMUL in downtime for lustre updates; will extend into next week.
    
    SouthGrid
    
    Scotgrid
  - Other new issues 5m
    
    Status of TPC with http:
    dcache (>5.2.18 and 6.2.x) and DPM (1.14) sites:
    UKI-SCOTGRID-ECDF
    UKI-SOUTHGRID-OX-HEP
    
    Active http TPC endpoints
    
    LANCS upgraded; works fine
    
    EDCF upgraded but some cerificate issues.
    
    Durham (with macaroon), only up- and down-load, not tpc; and not in functional tests.
- 10:20 → 10:40
  Ongoing issues 20m
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
    
    CentOS7 - Sussex
    
    No update
    
    Grand Unified queues
    
    Awaiting SHEF
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Paul to send Vip instructions for DPM upgrades.
  
  Some manual changes needed for TPC.
  
  Still working on residual fallout from previous DC power issues
  
  Dan
  
  In Downtime for upgrades; hardware done
  
  One difficult ATLAS directory (many files to verify checksums); downtime to next week to finish migration.
  
  Additionally, AC situation will be improved by next week.
  
  Change to mountpoints needed; to confirm via email.
  
  Matt
  
  Working on the Storage issues
  
  More jobs running from other VOs.
  
  Peter
  
  ATLAS will switch to python3 from cvmfs; should be transparent.
  
  Alessandra
  
  NTR
  
  Sam
  
  Atlas to move to 40G ceph
  
  To start to look at xrootd 5 for some new featuree.
  
  JW
  
  Work on TPC with HTTP for CEPH ongoing.
- 10:50 → 11:00
  
  AOB 10m

Choose timezone