ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2022-01-13T10:00:00+00:00
End: 2022-01-13T11:00:00+00:00
Location: Zoom

Thursday 13 Jan 2022, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Hide

● Outstanding tickets

Outstanding tickets
- 155473 TEAM atlas RAL-LCG2 less urgent NGI_UK in progress 2022-01-11 09:20:00 BU_ATLAS_Tier2 transfer and deletion errors EGI
  - IPV4 connectivity issues on new webdav aliased hosts
- 155460 USER atlas UKI-SOUTHGRID-CAM-HEP less urgent NGI_UK in progress 2022-01-12 15:51:00 Failovers from Cambridge to CERN backup proxy EGI
  - Active discussions from site admins
- 155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-12 16:37:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI
  - Data at risk, due to problems over Chrsitmas
- 155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2021-12-24 08:39:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI
  - JW to progress to a solution
- 154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2021-12-25 04:28:00 UKI-LT2-QMUL SOURCE transfer failures: [13] Result (Neon): SSL handshake failed EGI
  - Server fell over on Christmas day
  - Moving to adding more ‘oomph’, it’s not the highest priority item however
- 154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI
  - other urgent issues are delaying this
- 154436 TEAM atlas RAL-LCG2 very urgent NGI_UK on hold 2021-12-08 13:25:00 RAL Echo Davs developments EGI
  - New webdavs endpoint with new gateways created. Available for more aggressive optimisation tuning and improvements
- 153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI
  - Needs to be tested

● CPU

- RAL
  - Remains low; some from job scheduling when there’s a large number of transfering FTS files in the queue.
  - Also due to contention from other VOs
- Northgrid
  - Largely ok
- London
  - Some brief issues with QMUL
- SouthGrid
  - BHAM a few days outage (?), but back now.
  - Sussex; running well, but could be running more slots at the site
- Scotgrid
  - Durham - cooling; power issue over Christmas. SRR not readable; leading to overfilling of the storage.
    - Once SRR accessible, jobs started running and data reduced to below the total.
  - Gla; disk controller appears to have died; Expected to be onlined shortly.

● Ongoing Items

TPC with http
- Davs optimsisation at RAL to take priority with a new webdav alias available
Storageless Site test (Oxford)
- Seeing TLS errors on the Xcache via xrootd; cache is passing through the data
LANCS Storage migration
- JW to ensure endpoint is configured in CRIC
- Site awainting one last swtich change to begin real testing

● News round-table

Alessandra
- NTR
Dan
- Storage for Atlas by end of months
- Refurbishment remains some way off
Gerard
- NTR
Matt
- NTR
Patrick
- NTR; Attempting to work out how to get the full number of slots to run at the site.
Peter
- NTR
Sam
- GLA now restarted.
Stephen
- NTR
Vip
- To arrange a discussion to track down Xcache problems

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    Outstanding tickets
    
    155473 TEAM atlas RAL-LCG2 less urgent NGI_UK in progress 2022-01-11 09:20:00 BU_ATLAS_Tier2 transfer and deletion errors EGI
    
    IPV4 connectivity issues on new webdav aliased hosts
    
    155460 USER atlas UKI-SOUTHGRID-CAM-HEP less urgent NGI_UK in progress 2022-01-12 15:51:00 Failovers from Cambridge to CERN backup proxy EGI
    
    Active discussions from site admins
    
    155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-12 16:37:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI
    
    Data at risk, due to problems over Chrsitmas
    
    155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2021-12-24 08:39:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI
    
    JW to progress to a solution
    
    154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2021-12-25 04:28:00 UKI-LT2-QMUL SOURCE transfer failures: [13] Result (Neon): SSL handshake failed EGI
    
    Server fell over on Christmas day
    
    Moving to adding more ‘oomph’, it’s not the highest priority item however
    
    154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI
    
    other urgent issues are delaying this
    
    154436 TEAM atlas RAL-LCG2 very urgent NGI_UK on hold 2021-12-08 13:25:00 RAL Echo Davs developments EGI
    
    New webdavs endpoint with new gateways created. Available for more aggressive optimisation tuning and improvements
    
    153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI
    
    Needs to be tested
  - CPU 5m
    
    New link for the site-oriented dashboard
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Remains low; some from job scheduling when there’s a large number of transfering FTS files in the queue.
    
    Also due to contention from other VOs
    
    Northgrid
    
    Largely ok
    
    London
    
    Some brief issues with QMUL
    
    SouthGrid
    
    BHAM a few days outage (?), but back now.
    
    Sussex; running well, but could be running more slots at the site
    
    Scotgrid
    
    Durham - cooling; power issue over Christmas. SRR not readable; leading to overfilling of the storage.
    
    Once SRR accessible, jobs started running and data reduced to below the total.
    
    Gla; disk controller appears to have died; Expected to be onlined shortly.
  - Other new issues / tasks 5m
    
    Re-enabling GPU queue for QMUL
    
    Analysis facilities: understand the status and E&D for UK; feedback to Alessandra.
    
    Multihop failures RAL; no overwrite of failed intermediate steps
    
    RAL complete Echo rebalancing; Stop greedy deletions for Disk
- 10:20 → 10:40
  Ongoing Items 20m
  TPC with http
  
  Davs optimsisation at RAL to take priority with a new webdav alias available
  
  Storageless Site test (Oxford)
  
  Seeing TLS errors on the Xcache via xrootd; cache is passing through the data
  
  LANCS Storage migration
  
  JW to ensure endpoint is configured in CRIC
  
  Site awainting one last swtich change to begin real testing
  - TPC with http 5m
    
    RAL as DST: DAVS
    
    RAL as SRC: DAVS
    
    TPC at RAL: ADCINFR-195
  - Storageless Site test (Oxford) 5m
    
    ADCINFR-185
  - LANCS Storage migration 20m
    
    Commissioning of new Storage: ATLDDMOPS-5587
    
    DPM Decommissioning: ATLDDMOPS-5588
- 10:40 → 10:50
  News round-table 10m
  Alessandra
  
  NTR
  
  Dan
  
  Storage for Atlas by end of months
  
  Refurbishment remains some way off
  
  Gerard
  
  NTR
  
  Matt
  
  NTR
  
  Patrick
  
  NTR; Attempting to work out how to get the full number of slots to run at the site.
  
  Peter
  
  NTR
  
  Sam
  
  GLA now restarted.
  
  Stephen
  
  NTR
  
  Vip
  
  To arrange a discussion to track down Xcache problems
- 10:50 → 11:00
  
  AOB 10m
  
  ATLAS Monitoring (UK Cloud)

Choose timezone