ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-12-10T10:00:00+00:00
End: 2020-12-10T11:00:00+00:00
Location: Zoom

Thursday 10 Dec 2020, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Hide

Outstanding tickets

149842 UKI-SCOTGRID-ECDF less urgent assigned 2020-12-09 11:15:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- Davs ECDF https transfers; possible headnodes overloaded, compared to other protocols (interpretation from Sam)
- Rob looking into this
149811 UKI-LT2-QMUL less urgent in progress 2020-12-09 16:16:00 Transfer and deletion errors from UKI-LT2-QMUL as dst site
- Storage back online; needs rebuilding of several systems for Compute nodes
- ProxMox cluster taken down. HP SSD running journals, with uptime bug that bricked after x-hours. 2 out 3 SSDs taken out.
  - Positive comments regarding ProxMox made; Runs on debian/ubuntu
- Downtime next week for power work
149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-09 11:50:00 UKI-SOUTHGRID-RALPP: unable to connect to host
- IPv4 problems to site with FTS transfers via Rucio.
- Site will attempt router reboot to fix
- Also exposed bug in rucio for default IPvX version, if not specified in RSE.
  - RSE default looks to be update, which is causing succesful transfers over, by using IPv6.
149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-09 14:16:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
- two sets of files declared lost.
- Ongoing unique set attempting to be recovered. Will stop by Monday.
149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- No progress; however may have some relation to IPv4/6 differences; to be followed-up.
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Recieved file-list from disk 40. Some might be recoverable, but unlikely.
- To be declared lost once cleaned from namespace,
- JW: to create Jira, and get unique files
146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
- On hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Arc now working correctly. LDAP issue; not started. Adding more nodes; but network failures in the DC to be fixed.
- Final nodes need provisioning, aim to finish early next year.

CPU

RAL
- HC test failues (due to updated root version in one of the tests) caused sites to go into test. Recovery of lost slots taking time.
Northgrid
- Lancs: Mis-config of submission dir on the nfs mounts; should now be fixed
London
- QMUL issues (as reported above)
SouthGrid
- OX observed similar HC dip to RAL
Scotgrid
- Durham; problematic disk server over weekend.
- Glasgow; some additional cores added; running with 40 kHS06.

Other new issues

Glasgow Site Avail/Rel
- ETF information appears to be correct, but interpretation from the ATLAS Topology enrichment via VOFeed to be understood and updated.

Ongoing issues

CentOS7 - Sussex
- (described above)
TPC with http
- No update
Storageless Site tests (Oxford)
- No progress; discussions ongoing on how to configure the arc-ce queues
ECDF volatile storage
- Ticket updated; number of config changes needed from ATLAS side; JW to follow-up.
Glasgow DPM Decommissioning
- Still need LOCALGROUPDISK setup on Ceph. Discussion on the pool name, vs endpoint naming.

News round-table

Vip
- NTR
Dan
- NTR
Matt
- NTR
Peter
- NTR
Sam
- NTR
Gareth
- NTR
JW
- NTR
Patrick
- NTR

AOB

Future meetings to use new Cern hosted zoom room, integrated into indico.
Next week 17th, last Cloud support Mtg of the year. Expect to then restart on 7th.

There are minutes attached to this event. Show them.

- 1
  Status
  - a) Outstanding tickets
    
    Open ATLAS UK GGUS tickets
    
    149842 UKI-SCOTGRID-ECDF less urgent assigned 2020-12-09 11:15:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    
    Davs ECDF https transfers; possible headnodes overloaded, compared to other protocols (interpretation from Sam)
    
    Rob looking into this
    
    149811 UKI-LT2-QMUL less urgent in progress 2020-12-09 16:16:00 Transfer and deletion errors from UKI-LT2-QMUL as dst site
    
    Storage back online; needs rebuilding of several systems for Compute nodes
    
    ProxMox cluster taken down. HP SSD running journals, with uptime bug that bricked after x-hours. 2 out 3 SSDs taken out.
    
    Positive comments regarding ProxMox made; Runs on debian/ubuntu
    
    Downtime next week for power work
    
    149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-09 11:50:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    
    IPv4 problems to site with FTS transfers via Rucio.
    
    Site will attempt router reboot to fix
    
    Also exposed bug in rucio for default IPvX version, if not specified in RSE.
    
    RSE default looks to be update, which is causing succesful transfers over, by using IPv6.
    
    149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-09 14:16:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    
    two sets of files declared lost.
    
    Ongoing unique set attempting to be recovered. Will stop by Monday.
    
    149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    
    No progress; however may have some relation to IPv4/6 differences; to be followed-up.
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Recieved file-list from disk 40. Some might be recoverable, but unlikely.
    
    To be declared lost once cleaned from namespace,
    
    JW: to create Jira, and get unique files
    
    146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    
    On hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Arc now working correctly. LDAP issue; not started. Adding more nodes; but network failures in the DC to be fixed.
    
    Final nodes need provisioning, aim to finish early next year.
  - b) CPU
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    HC test failues (due to updated root version in one of the tests) caused sites to go into test. Recovery of lost slots taking time.
    
    Northgrid
    
    Lancs: Mis-config of submission dir on the nfs mounts; should now be fixed
    
    London
    
    QMUL issues (as reported above)
    
    SouthGrid
    
    OX observed similar HC dip to RAL
    
    Scotgrid
    
    Durham; problematic disk server over weekend.
    
    Glasgow; some additional cores added; running with 40 kHS06.
  - c) Other new issues / tasks
  - d) Glasgow: Site Availability/Reliability Config
    
    ADCMONITOR-491
    
    RQF1708406
    
    Glasgow Site Avail/Rel
    
    ETF information appears to be correct, but interpretation from the ATLAS Topology enrichment via VOFeed to be understood and updated.
- 2
  Ongoing Items
  CentOS7 - Sussex
  
  (described above)
  
  TPC with http
  
  No update
  
  Storageless Site tests (Oxford)
  
  No progress; discussions ongoing on how to configure the arc-ce queues
  
  ECDF volatile storage
  
  Ticket updated; number of config changes needed from ATLAS side; JW to follow-up.
  
  Glasgow DPM Decommissioning
  
  Still need LOCALGROUPDISK setup on Ceph. Discussion on the pool name, vs endpoint naming.
  - a) CentOS7 - Sussex
    
    Centos 7 deployment Twiki
  - b) TPC with http
  - c) Storageless Site tests (Oxford)
    
    ADCINFR-185
  - d) ECDF volatile storage
    
    ADCINFR-184
  - e) Glasgow DPM Decommissioning
    
    LOCALGROUPDISK and DATADISK decommissioning
    
    ADCINFR-152
- 3
  News round-table
  Vip
  
  NTR
  
  Dan
  
  NTR
  
  Matt
  
  NTR
  
  Peter
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  NTR
  
  JW
  
  NTR
  
  Patrick
  
  NTR
- 4
  AOB
  - Set up new CERN Zoom room for next week.
  Future meetings to use new Cern hosted zoom room, integrated into indico.
  
  Next week 17th, last Cloud support Mtg of the year. Expect to then restart on 7th.

Choose timezone