ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-06-18T10:00:00+01:00
End: 2020-06-18T11:00:00+01:00
Location: Vidyo

Thursday 18 Jun 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147436 UKI-SOUTHGRID-RALPP less urgent in progress 2020-06-15 14:58:00 UK UKI-SOUTHGRID-RALPP failing deletions
- Argus server down from power problems; also disk server out of action; needs physical access (this week).
147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy
- Progress; test machine working; plan how to rollback from change once permanent solution can be found.
147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-15 15:13:00 Deletion errors at UKI-SCOTGRID-GLASGOW
- Site needs to delete files from the namespace; general strategy:
  - Have the site clean the namespace from any leftovers.
  - Have the site produce storage dumps.
  - Run a consistency check.
  - Declare any missing files as lost.
146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Power outage; pushed forward migration of DPM to centos7; will monitor situation
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- no update
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on hold for ce6; Elena working hard to make progress
145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
- on hold
145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs
- Ticket updated. currently no spike in timeouts; with switch to direct-io for user jobs; should quantify error rate.
- Set to in progress; and aim to close ticket once direct io studies are done
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- Keep open until move is complete
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on hold

CPU

RAL
Northgrid
London
SouthGrid
Scotgrid
- ECDF recovered from downtime; still deletion errors
- Glasgow AC recovered - up to 25k jobs

Other new issues

Ongoing issues

CentOS7 DPM Lancs
- does not look like our SL6 CE will be able to talk to this new filesystem
- next week is downtime for cluster , during which our worker nodes will all be reinstalled and a new backed end shared filesystem will be back in place
- plan is to set up an ARC CE by lunchtime tomorrow as the old SL6 CREAM CE will likely stop being able to talk to our Cluster after next week’s downtime.
- please can atlas UK be ready to stick new endpoints in to AGIS for us.
CentOS7 - Sussex
- Awaiting updating
Glasgow Ceph storage
- RAL xrootd was tried
  - Identified that some performance tuning options, when under high loads caused too many concurrent threads and truncation of the cached files
- Moved back to on disk cache (on SSDs)
- Getting data to the jobs looks much better now.
- Stage-ing back from jobs and redirection is next, to work between all three caches (one currently running)
- Ceph-tuning for timeouts; stability improving; awaiting updated nautilaus for fixes to some current work-arounds
- Bandwith looks good and is maxing out the gridFTP box.
- Different versions of xroot on the various services: gateway 4.12.2, cache 4.11.3; aim to upgrade when possible
Grand Unified queues
- Awaiting SHEF

News round-table

Vip
- NTR: passed on information that atlas timeline for site decommissioning is typically 3-6 months
Matt
- Microphone problems; comments above passed by chat
Sam
- NTR
Gareth
- NTR
Tim
- RAL should try to get TPC running, is ATLAS priority.
JW
- will concentrate on direct-io tests to close ral ticket.

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    147436 UKI-SOUTHGRID-RALPP less urgent in progress 2020-06-15 14:58:00 UK UKI-SOUTHGRID-RALPP failing deletions
    
    Argus server down from power problems; also disk server out of action; needs physical access (this week).
    
    147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy
    
    Progress; test machine working; plan how to rollback from change once permanent solution can be found.
    
    147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-15 15:13:00 Deletion errors at UKI-SCOTGRID-GLASGOW
    
    Site needs to delete files from the namespace; general strategy:
    
    Have the site clean the namespace from any leftovers.
    
    Have the site produce storage dumps.
    
    Run a consistency check.
    
    Declare any missing files as lost.
    
    146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    Power outage; pushed forward migration of DPM to centos7; will monitor situation
    
    146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
    
    no update
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    on hold for ce6; Elena working hard to make progress
    
    145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
    
    on hold
    
    145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs
    
    Ticket updated. currently no spike in timeouts; with switch to direct-io for user jobs; should quantify error rate.
    
    Set to in progress; and aim to close ticket once direct io studies are done
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    Keep open until move is complete
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    on hold
  - CPU 5m
    
    UK Cloud jobs over last week
    
    RAL
    
    Northgrid
    
    London
    
    SouthGrid
    
    Scotgrid
    
    ECDF recovered from downtime; still deletion errors
    
    Glasgow AC recovered - up to 25k jobs
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 DPM Lancs
  
  Next week is downtime for cluster , during which our worker nodes will all be reinstalled and a new backed end shared filesystem will be back in place
  
  Plan is to set up an ARC CE by lunchtime tomorrow as the old SL6 CREAM CE will likely stop being able to talk to our Cluster after next week’s downtime.
  
  Please can atlas UK be ready to stick new endpoints in to AGIS for us.
  
  CentOS7 - Sussex
  
  Awaiting updating
  
  Glasgow Ceph storage
  
  RAL xrootd was tried
  
  Identified that some performance tuning options, when under high loads caused too many concurrent threads and truncation of the cached files
  
  Moved back to on disk cache (on SSDs)
  
  Getting data to the jobs looks much better now.
  
  Stage-ing back from jobs and redirection is next, to work between all three caches (one currently running)
  
  Ceph-tuning for timeouts; stability improving; awaiting updated nautilaus for fixes to some current work-arounds
  
  Bandwith looks good and is maxing out the gridFTP box.
  
  Different versions of xroot on the various services: gateway 4.12.2, cache 4.11.3; aim to upgrade when possible
  
  Grand Unified queues
  
  Awaiting SHEF
  - LANCS DPM centos 7 upgrade 5m
    
    Jira: ATLDDMOPS-5527
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Glasgow Ceph storage 5m
    
    ADCINFR-152: Glasgow Ceph
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  NTR: passed on information that atlas timeline for site decommissioning to storageless is typically 3-6 months
  
  Matt
  
  Microphone problems; comments above passed by chat
  
  Sam
  
  NTR
  
  Gareth
  
  NTR
  
  Tim
  
  RAL should try to get TPC running, is ATLAS priority.
  
  JW
  
  will concentrate on direct-io tests to close ral ticket.
- 10:50 → 11:00
  
  AOB 10m