ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-09-03T10:00:00+01:00
End: 2020-09-03T11:00:00+01:00
Location: Vidyo

Thursday 3 Sept 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency
- Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
- On site access yesterday; some older hardware will need OS upgrades.
148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
- as above
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
- SS to check the database for the 0 replica entries
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- JW to follow-up.
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- on hold
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- no update
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on hold

CPU

RAL
- Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
Northgrid
- LANCS: disk problems (described above); in test. May also have some pilot issues
London
- QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
SouthGrid
- OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
- RALPP Some reduction due to dCache upgrades.
Scotgrid
- GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.

Other new issues

QMUL upgrade
- JW to confirm that other sites dependent on QMUL storage are also in downtime.

Ongoing issues

Sussex
- on hold
Grand Unified queues
- on hold

News round-table

(NTR)

Vip
- Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
Dan
- Check that dependent sites (e.g. Cambridge) will transition correctly
Matt
- Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
Alessandra
- JW - to add to agenda page TPC items that need to be done.
Gareth
- https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Recommended_CPU_Storage_and_Netw
  - Is the information up-to-date (last update to page was recent).
Tim
- Lost files at MAN; AF to redeclare things as lost.
JW
- NTR

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency
    
    Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
    
    On site access yesterday; some older hardware will need OS upgrades.
    
    148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
    
    as above
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
    
    SS to check the database for the 0 replica entries
    
    146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    JW to follow-up.
    
    146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    
    on hold
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    no update
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    on hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    on hold
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
    
    Northgrid
    
    LANCS: disk problems (described above); in test. May also have some pilot issues
    
    London
    
    QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
    
    SouthGrid
    
    OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
    
    RALPP Some reduction due to dCache upgrades.
    
    Scotgrid
    
    GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.
  - Other new issues 5m
    
    QMUL Scratchdisk migration
    
    QMUL upgrade
    
    JW to confirm that other sites dependent on QMUL storage are also in downtime.
- 10:20 → 10:40
  Ongoing issues 20m
  Sussex
  
  on hold
  
  Grand Unified queues
  
  on hold
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
  
  Dan
  
  Check that dependent sites (e.g. Cambridge) will transition correctly
  
  Matt
  
  Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
  
  Alessandra
  
  JW - to add to agenda page TPC items that need to be done.
  
  Gareth
  
  https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Recommended_CPU_Storage_and_Netw
  
  Is the information up-to-date (last update to page was recent).
  
  Tim
  
  Lost files at MAN; AF to redeclare things as lost.
  
  JW
  
  NTR
- 10:50 → 11:00
  
  AOB 10m