ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-06-04T10:00:00+01:00
End: 2020-06-04T11:00:00+01:00
Location: Vidyo

Thursday 4 Jun 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147299 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-03 23:12:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
- Heading on-site to understand problem; possible the disk has died, ~ 10TB data loss
146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-02 10:46:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
- Pushed back due to other Edingbugh priorities
146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-02 10:30:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Pushed back due to other Edingbugh priorities
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- Work ongoing to use unprivleged mode.
146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:12:00 UKI-NORTHGRID-SHEF-HEP: evicted jobs
- Active interactions with NORDIGRID mailing lists; discussion on deprication on LCMAPs, and it’s possible replacements
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:11:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- As above
145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
- On hold
145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs
- Will aim to close this week
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 09:51:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- On hold
142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-06-01 08:27:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- to put on hold; awaiting rollout of all nodes; might require physical access

CPU

RAL
Northgrid
- MAN - walltime units changed from seconds to minutes (within ATLAS); change reverted.
- Lancaster’s drop was due to an IPv6 problem over the weekened
London
SouthGrid
Scotgrid

Other new issues

Request sent from ATLAS to restart squids due to residual issues with DB / frontier problems from previous weeks

Ongoing issues

CentOS7 DPM Lancs
- No change to plans
CentOS7 - Sussex
- As in GGUS discussion
Glasgow Ceph storage
- xroot message and troubleshooting tricky.
- External - should be ok (gridFTP, maybe also xrootd external),
  –Internal - bandwidth. 30GB/s 3x 10GB links.
Grand Unified queues
- Awaiting Shefield

News round-table

Vip
- Problem job - https://aipanda024.cern.ch/condor_logs_2/20-06-04_05/grid.18774108.3.out
- panda id – 4748952562: The GSI XIO driver failed to establish a connection via the underlying protocol
Dan
- LCMAPS will become deprecated, what will be the solution?
- Updated mount points - perhaps higher rates of failures
Matt
- NTR
Peter
- Re-opening questions; Sites ; lots of online teaching; re-opening will be cautious
Alessandra
- NTR
Sam
- NTR
Gareth
- NTR
Tim
- TPC: running initially on wrong server; now on test (more allowed connections)
- RAL as source is fine, RAL as dest. fails; two transfers trying to access same fail
- If not as dest - it is not the active party; uses pulling, dest gets from the source
JW
- NTR

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    147299 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-03 23:12:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    
    Heading on-site to understand problem; possible the disk has died, ~ 10TB data loss
    
    146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-02 10:46:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
    
    Pushed back due to other Edingbugh priorities
    
    146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-02 10:30:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    Pushed back due to other Edingbugh priorities
    
    146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
    
    Work ongoing to use unprivleged mode.
    
    146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:12:00 UKI-NORTHGRID-SHEF-HEP: evicted jobs
    
    Active interactions with NORDIGRID mailing lists; discussion on deprication on LCMAPs, and it’s possible replacements
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:11:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    As above
    
    145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
    
    On hold
    
    145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs
    
    Will aim to close this week
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 09:51:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    On hold
    
    142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-06-01 08:27:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    to put on hold; awaiting rollout of all nodes; might require physical access
  - CPU 5m
    
    UK Cloud jobs over last week
    
    RAL
    
    Northgrid
    
    MAN - walltime units changed from seconds to minutes (within ATLAS); change reverted.
    
    Lancaster’s drop was due to an IPv6 problem over the weekened
    
    London
    
    SouthGrid
    
    Scotgrid
  - Other new issues 5m
    
    As written last Friday (29 May) due to the squid version (4.11-2.1) problem, could you please restart your local site squids if you have not done so already to mitigate job failure we are seeing due to the latest squid version?
    
    We still have many site squids to restart as seen in the plot (thanks Michal), the object counts drop upon restart:
    
    http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas2/weeklyObjects.html
    
    Request sent from ATLAS to restart squids due to residual issues with DB / frontier problems from previous weeks
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 DPM Lancs
  
  No change to plans
  
  CentOS7 - Sussex
  
  As in GGUS discussion
  
  Glasgow Ceph storage
  
  xroot message and troubleshooting tricky.
  
  External - should be ok (gridFTP, maybe also xrootd external),
  –Internal - bandwidth. 30GB/s 3x 10GB links.
  
  Grand Unified queues
  
  Awaiting Shefield
  - LANCS DPM centos 7 upgrade 5m
    
    Jira: ATLDDMOPS-5527
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Glasgow Ceph storage 5m
    
    ADCINFR-152: Glasgow Ceph
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Problem job - https://aipanda024.cern.ch/condor_logs_2/20-06-04_05/grid.18774108.3.out
  
  panda id – 4748952562: The GSI XIO driver failed to establish a connection via the underlying protocol
  
  Dan
  
  LCMAPS will become deprecated, what will be the solution?
  
  Updated mount points - perhaps higher rates of failures
  
  Matt
  
  NTR
  
  Peter
  
  Re-opening questions; Sites ; lots of online teaching; re-opening will be cautious
  
  Alessandra
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  NTR
  
  Tim
  
  TPC: running initially on wrong server; now on test (more allowed connections)
  
  RAL as source is fine, RAL as dest. fails; two transfers trying to access same fail
  
  If not as dest - it is not the active party; uses pulling, dest gets from the source
  
  JW
  
  NTR
- 10:50 → 11:00
  
  AOB 10m