ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-07-16T10:00:00+01:00
End: 2020-07-16T11:00:00+01:00
Location: Vidyo

Thursday 16 Jul 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems
- Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error
- Files likely declared lost; To follow up on ticket.
147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
- still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
- O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
- Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
  - Infrastructure and maintainance, difficicult to fund this.
  - Experiments change their mind, based on current needs … planning for the wrong eventualities
- Sam - number of hurdles for new technologies; and assumptions of established code working
  - Experience with setting up a DPM xroot cache woks well
  - Use of cache to ‘change type’ eg. direct-io to locally staged.
  - GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Restarting services in DPM;
- in progress.
146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL
- Ticket was updated, largely on hold until other priority upgrades rolled out.
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.

CPU

RAL
- Largely recovered from quota change, and some network issues
Northgrid
- Lancs no jobs for last couple of days;
London
SouthGrid
Scotgrid
- ECDF few jobs
- GLA down to 50% nominal capactiy:
  - O(600) cores off for problems with AC downtime
  - Ceph queue issues for xrootd transfers:
    - rebuild xrootd plugins using RAL updates
    - Sam to update the Jira
    - If not resolved by Tuesday, fallback to gridFTP

Other new issues

New Site monitoring MONIT Page:
- Site monitoring MONIT Page
Lancaster
- Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
- Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
- Noted that Arc runtime environment scripts for modifying environment are useful

Ongoing issues

CentOS7 - Sussex
- Awaiting access
Grand Unified queues
- Awaiting Sheff.

News round-table

Vip
- OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
- Following up on question of Shefield
Matt
- NTR
Peter
- Bad disk server; how to identify the device name;
  - Matt, Sam to dig out the recipies (sent via mailing list)
Sam
- NTR
Gareth
- Cores offline from AC issues; access restrictions make things challenging
JW
- Work on TPC ongoing
Patrick:
- Access to DC still needed

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems
    
    Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
    
    147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error
    
    Files likely declared lost; To follow up on ticket.
    
    147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
    
    still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
    
    O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
    
    Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
    
    Infrastructure and maintainance, difficicult to fund this.
    
    Experiments change their mind, based on current needs … planning for the wrong eventualities
    
    Sam - number of hurdles for new technologies; and assumptions of established code working
    
    Experience with setting up a DPM xroot cache woks well
    
    Use of cache to ‘change type’ eg. direct-io to locally staged.
    
    GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
    
    146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    Restarting services in DPM;
    
    in progress.
    
    146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL
    
    Ticket was updated, largely on hold until other priority upgrades rolled out.
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    on hold
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    on hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.
  - CPU 5m
    
    UK Cloud jobs over last week
    
    RAL
    
    Largely recovered
    
    Northgrid
    
    Lancs no jobs for last couple of days
    
    London
    
    SouthGrid
    
    Scotgrid
    
    ECDF few jobs
    
    GLA down to 50% nominal capactiy ?
    
    O(600) cores off for problems with AC downtime
    
    Ceph queue issues for xrootd transfers:
    
    rebuild xrootd plugins using RAL updates
    
    Sam to update the Jira
    
    If not resolved by Tuesday, fallback to gridFTP
  - Other new issues 5m
    
    LANCS:
    A few data points for discussion later, this job finished successfully:
    https://bigpanda.cern.ch/job?pandaid=4788609755
    
    It was a multicore job. A clue (as mcore jobs have an effective 8* multiplier on their mem requests)? James' theory about jobs not being able to summon enough memory is looking very likely ro my eyes.
    
    I can't see any smoking guns on our queue AGIS:
    http://atlas-agis.cern.ch/agis/pandaqueue/detail/UKI-NORTHGRID-LANCS-HEP/full/
    
    But the request is as plain as day in our pilot jdls:
    https://aipanda157.cern.ch/condor_logs_2/20-07-14_11/grid.1305756.1.jdl
    (memory=1500)
    
    Taking a cheeky peek at a GLASGOW jdl I see that their memory request is 2000:
    https://aipanda157.cern.ch/condor_logs_2/20-07-16_07/grid.1325252.2.jdl
    
    I'm not sure where this is set, but for Lancs I'd like to see this set to at least 3000 memory units.
    
    Monit: Site-oriented dashboard
    
    New Site monitoring MONIT Page:
    
    Site monitoring MONIT Page
    
    Lancaster
    
    Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
    
    Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
    
    Noted that Arc runtime environment scripts for modifying environment are useful
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  Awaiting access
  
  Grand Unified queues
  
  Awaiting Sheff.
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
  
  Following up on question of Shefield
  
  Matt
  
  NTR
  
  Peter
  
  Bad disk server; how to identify the device name;
  
  Matt, Sam to dig out the recipies (sent via mailing list)
  
  Sam
  
  NTR
  
  Gareth
  
  Cores offline from AC issues; access restrictions make things challenging
  
  JW
  
  Work on TPC ongoing
  
  Patrick:
  
  Access to DC still needed
- 10:50 → 11:00
  
  AOB 10m