ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2021-01-07T10:00:00+00:00
End: 2021-01-07T11:00:00+00:00
Location: Zoom

Thursday 7 Jan 2021, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

https://cern.zoom.us/j/98434450232

Password protected (same as OPs Mtg, but repeated)

Videoconference

ATLAS UK Cloud Support

Zoom Meeting ID: 98434450232
Host: James William Walder
Useful links: Join via phone
Zoom URL

Hide

Outstanding tickets

149842 UKI-SCOTGRID-ECDF less urgent in progress 2021-01-03 13:08:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- Manual blacklisting on WAN transfers set over New year period;
  - Test of whitelisting shows no improvement since new year.
  - Needs input from site to understand situation.
149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- Ongoing; Peter to try and look from apfmon side
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-04 12:01:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- DPM transfer failures; Sam to check if from old files that should be cleared from the namespace.
- CEPH still running well; but current job mix is testing the caching
  - On Ceph, Internal xrootd Cache filling up due to intesive sets of jobs;
  - Purging of old files in xrootd cache to understand better; may be that all files in cache are in active usage?
- Still needing to move final compute capacity; requiring on-site work.
146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
- Remains awaiting updates to underlying software stack; no date is given by Grid Services
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Matt, some discussions on certs with site held.; will try to get recent update.

CPU

RAL
Failure of one CE over Christmas resulted in drained slots. Other VOs running fine, so took ~ 2days to fully reclaim slots.
- Gareth mentioned that one CE might be Primiary; which, if fails, drops all jobs.
- Not obvious in AGIS/CRIC if one is set to primary
Northgrid
LANCS: Fairshare issues; other VOs taking numbers of slots.
- Might need to reduce the time window on SGE to allocate slots; Also a dropoff parameter to change the weighting of older jobs
London
QMUL: Reinstalled batch system; crashing every 10mins; related to (new) arc accouting?
- Legacy accounting had issues with processing to the sqlite database.
- Major upgrade to DC planned for early in year; but timing still be be finalised
SouthGrid
Scotgrid
- 10Gb to 1Gb negotiation issue in networking caused drop in jobs over the new year.

Other new issues

Ongoing issues

CentOS7 - Sussex
- No update
TPC with http
- No update
Storageless Site tests (Oxford)
- Progress reported in Storage meeting; new arc almost ready
- ATLAS has stopped writing new data to Oxford
- LOCALGROUPDISK hope to keep at Oxford; needs confirmation. Update to SL7 needed.
- Sheffield;
  - Input from Duncan:
  - perfsonar looks ok, but ipv6 only
  - Use of the NAT might be the bottleneck, and not scale up to full production / analysis loads
  - Steve’s tests useful; needs some work from Shef to update few tests.
  - Could perhaps make nodes dual stack?
ECDF volatile storage
- No update; JW to work on actions in Jira ticket; other issues at the site higher priority.
Glasgow DPM Decommissioning
- TO check final deletions;
- Localgroupdisk naming now settled. To give green light once space is increased.

News round-table

Dan
- NTR (left before end)
Matt
- NTR (left before end)
Peter
- NTR
Sam
- NTR
Gareth
- Q/R’s needed by month-end
JW
- NTR
Duncan
- Was confirmed that Sheffield has no storage set up.
- Discussion on IO demands e.g. 5k cores (glasgow) * 0.5MB/s/core for future UK and Glasgow requirements

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Minutes
    
    Open ATLAS UK GGUS tickets
    
    149842 UKI-SCOTGRID-ECDF less urgent in progress 2021-01-03 13:08:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    
    Manual blacklisting on WAN transfers set over New year period;
    
    Test of whitelisting shows no improvement since new year.
    
    Needs input from site to understand situation.
    
    149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    
    Ongoing; Peter to try and look from apfmon side
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-04 12:01:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    DPM transfer failures; Sam to check if from old files that should be cleared from the namespace.
    
    CEPH still running well; but current job mix is testing the caching
    
    On Ceph, Internal xrootd Cache filling up due to intesive sets of jobs;
    
    Purging of old files in xrootd cache to understand better; may be that all files in cache are in active usage?
    
    Still needing to move final compute capacity; requiring on-site work.
    
    146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
    
    Remains awaiting updates to underlying software stack; no date is given by Grid Services
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Matt, some discussions on certs with site held.; will try to get recent update.
  - CPU 5m
    
    Minutes
    
    New link for the site-oriented dashboard
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Failure of one CE over Christmas resulted in drained slots. Other VOs running fine, so took ~ 2days to fully reclaim slots.
    
    Gareth mentioned that one CE might be Primiary; which, if fails, drops all jobs.
    
    Not obvious in AGIS/CRIC if one is set to primary
    
    Northgrid
    
    LANCS: Fairshare issues; other VOs taking numbers of slots.
    
    Might need to reduce the time window on SGE to allocate slots; Also a dropoff parameter to change the weighting of older jobs
    
    London
    
    QMUL: Reinstalled batch system; crashing every 10mins; related to (new) arc accouting?
    
    Legacy accounting had issues with processing to the sqlite database.
    
    Major upgrade to DC planned for early in year; but timing still be be finalised
    
    SouthGrid
    
    Scotgrid
    
    10Gb to 1Gb negotiation issue in networking caused drop in jobs over the new year.
  - Other new issues / tasks 5m
- 10:20 → 10:40
  Ongoing Items 20m
  
  Minutes
  CentOS7 - Sussex
  
  No update
  
  TPC with http
  
  No update
  
  Storageless Site tests (Oxford)
  
  Progress reported in Storage meeting; new arc almost ready
  
  ATLAS has stopped writing new data to Oxford
  
  LOCALGROUPDISK hope to keep at Oxford; needs confirmation. Update to SL7 needed.
  
  Sheffield;
  
  Input from Duncan:
  
  perfsonar looks ok, but ipv6 only
  
  Use of the NAT might be the bottleneck, and not scale up to full production / analysis loads
  
  Steve’s tests useful; needs some work from Shef to update few tests.
  
  Could perhaps make nodes dual stack?
  
  ECDF volatile storage
  
  No update; JW to work on actions in Jira ticket; other issues at the site higher priority.
  
  Glasgow DPM Decommissioning
  
  TO check final deletions;
  
  Localgroupdisk naming now settled. To give green light once space is increased.
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - TPC with http 5m
  - Storageless Site tests (Oxford) 5m
    
    ADCINFR-185
  - ECDF volatile storage 5m
    
    ADCINFR-184
  - Glasgow DPM Decommissioning 20m
    
    LOCALGROUPDISK and DATADISK decommissioning
    
    ADCINFR-152
    
    Storage permissions
  - ATLAS: Site Availability/Reliability reports: Glasgow 5m
    
    ADCMONITOR-491
    
    Cric VOfeed
    
    SNOW:RQF1708406
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Dan
  
  NTR (left before end)
  
  Matt
  
  NTR (left before end)
  
  Peter
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  Q/R’s needed by month-end
  
  JW
  
  NTR
  
  Duncan
  
  Was confirmed that Sheffield has no storage set up.
  
  Discussion on IO demands e.g. 5k cores (glasgow) * 0.5MB/s/core for future UK and Glasgow requirements
- 10:50 → 11:00
  
  AOB 10m

Choose timezone

ATLAS UK Cloud Support

Zoom

Outstanding tickets

CPU

Other new issues

Ongoing issues

News round-table

AOB