ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2021-01-14T10:00:00+00:00
End: 2021-01-14T11:00:00+00:00
Location: Zoom

Thursday 14 Jan 2021, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference

ATLAS UK Cloud Support

Zoom Meeting ID: 98434450232
Host: James William Walder
Useful links: Join via phone
Zoom URL

Hide

Outstanding tickets

150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
- Trimuf very far away; no perfsonar to see exactrly what’s happening.
  - Different ip address space between se’s might be contributing?
  - Maybe related to a full link connections?
- Additional comments from Duncan in Round Table.
149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- low deletion efficiency (many initial deletions requests)
- JW - To test a few files to ensure no data inconsistency check
149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- Important to find out exactly where in the chain it is failing.
- job executing status in the logs; is evicted 2s later.
- Condor history -> check for X’s not C’s
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
- Local users to consider ceph as the primary storage
146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
- Remains stuck behind updates further down the stack.
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.

CPU

RAL
- Ok, recent additional slots from CMS (which is now recovering)
Northgrid
- LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
London
- QMUL; largely recovered; work ongoing.
SouthGrid
- OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
Scotgrid
- Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.

Other new issues

Ongoing issues

CentOS7 - Sussex
- 3 nodes currently in; continue to have issues with network switch infrastructure.
TPC with http
- To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
- Some issues in the past with Rate limitting, and protections from ingenious users.
- From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
- Alessandra keen to move internal lan transfers away from gridFTP.
- Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
Storageless Site test / storage decomissioning (Oxford)
- Oxford Jira for decommissioning now set up.
- Will need to wait for Glasgow decomissioning to complete
ECDF volatile storage
- JW to start actions from the Jira.
Glasgow DPM Decommissioning
- Ongoing; final part most difficult due to the problems of last year
ATLAS: Site Availability/Reliability reports: Glasgow
- Push for VOFeed to cric; expected timescale being sought.

News round-table

Vip
- Needed to leave
Dan
- Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
Matt
- Disk servers continuing to need attention (e.g. weighting issues).
Peter
- Had to leave
Sam
- NTR
Gareth
- Q/R needed
JW
- NTR
Duncan
- QMUL -> triumf; 1600->0200 transfers were ok;
- Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
- Traceroute data: Failing via London; Running via Amsterdam;
- Can it be IPV6 / routing / QMUL config related ?
- Perfsonar would certainly help identify in these cases
Patrick
- NTR
Rob
- NTR

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Minutes
    
    Open ATLAS UK GGUS tickets
    
    150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
    
    Trimuf very far away; no perfsonar to see exactrly what’s happening.
    
    Different ip address space between se’s might be contributing?
    
    Maybe related to a full link connections?
    
    Additional comments from Duncan in Round Table.
    
    149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    
    low deletion efficiency (many initial deletions requests)
    
    JW - To test a few files to ensure no data inconsistency check
    
    149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    
    Important to find out exactly where in the chain it is failing.
    
    job executing status in the logs; is evicted 2s later.
    
    Condor history -> check for X’s not C’s
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
    
    Local users to consider ceph as the primary storage
    
    146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
    
    Remains stuck behind updates further down the stack.
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    3 WNs in. Network switch remains a problem; aiming for solution by end of the month.
  - CPU 5m
    
    Minutes
    
    New link for the site-oriented dashboard
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Ok, recent additional slots from CMS (which is now recovering)
    
    Northgrid
    
    LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
    
    London
    
    QMUL; largely recovered; work ongoing.
    
    SouthGrid
    
    OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
    
    Scotgrid
    
    Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.
  - Other new issues / tasks 5m
- 10:20 → 10:40
  Ongoing Items 20m
  
  Minutes
  CentOS7 - Sussex
  
  3 nodes currently in; continue to have issues with network switch infrastructure.
  
  TPC with http
  
  To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
  
  Some issues in the past with Rate limitting, and protections from ingenious users.
  
  From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
  
  Alessandra keen to move internal lan transfers away from gridFTP.
  
  Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
  
  Storageless Site test / storage decomissioning (Oxford)
  
  Oxford Jira for decommissioning now set up.
  
  Will need to wait for Glasgow decomissioning to complete
  
  ECDF volatile storage
  
  JW to start actions from the Jira.
  
  Glasgow DPM Decommissioning
  
  Ongoing; final part most difficult due to the problems of last year
  
  ATLAS: Site Availability/Reliability reports: Glasgow
  
  Push for VOFeed to cric; expected timescale being sought.
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - TPC with http 5m
  - Storageless Site test / storage decomissioning (Oxford) 5m
    
    ADCINFR-185
    
    ADCINFR-194
  - ECDF volatile storage 5m
    
    ADCINFR-184
  - Glasgow DPM Decommissioning 20m
    
    LOCALGROUPDISK and DATADISK decommissioning
    
    ADCINFR-152
    
    Storage permissions
  - ATLAS: Site Availability/Reliability reports: Glasgow 5m
    
    ADCMONITOR-491
    
    Cric VOfeed
    
    SNOW:RQF1708406
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Vip
  
  Needed to leave
  
  Dan
  
  Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
  
  Matt
  
  Disk servers continuing to need attention (e.g. weighting issues).
  
  Peter
  
  Had to leave
  
  Sam
  
  NTR
  
  Gareth
  
  Q/R needed
  
  JW
  
  NTR
  
  Duncan
  
  QMUL -> triumf; 1600->0200 transfers were ok;
  
  Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
  
  Traceroute data: Failing via London; Running via Amsterdam;
  
  Can it be IPV6 / routing / QMUL config related ?
  
  Perfsonar would certainly help identify in these cases
  
  Patrick
  
  NTR
  
  Rob
  
  NTR
- 10:50 → 11:00
  
  AOB 10m

Choose timezone

ATLAS UK Cloud Support

Zoom

Outstanding tickets

CPU

Other new issues

Ongoing issues

News round-table

AOB