Outstanding tickets
- 150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
- Trimuf very far away; no perfsonar to see exactrly what’s happening.
- Different ip address space between se’s might be contributing?
- Maybe related to a full link connections?
- Additional comments from Duncan in Round Table.
- 149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- low deletion efficiency (many initial deletions requests)
- JW - To test a few files to ensure no data inconsistency check
- 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- Important to find out exactly where in the chain it is failing.
- job executing status in the logs; is evicted 2s later.
- Condor history -> check for X’s not C’s
- 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
- Local users to consider ceph as the primary storage
- 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
- Remains stuck behind updates further down the stack.
- 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.
CPU
-
RAL
- Ok, recent additional slots from CMS (which is now recovering)
-
Northgrid
- LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
-
London
- QMUL; largely recovered; work ongoing.
-
SouthGrid
- OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
-
-
Scotgrid
- Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.
Other new issues
Ongoing issues
-
CentOS7 - Sussex
- 3 nodes currently in; continue to have issues with network switch infrastructure.
-
TPC with http
- To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
- Some issues in the past with Rate limitting, and protections from ingenious users.
- From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
- Alessandra keen to move internal lan transfers away from gridFTP.
- Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
-
Storageless Site test / storage decomissioning (Oxford)
- Oxford Jira for decommissioning now set up.
- Will need to wait for Glasgow decomissioning to complete
-
ECDF volatile storage
- JW to start actions from the Jira.
-
Glasgow DPM Decommissioning
- Ongoing; final part most difficult due to the problems of last year
-
ATLAS: Site Availability/Reliability reports: Glasgow
- Push for VOFeed to cric; expected timescale being sought.
News round-table
- Vip
- Dan
- Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
- Matt
- Disk servers continuing to need attention (e.g. weighting issues).
- Peter
- Sam
- Gareth
- JW
- Duncan
- QMUL -> triumf; 1600->0200 transfers were ok;
- Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
- Traceroute data: Failing via London; Running via Amsterdam;
- Can it be IPV6 / routing / QMUL config related ?
- Perfsonar would certainly help identify in these cases
- Patrick
- Rob
AOB
There are minutes attached to this event.
Show them.