150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-01-22 22:21:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
150304 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2021-01-25 17:24:00 MWT2: low transfer efficiency as dest. from UKI-NORTHGRID-LANCS-HEP
Script finished; 2/3 files on server are corrupted …
Prepare a list of the 400k; aim to just declare as lost.
Can zfs announce bad files - e.g. scubbing; First pass is metadata consistency; second pass to do checksum-level checking
Simple script to probe each file might be best approach.
Recovery of data highly unlikely.
Matt to investigate other servers once this list is prepared.
149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-01-26 14:29:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
Similar issue with lost files. Needs DDM support to declare files as lost; JW to follow up.
149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-28 09:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
Need to check the ansible scripts, to see if something is affecting them.
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-23 06:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
Largely a placeholder ticket to follow / track Decommissioning; Sam / JW to update the associated Jira to probe status.
146651 RAL-LCG2 urgent on hold 2021-01-19 10:05:00 singularity and user NS setup at RAL
Increasing requests from other VO’s on status of upgrade.
142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
On hold pending changes to the immediate situations
● CPU
RAL
Recovery from CE problem at start of week. Some loss of jobs during the week - not well understood.
Northgrid
Lancs: File system related issues.
London
QMUL: Main college link down; on backup.
(Update: Set to offline on Friday until QMUL IT decide network is stable)
SouthGrid
RALPP back online;
BHAM issues (external reverse lookup on grid subnet switched off via IT)
CAM; also offline (but seems clear that unrelated to BHAM issue)
Scotgrid
Possible brokerage issues seen; to try and follow-up with idendifaction of how jobs are (not) assigned to Glasgow.
ECDF and Durham effects of file-related issues (followed up in GGUS).
● Ongoing Items
CentOS7 - Sussex
NTR
TPC with http
NTR
Storageless Site test / storage decomissioning (Oxford)
Gateway to point at: JW
JW: test gateway
ECDF volatile storage
JW -> prod DDM experts on how they wish to commission the RSE’s
Glasgow DPM Decommissioning
Sam to prod ticket.
ATLAS: Site Availability/Reliability reports: Glasgow