● Outstanding tickets
- 150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-01-22 22:21:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
- 150304 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2021-01-25 17:24:00 MWT2: low transfer efficiency as dest. from UKI-NORTHGRID-LANCS-HEP
- Script finished; 2/3 files on server are corrupted …
- Prepare a list of the 400k; aim to just declare as lost.
- Can zfs announce bad files - e.g. scubbing; First pass is metadata consistency; second pass to do checksum-level checking
- Simple script to probe each file might be best approach.
- Recovery of data highly unlikely.
- Matt to investigate other servers once this list is prepared.
- 149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-01-26 14:29:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- Similar issue with lost files. Needs DDM support to declare files as lost; JW to follow up.
- 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-28 09:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- Need to check the ansible scripts, to see if something is affecting them.
- 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-23 06:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Largely a placeholder ticket to follow / track Decommissioning; Sam / JW to update the associated Jira to probe status.
- 146651 RAL-LCG2 urgent on hold 2021-01-19 10:05:00 singularity and user NS setup at RAL
- Increasing requests from other VO’s on status of upgrade.
- 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- On hold pending changes to the immediate situations
● CPU
- RAL
- Recovery from CE problem at start of week. Some loss of jobs during the week - not well understood.
- Northgrid
- Lancs: File system related issues.
- London
- QMUL: Main college link down; on backup.
- (Update: Set to offline on Friday until QMUL IT decide network is stable)
- SouthGrid
- RALPP back online;
- BHAM issues (external reverse lookup on grid subnet switched off via IT)
- CAM; also offline (but seems clear that unrelated to BHAM issue)
- Scotgrid
- Possible brokerage issues seen; to try and follow-up with idendifaction of how jobs are (not) assigned to Glasgow.
- ECDF and Durham effects of file-related issues (followed up in GGUS).
● Ongoing Items
-
CentOS7 - Sussex
-
TPC with http
-
Storageless Site test / storage decomissioning (Oxford)
- Gateway to point at: JW
- JW: test gateway
-
ECDF volatile storage
- JW -> prod DDM experts on how they wish to commission the RSE’s
-
Glasgow DPM Decommissioning
-
ATLAS: Site Availability/Reliability reports: Glasgow
- JW to try and move / access a timeline
● News round-table
- Vip
- Dan
- Network problems anticipated.
- Matt
- Peter
- Sam
- Gareth
- JW
- Rob
There are minutes attached to this event.
Show them.