Outstanding tickets
- 149095 UKI-SOUTHGRID-OX-HEP less urgent in progress 2020-10-17 07:37:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
- Should be fixed; JW - to check and close
- 148968 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-21 13:49:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
- Should now be fine; JW - to check and close
- 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-19 13:51:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Power fluctuations (cable strikes in and off campus)
- Odd networking state for some racks; rebooting appears to have cleared this
- Looking ok? JW to check and close if so.
- Other cvmfs issues also seem to be resolved
- 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
- 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
CPU
-
RAL
- Firewall upgrade at RAL (am of 22nd Oct); killed most jobs; now recovering
-
-
Northgrid
- LANCS now recovering on slots. Unclear why jobs still ran with missing files when declared temporarily unavailable.
-
London
- Small issue at QMUL, resolved quickly
-
SouthGrid
-
Scotgrid
- Durham, some unexpected downtime. One disk server identified with lost ATLAS data; List of files is in preparation.
Other new issues
Ongoing issues
- CentOS7 - Sussex
- Grand Unified queues
- TPC via http
- Ceph; fix available for testing for aligment issues in EC pool in xrootd
- Appears to be working at RAL; although still some failure modes observed
News round-table
- Vip
- Approx 1/3 of Cs to be drained from Saturday, for work on Tuesday.
- Dan
- Weekend memory problem; Storm / Argus; requires a restart. (open ticket with Storm devs.)
- Aiming for improving the automatation of restarts
- Matt
- rebuilding of servers ongoing.
- Peter
- Sam
- AOD -> DAOD jobs still show up as source of failures. (JW to also follow this).
- JW
- NTR (see tpc-http info above).
- Rob
- Will have a discussion with ATLAS experts regarding QoS developments with ECDF storage
AOB
There are minutes attached to this event.
Show them.