ATLAS UK Cloud Support
Vidyo
Outstanding tickets
-
147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-08 14:29:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
- Disk server for DPM overloaded; work to be done on improvements
-
147744 UKI-LT2-QMUL urgent in progress 2020-07-08 11:59:00 Inaccessible files at UKI-LT2-QMUL_DATADISK
- Work ongoing; Next version of storm should be better here
-
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- JW to get ticket updated.
-
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on hold
-
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on hold
-
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on hold
New issues
CPU
-
RAL
- Change to ATLAS sub-group quota; CMS took extra slots, ATLAS pledge down; now reverted and recovering slots
-
Northgrid
-
London
- QMUL: ATLAS job failures on certain node sets.
-
SouthGrid
-
Scotgrid
- Xrootd with different versions of voms-xroot to avoid both sets of problems
Other new issues
- Problem with HC / ATLAS this morning; mass HC-blacklisting.
- All sites should have been manually set back online.
- Storage downtimes
- Site admins are requested to declare downtime for ALL published access protocols at the site. If not all protocols are declared ‘stopped’, data access will be attempted through allowed protocols
- https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Site_blacklisting
- OX Arc6 upgrade
- Discussion on current status and current issues
- Status: GLASGOW - CEPH
-
Work needed on gFal2 and/or xrootd-ceph plugin discussed; also would affect RAL;
-
Update of current status; plans for gridFTP, xroot and redirection brought up
-
gridFTP works ok, but will be depricated (for 12 months) (external)
- Can be good enough for production work for now
-
gridFTP is the external protocal
-
will aim to get the xrootd write-back
-
and caching is in development
-
If necessary can consider to set internal access to use xrdcp (rather than rucio copy)
- QMUL:
- The job failures I believe are all related to the error “generate got a SIGKILL signal (exit code 137)” which I think is and out of memory error resulting the OS killings the job,
- always arcproxy that gets killed
Ongoing issues
- CentOS7 - Sussex
- on-hold
- Grand Unified queues
- on-hold
News round-table
(NTR)
-
Vip
- Discussion on arc-ce6 and possibilities of mapping-errors; followed-up on TB-support
-
Dan
- NTR
-
Matt
- ARC-6 being updated
-
Peter
- Will update Agis for getting LANCS test queue targeting CE
-
Alessandra
- NTR
-
Sam
- NTR
-
Gareth
- NTR
-
Tim
- NTR
-
JW
- Working on http TPC with small updates to work around current configuration setup