ATLAS UK Cloud Support



Tim Adye (Science and Technology Facilities Council STFC (GB)) , James William Walder (Science and Technology Facilities Council STFC (GB))
Videoconference Rooms
Weekly ATLAS UK Cloud Support Meeting
Tim Adye
Auto-join URL
Useful links
Phone numbers

Outstanding tickets

  • 147194 UKI-LT2-RHUL less urgent in progress 2020-05-28 UKI-LT2-RHUL: deletion errors
    • Couple of problems with disabled storages. Right now upgrading our storages to Centos7. These problems should be fixed now.
  • 147189 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-05-28 UKI-NORTHGRID-MAN-HEP detion errors …
    • Services restarted, looking better; with pupper off, needs manual intervention; was a cron issue (needs 1.14 dpm version)
  • 147082 UKI-NORTHGRID-MAN-HEP urgent waiting for reply 2020-05-21 File not accessible at …
    • Race condition; waiting for 1.14 to fix issue
    • Problem resolved - no response from dast; to close
  • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-05-19 Failovers from jobs running at …
    • no new update
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-05-21 UKI-SCOTGRID-ECDF deletion failures …
    • Site needs to disable ipv6, if needed, JW to respond.
  • 146651 RAL-LCG2 urgent involved in progress 2020-05-27 singularity and user NS setup at RAL
    • Rollback change to max_user_namespaces as it has negatively impacted LHCb and will pursue enabling unprivileged singularity instead.
  • 146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 UKI-NORTHGRID-SHEF-HEP: evicted jobs
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 ATLAS pilot jobs idle on …
    • GR - reply in TB support
  • 146159 UKI-SCOTGRID-GLASGOW very urgent in progress 2020-05-19 Unaccessible files at …
    • Priority to go into production; read_lan needs to be fixed in ATLAS
    • xrootd shows instabilitord in external connection (latest 4.12.1 version running, as plugin)
    • Why not RAL affected (4.11)? If problem with ceph-xrootd plugin
    • DPM; eventually to stop the queues; and physical move useful nodes as and when restrictions allows to be added to Ceph.
    • See if this ticket can be closed, and follow-up in decommissioning Jira
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 Very old version of squids at …
    • on Hold
  • 145510 RAL-LCG2 urgent on hold 2020-05-13 RAL-LCG2: timeouts on stage-in/outs
    • On hold, moving to close
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 High traffic from UKI-SCOTGRID-GLASGOW …
    • Complex arrangement between data centre and older systems.
    • Final solution only likely with move to DC, new hardware
    • Initial problem seems to be solved, but keep ticket open to follow progress
  • 142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-05-24 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Test jobs running; ask to enable all nodes now


Major DB outage at Cern; knock on effect with Frontier launchpads; CERN Frontier still issue in the morning; in recovery
HC tests set many sites offline.

  • RAL
    Config issue over the weekend, fixed on the Monday

  • Northgrid

  • London

  • SouthGrid

  • Scotgrid

Other new issues

ECDF to 8 core jobs

Ongoing issues

  • LANCS DPM centos 7 upgrade

    • circa. June 24th; no extraordinary actions need be taken prior to move
  • CentOS7 - Sussex

  • Glasgow Ceph storage

    • Non DC cores, reduction in capacity
  • Grand Unified queues

News round-table

  • Vip
    • OX is set offline; to follow-up with HC;
  • Dan
    • NTR
  • Matt
    • NTR
  • Peter
    • NTR
  • Alessandra
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • Tim
    • NTR
  • JW
    • NTR


TA -> AF; TPC on smoke test; RAL Have gone into stress-test as dest; missing source, which indicates problem
GR noted some issues with Firefox in protected (certificate) atlas pages, but not common to other present members.

There are minutes attached to this event. Show them.