Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support



Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency

    • Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
    • On site access yesterday; some older hardware will need OS upgrades.
  • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

    • as above
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

    • Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
    • SS to check the database for the 0 replica entries
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • JW to follow-up.
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

    • on hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • no update
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold


  • RAL

    • Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
  • Northgrid

    • LANCS: disk problems (described above); in test. May also have some pilot issues
  • London

    • QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
  • SouthGrid

    • OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
    • RALPP Some reduction due to dCache upgrades.
  • Scotgrid

    • GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.

Other new issues

  • QMUL upgrade
    • JW to confirm that other sites dependent on QMUL storage are also in downtime.

Ongoing issues

  • Sussex
    • on hold
  • Grand Unified queues
    • on hold

News round-table


  • Vip

    • Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
  • Dan

    • Check that dependent sites (e.g. Cambridge) will transition correctly
  • Matt

    • Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
  • Alessandra

    • JW - to add to agenda page TPC items that need to be done.
  • Gareth

  • Tim

    • Lost files at MAN; AF to redeclare things as lost.
  • JW

    • NTR



There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency

          • Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
          • On site access yesterday; some older hardware will need OS upgrades.
        • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

          • as above
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

          • Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
          • SS to check the database for the 0 replica entries
        • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • JW to follow-up.
        • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

          • on hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • no update
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • on hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • on hold
      • CPU 5m
        • RAL

          • Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
        • Northgrid

          • LANCS: disk problems (described above); in test. May also have some pilot issues
        • London

          • QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
        • SouthGrid

          • OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
          • RALPP Some reduction due to dCache upgrades.
        • Scotgrid

          • GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.
      • Other new issues 5m
        • QMUL upgrade
          • JW to confirm that other sites dependent on QMUL storage are also in downtime.



    • 10:20 10:40
      Ongoing issues 20m
      • Sussex
        • on hold
      • Grand Unified queues
        • on hold
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
      • Dan

        • Check that dependent sites (e.g. Cambridge) will transition correctly
      • Matt

        • Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
      • Alessandra

        • JW - to add to agenda page TPC items that need to be done.
      • Gareth

      • Tim

        • Lost files at MAN; AF to redeclare things as lost.
      • JW

        • NTR



    • 10:50 11:00
      AOB 10m