Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency

    • Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
    • On site access yesterday; some older hardware will need OS upgrades.
  • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

    • as above
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

    • Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
    • SS to check the database for the 0 replica entries
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • JW to follow-up.
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

    • on hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • no update
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold

CPU

  • RAL

    • Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
  • Northgrid

    • LANCS: disk problems (described above); in test. May also have some pilot issues
  • London

    • QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
  • SouthGrid

    • OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
    • RALPP Some reduction due to dCache upgrades.
  • Scotgrid

    • GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.

Other new issues

  • QMUL upgrade
    • JW to confirm that other sites dependent on QMUL storage are also in downtime.

Ongoing issues

  • Sussex
    • on hold
  • Grand Unified queues
    • on hold

News round-table

(NTR)

  • Vip

    • Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
  • Dan

    • Check that dependent sites (e.g. Cambridge) will transition correctly
  • Matt

    • Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
  • Alessandra

    • JW - to add to agenda page TPC items that need to be done.
  • Gareth

  • Tim

    • Lost files at MAN; AF to redeclare things as lost.
  • JW

    • NTR

AOB

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148474 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-01 09:24:00 UKI-NORTHGRID-LANCS-HEP : Low deletion efficiency

          • Similar status to last week; combination of aging servers, some full, and empty ones that become overloaded
          • On site access yesterday; some older hardware will need OS upgrades.
        • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-02 15:37:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

          • as above
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-02 15:53:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

          • Consistency check returns files that have zero replicas in DPM. AF to see if has any scripts that might help.
          • SS to check the database for the 0 replica entries
        • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-20 14:44:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • JW to follow-up.
        • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

          • on hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • no update
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • on hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • on hold
      • CPU 5m
        • RAL

          • Stable; below pledge in Monit, but consistent with internal (and pledge) if scaled by correct corepower.
        • Northgrid

          • LANCS: disk problems (described above); in test. May also have some pilot issues
        • London

          • QMUL: From 3rd. Switched to run only-prod jobs. (To stop jobs using scratchdisk).
        • SouthGrid

          • OX: Recovered from power issues; 3 WNs (older tranche) not recoverable.
          • RALPP Some reduction due to dCache upgrades.
        • Scotgrid

          • GLA: Running below full capactity; some from DPM, awaiting decommissioning of DPM and relocation, others in new DC.
      • Other new issues 5m
        • QMUL upgrade
          • JW to confirm that other sites dependent on QMUL storage are also in downtime.

         

         

    • 10:20 10:40
      Ongoing issues 20m
      • Sussex
        • on hold
      • Grand Unified queues
        • on hold
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Data center power issues / air con. now recovered. Lost 3 old WNs approx. 190 cores
      • Dan

        • Check that dependent sites (e.g. Cambridge) will transition correctly
      • Matt

        • Appears that some Pilots are dying at LANCS; lower priority to Disk failures at the moment
      • Alessandra

        • JW - to add to agenda page TPC items that need to be done.
      • Gareth

      • Tim

        • Lost files at MAN; AF to redeclare things as lost.
      • JW

        • NTR

       

       

    • 10:50 11:00
      AOB 10m