ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-09 10:50:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy

    • In progress; Local user causing issues. Squid not set to be monitored in gocdb
  • 148578 UKI-NORTHGRID-LANCS-HEP urgent in progress 2020-09-09 15:04:00 cannot download files from UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK

    • Some files lost, others recovered on LOCALGROUPDISK
    • ZFS needs time to complete the list.
    • JW - to delete the 7 LGD files.
  • 148544 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-07 13:09:00 UKI-SCOTGRID-ECDF failed jobs

    • Possible chksum timeouts for large files
    • Upgrades underway and inprogress;
  • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-09 19:20:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

    • Initial filelist declared lost
    • Diskserver finally failed the disk, preparing list of lost files.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-04 05:40:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

    • Work on capacity in the ceph pool in progress
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Await for upgrades to finish
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

    • on hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • on hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold

CPU

  • RAL

    • Echo Downtime for network firmware upgrades. Needed Manual downtime to be set.
  • Northgrid

    • Issues with Lancs (as described above)
  • London

    • QMUL in downtime for lustre updates; will extend into next week.
  • SouthGrid

  • Scotgrid

Other new issues

  • Active http TPC endpoints
    • LANCS upgraded; works fine
    • EDCF upgraded but some cerificate issues.
    • Durham (with macaroon), only up- and down-load, not tpc; and not in functional tests.

Ongoing issues

  • CentOS7 - Sussex
    • No update
  • Grand Unified queues
    • Awaiting SHEF

News round-table

  • Vip

    • Paul to send Vip instructions for DPM upgrades.
      • Some manual changes needed for TPC.
    • Still working on residual fallout from previous DC power issues
  • Dan

    • In Downtime for upgrades; hardware done
    • One difficult ATLAS directory (many files to verify checksums); downtime to next week to finish migration.
    • Additionally, AC situation will be improved by next week.
    • Change to mountpoints needed; to confirm via email.
  • Matt

    • Working on the Storage issues
    • More jobs running from other VOs.
  • Peter

    • ATLAS will switch to python3 from cvmfs; should be transparent.
  • Alessandra

    • NTR
  • Sam

    • Atlas to move to 40G ceph
    • To start to look at xrootd 5 for some new featuree.
  • JW

    • Work on TPC with HTTP for CEPH ongoing.

AOB

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • 148589 UKI-LT2-UCL-HEP less urgent in progress 2020-09-09 10:50:00 Failovers from UKI-LT2-UCL-HEP to CERN backup proxy

        • In progress; Local user causing issues. Squid not set to be monitored in gocdb
      • 148578 UKI-NORTHGRID-LANCS-HEP urgent in progress 2020-09-09 15:04:00 cannot download files from UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK

        • Some files lost, others recovered on LOCALGROUPDISK
        • ZFS needs time to complete the list.
        • JW - to delete the 7 LGD files.
      • 148544 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-07 13:09:00 UKI-SCOTGRID-ECDF failed jobs

        • Possible chksum timeouts for large files
        • Upgrades underway and inprogress;
      • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-09 19:20:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures

        • Initial filelist declared lost
        • Diskserver finally failed the disk, preparing list of lost files.
      • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-04 05:40:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

        • Work on capacity in the ceph pool in progress
      • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-09-05 18:57:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

        • Await for upgrades to finish
      • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

        • on hold
      • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

        • on hold
      • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

        • on hold
      • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

        • on hold
      • Outstanding tickets 10m
      • CPU 5m
        • RAL

          • Echo Downtime for network firmware upgrades. Needed Manual downtime to be set.
        • Northgrid

          • Issues with Lancs (as described above)
        • London

          • QMUL in downtime for lustre updates; will extend into next week.
        • SouthGrid

        • Scotgrid

         

         

      • Other new issues 5m

        Status of TPC with http:
        dcache (>5.2.18 and 6.2.x) and DPM (1.14) sites:
        UKI-SCOTGRID-ECDF
        UKI-SOUTHGRID-OX-HEP

        • Active http TPC endpoints
          • LANCS upgraded; works fine
          • EDCF upgraded but some cerificate issues.
          • Durham (with macaroon), only up- and down-load, not tpc; and not in functional tests.

         

         

    • 10:20 10:40
      Ongoing issues 20m
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Paul to send Vip instructions for DPM upgrades.
          • Some manual changes needed for TPC.
        • Still working on residual fallout from previous DC power issues
      • Dan

        • In Downtime for upgrades; hardware done
        • One difficult ATLAS directory (many files to verify checksums); downtime to next week to finish migration.
        • Additionally, AC situation will be improved by next week.
        • Change to mountpoints needed; to confirm via email.
      • Matt

        • Working on the Storage issues
        • More jobs running from other VOs.
      • Peter

        • ATLAS will switch to python3 from cvmfs; should be transparent.
      • Alessandra

        • NTR
      • Sam

        • Atlas to move to 40G ceph
        • To start to look at xrootd 5 for some new featuree.
      • JW

        • Work on TPC with HTTP for CEPH ongoing.

       

       

    • 10:50 11:00
      AOB 10m