ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148719 UKI-LT2-IC-HEP less urgent in progress 2020-09-22 19:43:00 Failovers from UKI-LT2-IC-HEP to CERN CVMFS backup proxy
    • Active discussion on ticket
  • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-22 12:47:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
    • Files declared lost (again, with typo fixed); few residual files to be investigated once Matt is back.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Internal deltions complete; Sam to update ticket
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    • On hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    • On hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • On hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • No update

CPU

  • Number of ATLAS / CERN issues affecting sites.
    • New Pilot version misreporting corecount:

      • Affected scaling values used for acounting (e.g. wallclock and slots used)
      • Killed jobs (incorrectly) with incorrectly calculated men-limit values.
      • fix deployed yesterday, and rolling out
    • Update to VOMS server yesterday introduced issues:

      • upgraded VOMS server issued a VOMS extension that could not be validated by existing (and supported) VOMS C/C++ libraries.
      • The problem was observed on XRootD since XRootD links against VOMS libraries, but any C/C++ software linking against the VOMS library would be affected (e.g., the StoRM frontend server).
      • change has been rolled back; but ATLAS may still have some lingering effects?
    • Harvester_Central_B stopped submitting jobs this morning - under investigation

    • All storm sites blacklisted since VOMS incident + pilot update (may be related to the VOMS issue?):

      • “pilot, 1324: Service not available at the moment”
  • RAL

    • Small drop in jobs due to pilot problems; now slowly claiming back jobs from other VOs
    • Not seemingly affected by other issues.
  • Northgrid

    • All jobs dropped off.
  • London

    • All jobs dropped off.
    • QMUL breifly back up to 20kHS06 before new issues arose
  • SouthGrid

    • Most sites gone; BHAM not affected
  • Scotgrid

    • Most sites gone; ECDF not affected

Other new issues

  • GLASGOW:
    • CEPH_DATADISK no longer in TEST (set to DATADISK in AIGS)
    • DPM DATADISK now set as test
    • PQ set offline for DPM queues
  • QMUL:
    • Space reporting now ok
    • Additional space for ATLAS (with some further space coming)

Ongoing issues

  • CentOS7 - Sussex

    • No update
  • TPC with http

    • No update

News round-table

  • Dan

    • 1/2 PB further to add for ATLAS

      • ATLAS to propose spacetoken split
  • Peter

    • Learning arc-ce
  • Sam

    • Reported on discussion in Storage mtg. on future planning,

      • e.g. moving to Storageless sites (even if storage not initially decommissioned):
    • To hold of final commissioning, until voms / related issues are resolved.

  • Gareth

    • Noted general problems due to the VOMS issues
  • JW

    • NTR

AOB

  • Move to Zoom?
    • No strong preference in either direction;
      • Noted that additional (organsiation) overhead on Host may be the deciding factor.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148719 UKI-LT2-IC-HEP less urgent in progress 2020-09-22 19:43:00 Failovers from UKI-LT2-IC-HEP to CERN CVMFS backup proxy
          • Active discussion on ticket
        • 148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-22 12:47:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
          • Files declared lost (again, with typo fixed); few residual files to be investigated once Matt is back.
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Internal deltions complete; Sam to update ticket
        • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
          • On hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
          • On hold
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • On hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • No update
      • CPU 5m
        • Number of ATLAS / CERN issues affecting sites.
          • New Pilot version misreporting corecount:

            • Affected scaling values used for acounting (e.g. wallclock and slots used)
            • Killed jobs (incorrectly) with incorrectly calculated men-limit values.
            • fix deployed yesterday, and rolling out
          • Update to VOMS server yesterday introduced issues:

            • upgraded VOMS server issued a VOMS extension that could not be validated by existing (and supported) VOMS C/C++ libraries.
            • The problem was observed on XRootD since XRootD links against VOMS libraries, but any C/C++ software linking against the VOMS library would be affected (e.g., the StoRM frontend server).
            • change has been rolled back; but ATLAS may still have some lingering effects?
          • Harvester_Central_B stopped submitting jobs this morning - under investigation

          • All storm sites blacklisted since VOMS incident + pilot update (may be related to the VOMS issue?):

            • “pilot, 1324: Service not available at the moment”
        • RAL

          • Small drop in jobs due to pilot problems; now slowly claiming back jobs from other VOs
          • Not seemingly affected by other issues.
        • Northgrid

          • All jobs dropped off.
        • London

          • All jobs dropped off.
          • QMUL breifly back up to 20kHS06 before new issues arose
        • SouthGrid

          • Most sites gone; BHAM not affected
        • Scotgrid

          • Most sites gone; ECDF not affected
      • Other new issues 5m

        New Pilot version misreporting corecount:
        - Affected scaling values used for acctouning (e.g. wallclock and slots used)
        - Killed jobs (incorrectly) with incorrectly calculated men-limit values.
        - fix deployed yesterday

        Update to VOMS server yesterday introduced issues(UK significantly affected). (Issue with v2).
        - ATLAS uses both v2 and v3 in various places.
        - https://cern.service-now.com/service-portal?id=outage&n=OTG0059138

        Harvester_Central_B stopped submitting jobs this morning - under investigation

        All storm sites blacklisted since VOMS incident + pilot update:
        - "pilot, 1324: Service not available at the moment"

        GLASGOW:
        - CEPH_DATADISK no longer in TEST
        - PQ set offline for DPM queues

        QMUL:
        Lustre migration:
        Space reportting now ok?
        https://monit-grafana.cern.ch/d/mHqFLAbik/wlcg-storage-space-accounting?from=now-7d&orgId=20&to=now&var-area=ATLASDATADISK&var-binning=1h&var-country=All&var-federation=All&var-groupby=vo&var-medium=Disk&var-service=All&var-site=UKI-LT2-QMUL&var-tier=All&var-vo=ALICE&var-vo=ATLAS&var-vo=LHCb

        • GLASGOW:
          • CEPH_DATADISK no longer in TEST (set to DATADISK in AIGS)
          • DPM DATADISK now set as test
          • PQ set offline for DPM queues
        • QMUL:
          • Space reporting now ok
          • Additional space for ATLAS (with some further space coming)
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex

        • No update
      • TPC with http

        • No update
    • 10:40 10:50
      News round-table 10m
      • Dan

        • 1/2 PB further to add for ATLAS

          • ATLAS to propose spacetoken split
      • Peter

        • Learning arc-ce
      • Sam

        • Reported on discussion in Storage mtg. on future planning,

          • e.g. moving to Storageless sites (even if storage not initially decommissioned):
        • To hold of final commissioning of new cephcXX, until voms / related issues are resolved.

      • Gareth

        • Noted general problems due to the VOMS issues
      • JW

        • NTR

       

    • 10:50 11:00
      AOB 10m

      Zoom ?

      • Move to Zoom?
        • No strong preference in either direction;
          • Noted that additional (organsiation) overhead on Host may be the deciding factor.