ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147792 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-07-21 10:09:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error

  • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-23 07:46:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

    • Files declared lost; still deletion errors, as files needs to be removed from DPM DB.
    • To find out best practice method to do this.
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-22 13:47:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Discussions ongoing; move to SL7 did not help
    • Site use recent xroot versions; will try falling back to xroot-4.11.3
    • Can’t reproduce the issue; not reproducable with gfal
      • files stuck with particular credentials;
      • worth checking if some issue with voms roles; or some old certs
  • 146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL

    • -> on hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • (via discusion) Hope to try some of OX configuration to get back online
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold

CPU

  • some missing point (for historical plots) from Cern DNS problems on the Sunday (sent sites into HC test, manually reverted)

  • RAL

  • ok

  • Northgrid

  • Small spike from SHEF, now nothing

  • LANCS; no real production jobs; (followed up via emails; gridengine restarts and - perhaps unrelated - ipv6 storage fix); jobs running

  • change to queuelength to 72hours (48 for production);

  • London

  • SouthGrid

  • Scotgrid

    • GLA
    • gridFTP endpoint set up cephc02 (has cache, unlike c04); external 04 and 02 for internal
    • xrootd then to be tested (and then to test 4.11.3 downgrade) in santuary of the test queue

Site overview

  • Job failures understood; RAL errors highest - to be investigated.

Other new issues

Ongoing issues

  • CentOS7 - Sussex

  • On Hold

  • Grand Unified queues

  • Awaiting SHEF

News round-table

  • Vip

    • Disussion about Shef and using OX configuration to help.
  • Matt

    • NTR
  • Alessandra

  • Should have a postmortem on GLA, when possible

  • Some work reported to have started on using pure Apache http with wlcg tokens (within doma-tpc)

  • some differences between JWT and wclg tokens (Q from Sam, for Dune)

    • TPC is using JWT with own schema
  • Sam

    • NTR
  • Gareth

    • Will try to turn on more nodes, but still with AC instabilities
  • Tim

    • NTR
  • JW

    • NTR

AOB

From the chat window
    - Back on the ECDF ticket - I checked the links and their deletion efficiency for the last 24 hours is 81%
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147792 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-07-21 10:09:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error

        • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-23 07:46:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

          • Files declared lost; still deletion errors, as files needs to be removed from DPM DB.
          • To find out best practice method to do this.
        • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-22 13:47:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • Discussions ongoing; move to SL7 did not help
          • Site use recent xroot versions; will try falling back to xroot-4.11.3
          • Can’t reproduce the issue; not reproducable with gfal
            • files stuck with particular credentials;
            • worth checking if some issue with voms roles; or some old certs
        • 146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL

          • -> on hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • (via discusion) Hope to try some of OX configuration to get back online
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • on hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • on hold

         

      • CPU 5m
        • some missing point (for historical plots) from Cern DNS problems on the Sunday (sent sites into HC test, manually reverted)

        • RAL

        • ok

        • Northgrid

        • Small spike from SHEF, now nothing

        • LANCS; no real production jobs; (followed up via emails; gridengine restarts and - perhaps unrelated - ipv6 storage fix); jobs running

        • change to queuelength to 72hours (48 for production);

        • London

        • SouthGrid

        • Scotgrid

          • GLA
          • gridFTP endpoint set up cephc02 (has cache, unlike c04); external 04 and 02 for internal
          • xrootd then to be tested (and then to test 4.11.3 downgrade) in santuary of the test queue
      • Other new issues 5m
        • Job failures generally understood;
          • RAL errors highest - to be investigated (is it still staging issues).

         

    • 10:20 10:40
      Ongoing issues 20m
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Disussion about Shef and using OX configuration to help.
      • Matt

      • NTR

      • Alessandra

      Should have a postmortem on GLA, when possible

      • Some work reported to have started on using pure Apache http with wlcg tokens (within doma-tpc)
      • some differences between JWT and wclg tokens (Q from Sam, for Dune)

        • TPC is using JWT with own schema
      • Sam

        • NTR
      • Gareth

        • Will try to turn on more nodes, but still with AC instabilities
      • Tim

        • NTR
      • JW

        • NTR

       

    • 10:50 11:00
      AOB 10m
      From the chat window
          - Back on the ECDF ticket - I checked the links and their deletion efficiency for the last 24 hours is 81%