ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 149095 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-10-28 14:02:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
    • Closed prior to meeting; look like issue at the other end.
  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-10-27 15:20:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • Followed up; failing files had already been declared as lost.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-27 10:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Awaiting DDM Experts to manually update DB.
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • On hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    • on Hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • Decommissioning and physical migration of hardware to new DC.
    • No CEs in old DC.
    • new nat5 to DC soon; then ticket can be progressed.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Response from Patrick; Host certificate problems

CPU

  • RAL

    • Running slightly below fairshare
  • Northgrid

    • LANCS: Reduced ATLAS allocation; reduced slots to LHCb

  • London

    • Generally fine
  • SouthGrid

    • OX system back within 3hrs from Tues downtime
  • Scotgrid

    • GLA slowly increasing cores; sofar no bottlenecks seen
    • To follow up with Durham re. reduction in slots.

Other new issues

  • Unconfirmed vunerability discussed, to be brought up at UK Security Mtg

Ongoing issues

  • CentOS7 - Sussex

    • oh hold

  • TPC with http

    • NTR
  • Storageless Site tests

    • Oxford agreement for StorageLess tests.
    • O(node(s)) to be allocated for test
    • ~ 2-3 weeks to identify and implement hardware
    • ECDF - remote xrootd monitoring to be udpated
    • For ATLAS:
      • OX - RAL will be endpoint
      • What workloads to start with?
      • Timescale
        • 2-3 weeks
      • Design
        • To be documented
      • Hardware
        • CEs
          • No plan to create new CE; use existing implementation
          • Consider sending a new queue to ARC ?
          • For HTCondor backend; need some mapping
        • XCache
          • Size: estimated 36TB.
      • New Panda Site / Queue likely needed
      • Jira to be created
      • Job Mix.

News round-table

  • Vip
    • NTR
  • Dan
    • NTR
  • Matt
    • NTR
  • Alessandra
    • Move to zoom expressed
  • Sam
    • NTR
  • Gareth
    • NTR
  • JW
    • NTR

AOB

  • Proposal to Move to Zoom; aim for next meeting.
    • JW to work out how to pass on security information.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149095 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-10-28 14:02:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
          • Closed prior to meeting; looks like issue at the other end.
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-10-27 15:20:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • Followed up; failing files had already been declared as lost.
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-27 10:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Awaiting DDM Experts to manually update DB.
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • On hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
          • on Hold
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • Decommissioning and physical migration of hardware to new DC.
          • No CEs in old DC.
          • new nat5 to DC soon; then ticket can be progressed.
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Response from Patrick; Host certificate problems
      • CPU 5m
        • RAL

          • Running slightly below fairshare
        • Northgrid

        • LANCS: Reduced ATLAS allocation; reduced slots to LHCb

        • London

          • Generally fine
        • SouthGrid

          • OX system back within 3hrs from Tues downtime
        • Scotgrid

          • GLA slowly increasing cores; sofar no bottlenecks seen
          • To follow up with Durham re. reduction in slots.

         

      • Other new issues / tasks 5m
        • Unconfirmed vunerability discussed, to be brought up at UK Security Mtg

         

    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex 5m
      • Grand Unified queues 5m
      • TPC with http 5m

        "Switch off HTTP-TPC at sites that do no support macaroons at the end of the year"
        (Currently would be UKI-SOUTHGRID-OX-HEP for the UK).

        •  
          • NTR
        •  

      • Storageless Site tests (Oxford) 5m

        (See also Minutes from: GridPP Storage Meeting minutes 28 Oct 2020)

          • Oxford agreement for StorageLess tests.
          • O(node(s)) to be allocated for test
          • ~ 2-3 weeks to identify and implement hardware
          • ECDF - remote xrootd monitoring to be udpated
          • For ATLAS:
            • OX - RAL will be endpoint
            • Timescale
              • 2-3 weeks
            • Design
              • To be documented
            • Hardware
              • CEs
                • No plan to create new CE; use existing implementation
                • Consider sending a new queue to ARC ?
                • For HTCondor backend; need some mapping
              • XCache
                • Size: estimated 36TB.
            • New Panda Site / Queue likely needed
            • Jira to be created
            • Job Mix to be discussed

         

    • 10:40 10:50
      News round-table 10m
      • Vip
        • NTR
      • Dan
        • NTR
      • Matt
        • NTR
      • Alessandra
        • Move to zoom
      • Sam
        • NTR
      • Gareth
        • NTR
      • JW
        • NTR
    • 10:50 11:00
      AOB 10m
      • Proposal to Move to Zoom; aim for next meeting.
      • JW to work out how to pass on security information.