ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148946 UKI-LT2-QMUL less urgent in progress 2020-10-07 10:34:00 Failovers from jobs running at UKI-LT2-QMUL queue

    • WNs available with IPV6
  • 148908 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-07 16:56:00 UKI-NORTH-LANCS-HEP jobs failing due to “lost heartbeat”

    • Downtime for improvements with shared FS done.
    • ZFS failed files; checks ongoing.
    • Current HC failures with root://fal-pygrid-30.lancs.ac.uk:1094//dpm/lancs.ac.uk/home/atlas/atlasdatadisk/rucio/data18_13TeV/96/e0/data18_13TeV.00349263.physics_Main.merge.AOD.f937_m1972._lb0150._0003.1
      • JW to declare HC File lost, to get HC passing again
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-07 18:15:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

    • SS apoligies for absence; report via email:
      • disk cleanup/deletions on the DPM are being handled locally for things on disk063, which seems to be weird dark data
    • Pickup in failure rate overnight for CEPH:
      • initially, it looks like putting the 40GB/s connection into the ceph cluster might have caused some load spikes. Later today I’m going to see what I can do to shape the traffic a bit here - it looks like write traffic is really the only thing seriously affected.
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

    • Update requested from Grid Services team on timeline
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • No update
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • No update
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • No update

CPU

  • RAL
    • Running above pledge; CMS problems released slots
  • Northgrid
    • LANCS issues, and recent drop for MAN
  • London

  • SouthGrid

  • Scotgrid
    • GLA Ceph related issues noted above

Other new issues

Ongoing issues

  • CentOS7 - Sussex
    • on hold
  • Grand Unified queues
    • on hold

News round-table

  • Vip
    • 26-27th possible downtime?
    • To find a time to discuss Storageless tests and plans
  • Dan
    • NTR; asked for relevant info from ATLAS S&C week to be passed back to T2s.
    • JW mentioned moving of Data-carousel model into production mode
  • Matt
    • Expecting more disks to arrive
  • Peter
    • Raised interest in Covid working arrangements at other sites
  • Sam
    • Sent Appologies
  • JW
    • TPC-http tests reveal issue in Pulls (RAL as Dest) with writing data.

AOB

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148946 UKI-LT2-QMUL less urgent in progress 2020-10-07 10:34:00 Failovers from jobs running at UKI-LT2-QMUL queue

          • WNs available with IPV6
        • 148908 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-07 16:56:00 UKI-NORTH-LANCS-HEP jobs failing due to “lost heartbeat”

          • Downtime for improvements with shared FS done.
          • ZFS failed files; checks ongoing.
          • Current HC failures with root://fal-pygrid-30.lancs.ac.uk:1094//dpm/lancs.ac.uk/home/atlas/atlasdatadisk/rucio/data18_13TeV/96/e0/data18_13TeV.00349263.physics_Main.merge.AOD.f937_m1972._lb0150._0003.1
            • JW to declare HC File lost, to get HC passing again
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-07 18:15:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures

          • SS apoligies for absence; report via email:
            • disk cleanup/deletions on the DPM are being handled locally for things on disk063, which seems to be weird dark data
          • Pickup in failure rate overnight for CEPH:
            • initially, it looks like putting the 40GB/s connection into the ceph cluster might have caused some load spikes. Later today I’m going to see what I can do to shape the traffic a bit here - it looks like write traffic is really the only thing seriously affected.
        • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL

          • Update requested from Grid Services team on timeline
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • No update
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • No update
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • No update
      • CPU 5m
        • RAL
          • Running above pledge; CMS problems released slots
        • Northgrid
          • LANCS issues, and recent drop for MAN
        • London

        • SouthGrid

        • Scotgrid
          • GLA Ceph related issues noted above
      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex
        • on hold
      • Grand Unified queues
        • on hold
    • 10:40 10:50
      News round-table 10m
      • Vip
        • 26-27th possible downtime?
        • Try to find a time to discuss Storageless tests and plans
      • Dan
        • NTR; asked for relevant info from ATLAS S&C week to be passed back to T2s.
        • JW mentioned moving of Data-carousel model into production mode
      • Matt
        • Expecting more disks to arrive
      • Peter
        • Raised interest in Covid working arrangements at other sites
      • Sam
        • Sent Appologies
      • JW
        • TPC-http tests reveal issue in Pulls (RAL as Dest) with writing data.

       

       

    • 10:50 11:00
      AOB 10m