ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

● Outstanding tickets

    • 155460 USER atlas UKI-SOUTHGRID-CAM-HEP less urgent NGI_UK assigned 2022-01-05 11:26:00 Failovers from Cambridge to CERN backup proxy EGI

      • ‘Rogue’ user jobs; admins are contacting users
    • 155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-05 22:18:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI

      • Major issues with one of the storage nodes, awaiting site prognosis
    • 155410 TEAM atlas RAL-LCG2 less urgent NGI_UK in progress 2022-01-04 07:07:00 RAL-LCG2 jobs failed due to transfer timeout EGI

      • Large number of fts transfers queued; transfers timing out (the time in queue, not the transfer time), leading to compute job failures.
      • Many mitigating steps taken: reduced number of running jobs, new webdav endpoint; dedicated atlas running on two machines on that endpoint.
      • Backlog largely cleared and work ongoing for davs improvements
      • 4-5 GB/s davs reads are sustainable
    • 155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2021-12-24 08:39:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI

      • Aim to resolve tickets shortly. HC test files should have been returned to site (to check); also to check on any outstanding issues
    • 154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2021-12-25 04:28:00 UKI-LT2-QMUL SOURCE transfer failures: [13] Result (Neon): SSL handshake failed EGI

      • Needs resolving (somehow)
    • 154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI

      • To prod site on how to resolve this
    • 154436 TEAM atlas RAL-LCG2 very urgent NGI_UK on hold 2021-12-08 13:25:00 RAL Echo Davs developments EGI

      • Work in 155410 to help dev work
    • 153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI

      • To be continued …

● CPU

    • RAL

      • Poor performance, due to 155410
    • Northgrid

      • Lancs Power issues on Christmas eve; resolved by / on Christmas day
    • London

      • Largely ok; some QMUL blips
    • SouthGrid

      • OK
    • Scotgrid

      • Some issues for Durham; Glasgow CPU efficiency is low


 


● Ongoing Items

  • TPC with http

    • Active work now restarting for davs deployment
  • Storageless Site test (Oxford)

    • Some new Xcache harware may be avaialble for loan
    • Working on finding out why little Xcache traffic
    • Site will upgrade to 5.4.0
  • LANCS Storage migration

    • Aliases exist; JW to configure CRIC side

 

 


● News round-table

  • Gerard

    • NTR
  • Matt

    • NTR
  • Patrick

    • NTR
  • Peter

    • NTR
  • Stephen

    • NTR
  • Vip

    • NTR

 

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
          • 155460 USER atlas UKI-SOUTHGRID-CAM-HEP less urgent NGI_UK assigned 2022-01-05 11:26:00 Failovers from Cambridge to CERN backup proxy EGI

            • ‘Rogue’ user jobs; admins are contacting users
          • 155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-05 22:18:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI

            • Major issues with one of the storage nodes, awaiting site prognosis
          • 155410 TEAM atlas RAL-LCG2 less urgent NGI_UK in progress 2022-01-04 07:07:00 RAL-LCG2 jobs failed due to transfer timeout EGI

            • Large number of fts transfers queued; transfers timing out (the time in queue, not the transfer time), leading to compute job failures.
            • Many mitigating steps taken: reduced number of running jobs, new webdav endpoint; dedicated atlas running on two machines on that endpoint.
            • Backlog largely cleared and work ongoing for davs improvements
            • 4-5 GB/s davs reads are sustainable
          • 155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2021-12-24 08:39:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI

            • Aim to resolve tickets shortly. HC test files should have been returned to site (to check); also to check on any outstanding issues
          • 154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2021-12-25 04:28:00 UKI-LT2-QMUL SOURCE transfer failures: [13] Result (Neon): SSL handshake failed EGI

            • Needs resolving (somehow)
          • 154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI

            • To prod site on how to resolve this
          • 154436 TEAM atlas RAL-LCG2 very urgent NGI_UK on hold 2021-12-08 13:25:00 RAL Echo Davs developments EGI

            • Work in 155410 to help dev work
          • 153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI

            • To be continued …
      • CPU 5m

        New link for the site-oriented dashboard

          • RAL

            • Poor performance, due to 155410
          • Northgrid

            • Lancs Power issues on Christmas eve; resolved by / on Christmas day
          • London

            • Largely ok; some QMUL blips
          • SouthGrid

            • OK
          • Scotgrid

            • Some issues for Durham; Glasgow CPU efficiency is low


         

      • Other new issues / tasks 5m

        Re-enabling GPU queue for QMUL

        Analysis facilities: understand the status and E&D for UK; feedback to Alessandra.

        Multihop failures RAL; no overwrite of failed intermediate steps

    • 10:20 10:40
      Ongoing Items 20m
      • TPC with http

        • Active work now restarting for davs deployment
      • Storageless Site test (Oxford)

        • Some new Xcache harware may be avaialble for loan
        • Working on finding out why little Xcache traffic
        • Site will upgrade to 5.4.0
      • LANCS Storage migration

        • Aliases exist; JW to configure CRIC side

       

       

    • 10:40 10:50
      News round-table 10m
      • Gerard

        • NTR
      • Matt

        • NTR
      • Patrick

        • NTR
      • Peter

        • NTR
      • Stephen

        • NTR
      • Vip

        • NTR

       

       

    • 10:50 11:00