ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

● Status

 

    • 155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-13 14:27:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI

      • Close to final query; urgent other items
    • 155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2022-01-25 16:47:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI

      • Why weren’t these spotted before?
    • 154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2022-01-23 14:03:00 UKI-LT2-QMUL SOURCE transfer failures EGI

      • Last update related to RAL multihop transfer failures
    • 154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI

      • Awaiting …
    • 154436 TEAM atlas RAL-LCG2 very urgent NGI_UK in progress 2022-01-27 08:44:00 RAL Echo Davs developments EGI

      • Failures overnight on one host due to “terminated handshake not received” errors
    • 153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI

      • Cleanup ongoing in Castor for preparation for CTA migration

● CPU

 

    • RAL

      • Some volatility; ARC Priority setting for HC jobs does not appear to propagate to submission priority
        • Perhaps to consider non aCT pull mode, if can’t solve this ?
    • Northgrid

      • LANCS: Some unexplained drop; but gathering back slots; several HC failures (as with other sites) due to rucio 503 errors
    • London

      • QMUL: Issues with Arc-CE’s; adding in memory
    • SouthGrid

      • NTR
    • Scotgrid

      • GLA HC (Rucio) failures, and Xcache server going down

● Other new issues / tasks

  • HC jobs running on GPU queue; Dan to check

 


● Ongoing Items

  • TPC with http

    • Updates running on webdav alias, GGUS above updated with latest info
  • Storageless Site test (Oxford)

    • To extract more info regarding Xcaches and performance (as described in Storage Mtg)
    • New server may still become available for testing
  • LANCS Storage migration

    • All the pieces there

 


● News round-table

  • Alessandra

    • NTR
  • Dan

    • Preparing for new storage; with doubling of new space.
  • Gerard

    • NTR
  • Matt

    • NTR
  • Peter

    • NTR
  • Sam

    • Preparing for LHCb and cephfs migration
  • Stephen

    • NTR
  • Vip

    • Possiblity of Xcache server is getting closer

 


● AOB

There are minutes attached to this event. Show them.
    • 10:00 AM 10:20 AM
      Status 20m

       

        • 155430 TEAM atlas UKI-SCOTGRID-ECDF less urgent NGI_UK in progress 2022-01-13 14:27:00 UKI-SCOTGRID-ECDF transfer and deletion errors EGI

          • Close to final query; urgent other items
        • 155141 TEAM atlas UKI-LT2-Brunel less urgent NGI_UK in progress 2022-01-25 16:47:00 Transfers from UKI-LT2-Brunel fail with “Internal Server Error” EGI

          • Why weren’t these spotted before?
        • 154806 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK in progress 2022-01-23 14:03:00 UKI-LT2-QMUL SOURCE transfer failures EGI

          • Last update related to RAL multihop transfer failures
        • 154543 TEAM atlas UKI-SCOTGRID-ECDF urgent NGI_UK in progress 2021-12-08 12:35:00 DPM storage ACL configuration EGI

          • Awaiting …
        • 154436 TEAM atlas RAL-LCG2 very urgent NGI_UK in progress 2022-01-27 08:44:00 RAL Echo Davs developments EGI

          • Failures overnight on one host due to “terminated handshake not received” errors
        • 153367 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2021-12-01 15:37:00 HTTPS on RAL CTA EGI

          • Cleanup ongoing in Castor for preparation for CTA migration
      • Outstanding tickets 10m
      • CPU 5m

        New link for the site-oriented dashboard

         

          • RAL

            • Some volatility; ARC Priority setting for HC jobs does not appear to propagate to submission priority
              • Perhaps to consider non aCT pull mode, if can’t solve this ?
          • Northgrid

            • LANCS: Some unexplained drop; but gathering back slots; several HC failures (as with other sites) due to rucio 503 errors
          • London

            • QMUL: Issues with Arc-CE’s; adding in memory
          • SouthGrid

            • NTR
          • Scotgrid

            • GLA HC (Rucio) failures, and Xcache server going down
      • Other new issues / tasks 5m

        Mitigations for recent security patches may stop Singularity.

        Sites with arc-ce and aCT; is the "ARC priority" consider/applied in the scheduling?

        Re-enabling GPU queue for QMUL:
        - HC tests now running successfully

        Multihop failures RAL; no overwrite of failed intermediate steps
        / request XrootD devs to have 'autorm' feature for http-TPC.

        • HC jobs running on GPU queue; Dan to check

         

    • 10:20 AM 10:40 AM
      Ongoing Items 20m
      • TPC with http

        • Updates running on webdav alias, GGUS above updated with latest info
      • Storageless Site test (Oxford)

        • To extract more info regarding Xcaches and performance (as described in Storage Mtg)
        • New server may still become available for testing
      • LANCS Storage migration

        • All the pieces there

       

    • 10:40 AM 10:50 AM
      News round-table 10m
      • Alessandra

        • NTR
      • Dan

        • Preparing for new storage; with doubling of new space.
      • Gerard

        • NTR
      • Matt

        • NTR
      • Peter

        • NTR
      • Sam

        • Preparing for LHCb and cephfs migration
      • Stephen

        • NTR
      • Vip

        • Possiblity of Xcache server is getting closer

       

    • 10:50 AM 11:00 AM