RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      CMS cosmic data taking starting March 7th; machine commissioning starting Apr 5th

      Migration of VOMS proxy and token issuing from OpenStack IAM to Kubernetes IAM continues; plan to switch off old service within a week!

      Issues within CMS on Monday with token tests affecting all test endpoints. This was gradually fixed during the day; Katy restarted AAA services to fix the consequently failing 'federation' test. 

      No obvious problem observed during the legacy network/OPN scream test on Tuesday. There was a spike of job failures coincident with the network change but I can't connect it to network. 

      CMS has started a tape deletion campaign, but I don't believe anything has happened at RAL yet. 

      Planned for deletion : ~3.4PB + ~250TB already obsolete = 3.65PB

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Operational issues:

      • xrootd nproc limit issue appeared again
        • This time the problem happens when LHCb pilots create gfal context
          • Gridftp plugin is initialized after the xrootd
          • The plugin creates new threads
          • Since at this point xrootd has already set the limit, thread creation may fail on a busy WN, causing the pilot failure as well (see attached plot).
        • We may want to mitigate it, e.g. by mapping pilot DN to a pool of users rather than a single one
          • GSTSM-327 is opened to track this.
      • Job failures due to frequent gateway restarts
        • Writable WN sandbox fixed a but in xrd-ceph buffer size (8 bytes changed to 8MiB)
        • That increased gateway memory consumption significantly, due to xcache-gateway interaction peculiarities, causing OOM kills sometimes
        • The sandbox was rolled-back on the prod farm
        • New version of the xrd-ceph plugin with write-only buffers (i.e. buffering that is applied only to write operations) is being tested on the LHCb-only WN (so far without any jobs, just manual tests).

       

      News:

      • New LHCb reprocessing workflow is coming to RAL (and other Tier-1 sites)
        • Jobs running are going to download data from CERN storage
        • This workflow is also used by CERN from Jumbo frame testing
          • RAL may join the test by enabling jumbo frames on (some) WNs (e.g. preprod farm), see GSTSM-328.

       

    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      NTR.

    • 13:55 14:00
      LSST Operations Report 5m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))

      Rubin Data Management and System Performance meeting last week

      Some key takeaways:

      • Will be zipping files for production to help with:
        • Archiving at IN2P3
        • Movement of data using FTS
        • Butler can read zipped files
        • S3 can pull out files from zip when needed for analysis
      • I made friends with Jen Adelman-Mcarthy and Brian Yanny so working to get RAL on the Multisite testing
        • More need as Lancs currently down so UKDF needs some compute
      • 1st data output pushed back 6 months (now all 1 year apart) gives us more time to resolve issues in code and infrastructure
        • Though mainly for the Rubin side
      • Questions from CM team - how do we get tickets to RAL if Tim is out - GGUS suggested again by myself and Jen

       

      Jobs running successfully still at RAL - As soon as data sorted from Jen / Brian job mix will change and increase in volume

       

      Data movement from Rubin regular testing amount though interesting test mix to have Lancs as the destination for all tests 

    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:05 14:15
      Preparing for mini UK Data Challenge 10m
      Speakers: Mr James Adams, Katy Ellis (Science and Technology Facilities Council STFC (GB))
    • 14:15 14:25
      Deployment of new EOS nodes 10m
      Speaker: Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)

      Writable WNs:

      • The change was deployed to the prod farm last week
      • Turned out that the gateway memory consumption increase significanlty under heavy io load, causing occasional OOM kills
        • Buffer size was fixed -- from 8 bytes to 8MiB
        • Problems triggered mostly by reads, due to xcache-gateway interraction peculiarities
          • Xcache can use multiple threads to fetch file blocks, and each thread is treated as a separate client, so a separate buffer is created
        • The change was rolled back
        • New version of xrd-ceph where buffering is only applied to write operations is being tested on the LHCb-only WN.
    • 14:35 14:45
      Utilizing GPUs 10m
      Speakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
    • 14:45 14:46
      AOB 1m
    • 14:46 14:55
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 14:55 15:00
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore