RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:30
→
13:31
Experiment Operational Issues 1m
-
13:35
→
13:40
ATLAS Operations Report 5mSpeakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
-
13:40
→
13:45
CMS Operations Report 5mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
CMS cosmic data taking starting March 7th; machine commissioning starting Apr 5th
Migration of VOMS proxy and token issuing from OpenStack IAM to Kubernetes IAM continues; plan to switch off old service within a week!
Issues within CMS on Monday with token tests affecting all test endpoints. This was gradually fixed during the day; Katy restarted AAA services to fix the consequently failing 'federation' test.
No obvious problem observed during the legacy network/OPN scream test on Tuesday. There was a spike of job failures coincident with the network change but I can't connect it to network.
CMS has started a tape deletion campaign, but I don't believe anything has happened at RAL yet.
Planned for deletion : ~3.4PB + ~250TB already obsolete = 3.65PB
-
13:45
→
13:50
LHCb Operations Report 5mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
Operational issues:
- xrootd nproc limit issue appeared again
- This time the problem happens when LHCb pilots create gfal context
- Gridftp plugin is initialized after the xrootd
- The plugin creates new threads
- Since at this point xrootd has already set the limit, thread creation may fail on a busy WN, causing the pilot failure as well (see attached plot).
- We may want to mitigate it, e.g. by mapping pilot DN to a pool of users rather than a single one
- GSTSM-327 is opened to track this.
- This time the problem happens when LHCb pilots create gfal context
- Job failures due to frequent gateway restarts
- Writable WN sandbox fixed a but in xrd-ceph buffer size (8 bytes changed to 8MiB)
- That increased gateway memory consumption significantly, due to xcache-gateway interaction peculiarities, causing OOM kills sometimes
- The sandbox was rolled-back on the prod farm
- New version of the xrd-ceph plugin with write-only buffers (i.e. buffering that is applied only to write operations) is being tested on the LHCb-only WN (so far without any jobs, just manual tests).
News:
- New LHCb reprocessing workflow is coming to RAL (and other Tier-1 sites)
- Jobs running are going to download data from CERN storage
- This workflow is also used by CERN from Jumbo frame testing
- RAL may join the test by enabling jumbo frames on (some) WNs (e.g. preprod farm), see GSTSM-328.
- xrootd nproc limit issue appeared again
- 13:50 → 13:55
-
13:55
→
14:00
LSST Operations Report 5mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
Rubin Data Management and System Performance meeting last week
Some key takeaways:
- Will be zipping files for production to help with:
- Archiving at IN2P3
- Movement of data using FTS
- Butler can read zipped files
- S3 can pull out files from zip when needed for analysis
- I made friends with Jen Adelman-Mcarthy and Brian Yanny so working to get RAL on the Multisite testing
- More need as Lancs currently down so UKDF needs some compute
- 1st data output pushed back 6 months (now all 1 year apart) gives us more time to resolve issues in code and infrastructure
- Though mainly for the Rubin side
- Questions from CM team - how do we get tickets to RAL if Tim is out - GGUS suggested again by myself and Jen
Jobs running successfully still at RAL - As soon as data sorted from Jen / Brian job mix will change and increase in volume

Data movement from Rubin regular testing amount though interesting test mix to have Lancs as the destination for all tests

- Will be zipping files for production to help with:
-
14:00
→
14:01
Tier-1 Projects 1m
-
14:05
→
14:15
Preparing for mini UK Data Challenge 10mSpeakers: Mr James Adams, Katy Ellis (Science and Technology Facilities Council STFC (GB))
-
14:15
→
14:25
Deployment of new EOS nodes 10mSpeaker: Thomas Byrne
-
14:25
→
14:35
XRootD Development 10mSpeakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)
Writable WNs:
- The change was deployed to the prod farm last week
- Turned out that the gateway memory consumption increase significanlty under heavy io load, causing occasional OOM kills
- Buffer size was fixed -- from 8 bytes to 8MiB
- Problems triggered mostly by reads, due to xcache-gateway interraction peculiarities
- Xcache can use multiple threads to fetch file blocks, and each thread is treated as a separate client, so a separate buffer is created
- The change was rolled back
- New version of xrd-ceph where buffering is only applied to write operations is being tested on the LHCb-only WN.
-
14:35
→
14:45
Utilizing GPUs 10mSpeakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
-
14:45
→
14:46
AOB 1m
-
14:46
→
14:55
Summary of Operational Status and Issues 9mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
14:55
→
15:00
Any other Business 5mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
13:30
→
13:31