Operational issues:
- xrootd nproc limit issue appeared again
- This time the problem happens when LHCb pilots create gfal context
- Gridftp plugin is initialized after the xrootd
- The plugin creates new threads
- Since at this point xrootd has already set the limit, thread creation may fail on a busy WN, causing the pilot failure as well (see attached plot).
- We may want to mitigate it, e.g. by mapping pilot DN to a pool of users rather than a single one
- GSTSM-327 is opened to track this.
- Job failures due to frequent gateway restarts
- Writable WN sandbox fixed a but in xrd-ceph buffer size (8 bytes changed to 8MiB)
- That increased gateway memory consumption significantly, due to xcache-gateway interaction peculiarities, causing OOM kills sometimes
- The sandbox was rolled-back on the prod farm
- New version of the xrd-ceph plugin with write-only buffers (i.e. buffering that is applied only to write operations) is being tested on the LHCb-only WN (so far without any jobs, just manual tests).
News:
- New LHCb reprocessing workflow is coming to RAL (and other Tier-1 sites)
- Jobs running are going to download data from CERN storage
- This workflow is also used by CERN from Jumbo frame testing
- RAL may join the test by enabling jumbo frames on (some) WNs (e.g. preprod farm), see GSTSM-328.