RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2025-01-08T13:30:00+00:00
End: 2025-01-08T15:50:00+00:00
Location: RAL R89

Wednesday 8 Jan 2025, 13:30 → 15:50 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:30 → 13:31
  
  Experiment Operational Issues 1m
- 13:35 → 13:45
  
  VO-Liaison ATLAS 10m
  
  Speakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  
  ATLAS liason meeting notes Brij
- 13:45 → 13:55
  
  VO Liaison CMS 10m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  A quiet Christmas period.
  
  In December there was an issue with AAA when under load. OOM errors were observed...to be followed up.
  
  The number of failures on the ARC-CEs really increased over the holiday and since then. The SAM status never goes red though, because all 5 CEs have to fail simultaneously. So we are lucky so far. Tom Birkett is investigating.
  
  Number of jobs being run by CMS is suspiciously variable, despite there being avaialble work in the system. Could this be related to the above CE failures?
  
  Job performance is as good or better than other CMS T1s.
  
  Tier 2 mini-DC testing was done in week of 9th December. Tier 1 tests to be planned.
- 13:55 → 14:05
  
  VO Liaison LHCb 10m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  jobs.png
  
  The main problem that affected RAL during the break and still affecting it now is nproc limit and its consequences (tracked in GSTSM-277). See the attached job plot. The limit is set by xrootd whenever the client is imported (note that gfal-context creation implies importing of xrootd client, and this is how LHCb pilots set the limit for themselves). Github issue is opened, but it is not progressing much. Furthermore, even the upstream fix may not resolve the problem fully since some version of LHCb software are strictly linked to certain xrootd versions, and this linkage can not be broken (e.g. for Run-[12] data processing software no changes can be made). Therefore, we should think about possible mitigations.
  
  The most straightforward one has already been applied this morning -- the max number of LHCb jobs is reduced to 10k. 10k is probably too low (below the pledge even), can we use something like 30k? I believe, we have been running 30k jobs without any problems before.
  
  As for the more "proper" mitigations, maybe we can "randomize" users, e.g. map LHCb pilot DN to multiple users randomly? In that case the processes should be ~evenly spread among the users and hopefully limit will not be hit. Priority reduction for particular users should not be an issue since we run pilot jobs and do not care much if some pilots got stuck in the queue as long as we can run other ones.
  
  In the github issue it was proposed to use LD_PRELOAD, but that's probably not very reliable.
  
  Other suggestions welcome.
  
  On a different topic: new certificate for the LHCb VO-box is requested, so that it contains SAN for the vobox alias.
  
  UPD: Chris H has just added the limit reset in DIRAC, so hopefully the impact of the issue will decrease soon. That does not mean that we should not put any mitigations, though (see direct access description above).
- 14:10 → 14:20
  
  VO Liaison ALICE 10m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
- 14:20 → 14:30
  
  VO Liaison LSST 10m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 14:30 → 14:40
  
  VO Liaison APEL 10m
  
  Speaker: Thomas Dack
- 14:45 → 14:55
  
  WP-D - GPU, Data Management, Other 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
- 15:00 → 15:01
  
  Major Incidents Changes 1m
- 15:05 → 15:15
  
  Summary of Operational Status and Issues 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
  
  Weekly Report 08 January 2025.docx
  
  Weekly Report 08 January 2025.pdf
- 15:20 → 15:21
  
  AOB 1m
- 15:22 → 15:32
  
  Any other Business 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89