RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:45
      VO-Liaison ATLAS 10m
      Speakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:45 13:55
      VO Liaison CMS 10m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      A quiet Christmas period. 

      In December there was an issue with AAA when under load. OOM errors were observed...to be followed up. 

      The number of failures on the ARC-CEs really increased over the holiday and since then. The SAM status never goes red though, because all 5 CEs have to fail simultaneously. So we are lucky so far. Tom Birkett is investigating. 

      Number of jobs being run by CMS is suspiciously variable, despite there being avaialble work in the system. Could this be related to the above CE failures?

      Job performance is as good or better than other CMS T1s. 

      Tier 2 mini-DC testing was done in week of 9th December. Tier 1 tests to be planned. 

       

       

    • 13:55 14:05
      VO Liaison LHCb 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      The main problem that affected RAL during the break and still affecting it now is nproc limit and its consequences (tracked in GSTSM-277). See the attached job plot. The limit is set by xrootd whenever the client is imported (note that gfal-context creation implies importing of xrootd client, and this is how LHCb pilots set the limit for themselves). Github issue is opened, but it is not progressing much. Furthermore, even the upstream fix may not resolve the problem fully since some version of LHCb software are strictly linked to certain xrootd versions, and this linkage can not be broken (e.g. for Run-[12] data processing software no changes can be made). Therefore, we should think about possible mitigations.

      The most straightforward one has already been applied this morning -- the max number of LHCb jobs is reduced to 10k. 10k is probably too low (below the pledge even), can we use something like 30k? I believe, we have been running 30k jobs without any problems before.

      As for the more "proper" mitigations, maybe we can "randomize" users, e.g. map LHCb pilot DN to multiple users randomly? In that case the processes should be ~evenly spread among the users and hopefully limit will not be hit. Priority reduction for particular users should not be an issue since we run pilot jobs and do not care much if some pilots got stuck in the queue as long as we can run other ones.

      In the github issue it was proposed to use LD_PRELOAD, but that's probably not very reliable.

      Other suggestions welcome.

      On a different topic: new certificate for the LHCb VO-box is requested, so that it contains SAN for the vobox alias.

      UPD: Chris H has just added the limit reset in DIRAC, so hopefully the impact of the issue will decrease soon. That does not mean that we should not put any mitigations, though (see direct access description above).

    • 14:10 14:20
      VO Liaison ALICE 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
    • 14:20 14:30
      VO Liaison LSST 10m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 14:30 14:40
      VO Liaison APEL 10m
      Speaker: Thomas Dack
    • 14:45 14:55
      WP-D - GPU, Data Management, Other 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:00 15:01
      Major Incidents Changes 1m
    • 15:05 15:15
      Summary of Operational Status and Issues 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:20 15:21
      AOB 1m
    • 15:22 15:32
      Any other Business 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore