- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
A quiet Christmas period.
In December there was an issue with AAA when under load. OOM errors were observed...to be followed up.
The number of failures on the ARC-CEs really increased over the holiday and since then. The SAM status never goes red though, because all 5 CEs have to fail simultaneously. So we are lucky so far. Tom Birkett is investigating.
Number of jobs being run by CMS is suspiciously variable, despite there being avaialble work in the system. Could this be related to the above CE failures?
Job performance is as good or better than other CMS T1s.
Tier 2 mini-DC testing was done in week of 9th December. Tier 1 tests to be planned.
The main problem that affected RAL during the break and still affecting it now is nproc limit and its consequences (tracked in GSTSM-277). See the attached job plot. The limit is set by xrootd whenever the client is imported (note that gfal-context creation implies importing of xrootd client, and this is how LHCb pilots set the limit for themselves). Github issue is opened, but it is not progressing much. Furthermore, even the upstream fix may not resolve the problem fully since some version of LHCb software are strictly linked to certain xrootd versions, and this linkage can not be broken (e.g. for Run-[12] data processing software no changes can be made). Therefore, we should think about possible mitigations.
The most straightforward one has already been applied this morning -- the max number of LHCb jobs is reduced to 10k. 10k is probably too low (below the pledge even), can we use something like 30k? I believe, we have been running 30k jobs without any problems before.
As for the more "proper" mitigations, maybe we can "randomize" users, e.g. map LHCb pilot DN to multiple users randomly? In that case the processes should be ~evenly spread among the users and hopefully limit will not be hit. Priority reduction for particular users should not be an issue since we run pilot jobs and do not care much if some pilots got stuck in the queue as long as we can run other ones.
In the github issue it was proposed to use LD_PRELOAD
, but that's probably not very reliable.
Other suggestions welcome.
On a different topic: new certificate for the LHCb VO-box is requested, so that it contains SAN for the vobox alias.
UPD: Chris H has just added the limit reset in DIRAC, so hopefully the impact of the issue will decrease soon. That does not mean that we should not put any mitigations, though (see direct access description above).