RAL Tier1 Experiments Liaison Meeting
Access Grid
Experiment Operational Issues 1m
VO-Liaison ATLAS 10mSpeakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
VO Liaison CMS 10mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
On Monday AAA machines all went from completely green SAM tests to constantly red. It was the 'federation' test failing, the error showing it was unable to connect to the global redirector. ARC-CE tests also failed constantly in the same period due to failures of the associated 'xrootd-access' test, which uses the AAA machines. This was on top of the intermittently failing ARC-CE SAM tests at submission, which is ongoing for some months but became worse over Christmas / New Year / January.
Tom Birkett has been following up a suspected network/firewall problem with DI. There was suspicion the intermittent ARC-CE test failures could be caused by this, along with many other observed problems, such as variable number of CMS jobs running despite work being available, lack of ATLAS jobs running, general slowness in Tier 1 machines, etc., etc.
On Tuesday morning around 10:30 DI made a change by removing one port from a network component. After this, many or all of the above problems seem miraculously fixed/improved immediately!
AAA tests went green; ARC-CE xrootd-access test went green; intermittent submission failures looking much much better.
UPDATE, Wed morning: AAA tests went red again last night. Jyothish did some clean up and restarts and tests are going green again.
Note, where CMS jobs have run, in general performance has been good, except Monday night into Tuesday there was a spike of Production job failures, attempting to read remotely and getting a FileOpen error.
AAA OOM errors when under high load still to be followed up. Also the problem with svc20 continuously dropping its monitoring in Vande.
CMS / CERN IT jumbo frames testing ongoing all week.
VO Liaison LHCb 10mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
/ * Sorry, i am on leave 15.01, so the below data may be outdated; valid on the evening of 14.01*/
News:- Data reprocessing camaign hopefully ends mid February (except for CNAF, which had many problems, though that is not relevant for us)
- Therefore UK DC at the end of February/beginning of March should be fine
Operational issues:
- nproc limit issue is fixed
- Pilot restores the original limit after gfal context creation
- Lots of failed WGProduction (direct access) jobs on Tuesday morning (see job plots)
- Jobs used xrootd-5.3.1 for streaming, this version has a bug that causes all vector reads with more than one chunk in request to fail (see this ticket for details)
- So, not our fault; LHCb is informed and will update the application linkage to a newer xrootd version
VO Liaison ALICE 10mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
VO Liaison LSST 10mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
Low activity over Christmas
Networking between Butler DB and BatchFarm is an issue, and as changes are being made to the network soon, moved LSST to 2020 BF nodes on the new network last night so that they may have access to the Butler
After draining old jobs this morning, and disabling job types that were not effected, this has meant only the "DC2" jobs remain, but are currently long running and none have finished at this time (despite running for nearly 2 hours) due to remaining on older nodes, rather than the new ones specified
VO Liaison APEL 10mSpeaker: Thomas Dack
WP-D - GPU, Data Management, Other 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
Major Incidents Changes 1m
Summary of Operational Status and Issues 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
AOB 1m
Any other Business 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore