RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2025-01-15T13:30:00+00:00
End: 2025-01-15T15:50:00+00:00
Location: RAL R89

Wednesday 15 Jan 2025, 13:30 → 15:50 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:30 → 13:31
  
  Experiment Operational Issues 1m
- 13:35 → 13:45
  
  VO-Liaison ATLAS 10m
  
  Speakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  
  ATLAS liason meeting notes Brij
- 13:45 → 13:55
  
  VO Liaison CMS 10m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  Screenshot 2025-01-14 at 16.49.46.png
  
  Screenshot 2025-01-14 at 16.56.50.png
  
  On Monday AAA machines all went from completely green SAM tests to constantly red. It was the 'federation' test failing, the error showing it was unable to connect to the global redirector. ARC-CE tests also failed constantly in the same period due to failures of the associated 'xrootd-access' test, which uses the AAA machines. This was on top of the intermittently failing ARC-CE SAM tests at submission, which is ongoing for some months but became worse over Christmas / New Year / January.
  
  Tom Birkett has been following up a suspected network/firewall problem with DI. There was suspicion the intermittent ARC-CE test failures could be caused by this, along with many other observed problems, such as variable number of CMS jobs running despite work being available, lack of ATLAS jobs running, general slowness in Tier 1 machines, etc., etc.
  
  On Tuesday morning around 10:30 DI made a change by removing one port from a network component. After this, many or all of the above problems seem miraculously fixed/improved immediately!
  
  AAA tests went green; ARC-CE xrootd-access test went green; intermittent submission failures looking much much better.
  
  UPDATE, Wed morning: AAA tests went red again last night. Jyothish did some clean up and restarts and tests are going green again.
  
  Note, where CMS jobs have run, in general performance has been good, except Monday night into Tuesday there was a spike of Production job failures, attempting to read remotely and getting a FileOpen error.
  
  AAA OOM errors when under high load still to be followed up. Also the problem with svc20 continuously dropping its monitoring in Vande.
  
  CMS / CERN IT jumbo frames testing ongoing all week.
- 13:55 → 14:05
  VO Liaison LHCb 10m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  jobs.png
  
  transfers_from.png
  
  transfers_to.png
  / * Sorry, i am on leave 15.01, so the below data may be outdated; valid on the evening of 14.01*/
  
  News:
  
  Data reprocessing camaign hopefully ends mid February (except for CNAF, which had many problems, though that is not relevant for us)
  
  Therefore UK DC at the end of February/beginning of March should be fine
  
  Operational issues:
  
  nproc limit issue is fixed
  
  Pilot restores the original limit after gfal context creation
  
  Lots of failed WGProduction (direct access) jobs on Tuesday morning (see job plots)
  
  Jobs used xrootd-5.3.1 for streaming, this version has a bug that causes all vector reads with more than one chunk in request to fail (see this ticket for details)
  
  So, not our fault; LHCb is informed and will update the application linkage to a newer xrootd version
- 14:10 → 14:20
  
  VO Liaison ALICE 10m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
- 14:20 → 14:30
  
  VO Liaison LSST 10m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
  
  Low activity over Christmas
  
  Networking between Butler DB and BatchFarm is an issue, and as changes are being made to the network soon, moved LSST to 2020 BF nodes on the new network last night so that they may have access to the Butler
  
  After draining old jobs this morning, and disabling job types that were not effected, this has meant only the "DC2" jobs remain, but are currently long running and none have finished at this time (despite running for nearly 2 hours) due to remaining on older nodes, rather than the new ones specified
- 14:30 → 14:40
  
  VO Liaison APEL 10m
  
  Speaker: Thomas Dack
- 14:45 → 14:55
  
  WP-D - GPU, Data Management, Other 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
- 15:00 → 15:01
  
  Major Incidents Changes 1m
- 15:05 → 15:15
  
  Summary of Operational Status and Issues 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
  
  Weekly Report 15 January 2025.docx
  
  Weekly Report 15 January 2025.pdf
- 15:20 → 15:21
  
  AOB 1m
- 15:22 → 15:32
  
  Any other Business 10m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89