RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2024-07-24T13:30:00+01:00
End: 2024-07-24T15:00:00+01:00
Location: RAL R89

Wednesday 24 Jul 2024, 13:30 → 15:00 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 14:00 → 14:01
  
  Major Incidents Changes 1m
- 14:05 → 14:06
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore, Kieran Howlett (STFC RAL)
  
  Weekly Report 24 July 2024.docx
  
  Weekly Report 24 July 2024.pdf
- 14:10 → 14:11
  
  Experiment Operational Issues 1m
- 14:15 → 14:16
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  The 'fetch-crl' command went missing (https://stfc.atlassian.net/l/cp/01CAs1Gv) and many CRLs expired on Sunday about 5pm. Failures in the SAM tests on all CEs, Echo webdav (not xrootd), Antares xrootd (not webdav) and all the AAA machines. Echo webdav was partially fixed on Sunday around midnight, with intermittent failures after that. Everything was fixed before lunch on Monday.
  
  About 750TB of CMS data for Antares was in backlog due to a CMS WM bug. Rules that should have been distributed over the previous 1 month were created for approval on Wednesday afternoon and about 2000 of them were approved on Thursday, with the data starting to hit Antares from Thursday night. On Friday afternoon the remaining 2500 rules were approved. Some issues, for the record:
  
  Errors of the type 'MGM is down' on Thursday night/Friday morning. George realised that a test setup had been accidentally deployed on prod. This was quickly reverted and those errors stopped. Rate to the Antares buffer went up to 15-20GB/s! (Worries about the buffer filling up, but George said it never did)
  
  A large number of 'file exists' errors - these usually happen as a result of previous problems.
  
  Friday evening - CMS DM report that Echo is in danger of going over pledge. Deletions are too slow to keep up with Rucio submissions for data multihopping through Echo (most of it was doing that). A bespoke Rucio-reaper is deployed to only work on RAL as we did during DC24 - this hugely improves the deletion rate. At the same time, the inbound transfers to Echo on FTS was reduced from 1000 to 200. Space used on Echo stabilised. The rate to Antares reduced greatly - possibly some of the 200 transfers to Echo were not also going to tape.
  
  Transfers to Echo stopped on Sunday evening due to the fetch-crl problem mentioned above.
  
  Monday morning - tape robot arms broken, fixed on Tuesday afternoon. Katy waiting for the backlog from multiple VOs to subside before trying an increase of the FTS inbound transfers to Echo again.
  
  As a result of the above, SAM status was red on Friday, Sunday, Monday, Tuesday. Katy removed CMS from drain on Monday after the CRL was fixed. Interestingly, because other VOs started draining before CMS, CMS picked up many slots (35k at max!) before draining from that high value. Job performance remained good during this time. CMS went into drain again this morning (Wed), Katy removed the drain status before lunch.
  
  The 'production' Shovler instance has been switched from Cloud to VMWare. VMWare is more resilient for a production service.
  
  IPv6 connectivity of the AAA machines. A ticket has been sent to DI, which Katy cannot view but requested the service desk to progress it.
- 14:20 → 14:21
  
  VO-Liaison ATLAS 1m
  
  Speakers: Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  
  ATLAS Liason notes
- 14:25 → 14:26
  VO Liaison LHCb 1m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  gw_issues.pdf
  Tickets:
  
  CRL update issue
  
  solved
  
  WN GW issues
  
  See slides
  
  Operational issues:
  
  Antares downtime due to robot breakdown
  
  Xrootd bug follow-up?
- 14:30 → 14:33
  
  VO Liaison LSST 3m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 14:35 → 14:36
  
  VO Liaison APEL 1m
  
  Speaker: Thomas Dack
- 14:39 → 14:40
  
  VO Liaison Others 1m
  
  Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Katy Ellis (Science and Technology Facilities Council STFC (GB))
- 14:45 → 14:46
  
  AOB 1m
- 14:50 → 14:51
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore