RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
14:00
→
14:01
Major Incidents Changes 1m
-
14:05
→
14:06
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore, Kieran Howlett (STFC RAL)
-
14:10
→
14:11
Experiment Operational Issues 1m
-
14:15
→
14:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
The 'fetch-crl' command went missing (https://stfc.atlassian.net/l/cp/01CAs1Gv) and many CRLs expired on Sunday about 5pm. Failures in the SAM tests on all CEs, Echo webdav (not xrootd), Antares xrootd (not webdav) and all the AAA machines. Echo webdav was partially fixed on Sunday around midnight, with intermittent failures after that. Everything was fixed before lunch on Monday.
About 750TB of CMS data for Antares was in backlog due to a CMS WM bug. Rules that should have been distributed over the previous 1 month were created for approval on Wednesday afternoon and about 2000 of them were approved on Thursday, with the data starting to hit Antares from Thursday night. On Friday afternoon the remaining 2500 rules were approved. Some issues, for the record:
- Errors of the type 'MGM is down' on Thursday night/Friday morning. George realised that a test setup had been accidentally deployed on prod. This was quickly reverted and those errors stopped. Rate to the Antares buffer went up to 15-20GB/s! (Worries about the buffer filling up, but George said it never did)
- A large number of 'file exists' errors - these usually happen as a result of previous problems.
- Friday evening - CMS DM report that Echo is in danger of going over pledge. Deletions are too slow to keep up with Rucio submissions for data multihopping through Echo (most of it was doing that). A bespoke Rucio-reaper is deployed to only work on RAL as we did during DC24 - this hugely improves the deletion rate. At the same time, the inbound transfers to Echo on FTS was reduced from 1000 to 200. Space used on Echo stabilised. The rate to Antares reduced greatly - possibly some of the 200 transfers to Echo were not also going to tape.
- Transfers to Echo stopped on Sunday evening due to the fetch-crl problem mentioned above.
- Monday morning - tape robot arms broken, fixed on Tuesday afternoon. Katy waiting for the backlog from multiple VOs to subside before trying an increase of the FTS inbound transfers to Echo again.
As a result of the above, SAM status was red on Friday, Sunday, Monday, Tuesday. Katy removed CMS from drain on Monday after the CRL was fixed. Interestingly, because other VOs started draining before CMS, CMS picked up many slots (35k at max!) before draining from that high value. Job performance remained good during this time. CMS went into drain again this morning (Wed), Katy removed the drain status before lunch.
The 'production' Shovler instance has been switched from Cloud to VMWare. VMWare is more resilient for a production service.
IPv6 connectivity of the AAA machines. A ticket has been sent to DI, which Katy cannot view but requested the service desk to progress it.
-
14:20
→
14:21
VO-Liaison ATLAS 1mSpeakers: Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
-
14:25
→
14:26
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
-
14:30
→
14:33
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
14:35
→
14:36
VO Liaison APEL 1mSpeaker: Thomas Dack
-
14:39
→
14:40
VO Liaison Others 1mSpeakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Katy Ellis (Science and Technology Facilities Council STFC (GB))
-
14:45
→
14:46
AOB 1m
-
14:50
→
14:51
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
14:00
→
14:01