RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
14:00
→
14:01
Major Incidents Changes 1m
-
14:05
→
14:06
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore, Kieran Howlett (STFC RAL)
-
14:10
→
14:11
Experiment Operational Issues 1m
-
14:15
→
14:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
AAA machines somehow lost ability to authenticate with certificates. It is unclear why this happened. Jyothish fixed it today. Certificate SAM tests failed for a couple of days, while token tests remained green.
Problems with Echo gateways since last Wednesday and over the weekend. Particular gateways were seen to be failing and timing out. These gateways were removed to mitigate the problem. SAM tests had a red overall status on Wednesday, Friday and Saturday.
CMS went into production drain...however we kept our slots due to 'Tier 0' jobs that are (still) not respecting the site status. In this case everything was fine - performance of the jobs was excellent.
I believe the IPv6 inaccessibility problem with AAA was fixed by DI. This also affected other machines not using LHCONE or LHCOPN.
Seeing a 'glitch' most days in SAM tests, affecting CE tests and sometimes others as well. Possible network disconnection? Error is:
"Job completed but failed to get job output"
-
14:20
→
14:21
VO-Liaison ATLAS 1mSpeakers: Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
-
14:25
→
14:26
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
Tickets:
- Failed uploads to ECHO (GGUS 167781)
- Gateways were unstable throughout the weekend
- Some files are lost as a result
- 219 files identified during consistency check, most of them had only 1 replica at RAL, so lost for good
- Some files are corrupted
- Need to retrieve checksums of all lhcb files on ECHO
- Is there anything to consider from operational perspective before doing this?
- Need to retrieve checksums of all lhcb files on ECHO
- Failed downloads/direct access requests from ECHO (GGUS 167617)
- New restart script was deployed to preprod farm last week
- Sometimes jobs are still failing with "Cannot allocate memory" error
- Makes sense, since turning on pgRead does not affect direct access requests
- Github issue is to be opened
Operational issues;- Xrootd bug follow-up?
- Failed uploads to ECHO (GGUS 167781)
-
14:30
→
14:33
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
14:35
→
14:36
VO Liaison APEL 1mSpeaker: Thomas Dack
-
14:39
→
14:40
VO Liaison Others 1mSpeakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Katy Ellis (Science and Technology Facilities Council STFC (GB))
-
14:45
→
14:46
AOB 1m
-
14:50
→
14:51
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
14:00
→
14:01