RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:00
→
13:01
Major Incidents Changes 1m
-
13:01
→
13:02
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB)), Kieran Howlett (STFC RAL)
-
13:02
→
13:03
GGUS /RT Tickets 1m
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed -
13:04
→
13:05
Site Availability 1m
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
-
13:05
→
13:06
Experiment Operational Issues 1m
-
13:15
→
13:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
Katy was at CHEP for the last 2 meetings.
Echo problems from Friday until yesterday. Originally thought to be related to reweighting of new disk hardware, was then also blamed on the vRead change hitting Echo with more requests than normal. The number of IOps was too high. SAM tests red on Friday and Saturday. Katy put CMS into drain as jobs were failing at a high rate (lots more stage-out errros). Transfers were also failing. On sunday tests were green as the load was removed - Katy put CMS back into production.
On Monday and Tuesday SAM tests failed again and CMS went back into drain automatically. Tuesday afternoon the WN-xrootd-access (accessing Echo) continued to fail. All other tests were green after the vRead changes were removed. The xrootd-access test files were accessible. The xrootd-access tests started passing again about 5 hours after the other tests went green. This delay in passing tests after the end of an incident has been observed several times before. Suspicion that this is related to AAA redirector being blacklisted for too long - a known issue?
Batch farm upgrades have been ongoing the last week and a half, with several half-batch farm drains. CMS are currently (still) capped at 8k cores due to the suspected pressure on the network in recent weeks. This should be released when we move LHCONE off of Janet.
To Do: test Tape REST API
-
13:16
→
13:17
VO-Liaison ATLAS 1mSpeakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
ATLAS recovered from the weekend's issue with Echo
- Affected also SHEF and OX
- Potential for some cleanup of residual files needed
Ran first test of REST API this morning with (test) production atlas traffic:
Writes (e.g. https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/02904e96-f495-11ed-8ea4-fa163e5a92fb) and observed archiveinfo api calls in the eso logs
Will continue with read tests.
- Once confirmed, ATLAS will be keen to use this for production. May also wish to try and remove multihop (discussions ongoing).
-
13:20
→
13:21
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
- Echo problems due to increased IOPs rate after vector read patch application
- Fixed by rolling back the patch
- Several corrupted files as a result
- Problems with uploads to antares
- Fixed
- Request to replace service certificate with host certificate on the vobox
- Security implications should be considered
- Vector read
- See slides attached
- Echo problems due to increased IOPs rate after vector read patch application
-
13:25
→
13:28
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
13:30
→
13:31
VO Liaison Others 1m
-
13:31
→
13:32
AOB 1m
-
13:32
→
13:33
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:00
→
13:01