RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2023-05-17T12:30:00+01:00
End: 2023-05-17T14:30:00+01:00
Location: RAL R89

Wednesday 17 May 2023, 12:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:00 → 13:01
  
  Major Incidents Changes 1m
- 13:01 → 13:02
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB)), Kieran Howlett (STFC RAL)
  
  Weekly Report 17 May 2023.docx
  
  Weekly Report 17 May 2023.pdf
- 13:02 → 13:03
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:04 → 13:05
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:05 → 13:06
  
  Experiment Operational Issues 1m
- 13:15 → 13:16
  
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  Katy was at CHEP for the last 2 meetings.
  Echo problems from Friday until yesterday. Originally thought to be related to reweighting of new disk hardware, was then also blamed on the vRead change hitting Echo with more requests than normal. The number of IOps was too high. SAM tests red on Friday and Saturday. Katy put CMS into drain as jobs were failing at a high rate (lots more stage-out errros). Transfers were also failing. On sunday tests were green as the load was removed - Katy put CMS back into production.
  On Monday and Tuesday SAM tests failed again and CMS went back into drain automatically. Tuesday afternoon the WN-xrootd-access (accessing Echo) continued to fail. All other tests were green after the vRead changes were removed. The xrootd-access test files were accessible. The xrootd-access tests started passing again about 5 hours after the other tests went green. This delay in passing tests after the end of an incident has been observed several times before. Suspicion that this is related to AAA redirector being blacklisted for too long - a known issue?
  Batch farm upgrades have been ongoing the last week and a half, with several half-batch farm drains. CMS are currently (still) capped at 8k cores due to the suspected pressure on the network in recent weeks. This should be released when we move LHCONE off of Janet.
  To Do: test Tape REST API
- 13:16 → 13:17
  VO-Liaison ATLAS 1m
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  ATLAS recovered from the weekend's issue with Echo
  Affected also SHEF and OX
  Potential for some cleanup of residual files needed
  
  Ran first test of REST API this morning with (test) production atlas traffic:
  Writes (e.g. https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/02904e96-f495-11ed-8ea4-fa163e5a92fb) and observed archiveinfo api calls in the eso logs
  Will continue with read tests.
  Once confirmed, ATLAS will be keen to use this for production. May also wish to try and remove multihop (discussions ongoing).
- 13:20 → 13:21
  VO Liaison LHCb 1m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  slides.pdf
  Echo problems due to increased IOPs rate after vector read patch application
  Fixed by rolling back the patch
  Several corrupted files as a result
  Problems with uploads to antares
  Fixed
  Request to replace service certificate with host certificate on the vobox
  Security implications should be considered
  Vector read
  See slides attached
- 13:25 → 13:28
  
  VO Liaison LSST 3m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 13:30 → 13:31
  
  VO Liaison Others 1m
- 13:31 → 13:32
  
  AOB 1m
- 13:32 → 13:33
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89