RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2023-02-01T12:30:00+00:00
End: 2023-02-01T14:30:00+00:00
Location: RAL R89

Wednesday 1 Feb 2023, 12:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:00 → 13:01
  
  Major Incidents Changes 1m
- 13:01 → 13:02
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
  
  Weekly Report 01 February 2023.docx
  
  Weekly Report 01 February 2023.pdf
- 13:02 → 13:03
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:04 → 13:05
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:05 → 13:06
  
  Experiment Operational Issues 1m
- 13:15 → 13:16
  
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  Screenshot 2023-02-01 at 13.08.43.png
  
  Screenshot 2023-02-01 at 13.09.19.png
  
  Images show new tables I have made in kibana/openSearch to show number of failures in last days per worker node. I added one image for a period of 15 days and another for 16 days because there seem to have been a huge number of failures on one WN 16 days ago. Looking at just the last 15 days, there is no particular problem with any one WN. A few WNs show more than 20% errror. A few show more than 50% error, but these are running relatively few jobs - possibly all SAM test jobs running on ML cores rather than multicore.
  
  Another occurance of the DNS issue this morning (third apparent appearance in 2 weeks). However, today this could be attributed to some work being done by DI, e.g https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=477872
  
  SAM tests are failing due to the above, but have been going green in the last hours. Likewise prod transfer efficiency is coming back.
  
  Farm is low on capacity due to WN firewall updates.
  
  Failure rate and efficiency of jobs is good in the last week.
- 13:16 → 13:17
  VO-Liaison ATLAS 1m
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  Atlas-wide rucio issues on monday affecting job stage-in, stage-out. Beleived due to overloading of servers with presigning of URL on non-standard storage.
  DNS failed name resolution on Weds am ~ 30
- 13:20 → 13:21
  
  VO Liaison LHCb 1m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  1f3bf2d6aee092f919ee7305b8e2af53.png
  
  6f154ece7fd9c89b89048d3b13b4991a(1).png
  
  a925e2cd59d6f8701c62dde2fba5150a.png
  
  pres_liaisons.pdf
  
  Very low number of running jobs, due to lack of production requests from LHCb.
  
  Vector read patch production testing has started. The patch does not resolve the problem.
- 13:25 → 13:28
  
  VO Liaison LSST 3m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 13:30 → 13:31
  
  VO Liaison Others 1m
- 13:31 → 13:32
  
  AOB 1m
- 13:32 → 13:33
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))