RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2020-09-09T13:30:00+01:00
End: 2020-09-09T14:30:00+01:00
Location: RAL R89

Wednesday 9 Sept 2020, 13:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

- 13:38 → 13:39
  
  Major Incidents Changes 1m
- 13:39 → 13:40
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
  
  RT1EL-20200909.docx
  
  RT1EL-20200909.pdf
- 13:40 → 13:41
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:41 → 13:42
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:42 → 13:43
  
  Experiment Operational Issues 1m
- 13:44 → 13:45
  
  VO-Liaison ATLAS 1m
  
  Minutes
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))
  
  ATLAS back to 98% fairshare (from Vande).
- 13:46 → 13:47
  
  VO Liaison CMS 1m
  
  Minutes
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  jobsByTranche_Sept20.pdf
  
  jobsByTranche_Sept20.pdf
  
  jobsByTranche_Sept20.pptx
  
  Echo downtime yesterday seemed to go fine. SAM tests and Phedex transfers are green. I speculate it was for 2 reasons that CMS never went below the normal number of cores on the batch farm, and in fact picked up more than the normal number when they were released by e.g. ATLAS:
  
  1. Some CMS jobs (pilots) last a really long time, and the downtime was only ~6 hours. The drain was started 24 hours in advance Some input data may have been accessed via AAA where necessary.
  
  2. It appeared that only 'Production' jobs were put into drain. RAL picked up a lot of User Analysis jobs, many of which failed, but by no means all.
  
  More failures than normal were seen in the category FileOpen. In normal running, we do see some of these, but usually it's far more in the FileRead category. I don't believe there was any problem staging out data from completed jobs - none of these errors appeared. I didn't check explicitly to see if this happened but the design is to stage out to RALPP if the local storage is unavailable.
  
  Job efficiency actually reacted positively at times! I believe this was due to all the Analysis jobs running, which typically have lower I/O requirements.
  
  I am making a document to provide evidence of all of this.
  
  Other stuff: IPv4 slowness being investigated by DI. SAM tests for AAA have been failing intermittently - possibly network related.
- 13:48 → 13:49
  VO Liaison LHCb 1m
  
  Minutes
  
  Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))
  LHCb
  
  Normal operations resumed after yesterday's downtime automatically
  
  Streaming issue from ECHO
  
  Waiting for fix to xrootd vector reads
  
  Some fine tuning done of proxy configuration in production system from our better understanding now
  
  Has helped error rates in Atlas I understand
  
  No improvement seen in LHCb jobs
  
  Next : working on understanding remaining failures and by generation
  
  For example "file name too long" error which has been seen
  
  DUNE
  
  Moved the ETF tests to using CRIC
  
  Still low level of running jobs on the grid
- 13:52 → 13:53
  
  VO Liaison Others 1m
- 13:53 → 13:54
  
  Experiment Planning 1m
- 13:54 → 13:55
  
  Dune/protoDune 1m
- 13:55 → 13:56
  
  Euclid 1m
- 13:56 → 13:57
  
  SKA 1m
- 13:57 → 13:58
  
  AOB 1m
- 13:58 → 13:59
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89

Share this page

Direct link

Social networks

Calendaring