RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2021-03-03T13:30:00+00:00
End: 2021-03-03T14:30:00+00:00
Location: RAL R89

Wednesday 3 Mar 2021, 13:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

Description

Please attend via the following Zoom meeting:

https://ukri.zoom.us/j/98562731547?pwd=UU9Wb2xCL05tWmROT1h6SUlWdUJ3dz09

- 13:38 → 13:39
  
  Major Incidents Changes 1m
- 13:39 → 13:40
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
  
  RT1EL-20210303.docx
  
  RT1EL-20210303.pdf
- 13:40 → 13:41
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:41 → 13:42
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:42 → 13:43
  
  Experiment Operational Issues 1m
- 13:44 → 13:45
  
  VO-Liaison ATLAS 1m
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))
  
  20210303_ipv46_update.pdf
  
  ATLAS needs to run more single-core analysis jobs
  - https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=397775
  
  - Will be direct IO; need for vectored reads
  
  Did notice that 100% on Vande no longer corresponds to 100% *11.7/10 on Atlas monitoring (accounting for corepower difference). Obscured by current changes
  - Some recent change to batch workers ?
  - Some change to absolute Fairshare values ?
  
  Echo Read access for Oxford ATLAS XCache
  - https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=397191
  
  TPC-http
  - Bespoke checksum script on Test Gateway to return checksum
  - Return of the '//' macaroon path normalisation issue.
- 13:46 → 13:47
  
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  CMS is running 'at pledge' due to being limited for LHCb to be fixed and they are now running 200% of their pledge. Most CMS-only nodes are empty.
  
  SAM tests looking much better this week. No change or fix was applied. However, I see a large number of job failures and very low efficiency. The failures are mostly FileOpen or FileRead. I have an example 'step chain' job to try - i.e. a multi-step job. I want to try this on one of the empty CMS-only nodes, hopefully this week.
  
  After talking to Chris Brew, we think there is a problem with the /etc/hosts file for the CMS docker config. He says you can't do this with the same IP address:
  
  172.28.1.1 xrootd.echo.stfc.ac.uk
  
  172.28.1.1 ceph-gw10.gridpp.rl.ac.uk
  
  172.28.1.1 ceph-gw11.gridpp.rl.ac.uk
  
  He said I should ask for a change to:
  
  172.28.1.1 xrootd.echo.stfc.ac.uk ceph-gw10.gridpp.rl.ac.uk ceph-gw11.gridpp.rl.ac.uk
- 13:48 → 13:49
  VO Liaison LHCb 1m
  
  Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))
  LHCb
  
  Low number of running jobs
  
  https://ggus.eu/?mode=ticket_info&ticket_id=150679
  
  Seems fixed after limits put on CMS, ATLAS
  
  Not permanent solution, but this seems to have allowed LHCb jobs to be picked up by batch system (???)
  
  ECHO streaming issue
  
  Waiting for release of fix to vector reads
  
  Timescale?
  
  Trying to understand discrepancy between storage used reported by RAL vs DIRAC
  
  Currently 20% discrepancy - big since 2019 (LHCb move to ECHO)
  
  Date : DIRAC vs RAL (Grafana)
  
  31/12/2020: 5.61 vs 6.46PB
  31/12/2019: 5.62 vs 6.42PB
  31/12/2018: 4.55 vs 4.54PB
  08/02/2018: 4.13 vs 4.10PB
  31/12/2016: 3.61 vs 3.09PB
  18/08/2016: 3.17 vs 3.22PB
  31/12/2015: 3.13 vs 3.12PB
  25/08/2015: 2.30 vs 2.34PB
  18/01/2015: 2.23 vs 2.28PB
  
  DUNE
  
  Normal operations
  
  Testing dynafed access to RAL storage to transfer data between RAL and Fermilab
  
  Is dynafed supported?
  
  Or other protocols supporting http(s)?
- 13:52 → 13:53
  
  VO Liaison Others 1m
- 13:53 → 13:54
  
  Experiment Planning 1m
- 13:54 → 13:55
  
  Dune/protoDune 1m
- 13:55 → 13:56
  
  Euclid 1m
- 13:56 → 13:57
  
  SKA 1m
- 13:57 → 13:58
  
  AOB 1m
- 13:58 → 13:59
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))