RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2023-02-15T12:30:00+00:00
End: 2023-02-15T14:30:00+00:00
Location: RAL R89

Wednesday 15 Feb 2023, 12:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:00 → 13:01
  
  Major Incidents Changes 1m
- 13:01 → 13:02
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
  
  Weekly Report 15 February 2023.docx
  
  Weekly Report 15 February 2023.pdf
- 13:02 → 13:03
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:04 → 13:05
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:05 → 13:06
  
  Experiment Operational Issues 1m
- 13:15 → 13:16
  
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  Investigating failing tape transfers from RAL and elsewhere for CMS. Currently the finger is pointing at FNAL FTS which was recently upgraded. I did a test with CERN FTS on a subset of the same data and that has successful transfers where the FNAL FTS has none. The data is staging successfully from tape but then not being instructed by FTS to move to Echo (at least this is the current working theory). Steve Murray is looking at it. He says that FNAL FTS is mis-configured for Antares.
  Also on tape failures - a few CMS tapes were 'disabled' this week (they were re-enabled by the script, but still caused significant failures). Is this happening more than normal?
  Intermittent webdav SAM test failures in the last 2 days. Coincident with critical status on a number of gws: svc01/02, gw14/15 mainly.
- 13:16 → 13:17
  
  VO-Liaison ATLAS 1m
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  
  AccountingComparison.pdf
  
  RAL in HC test overnight:
  - Stage out failures (svc02) and Rack power off triggered HC test failures. One of the HC tests stopped running, so RAL not put back online.
  - Have forced site online, and following up; experts now have reinjected tests.
  BNL -> RAL (and CNAF) transfers over the OPN have been very slow for ~ 1 week. Problem appears to be on the BNL side however.
  Accounting differences observered between the VO monitoring and WLCG accounting figures, starting ~ September. See attached plot.
  DNS issues reappeared on Sunday morning. Due (?) to TTL changes to webdav alias, observed fewer transfer failures during this period.
- 13:20 → 13:21
  VO Liaison LHCb 1m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  0727b5fc97d5444d245b47cda366d552.png
  
  8601d7af42f6cf5b31ec4e08c8637b81.png
  
  d43e6102b94a0a7de7ae10301b837996.png
  Vector read status:
  New patch was developed and applied on lcg2270. With several features:
  atomic cache reads
  caching layer
  timeout increase for readv operations
  async read operations disabled
  So far looks good, but only 11 user jobs were executed there.
  Old patch has the following results: 5 user jobs failed due to read erros, 792 user jobs executed successfully (0.6 percent failure rate). On the whole farm failure rate was approximately 1.7 percent for the same time period.
  Dark data
  Size of the dark data was identified, it is 877TB
  Discussion is ongoing how to delete this data, it may be better to do it from the site's side
  DNS issue
  Reappeared last Sunday, affected LHCb significantly
  Upload Failures
  Multiple peaks of failed uploads since yesterday afternoon, seems to be related to the gateway overload
  Low number of running jobs
  The number of running LHCb jobs was low throughout the weekend, due to fs tuning
  Recovered now
- 13:25 → 13:28
  
  VO Liaison LSST 3m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 13:30 → 13:31
  
  VO Liaison Others 1m
- 13:31 → 13:32
  
  AOB 1m
- 13:32 → 13:33
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89