RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:00
→
13:01
Major Incidents Changes 1m
-
13:01
→
13:02
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:02
→
13:03
GGUS /RT Tickets 1m
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed -
13:04
→
13:05
Site Availability 1m
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
-
13:05
→
13:06
Experiment Operational Issues 1m
-
13:15
→
13:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
Investigating failing tape transfers from RAL and elsewhere for CMS. Currently the finger is pointing at FNAL FTS which was recently upgraded. I did a test with CERN FTS on a subset of the same data and that has successful transfers where the FNAL FTS has none. The data is staging successfully from tape but then not being instructed by FTS to move to Echo (at least this is the current working theory). Steve Murray is looking at it. He says that FNAL FTS is mis-configured for Antares.
Also on tape failures - a few CMS tapes were 'disabled' this week (they were re-enabled by the script, but still caused significant failures). Is this happening more than normal?
Intermittent webdav SAM test failures in the last 2 days. Coincident with critical status on a number of gws: svc01/02, gw14/15 mainly.
-
13:16
→
13:17
VO-Liaison ATLAS 1mSpeakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
RAL in HC test overnight:
- Stage out failures (svc02) and Rack power off triggered HC test failures. One of the HC tests stopped running, so RAL not put back online.
- Have forced site online, and following up; experts now have reinjected tests.BNL -> RAL (and CNAF) transfers over the OPN have been very slow for ~ 1 week. Problem appears to be on the BNL side however.
Accounting differences observered between the VO monitoring and WLCG accounting figures, starting ~ September. See attached plot.
DNS issues reappeared on Sunday morning. Due (?) to TTL changes to webdav alias, observed fewer transfer failures during this period.
-
13:20
→
13:21
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
- Vector read status:
- New patch was developed and applied on lcg2270. With several features:
- atomic cache reads
- caching layer
- timeout increase for readv operations
- async read operations disabled
- So far looks good, but only 11 user jobs were executed there.
- Old patch has the following results: 5 user jobs failed due to read erros, 792 user jobs executed successfully (0.6 percent failure rate). On the whole farm failure rate was approximately 1.7 percent for the same time period.
- New patch was developed and applied on lcg2270. With several features:
- Dark data
- Size of the dark data was identified, it is 877TB
- Discussion is ongoing how to delete this data, it may be better to do it from the site's side
- DNS issue
- Reappeared last Sunday, affected LHCb significantly
- Upload Failures
- Multiple peaks of failed uploads since yesterday afternoon, seems to be related to the gateway overload
- Low number of running jobs
- The number of running LHCb jobs was low throughout the weekend, due to fs tuning
- Recovered now
- Vector read status:
-
13:25
→
13:28
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
13:30
→
13:31
VO Liaison Others 1m
-
13:31
→
13:32
AOB 1m
-
13:32
→
13:33
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:00
→
13:01