RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:00
→
13:01
Major Incidents Changes 1m
-
13:01
→
13:02
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:02
→
13:03
GGUS /RT Tickets 1m
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed -
13:04
→
13:05
Site Availability 1m
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
-
13:05
→
13:06
Experiment Operational Issues 1m
-
13:15
→
13:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
Katy planning to remove gsiftp tests from contribution to SAM status.
CMS saw spikes in failures on writes to Echo on the 18th and 23rd during the DNS issues. Also SAM status failed on those days due to storage tests (gsiftp and webdav).
Large numbers of (Processing type) jobs failing, but this is reflected at other sites.
-
13:16
→
13:17
VO-Liaison ATLAS 1mSpeakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
GGUS: 160156
The last DNS outage 'reset' most transfers. As of this morning, 33k submitted transfers to write into Echo, + 12k files being recalled from Antares (via Echo).
Very few failures (O(100)) in the last 24hrs where the source file had been evicted prior to transfer; We can (hopefully) resolve the ticket this afternoon if no further issues arise.
DNS Failed name resolution from external hosts:
Last Wednesday, AM, and Monday evening. ~ 220k transfers failed (not started).
GOCDB was also affected (any other ancillary services?).
-
13:20
→
13:21
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
- Reading part of the tape challenge finished last week. Results look promising -- expected throughput was 1.93GB/s, we achieved ~ 1GB/s more than this.
- There was a major LHCb Dirac update on Monday, which introduced some issues. Recovered within several hours. There were lot of failed jobs due to this.
- Low number of running LHCb jobs due to insufficient number of production requests.
- Consistency check identified some dark and lost data. Dark data was removed, lost files were re-replicated by (all data operations were done by the LHCb Computing team).
Tickets:
- Slow checksums (stats):
- Still waiting
- Deletion problems
- Solved
- Problems with simultaneous access to the same file on ECHO
- On hold, tests are ongoing at Glasgow
- Vector read.
- One more test: what happens with the LHCb applications is vector read requests returns "wrong" (i.e. not the one that was requested) data. This was tested (the same patch, but once in a 1000 vector reads it shifts one of the requested chunks by 1 byte), and it seems like the application crashes.
- Dedicated patched WN for production LHCb jobs is being prepared.
-
13:25
→
13:28
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
13:30
→
13:31
VO Liaison Others 1m
-
13:31
→
13:32
AOB 1m
-
13:32
→
13:33
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:00
→
13:01