- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
Echo downtime yesterday seemed to go fine. SAM tests and Phedex transfers are green. I speculate it was for 2 reasons that CMS never went below the normal number of cores on the batch farm, and in fact picked up more than the normal number when they were released by e.g. ATLAS:
1. Some CMS jobs (pilots) last a really long time, and the downtime was only ~6 hours. The drain was started 24 hours in advance Some input data may have been accessed via AAA where necessary.
2. It appeared that only 'Production' jobs were put into drain. RAL picked up a lot of User Analysis jobs, many of which failed, but by no means all.
More failures than normal were seen in the category FileOpen. In normal running, we do see some of these, but usually it's far more in the FileRead category. I don't believe there was any problem staging out data from completed jobs - none of these errors appeared. I didn't check explicitly to see if this happened but the design is to stage out to RALPP if the local storage is unavailable.
Job efficiency actually reacted positively at times! I believe this was due to all the Analysis jobs running, which typically have lower I/O requirements.
I am making a document to provide evidence of all of this.
Other stuff: IPv4 slowness being investigated by DI. SAM tests for AAA have been failing intermittently - possibly network related.
LHCb
DUNE