- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
A lot of problems on the CMS side this week, following the expiry of top-level certificates at CERN on Saturday.
Rucio did not have a valid CMS certificate and did not complete a successful transfer between Saturday lunchtime and Monday/Tuesday midnight.
RAL Tier 1 went into drain for production work on Sunday. SAM tests were green, and I could not see a good reason for this, so I forced us back into production (other T1s were running jobs at normal capacity). Failure rate good; efficiency variable.
HammerCloud tests stopped (across the whole of CMS grid) Saturday - Tuesday.
On Friday I saw a high failure rate of Analysis jobs again (as reported a couple of times now). These were (again) reading mostly across the trans-Atlantic link. I saw they were all one person who was running large numbers of CRAB jobs across the grid using the ‘IgnoreLocality’ option which does not match the job with the location of the input data. So I wrote a polite email and he has now killed those jobs and will hopefully allow the system to assign jobs to a better location.
Failure rates on particular WNs at RAL T1 batch farm: The liaisons have been investigating significantly high failure rates split by worker node. My analysis showed a surprising number of the most recent (2022) nodes having ‘significant’ failures (my definition being >10% failures, where 4% failure is the average across the farm, during the period 10-20th April).
AAA at RAL T1: There was an Echo downtime for reboots on Thursday; the RAL based redirector was also rebooted. After the downtime the AAA manager continued to fail SAM tests, so we got a red status for the day due to that. I fixed it by the end of the working day.
Had the 2023 data tape families created. Made an adjustment to the requested families based on the 2022 experience.
Tape deletions were happening yesterday, 1.5PB. Appears to be finished. ~2600 files 'not found' which is usually fine.
Looking at monitoring job read rates:
https://monit-grafana.cern.ch/d/BZfBLpE4k/user-kellis-average-data-input-over-read-time?orgId=11&from=now-4y&to=now&viewPanel=3&editPanel=3
You can see RAL is very low in the first years, but is closer to other sites in recent months (still not amazing, but perhaps that is not surprising given we are further from e.g. CERN than IN2P3, CNAF…?)
FTS: currently have a backlog of 154k transfers from RAL. Likely related to cern cert issues, and a number of CERN FTS hosts that stopped working.
Tickets:
Operational issues: