- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
2022 Pledges:
- TAPE already 'provided';
- DISK awaiting hardware;
- CPU awaiting TB.
Tape challenge; still to confirm a date for T0 Export repeat test
Antares:
- antares-tpc01 doesn't appear to work with TPC transfers (reason unknown); resulting "Operation Expired" errors for Archiving to Antares;
- ~ 75% overall transfer efficiency
- xrootd gateways can trigger "Operation Expired" errors for Recalls from Antares to Echo
- Updates to BNL FTS to force through more transfers (and try to reduce the pre-transfer staging eviction states).
Webdav tests still failing when load is higher. A few green days recently - load was light.
I talked to the Rucio developer who works on the multihop transfers, as well as 2.5 hours of discussion with CERN CTA and FTS colleagues (including Steve Murray), some of which was about Antares specifically.
For Rucio, the multihop is given a significant redesign in the forthcoming version 1.28 (currently CMS is using 1.27.x). This new design keeps the two parts of the transfer together through the lifetime of the job, even after a resubmit. Previously, once a multihop path was determined by Rucio, and then submitted to FTS, Rucio no longer had a memory of it being a multihop transfer.
Between us we also figured out that CMS had an expired credential in CMS-Rucio, which deals with FTS cancellations, and this is why nothing ever got cancelled. It did affect the staging as the same files had multiple FTS jobs associated with them. The FTS developer told me that this used to be prohibited by design, but it caused a problem for database queries so it was removed last year.
I was already aware that an update to the EOS version would fix another problem with bulk FTS requests containing one genuinely missing file failing the entire bulk request, with the ‘file missing’ type error. I probably mentioned this in my GridPP talk and have now written an internal ticket to try to encourage this to happen ASAP.
There is also an update coming for CTA itself, v4.6.0-1 is recommended, and I forget exactly what the change was for, but I have the release notes to read.
I observed a monitoring problem in CMS-Rucio during the tape challenge, and I can see that Eric Vaandering is following up on that.
I think these are quite some significant changes which should help the system run much more smoothly under load/when things go wrong. A lot of them are the result of all the work we did last summer when CMS was attempting to recall 8PB of ‘B-parking’ data from CERN-CTA. I was planning to give my Oct 2021 CMS Computing Week talk, which explained the B-parking problems to the T1 storage guys next week – but I’ll be able to update it with this information in this email.
In my opinion, these changes will fix a lot of the errors we were seeing during staging (particularly after the Rucio clusters were deleted by CERN-IT). They will benefit ATLAS too. My last worry would be on the Echo side, when it is busy.