ATLAS
- dCache
- Swiss CA & dCache restarts due to CaNL bug (dCache < 7.x)
- GGUS tickets to remaining 6 sites on 12 April
- done: FZK, DESY-ZN, SARA-MATRIX, RRC-KI-T1, RU-Protvino-IHEP
- scheduled: DESY-HH (GGUS:151317)
- usually restarted only WebDAV doors or WebDAV + gPlazma in the past, not pools
- this issue is currently hidden for CMS, because their FTS is configured with pull -> push fallback
- dCache WLCG SRR issue with missing storageshares
- dCache developers waiting for CERN/WLCG response since November(?)
- several other related really old dCache ticket still opened
- what is the current status? what is minimal dCache version with working SRR?
- WLCG SRR is a bit too flexible, usually generated by cron script (huge potential for failures - see notes for LCGDM-2744), no standard location, we need to cleanup a old/invalid locations in different CRICs (WLCG, ATLAS, DUNE, ...) ... not really improvement compared to the SRM used in the past
- dCache documentation recommends periodic CRL updates but once you run fetch-crl than updates are mandatory
- different command to enable periodic updates on SLC6, CentOS7, CentOS8
- StoRM
- official documentation should provide details how to update space occupancy for non-GPFS backend
- it would be nice to include examples for most common filesystems (e.g. Lustre, CEPH(?))
- what happened to the related STOR-1356
- DPM
- Observing production transfer failures
- mainly affects long distance transfers (EU <-> US / Asia / Australia)
- security model implemented by DPM for storage issued tokens (macaroons) is different compared to dCache, StoRM and XRootD LCGDM-2972
- new fixed DPM release necessary - in progress
- scheduled for the end of May
- ECHO RAL/Glasgow status?
- would it be possible / easier to use CephFS + XRootD without any special plugin?
- ATLAS 30day transfer average with RAL destination 4Gb/s (hourly peaks < 30Gb/s)
- ATLAS 30day transfer average with RAL source 4Gb/s (hourly peaks < 30Gb/s)
- XRootD status of production release for US sites?
- 100k small files stress tests for DPM (FTS, Kibana) and XRootD (FTS, Kibana)
- race creating (duplicate) parent directories
- DPM (449 failed transfers)
- 3x DESTINATION MAKE_PARENT HTTP 403 : Permission refused
- 19x DESTINATION OVERWRITE HTTP 403 : Permission refused
- 404x problem with TRIUMF source
- XRootD (5952 failed transfers)
- 32x DESTINATION MAKE_PARENT (Neon): Could not read status line: Connection reset by peer
- quite a lot of "Connection issues" - destination XRootD overloaded(?)
FTS related
- Creating concurrently parent directories causes HTTP-TPC transfer failures at least for DPM and XRootD
- Change key size for delegated proxy to 2048 (FTS-1700)
- CentOS8 by default don't support 1024 keys (observed at SARA & dCache)
- Logging Authorization headers (FTS-1663)
- FTS log level for HTTPS/DAVS/... should not be set to 3 for any production storage
- XRootD storage (UNL) even advertise Bearer for failed transfer with log level 0
- does this comes from bad XRootD configuration or is it necessary to fix XRootD sources?
- GridFTP issues (we don't really care)
- Protocol translation from SRM+GridFTP to HTTP doesn't work for files > 4MB
- GridFTP transfer succeeds but gfal incorrectly close connection which leads to "Aborting transfer due to session termination" in logs
- Davix+libneon vs. Davix+libcurl future plans
CMS
Current status
Total sites (T1_Disk + T2s) |
56 |
Notified Sites |
56 (100%) |
Reporting a davs endpoint |
47 (83%) |
Passed manual tests |
41 (73%) |
Passed loadTests (davs) |
13 (23%) |
Passed loadTests (srm) |
12 (21%) |
Enabled to fetch changes in TFC: |
0 (0%) |
Current issues
1. Corner cases in the TFC parser
The TFC parser uses regular expressions to parse all the possible formats of the URLs used within the LFN to PFN rules. We have found cases not taken into account by the set of regexps or pieces in the code whit missing regexps. I'm currently refactoring these pieces of code.
2. WebDAV in the TFC
Either the site admins do not define their WebDAV endpoint at their TFC or they define it wrongly
3. Missing permission
We have found many sites missing permissions for Rucio to write the LoadTest file
4. The ASO problem.
We found out that ASO relies on the Rucio configuration of the RSEs to schedule its TPC transfers. It always fetches the preferred protocols for read and write from the sites involved but doesn't check whether they're compatible or not. We had to change our TFC parser to keep read/write preferred protocols fixed to srm.