WLCG DOMA BDT Meeting
→
Europe/Zurich
Brian Paul Bockelman
(University of Wisconsin Madison (US)),
Maria Arsuaga Rios
(CERN),
Petr Vokac
(Czech Technical University in Prague (CZ))
Description
Topic: WLCG DOMA BDT Meeting (twiki)
-
-
16:30
→
16:35
News 5m
You can add your contribution to the existing section or if you have different / bigger topic to discuss please let us know before this meeting and we can created dedicated slot for your contributions.
Small topics that might be discussed in this meeting
- dCache 9.2.x and
xrootd.root
door configuration behavior- fixed in 9.2.8
- WLCG Ops: EGI GOCDB downtime for TAPE service (GGUS:165354)
- SRM.nearline no longer makes sense for WebDAV endpoints with TAPE REST
- we really don't wan't to keep SRM.nearline just to be able to declare downtime for TAPE
- EGI asked to create new service type (GGUS:165354)
- Service types for WebDAV endpoints with TAPE REST
- EGI GOCDB service type
- disk: webdav
- tape: wlcg.webdav.tape
- OSG Topology
- disk: WebDAV or WebDAV.disk
- tape: WebDAV.tape
- CRIC downtime synchronization implemented CRIC-258
- validated by BNL for OSG Topology downtime import for WebDAV.tape
- EGI GOCDB service type
- ATLAS still need different hostname (alias) because FTS configuration use scheme://fqdn
- tape transfers should be separated in FTS
- tape (buffer) may need different limits than disks
- SRM.nearline no longer makes sense for WebDAV endpoints with TAPE REST
- Alma9 xrd client & sites with Root CA signed by SHA1
- we concluded in issue#2150 XRootD should make CA validation compatible with other aplications (ignore signature on self-signed root CA)
- new SHA1 issues with CRLs - unable to validate / load CRLs signed with SHA1
- software: fetch-crl (issue#4), dCache (announcement in user-forum mailing list)
- some SHA1 CAs at least publish CRLs signed with SHA256 => doesn't cause troubles to fetch-crl and dCache
- ATLAS asked WLCG to talk with IGTF and provide timeline (WLCG MB#316 Service report)
- longer term goal: all grid CAs signed by OS trusted CAs
- ultimate goal: get rid of "globus"
/etc/grid-security/certificates
- FTS optimizer values (different for ATLAS vs. CMS vs. Pilot instance, but same behavior for HTTP-TPC)
- The FTS Optimizer has two step sizes when increasing connections on the link
- A conservative step size (default 1, configurable, "OptimizerIncreaseStep" setting), when the Optimizer is set to mode "1"
- An aggressive step size (default 2, configurable, "OptimizerAggressiveIncreaseStep" setting), when the Optimizer is set to mode "2"
- Then, the difference between mode "2" and mode"3" is how the number of actives is calculated:
- In mode "2", number of actives on the link = number of transfers on the link
- In mode "3", number of actives on the link = number of TCP streams on the link. This was valid in the times of GridFTP, but for HTTP, every transfer only uses one stream
- In essence, not much difference between mode "2" and mode "3".
- A lot of functionality applicable only to GridFTP => no longer useful
- The FTS Optimizer has two step sizes when increasing connections on the link
- Full TAPE buffer & FTS failing transfers
- could FTS do better and not to push new transfer when buffer is full? Use WLCG SRR to detect free space in buffer?
- do we have enough information for more clever decision than waiting for failures and relying on slow optimizer to stop transfers?
- FTS developers already thought about improvements not to hit buffer size limit
- FTS knows the total size of staged files
- dCache TAPE REST and non-default
webdav.root
issue
- fixed in dCache 9.2.14 with new configuration option
frontend.root
- current implemenation require different frontend service for each VO with their own doors dcache#7506
- fixed in dCache 9.2.14 with new configuration option
- Performance markers missing for serveral dCache sites GGUS:165469, dCacheRT#10596
- FZK, NDGF, BNL
- Switch gfal default HTTP library from libneon to libcurl - done at CERN and BNL
- dCache: removing internal dependencies on SRM space manager -> transition to dCache quotas
- ATLAS - currently rely on multiple spacetokens from each site - to be discussed internally how to deal with single quota space
- future question:
- make quota calculation more realtime
- automatic WLCG SRR with quota
Topics for next BDT meeting
storage.stage
implementation
- add reference to previous discussion and conclusions
- https://issues.infn.it/jira/browse/STOR-1605
- WLCG: stage is not superset of read, user still need read to download data
- compliance tests
- XRootD (EOS) allow explicit configuration of authorization strategy (xrootd#2121) only since 5.7.0 (xrootd#2205)
- EOS can't be used safely with tokens (unless you are fine giving DESTROY privileges to all VO members)
- fixed in EOS 5.2.26(?) which support scritoken with
authorization_strategy = capability
- new problems with
storage.create
which doesn't work correctly (xrootd#2364)
- new problems with
- what are the "optimal" FTS staging queu size limits for different implementations
- should FTS consider buffer size & file size in the queue?
- StoRM seems to struggle with long queue (100k+)
- is there a maximum for dCache & CTA
- should we still set stagin queue limit in FTS for dCache? Or can dCache deal internally with "infinite" staging queue size?
- Discussion about more robust FTS behavior with transfers from TAPE in case of full buffer
- it doesn't make sense to schedule new transfers when buffer in front of TAPE is full
- FTS would need more details in WLCG SRR to make more reasonable decisions
- P.V. note - more details in private email thread "control write throughput to tape buffer"
- DC27 archival metadata requirement
- TAPE family
/dev/null
for "Data Challenge" activity - we need more realistic T0 Export simulation
- e.g. for sites with complex topology like RAL
- also other sites use disk buffer in front of tape which is not served by same disknodes & filesystems like DATADISK
- TAPE family
- topics with lower priority:
- XRootD client libraries doesn't implement happy-eye-ball
- XRootD bug in case-sensitive HTTP headers parsing (RFC2616) is not compatible with StoRM HTTP/2 support (HTTP headers in lowercase RFC7540)
- HTTP/2
transferheaderauthorization
is sent to passive party asauthorization
header and XRootD don't recognize this header (returns authorization failure) - related to FTS upgrade to EL9 which comes with curl (used by gfal) with HTTP/2 support
- workaround - disable HTTP/2 support on the HTTP-TPC active party (StoRM)
- HTTP/2
- HTTP digest handling changed with RFC 9530 which obsoletes RFC 3230 xrootd#2211
- disable grid proxy delegation during FTS HTTP-TPC (by default?)
- Do we need
condor_test_token
equivalent for storage? - dCache: authzdb -> multimap + omnisession migration issue#6607
- reminders about existing (HTTP-TPC) related issues
- dCache: HTTP-TPC performance markers and RemoteConnections dCache#7441
- different performance markers types? (start, connection established, connection closed, finish, ...)
- dCache: issue with xroot-tpc and new default XRootD SHA256 signatures dCache#7599
- StoRM: no support for RemoteConnections in performance markers (ticket?)
- StoRM: Forbidden TPC push transfers on gclouds platform (STOR-1563)
- StoRM: support for "stat" with storage.create and storage.modify STOR-1600
- StoRM: does this storage rely on sufficiently recent CaNL (GGUS:167085)
- dCache: HTTP-TPC performance markers and RemoteConnections dCache#7441
- dCache 9.2.x and
-
16:35
→
16:55
Transfers with tokens 20mSpeakers: Petr Vokac (Czech Technical University in Prague (CZ)), Francesco Giacomini (INFN CNAF)
-
16:55
→
17:05
Tape REST access 10mSpeaker: Mihai PATRASCOIU (CERN)
TAPE REST & Tokens
- WLCG JWT profile storage.stage discussion
- status of storage.stage:/ implementations?
ATLAS progress
- Found an issue with non-default
webdav.root
(issue#7506)- affected sites: PIC, IN2P3-CC, NDGF
- Status:
- sites with TAPE rest in production
- CTA: CERN, RAL
- dCache: FZK, DESY-HH (T2), BNL-OSG2
- sites with TAPE rest available
- dCache (but see issue#7506): IN2P3-CC, NDGF-T1, PIC
- StoRM: INFN-T1
- sites without configured TAPE REST
- dCache: SARA-MATRIX, TRIUMF-LCG2
- sites with old SE
- dCache 7.x: RRC-KI-T1
- we would like to move forward with TAPE REST deployment after DC24, details
- CERN-PROD (REST) - production (April 2023)
- BNL-OSG2 (REST) - production - still in test (February 2024)
- FZK-LCG2 (REST) - production (March 2023)
- IN2P3-CC (REST) - use prefix
/pnfs/in2p3.fr/data/atlas
and needs dCache 9.2.14 - INFN-T1 (REST) - ready to be tested
- NDGF-T1 (REST) - they would like to test ENDIT with SRM before moving to REST (January 2024), NDGF use prefix
/pnfs/ndgf.org/data
and needs require dCache 9.2.14 - PIC (REST) - use prefix
/pnfs/pic.es/data/atlas
and needs dCache 9.2.14, complicated sites shared with IFAE T2 (multiple gplazma) and they'll use also multiple frontend services - RAL-LCG2 (REST) - production (~ May 2023)
RRC-KI-T1 (REST) - old dCache 7.x- SARA-MATRIX (REST) - REST JSON not yet configured
- TRIUMF-LCG2 (REST) - REST JSON not yet configured, use prefix
/pnfs/triumf.ca/data
and needs dCache 9.2.14
- sites with TAPE rest in production
-
17:05
→
17:15
Packet marking 10mSpeakers: Marian Babik (CERN), Shawn Mc Kee (University of Michigan (US))
-
17:15
→
17:25
WebDAV Error Message Improvement Project & unified error message format 10m
Discuss with experts improvements in the error messages produced by failed transfers.
https://twiki.cern.ch/twiki/bin/view/LCG/WebdavErrorImprovementSpeaker: Stephan Lammel (Fermi National Accelerator Lab. (US)) -
17:25
→
17:30
AOB 5m
-
16:30
→
16:35