Post-mortem from ATLAS about the tape issue on November
- Recalls from tape were behaving slower than expected
- Three possible use cases that involve tapes
- Tape => Disk (buffer)
- Tape => Disk on a different space token
- Tape => Other site
- Access is ~random, since it is up to physicist to decide which datasets to access
- "Fair share" not completely balanced, since CERN always has a replica of raw files
- Timeouts made request bounce between sites
- Additionally, some of these rquests were "forgotten" after the timeout is triggered
- [FTS-806] Issue an SRM abort even if the bring online get status fails
- Backlog increases, degrading performance even more
In summary, this combination of factors raises the questions:
- What can be expected?
- How is the overall system behaving? (i.e. files/hour, GB/hour, distribution on staging times...) The answers would allow to
- Identify issues
- Manage expectations
- Throttle requests
- This is particularly important as experiments are being pushed to reduce their disk usage, and rely more on tape
Metrics FTS can provide
- Specific metrics of the tape system do not provide comparable results
- Difficult to differenciate between VOs
- The interesting metrics are the ones mentioned before: throughput (in files or bytes) and waiting times
- FTS + ES/Kibana stack can probably answer these with some enrichment of the FTS messages
- [FTS-828] will group actions to be taken by the FTS team
- Can wait for FTS 3.6
CTA
- ATLAS and LHCb do not expect issues using XRootD for staging
- Both use FTS for staging, so FTS will hide the details anyway
- CMS needs to double check if their workflow may be affected
- Interesting question from David Cameron: how to combine XRootD for staging, and GridFTP for transfers?
- Likely, split staging and transfer jobs, since it is not possible to have an FTS job switching protocols depending on the task
AOB
- Current staging bulk size (1k) is too small for tape systems
- This limit was introduced due to an explicit CASTOR request
- GGUS #115757, comment 9
- Ideally would be nice to have bulk requests of about 1k files, this will reduce the processing time on our side and avoid triggering timeouts and errors.
- [FTS-309] Reduce bring online bulk request to 1k
- dCache agreed on the bulk size of 1k
- In any case, FTS 3.5 allows configuring all these parameters, so there is no need to do a new release to increase the bulk size/concurrent requests (FTS-422)
- To agree with storages a reasonable value for these (FTS-829)
- Grouping staging requests by dataset suggested to ATLAS
- It was also noted that requesting more bytes for staging than available on disk would create thrashing, so this is something to take into account
There are minutes attached to this event.
Show them.