Steering Meeting: Tapes

Name: Steering Meeting: Tapes
Start: 2016-12-15T16:00:00+01:00
End: 2016-12-15T17:30:00+01:00
Location: CERN

Thursday 15 Dec 2016, 16:00 → 17:30 Europe/Zurich

31/S-028 (CERN)

Description

Open discussion about staging: what to be expected, monitoring, protocols.

Hide

Post-mortem from ATLAS about the tape issue on November

Recalls from tape were behaving slower than expected
Three possible use cases that involve tapes
- Tape => Disk (buffer)
- Tape => Disk on a different space token
- Tape => Other site
Access is ~random, since it is up to physicist to decide which datasets to access
"Fair share" not completely balanced, since CERN always has a replica of raw files
Timeouts made request bounce between sites
Additionally, some of these rquests were "forgotten" after the timeout is triggered
- [FTS-806] Issue an SRM abort even if the bring online get status fails
Backlog increases, degrading performance even more

In summary, this combination of factors raises the questions:

What can be expected?
How is the overall system behaving? (i.e. files/hour, GB/hour, distribution on staging times...) The answers would allow to
- Identify issues
- Manage expectations
- Throttle requests
This is particularly important as experiments are being pushed to reduce their disk usage, and rely more on tape

Metrics FTS can provide

Specific metrics of the tape system do not provide comparable results
- Difficult to differenciate between VOs
The interesting metrics are the ones mentioned before: throughput (in files or bytes) and waiting times
FTS + ES/Kibana stack can probably answer these with some enrichment of the FTS messages
- [FTS-828] will group actions to be taken by the FTS team
- Can wait for FTS 3.6

CTA

ATLAS and LHCb do not expect issues using XRootD for staging
- Both use FTS for staging, so FTS will hide the details anyway
CMS needs to double check if their workflow may be affected
Interesting question from David Cameron: how to combine XRootD for staging, and GridFTP for transfers?
- Likely, split staging and transfer jobs, since it is not possible to have an FTS job switching protocols depending on the task

AOB

Current staging bulk size (1k) is too small for tape systems
- This limit was introduced due to an explicit CASTOR request
- GGUS #115757, comment 9
  - Ideally would be nice to have bulk requests of about 1k files, this will reduce the processing time on our side and avoid triggering timeouts and errors.
- [FTS-309] Reduce bring online bulk request to 1k
  - dCache agreed on the bulk size of 1k
- In any case, FTS 3.5 allows configuring all these parameters, so there is no need to do a new release to increase the bulk size/concurrent requests (FTS-422)
  - To agree with storages a reasonable value for these (FTS-829)
Grouping staging requests by dataset suggested to ATLAS
It was also noted that requesting more bytes for staging than available on disk would create thrashing, so this is something to take into account

There are minutes attached to this event. Show them.

Choose timezone