Fallout from ATLAS 2018 reprocessing
Due to a Rucio misconfiguration, a number of external sites were requesting CTAPPS files via other FTS instances. These sites were blacklisted on Tuesday.
Processing of FTS cancel requests resulted in "permission denied" errors by CTA. In addition, the CTA front-end will log cancel requests but the cancel logic is not yet implemented by the tape servers. Except in the case where FTS resubmits files immediately, this can cause the disk buffer to overflow quickly. It becomes therefore important to implement cancel functionality in CTA (cf. https://gitlab.cern.ch/cta/CTA/issues/252)
FTS polling and gc'd files: Currently, FTS is not aware of files that it requested but that have been garbage collected before they could be transferred out of EOSCTA. These files will appear as still non-staged to FTS. In order to overcome this problem, FTS needs to identify wether a staging request is still ongoing by examining the
sys.retrieve.req_id extended attribute. If no staging is ongoing, FTS should re-submit the staging request. This change, however, requires adaptations in the state machine logic of FTS as polling and submission are handled by separate components (daemons).
Grey data: The lack of cancelling exacerbates the problem of "grey" data, ie. files in the process of being recalled but for which the client has lost interest - they will not be read nor evicted by any client. Grey data can quickly build up and end up occupying a majority of the comparatively small CTA disk buffer. FST garbage collection is currently only triggered when both occupancy and file age exceed defined thresholds which means that grey data will be cleaned up only slowly. After discussion, the team agrees that the FST GC should be configurable to clean up files based solely on age (RFE to be created - Julien)
Addressing inconsistencies due to lost file staging requests: Staging requests might be potentially lost within CTA in the case of object store problems, cleanups (as done by Julien for alleviating grey data occupancy), or software bugs. This leads to inconsistencies between EOS and CTA as EOS is not aware of lost CTA requests. Re-submitting prepare statements will not work as repeated prepare requests for the same file are not propagated to CTA (until an expiration threshold, typically 1w). In principle, prepare requests could be changed to always be propagated to CTA. However, for query requests, this is not desirable given the polling load generated by e.g. FTS. Should a request table be kept inside EOS or CTA? Eric will consider options and make a proposal.