Meeting with LHCb
LHCb has a requirement to transfer data from CTA to Tier-1s (reprocessing use case). These transfers are orchestrated by LHCb.
This would require 4-6 T1s to upgrade their infrastructure. The following options are on the table, ordered from most to least desireable from LHCb's POV:
- Option 0: Rely on the DOMA timetable for infrastructure upgrades. Advantage: easy to implement; no changes to Dirac/FTS/gfal. Disadvantage: no guarantee that the upgrades will be done before Run-3.
- Option 1: Set up gridFTP gateways to talk to Tier-1s which don't support XRootD TPC with delegation. Advantage: no changes required in T1s. Disadvantage: needs FTS development as it is multi-protocol.
- Option 2: Hard-coded multi-hop through big EOS instance. Advantage: does not rely on gridFTP. Disadvantage: needs FTS development. LHCb would have to track 2 transfers (unless this could be managed by FTS).
Eddie is part of a working group to track the schedule for upgrades of dCache in the T1s. He reports that the following T1s have upgraded already: fzk, ndgf, in2p3-cc, sara. All T1 sites are obliged to provide an alternative to gridFTP by end of 2019.
The new version of gfal (with "query prepare") was released and is in pilot, it will be deployed in production in around one week. The "check m-bit" feature will be in a future release.
- Michael: Determine how much buffer/bandwidth LHCb requires for CTA to T1 use case. Can it be done with SSDs or do we need to look at spinners like ALICE? (In October meeting it was stated as 1.5 GB/s).
- Eddie will keep us updated on T1 upgrade schedule.
CTA test status
The bug that stopped repack was fixed.
If the EOS instance runs out of disk space, the file to be written is truncated during the flush to disk, but the CLOSEW event is executed anyway. In this case, the file is written to tape but then fails afterwards because the size/checksum don't match. Ideally we would like to prevent the file being written in the first place.
- Vlado: re-launch repack test
- Michael: follow up on disk full/CLOSEW issue with EOS developers
DB Schema Versioning
The DB schema for CTA v1.0 will be finalised on 6 Dec 2019. This includes the columns needed for future features like RAO+LTO.
Cédric is working on DB schema versioning.
The CTA catalogue will be wiped after the ATLAS recall test in January. The migration will be done onto a fresh install of the DB schema.
- Cédric: take a look at the DB schema versioning/update procedures in CASTOR
CTA version 1.0
Everything to be included in CTA version 1.0 should be committed to master by Friday 6 December. The process of tagging/creating RPMs etc. will be done w/c 9 December.
Testing: Oliver mentioned that various suites of test scripts and tools exist for DPM and gfal.
- Michael: rename CTA_* extended attributes to sys.archive.*
- Michael: go through all old tickets containing tests that should be done and create a test schedule to make sure we have covered all corner cases. (Immutable files was mentioned but there are others).
- Steve: create ticket for "delete file if there is a CTA error during the CLOSEW event" (to allow client retries).
- Julien: identify any missing EOS log messages needed to track the full lifecycle of each archive and retrieval request.
- Oliver: run test scripts against an EOSCTA endpoint to see what happens.
Documentation has been reorganised into developer docs (LaTeX/PDF) and operator docs (mkdocs).
- Vlado: progressively update operator procedures
- Michael: follow up with Mélissa on CTA logo
- Michael: update CTA website. Include links to monitoring for management (see below).
- Julien: update tapeops website with links to monitoring for tape operators (see below).
There are 3 main use cases for monitoring:
- Management information: historical info, e.g. evolution of the size of data/number of files stored on tape
- Day-to-day operational information (tape operators)
- Performance management (developers)
The data collection for case (1) needs to be in place before we start taking physics data, i.e. before we migrate ATLAS.
There are several things missing from our current monitoring:
- Evolution of number of files/data size over time: we have it for CASTOR but need a synthesised plot that includes both CASTOR and CTA. There are 2 parts: (1) usage/activity, summarized per transfer by parsing logfiles, (2) total aggregated data on tape
- Evolution of media
- Occupancy of the tape pools
- Summary of how many tape mounts per day, by whom, etc.
- Weekly summaries: weekly dump of all tape info
- A dashboard like the one for CASTOR Stager activity
- Queue information
Daniele Lanza's monitoring (goes to HDFS?) and Aurelian's collectd sensor: these are done for CASTOR and need to be done for CTA?
- Michael: detail what monitoring we are missing (by use case)
- Michael: detail what scripts are running on which days (check the Rundeck tape instance) and what data sources they are using
- Julien: separate our current dashboards into the "tape operator" use case and "developer" use case and consolidate
- Julien: check access restrictions on Grafana: are they needed? Remove if not.
- Vlado: adapt tsmod-daily-report to report on CTA
It was agreed that the CASTOR backup hardware decommissioning will not influence the migration schedule for CTA.
There are several things which need to be done before we can migrate the backup services to CTA:
- Management of encryption keys
- Scope bandwidth and space requirements for each backup service
- Migrate existing backup data (including splitting into several storage classes and repacking if not all backup users are to be migrated at once)
To be revisited next year.
- Section lunch will be Tue 17 Dec
- Michael: present CTA migration plan to the GDB on Wed 11 Dec: https://indico.cern.ch/event/739885/
There are minutes attached to this event.