CTA deployment meeting
Repack Status and plans for September production:
- Cédric's slides are attached to Indico.
- The metadata-bound performance problem starts to impact when expanding tapes of O(100K) files. Work for fixing this is ongoing but it is not seen as a blocker for deploying repack.
- Production validation for September is now priority:
- While work is ongoing for backstop, a simple submission wrapper should be used for production validation of repacking multiple tapes in parallel. The wrapper should just loop and periodically check for free space on the Repack disk pool, refraining from submitting new tapes while a given occupancy threshold is crossed. A simplified version of Daniele Kruse's CASTOR wrapper is more than enough for this.
- Cédric and Vlado should sit together and setup a reasonable sized Repack SSD EOS tape server disk pool, write the wrapper, and run a scale test involving multiple tape drives in parallel.
- Tape repair workflow:
- Inserting retrieved files via low-level recover still applies to CTA and will be done as previously agreed (creating the file in the repack EOS pool). This has lower priority than production validation described above.
FTS:
- Input from Andrea:
- we have contacted ATLAS Rucio guys to discuss the way to pass StorageClasses and Hints, for the Hints they would like to have more freedom ( not limited to 2 hints and 64 chars length) , discussion ongoing.
- we implemented a first version of the xrootd prepare evict in gfal2 xootd plugin
- Agreeing on the format for, and handing over StorageClass information to CTA is critical while there is still room for discussion for Hints. As suggested by Eric, a specific hint table (with archive files pointing to it) could be created in the CTA catalogue once the exact requirements for hints are agreed upon.
- Regarding XROOT prepare, a bug on the xrootd server (OFS) affects non-EOS backends (as EOS treats it differently); Steve will follow up with Andy. In terms of "prepare" functionality, we have the following in CTA:
- prepare without arguments: NOOP for CTA
- prepare -s: stage " "
- prepare -c: cancel " "
- prepare -a: abort " "
- FTS "m" bit support should be ready in September so that the Rucio team can integrate this functionality in their workflow.
Input from Michael:
My updates to action list:
1. Namespace split-up: agreed with Cedric 05/06/2019. See summary in last week's slides. Giuseppe to validate that files under /castor/cern.ch/atlas/atlascerngroupdisk are safely stored in EOS and don't need to be migrated. DONE
2. Georgios is doing preliminary analysis which will eventually come up with a set of metrics to allow us to create an economic model of colocation (measure cost/benefit of different optimisation strategies).
New items:
1. Compile EOS version with required changes: gRPC API, new checksum protobuf format. This is done in EOS v4.5.0. (Also includes XRootD 4.10 and prepare request tracking, though these features are not required for migration) DONE
2. Merge CTA schema changes into master. I have rebased my branch on master, made required changes and will complete testing today. Will coordinate with Julien to merge back into master as he plans to do a release before the merge. Aim to have this done by Monday 1/7/2019.
3. Review final DB schema for migration. Deadline 3/7/2019 (before Giuseppe goes on holiday).
4. Create DB migration tools for ATLAS instance: alter schema and convert checksums and uid/gid to new format.
5. Update EOS namespace injection tools to use new gRPC API.
6. Small-scale metadata migration to validate all tools and workflow for the migration, including handling failure modes.
7. Milestone: CASTOR DB to be moved to new hardware. Propose to move CTA ATLAS DB during the same maintenance window. Date to be set by DB team, I believe it is going to be around 15 July, Giuseppe is coordinating with DB team.
8. Milestone: Week 22-26 July: Full-scale ATLAS migration test (metadata only, no tapes). This is a functional and performance test. It will allow us to accurately estimate the time needed to do the real migration and to consider if we need to make any further optimisations.
Update from Eric on backstop/backpressure status (timelines for September to be added):
Feature |
Area |
Status |
Disk system list (C++ struct): Description of disk system: name, regex to match file URLs, URL to query the free space. |
Catalogue |
Preliminary |
Disk system list management: storing and management of the disk system list. |
Catalogue, frontend |
To be done, pending c++ struct definitive |
Support in retrieve request: attach the file system name |
Objectstore |
Preliminary |
Support in retireve queue: Keep track of the file system name for queued requests |
Objectstore |
To do |
Space allocation tracking object: Keep track of the space committed but not used yet, per disk system. |
Objectstore |
To do |
Support in queuing: Classify requests, add info in queue. |
Scheduler |
Preliminary |
Support in popping (the main part): Integrate the querying of the space tracker, possibly the disk system, and requeue the requests in case of failure. |
Scheduler |
To do |
Support in retrieve mounts: keep track of (temporarily) full disk file systems. |
Objectstore+scheduler |
To do |
Support in mount scheduling: skip mounts for which we found no space (sleep the mount 15 minutes). |
Objectstore+scheduler |
To do |
Action list:
who | what | by when |
Eric | Agree with ATLAS on list of "activities" and configure via cta-admin. Deploy "activities" on ATLAS | 27/5 |
Cédric | Implement repacking taking into account disabled tapes and drive dedications | 30/5 |
Julien | Ensure CTA team is copied in exchanges with ATLAS and other experiments. | 24/5 |
Julien | talk to procurement and network people (to ensure all network infrastructure is in place when nodes arrive) | 30/5 |
Michael | Ensure that Georgios gets in touch with Luc to advance discussions on modelling collocation hints and assessing their usefulness. | 30/5 |
Julien/Andrea | Explict stager_rm follow-up | 13/6 |
Andrea | Agree Rucio->FTS metadata format for collocation hints and storage classes | 13/6 |
Eric | propose and discuss with FTS team format how to receive collocation hints (in addition to storage classes and activities) from FTS. | 13/6 |
Julien | Identify what is the right hardware to run migration | 13/6 |