CTA deployment meeting

Europe/Zurich
600/R-002 (CERN)

600/R-002

CERN

15
Show room on map

Repack Status and plans for September production:

  • Cédric's slides are attached to Indico.
  • The metadata-bound performance problem starts to impact when expanding tapes of O(100K) files. Work for fixing this is ongoing but it is not seen as a blocker for deploying repack.
  • Production validation for September is now priority:
    • While work is ongoing for backstop, a simple submission wrapper should be used for production validation of repacking multiple tapes in parallel. The wrapper should just loop and periodically check for free space on the Repack disk pool, refraining from submitting new tapes while a given occupancy threshold is crossed. A simplified version of Daniele Kruse's CASTOR wrapper is more than enough for this.
    • Cédric and Vlado should sit together and setup a reasonable sized Repack SSD EOS tape server disk pool, write the wrapper, and run a scale test involving multiple tape drives in parallel.
  • Tape repair workflow:
    • Inserting retrieved files via low-level recover still applies to CTA and will be done as previously agreed (creating the file in the repack EOS pool). This has lower priority than production validation described above.

FTS:

  • Input from Andrea:
    • we have contacted ATLAS Rucio guys to discuss the way to pass StorageClasses and Hints, for the Hints they would like to have more freedom  ( not limited to 2 hints and 64 chars length) , discussion ongoing.
    • we implemented a first version of the xrootd prepare evict in gfal2 xootd plugin 
  • Agreeing on the format for, and handing over StorageClass information to CTA is critical while there is still room for discussion for Hints. As suggested by Eric, a specific hint table (with archive files pointing to it) could be created in the CTA catalogue once the exact requirements for hints are agreed upon.
  • Regarding XROOT prepare, a bug on the xrootd server (OFS) affects non-EOS backends (as EOS treats it differently); Steve will follow up with Andy. In terms of "prepare" functionality, we have the following in CTA:
    • prepare without arguments: NOOP for CTA
    • prepare -s: stage " " 
    • prepare -c: cancel " " 
    • prepare -a: abort " "
  • FTS "m" bit support should be ready in September so that the Rucio team can integrate this functionality in their workflow.

 

Input from Michael:

My updates to action list:

    1. Namespace split-up: agreed with Cedric 05/06/2019. See summary in last week's slides. Giuseppe to validate that files under /castor/cern.ch/atlas/atlascerngroupdisk are safely stored in EOS and don't need to be migrated. DONE

    2. Georgios is doing preliminary analysis which will eventually come up with a set of metrics to allow us to create an economic model of colocation (measure cost/benefit of different optimisation strategies).

New items:

    1. Compile EOS version with required changes: gRPC API, new checksum protobuf format. This is done in EOS v4.5.0. (Also includes XRootD 4.10 and prepare request tracking, though these features are not required for migration) DONE

    2. Merge CTA schema changes into master. I have rebased my branch on master, made required changes and will complete testing today. Will coordinate with Julien to merge back into master as he plans to do a release before the merge. Aim to have this done by Monday 1/7/2019.

    3. Review final DB schema for migration. Deadline 3/7/2019 (before Giuseppe goes on holiday).

    4. Create DB migration tools for ATLAS instance: alter schema and convert checksums and uid/gid to new format.

    5. Update EOS namespace injection tools to use new gRPC API.

    6. Small-scale metadata migration to validate all tools and workflow for the migration, including handling failure modes.

    7. Milestone: CASTOR DB to be moved to new hardware. Propose to move CTA ATLAS DB during the same maintenance window. Date to be set by DB team, I believe it is going to be around 15 July, Giuseppe is coordinating with DB team.

    8. Milestone: Week 22-26 July: Full-scale ATLAS migration test (metadata only, no tapes). This is a functional and performance test. It will allow us to accurately estimate the time needed to do the real migration and to consider if we need to make any further optimisations.


 

Update from Eric on backstop/backpressure status (timelines for September to be added):

Feature

Area

Status

Disk system list (C++ struct):

Description of disk system: name, regex to match file URLs, URL to query the free space.

Catalogue

Preliminary

Disk system list management: storing and management of the disk system list.

Catalogue, frontend

To be done, pending c++ struct definitive

Support in retrieve request: attach the file system name

Objectstore

Preliminary

Support in retireve queue:

Keep track of the file system name for queued requests

Objectstore

To do

Space allocation tracking object:

Keep track of the space committed but not used yet, per disk system.

Objectstore

To do

Support in queuing:

Classify requests, add info in queue.

Scheduler

Preliminary

Support in popping (the main part):

Integrate the querying of the space tracker, possibly the disk system, and requeue the requests in case of failure.

Scheduler

To do

Support in retrieve mounts: keep track of (temporarily) full disk file systems.

Objectstore+scheduler

To do

Support in mount scheduling: skip mounts for which we found no space (sleep the mount 15 minutes).

Objectstore+scheduler

To do

 

Action list:

Actions
who what by when
Eric Agree with ATLAS on list of "activities" and configure via cta-admin. Deploy "activities" on ATLAS 27/5
Cédric Implement repacking taking into account disabled tapes and drive dedications 30/5
Julien Ensure CTA team is copied in exchanges with ATLAS and other experiments. 24/5
Julien talk to procurement and network people (to ensure all network infrastructure is in place when nodes arrive) 30/5
Michael Ensure that Georgios gets in touch with Luc to advance discussions on modelling collocation hints and assessing their usefulness. 30/5
Julien/Andrea Explict stager_rm follow-up 13/6
Andrea Agree Rucio->FTS metadata format for collocation hints and storage classes  13/6
Eric propose and discuss with FTS team format how to receive collocation hints (in addition to storage classes and activities) from FTS. 13/6
Julien Identify what is the right hardware to run migration 13/6

 

There are minutes attached to this event. Show them.
    • 14:00 14:20
      Repack status and plans for September production 20m
      Speakers: Cedric Caffy (CERN), Eric Cano (CERN)
    • 14:20 14:40
      Status of deployment for September 20m
      • Buffer management: Backstop/backpressure, Garbage Collection, XROOT eviction, FTS integration
      • API's for ATLAS (FTS and EOSCTA integration): Storage Classes, Activities, Hints
      • Anything missing?
        What releases will we produce and when until September?
    • 14:40 15:00
      Review of actions, AOB, items for next meeting 20m