CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

CTA Development Priorities

Discussion on Fri 9 October. Present: Cédric, Julien, Michael, Steve.

We agreed the following priorities for the coming weeks for tasks related to repack:

  1. Injection of recovered files when repacking broken tapes. Change mount policy on-the-fly.

These features have been implemented by Cédric. They should be deployed and tested.

    2. Tape Server: Add configuration options for maintenance process.

        To give greater control over the highly-distributed maintainence process:

a. Add an option to disable repack operations on a per-tapeserver basis, so we can control which tape servers are handling repack jobs.

b. Add an option to disable the maintenance process entirely, so we can run maintenance jobs on a smaller cluster of tapeservers. This means that when we are tracing errors, we only have to check the logs of one or two dozen tape servers instead of distributing them over a couple of hundred.

    3. Remove "superseded" files in favour of the recycle bin.

This is a simplification: instead of two ways of handling deleted and repacked files, we settle on one mechanism, the recycle bin. "Superseded" should be removed.

Files in the recycle bin will not appear in tape listings using the normal tools ("cta-admin tapefile ls" etc.) and there will be no option to list them.

This task includes a separate operator tools to list files in the recycle bin and to delete them. (Q. Should operators be able to manually delete files from the recycle bin or should this be done automatically when tapes are reclaimed?)

In future we will want a tool to allow reinjection of deleted files from the recycle bin, but this is not an immediate priority, as it can be done manually by a developer if necessary. We will develop the tool when the need arises.

    4. Investigate if we can run a separate repack instance sharing the same catalogue.

There are a number of problems caused by mixing repack with normal operations, because there is no separation of repack queues from the normal retrieve queues. It would be a big development effort to add separate repack queues.

An alternative proposal is to run a separate repack instance, which has its own object store and dedicated tape drives. Only the catalogue and the tapes would need to be shared with the "main" production CTA instance.

This proposal needs to be investigated to confirm it is feasible, to propose a solution to contention for tapes between the main instance and repack instance, and to identify any other problems that may arise.

In parallel, Steve will work on solving the garbage collector problems identified in this week's tests.

 

Q. Should operators be able to manually delete files from the recycle bin or should this be done automatically when tapes are reclaimed?

It is currently the case : if a user deletes all the files from a tape and reclaim it, the files are deleted from the recycle-bin. Cédric believes we should do the same for repacked tapes.

In this case we don't need a tool for operators to delete files from the recycle bin manually. (At least, let's not invest effort in that until someone comes up with a use case. Just listing the files in the recycle bin is enough for now.)

There are minutes attached to this event. Show them.
    • 14:00 14:10
      ALICE Post-migration Issues 10m
      • ALICE went into production on schedule on Monday 12 October. Well done everyone!
      • Michael: Namespace reconciliation (1.8 million files in Alien are not in CTA)
      • David: ALICE probe: Alice authentication relies on xrootd-alicetokenacc RPM and secret magic sauce from Andreas, see #105 and #666.
      • Steve: Implicit prepare
      • Steve: Garbage collector, see #905

      ALICE schedule: Fill buffer to 2.5 PB, carefully fill the rest without evicting anything until the first bunch is fully reconstructed. Then stage a bit more and the GC will have to kick in (mid-November).

    • 14:10 14:20
      Getting LHCb into Production 10m
      • See Putting EOSCTALHCB into Production (CodiMD)
      • Successful 200TB test with 10 tape drives and one buffer server for 2.5GB/s of constant archival throughput. Chris will now recall the files.

      TO DO/Schedule

      • TPC test from EOS LHCb to T1s
      • DAQ test should be possible in November (tentatively 9 Nov 2020)
    • 14:20 14:30
      Getting CMS into Production 10m

      TO DO

      • Michael/Giuseppe: clean up namespace, identify any tapes which need to be repacked
      • Recall test
      • Test of FTS file safely on tape check (not a blocker)

      Schedule

      • 20 Oct 2020: Recall test (not a scale test, recall some datasets from various eras of the experiment)
      • w/c 26 Oct 2020? Put CASTOR into read-only
      • 02 Nov 2020: Migrate to CTA
      • 09 Nov 2020: CMS in production
      • 18 Nov 2020: MWGR#4 wiith T0 using Rucio
      • 29 Nov 2020: CMS retires PhEDEx
    • 14:30 14:40
      Getting PUBLIC into Production 10m

      NA62

      • NA62 repack is proceeding
      • NA62 offline integration: Vova has created the VM with AFS+EOS to allow the tests to proceed
      • NA62 migration test onto EOSCTA PPS, see issue #72
      • NA62 recall tests (30 TB) on EOSCTA PPS after migration test

      COMPASS

      • Metadata check checksum bulk query
      • DAQ tests in November on CTA
      • EOSCTA instance for these tests, see #69 COMPASS tests on EOSCTAPUBLIC PPS

      AMS

      • Meeting with AMS took place 29/09/2020, need to schedule a test

      NA61/SHINE

      • Trying to schedule a meeting

      r_public_user tapepool

      • Repack of r_public_user tapepool: many experiments have data under their part of the namespace in the public_user fileclass. This prevents us from migrating small experiments one-by-one. Can we split it as we did for grid?
    • 14:40 14:50
      Repack 10m

      Repack priorities

      1. Deploy and test: (a) Injection of recovered files when repacking broken tapes, (b) Change mount policy on-the-fly.
      2. Tape Server: Add configuration options to give greater control over the highly-distributed maintainence process.
      3. Remove "superseded" files in favour of the recycle bin.
      4. Investigate if we can run a separate repack instance sharing the same catalogue.
    • 14:50 14:55
      AOB 5m
      • Repacking and reclaiming tapes imported from CASTOR (i.e., removing the FROM_CASTOR flag from tapes in CTA)
      • CTA website deployment still broken