CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

Some notes from one or two previous meetings.

ALICE Post-migration

  • David configured the ALICE probe to not affect SLS until it is reporting correctly (#105 and #666).
  • Implicit Prepare is implemented as a separate workflow event in EOS. It can be configured on or off.
  • Garbage Collector requires the fix to EOS space reporting. The rate is fast enough, to be re-tested in production. (#905)
  • We need to re-test ALICE instance with >1,000 clients (Latchezar said 10,000)

Repack To Do

Remove "superseded" files in favour of the recycle bin (*).

This is a simplification: instead of two ways of handling deleted and repacked files, we settle on one mechanism, the recycle bin. "Superseded" should be removed.

Files in the recycle bin will not appear in tape listings using the normal tools ("cta-admin tapefile ls" etc.) and there will be no option to list them.

This task includes a separate operator tools to list files in the recycle bin and to delete them. (Q. Should operators be able to manually delete files from the recycle bin or should this be done automatically when tapes are reclaimed?)

In future we will want a tool to allow reinjection of deleted files from the recycle bin, but this is not an immediate priority, as it can be done manually by a developer if necessary. We will develop the tool when the need arises.

Investigate if we can run a separate repack instance sharing the same catalogue.

There are a number of problems caused by mixing repack with normal operations, because there is no separation of repack queues from the normal retrieve queues. It would be a big development effort to add separate repack queues.

An alternative proposal is to run a separate repack instance, which has its own object store and dedicated tape drives. Only the catalogue and the tapes would need to be shared with the "main" production CTA instance.

This proposal needs to be investigated to confirm it is feasible, to propose a solution to contention for tapes between the main instance and repack instance, and to identify any other problems that may arise.

There are minutes attached to this event. Show them.
    • 14:00 14:10
      ALICE Post-migration 10m
      • Namespace reconciliation: Michael provided Latchezar & Costin with list of 10,965 files in /alice/raw/global which are not in CTA.
      • Instance full and garbage collector status (see #905)
      • Implicit prepare: deployed in production (v3.1-8). How will it be tested?
      • ALICE probe status (#105 and #666).
    • 14:10 14:20
      Getting CMS into Production 10m

      TO DO

      • Reconciliation Rucio ←→ CTA
      • Test EOS Backup (Elvin's daemon and user backup tools on EOS CMS) (#113)
      • Repack tapes in cms_user (following list in CASTOR /user directories split across multiple file classes). Not a blocker if this is not finished prior to migration, we could remove tapes needing to be repacked from r_cms_user and migrate them after repack.
      • Test archive monitoring (either FTS or homebrew using xrdfs query prepare)
      • OTGs: CASTOR and CTA

      Schedule

      • 9 Nov 2020: Pre-MWGR end-to-end functional "replay" tests at T0 will be done with CASTOR
      • 18 Nov 2020: MWGR#4 will be done with CASTOR
      • w/c 23 Nov 2020 Put CASTOR CMS into recall-only mode
      • 30 Nov 2020: Migrate to CTA
      • 07 Dec 2020: CMS in production
      • End Dec 2020? : CMS retires PhEDEx
    • 14:20 14:30
      Getting LHCb into Production 10m
      • See Putting EOSCTALHCB into Production (CodiMD)
      • Successful 200TB test with 10 tape drives and one buffer server for 2.5GB/s of constant archival throughput.
      • Successful transfer using Dirac to move a file from EOS to Gridka via HTTP (FTS monitoring)!

      TO DO/Schedule

      • Mon 2 Nov : 200 TB recall test
      • Wed 4 Nov : IT/LHCb meeting
      • Test HTTP TPC with CTA (FTS multi-hop with one "hop" as QoS change, one hop as the transfer) to be agreed with Christophe
      • DAQ test : waiting for network connectivity
    • 14:30 14:40
      Getting PUBLIC into Production 10m

      COMPASS

      • Metadata check checksum bulk query: they will do it the slow way for now
      • Wed 4 Nov 2020 : DAQ tests on CTA.
      • EOSCTA instance for these tests, see #69 COMPASS tests on EOSCTAPUBLIC PPS

      NA62

      • NA62 offline integration: Vova has created the VM with AFS+EOS to allow the tests to proceed
      • NA62 has been migrated onto EOSCTA PPS (#72)
      • w/c 9 Nov : NA62 recall tests (30 TB) on EOSCTA PPS

      AMS

      • Meeting with AMS took place 29/09/2020
      • They are busy with SLC6 → CC7 migration and have no time to test CTA

      NA61/SHINE

      • Tue 3 Nov : NA61 TDAQ Upgrade Meeting

      Identifying hot/cold data

      • Vova is looking at CASTOR logfiles to try and build a profile of when data was last accessed and by whom.

      Repack of public_user

      • Fileclasses in public_user have been reorganised into new fileclasses to split small experiments from user data and legacy data. Some new tape pools were created and some data needs to be relocated into the experiment tapepool (e.g. totem, ilc).
      • Around 1,500 tapes need to be repacked.
      • Highest priority is to separate data belonging to CMS.
    • 14:40 14:50
      CTA Repack 10m

      Status Update

      • v3.1-8 deployed: (a) Injection of recovered files when repacking broken tapes, (b) Change mount policy on-the-fly, (c) switch off maintenance process on a per-tape server basis.
      • How will the above be tested?

      TO DO

      1. Remove "superseded" files in favour of the recycle bin.
      2. Investigate if we can run a separate repack instance sharing the same catalogue.
    • 14:50 15:00
      Tape Lifecycle 10m
      • Tape Lifecycle was documented by Vlado
      • How is tape inventory to be managed between CASTOR and CTA, and between CTA production and non-production instances?
      • Details around repack are quite vague. We have at least 3 distinct use cases which should each be explained : broken tape, low occupancy tape, change of media generation
      • What else is missing?
    • 15:00 15:10
      RAO for LTO 10m
      • Deployed in production with v3.1-8
      • Will it be enabled for LTO drives in IBM libraries?
      • Can we measure the improvement of RAO over linear order in production?
    • 15:10 15:15
      AOB 5m
      • CTA development priorities for the next 12 months