CTA deployment meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

Backpressure status:

  • under test (Eric). A CI test is available in Eric's branch (to be merged with main).
  • EOS just needs to be configured to appear to have a small buffer to trigger backpressure functionality. This is best exercised with small files in order to have an appropriate granularity. Initially, will exercise it with 2 tape servers (input + output).
  • The next step is to deploy it on ATLAS to repack some dummy data.
    • As suggested by Vlado hors meeting, test tapes from the test_vlado CASTOR tape pool could be imported for this purpose.
  • Later on, change Cédric's repack CI script to handle multiple input/output drives and therefore concurrent tape repacks.

Cancel status:

  • Coding progressing and should complete soon: CI test will be available next week, integrated in the standard pipeline (and consisting essentially of the following workflow: Evict -> drive down -> Prepare -> Cancel -> drive up).
  • Julien will check how Evict is currently being tested.

Repack:

  • Repacking of disabled tapes (ticket) needs to be carefully thought over before implementation. Cédric needs to discuss with Vlado how to handle the case of repacking disabled tapes during mass repacks, as tapes may be disabled for good reasons (due to media failures) during such a mass repack.

FTS polling of GC'd files:

  • After discussion between Andrea and Steve, a simpler approach is chosen for addressing this problem in the medium term. Steve will change GC to add an attribute on file deletion. FTS will recognise this and treat it as an error. This will cause Rucio to retry the request. However, with backpressure and cancel enabled, the impact of GC should be minimal. In the longer term, the state engine of FTS will be changed to include retries.

Index for lost file staging requests:

  • Eric/Steve are discussing with Giorgios what options are available for including such an index directly into the namespace back-end.

Next steps for ATLAS:

  • CTAPPS instance is available again after successful ATLAS test completion.
  • CTA SSD disk server hardware will be received mid-September. The racks are cabled (3 racks @ 10 nodes in vault). Testing by CF may take a couple of weeks.
  • The new hardware will be used for setting up a new instance (EOSCTAATLAS) that as agreed with ATLAS will import CASTOR ATLAS metadata (no merge with EOSCTAPPSATLAS which is a separate instance and will eventually be decommissioned).
  • Julien will contact Georgios to enable EOS namespace dumps on CTA instances (as requested by ATLAS for EOS and CASTOR) 
  • ATLAS plans a full reprocessing of Run-2 data during mid-October. We need to check with Alessandro when exactly this will be happening. By then, it is critical to have the above-mentioned functionalities (backpressure, cancel, FTS polling/GC fix) completed, tested and deployed to assure a smooth running.
    • Storage classes and activities are in place (Eric). For collocation hints, we should sort out the interfacing via Rucio/FTS. CTA handling will come later (as this is just for optimisation).

Migration status:

  • (Michael) a number of issues have been fixed/completed, including correct reporting of checksums, handling of EOS ID's avoiding crossovers with CASTOR, improved gRPC error reporting, handling long file/dir names, extended attribute propagation in EOS and "permission denied errors" (requires new EOS release). Working on checking parent directory existence, "moving" (renamed) CASTOR directories, idempotent inserts and putting failures into an error table for operators.
There are minutes attached to this event. Show them.
    • 15:00 15:20
      Status, next steps, timelines 20m

      FTS
      - FTS polling and gc'd files
      - stage cancel

      CTA:
      - back-pressure deployment
      - repack status

      ATLAS deployment timelines:
      - ATLAS plans for Run-2 reprocessing vs. migration timelines

    • 15:20 15:40
      Metadata migration: current status 20m
      • Lessons learned during ATLAS import: error handling and reporting, incremental (re-)imports, atomicity, idempotence of imports, avoiding namespace collisions
      Speaker: Michael Davis (CERN)
    • 15:40 16:00
      Review of actions, AOB, items for next meeting 20m