CTA deployment meeting

Europe/Zurich
31/S-027 (CERN)

31/S-027

CERN

10
Show room on map

FTS

Multi-hop

In principle LHCb can do multi-hop with what is currently implemented in FTS. At CHEP, Christophe discussed an extra feature with Eddie which would allow them to do multi-hop without explicitly specifying the intermediate "hop" (it would be specified once only in a config file). In either case LHCb will need to make changes to their software.

Getting rid of SRM: We discussed support for XRootD TPC with delegation of credentials in the T1s (specifically LHCb T1s which run dCache). We can't depend on DUNE to solve this problem for us as not all LHCb T1s are DUNE T1s.

ATLAS have done a functional test of Rucio with FTS multihop (only one file, and not at T0). Cédric Serfon will be back at CERN next week and will do a proper multi-hop test with Julien.

Backpressure

The current backpressure mechanism for archivals is to return a "no space" error to FTS when the buffer is full. This is fine for ATLAS: the archival will fail and Rucio will schedule a retry.

FTS will implement a "throttling" feature to have a graceful degradation as the buffer becomes full rather than an abrupt stop. From our point of view the behaviour is exactly the same; this has no impact on CTA.

The FTS "m-bit" feature (a file is not in CTA until it is on tape) is implemented but not yet merged to FTS master. ATLAS don't care about this, it is CMS who do, so this is not a blocker for us.

Backpressure for retrievals is fully managed by CTA. This is working fine, there are no known issues.

Summary

  • FTS does everything we need for ATLAS. We need to make sure the latest version is deployed everywhere.
  • The m-bit feature needs to be deployed and tested before we migrate CMS (issue #386)

Actions

  • FTS team are meeting with LHCb on Wednesday 27 November. Feedback at next week's meeting.
  • Michael: discuss with Michal schedule for official FTS release and deployment
  • Michael: Check the status of all LHCb T1s wrt. support for XRootD TPC (issue #674)

CTA Test Status

Everything is ready for the 40-tape repack test.

Actions

  • Julien: do a simple test to ensure EOS rate limiter will not break us, e.g. set rate limit to 1 Hz and retrieve >1 files.
  • Julien: full-scale test with ATLAS including FTS multi-hop test (see above)
  • Vlado/Julien: kick off 40-tape repack test

Monitoring

Refer back to Vlado & David's CTA Operations Update presentation, p.3-7 and the minutes of the meeting on 4 Sept. 2019.

Actions

The following actions from 4 Sept. need to be followed up on:

  • Implement tsmod-daily-report as a dashboard rather than sending mails (issue #677)
  • Move some of the checks done by tsmod-daily-report to CTA (issue #678)
  • Alarm system: Disabling functionality (tapes, drives, libraries for CTA) is not yet implemented. Disabling libraries can wait but for tapes and drives this should be provided with priority.
  • Alarm system: The new system works as a daemon (rather than being regularly invoked as it is the case for the CASTOR alarm system). This requires handling expiring authentication tokens (KRB5) required for invoking CTA and CASTOR commands. A possible solution is to de-daemonize it and invoke it via Rundeck. David/Vlado to work out a solution.
  • Dashboards: It is felt that there are currently far too many dashboards that have grown organically (see overviews here, here, or here). These need to be cleaned up and structured by functionality and audience. For operations, the following would be needed:
    • EOSCTA instances
    • tapeserver cluster
    • CTA tape back-end
    • Julien needs to work on a cleanup proposal, gathering input from tape operations.
  • German, Vlado, Julien: provide detail on what monitoring is missing compared to what is currently available in CASTOR.
  • Review of what we have in the EOS control tower and what is missing
  • Michael: gather everything into one document in order to discuss/prioritize what we need to go into production.

Documentation

We agreed that there will be the following sources for documentation:

  1. Developers/integration specialists : One PDF to rule them all
    • This documentation will cover CTA architecture, design and technical details.
    • It will not include CERN-specific configuration and installation instructions.
    • It will be mainly updated by CTA developers. It is not expected to be updated by operators.
    • After CTA is deployed, the contents of this document is not expected to change much. Changes will be mainly adding new features.
    • The source of this documentation will be the LaTeX files under CTA/doc/latex.
  2. HOWTO/cookbook : mkdocs
    • This will be a guide for EOSCTA operators, world readable.
    • Installation; configuration; how to use cta-admin and other operational tools.
    • Besides Tier-0, the target audience includes Tier-1s and any other site who may run EOSCTA.
    • Collaborative documentation which can be updated by operators.
  3. CERN-specific configuration : mkdocs
    • Configuration details of CERN EOSCTA instances; CERN-specific operational procedures.
    • Visible to and editable by CERN T0 staff/contractors.
  4. Conference papers : these should be put together in a common directory
  5. Old/obsolete documentation : should be moved into a directory that makes it clear that this is for historical interest only.

In addition, we need to have a simple website which explains what CTA is and links to the other documentation and to monitoring dashboards.

Actions

  • Michael: ask Melissa about designing a CTA logo
  • Michael: update the simple CTA website
  • Julien: create a 2nd mkdocs site so we have one for user case (2) and one for use case (3) above
  • Michael: create basic structure for mkdocs sites and move existing operator documentation there (CTA Admin guide, twiki, etc.)
  • Eric: consolidate the existing LaTeX documentation (in particular, integrate the CASTOR tapeserver documentation into the CTA documentation)

AOB

Production Hardware Update

Julien went to the CF weekly meeting and reports:

Good news (finally) our production hardware is waiting for shipment: all the servers are ready with no missing hardware, but I want to see the hardware in our racks before I am fully relieved.

Basically our machines will be usable so that we can deploy the reprocessing instance on at least 10 machines before the end of 2019.

I discussed a few possible shortcuts if we are pressed by time...

It will be tough but I will make sure we will have 10GB/s of throughput for the next Atlas recall exercise.

I keep you updated: I also have an emergency backup plans if we don't get any hardware (shipment destroyed, stolen, DOA...).

Database Schema Versioning

We need to be able to monitor which version(s) of the DB schema are deployed in production and to provide tools to migrate between DB versions.

Actions

  • Cédric: Analyse the problem of DB Schema versioning. Create a plan to implement versioning and to migrate between DB versions.
  • Michael/Steve: review the current DB schema and agree what should be in CTA DB Schema version 1.0.

Project Plan

The start date for the ATLAS reprocessing campaign is scheduled for 6-15 January.

The current plan is that by the end of December we will:

  1. Release CTA version 1.0, the release which will be used in production
  2. Get everything else ready that we need for the real ATLAS migration (hardware, monitoring, operator procedures & documentation, etc.)
  3. Migrate ATLAS data for the reprocessing campaign to EOSCTAATLAS (but do not disable the tapes in CASTOR).

In January 2020:

  1. Do the reprocessing campaign on a system as close as possible to the final production system. Use it as a final validation to give ATLAS confidence that CTA is production-ready and to give tape operators experience of working with CTA tools.
  2. As soon as possible after the reprocessing campaign, do the real migration and disable the tapes in CASTOR. The migration date needs to be agreed with ATLAS.
There are minutes attached to this event. Show them.
    • 14:00 14:10
      FTS and multi-hop 10m
      • FTS and multihop for LHCb, see ticket FTS-1483
      • Update from Andrea and Eddie on the impact on FTS to implement this
    • 14:10 14:20
      Backpressure 10m
      • Andrea : present how backpressure works in FTS
      • Identify what problems remain to be solved
    • 14:20 14:30
      CTA test status 10m
      • Status of stress tests
      • Status of repack 40 tapes test
    • 14:30 14:40
      Monitoring 10m

      We need to monitor both EOS and CTA in order to follow the full workflow. e.g. the number of files written in EOS should match the number of archive requests in CTA. Also we need to be able to monitor what the garbage collector is doing in the disk buffer.

      • What are we currently able to monitor (disk/tape)?
      • What things do we need to monitor that we are not currently monitoring?
    • 14:40 14:50
      Documentation 10m
      • Eric : brief summary on the state of our documentation
      • Agree the different audiences for documentation : CTA developers; Integration/data workflow specialists (e.g. FTS/XRootD developers, experiment data managers, dCache developers, ...); Installers (e.g. T1s like RAL); Operators (configuration, day-to-day operations); ... ?
      • What are the documentation requirements of the CTA service (as opposed to developers)?
      • For each audience:
        • What documentation should be available?
        • Where should they be able to find it (in the CTA repo, on a website, ...)?
        • In what format (PDF manual, CERNdoc website, ...)
    • 14:50 15:00
      ALICE/JAlien integration 10m
      • Michael/Steve : Feedback from this morning's meeting with Costin
      • Timescale for ALICE tests
      • Identify what needs to be done to integrate JAlien with EOSCTA
    • 15:00 15:10
      AOB 10m