CTA deployment meeting

Name: CTA deployment meeting
Start: 2019-11-14T14:00:00+01:00
End: 2019-11-14T16:30:00+01:00
Location: CERN

Thursday 14 Nov 2019, 14:00 → 16:30 Europe/Zurich

31/1-012 (CERN)

31/1-012

CERN

Show room on map

Hide

Design specifications and operational procedures

Quick brainstorm: are there parts of the project which we do we not fully understand or where we are not all on the same page? What needs to be agreed and/or documented?

ALICE/JAlien workflows
LHCb workflows
Backpressure (for ALICE/FTS)
Archive workflow: do we want bulk requests and query interface?
EOSCTA installation and configuration : essentially we need to document what happens in CI scripts
Operational procedures and documentation
Tape lifecycle management
Data recovery : tools, procedures (also, tape drive dedications should be implemented)
Policy for distributing software to external entities like RAL (in particular, how to handle CERN-specific dependencies)

CTA instances and testing

Julien confirms that we currently have the following instances:

EOSCTAPPS: currently being used by Michael for migration testing
EOSCTAATLASPPS
EOSCTACMSPPS
EOSCTAALICEPPS
Repack instance: this is an EOS-only instance for repack

What needs to be done to migrate ATLAS?

ATLAS will do their Run-2 exercises in January. Ideally we would like to do the full ATLAS migration immediately after the end-of-year shutdown so that the Run-2 exercises are done on CTA.

What do we need to make this happen?

Drives for all tape types : we have these, they just need to be installed
ATLAS buffer server : the production hardware has not been delivered and the schedule is still not known.
Agree migration schedule with ATLAS
Ensure all operational tools are available

What needs to be done to set up ALICE instance with two spaces?

Configuration of hardware (SSD+spinners)
Space-aware garbage collector. The FTS GC will do in the short term.
ALICE authentication mechanism needs to be fixed by Andreas and puppetized. Julien has opened issue 666 to track this on our side (Document, fix and puppetize `xrootd-alicetokenacc`)

Stress tests

The new "query prepare" logic is in EOS dev branch. This should be deployed and the stress tests repeated.

Actions

Oliver/Julien : go to CFCCM meeting on Wednesday to get a status update on procurement and ensure that the priority of hardware for CTA is understood. (Vlado offered to provide someone from DCS to help with installation if necessary).
Julien : document configuration of headnodes and CTA instances (see Documentation under AOB)
Julien : ensure that operators are able to run cta-admin on all tapeservers
Julien : (low priority) configure ALICE instance with 2 spaces
Vlado : evaluate what functionality from TOMS is missing from CTA
Steve/Michael : meet Costin to understand details of ALICE workflows. (What is the purpose of the ALICE probe?)
Steve : implement space-aware LRU Garbage Collector
Michael : finish functional tests of "query prepare", cherry-pick commits on top of our EOS version and create a new tag by end of this week
Julien : change the test scripts to use "query prepare" and submit prepare requests in batches of 200 to mimic the behaviour of FTS. Add a metric to measure the rate at which prepare requests are queued. Re-launch stress tests.

Repack

Thanks to Cédric for documenting the repack tools and procedure for repacking a single tape (see EOSCTA.pdf chapter 4). Eric is working on the system-level documentation.

Eric confirmed that most of the functionality in Daniele's repack scripts are not required by CTA. However, the tape lifecycle management parts of the scripts are still required.

During the test to repack 2 tapes, Eric identified an issue with the scheduling. There is a race between an existing mount fetching more work and a new mount starting and snatching this set of jobs before the first mount. The existing mount finds no work and finishes. It is then replaced by a new archive mount which continues to work on a different tape. This problem is not specific to repack. For the time being we can live with this problem; at some point in the future Eric will review the scheduling logic.

Actions

Vlado : separate out tape lifecycle management parts of Daniele's scripts and add the scripts to CTA.
Julien : reinstall drives for lib3
Vlado : Once lib3 is ready, repack 40 tapes
Vlado : use repack test to validate monitoring
Vlado : use repack test to train operators from Data Conversion Services (DCS) to work with CTA. Aurelian should also participate to gain experience with CTA.
Michael : review all open tickets for operator tools (cta-admin)

AOB

Documentation

There are two CTA website addresses:

eoscta.web.cern.ch: General information about the CTA project
eoscta.docs.cern.ch: CERN MkDocs documentation

The general info site will link to the documentation. The MkDocs site is intended for operational documentation such as operator procedures and the current configuration of headnodes/EOSCTA instances. Developer documentation will be kept in LaTeX/PDF format as before.

Eric pointed out that our current documentation is rather scattered: Tapeserver technical documentation is still under CTA/doc/castor, there is an obsolete document about the Tape Storage Element, etc.

Actions

Eric : review existing CTA documentation and propose how it should be consolidated and updated.

Database development and version control

We need to do the following work in the DB :

Create a schema versioning system and a set of tools/procedures to update between versions (high priority)
Remove the PL/SQL code for the CASTOR-style monitoring. We need to understand exactly what this code does and decide what it should be replaced with.
Migration to Postgres (not a current priority)

We could use some additional resources to do this work (fellow, or possibly technical student).

RPMs: Oracle OCCI dependency

The upgrade from Oracle 12 to 19 caused us all some pain due to some twisted dependencies and the oracle-instantclient19.3-meta package which is the only one which has a CERN version. Steve explained that this can be resolved by fixing the Requires: in the spec file for our own RPMs.

Actions

Steve : fix OCCI dependency in CTA spec file

There are minutes attached to this event. Show them.

- 14:00 → 14:10
  CTA instances and testing 10m
  - Julien: report on current status of pre-production instances
  - What still needs to be done to be ready to migrate ATLAS?
  - What needs to be done to set up ALICE instance with 2 spaces?
  - Next week: new round of archive/retrieve stress tests with query prepare:
    
    install EOS version with query prepare
    
    modify test scripts to send requests in batches of 200
    
    modify test scripts to use query prepare
- 14:10 → 14:20
  Repack 10m
  - Cedric has documented what is currently implemented, see EOSCTA.pdf chapter 4.
  - Is there anything outstanding in order to do single-tape repack in CTA?
  - Full repack:
    
    Vlado has provided a link to the CASTOR repack scripts: Automatic Tape Repacking System
    
    Eric: provide a short design document before the meeting, outlining how CTA repack is intended to work and what features have not yet been implemented.
    
    Prioritise outstanding tasks on repack
- 14:50 → 15:00
  
  AOB 10m

Choose timezone

CTA deployment meeting

31/1-012

CERN

Design specifications and operational procedures

CTA instances and testing

Repack

AOB

Documentation

Database development and version control

RPMs: Oracle OCCI dependency