CTA deployment meeting
Design specifications and operational procedures
Quick brainstorm: are there parts of the project which we do we not fully understand or where we are not all on the same page? What needs to be agreed and/or documented?
- ALICE/JAlien workflows
- LHCb workflows
- Backpressure (for ALICE/FTS)
- Archive workflow: do we want bulk requests and query interface?
- EOSCTA installation and configuration : essentially we need to document what happens in CI scripts
- Operational procedures and documentation
- Tape lifecycle management
- Data recovery : tools, procedures (also, tape drive dedications should be implemented)
- Policy for distributing software to external entities like RAL (in particular, how to handle CERN-specific dependencies)
Other technical problems to solve:
- Load balancing between archives and retrieves
Julien confirms that we currently have the following instances:
- EOSCTAPPS: currently being used by Michael for migration testing
- EOSCTAATLASPPS
- EOSCTACMSPPS
- EOSCTAALICEPPS
- Repack instance: this is an EOS-only instance for repack
What needs to be done to migrate ATLAS?
ATLAS will do their Run-2 exercises in January. Ideally we would like to do the full ATLAS migration immediately after the end-of-year shutdown so that the Run-2 exercises are done on CTA.
What do we need to make this happen?
- Drives for all tape types : we have these, they just need to be installed
- ATLAS buffer server : the production hardware has not been delivered and the schedule is still not known.
- Agree migration schedule with ATLAS
- Ensure all operational tools are available
What needs to be done to set up ALICE instance with two spaces?
- Configuration of hardware (SSD+spinners)
- Space-aware garbage collector. The FTS GC will do in the short term.
- ALICE authentication mechanism needs to be fixed by Andreas and puppetized. Julien has opened issue 666 to track this on our side (Document, fix and puppetize `xrootd-alicetokenacc`)
Stress tests
The new "query prepare" logic is in EOS dev branch. This should be deployed and the stress tests repeated.
Actions
- Oliver/Julien : go to CFCCM meeting on Wednesday to get a status update on procurement and ensure that the priority of hardware for CTA is understood. (Vlado offered to provide someone from DCS to help with installation if necessary).
- Julien : document configuration of headnodes and CTA instances (see Documentation under AOB)
- Julien : ensure that operators are able to run cta-admin on all tapeservers
- Julien : (low priority) configure ALICE instance with 2 spaces
- Vlado : evaluate what functionality from TOMS is missing from CTA
- Steve/Michael : meet Costin to understand details of ALICE workflows. (What is the purpose of the ALICE probe?)
- Steve : implement space-aware LRU Garbage Collector
- Michael : finish functional tests of "query prepare", cherry-pick commits on top of our EOS version and create a new tag by end of this week
- Julien : change the test scripts to use "query prepare" and submit prepare requests in batches of 200 to mimic the behaviour of FTS. Add a metric to measure the rate at which prepare requests are queued. Re-launch stress tests.
Repack
Thanks to Cédric for documenting the repack tools and procedure for repacking a single tape (see EOSCTA.pdf chapter 4). Eric is working on the system-level documentation.
Eric confirmed that most of the functionality in Daniele's repack scripts are not required by CTA. However, the tape lifecycle management parts of the scripts are still required.
During the test to repack 2 tapes, Eric identified an issue with the scheduling. There is a race between an existing mount fetching more work and a new mount starting and snatching this set of jobs before the first mount. The existing mount finds no work and finishes. It is then replaced by a new archive mount which continues to work on a different tape. This problem is not specific to repack. For the time being we can live with this problem; at some point in the future Eric will review the scheduling logic.
Actions
- Vlado : separate out tape lifecycle management parts of Daniele's scripts and add the scripts to CTA.
- Julien : reinstall drives for lib3
- Vlado : Once lib3 is ready, repack 40 tapes
- Vlado : use repack test to validate monitoring
- Vlado : use repack test to train operators from Data Conversion Services (DCS) to work with CTA. Aurelian should also participate to gain experience with CTA.
- Michael : review all open tickets for operator tools (cta-admin)
AOB
Documentation
There are two CTA website addresses:
- eoscta.web.cern.ch: General information about the CTA project
- eoscta.docs.cern.ch: CERN MkDocs documentation
The general info site will link to the documentation. The MkDocs site is intended for operational documentation such as operator procedures and the current configuration of headnodes/EOSCTA instances. Developer documentation will be kept in LaTeX/PDF format as before.
Eric pointed out that our current documentation is rather scattered: Tapeserver technical documentation is still under CTA/doc/castor, there is an obsolete document about the Tape Storage Element, etc.
Actions
- Eric : review existing CTA documentation and propose how it should be consolidated and updated.
Database development and version control
We need to do the following work in the DB :
- Create a schema versioning system and a set of tools/procedures to update between versions (high priority)
- Remove the PL/SQL code for the CASTOR-style monitoring. We need to understand exactly what this code does and decide what it should be replaced with.
- Migration to Postgres (not a current priority)
We could use some additional resources to do this work (fellow, or possibly technical student).
RPMs: Oracle OCCI dependency
The upgrade from Oracle 12 to 19 caused us all some pain due to some twisted dependencies and the oracle-instantclient19.3-meta package which is the only one which has a CERN version. Steve explained that this can be resolved by fixing the Requires: in the spec file for our own RPMs.
Actions
- Steve : fix OCCI dependency in CTA spec file