CTA deployment meeting
→
Europe/Zurich
31-1-012 (CERN)
31-1-012
CERN
Prime time preparations:
- HW infrastructure:
- Hardware (SSD-based converged disk servers) will be made available by Eric B. between 15/10 and 22/10. Julien will talk to Eric to see if hardware can be delivered progressively so that he can start setting up the core services (MGM, initial FST's etc) asap. It will take little time to deploy additional FST's as all provisioning procedures are puppetised.
- Vlado will identify tape servers / drives to be reallocated to CTA (~30 drives). 200 LTO tapes will be added to the CTA supply pools. A significant fraction of the tape servers should belong from the CASTOR SSD repack pool which is now down to serving Oracle repacking (~10x250MB/s) and therefore not all capacity is needed.
- Repack:
- Repack software is ready.
- A repack head node is available and an EOS cluster on the tape servers will be installed, in order to do a scale repack test (including backstop functionality).
- Around 50 tapes should be filled and repacked using 5+5 drives. Backstop functionality will be exercised by e.g. changing buffer settings and/or bringing drives down.
- EOS namespace backup/recovery testing:
- Not yet done; Julien will talk to FDO to understand what their current procedure is and apply corresponding puppet scripts for backing up the namespace and persistent configuration.
- It is paramount to exercise both backup AND recovery in order to validate functionality.
- Import:
- Incremental tests will be exercised on ctapps, then a full ATLAS import.
- Metadata of broken tapes will not be imported. For ATLAS, this represents a handful of STK tapes with few files on them that should be cleaned up beforehand.
- A namespace cross-compare between QuarkDB and Rucio will allow to understand if there are inconsistencies.
- Operations:
- Alarm system well advanced and will be completed by end of month.
- Software:
- CTA: All ready (backpressure, cancel, deletions); waiting for release to be produced early next week.
- Storage classes cannot be easily renamed (as they are referenced in the DB, object store, cached by the tape server).
- RAO information can be added later (4 fields for each tape file / segment: LPOS beginning/end, wrap beginning/end)
- FTS:
- "m" bit not there yet but not critical for ATLAS testing. Needs further discussions with Andrea.
- multi-hop will be tested next week with next Rucio version (C. Serfon)
- With the new FTS version, Rucio will throttle down the number of concurrent staging requests, in order to avoid FTS scheduler overloads. Instead of up to 2M staging requests, FTS will process them in chunks of 10K files. While this protects FTS, the latter number should ideally be configurable in order to be increased for CTA as we want to minimise tape remounting and therefore collect as many requests as possible in order to serve them in as few tape mounts as possible.
- Upgrading :
- This still requires additional work in order to follow EOS best practices (MGM, quarkDB) but also setting up redundant CTA front-ends. Also, Ceph and DB redundancy need to be understood further.
- Vlado will populate Gitlab #83 with actual use cases based on production experience for upgrading of components.
There are minutes attached to this event.
Show them.