CTA deployment meeting

Europe/Zurich
31-1-012 (CERN)

31-1-012

CERN

Prime time preparations:

  • HW infrastructure:
    • Hardware (SSD-based converged disk servers) will be made available by Eric B. between 15/10 and 22/10. Julien will talk to Eric to see if hardware can be delivered progressively so that he can start setting up the core services (MGM, initial FST's etc) asap. It will take little time to deploy additional FST's as all provisioning procedures are puppetised.
    • Vlado will identify tape servers / drives to be reallocated to CTA (~30 drives). 200 LTO tapes will be added to the CTA supply pools. A significant fraction of the tape servers should belong from the CASTOR SSD repack pool which is now down to serving Oracle repacking (~10x250MB/s)  and therefore not all capacity is needed.
  • Repack:
    • Repack software is ready.
    • A repack head node is available and an EOS cluster on the tape servers will be installed, in order to do a scale repack test (including backstop functionality).
    • Around 50 tapes should be filled and repacked using 5+5 drives. Backstop functionality will be exercised by e.g. changing buffer settings and/or bringing drives down. 
  • EOS namespace backup/recovery testing:
    • Not yet done; Julien will talk to FDO to understand what their current procedure is and apply corresponding puppet scripts for backing up the namespace and persistent configuration. 
    • It is paramount to exercise both backup AND recovery in order to validate functionality.
  • Import:
    • Incremental tests will be exercised on ctapps, then a full ATLAS import.
    • Metadata of broken tapes will not be imported. For ATLAS, this represents a handful of STK tapes with few files on them that should be cleaned up beforehand.
    • A namespace cross-compare between QuarkDB and Rucio will allow to understand if there are inconsistencies.
  • Operations:
    • Alarm system well advanced and will be completed by end of month.
  • Software:
    • CTA: All ready (backpressure, cancel, deletions); waiting for release to be produced early next week.
    • Storage classes cannot be easily renamed (as they are referenced in the DB, object store, cached by the tape server).
    • RAO information can be added later (4 fields for each tape file / segment: LPOS beginning/end, wrap beginning/end)
    • FTS: 
      • "m" bit not there yet but not critical for ATLAS testing. Needs further discussions with Andrea.
      • multi-hop will be tested next week with next Rucio version (C. Serfon)
      • With the new FTS version, Rucio will throttle down the number of concurrent staging requests, in order to avoid FTS scheduler overloads. Instead of up to 2M staging requests, FTS will process them in chunks of 10K files. While this protects FTS, the latter number should ideally be configurable in order to be increased for CTA as we want to minimise tape remounting and therefore collect as many requests as possible in order to serve them in as few tape mounts as possible.
  • Upgrading :
    • This still requires additional work in order to follow EOS best practices (MGM, quarkDB) but also setting up redundant CTA front-ends. Also, Ceph and DB redundancy need to be understood further.
    • Vlado will populate Gitlab #83 with actual use cases based on production experience for upgrading of components.

 

There are minutes attached to this event. Show them.
    • 14:00 15:20
      Prime Time approaching! 1h 20m

      Timelines for preparing final Infrastructure, ATLAS and Repack production instances by October 1st
      * HW Infrastructure: new SSD disk servers including connectivity, 30-40 tape drives/servers across all libraries
      * Service deployment: ATLAS instance on disk servers, Repack instance on SSD tape servers. EOS namespace backup/recovery testing
      * Software: EOS/CTA with agreed functionality (ie backpressure/cancel, GC enhancements, FTS enhancements). Repack fully exercised on large-scale exercise (~100 tapes)
      * Import: including error-tolerant and incremental imports
      * Operations: Alarm system fully deployed for CTA. Monitoring dashboard clean-up
      * Anything else?

      From Vlado:

      "German,
      I would add the following 2 points:
      1/ Can we have new release deployed on ctafrontend instance (after Open Days)?
      2/ At some point, upgrade of various components needs to be tested while the system is running. This is outlined in the ticket #83 from 2 years ago: https://gitlab.cern.ch/cta/CTA/issues/83"

    • 15:20 15:40
      AOB 20m
      • Meetings with CMS (+LHCb and ALICE)
      • Meeting(s) with RAL on possible EOSCTA adoption (week 30/9 to 4/10)