CTA deployment meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map
Michael Davis (CERN)

Versioning

Cédric expects to finish the schema validation tool by end of next week.

Actions

  • Cédric: update ticket for cta-admin version command

 

Website

Melissa Loyse Perrey (IR-ECO-DVS) will come back to us with some ideas for the logo.

Actions

  • Julien: move operator monitoring links from CTA website to operator docs website
  • Michael: update CTA front page once we have (a) the logo and (b) managment monitoring in place

 

Monitoring

The management monitoring must be in place before we go to production. Other operations monitoring: we have the essentials, more monitoring can be added later according to demand/use cases.

Management monitoring consists of 2 plots:

  • Volume sent to tape with data in Grafana / InfluxDB both for CASTOR and CTA.

  • Data in CASTOR. This shows the amount of ‘live’ (undeleted) data in CASTOR, extracted from the Castor NS. An equivalent plot needs to be created for CTA (extracting information from the CTA catalogue), and as above, we also need an overview plot with the sum of CASTOR + CTA data.

We agreed that the statistics for the "Data in CASTOR" plot should not be part of the CTA schema, because non-CERN sites will do monitoring differently to us. It will be stored in the existing MySQL DB which is already used for CTA monitoring. (Currently it creates only one table, populated by David's code).

(Note from ITMM Minutes: Regarding retention periods, note that all log data sent to the Monitoring Service will be deleted after 13 months unless there is an explicit request to retain.)

Actions

  • Julien: update Data Volume dashboard with combined CASTOR+CTA plot (issue #722)

The following actions need to be allocated:

  • Provide a function the CTA Frontend to provide a snapshot of how much data has been written to tape/deleted from tape in a given time window. (This function will replace the PL/SQL code of Giuseppe).
  • Add an option to cta-admin to poll this data
  • Create a Rundeck job to populate the MySQL DB
  • Create a Grafana dashboard to view the data
  • Create a method to aggregate the totals from CTA and CASTOR, without double-counting the data which has been migrated from CASTOR to CTA.
  • Clean up the CTA Schema by removing the CASTOR statistics table.
  • Do the same for tape read/write/mount statistics.

 

Testing

We reviewed, prioritized and allocated the "testing" tickets.

 

Repack

Bottom line: we have to repack 42 PB of data before Run-3. The ATLAS portion is ~10 PB (1500× 7TB tapes).

We should not immediately reclaim CASTOR tapes which are repacked in CTA, as this removes the possibility of rolling back the files to CASTOR. After the migration there will be a moratorium on reclaiming tapes for at least a few months.

There are no issues in CTA which are blocking us from restarting the repack tests, but we do need the EOS CLOSEW fixes (see below).

Actions

  • Vlado: for CASTOR repack, prioritize tapes which will be part of the ATLAS recall campaign.

 

Other known issues for v1.1 milestone

eosreport: Mihai is working on it, it will be deployed in time for the ATLAS recall exercise.

Zero-length files must be allowed in EOSCTA as the experiments use them for tagging directories with metadata. Giuseppe added a procedure to import zero-length files from CASTOR (not yet tested).

FTS should report zero-length files as "safely on tape" to avoid workflow errors. We need to define what a valid zero-length file is: there is currently a difference between files created with CREAT and files created with touch (ADLER checksum can be zero or 1).

 

Actions

  • Steve: follow up with Andreas about CLOSEW delete on error/do not execute workflow in case of error/"has tape backend" configuration
  • Michael: follow up with FTS team and EOS team about zero-length file issues
  • Michael: review and prioritize all v1.1 tickets tagged "Operations" and "CTA Frontend"
  • Eric: review and prioritize all v1.1 tickets tagged "Tape Server"

 

AOB

Tier-1 dCache upgrades

Eddie reports that all WLCG sites have been requested to upgrade their dCache instances to version 5.2.* and to enable Storage Resource Reporting (SRR) before the end of March 2020.

From the WLCG Ops minutes:

    17 sites are already running version 5.2.*, 25 to be upgraded
    SRR still to be enabled at all sites. Only JINR enabled it.
    This week all sites will be ticketed, either for upgrade to 5.2.* or for SRR.

Section lunch

Reminder: section lunch is Tue 17 Dec 12.00

 

There are minutes attached to this event. Show them.
    • 14:05 14:15
      Versioning 10m

      CTA v1.0-1 released!

      Next target is v1.1 (31 January 2020)

      • Cédric: DB Schema versioning status
      • Review CASTOR's DB Schema versioning and update procedures
      • cta-admin version command
    • 14:15 14:25
      CTA Website 10m

      We need several elements to update the website:

      • Management monitoring (see below)
      • Logo: Melissa Loyse Perrey (IR-ECO-DVS) will come back to us with some ideas
      • Michael: introductory text about CTA
    • 14:25 14:35
      Monitoring 10m

      Management reporting ("Frédéric plot")

      We need a common way to store tape usage information for CASTOR and CTA, so we can have a dashboard for each as well as a combined dashboard.

      Easiest solution is to use a DB, but this is not part of core CTA functionality so should not be in the CTA schema. So, a separate DB for tracking this (preferable to manage this ourselves rather than having relying on monitoring service)?

      Change from PL/SQL DB job (Oracle dependency) to external polling script (Rundeck?)

      Comments from Germán

      For monitoring, there are actually two “Frédéric plots” (also used by Ian, Bernd, Alberto and others):

      • Volume sent to tape with data in Grafana / InfluxDB both for CASTOR and CTA. What needs to be done on that page is to create an additional plot showing the sum of CASTOR+CTA traffic. This shouldn’t take much effort.

      • Data in CASTOR. This shows the amount of ‘live’ (undeleted) data in CASTOR, extracted from the Castor NS. An equivalent plot needs to be created for CTA (extracting information from the CTA catalogue), and as above, we also need an overview plot with the sum of CASTOR + CTA data.

      Comments from Giuseppe

      For the total namespace-extracted data volume plot, I'm agnostic as to what technology you prefer, but I suggest that the output of the GROUP BY "reduction" query - which one way or another has to be executed against the CTA catalogue with the logic currently in catalogue/oracle_catalogue_usage_stats.sql - be stored in a SQL database (the CTA catalogue itself being the most natural choice, a separate one is also an option).

      In other words, I won't rely on the storage offered by the monitoring systems - yesterday it was a custom text-based format, today is Grafana's internal format + InfluxDB, tomorrow it will be something else, and we have historical data from 2001...

      BTW, the data pushed to https://filer-carbon.cern.ch/grafana/d/000000067/castor-dashboard?orgId=1 is also pushed to InfluxDB (work done by D. Lanza).

      German: good and important point. Equally applicable to the tape read/write/mount statistics - we still have an Oracle logging DB that keeps all mount events since a dozen years, having survived 3 different IT monitoring systems...

      Review of previous Action Points

      Access restrictions on Grafana were put in place as historically the graphs were expensive to generate. Julien will check with monitoring team whether this is still the case and whether we can remove the restrictions.

      EOS logging: some improvements were made in EOS 4.6.6, some additional logging can go into the next EOS release (Julien will report)

      Separating/consolidating current dashboards into the "tape operator" and "developer" use cases is targetted for v1.1.

      • Michael: detail what monitoring is missing (by use case)
      • Michael: detail what scripts are running on which days (check the Rundeck tape instance) and what data sources they are using

      Dashboards (for reference)

      We currently have the following dashboards:

    • 14:35 14:45
      Testing 10m

      There are about 15 tickets tagged with "Testing" which need to be done.

      Can everyone pick one and implement the test? If we all do 3 tests each they will be covered.

    • 14:45 14:55
      Repack 10m

      ATLAS has to be repacked and it's probably not possible to do it before migration to CTA (1500 tapes, Vlado estimates 3 months)

      We will take a 2-pronged approach:

      1. Vlado to repack as much of ATLAS as possible in CASTOR within the constraints of the reprocessing campaign. ATLAS recall exercise must achieve 10 GB/s for a sustained period (min. 24 hours/max 5 days) so no repack can be done during this stress test.
      2. Fix CTA repack:

        • EOS CLOSEW fixes (see below)
        • Are there any CTA repack issues which are blocking?
        • Repeat CTA repack tests

      Comments from Germán

      I recommend to not entangle IBM 7TB tape repacking with ATLAS testing/migration, as it’s a potential source for inconsistencies and headaches for CASTOR-CTA metadata and tape consistency.

      There is no operational hurry to start CASTOR repacking now as we have sufficient free IBM enterprise slots (corresponding to 180PB of additional data - not counting free LTO slots on LIB1 nor on the new Spectralogic library).

      I suggest that 7TB repacking is postponed after the switchover and used as a field test for mass CTA repacking. There will be sufficient time to complete 7TB repacking before Run-3 (and could even stretch into it if really necessary).

    • 14:55 15:05
      Other known issues for v1.1 milestone 10m

      Outstanding issues on EOS:

      • Improvements to eosreport
      • Do not execute CLOSEW in case of error flushing file to disk
      • Delete file if CTA CLOSEW event returns an error
      • "has a tape backend" configuration option
      • Refactoring of FTS close write code

      Outstanding CTA issues:

      • Issues flagged up by testing
      • Issues tagged "Tape Server"
      • Issues tagged "Operator Tools/Frontend"
    • 15:05 15:15
      AOB 10m
      • Date of section lunch: Tue 17 Dec