CTA deployment meeting

Name: CTA deployment meeting
Start: 2019-12-12T14:00:00+01:00
End: 2019-12-12T15:15:00+01:00
Location: CERN

Thursday 12 Dec 2019, 14:00 → 15:15 Europe/Zurich

31/S-023 (CERN)

31/S-023

CERN

Show room on map

Michael Davis (CERN)

Hide

Versioning

Cédric expects to finish the schema validation tool by end of next week.

Actions

Cédric: update ticket for cta-admin version command

Website

Melissa Loyse Perrey (IR-ECO-DVS) will come back to us with some ideas for the logo.

Actions

Julien: move operator monitoring links from CTA website to operator docs website
Michael: update CTA front page once we have (a) the logo and (b) managment monitoring in place

Monitoring

The management monitoring must be in place before we go to production. Other operations monitoring: we have the essentials, more monitoring can be added later according to demand/use cases.

Management monitoring consists of 2 plots:

Volume sent to tape with data in Grafana / InfluxDB both for CASTOR and CTA.
Data in CASTOR. This shows the amount of ‘live’ (undeleted) data in CASTOR, extracted from the Castor NS. An equivalent plot needs to be created for CTA (extracting information from the CTA catalogue), and as above, we also need an overview plot with the sum of CASTOR + CTA data.

We agreed that the statistics for the "Data in CASTOR" plot should not be part of the CTA schema, because non-CERN sites will do monitoring differently to us. It will be stored in the existing MySQL DB which is already used for CTA monitoring. (Currently it creates only one table, populated by David's code).

(Note from ITMM Minutes: Regarding retention periods, note that all log data sent to the Monitoring Service will be deleted after 13 months unless there is an explicit request to retain.)

Actions

Julien: update Data Volume dashboard with combined CASTOR+CTA plot (issue #722)

The following actions need to be allocated:

Provide a function the CTA Frontend to provide a snapshot of how much data has been written to tape/deleted from tape in a given time window. (This function will replace the PL/SQL code of Giuseppe).
Add an option to cta-admin to poll this data
Create a Rundeck job to populate the MySQL DB
Create a Grafana dashboard to view the data
Create a method to aggregate the totals from CTA and CASTOR, without double-counting the data which has been migrated from CASTOR to CTA.
Clean up the CTA Schema by removing the CASTOR statistics table.
Do the same for tape read/write/mount statistics.

Testing

We reviewed, prioritized and allocated the "testing" tickets.

Repack

Bottom line: we have to repack 42 PB of data before Run-3. The ATLAS portion is ~10 PB (1500× 7TB tapes).

We should not immediately reclaim CASTOR tapes which are repacked in CTA, as this removes the possibility of rolling back the files to CASTOR. After the migration there will be a moratorium on reclaiming tapes for at least a few months.

There are no issues in CTA which are blocking us from restarting the repack tests, but we do need the EOS CLOSEW fixes (see below).

Actions

Vlado: for CASTOR repack, prioritize tapes which will be part of the ATLAS recall campaign.

Other known issues for v1.1 milestone

eosreport: Mihai is working on it, it will be deployed in time for the ATLAS recall exercise.

Zero-length files must be allowed in EOSCTA as the experiments use them for tagging directories with metadata. Giuseppe added a procedure to import zero-length files from CASTOR (not yet tested).

FTS should report zero-length files as "safely on tape" to avoid workflow errors. We need to define what a valid zero-length file is: there is currently a difference between files created with CREAT and files created with touch (ADLER checksum can be zero or 1).

Actions

Steve: follow up with Andreas about CLOSEW delete on error/do not execute workflow in case of error/"has tape backend" configuration
Michael: follow up with FTS team and EOS team about zero-length file issues
Michael: review and prioritize all v1.1 tickets tagged "Operations" and "CTA Frontend"
Eric: review and prioritize all v1.1 tickets tagged "Tape Server"

AOB

Tier-1 dCache upgrades

Eddie reports that all WLCG sites have been requested to upgrade their dCache instances to version 5.2.* and to enable Storage Resource Reporting (SRR) before the end of March 2020.

From the WLCG Ops minutes:

    17 sites are already running version 5.2.*, 25 to be upgraded
    SRR still to be enabled at all sites. Only JINR enabled it.
    This week all sites will be ticketed, either for upgrade to 5.2.* or for SRR.

Section lunch

Reminder: section lunch is Tue 17 Dec 12.00

There are minutes attached to this event. Show them.

- 14:05 → 14:15
  Versioning 10m
  CTA v1.0-1 released!
  
  Next target is v1.1 (31 January 2020)
  - Cédric: DB Schema versioning status
  - Review CASTOR's DB Schema versioning and update procedures
  - cta-admin version command
- 14:15 → 14:25
  CTA Website 10m
  We need several elements to update the website:
  - Management monitoring (see below)
  - Logo: Melissa Loyse Perrey (IR-ECO-DVS) will come back to us with some ideas
  - Michael: introductory text about CTA
- 14:25 → 14:35
  Monitoring 10m
  Management reporting ("Frédéric plot")
  
  We need a common way to store tape usage information for CASTOR and CTA, so we can have a dashboard for each as well as a combined dashboard.
  
  Easiest solution is to use a DB, but this is not part of core CTA functionality so should not be in the CTA schema. So, a separate DB for tracking this (preferable to manage this ourselves rather than having relying on monitoring service)?
  
  Change from PL/SQL DB job (Oracle dependency) to external polling script (Rundeck?)
  
  Comments from Germán
  
  For monitoring, there are actually two “Frédéric plots” (also used by Ian, Bernd, Alberto and others):
  - Volume sent to tape with data in Grafana / InfluxDB both for CASTOR and CTA. What needs to be done on that page is to create an additional plot showing the sum of CASTOR+CTA traffic. This shouldn’t take much effort.
  - Data in CASTOR. This shows the amount of ‘live’ (undeleted) data in CASTOR, extracted from the Castor NS. An equivalent plot needs to be created for CTA (extracting information from the CTA catalogue), and as above, we also need an overview plot with the sum of CASTOR + CTA data.
  Comments from Giuseppe
  
  For the total namespace-extracted data volume plot, I'm agnostic as to what technology you prefer, but I suggest that the output of the GROUP BY "reduction" query - which one way or another has to be executed against the CTA catalogue with the logic currently in catalogue/oracle_catalogue_usage_stats.sql - be stored in a SQL database (the CTA catalogue itself being the most natural choice, a separate one is also an option).
  
  In other words, I won't rely on the storage offered by the monitoring systems - yesterday it was a custom text-based format, today is Grafana's internal format + InfluxDB, tomorrow it will be something else, and we have historical data from 2001...
  
  BTW, the data pushed to https://filer-carbon.cern.ch/grafana/d/000000067/castor-dashboard?orgId=1 is also pushed to InfluxDB (work done by D. Lanza).
  
  German: good and important point. Equally applicable to the tape read/write/mount statistics - we still have an Oracle logging DB that keeps all mount events since a dozen years, having survived 3 different IT monitoring systems...
  
  Review of previous Action Points
  
  Access restrictions on Grafana were put in place as historically the graphs were expensive to generate. Julien will check with monitoring team whether this is still the case and whether we can remove the restrictions.
  
  EOS logging: some improvements were made in EOS 4.6.6, some additional logging can go into the next EOS release (Julien will report)
  
  Separating/consolidating current dashboards into the "tape operator" and "developer" use cases is targetted for v1.1.
  - Michael: detail what monitoring is missing (by use case)
  - Michael: detail what scripts are running on which days (check the Rundeck tape instance) and what data sources they are using
  Dashboards (for reference)
  
  We currently have the following dashboards:
  - meter-cta.web.cern.ch: empty dashboard with no data sources
  - monit-grafana.cern.ch/d/000000116/host-metrics: host metrics for tape servers
  - monit-grafana.cern.ch/d/19Z7vUMmk/timings-in-house: CTA timings and metrics
  - monit-grafana.cern.ch/d/5tAh216mk/read-write-throughput-cta: Realtime tape I/O performance
  - filer-carbon.cern.ch/grafana/d/000000036/eos-control-tower: EOS Control Tower (filter by EOSCTA instance)
  - monit-grafana.cern.ch/d/dfpOzRnmz/data-volume: Data Volume, (including "Frédéric’s plot”), both CASTOR + CTA (scroll down for CTA)
  - monit-grafana.cern.ch/d/rgcoo0vmk/tape-stats-by-density-history-castor-cta: Tape stats by density, both CASTOR + CTA
  - monit-grafana.cern.ch/d/000000786/infrastructure-castor-and-cta: Infrastructure, both CASTOR + CTA
  - monit-grafana.cern.ch/d/obWK_dAik/tape-pools-statistics: CTA tape pool statistics
- 14:35 → 14:45
  
  Testing 10m
  
  There are about 15 tickets tagged with "Testing" which need to be done.
  
  Can everyone pick one and implement the test? If we all do 3 tests each they will be covered.
- 14:45 → 14:55
  Repack 10m
  ATLAS has to be repacked and it's probably not possible to do it before migration to CTA (1500 tapes, Vlado estimates 3 months)
  
  We will take a 2-pronged approach:
  1. Vlado to repack as much of ATLAS as possible in CASTOR within the constraints of the reprocessing campaign. ATLAS recall exercise must achieve 10 GB/s for a sustained period (min. 24 hours/max 5 days) so no repack can be done during this stress test.
  2. Fix CTA repack:
    
    EOS CLOSEW fixes (see below)
    
    Are there any CTA repack issues which are blocking?
    
    Repeat CTA repack tests
  Comments from Germán
  
  I recommend to not entangle IBM 7TB tape repacking with ATLAS testing/migration, as it’s a potential source for inconsistencies and headaches for CASTOR-CTA metadata and tape consistency.
  
  There is no operational hurry to start CASTOR repacking now as we have sufficient free IBM enterprise slots (corresponding to 180PB of additional data - not counting free LTO slots on LIB1 nor on the new Spectralogic library).
  
  I suggest that 7TB repacking is postponed after the switchover and used as a field test for mass CTA repacking. There will be sufficient time to complete 7TB repacking before Run-3 (and could even stretch into it if really necessary).
- 14:55 → 15:05
  Other known issues for v1.1 milestone 10m
  Outstanding issues on EOS:
  - Improvements to eosreport
  - Do not execute CLOSEW in case of error flushing file to disk
  - Delete file if CTA CLOSEW event returns an error
  - "has a tape backend" configuration option
  - Refactoring of FTS close write code
  Outstanding CTA issues:
  - Issues flagged up by testing
  - Issues tagged "Tape Server"
  - Issues tagged "Operator Tools/Frontend"
- 15:05 → 15:15
  AOB 10m
  - Date of section lunch: Tue 17 Dec

CTA deployment meeting

31/S-023

CERN

Versioning

Website

Monitoring

Testing

Repack

Other known issues for v1.1 milestone

AOB

Tier-1 dCache upgrades

Section lunch

Management reporting ("Frédéric plot")

Comments from Germán

Comments from Giuseppe

Review of previous Action Points

Dashboards (for reference)

Comments from Germán