CTA deployment meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map
Michael Davis (CERN)

Versioning

Cédric expects to finish the schema validation tool by end of next week.

Actions

  • Cédric: update ticket for cta-admin version command

 

Website

Melissa Loyse Perrey (IR-ECO-DVS) will come back to us with some ideas for the logo.

Actions

  • Julien: move operator monitoring links from CTA website to operator docs website
  • Michael: update CTA front page once we have (a) the logo and (b) managment monitoring in place

 

Monitoring

The management monitoring must be in place before we go to production. Other operations monitoring: we have the essentials, more monitoring can be added later according to demand/use cases.

Management monitoring consists of 2 plots:

  • Volume sent to tape with data in Grafana / InfluxDB both for CASTOR and CTA.

  • Data in CASTOR. This shows the amount of ‘live’ (undeleted) data in CASTOR, extracted from the Castor NS. An equivalent plot needs to be created for CTA (extracting information from the CTA catalogue), and as above, we also need an overview plot with the sum of CASTOR + CTA data.

We agreed that the statistics for the "Data in CASTOR" plot should not be part of the CTA schema, because non-CERN sites will do monitoring differently to us. It will be stored in the existing MySQL DB which is already used for CTA monitoring. (Currently it creates only one table, populated by David's code).

(Note from ITMM Minutes: Regarding retention periods, note that all log data sent to the Monitoring Service will be deleted after 13 months unless there is an explicit request to retain.)

Actions

  • Julien: update Data Volume dashboard with combined CASTOR+CTA plot (issue #722)

The following actions need to be allocated:

  • Provide a function the CTA Frontend to provide a snapshot of how much data has been written to tape/deleted from tape in a given time window. (This function will replace the PL/SQL code of Giuseppe).
  • Add an option to cta-admin to poll this data
  • Create a Rundeck job to populate the MySQL DB
  • Create a Grafana dashboard to view the data
  • Create a method to aggregate the totals from CTA and CASTOR, without double-counting the data which has been migrated from CASTOR to CTA.
  • Clean up the CTA Schema by removing the CASTOR statistics table.
  • Do the same for tape read/write/mount statistics.

 

Testing

We reviewed, prioritized and allocated the "testing" tickets.

 

Repack

Bottom line: we have to repack 42 PB of data before Run-3. The ATLAS portion is ~10 PB (1500× 7TB tapes).

We should not immediately reclaim CASTOR tapes which are repacked in CTA, as this removes the possibility of rolling back the files to CASTOR. After the migration there will be a moratorium on reclaiming tapes for at least a few months.

There are no issues in CTA which are blocking us from restarting the repack tests, but we do need the EOS CLOSEW fixes (see below).

Actions

  • Vlado: for CASTOR repack, prioritize tapes which will be part of the ATLAS recall campaign.

 

Other known issues for v1.1 milestone

eosreport: Mihai is working on it, it will be deployed in time for the ATLAS recall exercise.

Zero-length files must be allowed in EOSCTA as the experiments use them for tagging directories with metadata. Giuseppe added a procedure to import zero-length files from CASTOR (not yet tested).

FTS should report zero-length files as "safely on tape" to avoid workflow errors. We need to define what a valid zero-length file is: there is currently a difference between files created with CREAT and files created with touch (ADLER checksum can be zero or 1).

 

Actions

  • Steve: follow up with Andreas about CLOSEW delete on error/do not execute workflow in case of error/"has tape backend" configuration
  • Michael: follow up with FTS team and EOS team about zero-length file issues
  • Michael: review and prioritize all v1.1 tickets tagged "Operations" and "CTA Frontend"
  • Eric: review and prioritize all v1.1 tickets tagged "Tape Server"

 

AOB

Tier-1 dCache upgrades

Eddie reports that all WLCG sites have been requested to upgrade their dCache instances to version 5.2.* and to enable Storage Resource Reporting (SRR) before the end of March 2020.

From the WLCG Ops minutes:

    17 sites are already running version 5.2.*, 25 to be upgraded
    SRR still to be enabled at all sites. Only JINR enabled it.
    This week all sites will be ticketed, either for upgrade to 5.2.* or for SRR.

Section lunch

Reminder: section lunch is Tue 17 Dec 12.00

 

There are minutes attached to this event. Show them.