CTA deployment meeting

Name: CTA deployment meeting
Start: 2019-11-28T14:00:00+01:00
End: 2019-11-28T16:00:00+01:00
Location: CERN

Thursday 28 Nov 2019, 14:00 → 16:00 Europe/Zurich

31/S-023 (CERN)

31/S-023

CERN

Show room on map

Michael Davis (CERN)

Hide

FTS update

Meeting with LHCb

LHCb has a requirement to transfer data from CTA to Tier-1s (reprocessing use case). These transfers are orchestrated by LHCb.

This would require 4-6 T1s to upgrade their infrastructure. The following options are on the table, ordered from most to least desireable from LHCb's POV:

Option 0: Rely on the DOMA timetable for infrastructure upgrades. Advantage: easy to implement; no changes to Dirac/FTS/gfal. Disadvantage: no guarantee that the upgrades will be done before Run-3.
Option 1: Set up gridFTP gateways to talk to Tier-1s which don't support XRootD TPC with delegation. Advantage: no changes required in T1s. Disadvantage: needs FTS development as it is multi-protocol.
Option 2: Hard-coded multi-hop through big EOS instance. Advantage: does not rely on gridFTP. Disadvantage: needs FTS development. LHCb would have to track 2 transfers (unless this could be managed by FTS).

Eddie is part of a working group to track the schedule for upgrades of dCache in the T1s. He reports that the following T1s have upgraded already: fzk, ndgf, in2p3-cc, sara. All T1 sites are obliged to provide an alternative to gridFTP by end of 2019.

The new version of gfal (with "query prepare") was released and is in pilot, it will be deployed in production in around one week. The "check m-bit" feature will be in a future release.

Actions

Michael: Determine how much buffer/bandwidth LHCb requires for CTA to T1 use case. Can it be done with SSDs or do we need to look at spinners like ALICE? (In October meeting it was stated as 1.5 GB/s).
Eddie will keep us updated on T1 upgrade schedule.

CTA test status

Repack test

The bug that stopped repack was fixed.

If the EOS instance runs out of disk space, the file to be written is truncated during the flush to disk, but the CLOSEW event is executed anyway. In this case, the file is written to tape but then fails afterwards because the size/checksum don't match. Ideally we would like to prevent the file being written in the first place.

Actions

Vlado: re-launch repack test
Michael: follow up on disk full/CLOSEW issue with EOS developers

DB Schema Versioning

The DB schema for CTA v1.0 will be finalised on 6 Dec 2019. This includes the columns needed for future features like RAO+LTO.

Cédric is working on DB schema versioning.

The CTA catalogue will be wiped after the ATLAS recall test in January. The migration will be done onto a fresh install of the DB schema.

Actions

Cédric: take a look at the DB schema versioning/update procedures in CASTOR

CTA version 1.0

Everything to be included in CTA version 1.0 should be committed to master by Friday 6 December. The process of tagging/creating RPMs etc. will be done w/c 9 December.

Testing: Oliver mentioned that various suites of test scripts and tools exist for DPM and gfal.

Actions

Michael: rename CTA_* extended attributes to sys.archive.*
Michael: go through all old tickets containing tests that should be done and create a test schedule to make sure we have covered all corner cases. (Immutable files was mentioned but there are others).
Steve: create ticket for "delete file if there is a CTA error during the CLOSEW event" (to allow client retries).
Julien: identify any missing EOS log messages needed to track the full lifecycle of each archive and retrieval request.
Oliver: run test scripts against an EOSCTA endpoint to see what happens.

Documentation

Documentation has been reorganised into developer docs (LaTeX/PDF) and operator docs (mkdocs).

Vlado: progressively update operator procedures
Michael: follow up with Mélissa on CTA logo
Michael: update CTA website. Include links to monitoring for management (see below).
Julien: update tapeops website with links to monitoring for tape operators (see below).

Monitoring

There are 3 main use cases for monitoring:

Management information: historical info, e.g. evolution of the size of data/number of files stored on tape
Day-to-day operational information (tape operators)
Performance management (developers)

The data collection for case (1) needs to be in place before we start taking physics data, i.e. before we migrate ATLAS.

There are several things missing from our current monitoring:

Evolution of number of files/data size over time: we have it for CASTOR but need a synthesised plot that includes both CASTOR and CTA. There are 2 parts: (1) usage/activity, summarized per transfer by parsing logfiles, (2) total aggregated data on tape
Evolution of media
Occupancy of the tape pools
Summary of how many tape mounts per day, by whom, etc.
Weekly summaries: weekly dump of all tape info
A dashboard like the one for CASTOR Stager activity
Queue information

Daniele Lanza's monitoring (goes to HDFS?) and Aurelian's collectd sensor: these are done for CASTOR and need to be done for CTA?

Actions

Michael: detail what monitoring we are missing (by use case)
Michael: detail what scripts are running on which days (check the Rundeck tape instance) and what data sources they are using
Julien: separate our current dashboards into the "tape operator" use case and "developer" use case and consolidate
Julien: check access restrictions on Grafana: are they needed? Remove if not.
Vlado: adapt tsmod-daily-report to report on CTA

CASTOR PUBLIC

It was agreed that the CASTOR backup hardware decommissioning will not influence the migration schedule for CTA.

There are several things which need to be done before we can migrate the backup services to CTA:

Management of encryption keys
Scope bandwidth and space requirements for each backup service
Migrate existing backup data (including splitting into several storage classes and repacking if not all backup users are to be migrated at once)

To be revisited next year.

AOB

Section lunch will be Tue 17 Dec
Michael: present CTA migration plan to the GDB on Wed 11 Dec: https://indico.cern.ch/event/739885/

There are minutes attached to this event. Show them.

- 14:00 → 14:10
  FTS update 10m
  - FTS team: feedback on:
    
    meeting with LHCb on 27 Nov
    
    schedule for release of FTS/gfal ("query prepare" feature and "check m-bit" feature)
  - Michael: feedback on status of LHCb T1s wrt. support for XRootD TPC
- 14:10 → 14:20
  CTA test status 10m
  - The repack test was launched with 4 tapes and 6 drives in order to have max bandwidth. The test failed as the buffer ran out of space. Cédric has done some analysis (issue #681) and identified three issues.
  - Julien: feedback on simple rate limiter test
  - Julien: feedback on full-scale test with ATLAS/Rucio/FTS multi-hop
- 14:20 → 14:30
  DB Schema Versioning 10m
  - Steve: Update on CTA DB schema v1.0
  - Cédric: Update on tools for DB schema versioning/validation/update schema
- 14:30 → 14:40
  CTA version 1.0 10m
  - I propose we freeze all new CTA features for version 1.0. (No more commits to master which add features to core CTA software).
  - Are there any bugs which need to be fixed for v1.0?
  - EOS version: is there anything we need which is not in our current EOS version?
    
    delete on CLOSEW
  - Release schedule (v1.0 branch, tagging, RPMs, etc.)
- 14:40 → 14:50
  Documentation 10m
  - CTA/doc/ directory has been reorganised into ConferencePapers, Presentations, DevelopmentNotes, CASTOR, plus the developer documentation.
  - The developer documentation is mainly split across 3 PDFs (cta.pdf, EOSCTA.pdf, CASTOR TapeServer.pdf) which Eric will synthesise this into One PDF to Rule Them All.
  - The operational documentation (Twiki, CTA Admin guide, some MD docs) has been consolidated on the internal mkdocs site. Vlado will progressively update this.
  - In future the non-CERN specific parts of the documentation will be moved to the external mkdocs site
  - Michael: Update on CTA logo and website
- 14:50 → 15:00
  Monitoring 10m
  - tsmod-daily-report: Moving checks to CTA (issue #678) is a prerequisite for ATLAS migration. Implementing as a dahboard instead of e-mails (issue #677) can be done as a later refinement. For ATLAS migration I propose we implement the CTA daily report as an e-mail.
  - Alarm system: Disabling tapes and drives is done. Disabling libraries is not a requirement for migration. The alarm system was de-daemonized and runs via Rundeck. i.e. Alarm system is ready for ATLAS migration.
  Dashboards
  
  We currently have the following dashboards:
  - meter-cta.web.cern.ch: empty dashboard with no data sources
  - monit-grafana.cern.ch/d/000000116/host-metrics: host metrics for tape servers
  - monit-grafana.cern.ch/d/19Z7vUMmk/timings-in-house: CTA timings and metrics
  - monit-grafana.cern.ch/d/5tAh216mk/read-write-throughput-cta: Realtime tape I/O performance
  - filer-carbon.cern.ch/grafana/d/000000036/eos-control-tower: EOS Control Tower (filter by EOSCTA instance)
  - monit-grafana.cern.ch/d/dfpOzRnmz/data-volume: Data Volume, (including "Frédéric’s plot”), both CASTOR + CTA (scroll down for CTA)
  - monit-grafana.cern.ch/d/rgcoo0vmk/tape-stats-by-density-history-castor-cta: Tape stats by density, both CASTOR + CTA
  - monit-grafana.cern.ch/d/000000786/infrastructure-castor-and-cta: Infrastructure, both CASTOR + CTA
  - monit-grafana.cern.ch/d/obWK_dAik/tape-pools-statistics: CTA tape pool statistics
  (Anything else not on this list?)
  
  Questions:
  1. Who are the audiences for these various dashboards (CTA devops, tape operators, management, ...?)
  2. Is there anything missing which we need to go into production?
  - A dashboard like the one for CASTOR Stager activity
  - anything else?
  3. How should we reorganise the dashboards?
- 15:00 → 15:10
  CASTOR PUBLIC 10m
  - Handover of CASTOR to TAB section is planned for January 2020.
  - CASTOR ‘backup’ disk pool will lose 14 servers in March 2020, leaving 4 servers.
  - There are 7 users of the ‘backup’ service class (= ‘backup’ disk pool by configuration). The biggest users are internal to IT (EOS, AFS, HADOOP).
  - The bandwidth is irregular, with spikes at 2GB/s
  Proposal
  
  Migrate those 3 to a new CTA backup instance before end of March. We can then decommission the 14 oldest disk servers and still serve the other customers with the remaining 4 servers.
  
  Would existing backup data be migrated or would it be let to expire?
  - If migrated, need to separate/repack tapes to be migrated vs. those that will stay (~14 PB)
  - If expired, clients will need to access both the old and the new endpoint. For old backups, sufficient disk capacity for retrieval would have to be kept on CASTOR. And expiry may be far in the future or not exist at all (this is controlled by the clients).
  Problems to be solved
  - Management of encryption keys
- 15:10 → 15:20
  Space-aware GC 10m
  - Steve: update on space-aware garbage collector for ALICE
- 15:20 → 15:30
  AOB 10m
  - Date of section lunch: Tue 17 Dec

Choose timezone

CTA deployment meeting

31/S-023

CERN

FTS update

Meeting with LHCb

CTA test status

DB Schema Versioning

CTA version 1.0

Documentation

Monitoring

CASTOR PUBLIC

AOB