CTA deployment meeting

Name: CTA deployment meeting
Start: 2020-02-06T14:00:00+01:00
End: 2020-02-06T15:35:00+01:00
Location: CERN

Thursday 6 Feb 2020, 14:00 → 15:35 Europe/Zurich

600/R-001 (CERN)

600/R-001

CERN

Show room on map

Michael Davis (CERN)

Hide

Timeline for the next few weeks

The timeline will be adjusted as events unfold, but this is the current plan:

week commencing 10 Feb

Big picture: prepare everything for next round of ATLAS recall tests.

EOS namespace backup/restore test [Julien]
Start write stress test on PPS instance [Julien]
Update DB schema with extra columns required by CTA v1.1 [Cédric]
Release CTA v1.1 and deploy on EOSCTAATLAS. This will include:
- Release Notes/Changelog (MD file + RPM spec file + GitLab tag). [Cédric]
- "cta-admin tapefile ls". ("archivefile ls" will stay in for now until the new tape summary functionality is ready. It will be removed in an upcoming release.)
- Schema migration tool
- Numerous bug fixes, clean-ups and better instrumentation
Close the CTA v1.1 milestone
Deploy new version of EOS on EOSCTAATLAS [Julien]
- 4.6.8 gives us "delete on close"
- 4.6.9 (if available) will give us XRootD 4.11.2, but we are happy to stick with 4.11.0 for the recall tests, so not a blocker.
- EOS-3913 is not done yet but Julien has a workaround, so not a blocker.
Repack ATLAS tapes (in CASTOR) until Wed 12th [Vlado]
Thu 13th: wipe and re-migrate CASTOR ATLAS to take account of repacked tapes [Michael]
Thu 13th: ATLAS Software Week Tape Carousel session: Eddie will report on recall campaign so far including explanation of the gfal bug.
Fri 14th: Plenary session at ATLAS Software Week to present CTA timeline to production

w/c 17 Feb

ATLAS recall will resume on Monday (provisional date, subject to confirmation by ATLAS DDMs)

w/c 24 Feb

"Official" write stress test with ATLAS
- Fix any problems identified by testing from w/c 10 Feb
- Similtaneously run our own write/recall/delete tests on the instance
- Test dual-copy tape pools
Tier-1 export test
- Test with XRootD 4.11.2 and EOS 4.6.9, to allow testing of TPC proxy delegation

w/c 2 Mar

Integration test with ATLAS online:
- Writing without FTS
- The "is file safely on tape" test
- "What happens when the buffer is full" test (in particular, ensure that online writes cannot be blocked by failed writes from elsewhere or by recalls filling the buffer)
Note that date is provisional, needs to be scheduled with ATLAS SFO and ensure it avoids TDAQ milestone stress test weeks.

w/c 9 Mar

One week "cool off" period with no writes to CASTOR, to ensure all files have made it to tape and to check that no further data is being written

w/c 16 Mar

Provisional date for EOSCTAATLAS to go into production.

Monitoring

David presented the currently-available monitoring dashboards.
We identified two main use cases (see below).

Management Dashboard

Total volume on tape/total number of files on tape: this exists for CASTOR but not for CTA. Cédric is working on calculating the tape summary statistics. When this is done we can create the CTA graph and a combined CASTOR+CTA graph.
Data transferred to tape by tape session by day: this exists for CASTOR and CTA. No combined graph exists but this is probably not necessary. (It can be created later if necessary.)

These two dashboards are all that is needed. We probably want a way to display an up-to-date snapshot of the first graph on the webpage.

Actions

Julien raised a question about how testing data is represented on the second graph. How should we separate monitoring of test and production instances? Do we need to clean all the statistics for testing before we go to production? To be clarified.

Operations/Developer Dashboards

All the available graphs are displayed on the main Grafana landing page.
There is also a "monitoring the monitoring system" dashboard. No change required to this.