Timeline for the next few weeks
The timeline will be adjusted as events unfold, but this is the current plan:
week commencing 10 Feb
Big picture: prepare everything for next round of ATLAS recall tests.
- EOS namespace backup/restore test [Julien]
- Start write stress test on PPS instance [Julien]
- Update DB schema with extra columns required by CTA v1.1 [Cédric]
- Release CTA v1.1 and deploy on EOSCTAATLAS. This will include:
- Release Notes/Changelog (MD file + RPM spec file + GitLab tag). [Cédric]
- "cta-admin tapefile ls". ("archivefile ls" will stay in for now until the new tape summary functionality is ready. It will be removed in an upcoming release.)
- Schema migration tool
- Numerous bug fixes, clean-ups and better instrumentation
- Close the CTA v1.1 milestone
- Deploy new version of EOS on EOSCTAATLAS [Julien]
- 4.6.8 gives us "delete on close"
- 4.6.9 (if available) will give us XRootD 4.11.2, but we are happy to stick with 4.11.0 for the recall tests, so not a blocker.
- EOS-3913 is not done yet but Julien has a workaround, so not a blocker.
- Repack ATLAS tapes (in CASTOR) until Wed 12th [Vlado]
- Thu 13th: wipe and re-migrate CASTOR ATLAS to take account of repacked tapes [Michael]
- Thu 13th: ATLAS Software Week Tape Carousel session: Eddie will report on recall campaign so far including explanation of the gfal bug.
- Fri 14th: Plenary session at ATLAS Software Week to present CTA timeline to production
w/c 17 Feb
- ATLAS recall will resume on Monday (provisional date, subject to confirmation by ATLAS DDMs)
w/c 24 Feb
- "Official" write stress test with ATLAS
- Fix any problems identified by testing from w/c 10 Feb
- Similtaneously run our own write/recall/delete tests on the instance
- Test dual-copy tape pools
- Tier-1 export test
- Test with XRootD 4.11.2 and EOS 4.6.9, to allow testing of TPC proxy delegation
w/c 2 Mar
- Integration test with ATLAS online:
- Writing without FTS
- The "is file safely on tape" test
- "What happens when the buffer is full" test (in particular, ensure that online writes cannot be blocked by failed writes from elsewhere or by recalls filling the buffer)
- Note that date is provisional, needs to be scheduled with ATLAS SFO and ensure it avoids TDAQ milestone stress test weeks.
w/c 9 Mar
- One week "cool off" period with no writes to CASTOR, to ensure all files have made it to tape and to check that no further data is being written
w/c 16 Mar
- Provisional date for EOSCTAATLAS to go into production.
Monitoring
- David presented the currently-available monitoring dashboards.
- We identified two main use cases (see below).
Management Dashboard
- Total volume on tape/total number of files on tape: this exists for CASTOR but not for CTA. Cédric is working on calculating the tape summary statistics. When this is done we can create the CTA graph and a combined CASTOR+CTA graph.
- Data transferred to tape by tape session by day: this exists for CASTOR and CTA. No combined graph exists but this is probably not necessary. (It can be created later if necessary.)
These two dashboards are all that is needed. We probably want a way to display an up-to-date snapshot of the first graph on the webpage.
Actions
- Julien raised a question about how testing data is represented on the second graph. How should we separate monitoring of test and production instances? Do we need to clean all the statistics for testing before we go to production? To be clarified.
Operations/Developer Dashboards
- All the available graphs are displayed on the main Grafana landing page.
- There is also a "monitoring the monitoring system" dashboard. No change required to this.
Actions
- David: go through the graphs and delete the ones which are obviously junk (e.g. created for a specific incident with hard-coded tape IDs) or which don't work.
- I assume we can also delete https://meter-cta.web.cern.ch/ which is empty?
- Vlado/Oliver: take a look at the list of charts available for CASTOR. For any chart you need, check if there is an equivalent chart available for CTA (some are, some are not). Flag any that are missing so we can create them.
- Michael: Create website "one landing page to rule them all" with introductory text, links to monitoring dashboards, operations website, documentation.
Backpressure
Cédric presented the backpressure mechanism implemented in the tapeserver.
The discussion was cut short as we ran out of time. We will schedule a technical meeting to finish the discussion.
There are minutes attached to this event.
Show them.