CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

3
Show room on map
Michael Davis (CERN)

Timeline for the next few weeks

The timeline will be adjusted as events unfold, but this is the current plan:

week commencing 10 Feb

Big picture: prepare everything for next round of ATLAS recall tests.

  • EOS namespace backup/restore test [Julien]
  • Start write stress test on PPS instance [Julien]
  • Update DB schema with extra columns required by CTA v1.1 [Cédric]
  • Release CTA v1.1 and deploy on EOSCTAATLAS. This will include:
    • Release Notes/Changelog (MD file + RPM spec file + GitLab tag). [Cédric]
    • "cta-admin tapefile ls". ("archivefile ls" will stay in for now until the new tape summary functionality is ready. It will be removed in an upcoming release.)
    • Schema migration tool
    • Numerous bug fixes, clean-ups and better instrumentation
  • Close the CTA v1.1 milestone
  • Deploy new version of EOS on EOSCTAATLAS [Julien]
    • 4.6.8 gives us "delete on close"
    • 4.6.9 (if available) will give us XRootD 4.11.2, but we are happy to stick with 4.11.0 for the recall tests, so not a blocker.
    • EOS-3913 is not done yet but Julien has a workaround, so not a blocker.
  • Repack ATLAS tapes (in CASTOR) until Wed 12th [Vlado]
  • Thu 13th: wipe and re-migrate CASTOR ATLAS to take account of repacked tapes [Michael]
  • Thu 13th: ATLAS Software Week Tape Carousel session: Eddie will report on recall campaign so far including explanation of the gfal bug.
  • Fri 14th: Plenary session at ATLAS Software Week to present CTA timeline to production

w/c 17 Feb

  • ATLAS recall will resume on Monday (provisional date, subject to confirmation by ATLAS DDMs)

w/c 24 Feb

  • "Official" write stress test with ATLAS
    • Fix any problems identified by testing from w/c 10 Feb
    • Similtaneously run our own write/recall/delete tests on the instance
    • Test dual-copy tape pools
  • Tier-1 export test
    • Test with XRootD 4.11.2 and EOS 4.6.9, to allow testing of TPC proxy delegation

w/c 2 Mar

  • Integration test with ATLAS online:
    • Writing without FTS
    • The "is file safely on tape" test
    • "What happens when the buffer is full" test (in particular, ensure that online writes cannot be blocked by failed writes from elsewhere or by recalls filling the buffer)
  • Note that date is provisional, needs to be scheduled with ATLAS SFO and ensure it avoids TDAQ milestone stress test weeks.

w/c 9 Mar

  • One week "cool off" period with no writes to CASTOR, to ensure all files have made it to tape and to check that no further data is being written

w/c 16 Mar

  • Provisional date for EOSCTAATLAS to go into production.

 

Monitoring

  • David presented the currently-available monitoring dashboards.
  • We identified two main use cases (see below).

Management Dashboard

  • Total volume on tape/total number of files on tape: this exists for CASTOR but not for CTA. Cédric is working on calculating the tape summary statistics. When this is done we can create the CTA graph and a combined CASTOR+CTA graph.
  • Data transferred to tape by tape session by day: this exists for CASTOR and CTA. No combined graph exists but this is probably not necessary. (It can be created later if necessary.)

These two dashboards are all that is needed. We probably want a way to display an up-to-date snapshot of the first graph on the webpage.

Actions

  • Julien raised a question about how testing data is represented on the second graph. How should we separate monitoring of test and production instances? Do we need to clean all the statistics for testing before we go to production? To be clarified.

Operations/Developer Dashboards

  • All the available graphs are displayed on the main Grafana landing page.
  • There is also a "monitoring the monitoring system" dashboard. No change required to this.

Actions

  • David: go through the graphs and delete the ones which are obviously junk (e.g. created for a specific incident with hard-coded tape IDs) or which don't work.
  • I assume we can also delete https://meter-cta.web.cern.ch/ which is empty?
  • Vlado/Oliver: take a look at the list of charts available for CASTOR. For any chart you need, check if there is an equivalent chart available for CTA (some are, some are not). Flag any that are missing so we can create them.
  • Michael: Create website "one landing page to rule them all" with introductory text, links to monitoring dashboards, operations website, documentation.

 

Backpressure

Cédric presented the backpressure mechanism implemented in the tapeserver.

The discussion was cut short as we ran out of time. We will schedule a technical meeting to finish the discussion.

 

Unresolved problems

  • How to ensure that DAQ writes cannot be blocked by other writes or recalls filling the buffer.
  • We need to have a policy for what to do with zero-length files.
  • How to handle ALICE "probe" (copies a small file to a tape-backed directory, reads it back, deletes the file before it makes it to tape).
There are minutes attached to this event. Show them.
    • 14:00 14:15
      ATLAS recall exercise and next tests 15m
      • Julien e-mailed round a graph from DDM team about the 2018 data recall. BNL was first to finish but we reached the fastest recall speed.
      • Next week is ATLAS Software Week.
      • Recall exercise will resume the week after (17 Feb): reprocessing of 2017 data.

      Action points to clarify

      • Make a new release of CTA (Monday OK?) This will add new features including:
        • cta-admin tapefile ls. Should we remove "archivefile ls"?
        • Any last-minute features that need to be in?
        • Procedure for Release Notes
      • Updates to DB?
      • EOS 4.6.8 has delete on close but not reporting (4.6.9?)
      • gfal bug fix, is new FTS deployed?
      • Everything to be deployed by end of next week in time for next recall tests
    • 14:15 14:30
      Monitoring Status Update 15m
      • Cedric is working on the tape summary feature and long-term logging of data in CTA
      • David will present on consolidation of monitoring dashboards
    • 14:30 14:45
      Next milestones 15m
      • 31 January milestone
        • Review of outstanding items and close the milestone
        • Review of issues raised by Julien
      • CTA v1.0-4 next week to complete recall test
      • Write Stress Test -> CTA v1.1 (24 February)
        • Backpressure
        • Do we need multi-hop for this? To be checked with Cédric Serfon.
        • XRootD v4.11.2 for TPC proxy delegation transfers
      • Integration test with ATLAS Online (2 March)
      • One week "cool off" period with no writes to CASTOR, to ensure all files have made it to tape and to check that no further data is being written
      • ATLAS goes into production (provisionally 16 March, need to check this does not clash with ATLAS TDAQ milestone tests)
    • 14:45 15:15
      Technical discussion about Backpressure 30m

      We need to take a look at all of the backpressure/rate-limiting mechanisms in play in all of the system components (Rucio/FTS/XRootD/EOS/CTA) and come up with a plan as to how we will implement a workable backpressure mechanism for retrieval.

      • Cédric will present the backpressure mechanism as it is implemented in the tapeserver.
      • Clarify what problems have already been solved and which are the open problems.
      • Discussion of solutions.
    • 15:15 15:20
      Other unresolved problems 5m

      List issues that still need to be resolved (not for discussion today)

      • Zero-length files