CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Michael Davis (CERN)

CTA Software Issues

Repack

Several issues were encountered related to repack:

  • There was a critical bug which could take out the tape servers while scheduling (fixed)
  • Tape servers could not write to disk on repack instance (configuration problem—fixed)
  • Limitations of XRootD SSS keys: Julien worked around this problem (how?)
  • There is a problem with slow performance of the eosctarepack instance which was traced back as far as the disk I/O but is not yet fully understood (to be investigated)
  • Vlado is missing some essential functionality for Repack:
    • Skip bad areas on tape
    • Disable RAO (as RAO will create an ordering which includes the bad areas)
    • Dedicate a tape to a specific drive
  • It is not possible to change the priority of an existing request. Julien has an idea to create a "téléski queue" which allows requests to skip our current "télécabine queue" model which doesn't allow anyone out until it reaches the top. (to be discussed)

Scheduler

We do not have a common agreement on how tape drives should be shared between VOs in CTA.

  • Steve implemented the CASTOR scheduler logic and will prepare a summary document so we can all understand the CASTOR logic
  • Cedric has reverse-engineered the CTA scheduler logic so we can see how the CTA scheduler logic works in practice
  • By comparing these documents we can see the difference between what CASTOR does and what CTA does
  • Ask Julien to document his ideas about how drives should be shared between VOs
  • See also discussion on issue #817
  • Once we have this information we can agree on how scheduling should work in CTA and what work needs to be done

Object Store

  • cta-admin showqueues does not show up-to-date information
  • Cedric created create-missing-repack-index tool to repair Object Store
  • Cedric will document the current functionality and behaviour of the Object Store
  • When Julien is back, we need to discuss his ideas about how to manage the Object Store
  • Discussion to be had: "How do we see the evolution of the set of functions provided by the Object Store over the next two years?"

CTA Operational Procedures

  • Vlado will prepare a presentation to explain the supply pool logic, the alert messages we receive and other operational issues which are important for everyone to understand. Some e-mails can be disabled in SNOW and others like TSMOD alerts need to be interpreted.
  • Vlado proposes to have Aurelian spend part of his time on CTA operations.
  • Ask Julien to send the Grafana monitoring alerts to a different channel to CTA_priv.
  • Steve, Vlado to send Michael links to any documentation not already in the CTA tapeoperations mkdocs site. Michael to update the site to ensure that all existing operations documentation can be found from a single entry point.
  • Discussion to be had when Julien is back: what changes does he want to make to operational procedures compared to how things were done in CASTOR?
  • Monday section meeting: discuss how we will organise ourselves for CTA Operations, e.g. Rota, weekly(?) Operations meeting.

Getting ALICE into Production

  • Costin's presentation on the read tests and Steve's notes are here. Need to check with Costin how the reprocessing campaign is going.
  • Vova reported on the write test (42K files, 62 TB). All failures understood except for one file. Vova will send us the filename so we can investigate.

AOB

  • Implementing RAO for LTO has been pushed back due to more pressing operational concerns. But we should aim to get this functionality into CTA by the end of the summer.
There are minutes attached to this event. Show them.
    • 14:00 14:10
      CTA Software Issues 10m

      Review of software issues (bugs, architectural problems) which have emerged since going to production:

      • Repack
      • Scheduling problems in general (drive fair share, drive dedications, ...)
      • ObjectStore
      • Other issues?
    • 14:10 14:20
      CTA Operational Procedures 10m

      Preliminary discussion. We need to ask Julien what he plans in terms of CTA operational procedures, so we will discuss in detail when he is back from holiday.

      • Supply pool logic
      • Where to find CASTOR operational procedures
      • Updating CASTOR operational procedures for CTA
      • SNOW tickets
      • Rota for CTA Operations
    • 14:20 14:30
      Getting ALICE into Production 10m
      • Status of write tests
      • To Do list is here
    • 14:30 14:35
      AOB 5m
      • RAO for LTO