CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

Monitoring

  • The list of current monitoring/Grafana issues and To Do list are summarised here.
  • David's summary of the 43 metrics which are used to determine CASTOR Service Availability is here.
  • David: contact Germán to ask his opinion on which metrics we should use to determine when the tape service is available/degraded/unavailable.

Review of RAIN/converter issues from last week

  • RAIN performance on SSDs is fine. To get a complete picture for HDDs, we need to repeat the tests on good hardware.
  • Reporting d10::t1 issue on RAIN layout is normal behaviour for EOS, nothing to do with recalls from tape. This does not affect our query prepare reporting for disk/tape residency, as we check only the XRDSFS_HASBKUP and XRDSFS_OFFLINE bits. We don't care about the number of copies. Conclusion: this is not a problem.
  • fs statistics: always reports the raw space.

Problems to be Solved

  • Unintentional deletion of files. The case where the converter triggers a delete workflow when converting a file was solved for our specific case, but further investigations by Steve have shown that a misconfigured system could still set the delete workflow on directories in /proc. See issue #781.
  • There may still be other corner cases of unintentional deletion that we haven't spotted yet or which may occur in future. The most important thing is to be able to spot when this happens and to be able to recover unintentionally deleted files. See section on file deletion below.
  • For the ALICERO tests, Julien will recall files to SSD (single replica) then convert them to RAIN on HDD. He has a temporary solution for this using ActiveMQ, but the long-term solution is to have a conversion workflow which is triggered automatically by the MGM. Julien: create a ticket for Andreas to specify under what conditions the converter should automatically start. (The details of what we need are documented in CodiMD).
  • EOS-4200: File metadata queries using gRPC sometimes return an empty result
  • EOS-4211: Preserve EOS fileID during conversion when we are only changing the space
  • EOS-4221: List failed converter jobs

Unanswered Questions

  • Will Mihai's changes to the converter change how we monitor what conversions are going on?
  • We don't yet have a final picture about which EOS operations can cause the disk file ID to change. Will group rebalancing change the disk file ID?

EOSCTAALICE reprocessing campaign

  • Disk layout for the recalls will be single copy layout on SSD and RAIN layout on HDD. Files will be recalled to SSD and then converted to HDD.
  • Julien is configuring the probe with Costin in order to test ALICE authentication. As setting up ALICE authentication is somewhat cryptic, Julien will document how to do it for the future.
  • ALICE have not yet told us what data they want to recall or the revised schedule for the recall campaign.
  • There are tight time constraints on what is possible due to the decommissioning of the hardware.

Putting EOSCTAATLAS into production

  • Maria is coordinating SFO/EOS stress tests to start on 25 May (without CTA)
  • Status of FTS "file safely on tape" feature: A bug fix release of FTS for HTTP TPC will be released this week. Mihai will start working on the m-bit next week.
  • We need to finalise the disk layout based on the experiences with ALICE/RAIN
  • What else do we need to consider before ATLAS can go to production?

File Lifecycle/File Deletion Workflow/Recovering Deleted Data

  • The ALICE tests have highlighted the possibility of data being inadvertently deleted from tape. We need to test that we can recover data which has been accidentally deleted.
  • Cédric: document the file deletion workflow (work in progress on CodiMD, final home will be in eoscta.docs) and the procedure for recovering files which are unintentionally deleted from CTA.
  • The "is_deleted" and "superceded" mechanisms are two contradictory and unfinished methods of tracking deleted data. This needs to be rationalised so that we have one, clear method of tracking deleted files.
  • Comments from Julien: "superceded" is there for repacked files but unfinished. It is unusable because there is no tool to revert superceded files. Proposal: get rid of "superceded" and replace it with an "is_deleted" flag on tf files. When the tape is reclaimed, double-check in the EOS namespace the the file has been deleted before finally deleting it from tape.
  • Cédric: review the rest of the file lifecycle documentation. The workflows for Archive, Retrieve and Repack are reasonably complete but may be out-of-date in some places.

AOB

  • X-Section communication: Michael: prepare a 10-minute presentation on the technical details of the CTA Archive workflow.
There are minutes attached to this event. Show them.
    • 14:00 14:10
      Monitoring Update 10m
      • Temporary Grafana v6.5 instance has been deployed
      • The dashboard still does not work in v6.7 with the modified version of the bar charts plugin. We would like to avoid using hacked/unsupported versions of official plugins in future.
      • Two possible ways forward: (a) use Flux, (b) upgrade Grafana Bar Chart to v1.8 which has the feature we need.
      • Update on criteria for Service Now availability screen. What are the consequences of the upgrade to the new Service Portal?
    • 14:10 14:20
      Review of RAIN/converter issues from last week 10m
      • Which problems still need to be investigated or solved?
      • Delete workflow
    • 14:20 14:30
      EOSCTAALICE recall campaign 10m
      • Have we settled on a final disk layout?
      • Do we know yet what data ALICE want to recall?
      • Revised schedule for this campaign. (Implications of hardware end-of-life).
    • 14:30 14:40
      Putting EOSCTAATLAS into production 10m
      • Maria is coordinating SFO/EOS stress tests to start on 25 May (without CTA)
      • Status of FTS "file safely on tape" feature: A bug fix release of FTS for HTTP TPC will be released this week. Mihai will start working on the m-bit next week.
      • We need to finalise the disk layout based on the experiences with ALICE/RAIN
      • What else do we need to consider before ATLAS can go to production?
    • 14:40 14:45
      AOB 5m
      • X-Section communication
      • Fellow interviews tomorrow:

      10.00-10.45: Rajula Vineet Reddy
      11.30-12.15: Javier López-Gómez
      12.45-13.30: Paweł Gomulak
      14.00-14.45: Diogo Guerra (not confirmed)