CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Michael Davis (CERN)

EOSCTA Post-Migration Issues

LHCb

  • Failing XRootD TPC transfers to RAL are causing noise which makes it difficult for Julien to evaluate his solution for evict after HTTP prepare.
  • Oliver will ask Chris if he can switch off checksum validation temporarily, at least until Julien has checked that his solution works satisfactorily.

NA62

  • Vova: check if !d works correctly on directories with no CTA workflow enabled.
  • Vova: follow up with EOS devs to ensure that p and !d permissions work correctly when there is no w permission.
  • The problem with transfers from T2s (INC2748013) was not present in CASTOR because it was more permissive (firewall open to sites outside LHCONE). User is happy to write the code to transfer the file in 2 hops via EOS PUBLIC, but says there is not enough storage. Solution: ask how much storage they need for this and ask EOS team to provide a space for these intermediate hops. Note: this use case will no doubt come up again with other experiments.
  • Query if file is on tape using gfal2 Python API: was resolved by upgrading to latest packages.

CTA PUBLIC Migration from CASTOR

DUNE

  • Ready to migrate after Easter.
  • Vlado repacked the 6 tapes.
  • Steve and Vova will write OTGs for CASTOR/CTA.

n_TOF

  • Vova: check up on the exact use case for "nsfind"/"eos find". We need to know roughly how many files are in the directories scanned and how frequently this operation will be performed.
  • We don't have a solution to the "early-file review" problem, where n_TOF examine the detector signals of DAQ files (like in offline analysis). Ideally n_TOF would like a copy of archived raw data to stay around in the CTA spinner space, but this is not possible without development effort. n_TOF have two alternative solutions, which they plan to evaluate later this year:
    • Modify FileDisplay to re-construct files directly from the crates, providing a really early aggregated overview of acquired data.
    • Implement FTS back-end in RawFileMerger (instead of Xroot) to upload the file in 2 places with a single stream.
  • Sylvain said, "We can schedule the evaluation for the first solution sooner, it'll give a better idea of the workload on our side."
  • FTS multi-hop was also proposed to them but they don't want to have to manage clean-up of the intermediate file when the early file review is finished.
  • Given this blocker, it seems unlikely that it will be possible to migrate n_TOF before their physics schedule starts in the summer.
  • Michael: arrange a meeting with n_TOF representatives to discuss the likely schedule and ensure that they are aware of the consequences of not migrating (recalls will be slow due to limited tape drives).

Status Updates: Experiments

  • Vova has pinged COMPASS to ask if they can start testing recalls to the spinner space.
  • Vova will contact CLIC in April.
  • AMS are blocked on authentication problems. To be followed up.
  • Michael will respond to RQF1772720 and follow up on ILC. Migration of ILC depends on the repack of public_user.

Status Updates: Backup

  • Ideally we want to migrate CASTOR backup use cases by September, as Steve will have to merge the CASTOR disk pools and Giuseppe advised not to merge backup disk pools with the other ones.
  • Repack/migration of data from CASTOR to CTA is not a blocker as we could migrate the users and leave most of the files in CASTOR until they expire. In principle only a small amount of data actually has to be migrated.
  • Encryption is a prerequisite however.

Status Updates: Other Use Cases

  • ~30 tapes with LEP tapepool data still to be repacked.
  • Migration of LEP tapepool can be scheduled when Vlado is back after Easter holidays.

CTA PUBLIC New Use Cases

  • References to FTS Pilot in KB articles should be replaced with FTS Public (Michael/Vlado)
  • We will pass Vlado's Codi document to Spacal and FASER and let them test the instructions. Edit according to their feedback and final version will be published in the KB.

RAL

  • Michael will follow up on Alastair's question about NA62 and FTS multi-hop.
  • We would like to have a schedule from RAL on the critical milestones and where they will need help from us.

CTA Software

  • Michael will start performance tests on Postgres, initially using the instance that the DB team provided to Julien. When we reach the limit of performance we will set up our own test instance which we can tune ourselves.
  • We will test queue sizes of at least 100 million entries.
There are minutes attached to this event. Show them.
    • 14:00 14:10
      EOSCTA Post-Migration Issues 10m

      LHCb

      • XRootD TPC transfers to RAL failing (checksum issue at RAL endpoint): see separate RAL item on agenda below
      • Eviction for stage+transfer with HTTP
      • Add more tape drives
      • DAQ integration and testing, including FTS Archive Monitoring
      • Mid-April: LHCb will stage small amounts of data for validation prior to reprocessing, then start pre-staging for their reprocessing campaign which will begin in May.

      NA62

      • 3 use cases represented by 3 service accounts (na62prod: all permissions, na62e001: write only, <new account>: recall only)
      • p PREPARE permission doesn't work without w. (workaround in place until fixed in EOS)
      • !d permission not preventing DELETE
      • INC2748013 UK T2 sites not able to transfer to CTA (outside firewall). (Why did this work in CASTOR?)
      • Requirement to query if file is on tape using gfal2 Python API
    • 14:10 14:20
      CTA PUBLIC Migration from CASTOR 10m

      DUNE

      • #213 and #293 Migrate DUNE from CASTOR to CTA
      • Tue 6 April: CASTOR /neutplatform to recall-only
      • Thu 8 April: disable access to CASTOR DUNE, migrate data
      • Mon 12 April: in production
      • OTGs
      • Data challenge 3Q2021

      n_TOF

      • See issue #143
      • nsfind: what is the use case exactly?
      • Recall to spinner space immediately after archiving
      • Any other blockers?
      • Set up a meeting to set migration schedule

      Status Updates: Experiments

      • COMPASS (spinner space) #69
      • AMS: auth issues #82
      • NA61/SHINE: setting up their DAQ system + storage. No schedule for testing yet.
      • TOTEM #277. 24 tapes from public_user need to be repacked. 0.5 PB of files in EOS need a tape copy. VM with SLC5 needs to access data on tape.
      • CLIC (Dirac) - will set up a meeting in April
      • ILC - see RQF1772720

      Status Updates: Backup

      • Status of encryption for CASTOR backup use cases
      • /afsmigration #282.

      Status Updates: Other Use Cases

      • Repack of LEP-era data. Half-way through the 247 tapes that need to be repacked (engineering and nomad_delayed) #240
      • UNOSAT: data in CASTOR is not needed. Set up OTG and arrange to delete unwanted data (Michael)
    • 14:20 14:30
      CTA PUBLIC New Use Cases 10m
      • Review Vlado's Codi document (future KB article) on how to store data in CTA
      • Policy for small experiments who ask for tape storage. What are the criteria to decide if tape is the appropriate solution?
      • #271 Spacal
      • #274 FASER experiment on CTA
      • AMBER (replacement for COMPASS at NA66)

      Discussion points for Codi doc:

      • No explanations on how to transfer files to/from local storage. Do we need this?
      • No details of FTS multi-hop, just a pointer to FTS docs.
    • 14:30 14:40
      RAL 10m
      • Julien should not be involved in debugging XRootD TPC issues at RAL (or T1s in general)
      • Disable checksum checking until RAL TPC issue is fixed
      • Alastair's request about FTS multi-hop?
      • RAL CASTOR to CTA migration preparations
      • Codi doc on CTA deployment at RAL
    • 14:40 14:50
      CTA Software 10m
      • #976 Move storage of persistent drive states to the DB
      • Review of tickets tagged Tape Server
    • 14:50 14:55
      AOB 5m