CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

Getting ALICE into Production

  • Cleaning up: 6 tapes to repack, Giuseppe is cleaning up ALICE /user namespace
  • We will deliver a 4.8 PB instance to ALICE. This will consist of 2.4 PB from ALICEDAQ, 1.3 PB from T0ALICE (CASTOR) and 1.1 PB "on loan" so that we can test the instance during the next ALICE recall exercise. The 1.1 PB will later have to be accounted from elsewhere in the ALICE pledge (to be decided) or removed.

Actions

  • Michael: re-import the ALICE tapepools to check everything is OK
  • Eric: do OTGs for CASTOR
  • Julien: do OTG for CTA, EOSCTA ALICE will be in put into production on Monday 12 October #70
  • Julien: Configure maximum file size for transfers to CTA as 20 PB #70
  • Michael: Migrate ALICE to CTA w/c 5 October
  • Julien: Garbage collector test w/c 5 October (mid/end of the week) #71

ATLAS Cleanup

  • Luca's files have been deleted from EOSCTA ATLAS (26,991 files deleted in parallel at 175 Hz; 54 TB total)
  • grid_atlas and broken tapes will be imported after ALICE migration

Getting LHCb into Production

  • Christophe says they will hop files from DAQ to EOS to EOSCTA and distribute from EOS to T1s, so archival workflow looks OK. However, Chris has not yet decided how the intermediate hop will be implemented (Dirac or FTS). Either way, LHCb will be responsible for cleaning up intermediate files.
  • m-bit check will use the same mechanism as CMS and ATLAS (FTS check file on tape)
  • Oliver pointed out that the T1s will support HTTP TPC with tokens, but not all T1s (StoRM) will support delegation
  • RAL say they will support HTTP in Echo and are "hopeful of it working without many caveats by Christmas." However, LHCB will also need to do transfers from CTA T0 to CASTOR T1 and CASTOR does not support HTTP. We do not yet have a solution for this.
  • We will cross that bridge when we come to it: the first step is to test HTTP TPC with X.509 delegation which should be possible with the T1s running dCache
  • In the first instance, we will ask the EOS team to test if they can transfer from EOS LHCb to one of the LHCb dCache sites.
  • In the meantime we can ramp up testing of the DAQ workflow. We will propose a write test (200 TB) orchestrated by Dirac, followed by a test from DAQ. Proposed date for these tests: 12/19 October.

Getting PUBLIC into Production

NA62

  • NA62 online tests: waiting for FTS to have "file safely on tape" feature in production
  • NA62 offline tests: Vova has provided Barbara with changes to T0M to allow it to work with CTA, waiting for her to test her workflow
  • NA62 dual class copy is about 50% complete
  • Repack of NA62 files from public_user→na62 not started yet
  • Julien: set up the EOSCTAPPSRO instance for the NA62 migration test (after ALICE migration tests have finished and dual class copy is complete, w/c 5 October 2020) #72

COMPASS

  • COMPASS DAQ tests will take place in November/December
  • COMPASS use CDR+xrdcp for DAQ workflow (slight difference from NA62 who use CDR+FTS)
  • COMPASS do not use grid certificates, they authenticate with Kerberos. Note that krb5 cannot be delegated to FTS, so this seems to preclude using FTS for COMPASS data taking. They do not plan to use FTS for data taking so this is not a blocker.
  • Julien: prepare an EOSCTA instance for COMPASS tests #69

RAO

  • Cedric reports that RAO testing is proceeding without problems and should be concluded by next week. All being well it will be merged to master for the next CTA release.

AOB

  • CTA CI stress tests will be configured as a new GitLab runner which can be triggered manually, to allow developers to run the stress test directly and make a release.
  • Julien: document how to tag and make a release #73
  • Student interns: we need multiple project descriptions for Danish + ISIMA student interns next year. Send proposals to Oliver.
  • Data privacy: we need a policy for how *_user tapepools will be handled in CTA instances. During the ATLAS migration, Giuseppe moved files in /user to /atlas/castoruser and is doing the same for ALICE. This has the unintended effect that the experiment data mover account can see all user files.
  • As a first approach to deal with this, we can move the user accounts further up the tree in the namespace and prohibit access to everyone. Access can be requested on a case-by-case basis through the usual channels. #74
There are minutes attached to this event. Show them.
    • 14:00 14:10
      Getting ALICE into Production 10m
      • See Putting EOSCTAALICE into Production
      • See #55 ALICE instance in production

      • Test migration complete, 7 tapes to repack

      • Giuseppe cleaning up /user directories, Michael will re-test migration of r_alice_user when this is done
      • Wed 30 Sept. eosctaalicero instance will be switched off and retired
      • Thu 1 Oct. Switch off write access to CASTOR ALICE
      • Mon 5 Oct. Block all access to CASTOR ALICE namespace, begin migration to CTA
      • Mon 12 Oct. EOSCTA ALICE in production
      • CASTOR T0ALICE will be returned to the ALICE pledge

      TO DO

      1. CASTOR & CTA OTGs. Eric will take care of CASTOR ones. Julien: CTA OTG.
      2. Decision about how the extra 1.3 PB should be accounted
      3. Garbage collector test after migration (w/c 5 Oct.)
      4. CTA Frontend maximum file size (20 GB?)
      5. Wipe eosctaalice, re-test migration of r_alice_user
    • 14:10 14:20
      ATLAS Cleanup 10m
      • Delete Luca's files
      • Import of grid_atlas and broken tapes (after ALICE migration)
      • EOSATLAS → EOSCTAATLAS throughput test (28/9). Coordinated by Maria.
    • 14:20 14:30
      Getting LHCb into Production 10m

      Meeting with Christophe Haen 23/09/2020

      • Thanks to help from Julien, he can do all the basic T0 workflows
      • Archival workflow, he will do an intermediate hop through EOS LHCb and will distribute to T1s from there using Dirac (Dirac should clean up files after export).
      • Retrieve workflow, problematic one is EOS CTA ←→ T1 as he does not want a matrix of protocols between endpoints. But looks like HTTP TPC will be supported everywhere. See notes on RAL below.
      • Can we set up a HTTP TPC gateway on EOSCTAPPS LHCB to test TPC transfers?

      RAL/ECHO

      ECHO is an XRootD site, "we are also aiming to primarily support HTTP TPC. TPC for Echo has been extensively tested and the summary is that it works but with a lot of caveats currently. EOS to/from Echo hasn’t been tested yet.

      "In the UK there are a few developers working to fix the various issues with the XrdCeph plugin, so we would be hopeful of it working without many caveats by Christmas."

      RAL/CASTOR

      What about transfers to CASTOR T1?

      Schedule for commissioning tests (Oct./Nov.)

      • TPC test
      • DAQ test
      • m-bit integration
      • Tests at scale
      • ??
    • 14:30 14:40
      Getting PUBLIC into Production 10m

      NA62

      • NA62 online integration (all done except m-bit test)
      • NA62 offline integration (T0M with xrdfs done, FTS to do)
      • NA62 migration test onto EOSCTA PPS (w/c 5 October, liberated after ALICE tests are finished) will include migration of public_user as files are not yet separated
      • NA62 recall tests on EOSCTA PPS immediately after migration test

      NA62 Repack

      • NA62 dual copy (na62dual fileclass). Last week: repack ongoing, out of ~330 tapes, 100 are done. This will take another 3 weeks or so. As NA62 is recalling files on the CASTOR PUBLIC stager, this automatically creates 2nd copy as well.
      • NA62 single copy (public_userna62 fileclass), /grid namespace: file classes assigned, tapes identified. Will be repacked at the same time as public_user.
      • As we have to repack public_user in any case for NA62, we will try to clean up the /user part of the namespace beforehand as much as possible. See #878.

      COMPASS

      • Meeting with Compass 23/09/2020
      • They use CDR for data taking (like NA62)
      • DAQ tests in November on CTA
    • 14:40 14:50
      RAO 10m
      • Cedric: Status update
    • 14:50 14:55
      AOB 5m
      • CTA releases : how can developers do the stress test?
      • Danish intern spring 2021
      • ISIMA students from April 2021 to September 2021