CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

CTA Migration

We did a quick recap of the To Do lists and CTA Status Update. We are making steady progress. Details are on the individual tickets. Main points for this week:

  • Julien will set up the gridFTP gateway for CAST in the EOS hostgroup to make use of their Puppet profiles.
  • Vova will test CAST and if all goes well we will move their production endpoint to CTA this week. There is no urgency to migrate their data from CASTOR as they have no offline workflow and no plans to do recalls, so this can even be done after they are in production in CTA.
  • Several SMEs need the spinner space for next tests, but this week CAST has priority.
  • LHCb HTTP tests require some configuration fixes on the EOS side. So long as we progress this steadily there is no urgent time pressure as LHCb won't be ready to do final commissioning tests until end of January.

Database Discussion

Summary by Cédric.

Here is a summary of the meeting Steve and I had this morning (18/01/2021) with Sebastien Masson (DB).

1. Preparation for CTA schema 4.0

I have asked how I can run my migration scripts against a database that is a copy of the production one to ensure that there are no major issue with these scripts.

Sebastien proposed to do it by creating an empty database schema and do a data pump over the network. Our database data is around 100 GB so this is easily doable.

In order to do that, we will use the castorint database and a PL/SQL command has to be ran in order to perform this task (Sebastien will give me the instructions to do it).

I have to ask for a new schema for castorint on the CERN resource portal. The question is, by which service account the new schema should be owned?

Update during the meeting: Julien will create the service account and will set up a private area to keep track of service account passwords. In the meantime, Cédric can create the new schema on his own account and transfer it to the service account later.

2. Slowness of the query used to list the content of the tape

The query used to list the files contained in a tape (old_getTapeContent.sql) uses a hint to tell Oracle to use a specific index.

After some discussion about the problem, it appears that we may have to partition the TAPE_FILE table to have, for example, one partition per tape. But, we want to be sure that it works. Steve pointed out that there are two different use cases for listing files:

  • Operators are interested in listing tape content (What are the files located on this tape? (repack, tape verification))
  • Users are not interested in tapes, but are only interested in files (Where are the files I want? (file retrieval))

We have to be sure that this solution works for both of these use cases.

Sebastien said that partitioning an existing table with many rows in it takes time, but this can be done in online mode: the database can continue running while doing this operation. He nevertheless suggested that we do it in offline: we stop the production while the partitioning process runs.

New indexes can be created after we have partitioned the TAPE_FILE table. They can be local to a partition or global to the table.

According to Sebastien, the partitioning concept also exists in PostgreSQL but it might not be implemented or used the same way as Oracle.

To sum things up, the slowness of the query used to list the content of a tape needs more investigation on how to solve it properly.

3. Other business

Sebastien spotted that there are PARALLEL tables set on the CTA production database (I found out that the PARALLEL tables are the TAPE_FILE and the ARCHIVE_FILE ones). We will discuss this during the CTA development meeting this afternoon.

This is an artefact of the migration process. There is a script to clean up after the migration which removes the PARALLEL option, but this script was not executed after the CMS migration as there were several other tapes that had to be migrated immediately afterwards.

CTA Tape Server Configuration

The CASTOR disk attached to tape servers is used only for repack. If we have FSTs running on tape server machines, they can be used for repack or for the retrieve space, but should not be used for archive space, to avoid the risk of interrupting data taking if we need to reboot a tape server.

Our current hardware consists of 64 machines, each with 32 TB SSDs = 2 PB total. Bandwidth is 2.5 GB/s × 64 = 160 GB/s total. We do not anticipate needing more than this in the near future. Vlado will therefore order 100 new tape servers without SSDs. We have the option of adding them (NVMe devices) later if we decide it's necessary.

There are minutes attached to this event. Show them.
    • 16:00 16:10
      Getting LHCb into Production 10m

      Schedule

      • w/c 18 Jan: continue HTTP integration and testing (FTS multi-hop with QoS change+transfer, orchestrated by Dirac). See #209 Test HTTP TPC to CTA for archival workflow
      • (Jan?) DAQ test
      • SARA: tested migration to new dCache last week, should be ready by end of Jan
      • IN2P3: needs to fix some dCache config issues, should be ready by end of Jan
      • (Feb) Publish OTGs
      • CASTOR LHCb to recall-only mode
      • CASTOR LHCb disabled
      • Migrate to CTA
      • EOSCTA LHCb in production
      • (Mar?) Data flow tests
    • 16:10 16:20
      Getting PUBLIC into Production 10m

      Ongoing Tests (Vova)

      • AMS: waiting for Baosong
      • CAST: set up gridFTP gateway and test
      • COMPASS: Recall testing after the spinner space is configured.
      • DUNE: see #213
      • NA61/SHINE: will they use DTO+FTS for their DAQ? Progress on integration work?
      • NA62: Recall tests completed. Re-test with spinner space, see #72.
      • n_TOF: test offline workflow with spinner space

      TO DO

      • CAST tests and migration to CTA, see #184 CAST tests on EOSCTAPUBLIC PPS and #198 CAST issues transferring to EOSCTAPUBLICPPS
      • Finish repack of public_user
      • "ALICE-like" spinner space on EOS PUBLIC PPS, to test with NA62 and n_TOF (#161)
      • Knowledge Base article for users to access files in CTA, see #214 Test and document workflow for retrieving user data from CTA
      • Migrate data from legacy experiments (Aleph, Chorus, Delphi, Nomad, Opal, ...).
      • Bartek (NA61) is following up on issue of experiment data in CASTOR /user part of the namespace

      Schedule

      • This week: test and migrate CAST to CTA if possible
      • 22 Feb: migrate NA62 to CTA
    • 16:20 16:30
      CTA Status Update 10m
      • This week: Deploy CTA 3.1-14
      • Next: CTA 4.0-1 with v4.0 schema update. Main features of CTA v4.0 are: superseding superseded (#922), tape lifecycle (#186, #943), max mounts per VO.
      • Remove "prepare" handling without "-s" from EOS MGM, done in #953
      • Tape verification in CTA (#883), in progress
    • 16:30 16:40
      Database discussion update 10m
      • Cedric/Steve feedback from meeting with Sebastien Masson
    • 16:40 16:50
      CTA Tape Server configuration 10m
      • Do we plan to use disk space attached to tape servers?
    • 16:50 16:55
      AOB 5m
      • We will write a joint CHEP paper with EOS+FTS teams. Abstract has been written.