CTA deployment meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

CTA deployment meeting, 10/10/2019

Follow-up from ALICE meeting:

  • See attached screenshot with the discussed potato-stick diagram.
  • Migrations:
    • Data taking from the pit will be at 100GB/s to O2, which acts as online data buffer. Around 50PB will be taken during HI (~3-4 weeks). Data will be copied from O2 to EOSCTA at a speed of 10GB/s (This number needs to be confirmed as previous discussions with Massimo were targeting 20-30GB/s).
      • at 10GB/s, transferring 50PB will take ~60 days; 20GB/s brings this down to ~30 days.
    • ALICE will not explicitly check migration status (aka "m" bit on CASTOR); they expect us to provide sufficient reliability and checking for ensuring that data doesn't get silently lost.
  • Recalls:
    • ALICE wants to continue with the HSM-based approach currently provided by CASTOR, with automated migration, recall and garbage collection. No FTS, nor explicit file transfer management.
    • Data reading from EOSCTA will be done outside HI period, at rates up to 10GB/s.
    • A 5PB disk read cache space is requested. Files recalled from tape will live in this cache. Each file will be copied 2-3 times to LXBATCH where it will be processed; with results written on EOSALICE.

Possible implementation:

  • Provide two separate EOS spaces within EOSCTAALICE for migrations (using SSD's) and recalls (using HDD's). This requires EOSCTA internal adaptations in several places, such as:
    • Inbound/outbound "routes" (cf also Steve's mail below):
      • inbound data (ie written to EOS not originated from tape servers) needs to be routed on the SSD space.
      • outbound data (written to EOS from tape servers) needs to be routed on HDD's.
      • Need special URL for tape server retrieve for writing to disk, not SSD.
        • We need to change the MGM of EOS to generate the special URL for open(write). This should be configurable appropriately. (Eric / Steve to create an RFE ticket)
        • tapeserverd needs to know the “write” space name also for correctly checking the free space in the context of backpressure. (Ticket to be created by Eric)
    • Currently, “default” space name is hardcoded in multiple places in EOS MGM and FST and this needs to be changed (Ticket to be created by Steve).

    • A LRU GC needs to be implemented and activated on the recall (HDD) space, "strict" garbage collection activated on migration (SSD) space.
    • Recall backpressure needs to work against the recall space.
    • SSD space for migration needs to be clarified. Initially, 400TB, will this have to grow to 800TB?
    • Can we offer single-replica HDD's for the read space, as all data on it is by definition available on tape? This would enhance performance and increase cost efficiency at the expense of higher operational cost (checking for broken HDD's, replacements)

Open points, to be clarified with ALICE:

  • Confirm rate assumptions and read access model mentioned above.
  • Confirm capacity for read space.
  • Confirm whether implicit prepares are needed. (Current CASTOR jobs shows that not all read files are "prepared" beforehand).
  • Confirm whether EOSCTAALICE requires access from outside CERN (assumption is no; T1 exports go via O2)
  • We have not yet resolved the potential migration backpressure problem. If the SSD buffer is full, writing to EOS will fail with ENOSPC. ALICE would need to back off until space becomes available again, and then retry. For this, a "df" equivalent command could be provided to them.

Additional points from the ALICE meeting, noted by Eric:

  • File size is typically 2GB
  • The processing of data is expected to happen in cycles of 3 weeks: Prepare, read the files (typically 2-3 times), [ no global eviction, but it felt that could be adequate. Alice does not want to do it though. ], and prepare for the next cycle. The cycle data size would be the 5PB of the disk buffer. Processing is done on local CPUs.
  • EOSALICE would be 25PB, 20 for storing derived data on d1t0 and 5PB as a garbage collected area for the aforementioned cycle. Alice would agree to carve the 5PB from EOSALICE to provide to EOSCTAALICE. 
  • It was claimed that CTA would not serve data outside of EOSALICE or O2 (so no T1/Grid access).
  • In Run2, Alice globally keeps 2 tape replicas of files. They will keep only 1 replica in Run3. That is a biggie, because data loss will occur (rather that the usual please re-import the files).
  • There are currently no plans for Run 4, but the big jump for data rate is now, not run 4.

 

————————————————————————————————————

Further information (Steve's discussion with Andreas):

There were two main discussions.  Firstly how to modify/configure EOS to use SSDs to write to tape and spinners to read.  Secondly how could EOS automatically move files from spinners to SSD and then to tape (this shall be explained later).

In order to write to tape using SSDs but do everything else with spinners we would need to do two development tasks:

  1. Modify EOS to allow us to override the space to be used when a tape server opens a file for writing (this is development work for the EOS team  - please correct me if I am wrong Andreas).
  2. Modify the EOS MGM (for myself or Eric - really easy) to add the new “override" opaque data to the URL given to tape servers to open EOS files for writing (when we retrieve files from tape).

Operationally we would then do the following:

  1. Configure EOS to have an SSD space and a spinner space.
  2. Configure EOS to use the spinner space by default.
  3. Configure all tape backed directories to use SSDs by default.

When anyone writes to a non-tape directory they will use spinners.  When a user writes to a tape backed directory they will use SSDs.  When a tape server writes back to the same tape backed directories it will override the SSD space setting and use spinners instead.

 

 

 

There are minutes attached to this event. Show them.
    • 14:00 14:45
      Feedback from experiments and impact on CTA deployment 45m
      • ALICE
      • CMS
      • LHCb
        • eventual news from ATLAS
    • 14:45 15:05
      AOB 20m

      SW releases