CTA deployment meeting, 17/10/2019
- Phedex->Rucio migration in summer 2020; meaning CMS won’t migrate to CTA beforehand.
- Julien needs to enquire with Katie/Eric for obtaining path for 200TB (Katie, Eric with copy to Dima and Danilo)
- Michael: analysis of namespace needs to be completed. ETA - 2w. Michael will follow up with Dima.
- Once this is done and CTA HW is available, the next step is to setup a Tier-0 read-only instance.
- Fileclass split-up required for avoiding big-bang Oracle transaction during metadata import. But this can be done internally without affecting CMS. Try to see to split by some semantic meaning.
- CMS is more interested in getting file classes right for Run-3 rather than fixing Run-2 data.
- Fileclass to tape pool mapping - as import scripts are based by file class, we need to check whether we have file classes that are mixed up on the same tape (as otherwise we can have tapes that are half-in half-out when a file class has been repacked - to be verified (Michael)
- Check with Eric W. whether we can do test recalls once a RO namespace import is available.
- Local CASTOR users with legacy (pre-2013) data: No need for supporting access; this will be handled by CMS on a case-by-case basis or via dedicated scripts.
- See below for notes from the LHCb meeting.
- Once HW is delivered, set up LHCb instance and do a RO namespace import
- Then, work with Christophe on adapting GFAL scripts, making them work with XROOT 3pc / sss authentication testing EOSLHCb to EOSCTALHCb transfers.
- Next step is then to test GFAL with FTS multi-hop. Multi-hop is required for CTA export of calibration data to T1's via EOSLHCb. This also requires multi-protocol support as XROOT 3pc won’t be available on all T1's, so EOSCTALHCb -> XROOT -> EOSLHCb -> GridFTP -> T1 needs to be supported.
- Setup a functional test of GFAL + multi-hop
- Michael: analysis of namespace needs to be completed.
- Participation to Rucio coding camp: Rucio multi-hop support is being completed, including EOS cache handling for T1 transfers as well as FTS integration. Goal is to have everything done beginning of November. Andrea will be helping there.
- Run-2 reprocessing challenge timelines: no news yet (after the meeting: Alessandro proposes running a data challenge once CTA hardware is in place, see Alessandro's answer below)
- Michael will do a test migration of the complete namespace
Status of ongoing developments:
- Too many archive mounts during repack (link) - starving effect.
- will require some adaptations to scheduling.
- Finished repack-tape-repair workflow (inject directly files) done, will be pushed in master this afternoon
CERN SW RAO for LTO
- See presentation link for parameters
- Check difference between reading and writing - and be pragmatic
Answer from Alessandro:
I discussed with Julien few days ago, I think we need to setup a stress test independently, when you're ready.
E.g. target 6GB/s , we take e.g. 600TB of RAW from tape to disk and then we delete them(in theory one day).
Please Eddie in the loop, I'm very happy to keep on helping in this, it's just that I might be much slower in answering. Eddie, please discuss this with Mario and David Cameron.
I think ATLAS could help, if we just plan, in making few stress tests (even to to e.g. 15-20GB/s) which could then open the possibility for full chain all experiments....
Additional notes from LHCb meeting
indico Agenda: https://indico.cern.ch/event/855015/
- assuming efficiency of 50%
- 4-5 days of buffer in pit
- Run2: was 0.7GB/s
- Run3: several parameters scaling up (cf slides)
- would reach 17.4GB/s without changes
- however, moving a higher fraction of physics (around 73%) to “turbo” - which is a kind of lossy compression - therefore data needs to get extracted right - there is no re-construction possible
- thus going down to 10GB/s (cf slides)
- 10GB/s nominal, requiring overhead for adsorbing peaks following issues
- Online farm will be several thousands nodes - 30PB of “working area”. Not SSD’s as capacity not yet good enough
- Will use tape as “cheap storage” for parking
- 2 copies: to CERN + Tier-1 of ALL detector data
- 10GB/s to tape + 10GB/s to Tier-1’s + 3.5GB/s to “disk at CERN” (EOS analysis)
- Tier-1 export: 10GB/s, RAL taking a “bigger share”
- At least +20PB on Tier-0 but this is the baseline scenario, not the contingency one! Can go up to +50PB
- Reprocessing during end-of-year required massively reading back data from tape. Require ~3GB/s [(3.62+1.81)/2 for Tier0] + margin of recall rate for doing it in 4 months (during EOY shutdown). There will be overlaps so need to add this on top of 10GB/s!
- Between pit and EOS: can use XROOT
- EOS->CTA needs to be the same software for data movement as for Tier-1.
- Stuck on dCache (maybe ECHO as well) as XROOT 3pc doesn’t work on these?
- Risk for LHCb collaboration to be dependent on this??
- Root problem seems to be a global variable in DIRAC?
- Luca: LHCb could add another protocol, “xroot-3pc” - which is a kind of hack but could work out
- dCache 5.2 should support 3pc for XROOT but latency to upgrade sites
- We can continue with BestMan for some time but doesn’t work on CentOS7
- Luca: envisaging cross-experiment stress test early next year
- test of pit->EOS->Tape would be already good
- T0 export to T1’s is round-robin
- Still assuming to send files from CTA to T1. Can use multi-hop to copy from CTA to EOS then Tier-1.
- Throughput of 1.5GB/s recalls
- functional test of multi-hop will be done by ATLAS in November
- need to foresee a cache T0 export space in EOS. According to Christophe, themselves can manage contents of that cache
- We need a functional test of the whole chain
- Christophe: could we keep the CTA test instance after the migration? A: yes of course
There are minutes attached to this event.