DOMA / ACCESS Meeting

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map
Frank Wuerthwein (Univ. of California San Diego (US)), Ilija Vukotic (University of Chicago (US)), Markus Schulz (CERN), Stephane Jezequel (LAPP-Annecy CNRS/USMB (FR)), Xavier Espinal (CERN)

People on Vidyo: Andreas Petzold, Daniele Spiga, David Smith, Diego Ciangottini, Elizabeth Sexton-Kennedy, Eric Fede, Frank Wuerthwein, Gonzalo Merino, Johannes Elmsheuser, Kaushik De, Laurent Duflot, Markus Schulz, Oxana Smirnova, Horst Severini, Riccardo Di Maria, Sabine Chrystel Crepe, Stephane Jezequel, Xavier Espinal

 

* Frank Wuerthwein (Univ. of California San Diego (US)) - Proposed New Scope of DOMA Access

- discussed at https://indico.cern.ch/event/932079/

- reported here again; please have a look at the slides

 

* Stephane Jezequel (LAPP-Annecy CNRS/USMB (FR)) - Discussion on proto datalake

- follow up of document presented by Xavier (https://docs.google.com/document/d/1ZzyycM6Sli6cFQelF3VfEs9OEDbnEbaSmIOpaZ5fOgY)

- this proposition was not presented and discussed within experiments yet so is not endorsed by any.

- request from last meeting to have more realistic timescale 

- important to quantify the gain from proposed datalake organisation 

- today presentation objective is to make the first step is to converge on proposition from DOMA ACCESS community on concrete tests until Run-3 restart (early 2022)

- maintain or reduce manpower necessary to maintain Grid storage infrastructure at HL-LHC scale including data transfers

- datalake notion: storage infrastructure providing full replica of input analysis data; production and analysis workflow should deal with datalake organisation

- need to find a solution for isolated sites (Chile, Australia, Eastern asia)

- this proposal assumes that RUCIO already deployed everywhere for everything and availability of HammerCloud infrastructure for performance measurements

- this proposal fits perfectly with the activity being carried out in ESCAPE project

- if testbed deployed in DOMA, it is necessary to define manpower, maximise reuse of existing tools and monitoring, minimise interference with production activity

- is it possible to complete exercises before LHC data-taking restart in early 2022 ?

 

STEP 1

- 3 or more sites providing storage in EU and US needed (already propositions from ES, IT, FR and DE)

- use only xrootd/http protocols for disk

-> this is already a call for sites to contribute

 

Timescale (rely on availability of Rucio experts) : 

- Build list of volunteering sites during summer 2020

2020-2021 : Can only be a small set of sites (10-20 ?)

- Build datalake during autumn 2020

Integrating new sites on the fly over 2020 since temporary data

 

STEP 1b

- use RUCIO production instance of DOMA (or ATLAS/CMS)

- asynchronous transfers relying on DOMA-TPC FTS infra

- monitoring datalake + functional tests: pre-prototype existing for ESCAPE storage infrastructure similar to WLCG for FTS/perfsonar/..

 

STEP 1c

- 3 or more sites/federations with state-less storage (caching)

- 3 or more sites/federations accessing data remotely (storage-less sites)

- calibrated jobs (analysis and production ?) triggered by HC infrastructure of experiment -> matrix of success to be defined      

- demonstrate feasibility to plug in external resources (HPC, cloud)

 

Timescale (ATLAS, CMS) : Late 2020-Spring 2021

- Start first HC tests accessing datalake  in late 2020

- Measure performances for calibrated jobs

- HPC/Commercial cloud : Not later than mid 2021

 

STEP 2

- injection of “primary” data to the lake

- run the required processing to produce analysis-like datasets; exercise file workflow after the data is produced, data is moved to a different QoS

- distribution of the analysis-like datasets to the processing sites caches

 

Timescale : Spring-Summer 2021

 

STEP 3

- based on the defined workflows, collect and compare metrics and see how they compare with the current infrastructure: overall time for completion, storage used, bandwidth used, CPU efficiency, operation burden (sites and experiment)

 

Timescale : Autumn 2021 → Summer 2022

 

 

Opened questions -> please see slides

 

Conclusion

- feedback to be collected from today’s presentation needed

- translate ideas in document

 

Discussion

- experiments are stressed and not straightforward to have manpower from them; how much work is necessary to prototype this proposal? 

- this should not be postponed, but fear to last more than 1.5 year; feeling that the prototype is more a pre-production service

- US-T1s could enter the discussion as well since similar argument already been discussed internally

- shared namespaces for ATLAS and CMS when same caches and origins are used is a non-trivial argument

- structural differences should be understood between ATLAS and CMS

- analyse better manpower and expertise needed for this project is necessary to understand if feasible or not

- RUCIO multi-VOs should be the priority

- HC to be less experiment dependent would be a plus

 
There are minutes attached to this event. Show them.