11.7.2023 — Online Meeting Data Generation WG
Meeting coordinates: zoom room online, 11 July 2023, 16:00-17:30
Present: Adnan Ghribi (CEA/CNRS), Adrian Oeftiger (GSI), Barbara Dalena (CEA/IRFU), Chenran Xu (KIT), David Bouvet (CCIN2P3), Thomas Kachelhoffer (CCIN2P3)
Minute Taking & Moderation: Adrian Oeftiger
Discussion aspects:
David presents Introduction to Grid Computing
- Adrian: how close can data be stored to computing resources? For the grid, is there a way to control this from user perspective? It would be interesting for sustainability reasons & for reducing network load.
- David: data can be user defined to be stored close to computing resources, yes.
- Adrian: can the data be published from the same space (i.e. get a DOI) where they are stored for internal analysis?
- Adnan: what is the work flow to publish data e.g. for the LHC experiments? It would be nice to avoid input/output bottlenecks, we want our algorithms to run on large batches of data at the same time.
- David: LHC experiments only published example cases and very limited data sets, distance to computing resources was not an issue. Tasks with large data amounts being evaluated / AI accessing a lot of data simultaneously have not been an issue so far.
- Adrian: thinking of publishing data along with a container with the corresponding software to analyse the data, e.g. a jupyter notebook container with a script. [This is relevant for reproducible research, both for simulations & experimental data.]
- David: Astrophysics is looking at the mentioned use case of publishing data along with evaluation / analysis software. Also astrophysics and astronomics research is looking more into solving federated computing on distributed storage than the HEP community. There it is important to have data close to the computing resources. This is a use case being worked on right now, with lots of technology, also involving jupyter notebook technology. R&D uses container technology as well, sometimes with the link to the data in the analysis workflow. This does exist.
- Adrian: Does this exist for inside collaborations or also for published work? E.g. clicking onto a link from a paper and opening a running kernel on some computing centre (be it behind an authentication gate such as eduGAIN etc) with the data located close by to see the analysis in detail?
- David: I have seen some of this. I guess one can do it. Most of the containers rely on computing software from the experiments, so one will have to adapt with own [custom] software.
- Adrian: this would be a good step to work towards uniting data sets from separate studies in the future, one potentially relevant goal to be ready for this within our initiative. The thing is that today analysis scripts are very heterogeneous between institutes as well as researchers within a single institute.
- David: HEP experiments work more centralised. Individual analysis scripts are used based on joint containers but not usually published.
- Barbara: I used to work in legacy experiments, 2008, good to see that now HEP community is working on united authentication systems & computing systems, at the time I was working on this it was very heterogeneous, too.
- David: yes everything is converged now. With developed services like CVMFS for example, etc.
- Adnan: I'm not aware of openly accessible data publication work flows in astro community. It's more on the small group or within the community level. A lot is closed. There needs to be more work towards open source and open access. Something to explore more...
- [...] David: data lakes...
- Adrian: pragmatic question: Mohammad Al-Turany from GSI is collaborating within ESCAPE on data lake implementation. How much is there finished as in ready-to-use on the data lake solutions from your side already?
- David: several sites are involved in this new infrastructure, we are part of it but it's ongoing work, the cache is implemented and one can use it.
- [...] David brings a few examples on data publication in federated data infrastructure which can be analysed by registered users, very interesting.
- Adrian: for us as accelerator community it would be important not to come up with new solutions but to join the much larger HEP & astro communities and ongoing projects to also solve our use case.
- [...] David presents DIRAC.
- Adnan: very interesting, is there documentation about how to write extensions and plugins for it?
- David: yes, it's also very interesting to us service developers, I will find a presentation of what I've done on it for example in CTA.
- technical discussion follows on caching, HPC centres available for running large training jobs, energy-efficient use of GPUs for more sustainability.
- Chenran: a lot of research in accelerator physics also doesn't need large data sets and/or on large grid-style federated computing infrastructure?
- Adrian: it may apply to large projects like FCC with many different institutes contributing to the study, or DA scans for LHC which were run on BOINC infrastructure as a volunteer network with several 100k user machines available for computing.
- Chenran: distributed data management will be important for the future, more doubts on the distributed computing part.
- Adrian: for the INFRA call it will be interesting to compare a few different solutions to the data management question to have recommendations for future distributed data management in a reasonable way. The execution of such projects relying on distributed data management will likely not be within our proposal.
- Adnan: yes the demonstrator part is important for the future.
- Adrian: and the important bit is that we as a community start to transparently publish the data along with the research papers on them.
- Adnan: ... and not reinvent the wheel over this.
- Adnan: some large data sets are already part of our collaboration (mentioning several TB sized plasma simulation data) and we will see how to ideally place this. The discussion with experts like David & Thomas will be very important. Thanks for being on board, Thomas will be joining the workshop next week.
There are minutes attached to this event.
Show them.