2nd WG meeting (online)
16.6.2023 — Online Meeting Data Generation WG
Meeting coordinates: zoom room online, 16 June 2023, 14:00-15:30
Present: Adnan Ghribi (CEA/CNRS), Adrian Oeftiger (GSI), Andrea Santamaria Garcia (LAS/KIT), Andrew Mistry (GSI), Barbara Dalena (CEA/IRFU), Chenran Xu (KIT), Francis Osswald (CNRS/IPHC), Gianluca Valentino (UM), Hayg Guler (CNRS / IJCLab), Kevin Cassou (CNRS), Simon Hirlaender (PLUS)
Minute Taking: Adnan Ghribi
Moderation: Adrian Oeftiger
Discussion aspects:
Andrew Misty joins us from nuclear physics data community and introduces us to the way they manage metadata.
Presentation by Andrew (to put in the box)
Highlights/questions :
Data publication : interim solution vs final solution at the European level?
zenodo as an interim solution for <50GB (max <200GB for paid solutions according to Kevin) to publish data sets with DOI
Kevin: they have large PIC simulation data sets from plasma wakefield community (for the first time facing this challenge), looking to publish >TB level
Adnan: portal to explore data before downloading them, storage for 5 years. Some links to projects working on this:
Adrian: CERN has integrated data + small evaluation kernel capabilities (NXCALS), implemented this a couple of years ago already and is gathering experience; could be an interesting solution to explore for international level?
Adnan: feature engineering just as important as metadata labeling (becomes dynamic during evaluation of data-driven models)
Adnan: discipline oriented data bases: do we have to create our own in the accelerator community?
Adnan: Note the guide to HMC better metadata booklet https://oceanrep.geomar.de/id/eprint/55270/7/2022_HMC-metadata_in_briefs_1_web.pdf
Adnan: Could be useful to have that table of metadata structure for information
Andrea : in contact with MT DMA (data management) - OpenPMD standard. Have contact with the the Helmholtz medatada initiative (they also have project calls: https://helmholtz-metadaten.de/en/projects/hmc-project-calls). Ongoing project : B2share provided by eudat, local instance of sat repo. Search for metadata and find data locally. - https://b2share.eudat.eu/
Adrian : Our initiative have been included in slide/pres during the JENA workshop and have been well received.
Chenran: using at KIT a solution that "publishes" internally the data in a local copy of the database and only then provides it externally connecting to the public database (but same software!), which then tags with a DOI
Adrian: suggestion to use a hackathon format to organise ourselves along the study cases during the programme over the next years, implement data publication/management strategies in one group, active learning strategies in another one but working on the same study case altogether.
Kevin : IDRIS/Nvidia - hackathon - deadline june 30th - starting in sept. https://www.ins2i.cnrs.fr/fr/cnrsinfo/appel-projets-pour-beneficier-de-laccompagnement-dingenieurs-en-intelligence-artificielle
Adrian: Hackathon that can we organize as an output of the white paper and within the call
Adrian: reviews the white paper. Reminder to please fill your study case and institute.
Adnan: add if your data catalogues are ready or if you need to generate them.
Adrian: Review of the study case "Exploring Resonance Diagrams"
Francis: Review of the study case "Enhanced emittance evaluation"
Gianluca: Review of the study case "Surrogate modeling of beam losses in the LHC collimation hierarchy"
Chenran: Review of the study case "Surrogate Modelling of low-energy linac"
Kevin, Andrea, Adnan, Barbara and Hayg will add their study cases soon
Adrian: discussing some workshop organisation.