4th DPHEP Collaboration Workshop

Name: 4th DPHEP Collaboration Workshop
Start: 2024-10-02T14:00:00+02:00
End: 2024-10-03T23:30:00+02:00
Location: CERN

2 Oct 2024, 14:00 → 3 Oct 2024, 23:30 Europe/Zurich

13/2-005 (CERN)

13/2-005

CERN

The event will be held as a hybrid workshop at CERN and via Zoom..

Show room on map

Cristinel Diaconu (CPPM, Aix-Marseille Université, CNRS/IN2P3 (FR)), Dirk Duellmann (CERN), Ulrich Schwickerath (CERN)

Description

Data Preservation in High Energy Physics DPHEP

This is the fourth DPHEP Collaboration Workshop.

It will also host a Collaboration Board meeting.

The goals of this workshop are:

Perform a site-experiment round-table to capture the current situation including common problems & solutions HEP-wide
Provide an update on the changing (or changed) landscape, e.g. FAIR data management (plans), reproducibility, sustainability of data repositories, value of data lifecycle
Status reports of the transverse services and ongoing common developments and their outlook
Issue a status report document to e submitted to the ICFA Data Lifecycle panel.

Please note that the agenda has to be adjusted to include new proposals/speakers. Proposals welcome.

Participants

42 View full list

Wednesday 2 October
- Wed 2 Oct
- Thu 3 Oct
- The larger landscape: Introduction
  - 1
    
    Welcome from CERN Management
    
    Speaker: Joachim Josef Mnich (CERN)
  - 2
    
    Workshop Introduction and Goals
    
    Speaker: Cristinel Diaconu (CPPM, Aix-Marseille Université, CNRS/IN2P3 (FR))
    
    DPHEP WORKSHOP4 Intro 2OCT2024.pdf
    
    DPHEP WORKSHOP4 Intro 2OCT2024.pptx
- Experiments and sites: Part 0
  - 3
    
    ALEPH data preservation via migration to EDM4HEP
    
    The migration of LEP data to the common event data model EDM4HEP, developed for future collider physics studies and used for FCC physics potential studies, is crucial for several reasons. First, it helps data preservation by ensuring the data remains accessible for future physics analyses. Additionally, it serves as a critical test of EDM4HEP, as this will be the first time the format is applied to real data. This data will also be invaluable to the FCC-ee community, enabling the training of new software on actual e+e- physics events.
    A project to convert as much LEP data as possible into EDM4HEP has been initiated at CERN, starting with the ALEPH data. Considerable effort has been dedicated to recovering the documentation and software used during ALEPH's operation. A series of programs have been developed to extract data from the original computing environments (Linux SLC4 and SLC6) and convert it into EDM4HEP structures, using an intermediate text file as an exchange format during the process.
    This contribution will present the current status of this ongoing effort, along with the lessons learned so far and the future outlook for the project.
    
    Speaker: Jacopo Fanini (Politecnico di Milano (IT))
    
    ALEPH2EDM4hep_DPHEP.pdf
    
    ALEPH2EDM4hep_DPHEP.pptx
  - 4
    
    Status of CERNLIB
    
    Since a long time the importance of CERNLIB for the preservation
    of pre-LHC experiments was discussed. As a result of this discussion,
    in 2022, a community effort was started with the aim to consolidate
    various bugfixes, improvements, 3rd party forks and to port CERNLIB
    to modern architectures. This activity resulted in a revived version
    of CERNLIB.
    
    The presented version is based on the CERNLIB version 2006 with
    numerous patches made for the compatibility with modern compilers
    and operating systems. The code is available in the CERN GitLab
    repository with all the development history starting from the early 1990s.
    The updates also include a re-implementation of the build system in cmake
    to make CERNLIB compliant with the current best practices and to increase
    the chances of preserving the code in a compilable state for the decades
    to come.
    
    The revived CERNLIB project also includes an updated documentation, which
    we believe is a cornerstone for any preserved software depending on it.
    
    Since 2022 the revived CERNLIB obtained a lot of positive feedback from
    the long-term CERNLIB users across the Physics community.
    
    Speakers: Andrii Verbytskyi, Dr Ulrich Schwickerath (CERN)
    
    CERNLIB2024.pdf
  - 5
    
    Data preservation and Opendata status in DELPHI
    
    During the last work shop in June 2021, issues and threads for preserving LEP data have been discussed. In the case of DELPHI, several threads have
    been addressed since then: a working port of CERNLIB to decent 64bit based
    operating systems, a dependency on a commercial 3d package for the DELPHI
    event display, and a restrictive data access policy, preventing DELPHI from opening the data and following FAIR principles.
    
    The presentation will summarize the current status of the preservation of DELPHI data, covering all areas of bit preservation, software, documentation and analysis preservation, data access policies and, as well as FAIR principles and Opendata access.
    
    Speaker: Dietrich Liko (Austrian Academy of Sciences (AT))
    
    20241002 - DELPHI 4th DPHEP collaboration workshop presentation.pdf
    
    20241002 - DELPHI 4th DPHEP collaboration workshop presentation.pptx
  - 6
    
    Data preservation in OPAL
    
    OPAL has always envisioned to keep its software stack working on recent operating systems.
    At the last DP workshop the main risks for this approach where
    - non availability of CERNLIB for recent OS versions
    - the future of 32 bit compilers
    - Uncertainty about the compatibility of the full software stack with 64 bit compilers
    I will report about the current status of these issues, recent developments and the foreseen next steps.
    
    Speaker: Matthias Schroeder
    
    OpalDpStatus2024.odp
    
    OpalDpStatus2024.pdf
  - 7
    
    Recovery of DELPHI and OPAL event displays: the PHIGS story
    
    The event displays of DELPHI and OPAL were built on top of a commercial and
    closed source PHIGS implementation. When the company providing the toolkit
    seased to exist, and the existing binaries stopped working on recent operating
    systems, these tools were essentially given up.
    In this presentation, we will show how we managed to revive the event displays
    nevertheless, based on an open source prototype implementation called OpenPHIGS, which was published a couple of years ago.
    
    Speaker: Dr Ulrich Schwickerath (CERN)
    
    OpenPHIGS.odp
    
    OpenPHIGS.pdf
- 15:45
  
  Coffee
- Experiments and sites: Part 1
  - 8
    
    Update on ZEUS preserved data usage
    
    The data of the ZEUS experiment at HERA and their usage have been converted to "preservation mode" in 2012, and new physics results have continuously been published from these data since then. A brief update will be given on the latest status of results, data and software access (e.g. switch to linux9), other related issues, and future plans.
    
    Speaker: Achim Geiser (Deutsches Elektronen-Synchrotron (DE))
    
    DPHEP_2024a.pdf
  - 9
    
    Data preservation in the H1 experiment
    
    The H1 experiment at HERA was taking data from 1992 to 2007. A long-term data preservation system was set up in 2015 at DESY, Hamburg. The system permits the analysis of H1 data on the National Analysis Facility NAF at DESY. This talk summarizes the H1 data analysis model, recent physics results, and recent develpoments related to data preservation.
    
    Speaker: Henry Klest
    
    dphep_241002.pdf
  - 10
    
    JADE Data Preservation
    
    The JADE experiment was located at DESY in Hamburg and collected data in $e^+e^-$ collisions in the PETRA storage ring from 1979 to 1986 with a center-of-mass energy between 12 and 46.6 GeV. Most notably the JADE collaboration was responsible for the (co-)discovery of the gluon as well as establishing jet-physics and testing of quantum chromodynamics (QCD).
    
    Preservation of the unique JADE data, software and documentation is important for the scientific, educational and cultural reasons. In collaboration with the CERN opendata project we provide the software and necessary documentation to analyze and understand the JADE data. The preserved data, computing notes, software, original logbooks listing notable events and important parameters over the whole data taking time are openly available to the public and might be used in modern analyzes.
    
    Speakers: Andrii Verbytskyi (Max Planck Society (DE)), Mr Richard Hildebrandt (Max-Planck Institute for Physics)
    
    JADE2024.pdf
  - 11
    
    Progress in the Data and Analysis Preservation in the PHENIX Experiment at RHIC
    
    The PHENIX Collaboration has actively pursued a Data and Analysis Preservation program since 2019, the first such dedicated effort at RHIC. We successfully leveraged the Zenodo platform at CERN for knowledge management purposes, and the HEPData portal to reliably preserve the vast majority of all numerical data used in PHENIX publications. A particularly challenging endeavor is preservation of complex physics analyses, selected for their scientific importance and the value of the specific techniques developed as a part of the research. For this, we have chosen two of the most impactful PHENIX results: (a) the joint study of direct photons and neutral pions in d+Au collisions and (b) study of the J/ψ production via the di-muon decay channel. To ensure reproducibility of these analyses going forward, the general strategy is to carefully partition them into self-contained tasks. This is supplemented by a combination of containerization techniques, code management, and robust documentation. We also leverage REANA as one of the preferred ways to run the required software. We present our experience based on these examples, and outline our future plans for analysis preservation.
    
    Speaker: Maxim Potekhin (Brookhaven National Laboratory (US))
    
    PHENIX DAP Update
  - 12
    
    Status on the Preservation of BaBar's data and analysis ability
    
    BaBar's support at SLAC ended at the beginning of 2021. However, since the collaboration is still active, it is needed to be able to start new analyses. The presentation reports on the status of the preservation of BaBar's data and the ability to do new analyses. Issues that came up as well as issues that the effort may face in the near future will be detailed.
    
    Speaker: Dr Marcus Ebert (University of Victoria)
    
    DPHEP BaBar Preservation.pdf
    
    Google slides
Thursday 3 October
- Wed 2 Oct
- Thu 3 Oct
- Experiments and sites: Part 2
  - 13
    
    KEK / Belle I & II
    
    The data from the Belle experiment, which finished collecting data in 2010, was passed on to its successor, the Belle II experiment. The Belle II experiment is currently accumulating data, but the Belle data will be used only by Belle II collaborators, at least until the amount of data exceeds that of the Belle experiment.
    
    To continue the analysis using the Belle data, the Belle II experiment has independently developed tools that can handle the Belle data within the Belle II Software framework, and is still actively producing physics results based on the Belle data.
    
    On the other hand, some of the software libraries used by the Belle experiment are no longer compatible with new operating systems and computing environments.
    
    In this talk, we will report on the current status of the Belle II experiment and the preservation of the Belle data and analysis environment.
    
    Speaker: Takanori Hara (KEK High Energy Accelerator Research Organization (JP))
    
    4thDPHEP_TakanoriHARA.pdf
  - 14
    
    BESIII Data Preservation
    
    BESIII is an experiment running on the Beijing Electron-Positron Collider II (BEPCII), with plans to continue operations until 2030. While BESIII had made some preliminary studies for long-term data preservation and applications, it recently decided to adopt the level 4 model of the DPHEP. A working group, consisting of members from the BESIII collaboration and the Computing Center at the Institute of High Energy Physics, has been formed to work on the strategies and technologies for the long-term preservation of BESIII data. According to the level 4 model, all experimental raw data, databases, and legacy data analysis tools will be reserved, ensuring that data analysis can be conducted after the experiment concludes. In addition, to legacy data analysis, the group is considering introducing machine learning methods to potentially replace traditional event reconstruction processes, providing an alternative pathway for data analysis for those unfamiliar with the BESIII experiment. This report will briefly introduce the current status and future plan for this effort.
    
    Speaker: Gang Chen (Chinese Academy of Sciences (CN))
    
    BESIII data proservation-v1.0.pdf
    
    BESIII data proservation-v1.0.pptx
  - 15
    
    CERN Open Data portal
    
    The CERN Open Data portal is a digital repository that focuses on disseminating event-level experimental particle physics data from LHC as well as non-LHC experiments. It provides more than 5 petabytes of collision data, simulated data, derived data with accompanying configuration files, software tools, container images, and illustrative data usage examples are being provided. This talk presents a status update on the CERN Open Data portal service, covering the evolution of the content as well as the latest developments of the platform, with a particular focus on data preservation topics, the evolution of storage and the data access monitoring.
    
    Speaker: Pablo Saiz (CERN)
    
    Open data for DPHEP.pdf
    
    Open data for DPHEP.pptx
  - 16
    
    REANA reproducible analyses: status update
    
    REANA is a platform for reusable and reproducible data analyses. REANA allows researchers to use declarative analysis workflows (CWL, Snakemake, Yadage) and run them on containerised compute clouds (Kubernetes, HTCondor, Slurm). In this talk we present a status update on REANA, covering the latest developments, with a particular focus on data-preservation oriented use cases. We demonstrate how REANA can offer an "analysis engine" to complement data preservation activities in view of verifying data provenance information or to ensure the validity of data usage examples for future data reuse by means of actionable "continuous reuse" workflows.
    
    Speaker: Marco Donadoni (CERN)
    
    REANA-4th-DPHEP-Workshop-20241002.pdf
  - 17
    
    CERN Analysis Preservation portal: Research Preservation and Collaboration
    
    CERN’s Large Hadron Collider (LHC) generates a wealth of scientific information, creating a complex challenge in managing, preserving, and reusing every element of the research process. The CERN Analysis Preservation (CAP) platform addresses these challenges by offering a comprehensive solution for capturing and preserving all components of an analysis, including data, code, computational environments, and documentation. CAP facilitates the seamless sharing and long-term preservation of research workflows, ensuring that valuable scientific knowledge is retained for future use and collaboration.
    
    In this presentation, we will share the latest updates on the CAP platform, for preservation and working group managers, enhancements in data sharing capabilities, and recent developments such as the integration of our react-formule package, which supports more flexible and structured data modelling
    
    Speaker: Pamfilos Fokianos (CERN)
    
    CAP-DPHEP-2024-final.pdf
- 10:45
  
  Preserved Cofee Break
- The larger landscape I
  - 18
    
    CERN Open Data: Policy to implementation
    
    In 2020 CERN released its Open Data policy document, which was endorsed by the 4 large LHC experiments at CERN. The policy was written to balance the concerns from the experiments related to loss of ownership of their data, and resource issues with the desire to be as open as possible with the data. The policy foresees all of the experiments releasing a substantial part of their data with a latency of 5 years since the data was collected. This talk will briefly discuss the policy and its implementation challenges, including discussing the datasets that have been publicly released since the policy was endorsed, and some thoughts on the how this will work in the future.
    
    Speaker: Jamie Boyd (CERN)
    
    OpenData_DPHEP_3.10.24.pdf
- Experiments and sites: LHC
  - 19
    
    Long-term data preservation in ALICE: status and plans
    
    The status of data preservation in the ALICE collaboration will be presented, focusing on efforts to preserve Runs 1 and 2 data using the new data format and software framework developed for Run 3 and 4 under the ALICE O2 project. The conversion of old AOD and ESD formats to the new AO2D format significantly reduces data size and introduces a flat data model optimized for fast I/O. Additionally, the O2 analysis framework efficiently handles the increased data volume of Runs 3 and 4, ensuring the long-term accessibility of both legacy and future data.
    
    Speaker: David Dobrigkeit Chinellato (Austrian Academy of Sciences (AT))
    
    DDChinellato-DPHEP-01.pdf
  - 20
    
    ATLAS open and preserved data: status and plans
    
    This year, ATLAS released Open Data for Research for the first time, providing all 2015 and 2016 proton–proton collisions to the public for scientific use. These data join the myriad bespoke datasets that have been released for specific purposes, including fast calorimeter simulation training, top quark jet tagging, Standard Model measurements in final states with a Z boson, and BSM searches. Accompanying this latest data release is significant new documentation providing metadata, software, example analyses, and more. This marks the beginning of a series of open data releases planned over the next 12 months, including the first release of heavy ion open data from ATLAS.
    
    Speaker: Zach Marshall (Lawrence Berkeley National Lab. (US))
    
    20241002 DPHEP.pdf
  - 21
    
    LHCb Analysis Preservation and Open Data Activities
    
    The LHCb experiment offers an excellent environment to study a broad variety of modern physics topics. The data recorded by LHCb from the major physics campaigns (Run 1 and 2) at the LHC has accumulated over 600 scientific publications, making it increasingly important to preserve analysis workflows to facilitate both reusability and reinterpretation of the results. LHCb encourages preservation of data and analysis workflows from the point the data is read out by the detector to the end results shown in publications, with options to produce Ntuples in a way that preserves the data provenance, and with extensive use of workflow management systems like Snakemake.
    
    Such valuable and complex data merits careful thought on how to preserve and provide open access to a broader community of researchers. In accordance with the CERN Open Data Policy, LHCb announced the release of the full Run 1 dataset gathered from proton-proton collisions, amounting to approximately 800 terabytes made public on the CERN Open Data portal in 2023. However, due to the large amount of data collected during Run 2, it is no longer feasible to make the reconstructed data accessible to the public in the same way. This prompted the development of an innovative approach to publishing open data by means of a dedicated LHCb Ntupling Service, allowing third-party users to query the data collected by LHCb and request custom samples in the form of Ntuples. These Ntuples can be individually customized in the web interface of the Ntupling Service application using LHCb standard tools for saving measured or derived quantities of interest. The procedure of requesting and subsequently analyzing an Ntuple requires no specific knowledge of the LHCb software stack.
    
    The LHCb Ntupling Service was developed as a collaborative effort by LHCb and the CERN Open Data team from the CERN Department of Information Technology. The service consists of the web interface frontend allowing users to create and review Ntuple production requests, the backend application processing the user requests and storing them in the GitLab repositories, offering vetting capabilities to the LHCb Open Data team, and automatic dispatch of user requests to the LHCb Ntuple production systems after approval. The produced Ntuples are then collected and delivered back to the users in the frontend web interface.
    
    Speaker: Dillon Fitzgerald (University of Michigan (US))
    
    Fizgerald_LHCb_DPHEP2024_final.pdf
- 12:30
  
  Lunch
- The larger landscape: New/other experiments, neutrino, astrophysics, technology R&D
  - 22
    
    Preserving ANTARES legacy data
    
    The ANTARES water Cherenkov neutrino telescope has been taking data in the Mediterranean Sea for more than 15 years until 2022. Contributing with valuable research to astroparticle and particle physics, the final analyses by the Collaboration are being completed and the data is ready to be handed over to the KM3NeT Collaboration operating the next-generation detectors. The ANTARES data will serve as a use case for the development of KM3NeT data management policies and its open science system. In this contribution, the challenges for long-term maintenance and opportunities for the development of future interfaces will be discussed.
    
    Speaker: Jutta Schnabel (ECAP, FAU Erlangen-Nürnberg)
    
    ANTARES legacy DPHEP Oct 2024_v2.pdf
  - 23
    
    Multi-experiment and multi-era research level LHC Open Data analysis with PUNCH4NFDI
    
    PUNCH4NFDI is a German DFG-funded initiative to promote FAIR and open data
    management - and the development of the relevant tools - for the fields of
    particle physics, astrophysics, astroparticle physics, nuclear&hadron
    physics, and lattice QCD.
    In particular, following the FAIR principles, it promotes and works towards
    cross-community access to data, wherever possible.
    
    Currently a series of PUNCH use cases is being released that give first
    insights into the PUNCH capabilities. One of these is a HEP demonstrator use
    case on open data from both CMS and ATLAS, and both LHC Run 1 and Run2. It
    is based on unified research level data access akin to one that is actually
    in use within one of the collaborations themselves, within a single
    workflow.
    
    The workflow is exemplified with an analysis of Higgs-to-four-lepton
    production, publicly documented on the PUNCH4NFDI portal. The full workflow
    on initially unfiltered open data and MC samples runs within a few hours and
    can be adapted by users to their own physics cases.
    Only limited by the somewhat condensed information content of the input files, of order 50% of all potential LHC physics topics can be covered with this workflow.
    
    Speaker: Achim Geiser (Deutsches Elektronen-Synchrotron (DE))
    
    PUNCHdemonstratorDPHEP3.pdf
- Experiments and sites (cont.)
  - 24
    
    Open Data update from CMS
    
    CMS will share our work in 2024 to release Run 2 legacy data and support preservation and open science usage via new documentation and workshops. We will also share the release of our statistical analysis package and a new platform for releasing likelihoods from CMS papers.
    
    Speaker: Julie Hogan (Brown University, Bethel University (US))
    
    NewCMSOD_DPHEP2024_JHogan.pdf
- Experiments and sites: Discussion
  - 25
    
    DPHEP and ICFA Data Lifecycle Panel
    
    ICFA Data Lifecycle Panel is a follow-up of the DPHEP (and other initiatives) within ICFA. DPHEP will continue as a Collaboration and we take this occasion to discuss sharing the tasks and communicating between these two entities.
    
    We will go through the input collected in the survey and prepare the first step in developing a comprehensive set of recommendations and best practices that address critical aspects of the data lifecycle.
    
    Speaker: Kati Lassila-Perini (Helsinki Institute of Physics (FI))
    
    DHEP_ICFADataLifeCycle_KLP_03Oct2024.pdf
  - 26
    DPHEP Collaboration: organisational matters, structure and future
    
    CHEP Contribution: https://indico.cern.ch/event/1338689/contributions/6011102/
    
    EPPSU Contribution/Statement
    
    Organisation and future activity
    
    DPHEP WORKSHOP4 Closing 2OCT2024.pdf
    
    DPHEP WORKSHOP4 Closing 2OCT2024.pptx
- 16:00
  
  coffee
- Post - Workshop contributions
  - 27
    
    FAIROS-HEP
    
    FAIROS-HEP is an NSF-funded Research Coordination Network that aims to connect groups of researchers thinking about FAIR and Open Source principles in HEP and other related fields. The network envisions a more cohesive infrastructure around data and publications in HEP. By focusing on FAIR data practices and how data and software can be linked to physics results, we hope to build a network of researchers thinking about how we can create a “living publication” to preserve and extend physics results. The project includes some funding for building infrastructure as well as future workshops connecting groups.
    
    Kick-off Slides CERN.pdf
    
    Kick-off Slides CERN.pptx

Choose timezone

4th DPHEP Collaboration Workshop

13/2-005

CERN