- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Data Preservation in High Energy Physics DPHEP
This is the fourth DPHEP Collaboration Workshop.
It will also host a Collaboration Board meeting.
The goals of this workshop are:
Please note that the agenda has to be adjusted to include new proposals/speakers. Proposals welcome.
The migration of LEP data to the common event data model EDM4HEP, developed for future collider physics studies and used for FCC physics potential studies, is crucial for several reasons. First, it helps data preservation by ensuring the data remains accessible for future physics analyses. Additionally, it serves as a critical test of EDM4HEP, as this will be the first time the format is applied to real data. This data will also be invaluable to the FCC-ee community, enabling the training of new software on actual e+e- physics events.
A project to convert as much LEP data as possible into EDM4HEP has been initiated at CERN, starting with the ALEPH data. Considerable effort has been dedicated to recovering the documentation and software used during ALEPH's operation. A series of programs have been developed to extract data from the original computing environments (Linux SLC4 and SLC6) and convert it into EDM4HEP structures, using an intermediate text file as an exchange format during the process.
This contribution will present the current status of this ongoing effort, along with the lessons learned so far and the future outlook for the project.
Since a long time the importance of CERNLIB for the preservation
of pre-LHC experiments was discussed. As a result of this discussion,
in 2022, a community effort was started with the aim to consolidate
various bugfixes, improvements, 3rd party forks and to port CERNLIB
to modern architectures. This activity resulted in a revived version
of CERNLIB.
The presented version is based on the CERNLIB version 2006 with
numerous patches made for the compatibility with modern compilers
and operating systems. The code is available in the CERN GitLab
repository with all the development history starting from the early 1990s.
The updates also include a re-implementation of the build system in cmake
to make CERNLIB compliant with the current best practices and to increase
the chances of preserving the code in a compilable state for the decades
to come.
The revived CERNLIB project also includes an updated documentation, which
we believe is a cornerstone for any preserved software depending on it.
Since 2022 the revived CERNLIB obtained a lot of positive feedback from
the long-term CERNLIB users across the Physics community.
During the last work shop in June 2021, issues and threads for preserving LEP data have been discussed. In the case of DELPHI, several threads have
been addressed since then: a working port of CERNLIB to decent 64bit based
operating systems, a dependency on a commercial 3d package for the DELPHI
event display, and a restrictive data access policy, preventing DELPHI from opening the data and following FAIR principles.
The presentation will summarize the current status of the preservation of DELPHI data, covering all areas of bit preservation, software, documentation and analysis preservation, data access policies and, as well as FAIR principles and Opendata access.
OPAL has always envisioned to keep its software stack working on recent operating systems.
At the last DP workshop the main risks for this approach where
- non availability of CERNLIB for recent OS versions
- the future of 32 bit compilers
- Uncertainty about the compatibility of the full software stack with 64 bit compilers
I will report about the current status of these issues, recent developments and the foreseen next steps.
The event displays of DELPHI and OPAL were built on top of a commercial and
closed source PHIGS implementation. When the company providing the toolkit
seased to exist, and the existing binaries stopped working on recent operating
systems, these tools were essentially given up.
In this presentation, we will show how we managed to revive the event displays
nevertheless, based on an open source prototype implementation called OpenPHIGS, which was published a couple of years ago.
The data of the ZEUS experiment at HERA and their usage have been converted to "preservation mode" in 2012, and new physics results have continuously been published from these data since then. A brief update will be given on the latest status of results, data and software access (e.g. switch to linux9), other related issues, and future plans.
The H1 experiment at HERA was taking data from 1992 to 2007. A long-term data preservation system was set up in 2015 at DESY, Hamburg. The system permits the analysis of H1 data on the National Analysis Facility NAF at DESY. This talk summarizes the H1 data analysis model, recent physics results, and recent develpoments related to data preservation.
The JADE experiment was located at DESY in Hamburg and collected data in $e^+e^-$ collisions in the PETRA storage ring from 1979 to 1986 with a center-of-mass energy between 12 and 46.6 GeV. Most notably the JADE collaboration was responsible for the (co-)discovery of the gluon as well as establishing jet-physics and testing of quantum chromodynamics (QCD).
Preservation of the unique JADE data, software and documentation is important for the scientific, educational and cultural reasons. In collaboration with the CERN opendata project we provide the software and necessary documentation to analyze and understand the JADE data. The preserved data, computing notes, software, original logbooks listing notable events and important parameters over the whole data taking time are openly available to the public and might be used in modern analyzes.
The PHENIX Collaboration has actively pursued a Data and Analysis Preservation program since 2019, the first such dedicated effort at RHIC. We successfully leveraged the Zenodo platform at CERN for knowledge management purposes, and the HEPData portal to reliably preserve the vast majority of all numerical data used in PHENIX publications. A particularly challenging endeavor is preservation of complex physics analyses, selected for their scientific importance and the value of the specific techniques developed as a part of the research. For this, we have chosen two of the most impactful PHENIX results: (a) the joint study of direct photons and neutral pions in d+Au collisions and (b) study of the J/ψ production via the di-muon decay channel. To ensure reproducibility of these analyses going forward, the general strategy is to carefully partition them into self-contained tasks. This is supplemented by a combination of containerization techniques, code management, and robust documentation. We also leverage REANA as one of the preferred ways to run the required software. We present our experience based on these examples, and outline our future plans for analysis preservation.
BaBar's support at SLAC ended at the beginning of 2021. However, since the collaboration is still active, it is needed to be able to start new analyses. The presentation reports on the status of the preservation of BaBar's data and the ability to do new analyses. Issues that came up as well as issues that the effort may face in the near future will be detailed.
The data from the Belle experiment, which finished collecting data in 2010, was passed on to its successor, the Belle II experiment. The Belle II experiment is currently accumulating data, but the Belle data will be used only by Belle II collaborators, at least until the amount of data exceeds that of the Belle experiment.
To continue the analysis using the Belle data, the Belle II experiment has independently developed tools that can handle the Belle data within the Belle II Software framework, and is still actively producing physics results based on the Belle data.
On the other hand, some of the software libraries used by the Belle experiment are no longer compatible with new operating systems and computing environments.
In this talk, we will report on the current status of the Belle II experiment and the preservation of the Belle data and analysis environment.
BESIII is an experiment running on the Beijing Electron-Positron Collider II (BEPCII), with plans to continue operations until 2030. While BESIII had made some preliminary studies for long-term data preservation and applications, it recently decided to adopt the level 4 model of the DPHEP. A working group, consisting of members from the BESIII collaboration and the Computing Center at the Institute of High Energy Physics, has been formed to work on the strategies and technologies for the long-term preservation of BESIII data. According to the level 4 model, all experimental raw data, databases, and legacy data analysis tools will be reserved, ensuring that data analysis can be conducted after the experiment concludes. In addition, to legacy data analysis, the group is considering introducing machine learning methods to potentially replace traditional event reconstruction processes, providing an alternative pathway for data analysis for those unfamiliar with the BESIII experiment. This report will briefly introduce the current status and future plan for this effort.
The CERN Open Data portal is a digital repository that focuses on disseminating event-level experimental particle physics data from LHC as well as non-LHC experiments. It provides more than 5 petabytes of collision data, simulated data, derived data with accompanying configuration files, software tools, container images, and illustrative data usage examples are being provided. This talk presents a status update on the CERN Open Data portal service, covering the evolution of the content as well as the latest developments of the platform, with a particular focus on data preservation topics, the evolution of storage and the data access monitoring.
REANA is a platform for reusable and reproducible data analyses. REANA allows researchers to use declarative analysis workflows (CWL, Snakemake, Yadage) and run them on containerised compute clouds (Kubernetes, HTCondor, Slurm). In this talk we present a status update on REANA, covering the latest developments, with a particular focus on data-preservation oriented use cases. We demonstrate how REANA can offer an "analysis engine" to complement data preservation activities in view of verifying data provenance information or to ensure the validity of data usage examples for future data reuse by means of actionable "continuous reuse" workflows.
CERN’s Large Hadron Collider (LHC) generates a wealth of scientific information, creating a complex challenge in managing, preserving, and reusing every element of the research process. The CERN Analysis Preservation (CAP) platform addresses these challenges by offering a comprehensive solution for capturing and preserving all components of an analysis, including data, code, computational environments, and documentation. CAP facilitates the seamless sharing and long-term preservation of research workflows, ensuring that valuable scientific knowledge is retained for future use and collaboration.
In this presentation, we will share the latest updates on the CAP platform, for preservation and working group managers, enhancements in data sharing capabilities, and recent developments such as the integration of our react-formule package, which supports more flexible and structured data modelling
In 2020 CERN released its Open Data policy document, which was endorsed by the 4 large LHC experiments at CERN. The policy was written to balance the concerns from the experiments related to loss of ownership of their data, and resource issues with the desire to be as open as possible with the data. The policy foresees all of the experiments releasing a substantial part of their data with a latency of 5 years since the data was collected. This talk will briefly discuss the policy and its implementation challenges, including discussing the datasets that have been publicly released since the policy was endorsed, and some thoughts on the how this will work in the future.
The status of data preservation in the ALICE collaboration will be presented, focusing on efforts to preserve Runs 1 and 2 data using the new data format and software framework developed for Run 3 and 4 under the ALICE O2 project. The conversion of old AOD and ESD formats to the new AO2D format significantly reduces data size and introduces a flat data model optimized for fast I/O. Additionally, the O2 analysis framework efficiently handles the increased data volume of Runs 3 and 4, ensuring the long-term accessibility of both legacy and future data.
This year, ATLAS released Open Data for Research for the first time, providing all 2015 and 2016 proton–proton collisions to the public for scientific use. These data join the myriad bespoke datasets that have been released for specific purposes, including fast calorimeter simulation training, top quark jet tagging, Standard Model measurements in final states with a Z boson, and BSM searches. Accompanying this latest data release is significant new documentation providing metadata, software, example analyses, and more. This marks the beginning of a series of open data releases planned over the next 12 months, including the first release of heavy ion open data from ATLAS.
The LHCb experiment offers an excellent environment to study a broad variety of modern physics topics. The data recorded by LHCb from the major physics campaigns (Run 1 and 2) at the LHC has accumulated over 600 scientific publications, making it increasingly important to preserve analysis workflows to facilitate both reusability and reinterpretation of the results. LHCb encourages preservation of data and analysis workflows from the point the data is read out by the detector to the end results shown in publications, with options to produce Ntuples in a way that preserves the data provenance, and with extensive use of workflow management systems like Snakemake.
Such valuable and complex data merits careful thought on how to preserve and provide open access to a broader community of researchers. In accordance with the CERN Open Data Policy, LHCb announced the release of the full Run 1 dataset gathered from proton-proton collisions, amounting to approximately 800 terabytes made public on the CERN Open Data portal in 2023. However, due to the large amount of data collected during Run 2, it is no longer feasible to make the reconstructed data accessible to the public in the same way. This prompted the development of an innovative approach to publishing open data by means of a dedicated LHCb Ntupling Service, allowing third-party users to query the data collected by LHCb and request custom samples in the form of Ntuples. These Ntuples can be individually customized in the web interface of the Ntupling Service application using LHCb standard tools for saving measured or derived quantities of interest. The procedure of requesting and subsequently analyzing an Ntuple requires no specific knowledge of the LHCb software stack.
The LHCb Ntupling Service was developed as a collaborative effort by LHCb and the CERN Open Data team from the CERN Department of Information Technology. The service consists of the web interface frontend allowing users to create and review Ntuple production requests, the backend application processing the user requests and storing them in the GitLab repositories, offering vetting capabilities to the LHCb Open Data team, and automatic dispatch of user requests to the LHCb Ntuple production systems after approval. The produced Ntuples are then collected and delivered back to the users in the frontend web interface.
The ANTARES water Cherenkov neutrino telescope has been taking data in the Mediterranean Sea for more than 15 years until 2022. Contributing with valuable research to astroparticle and particle physics, the final analyses by the Collaboration are being completed and the data is ready to be handed over to the KM3NeT Collaboration operating the next-generation detectors. The ANTARES data will serve as a use case for the development of KM3NeT data management policies and its open science system. In this contribution, the challenges for long-term maintenance and opportunities for the development of future interfaces will be discussed.
PUNCH4NFDI is a German DFG-funded initiative to promote FAIR and open data
management - and the development of the relevant tools - for the fields of
particle physics, astrophysics, astroparticle physics, nuclear&hadron
physics, and lattice QCD.
In particular, following the FAIR principles, it promotes and works towards
cross-community access to data, wherever possible.
Currently a series of PUNCH use cases is being released that give first
insights into the PUNCH capabilities. One of these is a HEP demonstrator use
case on open data from both CMS and ATLAS, and both LHC Run 1 and Run2. It
is based on unified research level data access akin to one that is actually
in use within one of the collaborations themselves, within a single
workflow.
The workflow is exemplified with an analysis of Higgs-to-four-lepton
production, publicly documented on the PUNCH4NFDI portal. The full workflow
on initially unfiltered open data and MC samples runs within a few hours and
can be adapted by users to their own physics cases.
Only limited by the somewhat condensed information content of the input files, of order 50% of all potential LHC physics topics can be covered with this workflow.
CMS will share our work in 2024 to release Run 2 legacy data and support preservation and open science usage via new documentation and workshops. We will also share the release of our statistical analysis package and a new platform for releasing likelihoods from CMS papers.
ICFA Data Lifecycle Panel is a follow-up of the DPHEP (and other initiatives) within ICFA. DPHEP will continue as a Collaboration and we take this occasion to discuss sharing the tasks and communicating between these two entities.
We will go through the input collected in the survey and prepare the first step in developing a comprehensive set of recommendations and best practices that address critical aspects of the data lifecycle.
FAIROS-HEP is an NSF-funded Research Coordination Network that aims to connect groups of researchers thinking about FAIR and Open Source principles in HEP and other related fields. The network envisions a more cohesive infrastructure around data and publications in HEP. By focusing on FAIR data practices and how data and software can be linked to physics results, we hope to build a network of researchers thinking about how we can create a “living publication” to preserve and extend physics results. The project includes some funding for building infrastructure as well as future workshops connecting groups.