5th DPHEP Collaboration Workshop

Europe/Zurich
13/2-005 (CERN)

13/2-005

CERN

The event will be held as a hybrid workshop at CERN and via Zoom..
90
Show room on map
Cristinel Diaconu (CPPM, Aix-Marseille Université, CNRS/IN2P3 (FR)), Dirk Duellmann (CERN), Ulrich Schwickerath (CERN)
Description

Data, Analysis, Software, Hardware, Knowledge Preservation in HEP and beyond

The goals of this workshop are to review the DP status and actions, and exploring other aspects such as AI technologies for information preservation, DP for outreach, and cognitive sciences input for long-term knowledge preservation in large collaborations. 

  1. Perform a site-experiment round-table to capture the current situation including common problems & solutions HEP-wide
  2. Intensive surveys of the data sets in HEP: the landscape and the caves. 
  3. Address the hardware preservation, applying the DPHEP concepts
    • define "hardware preservation" (key note talk on logistics in industry and at CERN) 
    • hardware data (using DPHEP definitions: files, software, documentation, publications)
    • use case : training, demo, spare, maintenance: what and how should be preserved
    • knowledge/skills preservation
  4. Knowledge and memory (keynote talk on cognitive science)
  5. Data Preservation for outreach
  6. Impact of AI, AI projects for data (and therefore knowledge) preservation.

Please note that the agenda has to be adjusted to include new proposals/speakers. Proposals welcome.

16/02/2026: The agenda is preliminary: the content/time can still change.

16/02/2026: Note that the meeting room location will change for the second day March 6.

Does scientific data have intrinsic value, particularly data from large scale experiments? Does the value or accuracy of these data change over time? Could these data contain 'hidden treasures'? Is data collected more than 25 years ago at LEP collider still relevant and usable? Could these data help prepare the future collider FCC? How do the Babar and HERA experiments manage to produce scientific papers so long after the data-taking phase has ended? Will RHIC preserved data facilitate a scientific transition to EIC? How is LHC data prepared for long-term preservation and open access? What prevents us from analysing preserved data: technology or knowldege ? Are there generic solutions available? Can AI help us preserve knowledge and technology for the future?

    • 2:00 PM 2:20 PM
      The larger landscape: Introduction 13/2-005

      13/2-005

      CERN

      The event will be held as a hybrid workshop at CERN and via Zoom..
      90
      Show room on map
    • 2:20 PM 4:00 PM
      Experiments and sites: / ROOM 13/2-005 13/2-005

      13/2-005

      CERN

      The event will be held as a hybrid workshop at CERN and via Zoom..
      90
      Show room on map
      • 2:20 PM
        DPHEP greetings from the past: news around LEP 20m

        This presentation is a wrap up of where we are with data preservation for the
        LEP experiments, 25 years after the end of data taking. It will focus on
        recent developments since the last workshop, and cover the status of supporting
        tools like CERNLIB and OpenPHIGS, which are common to at least some of the
        experiments. While LEP data is mostly available today, with the exception of
        L3, an eye must be kept on the ever changing IT infrastructure and related
        upcoming risks.

        Speaker: Dr Ulrich Schwickerath (CERN)
      • 2:40 PM
        A New Repository for DELPHI Data 20m

        EOCS-CZ activities include the creation of several data repositories for various scientific fields. The repository for High Energy Physics and Astroparticle Physics is currently under construction. Data from the LEP experiment DELPHI, already preserved in the CERN Open Data Portal, serve as the first use case for the new repository. The motivation is to maintain an additional copy of these valuable data in a geographically distinct location, to test the performance and limits of data access at the current site, and to compare them with those of the new repository.

        The new prototype data repository is based on the Invenio framework. This prototype explores an experiment-agnostic approach to publishing preserved HEP (and other) data. The repository technology is conceptually aligned with but independent from the CERN Open Data portal. We discuss the current status of the upload (work in progress), metadata modelling choices, PID options, and lessons learned when adapting a general-purpose digital repository to large-scale HEP datasets.

        Speaker: Michal Lukeš (FZU)
      • 3:00 PM
        Advanced b-tagging with archived ALEPH data 20m

        Recently, the 1994 dataset from the ALEPH experiment was converted into EDM4HEP format. Using this archived and converted ALEPH data, we apply modern software intended for FCC studies to process the data and train and employ state-of-the art, deep-learning based jet-tagging techniques. We obtain significant improvements in heavy-flavour tagging performance with respect to the legacy algorithms, and observe good agreement between simulation and data as a function of the tagger output score. These results open promising prospects on enhancing the precision of various legacy electroweak measurements by re-analyzing LEP data with improved analysis methods. Furthermore, these studies form a useful testbed for the development of software for future electron-positron colliders with real data.

        Speaker: Luka Lambrecht (Brown University (US))
      • 3:20 PM
        Converting DELPHI Legacy Open Data to the EDM4HEP Format 20m

        The DELPHI experiment at CERN has released its complete Open Data legacy in SDST format, traditionally analyzed using the Fortran-based SKELANA tool. This limits accessibility and integration with modern analysis frameworks. We present a conversion pipeline that transforms DELPHI's SDST data—both real and simulated—into the EDM4HEP standard, enabling analysis within modern Python and C++ ecosystems.

        The conversion preserves essential physics information: reconstructed particles, vertices, and Monte Carlo truth data. This bridges legacy Zebra structures to a standardized data model, ensuring DELPHI data remains accessible and usable with contemporary tools. Future work will extend this to include specialized detector subsystems — like the Microvertex detector and RICH detectors — enabling more sophisticated physics analyses within the modern framework.

        Speaker: Dietrich Liko (Austrian Academy of Sciences (AT))
      • 3:40 PM
        26 years on: Experience with OPAL Data Re-analysis 20m

        Recently, I have been working on finding out how feasible it is to do analysis with the OPAL data again. I have been amazed and really excited by the possibilities that are at hand. After quickly being able to reproduce the results on radiative neutrino counting from a dataset we published in 2000, it is planned to complete and publish this analysis for the whole LEP2 dataset of OPAL. I also have several other ideas of topics that make sense for scientific exploitation of this legacy data.

        It is not just that the data are at hand, but that the whole software stack can run locally on a laptop (via cvmfs), including event simulation, event reconstruction, and event display, and thus also enabling development/fixes to the software.

        I had shown interest in working on re-analysis of OPAL data, but had been leery of a number of issues. But now a number of developments have made this much, much easier: the software stack is supported across a number of hardware architectures, the data is in eos, CERNLIB lives again, and with the Open-PHIGS solution for the event display, there are really no show-stoppers. By taking advantage of 26 years of Moore's law, together with rather well documented software, one can adopt methods for data analysis that go far beyond what was computationally feasible last century. While the code base is in ANSI-compliant F77, Fortran-callable C++ functions have been used for targeted self-contained new developments. In the context of the radiative neutrino counting analysis, it has been straightforward to generate large samples of simulated data from instrumental backgrounds arising from cosmic-rays and beam-halo muon events, and to make improvements to the event display and simulation.

        Speaker: Graham Wilson (The University of Kansas (US))
    • 4:00 PM 4:15 PM
      Coffee 15m 13/2-005

      13/2-005

      CERN

      The event will be held as a hybrid workshop at CERN and via Zoom..
      90
      Show room on map
    • 4:15 PM 6:15 PM
      Experiments and sites: / ROOM 13/2-005 13/2-005

      13/2-005

      CERN

      The event will be held as a hybrid workshop at CERN and via Zoom..
      90
      Show room on map
      • 4:15 PM
        CMS Data Preservation and Open Access: status and plans 20m

        Since the first release of open data from an LHC experiment by CMS in 2014 the CMS Open Data program has matured into a sustained pipeline of data to the public. These data (nearly 5 PB of real and simulated collisions) have found extensive use in educational programs as well as in the larger research community. This talk will describe the content of CMS Open Data and how it has been used for education and research so far. It will also explore the challenges and opportunities ahead for data preservation and open access in CMS.

        Speaker: Thomas McCauley (University of Notre Dame (US))
      • 4:35 PM
        Status of Open and Preserved Data in ATLAS 20m

        ATLAS has significantly expanded its Open Data offerings since the last DPHEP Workshop, with new releases of Open Data for Education based on the Open Data for Research that was released in 2024 as well as the first releases of Open Heavy Ion Data and Open Event Generation Data. The first tutorial on ATLAS Open Data was held in November of 2025: a broad audience engaged with the latest materials, which were extended significantly for the event. The documentation of these datasets continues to grow in order to make the data accessible to a larger user base, and a new ‘public by default’ approach to collaboration documentation will eventually make significant expert documentation available to the community. The collaboration has identified the Open Data as its primary mechanism for preserving data from Run 2 and Run 3 in a new Data Preservation policy. A new software preservation policy has also been approved to ensure wider availability and accessibility of all experiment software.

        Speaker: Zach Marshall (Lawrence Berkeley National Lab. (US))
      • 4:55 PM
        LHCb Data Preservation and Open Data 20m

        The LHCb collaboration at the LHC has accumulated ${\sim}700$ scientific publications, making it increasingly important to preserve analysis workflows to facilitate both reusability and reinterpretation of the results and collected data. Data and analysis preservation are important steps towards producing publications at LHCb, with Run 3 ntuple production for user analysis centralized in a way that preserves the data provenance, extensive use of workflow management systems like Snakemake, and requirements to upload published results to HEPData.

        Such valuable and complex data merits careful thought on how to preserve and provide open access to a broader community of researchers. In accordance with the CERN Open Data Policy, LHCb announced the release of the full Run 1 dataset gathered from proton-proton collisions, amounting to approximately 800 terabytes made public on the CERN Open Data portal in 2023. Due to the sheer volume of the Run 2 data and beyond, a different release mechanism has been developed by LHCb in collaboration with CERN Department of Information Technology, the LHCb Ntupling Service. The service consists of the web interface frontend allowing users to create and review ntuple production requests, the backend application processing the user requests and storing them in the GitLab repositories (offering vetting capabilities to the LHCb Open Data team), and automatic dispatch of user requests to the LHCb ntuple production systems after approval. The produced ntuples are then collected and delivered back to the users in the frontend web interface. The procedure of requesting and subsequently analyzing an ntuple requires no specific knowledge of the LHCb software stack.

        The official release of the LHCb Ntupling Service was announced this year. It is accessible directly through the CERN Open Data Portal, and provides public access to both Run 1, and for the first time, Run 2 $pp$ data collected at LHCb, amounting to over 4 PB of data to explore!

        Speaker: Dillon Fitzgerald (University of Michigan (US))
      • 5:15 PM
        Long-term data preservation in ALICE: status and plans 20m

        The current status of data preservation efforts in the ALICE collaboration will be discussed, with emphasis on the preservation of Run 1 and 2 data using the data format developed for Run 3 and 4 in the context of the ALICE O2 project. The change to this new format and the associated framework bring about a significant data size reduction and provides a flat data structure to ensure fast I/O. The O2 analysis framework and the new data format present a good option for data preservation in general for ALICE for internal use as well as open data, for both legacy and future data of Run 3 and 4.

        Speaker: David Dobrigkeit Chinellato (Austrian Academy of Sciences (AT))
      • 5:35 PM
        Data preservation in the H1 experiment at HERA 20m

        The H1 experiment at HERA was taking data from 1992 to 2007. A long-term data preservation system was set up in 2015 at DESY, Hamburg. The system permits the analysis of H1 data on the National Analysis Facility NAF at DESY. This talk summarizes the H1 data analysis model, recent physics results, and recent develpoments related to data preservation.

        Speaker: Daniel Britzger (Max-Planck-Institut für Physik München)
      • 5:55 PM
        Status of BaBar's Data and Analysis Preservation Efforts 20m

        BaBar has been operating its current long term data and analysis system since 2021. The presentation reports on the status of the preservation of BaBar's data and the ability to do new analyses, challenges faced and future steps that are needed for continuing data preservation.

        Speaker: Dr Marcus Ebert (University of Victoria)
    • 9:00 AM 10:00 AM
      The larger landscape: / ROOM 222/R-001 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 9:00 AM
        CERN Open Data: Policy to implementation 20m

        In 2020 CERN released its Open Data policy document, which was endorsed by the 4 large LHC experiments at CERN. The policy was written to balance the concerns from the experiments related to loss of ownership of their data, and resource issues with the desire to be as open as possible with the data. The policy had all of the experiments releasing a substantial part of their data with a latency of 5 years since the data was collected. Since then the smaller LHC experiments have also endorsed this policy, and discussions are ongoing with other experiments at CERN. This talk will briefly discuss the policy and its implementation challenges, including discussing the datasets that have been publicly released since the policy was endorsed, and some thoughts on the how this will work in the future.

        Speaker: Jamie Boyd (CERN)
      • 9:20 AM
        Progress and Prospects for Data Preservation at IHEP 20m

        The Institute of High Energy Physics (IHEP) constructs and operates several large-scale scientific facilities, including BESIII, JUNO, Daya Bay (DYB), LHAASO and so on. The data generated from these experiments are crucial for driving discoveries and innovations in high-energy physics and related fields. IHEP has been actively engaged in advancing the long-term preservation of data, software, and knowledge associated with these facilities.
        In particular, for the BESIII experiment, we have established a comprehensive data preservation framework covering raw and processed data, analysis software, documentation, and metadata. Beyond preservation, we are developing a BESIII Data Ecosystem aimed at fully exploiting the scientific potential of the BESIII datasets. This effort is overseen by the BESIII Data Committee, which is responsible for data curation, validation, and managed external releases.
        Furthermore, IHEP promotes open data initiatives across other experiments. Selected datasets from the Daya Bay and LHAASO experiments, corresponding to published results, have been released to the scientific community. These steps reinforce our commitment to sustainable data stewardship, collaborative knowledge integration, and the long-term reuse of HEP data.
        This presentation will review IHEP’s progress in data, software, and knowledge preservation, discuss ongoing challenges, and outline future strategies within the global DPHEP context.

        Speaker: Hao Hu (Institute of High Energy of Physics)
      • 9:40 AM
        Analysis Preservation as a Pillar of Long-Term Knowledge Reuse in HEP 20m

        Discussions of long-term preservation in high-energy physics often emphasise event data, software environments, and documentation. Equally important, and increasingly mature, is the preservation of analyses: the executable logic that connects experimental data to published measurements through selections, derived observables, statistical procedures, and – more recently – machine-learning models. Over the past decades, community-driven efforts have demonstrated that analysis-level preservation is both technically feasible and scientifically impactful, enabling reinterpretation of legacy measurements, cross-experiment validation, and educational reuse.

        This contribution highlights analysis preservation as a complementary pillar within the broader DPHEP landscape. We outline the ecosystem formed by structured public result repositories, portable analysis frameworks (e.g. Rivet, CheckMATE), and lightweight model- and metadata-preservation tools (e.g. petrifyML, Contur) that archive machine-learning classifiers in stable, dependency-minimal or standards-based representations. Together, these components support “executable publications,” in which not only numerical results but also the procedures that produced them remain reusable across software generations.

        Analysis preservation provides a direct bridge between data preservation and knowledge preservation by encoding expert intent in forms that are both human-readable and machine-actionable. It also creates natural interfaces for AI-assisted documentation, automated validation workflows, and outreach-oriented reproducibility. While significant progress has been made – supported by broad collaboration-level engagement – coverage remains incomplete and sustainability often depends on limited dedicated resources. Recognising analysis preservation explicitly alongside data, software, and hardware preservation will help consolidate existing successes into a durable, community-supported infrastructure for future research and public engagement.

        Speaker: Tomasz Procter (Jagiellonian University (PL))
    • 10:00 AM 10:20 AM
      Transverse Projects 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 10:00 AM
        EOSC EDEN 20m

        Long-term digital preservation is a common challenge for many research infrastructures. Data volumes are growing, technologies change, and research data must remain usable and trustworthy over long periods of time. The EOSC-EDEN project addresses these challenges by developing a shared approach to long-term digital preservation across research domains.

        This presentation gives a brief overview of EOSC-EDEN and its current work. It explains how discipline and data-type requirements were collected across several scientific areas, including High-Energy Physics, to identify shared preservation needs and challenges. It then introduces the Core Preservation Processes, which treat preservation as a continuous lifecycle activity rather than a single ingest step. The talk also presents the first version of the EOSC-EDEN specifications and requirements, led by CERN, which translate existing preservation principles such as OAIS, FAIR, and TRUST into practical, interoperable service specifications.

        The presentation highlights the added value of EOSC-EDEN in aligning preservation practices across communities, supporting shared solutions, and strengthening long-term sustainability.

        Speaker: Wesley Middelbos (CERN)

        Link to visual diagram tool for the Core Preservation Processes: https://cpp.fd-dev.csc.fi/

    • 10:20 AM 10:40 AM
      Preserved Coffee Break 20m 222/R-001

      222/R-001

      CERN

      200
      Show room on map
    • 10:40 AM 12:00 PM
      Transverse Projects: / ROOM 222/R-001 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 10:40 AM
        K4GeneratorsConfig: A Unified Approach to MC Generator Benchmarking 20m

        The next generation of electron-positron colliders will require
        unprecedented precision in both theory and experiment. Sophisticated
        software frameworks are essential to evaluate detector concepts,
        optimize designs, and simulating physical processes. In this context,
        Monte Carlo (MC) event generators play a central role, enabling
        realistic simulations of Standard Model processes and providing the
        basis for physics studies. However, technical consistency across
        different generators remains critical, particularly in domains where
        agreement is expected. To address this need, we present
        K4GeneratorsConfig, a Python-based package that automates the
        benchmarking process for MC generators. The tool translates universal
        physics inputs into generator-specific configurations, ensuring
        consistency, reproducibility, and reduced human error. Its modular
        design allows for straightforward integration of additional generators
        and provides compatibility with the Key4hep software stack.
        While so far the focus has focused on the configuration of the
        generator, we are now aiming to integrate the production aspect of
        running generators with an emphasis on reproducibility and referencing
        for well defined MC generator versions.

        Speaker: Alan Price
      • 11:00 AM
        CERN Preserve Platform 20m

        This presentation focuses on the CERN Preserve Platform, which implements OAIS-compliant long-term preservation pipelines and applies the OAIS reference model in a large-scale scientific environment.

        The platform supports a broad range of research-relevant content, including documentation, theses, publications, and images. Support for mailboxes and internal websites is scheduled for later this year.

        The goal of the platform is to provide Preservation as a Service for CERN repositories, as well as for selected CERN-related content hosted on platforms such as Zenodo. In 2024, we established a service level agreement with CDS ensuring that its complete content is preserved through the Preserver Platform. The presentation demonstrates the architecture of this service within CERN’s multi-repository environment and outlines different ways to establish additional SLAs based on repository needs.

        Speaker: Panna Liptak (CERN)
      • 11:20 AM
        Providing Cold Storage in the CERN Open Data portal 20m

        The CERN Open Data portal provides open access to high-energy physics data for research, education, and outreach. As the volume of hosted data surpasses 5 PB, the need for a sustainable management strategy becomes critical to ensure long-term preservation. Balancing high-performance access for popular datasets with cost-effective storage for rarely accessed data is essential for the continued growth of the repository.

        To address these challenges, a cold storage system was moved into production in June 2025. By leveraging tape archives for secondary storage, the portal can preserve massive volumes of data while freeing up primary disk resources. A central feature of this implementation is the self-service staging functionality: unauthenticated users can request the restoration of archived datasets directly through the web interface, with the process handled by an automated queue.

        This contribution discusses the integration of cold storage into the Open Data infrastructure and the specific functionalities developed to maintain public access to archived content. Operational insights and usage statistics from the first year of production are also shared, reflecting on how this storage model supports the long-term goals of data preservation in high energy physics.

        Speaker: Diana Rand (CERN)
      • 11:40 AM
        Facilitating Open Data Reuse with REANA: From Scalable Workflows to AI-Assisted Analysis Authoring 20m

        REANA is a platform for reusable and reproducible data analyses. REANA allows researchers to use declarative analysis workflows (CWL, Snakemake, Yadage) and run them on supported containerised compute clouds (Kubernetes, HTCondor, Slurm).

        In this talk we present the latest developments in the REANA ecosystem focusing on making preserved data more easily reusable.

        • We notably discuss the recently added support for Dask computational workflows where REANA allows different users to instantiate different Dask clusters as necessary to reinterpret original analyses.

        • We also present efforts in prototyping federated scientific workflow execution across the European Open Science Cloud federation where different parts of the user analysis can be sent to different nodes close to where the data sits. This, together with the availability of the nascent EOSC Federation computing resources to theoretical physicists, machine-learning scientists, university teachers and students in Europe, could lead to easy-to-use computational patterns for studying and reinterpreting preserved open data from experimental particle physics.

        • We finally mention nascent efforts in the REANA ecosystem trying to take advantage of the progress of Large Language Models (LLMs) to assist researchers with workflow authoring by means of AI agents talking to the REANA backend for idea verification in order to reduce hallucinations when assisting the user in writing analysis workflows.

        The recent developments in the REANA ecosystem show how the data preservation efforts with capturing data together with detailed auxiliary information about data provenance and use from publications and other supplementary material, combined with runnable examples and actionable workflows illustrating how to use the data, combined with automated workflow authoring assistance, can lead to furthering the facilitation of open data reuse in the near future.

        Speaker: Tibor Simko (CERN)
    • 12:00 PM 1:30 PM
      Lunch 1h 30m
    • 1:30 PM 2:10 PM
      Transverse Projects: / ROOM 222/R-001 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 1:30 PM
        Using generative AI to extract dataset information from journal articles 20m

        Open access to particle physics datasets has the potential to yield new scientific results and new approaches to both education and data analysis. However, the cost is non-zero and so it is important for the curators to understand the usage and needs of the community. One way to assess this is to examine journal articles that make use of these datasets. It is straightforward to extract citations about specific datasets, but it may not always include more granular information, such as whether or not the entire dataset was used or just subsets. We present some exploratory work making use of the more popular LLMs (Large Language Models) from OpenAI and Anthropic to extract information from these publications. The current status of this work will be presented.

        Speaker: Emily Rensch (Siena University/Cornell University)
      • 1:50 PM
        From tcl to awkward: analyzing old data with new tools 20m

        As more experiments move toward data preservation and open datasets, it opens the door to not only new physics questions, but attacking these new questions with new computational approaches that may not have existed when the data were recorded. In this talk, I will detail our experiences analyzing almost 20-year old data from the BaBar experiment with the latest experiment-agnostic computational tools like uproot and awkward. I will discuss the pressure points and challenges of working with older datasets and the still-necessary legacy computing infrastructure and highlight some lessons for current and future experiments.

        Speaker: Matthew Bellis (Cornell University/Siena College (US))
    • 2:10 PM 3:10 PM
      Experiments and sites: / ROOM 222/R-001 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 2:10 PM
        A Holistic Approach to Data, Analysis, and Knowledge Preservation at BNL 20m

        The RHIC Data and Analysis Preservation Program (DAPP) at Brookhaven National Laboratory is a comprehensive initiative in nuclear physics, safeguarding 25 years of data, software, and institutional knowledge from the PHENIX, STAR, and sPHENIX experiments. As the community transitions to the Electron-Ion Collider (EIC) era, DAPP offers a complete model for preserving scientific value beyond the lifetime of active experiments. We present an integrated overview of the program's three complementary pillars, with relevance to the broader HEP/NP community preparing for HL-LHC, the EIC, and future facilities.

        The DAPP is built on a comprehensive strategy. It first establishes the core preservation infrastructure for long-term data and software retention, utilizing enhanced metadata to enable result reanalysis. Second, it introduces SciBot, a locally deployed AI assistant using Retrieval-Augmented Generation (RAG) and locally hosted LLMs, which provides secure, natural-language access to preserved RHIC knowledge and critically guarantees data sovereignty for private documentation. Finally, the program presents CRISP, an integrated institutional knowledge and document management system that leverages automated workflows and federated identity to ensure digital preservation and long-term accessibility for future projects.

        Together, these efforts provide a practitioner’s perspective on the full preservation stack—from raw data and analysis workflows to the organizational knowledge that gives them context and meaning. We will highlight cross-cutting lessons on metadata standards, access-control federation, AI integration, and the institutional structures needed to sustain preservation throughout experiment lifecycles, with direct relevance to the broader HEP and nuclear physics communities preparing for next-generation facilities.

        Speaker: Dr Jerome LAURET (Brookhaven National Laboratory)
      • 2:30 PM
        MINERvA neutrino experiment open data 20m

        The MINERvA experiment specializes in studying neutrino interactions and producing neutrino cross section measurements, especially to support current and future oscillation experiments. After several years of preparation, we released our open data product and announced it to the most interested community at a conference in October 2025. minerva.fnal.gov/opendata It includes data and simulated events from our entire run period, which ended in 2019. We also provide a code release with examples and documentation. In fact, we refactored and released what we are actively using now and moved current analyses to the revised set, which will enhance the code and documentation and our ability to support uses of the data in the next few years. We are rapidly gaining insight and experience with potential "customers" and use cases now.

        Speaker: Rik Gran (University of Minnesota Duluth)
      • 2:50 PM
        Digital preservation and reanalysis of raw photographic data from the CERN 2m bubble chamber 20m

        Many of the defining advances in particle physics during the mid twentieth century were made at bubble chamber experiments. The particle physics group of the University of Birmingham have a rich history of involvement in bubble chamber experiments dating back to the 1950s. As part of this legacy, the group hold an extensive collection of photographic film (tens of thousands of frames) recorded by experiments at the CERN 2m Hydrogen Bubble Chamber (HBC), which operated between 1965 and 1976. These photographic records of particle interactions in the chamber volume represented the primary raw data format of such experiments, from which particle trajectories and momenta were then reconstructed from careful measurements of the film. Remarkably, nearly 60 years since the chamber was first commissioned, the basic technical information required to reconstruct particle interactions from measurements of 2m HBC film has been comprehensively preserved by CERN and is publicly available. This talk will describe an effort to digitally preserve this large collection of photographic film in high fidelity and explore the feasibility of reviving its quantitative scientific exploitation. More information can be found on the project’s website: https://bubblechamber.web.cern.ch

        Speaker: Andrew Stephen Chisholm (University of Birmingham (GB))
    • 3:15 PM 3:50 PM
      The larger landscape 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 3:15 PM
        ICFA recommendations on data preservation and open science and their assessment 20m

        In 2025, the ICFA Data Lifecycle panel published recommendations for best practices for data preservation and open science in HEP, addressing the important issue of FAIR (findable, accessible, interoperable, reproducible) and open - and thus more sustainable - data, software, and analysis workflows. The panel plans to oversee an assessment of how these recommendations are being implemented at different actor levels, followed by regular follow‑up assessments every 1–2 years.

        This contribution presents the recommendations and outlines the design of the assessment process, guided by impact, reuse, and assessment tools with good user experience.

        Speaker: Kati Lassila-Perini (Helsinki Institute of Physics (FI))
      • 3:35 PM
        Closing 15m
        Speakers: Cristinel Diaconu (CPPM, Aix-Marseille Université, CNRS/IN2P3 (FR)), Dr Ulrich Schwickerath (CERN)
    • 5:00 PM 6:00 PM
      Post - Workshop contributions 222/R-001

      222/R-001

      CERN

      200
      Show room on map
      • 5:00 PM
        Data ORchestration Agent (DORA) for AI-Ready Scientific Data in Large-Scale Facilities 20m

        Scientific discovery increasingly relies on AI models, which require high-quality, AI-ready datasets. Yet, managing and preserving complex data from large-scale facilities remains a bottleneck. This report introduces the Data ORchestration Agent (DORA), an AI-driven framework designed to automate and optimize the entire data lifecycle—from processing to preservation and provisioning—ensuring data is AI-ready.
        We will report how DORA employs intelligent agents to execute dynamic workflows for data ingestion, completion, annotation, and preservation. It follows a dual paradigm: AI for Data, where agents automate data orchestration tasks, and Data for AI, where outputs are structured, annotated, and preserved as FAIR-compliant, traceable datasets optimized for AI training. We will further presents application cases of DORA in the High-Energy Photon Source (HEPS) and the China Spallation Neutron Source (CSNS), demonstrating its capabilities.
        DORA enhances data usability, ensures reproducibility, and will accelerates scientific discovery. It represents a shift toward autonomous, scalable, and sustainable data ecosystems for large-scale research infrastructures.

        Speaker: Zhengde Zhang (中国科学院高能物理研究所)