8–12 Sept 2025
Hamburg, Germany
Europe/Berlin timezone

ColliderML: The First Release of an OpenDataDetector High-Luminosity Physics Benchmark Dataset

Not scheduled
30m
Hamburg, Germany

Hamburg, Germany

Poster Track 2: Data Analysis - Algorithms and Tools Poster session with coffee break

Speaker

Daniel Thomas Murnane (Niels Bohr Institute, University of Copenhagen)

Description

Particle physics is a field hungry for high quality simulation, to match the precision with which data is gathered at collider experiments such as the Large Hadron Collider (LHC). The computational demands of full detector simulation often lead to the use of faster but less realistic parameterizations, potentially compromising the sensitivity, generalizability, and robustness of downstream machine learning (ML) models. To address this, we introduce the OpenDataDetector High-Luminosity Physics Benchmark Dataset 2025, aka “ColliderML”. It includes O(1 million) realistically simulated and digitised high-pileup collision events, across O(10) important SM and BSM channels. A variety of objects are available, from energy deposit information in the tracker and calorimeters, up to reconstructed tracks and jets, as well as a large dataset of particle gun simulations. The OpenDataDetector geometry itself provides a realistic combination of several next-generation detector technologies.

To demonstrate ColliderML's utility, we showcase multiple machine learning benchmarks that rigorously evaluate the performance and behavior of ML models trained under diverse collider conditions. These evaluations specifically examine critical ML aspects such as generalizability between fast and full simulation and across physics channels, the benefits of low-level and full-detector features, and robustness in handling complex and noisy collider data. Additionally, we provide an intuitive accompanying software library, streamlining dataset access and manipulation. As we find large ML models plateauing in performance on high-level physics objects, we propose ColliderML as an essential tool in exploring the next generation of ML on low-level collider data.

Significance

The largest full simulation dataset of experiment-agnostic low-level data was previously TrackML (https://www.kaggle.com/competitions/trackml-particle-identification), released in 2018 with 10k events. We intend to finally improve on this, with 100x more data, full detector coverage (calorimeter + tracker), better digitizations, and reconstructed objects. We believe this is a major milestone for low-level data ML studies in open data.

References

https://iopscience.iop.org/article/10.1088/1742-6596/2438/1/012110/pdf

Authors

Andreas Salzburger (CERN) Anna Zaborowska (CERN) Daniel Thomas Murnane (Niels Bohr Institute, University of Copenhagen) Minh-Tuan Pham (University of Wisconsin Madison (US)) Paul Gessinger (CERN)

Presentation materials

There are no materials yet.