Speaker
Description
Particle physics is a field hungry for high quality simulation, to match the precision with which data is gathered at collider experiments such as the Large Hadron Collider (LHC). The computational demands of full detector simulation often lead to the use of faster but less realistic parameterizations, potentially compromising the sensitivity, generalizability, and robustness of downstream machine learning (ML) models. To address this, we introduce the OpenDataDetector High-Luminosity Physics Benchmark Dataset 2025, aka “ColliderML”. It includes O(1 million) realistically simulated and digitised high-pileup collision events, across O(10) important SM and BSM channels. A variety of objects are available, from energy deposit information in the tracker and calorimeters, up to reconstructed tracks and jets, as well as a large dataset of particle gun simulations. The OpenDataDetector geometry itself provides a realistic combination of several next-generation detector technologies.
To demonstrate ColliderML's utility, we showcase multiple machine learning benchmarks that rigorously evaluate the performance and behavior of ML models trained under diverse collider conditions. These evaluations specifically examine critical ML aspects such as generalizability between fast and full simulation and across physics channels, the benefits of low-level and full-detector features, and robustness in handling complex and noisy collider data. Additionally, we provide an intuitive accompanying software library, streamlining dataset access and manipulation. As we find large ML models plateauing in performance on high-level physics objects, we propose ColliderML as an essential tool in exploring the next generation of ML on low-level collider data.
Significance
The largest full simulation dataset of experiment-agnostic low-level data was previously TrackML (https://www.kaggle.com/competitions/trackml-particle-identification), released in 2018 with 10k events. We intend to finally improve on this, with 100x more data, full detector coverage (calorimeter + tracker), better digitizations, and reconstructed objects. We believe this is a major milestone for low-level data ML studies in open data.
References
https://iopscience.iop.org/article/10.1088/1742-6596/2438/1/012110/pdf