25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

COLLIDE-2V - 750 Million Dual-View LHC Event Dataset for Low-Latency ML

26 May 2026, 14:57
18m
MHMK M02

MHMK M02

Oral Presentation Track 2 - Online and real-time computing Track 2 - Online and real-time computing

Speaker

Eric Anton Moreno (Massachusetts Institute of Technology (US))

Description

Modern foundation models (FMs) have pushed the frontiers of language, vision, and multi-model tasks by training ever-larger neural networks (NN) on unprecedented volumes of data. The use of FM models has yet to be established in collider physics, which both lack a comparably sized, general-purpose dataset on which to pre-train universal event representations, and a clear demonstrable need. Real-time event identification presents a possible need due to a requirement for fast event classification and selection of all possible collisions at the LHC. As a result, we construct a dual-view LHC collision dataset (COLLIDE-2V), a 50TB public dataset comprising ~750 million proton-proton events generated with MadGraph + Pythia + Delphes under High-Luminosity LHC conditions (<μ> = 200). Spanning everything from minimum-bias and γ+jets to top, Higgs, di-boson, multi-boson, exotic long-lived signatures and dark showers, the sample covers 50+ distinct processes and >99% of the CMS Run-3 trigger menu in a single coherent format. To allow for effective real-time event interpretation each event is provided twice, as Parquet files which retain physics-critical content:

  • Offline: a full CMS-like reconstruction emulated by a tuned Delphes card
  • L1T: a low-latency, lower-resolution view obtained via a custom Level-1 Trigger (L1T) parameterisation (degraded vertex, track and calorimeter performance, altered puppi, |η| ≤ 2.5 tracking, pT thresholds, etc.)

As a proof-of-concept, COLLIDE-2V supports a wide spectrum of research applications ranging from few-shot transfer learning, fine-tuning, pileup mitigation, detector-level generative modelling, cross-experiment benchmarking, to fast simulation surrogates and real-time trigger inference, and entirely novel anomaly-detection - thereby accelerating the shift from handcrafted topology cuts to data-driven decision making throughout the HL-LHC program.

Authors

Eric Anton Moreno (Massachusetts Institute of Technology (US)) Philip Oliver Ploner (ETH Zurich (CH)) Ranit Das (Rutgers University) Mr Ryan Liu (Lawrence Berkeley National Lab. (US))

Co-authors

Abhijith Gandrakota (Fermi National Accelerator Lab. (US)) Alex Tapper (Imperial College London) Benedikt Maier (Imperial College (GB)) David Shih Javier Mauricio Duarte (Univ. of California San Diego (US)) Jennifer Ngadiuba (FNAL) Maciej Mikolaj Glowacki (CERN) Miaoyuan Liu (Purdue University (US)) Philip Coleman Harris (Massachusetts Inst. of Technology (US)) Shiqi Kuang (Purdue University (US)) Thea Aarrestad (ETH Zurich (CH))

Presentation materials