Sep 1 – 5, 2025
ETH Zurich
Europe/Zurich timezone

COLLIDE-2V - 750 Million Dual-View LHC Event Dataset for Low-Latency ML

Sep 3, 2025, 4:20 PM
20m
ETH Zurich

ETH Zurich

HIT E 51, Siemens Auditorium, ETH Zurich, Hönggerberg campus, 8093 Zurich, Switzerland
Standard Talk Contributed talks

Speaker

Eric Anton Moreno (Massachusetts Institute of Technology (US))

Description

Modern foundation models (FMs) have pushed the frontiers of language, vision, and multi-model tasks by training ever-larger neural networks (NN) on unprecedented volumes of data. The use of FM models has yet to be established in collider physics, which both lack a comparably sized, general-purpose dataset on which to pre-train universal event representations, and a clear demonstrable need. Real-time event identification presents a possible need due to a requirement for fast event classification and selection of all possible collisions at the LHC. As a result, we construct a dual-view LHC collision dataset (COLLIDE-2V), a 50TB public dataset comprising ~750 million proton-proton events generated with MadGraph + Pythia + Delphes under High-Luminosity LHC conditions (<μ> = 200). Spanning everything from minimum-bias and γ+jets to top, Higgs, di-boson, multi-boson, exotic long-lived signatures and dark showers, the sample covers 50+ distinct processes and >99% of the CMS Run-3 trigger menu in a single coherent format. To allow for effective real-time event interpretation each event is provided twice, as Parquet files which retain physics-critical content:


  • Offline - a full CMS-like reconstruction emulated by a tuned Delphes card
  • L1T - a low-latency, lower-resolution view obtained via a custom Level-1 Trigger (L1T) parameterisation (degraded vertex, track and calorimeter performance, altered puppi, |η| ≤ 2.5 tracking, pT thresholds, etc.)

As a proof-of-concept, COLLIDE-2V supports a wide spectrum of research applications ranging from few-shot transfer learning, fine-tuning, pileup mitigation, detector-level generative modelling, cross-experiment benchmarking, to fast simulation surrogates and real-time trigger inference, and entirely novel anomaly-detection - thereby accelerating the shift from handcrafted topology cuts to data-driven decision making throughout the HL-LHC program.

Authors

Eric Anton Moreno (Massachusetts Institute of Technology (US)) Philip Oliver Ploner (ETH Zurich (CH)) Ranit Das (Rutgers University)

Co-authors

Abhijith Gandrakota (Fermi National Accelerator Lab. (US)) Alex Tapper (Imperial College London) Benedikt Maier (Imperial College (GB)) David Shih Javier Mauricio Duarte (Univ. of California San Diego (US)) Jennifer Ngadiuba (FNAL) Maciej Mikolaj Glowacki (CERN) Mia Liu (Purdue University) Philip Coleman Harris (Massachusetts Inst. of Technology (US)) Shiqi Kuang (Purdue University (US)) Thea Aarrestad (ETH Zurich (CH))

Presentation materials