Minutes & Key Points

1. Opening and Meeting Goals

First topical meeting dedicated to Anomaly Detection (AD) within the Prompt BSM WG.
Goals of the topical meeting series:

Brainstorm what has been done so far, next steps, and perspectives.
Form task forces to agree on:

Data preservation & reinterpretation
Uncertainties
Results presentation
Benchmarks and validation tests

Collect inputs and define guidelines to be summarized in a public document.

Upcoming topical meetings:

Oct 7, 2025 – Heavy Resonances
Later: pMSSM SUSY, VLF, LQ, HNL.

Action Items:

Contact conveners if interested in leading task forces on specific topics.

2. Theoretical Overview – David Shih

Topic: ML-powered model-agnostic anomaly detection searches at the LHC.

Key Points:

Motivation: LHC has produced thousands of model-specific searches; yet new physics remains elusive. There is likely untapped discovery potential with model-agnostic ML searches.
Types of anomalies:

Outliers – rare, extreme deviations (autoencoders effective here).
Overdensities – excesses over smooth backgrounds (weak supervision, density estimation). Both approaches are complementary.

Autoencoders (AE):

Learn to reconstruct background events → large reconstruction error indicates anomaly.
Demonstrated sensitivity to QCD jets vs. anomalous tops/gluinos.

Overdensity methods:

Learn the ratio R(x) = pdata(x)/pbg(x).
Techniques: CWoLa, ANODE, SALAD, CATHODE, etc..
Proof-of-concept results: e.g., CATHODE enhances dijet anomalies from ~2σ to ~30σ significance.

Resonant AD: Combining anomaly scores with bump hunts (already applied in CMS/ATLAS dijet analyses).
Non-resonant AD:

More challenging; requires robust background estimation (ABCD with decorrelated autoencoders, latent space overdensity scores).
Early proof-of-concepts (e.g., CONRAD, dual autoencoders) show promise.

Trigger-level AD:

Fast autoencoder-based triggers (CICADA, AXOL1TL, GELATO).
Potential complementary approaches with online generative modeling.

Suggestions for Reinterpretation:

AE-based searches: publish anomaly score function → theorists can inject signals and reweight.
Overdensity-based searches: more difficult; require publishing background models/events and compressed data features so theorists can retrain anomaly scores.

Q&A Highlights

Mario Campanelli: Asked clarification on generative model before the trigger.

Response (D. Shih): Idea is to train a generative model on buffered data pre-trigger, then generate synthetic events for offline searches/scouting-like analysis.

Javier Jiménez Peña: Asked about “double independent autoencoders.”

Response: By training two decorrelated autoencoders, anomalies manifesting in multiple features can be flagged in both → enabling ABCD background estimation.

Jack Harrison (ATLAS): Mentioned ATLAS recently published a non-resonant AD search in multilepton final states; this should be added to references.
Vilius Cepaitis: Raised concern about the computational cost of retraining overdensity models for reinterpretations.

Response (D. Shih): Agreed this is an important challenge; possible need for heuristics or surrogate models to reduce computational overhead.

3. ATLAS Anomaly Detection Overview – Vilius Čepaitis (on behalf of ATLAS)

Key Points:

ATLAS has completed six public AD analyses with Run-2 data; no significant excess observed.
Covered a spectrum of techniques: unsupervised (autoencoders, normalizing flows), weakly-supervised (CWoLa), semi-supervised (ANTELOPE), and dedicated AD triggers (GELATO).
Examples presented:

Y→XH analysis: VRNN-AE anomaly score alongside dedicated Higgs tagging regions.
jet+X states: AE on rapidity–mass matrix; selection of most anomalous 1% events; ADFilter tool released for public reinterpretation.
Multilepton anomalies: Normalizing flow with kinematic features; 16 anomaly regions.
Semi-visible jets (SVJ): Semi-supervised with ANTELOPE.
CWoLa round 1 & 2: Iterative improvements using CURTAINS and SALAD for background templates.

Feature sensitivity: Input choice strongly affects performance; BDTs may be more robust than NNs in some cases.
Validation strategies: ATLAS uses combinations of MC validation, topological control regions, low-anomaly CRs, and pseudo-data with generative models.
Benchmarking: No single optimal AD method; suggests using “standard candle” BSM signals or mixed validation sets to benchmark new techniques.
Result presentation: Different combinations used (BumpHunter p-values, model-dependent and model-independent limits). Model-independent results and public tools (e.g., ADFilter) are especially valuable.
Uncertainties: Besides normal systematics, AD methods bring stochastic uncertainty. Ensembles (multiple trainings with different seeds) can quantify this.
ATLAS is preparing internal AD guidelines covering scope, validation, reinterpretation, and result presentation.

4. CMS Anomaly Detection Overview – Louis Moureaux (on behalf of CMS)

Key Points:

Scope of CMS AD efforts:

Data quality monitoring: ECAL autoencoder flags local detector anomalies (not physics).
Triggers:

AXOL1TL (global trigger objects) and CICADA (calorimeter towers), both autoencoder-based. Running at Level-1 with ~µs latency, sensitive across benchmarks.
By end of 2025, expected ~200 fb⁻¹ (AXOL1TL) and ~100 fb⁻¹ (CICADA). Next question: how to analyze these datasets.

Offline analyses:

Dijet resonance anomaly search (Run-2, 138 fb⁻¹) [2412.03747]: applied multiple AD methods (CWoLa Hunting, TNT, CATHODE(-b), VAE-QR, QUAK).

Strategy: retain ~1% most anomalous events, bump-hunt mjj.
Results: No significant excess. Limits improve over inclusive fits; dedicated searches still stronger.

Methodology details:

VAE-QR: quantile regression to remove mjj sculpting.
QUAK: hybrid flows with signal priors, complementary to others.
Weak supervision requires signal injection for efficiency; retraining expensive.

Other studies: Boosted top quarks found with weak supervision; H(bb)+anomalous selection with ParticleNet.

Complementarity: AD methods show small correlations; thus complementary.

Open Issues Highlighted by CMS:

Non-resonant AD: Yet to be tried on CMS; natural link to EFTs suggested.
Reinterpretation:

Limits from weak supervision might be easier to provide.
Key question: what information should CMS publish if an excess is found?.
How to evaluate performance without benchmark models?

Methodology & uncertainties:

Best input features not yet clear.
Background estimation uncertainties critical.
Can weak supervision be extended to triggers?