11–15 Mar 2024
Charles B. Wang Center, Stony Brook University
US/Eastern timezone

Boosting statistical anomaly detection via multiple test with NPLM

13 Mar 2024, 15:10
20m
Lecture Hall 2 ( Charles B. Wang Center, Stony Brook University )

Lecture Hall 2

Charles B. Wang Center, Stony Brook University

100 Circle Rd, Stony Brook, NY 11794
Oral Track 2: Data Analysis - Algorithms and Tools Track 2: Data Analysis - Algorithms and Tools

Speaker

Dr Gaia Grosso (IAIFI, MIT)

Description

Statistical anomaly detection empowered by AI is a subject of growing interest at collider experiments, as it provides multidimensional and highly automatized solutions for signal-agnostic data quality monitoring, data validation and new physics searches.
AI-based anomaly detection techniques mainly rely on unsupervised or semi-supervised machine learning tasks. One of the most crucial and still unaddressed challenges of these applications is how to optimize the chances of detecting unexpected anomalies when prior knowledge about the nature of the latter is not available.
In this presentation we show how to exploit multiple tests to improve sensitivity to rare anomalies of different nature. We focus on a kernel methods based implementation of the NPLM algorithm, a signal-agnostic goodness of fit test based on a ML approximation of the likelihood ratio test [1, 2].
First, we show how performing multiple tests with different model configurations on the same data allows us to work around the problem of hyperparameters tuning, improving the algorithm’s chance of discovery at the same time. Second, we show how multiple samples of streamed data can be optimally exploited to increase sensitivity to rare signals.
The presented findings offer the ability to perform fast, efficient, and sensitivity-enhanced applications of the NPLM algorithm to a larger and potentially more inclusive set of data, both offline and quasi-online.
With low-dimensional problems, we show this tool acts as a powerful diagnostic and compression algorithm. Furthermore, we find the agnostic nature of the strategy becomes especially relevant when the input data representation results from unsupervised ML algorithms, whose response to anomalies cannot be predicted.

References

Previous work related to the topic:
https://link.springer.com/article/10.1140/epjc/s10052-022-10830-y
https://arxiv.org/abs/2305.14137
https://iopscience.iop.org/article/10.1088/2632-2153/acebb7

Significance

The proposed strategies are new developments of the algorithm that have not been published yet. The tests carried out for this work show improved results over a set of benchmarks with respect to the previous implementation of the algorithm.

Experiment context, if any CMS

Primary authors

Dr Gaia Grosso (IAIFI, MIT) Dr Marco Letizia Philip Coleman Harris (Massachusetts Inst. of Technology (US))

Presentation materials