Speaker
Description
The operational control layer in the Experiment Control System (ECS) of the LHCb experiment is built on WinCC Open Architecture (OA), which generates large volumes of logs. Currently, operators and shifters examine these logs manually to identify system errors. This process is time-consuming, tedious, and requires expert knowledge.
Patterns in the logs are not easily discernible, making it difficult to identify events that lead to system failures. Machine learning can uncover such patterns, flag potential errors before they occur, and streamline error identification and communication to shifters.
We therefore propose a real-time anomaly detection system for LHCb operational logs that will support:
- Error identification and prediction
- Root-cause tracing
- Alarms and notifications to shifters
This system will reduce operator workload, improve reliability, and enable proactive responses to potential failures.
CERN group/ Experiment
LHCb
| Working area | Area 7: Experimental Technologies |
|---|---|
| Project goals | The intermediate goals are to implement log data preprocessing (PCA, clustering, exploration), labeling, and event mapping to shift DB logs; develop supervised and unsupervised ML models for anomaly detection; and build supporting software such as a monitoring UI, log streaming API, and MLOps hooks, while the final goals are to integrate the system into the LHCb framework for testing and validation and to provide a real-time anomaly detection system that supports error identification, prediction, and root-cause tracing with alarms and notifications for shifters. |
| Timeline | In months 0–12 the focus is on data preprocessing and ML model development, in months 12–24 on backend and frontend software development, and in months 24–36 on system integration, testing, and validation. |
| Available person power | 0.5 FTE (Doctoral Student) |
| Additional person power request | No additional person power |
| Is this an already ongoing activity? | Yes |
| Indicative hardware resources needs | The project will require access to LHCb Online computing resources, including CPU nodes for log streaming and preprocessing, as well as GPU resources for ML model training and testing, with the scale depending on model complexity |