5–9 Sept 2011
Europe/London timezone

The AAL project: Automated monitoring and intelligent AnaLysis for the ATLAS data taking infrastructure

5 Sept 2011, 16:30
25m
Parallel talk Track 1: Computing Technology for Physics Research Monday 05th - Computing Technology for Physics Research

Speaker

Mr Luca Magnoni (Conseil Europeen Recherche Nucl. (CERN))

Description

The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment at CERN is the infrastructure responsible for filtering and transferring ATLAS experimental data from detectors to the mass storage system. It relies on a large, distributed computing environment, including thousands of computing nodes with thousands of application running concurrently. In such a complex environment, information analysis is fundamental for controlling applications behavior, error reporting and operational monitoring. During data taking runs, streams of messages sent by applications via the message reporting system together with data published from applications via information services are the main sources of knowledge about correctness of running operations. The huge flow of data produced (with an average rate of O(1-10KHz)) is constantly monitored by experts to detect problem or misbehavior. This require strong competence and experience in understanding and discovering problems and root causes, and often the meaningful information is not in the single message or update, but in the aggregated behavior in a certain time-line. The AAL project is meant at reducing the man power needs and at assuring a constant high quality of problem detection by automating most of the monitoring tasks and providing real-time correlation of data-taking and system metrics. This project combines technologies coming from different disciplines, in particular it leverages on an Event Driven Architecture to unify the flow of data from the ATLAS infrastructure, on a Complex Event Processing (CEP) engine for correlation of events and on a machine learning module to detect anomaly and problems that cannot be defined in advance. The project is composed of 3 main components: a core processing engine, responsible for correlation of events through expert-defined queries, a machine learning module to detect anomalies in an unsupervised manner and a web based front-end to present real-time information and interact with the system. All components works in a loose-coupled event based architecture, with a message broker to centralize all communication between modules. The result is an intelligent system able to extract and compute relevant information from the flow of operational data to provide real-time feedback to human experts who can promptly react when needed. The paper presents the design and implementation of the AAL project, together with the results of its usage as automated monitoring assistant for the ATLAS data taking infrastructure.

Primary author

Mr Luca Magnoni (Conseil Europeen Recherche Nucl. (CERN))

Co-authors

Mr Andrei Kazarov (St. Petersburg, INP) Dr Giovanna Lehmann Miotto (Conseil Europeen Recherche Nucl. (CERN))

Presentation materials

Peer reviewing

Paper