Speaker
Dr
Giuseppe Avolio
(University of California Irvine (US))
Description
The Trigger and DAQ (TDAQ) system of the ATLAS experiment is a very
complex distributed computing system, composed of O(10000) of
applications running on more than 2000 computers. The TDAQ Controls
system has to guarantee the smooth and synchronous operations of all
TDAQ components and has to provide the means to minimize the downtime
of the system caused by runtime failures, which are inevitable for a
system of such scale and complexity.
During data taking runs, streams of information messages sent or
published by TDAQ applications are the main sources of knowledge about
correctness of running operations. The huge flow of operational
monitoring data produced (with an average rate of O(1-10KHz)) is
constantly monitored by experts to detect problem or misbehavior.
Given the scale of the system and the rates of data to be analyzed,
the automation of the Control system functionality in areas of
operational monitoring, system verification, error detection and
recovery is a strong requirement. It allows to reduce the operations
man power needs and to assure a constant high quality of problem
detection and following recovery.
To accomplish its objective, the Controls system includes some
high-level components which are based on advanced software
technologies, namely the rule-based expert system (ES) and the complex
event processing (CEP) engines. The chosen techniques allow to
formalize, to store and to reuse the TDAQ experts' knowledge in the
Control framework and thus to assist TDAQ shift crew to accomplish its
task.
DVS (Diagnostics and Verification System) and Online Recovery
components are responsible for the automation of system testing and
verification, diagnostics of failures and recovery procedures. These
components are built on top of a common technology of a
forward-chaining ES framework (based on CLIPS expert system shell),
that allows to program the behavior of a system in terms of “if-then”
rules and to easily extend or modify the knowledge base.
The core of AAL (Automated monitoring and AnaLysis) component is a CEP
(Complex Event Processing) engine implemented using ESPER in Java. The
engine is loaded with a set of directives and it performs correlation
and analysis of operational messages and events and produces
operator-friendly alerts, assisting TDAQ operators to react promptly
in case of problems or to perform important routine tasks. The
component is known to shifters as "Shifter Assistant" (SA), and
introduction of the SA allowed to reduce the number of shifters in the
ATLAS control room. Design foresees a machine learning module to
detect anomaly and problems that cannot be defined in advance.
The described components are constantly used for the ATLAS Trigger-DAQ
system operations, and the knowledge base is growing as more expertise
is acquired. By the end of 2011 the size of the knowledge base used
for TDAQ operations was about 300 rules.
The paper presents the design and present implementation of the
components and also the experience of its use in a real operational
environment of the ATLAS experiment.
Summary
The paper presents the design and implementation of some intelligent expert system based TDAQ Controls components and also the experience of their use in a real operational environment of the ATLAS experiment.
Primary author
Andrei Kazarov
(B.P. Konstantinov Petersburg Nuclear Physics Institute - PNPI ()
Co-authors
Alina Corso Radu
(University of California Irvine (US))
Giovanna Lehmann Miotto
(CERN)
Dr
Giuseppe Avolio
(University of California Irvine (US))
Luca Magnoni
(CERN)