Dr Giuseppe Avolio (University of California Irvine (US))
The Trigger and DAQ (TDAQ) system of the ATLAS experiment is a very complex distributed computing system, composed of O(10000) of applications running on more than 2000 computers. The TDAQ Controls system has to guarantee the smooth and synchronous operations of all TDAQ components and has to provide the means to minimize the downtime of the system caused by runtime failures, which are inevitable for a system of such scale and complexity. During data taking runs, streams of information messages sent or published by TDAQ applications are the main sources of knowledge about correctness of running operations. The huge flow of operational monitoring data produced (with an average rate of O(1-10KHz)) is constantly monitored by experts to detect problem or misbehavior. Given the scale of the system and the rates of data to be analyzed, the automation of the Control system functionality in areas of operational monitoring, system verification, error detection and recovery is a strong requirement. It allows to reduce the operations man power needs and to assure a constant high quality of problem detection and following recovery. To accomplish its objective, the Controls system includes some high-level components which are based on advanced software technologies, namely the rule-based expert system (ES) and the complex event processing (CEP) engines. The chosen techniques allow to formalize, to store and to reuse the TDAQ experts' knowledge in the Control framework and thus to assist TDAQ shift crew to accomplish its task. DVS (Diagnostics and Verification System) and Online Recovery components are responsible for the automation of system testing and verification, diagnostics of failures and recovery procedures. These components are built on top of a common technology of a forward-chaining ES framework (based on CLIPS expert system shell), that allows to program the behavior of a system in terms of “if-then” rules and to easily extend or modify the knowledge base. The core of AAL (Automated monitoring and AnaLysis) component is a CEP (Complex Event Processing) engine implemented using ESPER in Java. The engine is loaded with a set of directives and it performs correlation and analysis of operational messages and events and produces operator-friendly alerts, assisting TDAQ operators to react promptly in case of problems or to perform important routine tasks. The component is known to shifters as "Shifter Assistant" (SA), and introduction of the SA allowed to reduce the number of shifters in the ATLAS control room. Design foresees a machine learning module to detect anomaly and problems that cannot be defined in advance. The described components are constantly used for the ATLAS Trigger-DAQ system operations, and the knowledge base is growing as more expertise is acquired. By the end of 2011 the size of the knowledge base used for TDAQ operations was about 300 rules. The paper presents the design and present implementation of the components and also the experience of its use in a real operational environment of the ATLAS experiment.
The paper presents the design and implementation of some intelligent expert system based TDAQ Controls components and also the experience of their use in a real operational environment of the ATLAS experiment.
Andrei Kazarov (B.P. Konstantinov Petersburg Nuclear Physics Institute - PNPI ()