The ATLAS experiment at the Large Hadron Collider at CERN relies on a complex and highly distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data obtained at unprecedented energy and rates. The TDAQ Controls system is the component that guarantees the smooth and synchronous operations of all the TDAQ components and provides the means to minimize the downtime of the system caused by runtime failures.
Given the scale and complexity of the TDAQ system and the rates of data to be analysed, the automation of the system functionality in the areas of error detection and recovery is a strong requirement. That is why in Run 2 the Central Hint and Information Processor (CHIP) service has been introduced; it can be truly considered the “brain” of the TDAQ Controls system. CHIP is an intelligent system able to supervise the ATLAS data taking, take operational decisions and handle abnormal conditions. It is based on an open-source Complex Event Processing (CEP) engine, ESPER. Currently, CHIP’s knowledge base is made up of more than 300 rules organized in about 30 different contexts.
This paper will focus on the experience gained with CHIP during the whole LHC Run 2 period. Particular attention will be paid to demonstrate how the use of CHIP for automation and error recovery proved to be a valuable asset in optimizing the data taking efficiency, reducing operational mistakes, efficiently handling complex scenarios and improving the latency to react to abnormal situations. Additionally, the huge benefits brought by the CEP engine in terms of both flexibility and simplification of the knowledge base will be reported.