Speaker
Description
Data centers play a key role in High Energy Physics (HEP) experiments, as there is the need to collect, process, and store large quantities of data. Given the scale and complexity of those computing infrastructures, it is not trivial to spot failures of any nature. Traditional rule-based monitoring systems work well, but they might struggle in large, heterogeneous, and dynamic environments. It is important to be able to identify issues as quickly as possible in order to react and minimize downtime, which can have a negative impact on the data acquisition. In this work, we present an Artificial Intelligence based technique for an accurate and context-aware anomaly detection system in the LHCb computing farm. We show methodologies, design choices, and tools adopted and present results, highlighting the benefits and limitations of this approach.