The ongoing integration of clouds into the WLCG raises the need for a detailed health and performance monitoring of the virtual resources in order to prevent problems of degraded service and interruptions due to undetected failures. When working in scale, the existing monitoring diversity can lead to a metric overflow whereby the operators need to manually collect and correlate data from several monitoring tools and frameworks, resulting in tens of different metrics to be interpreted and analysed per virtual machine, constantly.
In this paper we present an ESPER based standalone application which is able to process complex monitoring events coming from various sources and automatically interpret data in order to issue alarms upon the resources' statuses, without interfering with the actual resources and data sources. We will describe how this application has been used with both commercial and non-commercial cloud activities, allowing the operators to quickly be alarmed and react upon VMs and clusters running with a low CPU load and low network traffic, among other anomalies, resulting then in either the recycling of the misbehaving VMs or fixes on the submission of the LHC experiments workflows. Finally we'll also present the pattern analysis mechanisms being used as well as the surrounding Elastic and REST API interfaces where the alarms are collected and served to users.
|Primary Keyword (Mandatory)||Artificial intelligence/Machine learning|
|Secondary Keyword (Optional)||Monitoring|