Pedro Andrade (CERN)
At the present time computing centres are facing a massive rise in virtualization and cloud computing. The Agile Infrastructure (AI) project is working to deliver new solutions to ease the management of CERN Computing Centres. Part of the solution consists in a new common monitoring infrastructure which collects and manages monitoring data of all computing centre servers and associated software as well as additional environment and facilities data (e.g. temperature, power consumption, etc.). The new monitoring system is addressing requirements for a very large scale. Performance measurement will be implemented by gathering metric data from the entire Computing Centre. Linux hosts data will be collected by improving the Lemon (LHC Era Monitoring System) client to forward metric data to a messaging layer responsible for data transport. Using the same messaging channel, other metrics data sources (windows servers, network data, non-hosts data) will also be collected on top of which different visualization and data analytics solutions will be implemented. Given the architecture similarities with the WLCG grid monitoring tools, such as the Service Availability Monitoring (SAM) system, the same AI monitoring model and technologies can also be applied to monitor grid resources. Another important component of the new monitoring system is to directly notify system administrators and service managers about errors and problems. These situations are handled and processed by a new operations workflow, the General Notification Infrastructure (GNI). Using messaging technology for the transport of monitoring messages, GNI allows multiple entities (currently Lemon for linux servers, SCOM for windows servers, and other isolated clients) to produce monitoring notifications which are processed by an extensible number of notifications consumers. Today GNI provides a gateway to CERN event management system, part of CERN IT service management implemented in the Service-Now framework, and a notifications dashboard. In the future a notifications analysis framework and other consumers (e.g. email SMS, etc.) will be added. In this article, a high level architecture overview of the new monitoring infrastructure is provided. The GNI operational tools developed and deployed to monitor CERN Computing Centres (Meyrin and Wigner) are presented as well as the future plans towards large scale data analytics for CERN Computing Centres and the WLCG grid infrastructure.