Alessandro Di Girolamo (CERN) Fernando Harald Barreiro Megino (CERN IT ES)
The LHC experiments' computing infrastructure is hosted in a distributed way across different computing centers in the Worldwide LHC Computing Grid and needs to run with high reliability. It is therefore crucial to offer a unified view to shifters, who generally are not experts in the services, and give them the ability to follow the status of resources and the health of critical systems in order to alert the experts whenever a system becomes unavailable. Several experiments have chosen to build their service monitoring on top of the flexible Service Level Status (SLS) framework developed by CERN IT. Based on examples from ATLAS, CMS and LHCb, this contribution will describe the complete development process of a service monitoring instance and explain the deployment models that can be adopted. We will also describe the software package used in ATLAS Distributed Computing to send health reports through the MSG messaging system and publish them to SLS on a lightweight web server.
Alessandro Di Girolamo (CERN) Diego Da Silva Gomes (Universidade do Estado do Rio de Janeiro (BR)) Fernando Harald Barreiro Megino (CERN IT ES) José Flix Peter Kreuzer (Rheinisch-Westfaelische Tech. Hoch. (DE)) Dr Stefan Roiser (CERN) Vincent Roger Yvan Bernardoff (Univ. P. et Marie Curie (Paris VI) (FR))