Site Status Board: WLCG monitoring from the experiment perspective
Presented by Jacobo TARRAGóN CROS on 12 Apr 2010 from 16:03 to 16:18
Session: Infrastructure Tools and Services
Track: End-user environments, scientific gateways and portal technologies
Now that the LHC has started, the experiments require a high standard of reliability and performance on their computing activities. Monitoring these activities is not a trivial task mainly due to two reasons: first of all, asserting the proper behavior of a site depends heavily on the software model of each experiment; secondly, the number of sites taking part in WLCG has increased drastically compared to previous HEP experiments.
The Site Status Board (SSB) web application, developed under the Dashboard Experiment framework, has been designed to provide an overall view of the sites performance from the experiment perspective. Designed originally for the LHC VOs, it allows the experiments to define a set of activities, also known as views. For each view, the experiment administrator can define the metrics that have to be collected. For instance, CMS has currently five different views ('computing shifters', 'site commissioning', 'space monitoring', ...). For the first view, the metrics include the number of running jobs, transfer status, availability of software on the site, etc. The SSB collects the status of the metrics over time and presents it in several formats. SSB will also include pointers describing the possible errors and solutions if this information is provided. Thanks to the SSB, the organizations can analyse site statuses easily, and at the same time, they keep track of the evolving metric results.
The SSB is being widely used by CMS and LHCb for several activities: computing shifts, site commissioning and space monitoring in the case of CMS, and job and space monitoring in LHCb. ATLAS and ALICE are also evaluating the SSB.
Production level services are being built using the Site Status Board. At the same time, the application is constantly evolving since it needs to adapt to the experiments growing needs. Future changes will focus on a performance boost for the historical data browsing, improvements on the reliability of information gathering, and extending the flexibility of the metric definitions.
grid, monitoring, site, status