Dr Raja Nandakumar (Rutherford Appleton Laboratory)
DIRAC, the LHCb community Grid solution, is intended to reliably run large data mining activities. The DIRAC system consists of various services (which wait to be contacted to perform actions) and agents (which carry out periodic activities) to direct jobs as required. An important part of ensuring the reliability of the infrastructure is the monitoring and logging of these DIRAC distributed systems. The monitoring is done collecting information from two sources - one is from pinging the services or by keeping track of the regular heartbeats of the agents, and the other from the analysis of the error messages generated both by agents and services and collected by a logging system. This allows us to ensure that the components are running properly and to collect useful information regarding their operations. The process status monitoring is displayed using the SLS sensor mechanism that also automatically allows to plot various quantities and keep a history of the system. A dedicated GridMap interface (ServiceMap) allows production shifters and experts to have an immediate, high-impact view of all LHCb critical services status while offering the possibility to refer to details of the SLS and SAM sensors. Error types and statistics provided by the logging service can be accessed via dedicated web interfaces on the DIRAC portal or programmatically via the python based API and CLI.
|Presentation type (oral | poster)||poster|