12-16 April 2010
Uppsala University
Europe/Stockholm timezone

LHCb operations: organizations, procedures, tools and critical services.

Apr 14, 2010, 11:00 AM
Room X (Uppsala University)

Room X

Uppsala University

Oral Support services and tools for user communities High Energy Physics


Dr roberto santinelli (CERN/IT/GD)


The proliferation of tools for monitoring both activities and the status of the critical services, together with the pressing need for prompt reactions to problems impacting data taking, user analysis and scheduled activities (e.g. MC simulation) brings the need of better organizing the huge amount of information available. The monitoring system for the LHCb Grid Computing relies on many heterogeneous and independent sources of information offering different views for a better understanding of problems while an operations team and defined procedures have been put in place to handle them.

Conclusions and Future Work

This is a continuously working in progress activity. The evolution of various components necessarily brings to new unpredicted situations that have to be sorted out with new procedures or new monitoring tools. Nonetheless the system is getting more and more stable and the learning curve achieving its plateau. We ca not exclude that - far to be an ideal system - DIRAC and/or the grid middle-wares it is interfacing to will not evolve, perhaps dramatically. The organizations of LHCb operations has however to guarantee - during the LHC life period - the right quality of the access to the Grid.


All above aspects can really be sorted out only after years of experience running a such complex system as the DIRAC system interacting with EGEE (or other) middle-ware stacks is. The learning process bringing the LHCb Grid Operations up to its current organization is by itself a valuable and certainly sharable with other EGEE communities. This is even more important if one considers that this is achieved also after numerous interactions with other HEP experiments having to face similar problematic situations too. The impact in LHCb of such a well trained and organized operations system for grid computing is crucial to permit the whole community smoothly accessing the LHC data on the distributed computing infrastructures and to run their analysis on top of that. The procedures defined so far, the alarming mechanisms in place, the multi-layered support structure, the various tracking systems adopted, the organizations of internal meetings, the adequate channeling of the information to/from services and resources providers, all of that might offer a useful operations-oriented platform for even more general use.

Detailed analysis

With LHC taking real data the "expert" user of the Grid left his place to rather many “normal” users in turn eagerly using the LHCb infrastructure to access data on the Grid. This suddenly introduced an impelling need to minimize service unavailability by proactively monitoring them on one hand and made immediately available all relevant information to debug problems either to the first line user support or to the shifter crew on the other hand. The definition of such a monitoring system is by far trickier than its implementation and has necessarily to start with an attentive analysis of the whole system in its complexity in order to isolate the relevant aspects. This abstract summarizes what is the state-of-art about LHCb Grid operations emphasizing the ultimate reasons that brought to various choices and what are the tools currently in use to run our daily activities: the most common problems experienced across years of activities on the WLCG infrastructure, the services with their criticality, the procedures in place some of them so well exercised and trustable to be made fully automatic, the relevant metrics to watch at, the tools available and what is still missing

Keywords LHCb operations monitoring
URL for further information http://lhcbweb.pic.es/DIRAC/info/general/diracOverview

Primary author

Dr roberto santinelli (CERN/IT/GD)

Presentation materials