Conclusions and Future Work
This is a continuously working in progress activity. The evolution of various components necessarily brings to new unpredicted situations that have to be sorted out with new procedures or new monitoring tools. Nonetheless the system is getting more and more stable and the learning curve achieving its plateau. We ca not exclude that - far to be an ideal system - DIRAC and/or the grid middle-wares it is interfacing to will not evolve, perhaps dramatically. The organizations of LHCb operations has however to guarantee - during the LHC life period - the right quality of the access to the Grid.
All above aspects can really be sorted out only after years of experience running a such complex system as the DIRAC system interacting with EGEE (or other) middle-ware stacks is. The learning process bringing the LHCb Grid Operations up to its current organization is by itself a valuable and certainly sharable with other EGEE communities. This is even more important if one considers that this is achieved also after numerous interactions with other HEP experiments having to face similar problematic situations too. The impact in LHCb of such a well trained and organized operations system for grid computing is crucial to permit the whole community smoothly accessing the LHC data on the distributed computing infrastructures and to run their analysis on top of that. The procedures defined so far, the alarming mechanisms in place, the multi-layered support structure, the various tracking systems adopted, the organizations of internal meetings, the adequate channeling of the information to/from services and resources providers, all of that might offer a useful operations-oriented platform for even more general use.
With LHC taking real data the "expert" user of the Grid left his place to rather many “normal” users in turn eagerly using the LHCb infrastructure to access data on the Grid. This suddenly introduced an impelling need to minimize service unavailability by proactively monitoring them on one hand and made immediately available all relevant information to debug problems either to the first line user support or to the shifter crew on the other hand. The definition of such a monitoring system is by far trickier than its implementation and has necessarily to start with an attentive analysis of the whole system in its complexity in order to isolate the relevant aspects. This abstract summarizes what is the state-of-art about LHCb Grid operations emphasizing the ultimate reasons that brought to various choices and what are the tools currently in use to run our daily activities: the most common problems experienced across years of activities on the WLCG infrastructure, the services with their criticality, the procedures in place some of them so well exercised and trustable to be made fully automatic, the relevant metrics to watch at, the tools available and what is still missing
|Keywords||LHCb operations monitoring|
|URL for further information||http://lhcbweb.pic.es/DIRAC/info/general/diracOverview|