Mr Costin Grigoras (CERN)
A complex software environment such as the ALICE Computing Grid infrastructure requires permanent control and management for the large set of services involved. Automating control procedures reduces the human interaction with the various components of the system and yields better availability of the overall system. In this paper we will present how we used the MonALISA framework to gather, store and display the relevant metrics in the entire system from central and remote site services. We will also show the automatic local and global procedures that are triggered by the monitored values. Decision-taking agents are used to restart remote services, alert the operators in case of problems that cannot be automatically solved, submit production jobs, replicate and analyze raw data, resource load-balance and other control mechanisms that optimize the overall work flow and simplify day-to-day operations. Synthetic graphical views for all operational parameters, correlations, state of services and applications as well as the full history of all monitoring metrics are available for the entire system that now encompasses 80 sites all over the world, more than 10000 CPUs and 10PB of storage.
|Presentation type (oral | poster)||oral|