Speaker
Description
Conclusions and Future Work
From being an application providing a handy interface for digging into information (in order to spot misbehaving sites, services, data files, etc.), the system has becomes more operational and, after a year of constant usage by shifters, some tasks were selected for automation. The system will in particular be extended with an alert system triggered when a suspicious pattern is identified and will collect the necessary information to investigate it. This is the main topic of current developments.
Impact
The production system monitoring is now the central point of several communities: people on shift, production managers, as well as certain site administrators who are willing to follow closely ATLAS activities in their site and react when
noticing a problem. The system collects fresh information every minute and produces optimized displays of short period statistics (last hours) and problems spotted automatically. Shifters get instantaneously the required overview, starting point for their daily work. Moreover, the monitoring system integrates with external tools: links to external systems let shifters check if tickets were already submitted, search in the dedicated Elog, check the software installation system, get to job log files, etc. Managers and site admins can follow how the activity has been performing on a larger time scale, plots and statistics can be exported
for weekly reports.
URL for further information
http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/shifters-display
Keywords
Monitoring, Grid job execution, distributed Monte Carlo Production
Detailed analysis
Now that LHC real data is coming, the execution of the production systems of LHC experiments is becoming very
critical. It has to run efficiently and optimally 24h a day. The ATLAS production is ran by a distributedsystem using at the same time EGEE, Nordugrid and OSG resources. Due to the heterogeneity of the underlying systems and tools, and the scale of the computation (100K jobs per day on 350 sites), monitoring such a system
is a challenge.
In this paper we present the monitoring system that is in place today and being used by the ATLAS shifters and
experts around the world in order to ensure the stability and efficiency of the production system. We describe
how information is collected from various places and how specific displays are produced, permitting to get an
immediate overview of the status of the system and identify the misbehaving components.