2–6 Mar 2009
Le Ciminiere, Catania, Sicily, Italy
Europe/Rome timezone

Monitoring the ATLAS distributed production

4 Mar 2009, 11:40
20m
Raffaello (80) (Le Ciminiere, Catania, Sicily, Italy)

Raffaello (80)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania
Oral End-user environments and portal technologies High Energy Physics

Speaker

Benjamin Gaidioz (CERN)

Description

The importance of efficient monitoring tools for LHC experiments is emphasized by the fact that LHC will soon produce real data. We present the monitoring system of the ATLAS (CERN LHC experiment) distributed productionsystem. Collecting information in near real-time about the performance of each job execution, it shows to people on shift specific displays permitting them to quickly identify weak components and take the necessary actions.

Conclusions and Future Work

From being an application providing a handy interface for digging into information (in order to spot misbehaving sites, services, data files, etc.), the system has becomes more operational and, after a year of constant usage by shifters, some tasks were selected for automation. The system will in particular be extended with an alert system triggered when a suspicious pattern is identified and will collect the necessary information to investigate it. This is the main topic of current developments.

Impact

The production system monitoring is now the central point of several communities: people on shift, production managers, as well as certain site administrators who are willing to follow closely ATLAS activities in their site and react when
noticing a problem. The system collects fresh information every minute and produces optimized displays of short period statistics (last hours) and problems spotted automatically. Shifters get instantaneously the required overview, starting point for their daily work. Moreover, the monitoring system integrates with external tools: links to external systems let shifters check if tickets were already submitted, search in the dedicated Elog, check the software installation system, get to job log files, etc. Managers and site admins can follow how the activity has been performing on a larger time scale, plots and statistics can be exported
for weekly reports.

URL for further information

http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/shifters-display

Keywords

Monitoring, Grid job execution, distributed Monte Carlo Production

Detailed analysis

Now that LHC real data is coming, the execution of the production systems of LHC experiments is becoming very
critical. It has to run efficiently and optimally 24h a day. The ATLAS production is ran by a distributedsystem using at the same time EGEE, Nordugrid and OSG resources. Due to the heterogeneity of the underlying systems and tools, and the scale of the computation (100K jobs per day on 350 sites), monitoring such a system
is a challenge.
In this paper we present the monitoring system that is in place today and being used by the ATLAS shifters and
experts around the world in order to ensure the stability and efficiency of the production system. We describe
how information is collected from various places and how specific displays are produced, permitting to get an
immediate overview of the status of the system and identify the misbehaving components.

Author

Benjamin Gaidioz (CERN)

Co-authors

Mr Alexander Read (University of Oslo) Mr Ricardo Rocha (CERN) Simone Campana (CERN) Mr Xavi Espinal (PIC/IFAE)

Presentation materials