4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Name: 4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event
Start: 2009-03-02T09:00:00+01:00
End: 2009-03-06T17:30:00+01:00
Location: Le Ciminiere, Catania, Sicily, Italy

2–6 Mar 2009

Le Ciminiere, Catania, Sicily, Italy

Europe/Rome timezone

Support

Kristina.Ulrika.Gunne@cern.ch

Monitoring the ATLAS distributed production

4 Mar 2009, 11:40

20m

Raffaello (80) (Le Ciminiere, Catania, Sicily, Italy)

Raffaello (80)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania

Oral End-user environments and portal technologies High Energy Physics

Benjamin Gaidioz (CERN)

The importance of efficient monitoring tools for LHC experiments is emphasized by the fact that LHC will soon produce real data. We present the monitoring system of the ATLAS (CERN LHC experiment) distributed productionsystem. Collecting information in near real-time about the performance of each job execution, it shows to people on shift specific displays permitting them to quickly identify weak components and take the necessary actions.

Conclusions and Future Work

From being an application providing a handy interface for digging into information (in order to spot misbehaving sites, services, data files, etc.), the system has becomes more operational and, after a year of constant usage by shifters, some tasks were selected for automation. The system will in particular be extended with an alert system triggered when a suspicious pattern is identified and will collect the necessary information to investigate it. This is the main topic of current developments.

Detailed analysis

Now that LHC real data is coming, the execution of the production systems of LHC experiments is becoming very
critical. It has to run efficiently and optimally 24h a day. The ATLAS production is ran by a distributedsystem using at the same time EGEE, Nordugrid and OSG resources. Due to the heterogeneity of the underlying systems and tools, and the scale of the computation (100K jobs per day on 350 sites), monitoring such a system
is a challenge.
In this paper we present the monitoring system that is in place today and being used by the ATLAS shifters and
experts around the world in order to ensure the stability and efficiency of the production system. We describe
how information is collected from various places and how specific displays are produced, permitting to get an
immediate overview of the status of the system and identify the misbehaving components.

URL for further information

http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/shifters-display

Impact

The production system monitoring is now the central point of several communities: people on shift, production managers, as well as certain site administrators who are willing to follow closely ATLAS activities in their site and react when
noticing a problem. The system collects fresh information every minute and produces optimized displays of short period statistics (last hours) and problems spotted automatically. Shifters get instantaneously the required overview, starting point for their daily work. Moreover, the monitoring system integrates with external tools: links to external systems let shifters check if tickets were already submitted, search in the dedicated Elog, check the software installation system, get to job log files, etc. Managers and site admins can follow how the activity has been performing on a larger time scale, plots and statistics can be exported
for weekly reports.

Keywords

Monitoring, Grid job execution, distributed Monte Carlo Production

Benjamin Gaidioz (CERN)

Mr Alexander Read (University of Oslo) Mr Ricardo Rocha (CERN) Simone Campana (CERN) Mr Xavi Espinal (PIC/IFAE)

Slides

bgaidioz.pdf

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

Monitoring the ATLAS distributed production

Raffaello (80)

Le Ciminiere, Catania, Sicily, Italy

Speaker

Description

Conclusions and Future Work

Detailed analysis

URL for further information

Impact

Keywords

Author

Co-authors

Presentation materials

Choose timezone

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

Speaker

Description

Conclusions and Future Work

Detailed analysis

URL for further information

Impact

Keywords

Author

Co-authors

Presentation materials