4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Name: 4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event
Start: 2009-03-02T09:00:00+01:00
End: 2009-03-06T17:30:00+01:00
Location: Le Ciminiere, Catania, Sicily, Italy

2–6 Mar 2009

Le Ciminiere, Catania, Sicily, Italy

Europe/Rome timezone

Support

Kristina.Ulrika.Gunne@cern.ch

World-wide daily computing operations in ATLAS

4 Mar 2009, 11:00

20m

Raffaello (80) (Le Ciminiere, Catania, Sicily, Italy)

Raffaello (80)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania

Oral Scientific results obtained using grid technology High Energy Physics

Mrs Schovancova Jaroslava (Institute of Physics, ASCR v.v.i. and CESNET)

The ATLAS distributed computing activities involve about 200 computing centers distributed world wide and need people on shift covering 24 hours per day: data distribution, simulated event production, reprocessing and user analysis run continuously. In this paper we present the operations model followed by the ATLAS daily operations team, describe the main problems found (at the sites and experiment level), and describe the actions taken to increase the quality of service for the ATLAS community.

Conclusions and Future Work

We will present the most common problems encountered, and the expected future evolution to provide efficient usage of data, resources, manpower and improve communication between sites and the experiment. One of the targets to achieve is the automation at all levels, starting from monitoring alarming systems and triggering automated actions to the production system and the data management system. We have now in place a set of functional tests that can be used as a reference for this in the future.

Detailed analysis

Reliable performance of the whole ATLAS computing community is of crucial importance to meet the ambitious physics goals of the ATLAS experiment. Distributed computing software and monitoring tools are evolving continuously to achieve this target. The world-wide daily operations shift group are the first responders to all faults, alarms and outages. The shifters are responsible to find, report and follow problems at almost every level of a complex distributed infrastructure, and complex processing model. A detailed report of the most critical issues found during the last year of operations within the EGEE sites will be provided, this comprises a set of five categories which turned out to dominate: storage stability, grid middleware, batch system misconfiguration, ATLAS software related problems and data corruption.

Impact

The ATLAS distributed computing operations influence the whole collaboration with more than 2,000 members. The whole distributed computing infrastructure should cope distribution, storage and physics analysis of ~10 PB of data per year. Data should be correctly steered from CERN to the Tier-1s and then to the Tier-2s in a secondary step. The provision of a good quality of service for the ATLAS computing community is of crucial importance for the future data analysis of the LHC data, physicists from all over the world need to have a stable and reliable system where they can analyze the data. The main targets to achieve are: stable data replication system (from the Tier-0 down to the Tier-2s and finally to the worker nodes), a correct environment at the batch system to run this jobs and an efficient way to store and retrieve the outputs. The work of the daily operations team is of capital importance to ensure the correctness of the system in every single of the mentioned steps.

Keywords

ATLAS, Grid Computing, Monte Carlo production

URL for further information

https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS

Dr Xavier Espinal (PIC (Port d'Informació Científica) and IFAE (institut de Física d'Altes Energies))

Mrs Schovancova Jaroslava (Institute of Physics, ASCR v.v.i. and CESNET)

Slides

2009-03_EGEE_UF4_ADCoS.003.pdf

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

World-wide daily computing operations in ATLAS

Raffaello (80)

Le Ciminiere, Catania, Sicily, Italy

Speaker

Description

Conclusions and Future Work

Detailed analysis

Impact

Keywords

URL for further information

Author

Co-author

Presentation materials

Choose timezone

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

Speaker

Description

Conclusions and Future Work

Detailed analysis

Impact

Keywords

URL for further information

Author

Co-author

Presentation materials