Speaker
Description
Conclusions and Future Work
We will present the most common problems encountered, and the expected future evolution to provide efficient usage of data, resources, manpower and improve communication between sites and the experiment. One of the targets to achieve is the automation at all levels, starting from monitoring alarming systems and triggering automated actions to the production system and the data management system. We have now in place a set of functional tests that can be used as a reference for this in the future.
Detailed analysis
Reliable performance of the whole ATLAS computing community is of crucial importance to meet the ambitious physics goals of the ATLAS experiment. Distributed computing software and monitoring tools are evolving continuously to achieve this target. The world-wide daily operations shift group are the first responders to all faults, alarms and outages. The shifters are responsible to find, report and follow problems at almost every level of a complex distributed infrastructure, and complex processing model. A detailed report of the most critical issues found during the last year of operations within the EGEE sites will be provided, this comprises a set of five categories which turned out to dominate: storage stability, grid middleware, batch system misconfiguration, ATLAS software related problems and data corruption.
Impact
The ATLAS distributed computing operations influence the whole collaboration with more than 2,000 members. The whole distributed computing infrastructure should cope distribution, storage and physics analysis of ~10 PB of data per year. Data should be correctly steered from CERN to the Tier-1s and then to the Tier-2s in a secondary step. The provision of a good quality of service for the ATLAS computing community is of crucial importance for the future data analysis of the LHC data, physicists from all over the world need to have a stable and reliable system where they can analyze the data. The main targets to achieve are: stable data replication system (from the Tier-0 down to the Tier-2s and finally to the worker nodes), a correct environment at the batch system to run this jobs and an efficient way to store and retrieve the outputs. The work of the daily operations team is of capital importance to ensure the correctness of the system in every single of the mentioned steps.
Keywords
ATLAS, Grid Computing, Monte Carlo production
URL for further information
https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS