Speaker
Dr
Xavier Espinal
(PIC/IFAE)
Description
The ATLAS distributed computing activities involve about 200 computing centers distributed world-wide and need people on shift covering 24 hours per day. Data distribution, data reprocessing, user analysis and Monte Carlo event simulation runs continuously. Reliable performance of the whole ATLAS computing community is of crucial importance to meet the ambitious physics goals of the ATLAS experiment. Distributed computing software and monitoring tools are evolving continuously to achieve this target. The world-wide daily operations shift group are the first responders to all faults, alarms and outages. The shifters are responsible to find, report and follow problems at almost every level of a complex distributed infrastructure, and complex processing model. In this paper we present the operations model followed by the experiences of running the world-wide daily operations group for the past year. We will present the most common problems encountered, and the expected future evolution to provide efficient usage of data, resources, manpower and improve communication between sites and the experiment.
Presentation type (oral | poster) | oral |
---|
Author
Dr
Xavier Espinal
(PIC/IFAE)
Co-author
Dr
Kaushik De
(UTA)