Monitoring the ATLAS Production System

Sep 6, 2007, 2:20 PM
Victoria, Canada

Dr John Kennedy (LMU Munich)


The ATLAS production system is responsible for the distribution of O(100,000) jobs per day to over 100 sites worldwide. The tracking and correlation of errors and resource usage within such a large distributed system is of extreme importance. The monitoring system presented here is designed to abstract the monitoring information away form the central database of jobs. This approach ensures that the monitoring does not destructively interfere with the production itself and provides faster responses to monitoring queries. The design and functionality of the system is discussed and the possible future development of monitoring tools for the ATLAS Production System are explored.
Dr John Kennedy (LMU Munich)

