Speaker
Description
ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.