Speaker
Description
With the change of the ATLAS computing model from hierarchical to dynamic, processing tasks are dispatched to sites based not only on availability of resources but also network conditions along the path between compute and storage, which may be topologically and/or geographically distant. We describe a system developed to collect, store, analyze and provide timely access to the network conditions for ATLAS sites, which is also generalized for broader use. We describe the data we collect from four different sources giving orthogonal views of network performance and utilization. The pre-existing ATLAS Distributed Computing Analytics platform is used for data transport and storage. The platform provides interactive monitoring dashboards, and serves as a backend to an alarm and alert system which we have developed for site operators. A co-located Jupyter service is used to perform in-depth interactive data analysis, train different Machine Learning algorithms and test models on historical data. We discuss how the derived knowledge gets used by ATLAS for network anomaly detection, job scheduling and data brokering.