Speaker
Description
We describe the ongoing analytics project at the Canadian ATLAS Tier-1 centre whose objective is to gather, process, analyze and visualize both metrics and logs captured from the hardware and software infrastructure that build up the Tier-1 site to help monitor its health and state.
The project started in 2020 with most of the work initially focused on identifying which data to capture, how to process, store and visualize it, as well as deciding which hardware and software to utilize. We will provide a brief description of the heterogeneous nature of the data collecting infrastructure, focusing on the technologies introduced with this project: the Elasticsearch suite of tools as the main workforce for capturing, processing, and storing the data utilizing Beats, Logstash and Elasticsearch respectively; Grafana for visualization; and InfluxDB for tape library metrics. This will include a brief description of how it is set up, including example dashboards for the main datasets such as dCache, HTCondor, Linux system and security logs and tape library events.
We will also describe the hardware purchased and installed in 2022 as well as current and future work. Eventually the objective is to add machine learning methods on these datasets to provide more insights into the workings of our infrastructure, automated alerts mechanism based on predictive models, and finding correlations within the different systems to help identify sources of inefficiencies.