Speaker
Emanuele Simili
(University of Glasgow)
Description
Our Tier2 cluster (ScotGrid, Glasgow) uses HTCondor as batch system, combined with ARC-CE as front-end for job submission and ARGUS for authentication and user mapping.
On top of this, we have built a central monitoring system based on Prometheus that collects, aggregates and displays metrics on custom Grafana dashboards. In particular, we extract jobs info by regularly parsing the output of 'condor_status' on the condor_manager, scheduler, and worker nodes.
A collection of graphs gives a quick overlook of cluster performance and helps identify rising issues. Logs from all nodes and services are also collected to a central Loki server and retained over time.
Desired slot length | 15 |
---|---|
Speaker release | Yes |
Author
Emanuele Simili
(University of Glasgow)
Co-authors
David Britton
Samuel Cadellin Skipsey
Gordon Stewart
(University of Glasgow)
Gareth Douglas Roy
(University of Glasgow (GB))