Our Tier2 cluster (ScotGrid, Glasgow) uses HTCondor as batch system, combined with ARC-CE as front-end for job submission and ARGUS for authentication and user mapping.
On top of this, we have built a central monitoring system based on Prometheus that collects, aggregates and displays metrics on custom Grafana dashboards. In particular, we extract jobs info by regularly parsing the output of 'condor_status' on the condor_manager, scheduler, and worker nodes.
A collection of graphs gives a quick overlook of cluster performance and helps identify rising issues. Logs from all nodes and services are also collected to a central Loki server and retained over time.
|Desired slot length||15|