Speaker
Description
Monitoring the status of a high throughput computing cluster running computationally intensive production jobs is a crucial yet challenging system administration task due to the complexity of such systems. To this end, we train autoencoders using the Linux kernel CPU metrics of the cluster. Additionally, we explore assisting these models with graph neural networks to share information across threads within a compute node. The models are compared in terms of their ability to: 1) Produce a compressed latent representation that captures the salient features of the input, 2) Detect anomalous activity, and 3) Make distinction between different kinds of jobs run at Jefferson Lab. The goal is to have a robust encoder whose compressed embeddings are used for several downstream tasks. We extend this study further by deploying these models in a human-in-the-loop production-based setting for the anomaly detection task and discuss the associated implementation aspects such as continual learning and the criterion to generate alarms. This study represents a first step in the endeavor towards building self-supervised large-scale foundation models for computing centers.