Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

19–25 Oct 2024
Europe/Zurich timezone

Decode the Workload: Training Deep Learning Models for Efficient Compute Cluster Representation

TUE 12
22 Oct 2024, 15:18
57m
Exhibition Hall

Exhibition Hall

Poster Track 7 - Computing Infrastructure Poster session

Speaker

Torri Jeske

Description

Monitoring the status of a high throughput computing cluster running computationally intensive production jobs is a crucial yet challenging system administration task due to the complexity of such systems. To this end, we train autoencoders using the Linux kernel CPU metrics of the cluster. Additionally, we explore assisting these models with graph neural networks to share information across threads within a compute node. The models are compared in terms of their ability to: 1) Produce a compressed latent representation that captures the salient features of the input, 2) Detect anomalous activity, and 3) Make distinction between different kinds of jobs run at Jefferson Lab. The goal is to have a robust encoder whose compressed embeddings are used for several downstream tasks. We extend this study further by deploying these models in a human-in-the-loop production-based setting for the anomaly detection task and discuss the associated implementation aspects such as continual learning and the criterion to generate alarms. This study represents a first step in the endeavor towards building self-supervised large-scale foundation models for computing centers.

Primary author

Dr Ahmed Mohammed (Thomas Jefferson National Accelerator Facility)

Co-authors

Mr Bryan Hess (Thomas Jefferson National Accelerator Facility) Mrs Diana McSpadden (Thomas Jefferson National Accelerator Facility) Kishansingh Rajput (Thomas Jefferson National Accelerator Facility) Ms Laura Hild (Thomas Jefferson National Accelerator Facility) Dr Malachi Schram (Thomas Jefferson National Accelerator Facility) Mr Mark Jones (Thomas Jefferson National Accelerator Facility) Mr Wesley Moore (Thomas Jefferson National Accelerator Facility) Dr Zhenyu Dai (Thomas Jefferson National Accelerator Facility)

Presentation materials