Conference on Computing in High Energy and Nuclear Physics

Name: Conference on Computing in High Energy and Nuclear Physics
Start: 2024-10-19T08:00:00+02:00
End: 2024-10-25T18:30:00+02:00
Location: No location set

19–25 Oct 2024

Europe/Zurich timezone

Contact Program Chairs

chep2024-pc@cern.ch

Decode the Workload: Training Deep Learning Models for Efficient Compute Cluster Representation

TUE 12

22 Oct 2024, 15:18

57m

Exhibition Hall

Poster Track 7 - Computing Infrastructure Poster session

Torri Jeske

Monitoring the status of a high throughput computing cluster running computationally intensive production jobs is a crucial yet challenging system administration task due to the complexity of such systems. To this end, we train autoencoders using the Linux kernel CPU metrics of the cluster. Additionally, we explore assisting these models with graph neural networks to share information across threads within a compute node. The models are compared in terms of their ability to: 1) Produce a compressed latent representation that captures the salient features of the input, 2) Detect anomalous activity, and 3) Make distinction between different kinds of jobs run at Jefferson Lab. The goal is to have a robust encoder whose compressed embeddings are used for several downstream tasks. We extend this study further by deploying these models in a human-in-the-loop production-based setting for the anomaly detection task and discuss the associated implementation aspects such as continual learning and the criterion to generate alarms. This study represents a first step in the endeavor towards building self-supervised large-scale foundation models for computing centers.

Dr Ahmed Mohammed (Thomas Jefferson National Accelerator Facility)

Mr Bryan Hess (Thomas Jefferson National Accelerator Facility) Mrs Diana McSpadden (Thomas Jefferson National Accelerator Facility) Kishansingh Rajput (Thomas Jefferson National Accelerator Facility) Ms Laura Hild (Thomas Jefferson National Accelerator Facility) Dr Malachi Schram (Thomas Jefferson National Accelerator Facility) Mr Mark Jones (Thomas Jefferson National Accelerator Facility) Mr Wesley Moore (Thomas Jefferson National Accelerator Facility) Dr Zhenyu Dai (Thomas Jefferson National Accelerator Facility)

CHEP24_ComputerCluster_Final.pdf

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

Decode the Workload: Training Deep Learning Models for Efficient Compute Cluster Representation

Exhibition Hall

Speaker

Description

Author

Co-authors

Presentation materials

Choose timezone

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

Speaker

Description

Author

Co-authors

Presentation materials