Speakers
Dr
Malachi Schram
Malachi Schram
(Pacific Northwest National Laboratory)
Description
We investigate novel approaches using Deep Learning (DL) for efficient execution of workflows on distributed resources. Specifically, we studied the use of DL for job performance prediction, performance classification, and anomaly detection to improve the utilization of the computing resources.
- Performance prediction:
- capture performance of workflows on multiple resources
-
consider intra-node task assignment
-
Performance classification: Prediction of job success/failure
- Predict at regular intervals job succeed/fail - site reliability
-
Long short-term memory (LSTM) neural networks
-
Performance anomaly detection:
- Example: Functions that consume unexpectedly large/small amounts of time
We used the Belle II distributed computing workflow and modifications to the DIRAC system for these studies.
Primary authors
Dr
Malachi Schram
Dr
Nathan Tallent
(Pacific Northwest National Laboratory)
Dr
Ryan Friese
(Pacific Northwest National Laboratory)
Alok Singh
(University of California San Diego)