We investigate novel approaches using Deep Learning (DL) for efficient execution of workflows on distributed resources. Specifically, we studied the use of DL for job performance prediction, performance classification, and anomaly detection to improve the utilization of the computing resources.
- Performance prediction:
- capture performance of workflows on multiple resources
consider intra-node task assignment
Performance classification: Prediction of job success/failure
- Predict at regular intervals job succeed/fail - site reliability
Long short-term memory (LSTM) neural networks
Performance anomaly detection:
- Example: Functions that consume unexpectedly large/small amounts of time
We used the Belle II distributed computing workflow and modifications to the DIRAC system for these studies.