Speaker
Description
Problematic I/O pattern is the major cause of low efficiency HEP jobs. When the computing cluster is partially occupied by jobs with problematical I/O patterns, the overall CPU efficiency will dramatically drop down. In a cluster with thousands of users, locating the source of an anomalous workload is not an easy task. Automatic anomaly detection of I/O behavior can largely alleviate the impact of these situations and reduce the manpower spent problem diagnoses. A job’s I/O behavior mainly includes tens of metadata operations such as open, close, getattr etc., and tens of data operations such as read, write etc. Manually setting a problematic threshold for operation cannot adapt to the diversity and variability of the cluster.
This paper provides a data driven method to solve this problem. First, we collect I/O behavior information of each job from the job statistics monitoring file of Lustre file system through collectD and insert them into an Elasticsearch database. Then we search, aggregate and assemble these items into data samples which can be used by machine learning algorithms. After that, we can train unsupervised models with data samples per week and per day. Finally, we can make almost real time anomaly detection by the anomalous score generated given by pre trained models for a new data sample. Currently, the unsupervised model we used is Isolation Forest, which is a very efficient and scalable algorithm for point anomaly detection in a high dimension space. In the future, we can leverage more comprehensive models such as LSTM to make detections of job behavior as a sequence of I/O pattern in its life time.
These tool has been deployed in our production system. It collects tens of thousands of samples per day, hundreds of thousands samples per week, which makes sufficient statistics basis to build an isolation forest. Visualization and sorting tools on web page is also provided to facilitate problem diagnosis and validation of our idea.
Speaker time zone | Compatible with Asia |
---|