29 November 2021 to 3 December 2021
Virtual and IBS Science Culture Center, Daejeon, South Korea
Asia/Seoul timezone

Anomaly Detection of I/O behaviors in HEP computing cluster based on unsupervised machine learning

contribution ID 544
29 Nov 2021, 17:20
20m
S221-A (Virtual and IBS Science Culture Center)

S221-A

Virtual and IBS Science Culture Center

55 EXPO-ro Yuseong-gu Daejeon, South Korea email: library@ibs.re.kr +82 42 878 8299
Oral Track 1: Computing Technology for Physics Research Track 1: Computing Technology for Physics Research

Speaker

Lu Wang (Computing Center,Institute of High Energy Physics, CAS)

Description

Problematic I/O pattern is the major cause of low efficiency HEP jobs. When the computing cluster is partially occupied by jobs with problematical I/O patterns, the overall CPU efficiency will dramatically drop down. In a cluster with thousands of users, locating the source of an anomalous workload is not an easy task. Automatic anomaly detection of I/O behavior can largely alleviate the impact of these situations and reduce the manpower spent problem diagnoses. A job’s I/O behavior mainly includes tens of metadata operations such as open, close, getattr etc., and tens of data operations such as read, write etc. Manually setting a problematic threshold for operation cannot adapt to the diversity and variability of the cluster.

This paper provides a data driven method to solve this problem. First, we collect I/O behavior information of each job from the job statistics monitoring file of Lustre file system through collectD and insert them into an Elasticsearch database. Then we search, aggregate and assemble these items into data samples which can be used by machine learning algorithms. After that, we can train unsupervised models with data samples per week and per day. Finally, we can make almost real time anomaly detection by the anomalous score generated given by pre trained models for a new data sample. Currently, the unsupervised model we used is Isolation Forest, which is a very efficient and scalable algorithm for point anomaly detection in a high dimension space. In the future, we can leverage more comprehensive models such as LSTM to make detections of job behavior as a sequence of I/O pattern in its life time.

These tool has been deployed in our production system. It collects tens of thousands of samples per day, hundreds of thousands samples per week, which makes sufficient statistics basis to build an isolation forest. Visualization and sorting tools on web page is also provided to facilitate problem diagnosis and validation of our idea.

Speaker time zone Compatible with Asia

Author

Lu Wang (Computing Center,Institute of High Energy Physics, CAS)

Co-authors

Mr Qingbao Hu (IHEP) Ms Juan Chen (IHEP)

Presentation materials