The IHEP local cluster is a middle-sized HEP data center which consists of 20’000 CPU slots, hundreds of data servers, 20 PB disk storage and 10 PB tape storage. After data taking of JUNO and LHAASO experiment, the data volume processed at this center will approach 10 PB data per year. Facing the current cluster scale, anomaly detection is a non-trivial task in daily maintenance. Traditional methods such as static thresholding of performance metrics, key words searching in system logs, etc., require expertise of certain software systems, and cannot be easy to transplant. Besides, these methods cannot easily adapt to the changes of workloads and hardware configurations. Anomalies are data points which are either different from the majority of others or different from the expectation of a reliable prediction model in a time series. With a sufficient training sample dataset, machine learning-based anomaly detections which leverage these statistical characteristics can largely avoid the disadvantages of traditional methods. The Ganglia monitoring system at IHEP collects billions of timestamped monitoring data from the cluster every year. It provides sufficient data samples to train machine learning models. In this presentation, we firstly developed a generic anomaly detection framework to facilitate different detection task. It facilities common tasks such as data sample building, retagging and visualization, model calling, deviation measurement and performance measurement in machine learning-based anomaly detection methods. Then, for massive storage system, we developed and trained a spatial anomaly detection model based on Isolation Forest algorithm and a time series anomaly detection model based on LSTM recurrent neural networks to validate our idea. Initial performance comparison of our methods and traditional methods will be provided at the end of the presentation.
|Consider for promotion||No|