Automation of Large-scale Computer Cluster Monitoring Information Analysis

Not scheduled
15m
OIST

OIST

1919-1 Tancha, Onna-son, Kunigami-gun Okinawa, Japan 904-0495
poster presentation Track6: Facilities, Infrastructure, Network

Speaker

Mr Erekle Magradze (Georg-August-Universitaet Goettingen (DE))

Description

High-throughput computing platforms consist of complex infrastructure and provide a number of services apt to failures. To mitigate the impact of failures on the quality of the provided services, a constant monitoring and in time reaction is required, which is impossible without automation of the system administration processes. This paper introduces a way of automation of the process of monitoring information analysis to provide long and short term predictions of the service response time (SRT) of the mass storage and the batch systems and to identify the status of a service at a given time. The approach for the SRT predictions is based on Adaptive Neuro Fuzzy Inference System (ANFIS) while for a proper service status identification the K-means clustering algorithm was employed. An evaluation of the approaches is performed on real monitoring data from the WLCG Tier 2 center GoeGrid. Ten fold cross validation results demonstrate high efficiency of both approaches in comparison to known methods.

Primary author

Mr Erekle Magradze (Georg-August-Universitaet Goettingen (DE))

Co-authors

Arnulf Quadt (Georg-August-Universitaet Goettingen (DE)) Gen Kawamura (Georg-August-Universitaet Goettingen (DE)) Haykuhi Musheghyan (Georg-August-Universitaet Goettingen (DE)) Jordi Nadal Serrano (Georg-August-Universitaet Goettingen (DE))

Presentation materials