A Grid computing site consists of various services including Grid middlewares, such as Computing Element, Storage Element and so on. Ensuring a safe and stable operation of the services is a key role of site administrators. Logs produced by the services provide useful information for understanding the status of the site. However, it is a time-consuming task for site administrators to monitor and analyze the service logs everyday. Therefore, a support framework (gridalert), which detects anomaly logs and alerts to site administrators, has been developed using Machine Learning techniques.
Typical classifications using Machine Learning require pre-defined labels. It is difficult to collect a large amount of anomaly logs to build a Machine Learning model that covers all possible pre-defined anomalies. Therefore, Unsupervised Machine Learning based on clustering algorithms is used in the gridalert to detect anomaly logs. Several clustering algorithms, such as k-means, DBSCAN and IsolationForest, and its parameters have been compared in order to maximize the performance of the anomaly detection for Grid computing site operations. The gridalert has been deployed to Tokyo Tier2 site, which is one of the Worldwide LHC Computing Gird sites, and is used in operation. In this presentation, studies about Machine Learning algorithms for the anomaly detection and our operational experiences of the gridalert will be reported.
|Consider for promotion||No|