18–22 Jan 2016
UTFSM, Valparaíso (Chile)
Chile/Continental timezone

A scalable architecture for online anomaly detection of WLCG batch jobs

21 Jan 2016, 14:50
25m
UTFSM, Valparaíso (Chile)

UTFSM, Valparaíso (Chile)

Avenida España 1680, Valparaíso Chile
Oral Computing Technology for Physics Research Track 1

Speaker

Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE))

Description

For data centres it is increasingly important to monitor the network usage, and learn from network usage patterns. Especially configuration issues or misbehaving jobs preventing a smooth operation need to be detected as early as possible. At the GridKa Tier 1 centre we therefore operate a tool for monitoring traffic data and characteristics of WLCG jobs and pilots locally on different worker nodes. On the one hand local information itself are not sufficient to detect anomalies for several reasons, e.g. the underlying job distribution on a single worker node might change or there might be a local misconfiguration. On the other hand a centralised anomaly detection approach does not scale regarding network communication as well as computational costs. We therefore propose a scalable architecture based on concepts of a super-peer network. The contribution discusses different issues regarding the optimisation of computational costs, network overhead, and accuracy of anomaly detection. Based on simulations we will show the influence of different parameters, e.g. network size, location of computation, but also characteristics of WLCG batch jobs. The simulations are based on real batch job network traffic data that has been collected for several months.

Author

Eileen Kuhn (KIT - Karlsruhe Institute of Technology (DE))

Co-authors

Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE)) Christopher Jung Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE)) Max Fischer (KIT - Karlsruhe Institute of Technology (DE))

Presentation materials

Peer reviewing

Paper