14–18 Oct 2013
Amsterdam, Beurs van Berlage
Europe/Amsterdam timezone

Phronesis, a diagnosis and recovery tool for system administrators

14 Oct 2013, 15:00
45m
Grote zaal (Amsterdam, Beurs van Berlage)

Grote zaal

Amsterdam, Beurs van Berlage

Poster presentation Facilities, Production Infrastructures, Networking and Collaborative Tools Poster presentations

Speaker

Christophe Haen (Univ. Blaise Pascal Clermont-Fe. II (FR))

Description

The backbone of the LHCb experiment is the Online system, which is a very large and heterogeneous computing center. Making sure of the proper behavior of the many different tasks running on the more than 2000 servers represents a huge workload for the small expert-operator team and is a 24/7 task. At the occasion of CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with always improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on reinforcement learning algorithms, as well as innovative concepts such as Shared Experience. While the presentation made at CHEP 2012 showed the validity of our prototype on simulations, we here present a version with improved algorithms, manipulation tools, and report on experience with running it in the LHCb Online system.

Primary author

Christophe Haen (Univ. Blaise Pascal Clermont-Fe. II (FR))

Co-authors

Niko Neufeld (CERN) Prof. Vincent Barra (LIMOS, UMR 6158 CNRS, Univ. Blaise Pascal.)

Presentation materials