10–12 Mar 2020
Fermilab
US/Central timezone

JobsBuster

11 Mar 2020, 14:40
25m
Racetrack (WH7XO) (Fermilab)

Racetrack (WH7XO)

Fermilab

Speaker

Siarhei Padolski (BNL)

Description

Reliable automatization of the root cause analysis procedure is an essential prerequisite for the Operational Intelligence deployment. That kind of data processing is important as an input for the automatic decision making and has its own value as an instrument for offloading shifters operations. The order of magnitude of failing rate in distributed computing, for instance in ATLAS experiment, is the tenth thousand jobs a day. This is why manual problem identification requires sufficient efforts. We created a prototype of the system, which finds the least common denominator for the computational jobs failures called Jobs Buster. In this talk, we provide an overview of this system, its current status and development plans.

Presentation materials