Mar 10 – 12, 2020
US/Central timezone


Mar 11, 2020, 2:40 PM
Racetrack (WH7XO) (Fermilab)

Racetrack (WH7XO)



Siarhei Padolski (BNL)


Reliable automatization of the root cause analysis procedure is an essential prerequisite for the Operational Intelligence deployment. That kind of data processing is important as an input for the automatic decision making and has its own value as an instrument for offloading shifters operations. The order of magnitude of failing rate in distributed computing, for instance in ATLAS experiment, is the tenth thousand jobs a day. This is why manual problem identification requires sufficient efforts. We created a prototype of the system, which finds the least common denominator for the computational jobs failures called Jobs Buster. In this talk, we provide an overview of this system, its current status and development plans.

Presentation materials