10–14 Oct 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Towards automation of data quality system for CERN CMS experiment

13 Oct 2016, 15:30
1h 15m
San Francisco Marriott Marquis

San Francisco Marriott Marquis

Poster Track 7: Middleware, Monitoring and Accounting Posters B / Break

Speaker

Maxim Borisyak (National Research University Higher School of Economics (HSE) (RU); Yandex School of Data Analysis (RU))

Description

Daily operation of a large scale experimental setup is a challenging task both in terms of maintenance and monitoring. In this work we describes an approach for automated Data Quality system. Based on the Machine Learning methods it can be trained online on manually-labeled data by human experts. Trained model can assist data quality managers filtering obvious cases (both good and bad) and asking for further estimation only of fraction of poorly-recognizable datasets.

The system is trained on CERN open data portal data published by CMS experiment. We demonstrate that our system is able to save at least 20% of person power without increase in pollution (false positive) and loss (false negative) rates. In addition, for data not labeled automatically system provides its estimates and hints for a possible source of anomalies which leads to overall improvement of data quality estimations speed and higher purity of collected data.

Primary Keyword (Mandatory) Monitoring
Secondary Keyword (Optional) Artificial intelligence/Machine learning

Primary author

Maxim Borisyak (National Research University Higher School of Economics (HSE) (RU); Yandex School of Data Analysis (RU))

Co-authors

Andrey Ustyuzhanin (National Research University Higher School of Economics (HSE) (RU); Yandex School of Data Analysis (RU)) Dmitry Smolyakov (Yandex School of Data Analysis (RU)) Dr Jean-Roch Vlimant (California Institute of Technology (US)) Maria Stenina (Yandex (RU)) Maurizio Pierini (CERN)

Presentation materials