ACAT 2017

Name: ACAT 2017
Start: 2017-08-21T07:45:00-07:00
End: 2017-08-25T18:00:00-07:00
Location: University of Washington, Seattle

21–25 Aug 2017

University of Washington, Seattle

US/Pacific timezone

Need Help?

Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System

24 Aug 2017, 16:00

45m

The Commons (Alder Hall)

The Commons

Alder Hall

Poster Track 1: Computing Technology for Physics Research Poster Session

Siarhei Padolski (BNL)

Every scientific workflow involves an organizational part which purpose is to plan an analysis process thoroughly according to defined schedule, thus to keep work progress efficient. Having such information as an estimation of the processing time or possibility of system outage (abnormal behaviour) will improve the planning process, provide an assistance to monitor system performance and predict its next state.

The ATLAS Production System is an automated scheduling system that is responsible for central production of Monte-Carlo data, highly specialized production for physics groups, as well as data pre-processing and analysis using such facilities as grid infrastructures, clouds and supercomputers. With its next generation (ProdSys2) the processing rate is around 2M tasks per year that is more than 365M jobs per year. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, physics groups and individual users. ATLAS Distributed Computing in its current state is the aggregation of large and heterogenous facilities, running on the WLCG, academic and commercial clouds, and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analysis happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and predict the behaviour of the analysis process (its duration) and system's state itself caused to focus on design of the built-in situational awareness analytic tools.

Proposed suite of tools aims to estimate completion time (so called "Time To Complete", TTC) for every (production) task (i.e., prediction of the task duration), completion time for a chain of tasks, and to predict the failure state of the system (e.g., based on "abnormal" task processing times). Its implementation is based on Machine Learning methods and techniques, and besides the historical information about finished tasks it uses ProdSys2 job execution information and resources usage state (real-time parameters and metrics to adjust predicted values according to the state of the computing environment).

The WLCG ML R&D project started in 2016. Within the project the first implementation of the TTC Estimator (for production tasks) was developed, and its visualization was integrated into the ProdSys Monitor.

Alexei Klimentov (Brookhaven National Laboratory (US)) Dmitri Golubkov (Institute for High Energy Physics (IHEP)-Unknown-Unknown) Fernando Harald Barreiro Megino (University of Texas at Arlington) Maksim Gubin (National Research Tomsk Polytechnic University (RU)) Maria Grigoryeva (Institute for Theoretical and Experimental Physics (RU)) Mikhail Titov (National Research Centre Kurchatov Institute (RU)) Misha Borodin (University of Iowa (US)) Siarhei Padolski (BNL) Tadashi Maeno (Brookhaven National Laboratory (US)) Tatiana Korchuganova (National Research Tomsk Polytechnic University (RU))

acat2017-n83.pdf

ACAT_2017_paper_Gubin_Titov.pdf

ACAT 2017

Need Help?

Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System

The Commons

Alder Hall

Speaker

Description

Primary authors

Presentation materials

Peer reviewing

Paper

Choose timezone

ACAT 2017

Need Help?

Speaker

Description

Primary authors

Presentation materials

Peer reviewing

Paper

Share this page

Direct link

Social networks

Calendaring