9-13 July 2018
Sofia, Bulgaria
Europe/Sofia timezone

CMS Workflow Failures Recovery Panel, Towards AI-assisted Operation

10 Jul 2018, 16:00
1h
Sofia, Bulgaria

Sofia, Bulgaria

National Culture Palace, Boulevard "Bulgaria", 1463 NDK, Sofia, Bulgaria
Poster Track 3 – Distributed computing Posters

Speaker

Jean-Roch Vlimant (California Institute of Technology (US))

Description

The central production system of CMS is utilizing the LHC grid and effectively about 200 thousand cores, over about a hundred computing centers worldwide. Such a wide and unique distributed computing system is bound to sustain a certain rate of failures of various types. These are appropriately addressed with site administrators a posteriori. With up to 50 different campaigns ongoing concurrently, the range of diversity of workload is wide and complex, leading to a certain amount of mis-configurations, despite all efforts in request preparation. Most of the 2000 to 4000 datasets produced each week are done so in full automation, and datasets are delivered within an agreed level of completion. Despite effort of reducing the rate of failure, there remains a good fraction of workflows that requires non trivial intervention. This work remains for computing operators to do. We present here a tool, which was developed to facilitate and improve this operation, in a view to reduce delays in delivery. A dense and comprehensive representation of what errors occurred during the processing of a request helps expediting the investigation. Workflows that suffered from similar failures are bundled and presented as such to the operator. A realistically simplified operating panel front-end is connected to a backend automatizing the technical operation required for ease of operation. The framework was built such that it is collecting both the decision and the information available to the operator for taking that decision. It is therefore possible to employ machine learning technique to learn from the operator by training on labelled data. The operator’s procedure is automatized further by applying the decisions that are predicted with acceptable confidence. We present this tool that improves operational efficiency and will lead to further development in handling failures in distributed computing resources using machine learning.

Primary authors

Daniel Robert Abercrombie (Massachusetts Inst. of Technology (US)) Allison Reinsvold Hall (University of Notre Dame (US)) Paola Katherine Rozo Bernal (Universidad de los Andes (CO)) Jean-Roch Vlimant (California Institute of Technology (US)) Thong Nguyen (California Institute of Technology (US)) Christian Contreras Campana (DESY, Hamburg (Germany)) Matteo Cremonesi (Fermi National Accelerator Lab. (US))

Presentation Materials