Job monitoring on DIRAC for Belle II distributed computing

Not scheduled
15m
OIST

OIST

1919-1 Tancha, Onna-son, Kunigami-gun Okinawa, Japan 904-0495
poster presentation Track4: Middleware, software development and tools, experiment frameworks, tools for distributed computing

Speaker

Yuji Kato

Description

The BelleII is an asymmetric energy $e^{+}e^{-}$ collider experiment at SuperKEKB in Japan. One of the main goals of BelleII is to search for physics beyond the Standard Model with a data set of about $5 \times 10^{10}$ $B\bar{B}$ pairs. In order to store such huge amount of data including MC events and analyze it in a timely manner, BelleII adopted distributed computing model with DIRAC (Distributed Infrastructure with Remote Agent Control), which is a framework developed for the LHCb experiment to manage heterogeneous computing environment. To effectively maximize resources, we must detect and repair errors quickly. Monitoring job processing is crucial. Currently, DIRAC provides information such as status of jobs and log files. However, this is not sufficient to detect problems. By collecting statistical information such as number of successes/failed jobs and analyzing log files, we can understand the status of each site. Furthermore, these processes must be automated to check all sites efficiently. We implement these functions in DIRAC. In this paper, we describe the design and experience with the monitor in the MC data production campaign.

Primary author

Co-authors

Hideki Miyake (KEK) I Ueda (University of Tokyo (JP)) Kiyoshi Hayasaka (Nagoya Univ.) Takanori Hara (Osaka University)

Presentation materials