Speaker
Yuji Kato
Description
The BelleII is an asymmetric energy $e^{+}e^{-}$ collider experiment at SuperKEKB in Japan. One of the main goals of BelleII is to search for physics beyond the Standard Model with a data set of about $5 \times 10^{10}$ $B\bar{B}$ pairs. In order to store such huge amount of data including MC events and analyze it in a timely manner, BelleII adopted distributed computing model with DIRAC (Distributed Infrastructure with Remote Agent Control), which is a framework developed for the LHCb experiment to manage heterogeneous computing environment.
To effectively maximize resources, we must detect and repair errors quickly. Monitoring job processing is crucial. Currently, DIRAC provides information such as status of jobs and log files. However, this is not sufficient to detect problems. By collecting statistical information such as number of successes/failed jobs and analyzing log files, we can understand the status of each site. Furthermore, these processes must be automated to check all sites efficiently. We implement these functions in DIRAC.
In this paper, we describe the design and experience with the monitor in the MC data production campaign.
Author
Co-authors
Hideki Miyake
(KEK)
I Ueda
(University of Tokyo (JP))
Kiyoshi Hayasaka
(Nagoya Univ.)
Takanori Hara
(Osaka University)