Speaker
Description
The Belle II experiment at KEK, Japan, operates with data volume reaching over 30 petabytes, with datasets distributed and processed worldwide using DIRAC and Rucio. With the globally distributed computing infrastructure, and expecting an order of magnitude larger data volume, we face operational challenges for both computing experts and end-users. The end-users frequently struggle with multiple issues (e.g. problem with job submission, locating relevant documentation) generating load on experts who provide support.
This contribution reports on ongoing research and development of an intelligent, automated assistance system. The proposed system is designed to optimize experiment workflows, diagnose common failures, and provide continuous 24/7 monitoring to reduce service downtime and accelerate incident response. Our work leverages recent advances in open-source Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) to incorporate experiment-specific documentation such as software guides, troubleshooting resources, and FAQs for authoritative, context-aware assistance. In parallel, we explore AI-Agents for automated analysis of grid job logs, failure classification, and root-cause suggestion.
This research proposes a local LLM infrastructure for enhanced privacy, security, and sustainability by keeping sensitive data internal. The self-contained deployment allows for task-specific fine-tuning, integration with Model Context Protocol (MCP) tools, and long-term cost control. The contribution details the prototype architecture, preliminary evaluation, and a roadmap to improve Belle II Experiment operations and user experience.