AI RCS Strategy Workshop

Name: AI RCS Strategy Workshop
Start: 2025-09-15T08:09:00+02:00
End: 2025-09-19T18:00:00+02:00
Location: CERN

15–19 Sept 2025

CERN

Europe/Zurich timezone

MLOps Infrastructure and End-to-End Workflows for Online LHCb Operations

16 Sept 2025, 11:05

40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100

Show room on map

4. AI Infrastructure for Model Training AI Infrastructure for Model Training

Apostolos Karvelas (CERN)

This project focuses on establishing a dedicated MLOps environment tailored to the needs of the online operations of the LHCb experiment. Its goal is to enable the development, optimization, and deployment of machine learning models entirely within the LHCb technical network, using LHCb-managed resources and directly supporting online workflows.

The first phase of the project, focused on deploying and configuring Kubeflow on the LHCb infrastructure, has been started as part of a summer student project. The system is already functional and has been configured with LDAP authentication and GPU time-slicing, allowing multiple users to share GPU resources. Some additional work is still required to finalize integration with LHCb resource management and to prepare the system for production use.

Running our own instance of Kubeflow inside the LHCb network also enables access to specific Online resources, such as the ECS, which are essential for certain ML workflows and directly support LHCb operations.
Building on this foundation, the next steps will extend the infrastructure to support advanced MLOps capabilities:

Model Serving with KServe: Deploying trained ML models as scalable inference services within the LHCb environment.
Hyperparameter Optimization with Katib: Automating the search for optimal model configurations to improve performance and efficiency.
Workflow Automation with Kubeflow Pipelines: Enabling reproducible, end-to-end ML workflows that integrate data preprocessing, training, optimization, and deployment.

By providing these capabilities, the project will deliver a unified MLOps platform for online operations, enabling efficient prototyping, resource use, and reliable deployment of ML models within LHCb’s computing environment.

CERN group/ Experiment

LHCb

Working area	Area 4: AI Infrastructure for Model Training
Project goals	The intermediate goals are to finalize Kubeflow deployment and integration with LHCb systems, deploy KServe for model serving, integrate Katib for hyperparameter optimization, and develop initial ML pipelines for common workflows, while the final goals are to provide a production-ready MLOps environment for online operations, enabling automated, reproducible, and optimized ML workflows with models that can be quickly deployed and used in analysis and operations.
Timeline	In months 0–6 the focus is on finalizing Kubeflow integration and deploying KServe and Katib in a test environment, in months 6–12 on developing and validating initial ML pipelines, and in years 2–3 on production deployment with user support for serving, optimization, and pipelines.
Available person power	0.1 FTE (Origin Fellow)
Additional person power request	0.5 FTE (Technical Student)
Is this an already ongoing activity?	Yes
Indicative hardware resources needs	We currently have a server with 2 GPUs, enough CPU, and RAM for initial development and testing. If more models and users are added in the future, we will need to expand the resources to handle the extra workload.

Apostolos Karvelas (CERN)

There are no materials yet.

AI RCS Strategy Workshop

MLOps Infrastructure and End-to-End Workflows for Online LHCb Operations

40/S2-A01 - Salle Anderson

CERN

Speaker

Description

CERN group/ Experiment

Author

Presentation materials