15–19 Sept 2025
CERN
Europe/Zurich timezone

MLOps Infrastructure and End-to-End Workflows for Online LHCb Operations

16 Sept 2025, 11:05
5m
40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100
Show room on map
4. AI Infrastructure for Model Training AI Infrastructure for Model Training

Speaker

Apostolos Karvelas (CERN)

Description

This project focuses on establishing a dedicated MLOps environment tailored to the needs of the online operations of the LHCb experiment. Its goal is to enable the development, optimization, and deployment of machine learning models entirely within the LHCb technical network, using LHCb-managed resources and directly supporting online workflows.

The first phase of the project, focused on deploying and configuring Kubeflow on the LHCb infrastructure, has been started as part of a summer student project. The system is already functional and has been configured with LDAP authentication and GPU time-slicing, allowing multiple users to share GPU resources. Some additional work is still required to finalize integration with LHCb resource management and to prepare the system for production use.

Running our own instance of Kubeflow inside the LHCb network also enables access to specific Online resources, such as the ECS, which are essential for certain ML workflows and directly support LHCb operations.
Building on this foundation, the next steps will extend the infrastructure to support advanced MLOps capabilities:

  • Model Serving with KServe: Deploying trained ML models as scalable inference services within the LHCb environment.
  • Hyperparameter Optimization with Katib: Automating the search for optimal model configurations to improve performance and efficiency.
  • Workflow Automation with Kubeflow Pipelines: Enabling reproducible, end-to-end ML workflows that integrate data preprocessing, training, optimization, and deployment.

By providing these capabilities, the project will deliver a unified MLOps platform for online operations, enabling efficient prototyping, resource use, and reliable deployment of ML models within LHCb’s computing environment.

CERN group/ Experiment

LHCb

Working area Area 4: AI Infrastructure for Model Training
Project goals The intermediate goals are to finalize Kubeflow deployment and integration with LHCb systems, deploy KServe for model serving, integrate Katib for hyperparameter optimization, and develop initial ML pipelines for common workflows, while the final goals are to provide a production-ready MLOps environment for online operations, enabling automated, reproducible, and optimized ML workflows with models that can be quickly deployed and used in analysis and operations.
Timeline In months 0–6 the focus is on finalizing Kubeflow integration and deploying KServe and Katib in a test environment, in months 6–12 on developing and validating initial ML pipelines, and in years 2–3 on production deployment with user support for serving, optimization, and pipelines.
Available person power 0.1 FTE (Origin Fellow)
Additional person power request 0.5 FTE (Technical Student)
Is this an already ongoing activity? Yes
Indicative hardware resources needs We currently have a server with 2 GPUs, enough CPU, and RAM for initial development and testing. If more models and users are added in the future, we will need to expand the resources to handle the extra workload.

Author

Presentation materials

There are no materials yet.