Speaker
Description
This project focuses on establishing a dedicated MLOps environment tailored to the needs of the online operations of the LHCb experiment. Its goal is to enable the development, optimization, and deployment of machine learning models entirely within the LHCb technical network, using LHCb-managed resources and directly supporting online workflows.
The first phase of the project, focused on deploying and configuring Kubeflow on the LHCb infrastructure, has been started as part of a summer student project. The system is already functional and has been configured with LDAP authentication and GPU time-slicing, allowing multiple users to share GPU resources. Some additional work is still required to finalize integration with LHCb resource management and to prepare the system for production use.
Running our own instance of Kubeflow inside the LHCb network also enables access to specific Online resources, such as the ECS, which are essential for certain ML workflows and directly support LHCb operations.
Building on this foundation, the next steps will extend the infrastructure to support advanced MLOps capabilities:
- Model Serving with KServe: Deploying trained ML models as scalable inference services within the LHCb environment.
- Hyperparameter Optimization with Katib: Automating the search for optimal model configurations to improve performance and efficiency.
- Workflow Automation with Kubeflow Pipelines: Enabling reproducible, end-to-end ML workflows that integrate data preprocessing, training, optimization, and deployment.
By providing these capabilities, the project will deliver a unified MLOps platform for online operations, enabling efficient prototyping, resource use, and reliable deployment of ML models within LHCb’s computing environment.
CERN group/ Experiment
LHCb
| Working area | Area 4: AI Infrastructure for Model Training |
|---|---|
| Project goals | The intermediate goals are to finalize Kubeflow deployment and integration with LHCb systems, deploy KServe for model serving, integrate Katib for hyperparameter optimization, and develop initial ML pipelines for common workflows, while the final goals are to provide a production-ready MLOps environment for online operations, enabling automated, reproducible, and optimized ML workflows with models that can be quickly deployed and used in analysis and operations. |
| Timeline | In months 0–6 the focus is on finalizing Kubeflow integration and deploying KServe and Katib in a test environment, in months 6–12 on developing and validating initial ML pipelines, and in years 2–3 on production deployment with user support for serving, optimization, and pipelines. |
| Available person power | 0.1 FTE (Origin Fellow) |
| Additional person power request | 0.5 FTE (Technical Student) |
| Is this an already ongoing activity? | Yes |
| Indicative hardware resources needs | We currently have a server with 2 GPUs, enough CPU, and RAM for initial development and testing. If more models and users are added in the future, we will need to expand the resources to handle the extra workload. |