15–19 Sept 2025
CERN
Europe/Zurich timezone

Kubeflow backed by CVMFS: Efficient ML Model distribution for the Grid

16 Sept 2025, 12:20
5m
40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100
Show room on map
5. Infrastructure for AI Deployment Infrastructure for AI Deployment

Speaker

Valentin Volkl (CERN)

Description

The infrastructure to deploy both training data and final models in a distributed computing environment like the WLCG is essential in order to make optimal use of ML/AI in offline computing. CVMFS is the de-facto standard to deploy software binaries, and could bring its advantages to ML operations, in particular with respect to software preservation.

As ML models used for inference are commonly stored in OCI registries CVMFS can make use of existing container tools to cache and distribute them, integrating with other platforms such as Kubeflow. This is therefore no re-invention of existing industry tools, but an enhancement of state-of-the-art tools.. However, since the access pattern of these model files differs from other software binaries, proxies and caches need to be tuned to work effectively for this use case. A central “model-registry.cern.ch” repository will be created as a service for the community, similar to unpacked.cern.ch, to make its use similarly accessible and transparent.

CERN group/ Experiment

EP-SFT

Working area Area 5: Infrastructure for AI Deployment
Project goals Improve ML operations in distributed environments like the grid; Integrate CVMFS with industry ML-Ops tools in order to leverage its efficiency and data preservation capabilities.
Timeline * 2 Months: Deployment of prototype “model-registry.cern.ch” repository, first benchmarks. * 6 Months: Prototype Integration with Kubeflow container registry / mlflow.cern.ch, further performance engineering. * 1 Year: Production Release: Feedback of operators and community integrated, Documentation, Investigation of new registry-side publication mechanisms.
Available person power 0.1 FTE
Additional person power request 1 Graduate (over 1 year)
Is this an already ongoing activity? No
Indicative hardware resources needs Publisher machine and S3 storage ( provided centrally by IT )

Author

Presentation materials

There are no materials yet.