Speaker
Description
The infrastructure to deploy both training data and final models in a distributed computing environment like the WLCG is essential in order to make optimal use of ML/AI in offline computing. CVMFS is the de-facto standard to deploy software binaries, and could bring its advantages to ML operations, in particular with respect to software preservation.
As ML models used for inference are commonly stored in OCI registries CVMFS can make use of existing container tools to cache and distribute them, integrating with other platforms such as Kubeflow. This is therefore no re-invention of existing industry tools, but an enhancement of state-of-the-art tools.. However, since the access pattern of these model files differs from other software binaries, proxies and caches need to be tuned to work effectively for this use case. A central “model-registry.cern.ch” repository will be created as a service for the community, similar to unpacked.cern.ch, to make its use similarly accessible and transparent.
CERN group/ Experiment
EP-SFT
| Working area | Area 5: Infrastructure for AI Deployment |
|---|---|
| Project goals | Improve ML operations in distributed environments like the grid; Integrate CVMFS with industry ML-Ops tools in order to leverage its efficiency and data preservation capabilities. |
| Timeline | * 2 Months: Deployment of prototype “model-registry.cern.ch” repository, first benchmarks. * 6 Months: Prototype Integration with Kubeflow container registry / mlflow.cern.ch, further performance engineering. * 1 Year: Production Release: Feedback of operators and community integrated, Documentation, Investigation of new registry-side publication mechanisms. |
| Available person power | 0.1 FTE |
| Additional person power request | 1 Graduate (over 1 year) |
| Is this an already ongoing activity? | No |
| Indicative hardware resources needs | Publisher machine and S3 storage ( provided centrally by IT ) |