Different groups at CERN have been focusing on changing existing workflows and processes to rely on machine learning, covering trigger farms, fast simulation, anomaly detection, reinforcement learning, etc.
To help end users in these tasks a service must hide the underlying infrastructure complexity and integrate well with existing identity and storage services, as well as easing the tasks of data preparation, model training, serving, among others.
In this talk we present a new solution available at CERN based on Kubeflow, the ML platform running on top of Kubernetes. We describe how the underlying resources - CPUs and GPUs - are offered to the end user hiding the complex details that allow the service to scale horizontally, and shared with the goal of optimizing resource usage. We present how existing on-premise capacity can be extended to external resources (public clouds) without users realizing, and for use cases where on-demand usage is cost effective such as covering for peak periods.
In the second part of the talk we cover the complete ML lifecycle. Examples will include quick code development and iteration using notebooks; submission of analysis pipelines allowing workloads to easily scale out, and including the direct conversion of a notebook to a pipeline; distributed model training with submission via both a web interface and an API; hyper-parameter tuning support with multiple search algorithms available; and finally model storage and serving.