Discussion on the differences in scope and technology of ml.cern.ch and the AF pilot.
The AF team's current understanding:
The effort behind serving Kubeflow is mainly to provide:
ML pipelines that are not possible elsewhere e.g. distributed training
Automated hyperparameter optimisation
Model serving (use cases for classic ML but also for LLMs)
Notebooks with GPUs (like happens in SWAN) are interesting, but not the main focus.
There is an ongoing work related to dynamic GPU allocation in which the idea is to aggregate in a common pool all GPUs so that services like SWAN, batch and ml.cern.ch draw resources from the same place.
This should decrease the time resources are idle or under used and make the overall service more (cost) efficient.
We should discuss how ml workflows potentially work end to end and what environment the user is expected to use for each step. How different are the environments for the users. Is this likely to work for experienced people in the experiments as well as for new people (Phd students etc.)?
-- Preparation of trainings data
-- AOD --> DAODs--> ntuples from data
-- MC generation of trainings data
-- Developing a model
-- Training the model with the data from step 1
-- including hyperparameter optimisation
-- Testing the model with production data
-- see first step
-- Making sense of the results ( plots etc. )
-- Loop over all steps
-- Scaling up to a production analysis