ATLAS :

 

Some constraints:

 

Current status (ML in Run 3): 

 

Discussion on using public clouds is welcome but it has to take into account costs for distributing training data  : 

 

 

CMS: 

Multiple ML models already in production

 

ML libraries  already in CMSSW (mostly TensorFlow at the moment) 

 

Inference is limited by single event calls  use indirect inference with batching 

 

 

Wishlist to improve R&D and deployment:

 

ALICE:

 

Example ML workloads with different training patterns:  

In general relatively light trainings but many different models

 

Current infrastructure: 

 

 

LHCb:

 

Main use case are in online operations and trigger 

Analysis ML is simpler than in other models since background rates are lower and less complex than other experiments


Issues and resources:

 

At the moment common resources for training accessible from different institutes . Not available at CERN

Need a multi-GPU system (batch-like) allowing hyper-parameter optimization 

 

 

 

ATS:

Automation of the accelerators infrastructure is the main scope for ML research in ATS

In addition: accelerator design and AI assistants

Constraints and limitations: 

 

DISCUSSION: 

Customization vs Industry solutions (ONNX Runtime is a Microsoft API)

 

 

A FEW POINTS TO FOLLOW UP ON:

  1. Better advertising of what tools are available in IT and how to use them
  2. Public cloud (difficult to provision on-site resources) : issue with keeping data private is not there, and dedicated high speed links exist
  3. Storage: some application have a bottleneck in the speed of storage access (EOS? Public cloud )
  4. MLOPS and hyper-parameter optimization: crucial aspect. Use a central repository of models.
  5. Hyper-parameter optimization: for ml.cern.ch this is already possible, need multi-GPU batch support (essential in the next years)
  6. Access to hardware resources in a coordinated, structured way. No power to set up services in the experiments (Alice EPN, or LHCb HLT1) to combine ML usage to other standard usage
  7. Integrating ML software 
  8. Profiling and benchmarking (deployment optimization requires dedicated expertise) – project can be done in collaborations with industries
  9. Dedicated architectures and associated licences for dedicated software