2nd CERN IT Machine Learning Infrastructure Workshop

Europe/Zurich
40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100
Show room on map
Description

The High Energy Physics community has started introducing Deep Learning techniques for improving different aspects in the experiments and accelerators life cycle and data processing steps.

 

The availability of reliable, user-friendly and scalable resources, including full software and hardware stacks, is critical to fully support activities in the ML/DL domain. In this context, the CERN IT department has launched an initiative to gather information about the status of its ML/DL dedicated infrastructure, the state-of-the-art and the needs of the HEP  ML/DL community at CERN.

 

A first workshop focused on the current status of ML/DL activities and infrastructure solutions in CERN IT.

This second workshop tries to collect information about on-going and planned AI-research within the different departments and experiments at CERN.

 

Recording is available here!

Zoom Meeting ID
68903957844
Host
Ricardo Rocha
Useful links
Join via phone
Zoom URL

ATLAS :

 

Some constraints:

  • Data access needs to allow for different data formats 
  • Expect data storage to be in the order of tens of terabytes
  • Need flexibility in the software tooling, supporting conda through lxplus
  • Interactive GPU access would be important for testing jobs and environment and help preventing problems when porting to larger scale hardware.
  • Have access to inference hardware that can handle custom precision or pruned models for efficient deployment 
  • Explore high speed interconnection between nodes at national lab 

 

Current status (ML in Run 3): 

  • Most simulation is still classical (but Fast ML based on GAN is in production)
  • Tagging is fully ML, tracking classical, trigger mostly classical as is analysis
  • There are reconstruction jobs that could be trained on consumer model GPUs
  • Expect to have 50% of ATLAS computing model accelerated by GPU-based ML by 2030s

 

Discussion on using public clouds is welcome but it has to take into account costs for distributing training data  : 

 

  • “Back on the commercial clouds question: within ATLAS we exercised with GCP, data ingress is not a problem, we connected on of their sited in Europe within WLCG. Big benefit is the scalability and modern hardware. Drawback is that it still seems expensive compared to the compute our funding agencies provide. We plan to release a public document to report details”

 

CMS: 

Multiple ML models already in production

  • Typically complex models and fully customized  complex topologies, toward larger models based on transformers
  • >50% training takes more than 1 day
  • Can’t do hyperparameter optimization for lack of resources

 

ML libraries  already in CMSSW (mostly TensorFlow at the moment) 

  • Using a compiler to translate trained models to a C++ lib (x5-x10 reduction in memory, thanks to Tensorflow removal)

 

Inference is limited by single event calls  use indirect inference with batching 

  • Offloading inference engine SONIC

 

 

Wishlist to improve R&D and deployment:

  • Tools to track full lifecycles
  • Versioning of models used in production and integrated in training infrastructure to trigger re-training if needed
  • Continuous training for Level 1 trigger
  • ML in custom environment beyond training facilities  environment for testing ML specific facilities
  • How to store  large MODELS and versioning of data in production

 

ALICE:

 

Example ML workloads with different training patterns:  

  • Bethe Block corrections (for Particle ID) expect simple models trained for 300 hours per years of data taking  expect scale to a 1000 hour
  • ITS2 (Inner tracker -based particle ID): trained once  (stable detector)  light model , 30 minutes training on A6000 (use INFN Torino cluster)
  • Multi-detector combination for PID (based on transformer encoder)  1h training time per model on GTX 1660 

In general relatively light trainings but many different models

 

Current infrastructure: 

  • EPN cluster equipped with 8 AMD GPUs – current use is opportunistic:   when data taking (and classical reconstruction) does not fill the cluster
  • Interactive lightweight access would be most useful in Alice at the moment (SWAN like)  but it is strongly limited by resources availability at the moment
  • Data format is an issue (use of TMVA for using ONNX models)  but needs development for more general ONNX operators

 

 

LHCb:

 

Main use case are in online operations and trigger 

  • GNN based tracking: Tests to check whether it is possible to run in HLT1 (Allen) but main questions relate to model size, pile-up scaling.. 
  • Primary Vertex reconstruction 
  • GNN based full event interpretation. (several days of training – impossible to due at CERN)

Analysis ML is simpler than in other models since background rates are lower and less complex than other experiments


Issues and resources:

  • Maintenance and preservation: 
    • Issues for maintainability of fast sim models, retrainability, management and inference of large models 
    • Need versioning tools for models, data and hyperparameters  storage for models and data (most can be very large)

 

  • Training : need robust flexible pipelines 

At the moment common resources for training accessible from different institutes . Not available at CERN

Need a multi-GPU system (batch-like) allowing hyper-parameter optimization 

 

  • Profiling of the computing resources is an important  issue
  • Inference : the main challenge is the optimization to run on Allen (HLT1)

 

 

ATS:

Automation of the accelerators infrastructure is the main scope for ML research in ATS

In addition: accelerator design and AI assistants

Constraints and limitations: 

  • Run on technical Network (separate from general purpose network)
  • Missing GPUs in K8 cluster for ML, on the TN and the UCAP (online data facility)

     

 

DISCUSSION: 

  • General issue on ONNX integration/inference. This is quite of a technical challenge. Should we organise a day to gather some experts from inside and outside CERN to resolve issues? Would this make sense?

Customization vs Industry solutions (ONNX Runtime is a Microsoft API)

  • Do we really want to develop in-house tools?
  • Difficult to keep up with the new solutions that are made available outside
  • Possibility to develop TMVA in order to have a customizable solution
  • Solution could be a hybrid dependent on the use case (serving vs model catalog)

 

 

A FEW POINTS TO FOLLOW UP ON:

  1. Better advertising of what tools are available in IT and how to use them
  2. Public cloud (difficult to provision on-site resources) : issue with keeping data private is not there, and dedicated high speed links exist
  3. Storage: some application have a bottleneck in the speed of storage access (EOS? Public cloud )
  4. MLOPS and hyper-parameter optimization: crucial aspect. Use a central repository of models.
  5. Hyper-parameter optimization: for ml.cern.ch this is already possible, need multi-GPU batch support (essential in the next years)
  6. Access to hardware resources in a coordinated, structured way. No power to set up services in the experiments (Alice EPN, or LHCb HLT1) to combine ML usage to other standard usage
  7. Integrating ML software 
  8. Profiling and benchmarking (deployment optimization requires dedicated expertise) – project can be done in collaborations with industries
  9. Dedicated architectures and associated licences for dedicated software 
There are minutes attached to this event. Show them.