2nd CERN IT Machine Learning Infrastructure Workshop

Name: 2nd CERN IT Machine Learning Infrastructure Workshop
Start: 2023-10-11T09:00:00+02:00
End: 2023-10-11T13:00:00+02:00
Location: CERN

Wednesday 11 Oct 2023, 09:00 → 13:00 Europe/Zurich

40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100

Show room on map

Description

The High Energy Physics community has started introducing Deep Learning techniques for improving different aspects in the experiments and accelerators life cycle and data processing steps.

The availability of reliable, user-friendly and scalable resources, including full software and hardware stacks, is critical to fully support activities in the ML/DL domain. In this context, the CERN IT department has launched an initiative to gather information about the status of its ML/DL dedicated infrastructure, the state-of-the-art and the needs of the HEP ML/DL community at CERN.

A first workshop focused on the current status of ML/DL activities and infrastructure solutions in CERN IT.

This second workshop tries to collect information about on-going and planned AI-research within the different departments and experiments at CERN.

Recording is available here!

68903957844

Ricardo Rocha

Join via phone

Hide

ATLAS :

Some constraints:

Data access needs to allow for different data formats
Expect data storage to be in the order of tens of terabytes
Need flexibility in the software tooling, supporting conda through lxplus
Interactive GPU access would be important for testing jobs and environment and help preventing problems when porting to larger scale hardware.
Have access to inference hardware that can handle custom precision or pruned models for efficient deployment
Explore high speed interconnection between nodes at national lab

Current status (ML in Run 3):

Most simulation is still classical (but Fast ML based on GAN is in production)
Tagging is fully ML, tracking classical, trigger mostly classical as is analysis
There are reconstruction jobs that could be trained on consumer model GPUs
Expect to have 50% of ATLAS computing model accelerated by GPU-based ML by 2030s

Discussion on using public clouds is welcome but it has to take into account costs for distributing training data :

“Back on the commercial clouds question: within ATLAS we exercised with GCP, data ingress is not a problem, we connected on of their sited in Europe within WLCG. Big benefit is the scalability and modern hardware. Drawback is that it still seems expensive compared to the compute our funding agencies provide. We plan to release a public document to report details”

CMS:

Multiple ML models already in production

Typically complex models and fully customized  complex topologies, toward larger models based on transformers
>50% training takes more than 1 day
Can’t do hyperparameter optimization for lack of resources

ML libraries already in CMSSW (mostly TensorFlow at the moment)

Using a compiler to translate trained models to a C++ lib (x5-x10 reduction in memory, thanks to Tensorflow removal)

Inference is limited by single event calls  use indirect inference with batching

Offloading inference engine SONIC

Wishlist to improve R&D and deployment:

Tools to track full lifecycles
Versioning of models used in production and integrated in training infrastructure to trigger re-training if needed
Continuous training for Level 1 trigger
ML in custom environment beyond training facilities  environment for testing ML specific facilities
How to store large MODELS and versioning of data in production

ALICE:

Example ML workloads with different training patterns:

Bethe Block corrections (for Particle ID) expect simple models trained for 300 hours per years of data taking  expect scale to a 1000 hour
ITS2 (Inner tracker -based particle ID): trained once (stable detector)  light model , 30 minutes training on A6000 (use INFN Torino cluster)
Multi-detector combination for PID (based on transformer encoder)  1h training time per model on GTX 1660

In general relatively light trainings but many different models

Current infrastructure:

EPN cluster equipped with 8 AMD GPUs – current use is opportunistic: when data taking (and classical reconstruction) does not fill the cluster
Interactive lightweight access would be most useful in Alice at the moment (SWAN like) but it is strongly limited by resources availability at the moment
Data format is an issue (use of TMVA for using ONNX models)  but needs development for more general ONNX operators

LHCb:

Main use case are in online operations and trigger

GNN based tracking: Tests to check whether it is possible to run in HLT1 (Allen) but main questions relate to model size, pile-up scaling..
Primary Vertex reconstruction
GNN based full event interpretation. (several days of training – impossible to due at CERN)

Analysis ML is simpler than in other models since background rates are lower and less complex than other experiments

Issues and resources:

Maintenance and preservation:
- Issues for maintainability of fast sim models, retrainability, management and inference of large models
- Need versioning tools for models, data and hyperparameters  storage for models and data (most can be very large)

Training : need robust flexible pipelines

At the moment common resources for training accessible from different institutes . Not available at CERN

Need a multi-GPU system (batch-like) allowing hyper-parameter optimization

Profiling of the computing resources is an important issue
Inference : the main challenge is the optimization to run on Allen (HLT1)

ATS:

Automation of the accelerators infrastructure is the main scope for ML research in ATS

In addition: accelerator design and AI assistants

Constraints and limitations:

Run on technical Network (separate from general purpose network)
Missing GPUs in K8 cluster for ML, on the TN and the UCAP (online data facility)

DISCUSSION:

General issue on ONNX integration/inference. This is quite of a technical challenge. Should we organise a day to gather some experts from inside and outside CERN to resolve issues? Would this make sense?

Customization vs Industry solutions (ONNX Runtime is a Microsoft API)

Do we really want to develop in-house tools?
Difficult to keep up with the new solutions that are made available outside
Possibility to develop TMVA in order to have a customizable solution
Solution could be a hybrid dependent on the use case (serving vs model catalog)

A FEW POINTS TO FOLLOW UP ON:

Better advertising of what tools are available in IT and how to use them
Public cloud (difficult to provision on-site resources) : issue with keeping data private is not there, and dedicated high speed links exist
Storage: some application have a bottleneck in the speed of storage access (EOS? Public cloud )
MLOPS and hyper-parameter optimization: crucial aspect. Use a central repository of models.
Hyper-parameter optimization: for ml.cern.ch this is already possible, need multi-GPU batch support (essential in the next years)
Access to hardware resources in a coordinated, structured way. No power to set up services in the experiments (Alice EPN, or LHCb HLT1) to combine ML usage to other standard usage
Integrating ML software
Profiling and benchmarking (deployment optimization requires dedicated expertise) – project can be done in collaborations with industries
Dedicated architectures and associated licences for dedicated software

There are minutes attached to this event. Show them.

- 09:00 → 09:10
  
  Introduction 10m
  
  Speakers: Ricardo Rocha (CERN), Dr Sofia Vallecorsa (CERN)
  
  2nd IT ML infrastructure workshop-2.pdf
- 09:10 → 09:30
  
  ATLAS 20m
  
  Speaker: Daniel Thomas Murnane (Lawrence Berkeley National Lab. (US))
  
  ATLAS ML Resource Requirements.pdf
- 09:30 → 09:50
  
  CMS 20m
  
  Speaker: Davide Valsecchi (ETH Zurich (CH))
  
  23_10_10 - CERN IT meeting - CMS ML report.pdf
- 09:50 → 10:10
  
  ALICE 20m
  
  Speaker: Fabio Catalano (CERN)
  
  CERN_IT_workshop_ALICE_111023_Catalano.pdf
- 10:10 → 10:30
  
  LHCb 20m
  
  Speaker: Simon Akar (University of Cincinnati (US))
  
  IT_Workshop_LHCb.pdf
- 10:30 → 10:50
  
  ATS 20m
  
  Speaker: Verena Kain (CERN)
  
  IT_ML_ATS_11Oct2023.pdf
- 10:50 → 11:10
  
  Break 20m
- 11:10 → 12:10
  
  Discussion 1h

Choose timezone

2nd CERN IT Machine Learning Infrastructure Workshop

40/S2-A01 - Salle Anderson

CERN