ATLAS :

Some constraints:

Data access needs to allow for different data formats
Expect data storage to be in the order of tens of terabytes
Need flexibility in the software tooling, supporting conda through lxplus
Interactive GPU access would be important for testing jobs and environment and help preventing problems when porting to larger scale hardware.
Have access to inference hardware that can handle custom precision or pruned models for efficient deployment
Explore high speed interconnection between nodes at national lab

Current status (ML in Run 3):

Most simulation is still classical (but Fast ML based on GAN is in production)
Tagging is fully ML, tracking classical, trigger mostly classical as is analysis
There are reconstruction jobs that could be trained on consumer model GPUs
Expect to have 50% of ATLAS computing model accelerated by GPU-based ML by 2030s

Discussion on using public clouds is welcome but it has to take into account costs for distributing training data :

“Back on the commercial clouds question: within ATLAS we exercised with GCP, data ingress is not a problem, we connected on of their sited in Europe within WLCG. Big benefit is the scalability and modern hardware. Drawback is that it still seems expensive compared to the compute our funding agencies provide. We plan to release a public document to report details”

CMS:

Multiple ML models already in production

Typically complex models and fully customized  complex topologies, toward larger models based on transformers
>50% training takes more than 1 day
Can’t do hyperparameter optimization for lack of resources

ML libraries already in CMSSW (mostly TensorFlow at the moment)

Using a compiler to translate trained models to a C++ lib (x5-x10 reduction in memory, thanks to Tensorflow removal)

Inference is limited by single event calls  use indirect inference with batching

Wishlist to improve R&D and deployment:

Tools to track full lifecycles
Versioning of models used in production and integrated in training infrastructure to trigger re-training if needed
Continuous training for Level 1 trigger
ML in custom environment beyond training facilities  environment for testing ML specific facilities
How to store large MODELS and versioning of data in production

ALICE:

Example ML workloads with different training patterns:

Bethe Block corrections (for Particle ID) expect simple models trained for 300 hours per years of data taking  expect scale to a 1000 hour
ITS2 (Inner tracker -based particle ID): trained once (stable detector)  light model , 30 minutes training on A6000 (use INFN Torino cluster)
Multi-detector combination for PID (based on transformer encoder)  1h training time per model on GTX 1660

In general relatively light trainings but many different models

Current infrastructure:

EPN cluster equipped with 8 AMD GPUs – current use is opportunistic: when data taking (and classical reconstruction) does not fill the cluster
Interactive lightweight access would be most useful in Alice at the moment (SWAN like) but it is strongly limited by resources availability at the moment
Data format is an issue (use of TMVA for using ONNX models)  but needs development for more general ONNX operators

LHCb:

Main use case are in online operations and trigger

GNN based tracking: Tests to check whether it is possible to run in HLT1 (Allen) but main questions relate to model size, pile-up scaling..
Primary Vertex reconstruction
GNN based full event interpretation. (several days of training – impossible to due at CERN)

Analysis ML is simpler than in other models since background rates are lower and less complex than other experiments

Issues and resources:

Maintenance and preservation:
- Issues for maintainability of fast sim models, retrainability, management and inference of large models
- Need versioning tools for models, data and hyperparameters  storage for models and data (most can be very large)

At the moment common resources for training accessible from different institutes . Not available at CERN

Need a multi-GPU system (batch-like) allowing hyper-parameter optimization

ATS:

Automation of the accelerators infrastructure is the main scope for ML research in ATS

In addition: accelerator design and AI assistants

Constraints and limitations:

Run on technical Network (separate from general purpose network)
Missing GPUs in K8 cluster for ML, on the TN and the UCAP (online data facility)

DISCUSSION:

General issue on ONNX integration/inference. This is quite of a technical challenge. Should we organise a day to gather some experts from inside and outside CERN to resolve issues? Would this make sense?

Customization vs Industry solutions (ONNX Runtime is a Microsoft API)

A FEW POINTS TO FOLLOW UP ON:

Better advertising of what tools are available in IT and how to use them
Public cloud (difficult to provision on-site resources) : issue with keeping data private is not there, and dedicated high speed links exist
Storage: some application have a bottleneck in the speed of storage access (EOS? Public cloud )
MLOPS and hyper-parameter optimization: crucial aspect. Use a central repository of models.
Hyper-parameter optimization: for ml.cern.ch this is already possible, need multi-GPU batch support (essential in the next years)
Access to hardware resources in a coordinated, structured way. No power to set up services in the experiments (Alice EPN, or LHCb HLT1) to combine ML usage to other standard usage
Integrating ML software
Profiling and benchmarking (deployment optimization requires dedicated expertise) – project can be done in collaborations with industries
Dedicated architectures and associated licences for dedicated software