ATLAS :
Some constraints:
- Data access needs to allow for different data formats
- Expect data storage to be in the order of tens of terabytes
- Need flexibility in the software tooling, supporting conda through lxplus
- Interactive GPU access would be important for testing jobs and environment and help preventing problems when porting to larger scale hardware.
- Have access to inference hardware that can handle custom precision or pruned models for efficient deployment
- Explore high speed interconnection between nodes at national lab
Current status (ML in Run 3):
- Most simulation is still classical (but Fast ML based on GAN is in production)
- Tagging is fully ML, tracking classical, trigger mostly classical as is analysis
- There are reconstruction jobs that could be trained on consumer model GPUs
- Expect to have 50% of ATLAS computing model accelerated by GPU-based ML by 2030s
Discussion on using public clouds is welcome but it has to take into account costs for distributing training data :
- “Back on the commercial clouds question: within ATLAS we exercised with GCP, data ingress is not a problem, we connected on of their sited in Europe within WLCG. Big benefit is the scalability and modern hardware. Drawback is that it still seems expensive compared to the compute our funding agencies provide. We plan to release a public document to report details”
CMS:
Multiple ML models already in production
- Typically complex models and fully customized complex topologies, toward larger models based on transformers
- >50% training takes more than 1 day
- Can’t do hyperparameter optimization for lack of resources
ML libraries already in CMSSW (mostly TensorFlow at the moment)
- Using a compiler to translate trained models to a C++ lib (x5-x10 reduction in memory, thanks to Tensorflow removal)
Inference is limited by single event calls use indirect inference with batching
- Offloading inference engine SONIC
Wishlist to improve R&D and deployment:
- Tools to track full lifecycles
- Versioning of models used in production and integrated in training infrastructure to trigger re-training if needed
- Continuous training for Level 1 trigger
- ML in custom environment beyond training facilities environment for testing ML specific facilities
- How to store large MODELS and versioning of data in production
ALICE:
Example ML workloads with different training patterns:
- Bethe Block corrections (for Particle ID) expect simple models trained for 300 hours per years of data taking expect scale to a 1000 hour
- ITS2 (Inner tracker -based particle ID): trained once (stable detector) light model , 30 minutes training on A6000 (use INFN Torino cluster)
- Multi-detector combination for PID (based on transformer encoder) 1h training time per model on GTX 1660
In general relatively light trainings but many different models
Current infrastructure:
- EPN cluster equipped with 8 AMD GPUs – current use is opportunistic: when data taking (and classical reconstruction) does not fill the cluster
- Interactive lightweight access would be most useful in Alice at the moment (SWAN like) but it is strongly limited by resources availability at the moment
- Data format is an issue (use of TMVA for using ONNX models) but needs development for more general ONNX operators
LHCb:
Main use case are in online operations and trigger
- GNN based tracking: Tests to check whether it is possible to run in HLT1 (Allen) but main questions relate to model size, pile-up scaling..
- Primary Vertex reconstruction
- GNN based full event interpretation. (several days of training – impossible to due at CERN)
Analysis ML is simpler than in other models since background rates are lower and less complex than other experiments
Issues and resources:
- Maintenance and preservation:
- Issues for maintainability of fast sim models, retrainability, management and inference of large models
- Need versioning tools for models, data and hyperparameters storage for models and data (most can be very large)
- Training : need robust flexible pipelines
At the moment common resources for training accessible from different institutes . Not available at CERN
Need a multi-GPU system (batch-like) allowing hyper-parameter optimization
- Profiling of the computing resources is an important issue
- Inference : the main challenge is the optimization to run on Allen (HLT1)
ATS:
Automation of the accelerators infrastructure is the main scope for ML research in ATS
In addition: accelerator design and AI assistants
Constraints and limitations:
- Run on technical Network (separate from general purpose network)
- Missing GPUs in K8 cluster for ML, on the TN and the UCAP (online data facility)
DISCUSSION:
- General issue on ONNX integration/inference. This is quite of a technical challenge. Should we organise a day to gather some experts from inside and outside CERN to resolve issues? Would this make sense?
Customization vs Industry solutions (ONNX Runtime is a Microsoft API)
- Do we really want to develop in-house tools?
- Difficult to keep up with the new solutions that are made available outside
- Possibility to develop TMVA in order to have a customizable solution
- Solution could be a hybrid dependent on the use case (serving vs model catalog)
A FEW POINTS TO FOLLOW UP ON:
- Better advertising of what tools are available in IT and how to use them
- Public cloud (difficult to provision on-site resources) : issue with keeping data private is not there, and dedicated high speed links exist
- Storage: some application have a bottleneck in the speed of storage access (EOS? Public cloud )
- MLOPS and hyper-parameter optimization: crucial aspect. Use a central repository of models.
- Hyper-parameter optimization: for ml.cern.ch this is already possible, need multi-GPU batch support (essential in the next years)
- Access to hardware resources in a coordinated, structured way. No power to set up services in the experiments (Alice EPN, or LHCb HLT1) to combine ML usage to other standard usage
- Integrating ML software
- Profiling and benchmarking (deployment optimization requires dedicated expertise) – project can be done in collaborations with industries
- Dedicated architectures and associated licences for dedicated software
There are minutes attached to this event.
Show them.