15–19 Sept 2025
CERN
Europe/Zurich timezone

Zero-conversion reading of HEP data for training with common ML tools

16 Sept 2025, 11:15
5m
40/S2-A01 - Salle Anderson (CERN)

40/S2-A01 - Salle Anderson

CERN

100
Show room on map
4. AI Infrastructure for Model Training AI Infrastructure for Model Training

Speaker

Dr Vincenzo Eduardo Padulano (CERN)

Description

Training ML models on High Energy Physics data currently requires either very expensive copies and conversion to some intermediate format or creation of custom I/O pipelines for the end user. ROOT provides a prototype system for ingestion of data in the common TTree format (which also supports the future RNTuple format) directly into the ML model. This requires zero conversion steps and is done via a single function call for the final user. This streamlined approach of ingesting data into ML models can be made generic and cross-experiment. Work is required towards bringing this prototype in production, testing it on distributed scenarios and with training involving GPUs.

CERN group/ Experiment

EP-SFT

Working area Area 4: AI Infrastructure for Model Training
Project goals Problem: Common ML tools do not support natively data loading of HEP data formats Intermediate goal: Benchmark the native ROOT data loading into batches for ML training across multiple ML models, datasets, computing platforms Final Goal: Develop an easy-to-use API that seamlessly provides native data loading of ROOT datasets to ML models, in an efficient and scalable way, thus removing the need for intermediate data conversions and unnecessary bookkeeping.
Timeline Year 1: Research typical physics use cases that employ ML training workflows. Make use of this knowledge to benchmark and profile the existing prototype according to realistic scenarios. Provide continuous reports and take stock by defining the most important optimizations and missing features required Year 2: Act on knowledge accumulated in Y1, extend and bring data loading tool to production-grade level Year 3: Demonstrate possible integration of the tool in experiment frameworks and analyses that currently require expensive data duplication and bookkeeping.
Available person power 0
Additional person power request 1 Graduate + 0.2 Staff for supervision
Is this an already ongoing activity? Yes
Indicative hardware resources needs 1 PC equipped with a GPU

Author

Presentation materials

There are no materials yet.