Speaker
Description
The CERN-SFT group, in their summary paper, proposed that "a common end-to-end fast-simulation tool could be created across experiments to complement the GEANT library." Building on the experience gained by LHCb in developing its Flash Simulation framework, Lamarr, several key challenges have emerged in integrating machine learning (ML) algorithms into high-energy physics software stacks:
-
ML models are typically lightweight, but the event-level granularity of the Gaudi scheduler complicates batching particles across multiple events. This results in frequent model invocations and significant overhead when using dedicated runtimes.
-
Dedicated runtimes are optimized for multithreading, which may conflict with Gaudi’s own multithreading management.
Constructing ML pipelines—comprising preprocessing, inference, and postprocessing—requires C++ development, a skillset often distinct from that of ML engineers who typically work in Python.
To address these challenges, Lamarr adopted a pipeline description language based on XML. This enables the composition of in-process computing blocks, distributed as shared objects via CVMFS. These blocks are transpiled from Python to C using tools such as scikinC and keras2c. This strategy shares conceptual similarities with SOFIE, a framework developed by CERN-SFT and used by LHCb.
We propose a collaborative project to gather requirements and draft an implementation plan for a multi-experiment, multi-application ML deployment system. This system would target high-throughput computing (HTC) environments and multithreaded C++ applications.
Key considerations include:
- Intermediate Data Representation: Efficient in-memory formats for intermediate data between computing blocks that support batch processing and cross-language accessibility (e.g., C++ and Python). Apache Arrow Tables and ROOT RDataFrames serve as promising examples.
- Experiment Independence: Leveraging Lamarr’s architecture as a foundation for a generalized, experiment-agnostic framework.
Graph-Based Data Structures: Enabling the definition and execution of ML pipelines on heterogeneous graph data representing particles, vertices, and reconstructed physics objects.
We believe that Lamarr’s implementation offers a valuable starting point and could serve as a prototype for a broader, experiment-independent solution.
CERN group/ Experiment
LHCb, EP-SFT
| Working area | Area 1" Cutting Edge AI for Offline Data Processing |
|---|---|
| Project goals | Participation and using Lamarr as one of the examples / backbones for the common end-to-end flash simulation. |
| Timeline | 3 years |
| Available person power | 0.1 FTE |
| Additional person power request | 1 FTE |
| Is this an already ongoing activity? | Yes |