Fast Machine Learning for Science Conference 2025

Name: Fast Machine Learning for Science Conference 2025
Start: 2025-09-01T08:30:00+02:00
End: 2025-09-05T17:30:00+02:00
Location: ETH Zurich

1 Sept 2025, 08:30 → 5 Sept 2025, 17:30 Europe/Zurich

ETH Zurich

HIT E 51, Siemens Auditorium, ETH Zurich, Hönggerberg campus, 8093 Zurich, Switzerland

Benjamin Ramhorst (ETH Zurich), Denis-Patrick Odagiu (ETH Zurich (CH)), Marius Köppel (ETH Zurich (CH)), Maurizio Pierini (CERN), Sioni Paris Summers (CERN), Thea Aarrestad (ETH Zurich (CH))

Description

Swiss Institute of Particle Physics (CHIPP)

The Fast Machine Learning for Science Conference 2025 will be hosted by ETH Zurich September 1-5th, 2025.

As experimental methods continue to evolve, generating increasingly complex and high-resolution datasets, machine learning (ML) is becoming an essential tool across numerous scientific disciplines. This conference will explore emerging ML methods and their applications in scientific discovery, focusing on processing technologies and strategies to accelerate deep learning and inference.

Topics

Topics include, but are not limited to:

Machine Learning Algorithm Design & Optimization

Novel efficient architectures
Hyperparameter optimization and tuning
Model compression (quantization, sparsity)
Hardware/software co-design for ML efficiency

Accelerated Inference & Real-Time Processing

Low-latency ML for scientific experiments
FPGA/GPU-based ML acceleration
ML for trigger systems and data acquisition
On-detector and edge inference

Scalable & Distributed ML Systems

Cloud-based, accelerated ML processing
Distributed inference
Acceleration-as-a-service

Advanced Hardware & Computing Architectures

Specialized AI accelerators
Heterogeneous computing platforms for ML
Beyond CMOS

Scientific Applications of Fast ML

High-energy physics, astrophysics and astronomy
Space science and satellite-based ML
Genomics and medical imaging
Climate and environmental modeling
Material Science
Robotics

Important Deadlines

Abstract Submission: July 1, 2025
Registration Deadline: August 15, 2025
Extended Registration Deadline: August 25, 2025

We welcome abstracts for:

Scientific talks
Posters (A0 vertical)
2-4 hour Monday tutorials
2-3 hour Wednesday topical (birds-of-a-feather) sessions

More information and registration details will follow. We look forward to welcoming you in Zurich this September!

Best regards,
On behalf of the Organizers

Scientific Committee

Thea K. Årrestad (ETH Zürich)
Javier Duarte (UCSD)
Phil Harris (MIT)
Burt Holzman (Fermilab)
Scott Hauck (U. Washington)
Shih-Chieh Hsu (U. Washington)
Sergo Jindariani (Fermilab)
Mia Liu (Purdue University)
Allison M. Deiana (Southern M. U.)
Mark Neubauer (U. Illinois U-C)
Jennifer Ngadiuba (Fermilab)
Maurizio Pierini (CERN)
Sioni Summers (CERN)
Alex Tapper (Imperial College)
Nhan Tran (Fermilab)

Organising Committee

Thea K. Årrestad (ETH Zürich) - Chair
Marius Köppel (ETH Zürich) - Co-Chair
Cristina Botta (CERN/UZH)
Annapaola De Cosa (ETH)
Patrick Odagiu (ETH Zürich)
Maurizio Pierini (CERN)
Benjamin Ramhorst (ETH Zürich)
Anna Sfyrla (UniGe)
Sioni Summers (CERN)
Jennifer Zollinger (ETH Zürich)

Local organisers

fml-2025-organisers@cern.ch

Registration

Conference registration

63131533813

Thea Aarrestad

Sioni Paris Summers

20256816

Join via phone

Monday 1 September
- 1
  
  Registration
- Tutorials
  - 2
    hls4ml COPL Common Room HIT F 23.2
    
    COPL Common Room HIT F 23.2
    
    FPGAs provide unique advantages in the realm of machine learning acceleration. Unlike CPUs and GPUs, FPGAs allow for custom parallelism, data type precision and dataflow tailored specifically to the workload. Their reconfigurability enables the design of optimised hardware circuits that can reduce latency, power consumption, and improve throughput. Some common examples of FPGA-accelerated neural networks include particle classification, in-network traffic sniffing, and image segmentation for autonomous vehicles.
    
    In this tutorial, we will introduce and hold a hands-on demo on hls4ml, an open-source library for real-time deployment of neural networks on FPGAs. hls4ml allows a seamless conversion from high-level models (e.g., from Keras or PyTorch) to low-latency, low-power FPGA designs. In this tutorial, we will cover the design choices behind hls4ml, from deeply pipelined dataflow architectures to model quantization and pruning. The hands-on demo will allow participants to experiment with hls4ml’s Python API and try out the following concepts:
    
    Quantization-aware training with QKeras
    
    Model conversion with hls4ml
    
    Analysis of model latency and resource utilisation
    
    Tuning of model resources and latency
    
    Finally, the tutorial will conclude with a live demo of the model inference on a real FPGA.
    
    Speaker: Benjamin Ramhorst (ETH Zurich)
    
    hls4ml Tutorial FastML Sept 2025.pdf
    
    Zoom for the tutorial
  - 3
    
    NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference HIT E 41.1
    
    HIT E 41.1
    
    Neural networks (NNs) have gained significant interest in recent years due to their prevalence in AI applications. Lookup table (LUT) based NN architectures have emerged as a promising solution for ultra-low latency inference on reconfigurable hardware such as field programmable gate arrays (FPGAs). These techniques promise significant enhancements in both resource efficiency and inference speed, and have been shown to be effective across multiple latency-critical domains such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models.
    
    In this tutorial, we will mainly demonstrate our latest work called NeuraLUT-Assemble, a state-of-the-art framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification. Additionally, we will illustrate NeuraLUT-Assemble's efficiency through a live demonstration of the network implemented on a physical FPGA.
    
    At the beginning of our tutorial, we will walk through the prior work that led to NeuraLUT-Assemble. We will start by reviewing two broad families of ultra-low-latency FPGA inference: (1) methods that learn the LUTs directly and (2) methods that train traditional neural networks with constraints so they can later be fully mapped into LUTs. For the first category, we will briefly introduce LUTNet and DWNNs, which make LUT behavior differentiable by approximating gradients. In the second category, we will cover LogicNets, which maps sparse quantized networks to LUTs but suffers from limited fan-in; PolyLUT, which increases expressivity by embedding multivariate polynomials into neurons; and NeuraLUT, which hides small MLPs inside each LUT to combine flexibility with trainability. We will also touch on ReducedLUT, a complementary technique that performs post-training logic minimization by exploiting don't-care conditions to further compress LUTs.
    
    After covering this background, we will introduce the methodology behind NeuraLUT-Assemble. We will explain the tree-based assembling strategy in detail, showing how it enables the construction of larger fan-in neurons by composing them from smaller, manageable L-LUTs. This section will include an intuitive breakdown of different assembly configurations, with visual diagrams and comparisons to traditional approaches. Next, we will walk through the training process, including the use of quantization-aware training, skip connections embedded within LUTs, and the structured pruning strategy used to determine connectivity. We will then present a hands-on demonstration of the full toolflow—from network specification in PyTorch to Verilog generation, FPGA synthesis, and deployment. The tutorial will conclude with an evaluation of NeuraLUT-Assemble across tasks such as MNIST and jet classification, highlighting empirical trade-offs in accuracy and area-delay product. Throughout, we aim to provide practical insight into designing, training, and deploying efficient hardware-aware neural networks using LUT-based techniques on FPGAs.
    
    Speakers: Marta Andronic (Imperial College London), Mr Oliver Cassidy (Imperial College London)
    
    FastML_presentation.pdf
    
    FastML_presentation.pptx
  - 4
    Part 1: Agile Hardware Design for AI: A Hands-On Tutorial with SODA and Bambu HIT F 31.2
    
    HIT F 31.2
    
    This tutorial explores the growing demand for domain-specific hardware accelerators driven by the rapid evolution of AI and data analytics. Traditional hardware design cycles are too slow to keep up with the pace of algorithmic innovation. To address this, new agile hardware design methodologies are emerging, leveraging compiler technologies and High-Level Synthesis (HLS) to automate and accelerate the design process.
    
    The tutorial introduces the SODA Synthesizer, an open-source, compiler-based toolchain that enables the generation of efficient hardware accelerators from high-level algorithm descriptions.
    
    It consists of:
    
    SODA-OPT: A front-end and optimizer built on the MLIR (Multi-Level Intermediate Representation) framework. It interfaces with popular Python-based data science and machine learning frameworks, performing hardware/software partitioning and domain-specific optimizations.
    
    Bambu: A state-of-the-art open-source HLS tool developed at Politecnico di Milano. It translates optimized high-level code into hardware designs, supporting both FPGA and ASIC targets, and integrates with RTL simulation and logic synthesis tools.
    
    The tutorial highlights the limitations of traditional HLS tools, which were primarily designed for digital signal processing and required expertise in hardware description languages (HDLs). Modern HLS tools now support parallel programming models and integrate with high-level frameworks, making them more accessible to software developers and domain scientists.
    
    Key topics include:
    
    Current trends and methodologies in agile hardware design.
    
    Advantages and limitations of conventional HLS approaches.
    
    The role of MLIR in supporting diverse domains and frameworks.
    
    Hands-on demonstrations of both SODA-OPT and Bambu, showing how to go from Python code to optimized hardware accelerators.
    
    The tutorial also demonstrates how the toolchain integrates with both commercial and open-source Electronic Design Automation (EDA) tools, such as OpenROAD, enabling a complete end-to-end flow from Python to silicon. The design flow is flexible and supports both FPGA prototyping and ASIC implementation, making it suitable for a wide range of applications in AI and data-intensive computing.
    
    This tutorial is intended for researchers, engineers, and developers interested in accelerating AI and data-driven applications through agile hardware design, even without deep expertise in traditional hardware design languages.
    
    References
    
    [1] G. Gozzi, M. Fiorito, S. Curzel, C. Barone, V. G. Castellana, M. Minutoli, A. Tumeo, and F. Ferrandi: SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators. ACM Trans. Reconfigurable Technol. Syst. 18, 1, Article 9 (March 2025).
    
    [2] N. Bohm Agostini, S. Curzel, J. Zhang, A. Limaye, C. Tan, V. Amatya, M. Minutoli, V. G. Castellana, J. B. Manzano, D. Brooks, G-Y. Wei, A. Tumeo: Bridging Python to Silicon: The SODA Toolchain. IEEE Micro 42(5): 78-88 (2022) (Best paper for 2022)
    
    [3] N. Bohm Agostini, S. Curzel, V. Amatya, C. Tan, M. Minutoli, V. G. Castellana, J. B. Manzano, D. R. Kaeli, A. Tumeo: An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration. ICCAD 2022: 6:1-6:9
    
    [4] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, A. Tumeo: Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications. DAC 2021: 1327-1330
    
    Speakers: Mr Giovanni Gozzi (Politecnico di Milano), Mr Michele Fiorito (Politecnico di Milano), Dr Vito Giovanni Castellana (Pacific Northwest National Laboratory), Dr Antonino Tumeo (Pacific Northwest National Laboratory), Fabrizio Ferrandi (Politecnico di Milano)
    
    Schedule.pdf
    
    SODA+Bambu FASTML2025 tutorial
    
    Tutorial_hands-on-prep.pdf
    
    Tutorial_Intro_Bambu.pdf
    
    Tutorial_SODA_intro.pdf
    
    Tutorial_SODA-OPT_hands_on.pdf
    
    Tutorial_SODA-OPT.pdf
  - 10:30
    
    Break, change rooms ETH Zurich
    
    ETH Zurich
    
    HIT E 51, Siemens Auditorium, ETH Zurich, Hönggerberg campus, 8093 Zurich, Switzerland
  - 5
    
    Coyote v2: Open-source Abstractions and Infrastructure for FPGAs HIT E 41.1
    
    HIT E 41.1
    
    As Moore’s Law and Dennard Scaling reach their limits, computing is shifting toward heterogeneous hardware for large-scale data processing. Cloud vendors are deploying accelerators, like GPUs, DPUs, and FPGAs, to meet growing computational demands of ML and big data.
    
    While FPGAs offer great flexibility and performance, practically integrating them in larger systems remains challenging due to the long development cycles and expertise required. To address this, we introduce Coyote v2, an open-source FPGA shell with high-level, OS-like abstractions. Broadly speaking, Coyote v2 strives to simplify the application deployment process and enable developers to solely focus on their application logic and its performance, rather than infrastructure development. By providing clear and simple-to-use interfaces in both hardware and software, Coyote v2 allows everyone to leverage the mentioned abstractions for customized acceleration offloads and build distributed and heterogeneous computer systems, consisting of many FPGAs, GPUs and CPUs. Coyote v2 has been re-engineered for flexibility of use as the basis platform for multi-tenant accelerators, SmartNICs, and near-memory accelerators.
    
    This tutorial will cover Coyote v2's vFPGAs, which enable users to seamlessly deploy arbitrary applications on FPGAs, the built-in networking stacks for distributed applications and, finally, the shared virtual memory model, enabling FPGA interaction with other hardware (CPU, GPU, storage). Additionally, we will showcase Coyote's high-level software API, which enables easy, yet high performance, interaction from C++ with the FPGA. Finally, we will showcase Coyote's integration with hls4ml, performing inference on a PCIe-attached FPGA from a few lines of Python.
    
    Speaker: Benjamin Ramhorst (ETH Zurich)
    
    Coyote FastML Tutorial Sept 2025.pdf
  - 6
    Designing and Deploying Low-Latency Neural Networks on FPGAs with HGQ and da4ml COPL Common Room HIT F 23.2
    
    COPL Common Room HIT F 23.2
    
    Neural networks with a latency requirement on the order of microseconds are widely used at the CERN Large Hadron Collider, particularly in the low-level trigger system. To satisfy this latency requirement, these neural networks are often deployed on FPGAs.
    
    This tutorial aims to provide a practical, hands-on guide of a software-hardware co-design workflow using the HGQ2 and da4ml libraries. Comparing with existing workflows, this approach has shown to reduce the resource consumption of the resulting hardware designs by up to two orders of magnitude while maintaining the same accuracy. In particular, the following topics are covered:
    
    Setup and Basic Concepts
    
    Environment: We will cover installing the HGQ2 and da4ml packages via pip, configuring Keras v3 backends, and understanding the basics of numba JIT-compilation used in da4ml to avoid common pitfalls.
    
    HGQ Methodology: The key concepts of HGQ will be introduced, including the use of a surrogate gradient for differentiable bit-widths and the construction of a differentiable hardware resource estimate incorporated into the loss function for efficient model training.
    
    da4ml Methodology: An overview of da4ml's two-stage hybrid algorithm will be provided, including the coarse-grained graph-based reduction and the fine-grained common subexpression elimination to create multiplier-free designs. We will explain how this process aligns with HGQ's training goal by effectively reducing the number of non-zero digits in the weight matrix.
    
    The Co-Design Workflow
    
    Training with HGQ: We will define and train neural networks from scratch in HGQ, covering the basics of configuring fixed-point quantizers and applying HGQ to architectures ranging from simple DNNs to MLP-Mixers. Best practices for defining models that can be converted to FPGA firmware with bit-exactness will be discussed. In addition, guidance will be given on how to emulate QKeras behavior in HGQ2 when necessary.
    
    Synthesis with hls4ml and da4ml: We will demonstrate how to convert an HGQ-trained model using hls4ml for bit-exact firmware generation, and explain how this is achieved in the background through a model-wise symbolic precision propagation. We will also show how to enable and configure da4ml using the distributed_arithmetic strategy in hls4ml.
    
    Analysis and Advanced Techniques
    
    RTL Generation: For compatible network architectures, we will explore da4ml's ability to generate fully pipelined Verilog directly from a trained model. We will also demonstrate how to verify the design's correctness with streamlined Verilator emulation.
    
    Performance Review: We will analyze and compare key hardware metrics—initiation interval, latency, Fmax, and resource utilization—from both the hls4ml and standalone RTL workflows to discuss their trade-offs.
    
    Tuning Techniques: We will cover more advanced techniques, such as beta scheduling or targeting a specific resource budget in HGQ2 with PID control of beta, automatically logging models on the Pareto front to explore the accuracy-resource trade-off, and debugging common issues like divergent bit-widths during conversion.
    
    Speaker: Chang Sun (California Institute of Technology (US))
    
    hgq2-da4ml-tutorial-fastml25.pdf
  - 7
    Part 2: Agile Hardware Design for AI: A Hands-On Tutorial with SODA and Bambu HIT F 31.2
    
    HIT F 31.2
    
    This tutorial explores the growing demand for domain-specific hardware accelerators driven by the rapid evolution of AI and data analytics. Traditional hardware design cycles are too slow to keep up with the pace of algorithmic innovation. To address this, new agile hardware design methodologies are emerging, leveraging compiler technologies and High-Level Synthesis (HLS) to automate and accelerate the design process.
    
    The tutorial introduces the SODA Synthesizer, an open-source, compiler-based toolchain that enables the generation of efficient hardware accelerators from high-level algorithm descriptions.
    
    It consists of:
    
    SODA-OPT: A front-end and optimizer built on the MLIR (Multi-Level Intermediate Representation) framework. It interfaces with popular Python-based data science and machine learning frameworks, performing hardware/software partitioning and domain-specific optimizations.
    
    Bambu: A state-of-the-art open-source HLS tool developed at Politecnico di Milano. It translates optimized high-level code into hardware designs, supporting both FPGA and ASIC targets, and integrates with RTL simulation and logic synthesis tools.
    
    The tutorial highlights the limitations of traditional HLS tools, which were primarily designed for digital signal processing and required expertise in hardware description languages (HDLs). Modern HLS tools now support parallel programming models and integrate with high-level frameworks, making them more accessible to software developers and domain scientists.
    
    Key topics include:
    
    Current trends and methodologies in agile hardware design.
    
    Advantages and limitations of conventional HLS approaches.
    
    The role of MLIR in supporting diverse domains and frameworks.
    
    Hands-on demonstrations of both SODA-OPT and Bambu, showing how to go from Python code to optimized hardware accelerators.
    
    The tutorial also demonstrates how the toolchain integrates with both commercial and open-source Electronic Design Automation (EDA) tools, such as OpenROAD, enabling a complete end-to-end flow from Python to silicon. The design flow is flexible and supports both FPGA prototyping and ASIC implementation, making it suitable for a wide range of applications in AI and data-intensive computing.
    
    This tutorial is intended for researchers, engineers, and developers interested in accelerating AI and data-driven applications through agile hardware design, even without deep expertise in traditional hardware design languages.
    
    References
    
    [1] G. Gozzi, M. Fiorito, S. Curzel, C. Barone, V. G. Castellana, M. Minutoli, A. Tumeo, and F. Ferrandi: SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators. ACM Trans. Reconfigurable Technol. Syst. 18, 1, Article 9 (March 2025).
    
    [2] N. Bohm Agostini, S. Curzel, J. Zhang, A. Limaye, C. Tan, V. Amatya, M. Minutoli, V. G. Castellana, J. B. Manzano, D. Brooks, G-Y. Wei, A. Tumeo: Bridging Python to Silicon: The SODA Toolchain. IEEE Micro 42(5): 78-88 (2022) (Best paper for 2022)
    
    [3] N. Bohm Agostini, S. Curzel, V. Amatya, C. Tan, M. Minutoli, V. G. Castellana, J. B. Manzano, D. R. Kaeli, A. Tumeo: An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration. ICCAD 2022: 6:1-6:9
    
    [4] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, A. Tumeo: Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications. DAC 2021: 1327-1330
    
    Speakers: Mr Giovanni Gozzi (Politecnico di Milano), Mr Michele Fiorito (Politecnico di Milano), Dr Vito Giovanni Castellana (Pacific Northwest National Laboratory), Dr Antonino Tumeo (Pacific Northwest National Laboratory), Fabrizio Ferrandi (Politecnico di Milano), Nicolo Ghielmetti (CERN)
    
    Schedule.pdf
    
    SODA+Bambu FASTML2025 tutorial
    
    Tutorial_AXI4MLIR.pdf
    
    Tutorial_Bambu_flow 1.pdf
    
    Tutorial_Sparta.pdf
  - 8
    
    Super Neural Architecture Codesign Package (SNAC-Pack) Siemens Auditorium
    
    Siemens Auditorium
    
    Machine learning has become a critical tool for analysis and decision-making across a wide range of scientific domains, from particle physics to materials science. However, the deployment of neural networks in resource-constrained environments, such as hardware accelerators and edge devices, remains a significant challenge. This often requires specialized expertise in both neural architecture design and hardware optimization.
    
    To address this challenge, we introduce the Super Neural Architecture Codesign Package (SNAC-Pack), an integrated framework that automates the discovery and optimization of neural network architectures specifically tailored for hardware deployment. SNAC-Pack combines two powerful tools: Neural Architecture Codesign, which performs a two stage neural architecture search for optimal models, and the Resource Utilization and Latency Estimator, which predicts how an architecture will perform when implemented on FPGA software.
    
    SNAC-Pack streamlines the neural architecture design process by enabling researchers to automatically explore diverse architectures optimized for both task performance and hardware efficiency. By providing quick estimates of resource utilization and latency without requiring time-consuming synthesis, SNAC-Pack accelerates the development cycle. State-of-the-art compression techniques, such as quantization-aware training and pruning, further optimize the models, resulting in architectures that can be deployed to FPGA hardware.
    
    This tutorial provides a hands-on introduction to SNAC-Pack, guiding participants through the complete workflow from dataset preparation to hardware deployment. By the end of the tutorial, attendees will be able to run SNAC-Pack for their own applications, achieving improvements in accuracy, latency, and resource utilization compared to naive hand-crafted approaches.
    
    Speaker: Dmitri Demler
    
    SNAC-Pack_fastml_25.pdf
- 9
  
  Registration
- Social: Welcome
  
  fml2025_intro.pdf
- Invited talks
  - 10
    
    Reasoning Language Models: Overview and Blueprint
    
    Speaker: Maciej Besta (ETH Zurich)
    
    LLMs-Reasoning-Fast-ML___splitted.pdf
  - 11
    
    Real-time mixed-signal electronic circuits for understanding and implementing neural computation
    
    While machine learning has made tremendous progress in recent years, there is still a large gap between artificial and natural intelligence.
    Closing this gap requires combining fundamental research in neuroscience with mathematics, physics, and engineering to understand the principles of neural computation and cognition.
    Mixed-signal subthreshold analog and asynchronous digital electronic integrated circuits offer an additional means of exploring neural computation, by providing a computational substrate that shares many similarities with the one of biological brains.
    In this subthreshold region of operation, transistor channels employ the same physics of carrier transport (diffusion) as the proteic channels of real neurons.
    Thus, complex neuromorphic circuits and networks built following this approach share many similarities with real synapses, neurons, and cortical neural circuits.
    In this presentation I will demonstrate how to build neuromorphic processors that use the physics of their computational substrate to directly emulate the physics of biological neural processes in real-time. I will demonstrate how to build complex recurrent electronic neural circuits with dynamics and response properties strikingly similar to those measured in real neural networks. I will argue that these systems can be used to complement numerical simulations in basic research and real-world applications.
    
    Speaker: Giacomo Indiveri (ETH Zurich)
    
    indiveri-fastml25.pdf
- 15:30
  
  Coffee
- Invited talks
  - 12
    
    Real-time inference at the LHC
    
    The real-time processing of data created by the Large Hadron Collider's (LHC) experiments, amounting to over 10% of worldwide internet traffic, is one of the greatest computing challenges ever attempted. I will discuss the concrete applications of real-time processing in the LHC's main experiments, and the technological innovations in this area over the past decades. I will also reflect on the development of communities focused on real-time data processing in experimental HEP, and describe the ways in which these communities have expanded the physics reach of their experiments far beyond what had been originally imagined. Finally I will look ahead to the challenges facing the LHC experiments in the high-luminosity era of the LHC, and touch on ways in which the widespread availability of precision timing in our detectors will transform real-time processing in the next decades.
    
    Speaker: Vava Gligorov (Centre National de la Recherche Scientifique (FR))
    
    RealTimeInferenceAtLHC_0925.pdf
  - 13
    
    Real-time inference in gravitational wave astronomy (REMOTE)
    
    Speaker: Maximilian Dax (ELLIS Institute Tübingen)
    
    slides_dax.pdf
- 14
  
  Commute to ETH Main Building (by ETH link 17:34 or 17:54). Meet Polyterrasse
- Social: Reception Dozentenfoyer (ETH Main Building)
  
  Dozentenfoyer
  
  ETH Main Building
  
  fml2025_intro.pdf
Tuesday 2 September
- Invited talks
  - 15
    
    The JAX scientific ecosystem
    
    This talk provides an overview of several libraries in the open-source JAX ecosystem (such as Equinox, Diffrax, Optimistix, ...) In short, we have been building an "autodifferentiable GPU-capable scipy". These libraries offer the foundational core of tools that have made it possible for us to train neural networks (e.g. score-based diffusions for image generation), solve PDEs, and smoothly handle hybridisations of the two (e.g. fit neural ODEs to scientific data). By the end of the talk, the goal is for you to be able to walk away with a slew of new modelling tools, suitable for tackling problems both in ML and in science.
    
    Speaker: Patrick Kidger (Cradle.bio)
    
    kidger-patrick.pdf
  - 16
    
    Wearable TinyML Platforms for Biosignal Intelligence Across the Body
    
    Most commercial wearables still capture only basic metrics such as step counts or heart rate, and remain closed systems without access to raw data. In this talk, I will present our holistic approach to full-body biosignal intelligence, where ultra-low-power embedded platforms and machine learning algorithms are co-designed to capture and process signals from the brain, eyes, muscles, and cardiovascular system in real time, while sustaining day-long battery lifetimes. I will show how open, modular platforms can be adapted into diverse wearable form factors, and how tailored ML algorithms make these signals usable for applications such as seizure detection and eye-movement classification. Finally, I will discuss how this vision extends to emerging modalities such as wearable ultrasound, representing the next leap in multimodal, ML-enabled wearables.
    
    Speaker: Dr Andrea Cossettini (ETH Zurich)
    
    fastML2025_Andrea.pdf
- 10:30
  
  Coffee
- Invited talks
  - 17
    
    Real-time ML and neuromorphic computing for smart robots
    
    Speaker: Yulia Sandamirskaya (Zurich University of Applied Sciences)
    
    FastML_talk2025_2.pdf
- 12:00
  
  Lunch
- Contributed talks
  - 18
    
    FINN+: Towards Hassle-Free Co-Design of FPGA DNN Inference Accelerators
    
    Custom FPGA dataflow accelerators for DNN inference can enable unprecedented performance and efficiency for many applications. Dataflow accelerator compilers, such as the FINN framework, have improved in recent years and allow practitioners to explore this technology without requiring in-depth FPGA knowledge.
    
    However, the overall design process remains quite tedious, time-consuming, and often requires significant manual intervention. This is primarily caused by limited flexibility and automation in the compiler, as well as the enormous size and complexity of the design space. In contrast to the typical exploration process, where a quantized DNN is manually trained and then passed through the compiler, requiring many iterations to reach an acceptable solution, we envision an automated co-design of DNN and FPGA accelerator based on Automated Machine Learning (AutoML) techniques.
    
    In an effort to realize this vision while also facilitating the exploration of FPGA dataflow accelerators for energy efficient inference in the datacenter, we introduce FINN+, our custom fork of the FINN framework. Our work so far includes empirical resource and power consumption modeling, support for Transformer topologies, efficient deployment on datacenter FPGAs, Multi-FPGA acceleration, and general usability improvements.
    In this talk, we will share recent highlights as well as remaining challenges of the project.
    
    Speaker: Felix Jentzsch
    
    FINN-PLUS_presentation.pdf
  - 19
    
    End-to-End Neural Network Compression and Deployment for Hardware Acceleration Using PQuant and hls4ml
    
    As the demand for efficient machine learning on resource-limited devices grows, model compression techniques like pruning and quantization have become increasingly vital. Despite their importance, these methods are typically developed in isolation, and while some libraries attempt to offer unified interfaces for compression, they often lack support for deployment tools such as hls4ml. To bridge this gap, we developed PQuant, a Python library designed to streamline the process of training and compressing machine learning models. PQuant offers a unified interface for applying a range of pruning and quantization techniques, catering to users with minimal background in compression while still providing detailed configuration options for advanced use. Notably, it features built-in compatibility with hls4ml, enabling seamless deployment of compressed models on FPGA-based accelerators. This makes PQuant a versatile resource for both researchers exploring compression strategies and developers targeting efficient implementation on edge devices or custom hardware platforms. We will present the PQuant library, the performance of several compression algorithms implemented with it, and demonstrate the conversion flow of a neural network model from an uncompressed state to optimized firmware for FPGAs.
    
    Speaker: Roope Oskari Niemi
    
    PQuant_fastML.pdf
  - 20
    
    PrioriFI: Efficient Fault Injection for Edge Neural Networks
    
    As neural networks (NNs) are increasingly used to provide
    edge intelligence, there is a growing need to make the edge devices
    that run them robust to faults. Edge devices must mitigate the resulting
    hardware failures while maintaining strict constraints on power, energy,
    latency, throughput, memory size, and computational resources. Edge
    NNs require fundamental changes in model architecture, e.g., quantization and fewer, smaller layers. PrioriFI is an efficient fault injection (FI) algorithm that evaluates edge NN robustness by ranking NN bits based on their fault sensitivity. PrioriFI uses the Hessian for the initial parameter ranking. Then, during an FI campaign, PrioriFI uses the information gained from each FI to focus on the bits likely to be the next most sensitive. With PrioriFI, designers can quickly evaluate different NN architectures and co-design fault-tolerant edge NNs.
    
    Speaker: Olivia Weng
    
    PrioriFI FastML'25.pdf
  - 21
    
    ENABOL: Enabling Neural Backpropagation On-chip Learning for Edge AI Systems
    
    On-chip learning has the potential to unlock low-latency, low-power, and continuously adaptive AI directly on edge devices. However, research in this area remains limited by the lack of accessible hardware toolchains that support backpropagation. To address this gap, we propose ENABOL, a hardware-efficient extension of the HLS4ML toolchain that enables customizable backpropagation support for widely implemented neural network layers. ENABOL allows users to translate high-level Python models with training logic into synthesizable HLS-compatible C++ implementations, allowing seamless integration into FPGA/ASIC flows. It lowers the barrier for prototyping and exploring on-chip training strategies by automating key components of the training pipeline, such as gradient computation, weight updates, and control flow. While not intended as a fully optimized production framework, ENABOL provides a foundational platform for experimentation, hardware–algorithm co-design, and future optimization. Its compatibility with edge-oriented backend implementation platforms such as ESP facilitates scalable, real-time learning research in resource-constrained environments.
    
    Speaker: Manuel Valentin (Northwestern University)
    
    2025 - FastML - ENABOL.pdf
    
    2025 - FastML - ENABOL.pptx
  - 22
    
    da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs
    
    Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs pipelined with II=1. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic (DA) on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute.
    
    We release da4ml, a free and open source package that enables end-to-end, bit-exact neural network to Verilog or HLS design conversion, optimized with the proposed algorithm. For easy adoption into existing workflows, we also integrate da4ml into the hls4ml library. The results show that da4ml can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency compared to the native implementation hls4ml, enabling the implementation of previously infeasible networks.
    
    Speaker: Chang Sun (California Institute of Technology (US))
    
    da4ml-fastml25.pdf
  - 23
    
    AMD AI-Engines in fixed latency environments
    
    The ATLAS Level-0 Global Trigger is a mission critical system opting to take advantage of the full calorimeter granularity during Run-4 and beyond. Level-0 Global will be executing a cascade of trigger algorithms combined both the calorimeter information and the muons. Within the Next Generation Trigger project at CERN there is a dedicated work package (WP2.1) exploring large deployment of Machine Learning based algorithms to further enhance selections within the Global system. Given the tight latency and throughput conditions that Global is operating at, any solution developed at WP2.1 has to be deployed in custom hardware solution using FPGAs. The cutting edge FPGA technologies include within the same package dedicated co-processing chiplets optimised for Machine Learning applications. Such a device is the Adaptive Interface Engines provided in the Versal Premium packages. In this talk we will present the work performed within the scope of the NGT WP2.1 aiming to characterise the performance of those devices, provide some generic implementations for specific ML models and explore the feasibility of deploying them in the harsh environment of a mission critical system (L0-Global).
    
    Speaker: Ioannis Xiotidis (CERN)
    
    AIEsInFixedLatency.pdf
- Posters and coffee HIT G floor (gallery)
  
  HIT G floor (gallery)
  
  All posters
  
  poster_listing_wrapped.pdf
- Contributed talks
  - 24
    
    Balancing Prediction Performance, Transparency and Energy Consumption in Machine Learning Models for Data Streams
    
    In the era of continuous data generation, real-time processing of data streams has become crucial for timely, adaptive, and context-aware decision-making. However, maintaining effective learning models in such dynamic environments requires carefully balancing prediction performance, transparency and energy consumption.
    
    In the talk, we will present two new state-of-the-art methods for classification on data streams in such settings: (i) SoHoTs (soft Hoeffding trees) for balancing prediction performance and transparency, and (ii) HEROS (Heterogeneous Online Ensembles) for balancing prediction performance and energy consumption. SoHoTs are transparent, differentiable decision tree models for data streams. They employ a novel routing mechanism based on the Hoeffding inequality and adapt to changing data distributions through gradient-based weight updates, like soft decision trees. They process data in real-time, one sample at a time, without the need for storage, and enhance interpretability via decision-rule-based feature importance, sparse activation, and visualized decision paths. To study the trade-off between prediction performance and energy consumption, we introduce HEROS, which avoid expensive hyperparameter optimization by maintaining a diverse pool of preconfigured models. At each time step, HEROS select a resource-aware subset of models for training. A novel zeta-policy is introduced to guide this selection process, prioritizing models that deliver near-optimal performance under strict resource constraints. Empirical evaluations across 20 data streams (SoHoTs) and 11 benchmark datasets (HEROS) demonstrate that both methods achieve strong predictive performance while ensuring transparency or reduced resource consumption.
    
    Speaker: Kirsten Köbschall
  - 25
    
    Arbolta: A Fault Tolerance Study using Minimal Hardware Simulation
    
    The widespread deployment of embedded ML systems has created a need for resilient, fault-tolerant hardware and software capable of operating in inherently noisy conditions. While the standardization of low-precision (≤ 8-bit) datatypes has allowed for reduced training and inference costs and increased interoperability across commercial accelerators, clear guidelines for robust implementation under faulty conditions remains underdeveloped. Prior work has improved the efficiency of accelerator resilience studies through targeted fault injection campaigns at the software level (e.g., weights and inputs) and hybrid approaches which model high-level architectural state. However, these methods rely on assumptions which do not hold across the diverse hardware implementations of arithmetic units for emerging datatypes. This work extends the open-source Arbolta framework to enable fault injection in simulated accelerators through lightweight hardware-level simulation, capturing critical microarchitectural effects within an accessible Python environment. We compare Arbolta to existing tools which focus solely on fault injection in model weights/inputs and high-level architectural registers and propose a novel workflow for the design space exploration of resilient accelerators and models. We demonstrate the value of lightweight hardware simulation by presenting a series of case studies culminating in a fault-injection campaign on a minimal accelerator. Finally, we discuss the insights of our case studies and their applicability to more realistic hardware designs.
    
    Speaker: Alexander Redding (UC San Diego)
    
    fastml.pdf
  - 26
    
    SuperSONIC: Cloud-Native Infrastructure for ML Inferencing
    
    The rising computational demands of increasing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments have driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) framework. SONIC accelerates ML inference by offloading tasks to local or remote coprocessors, optimizing resource utilization. Its portability across diverse hardware platforms improves data processing and model deployment efficiency in advanced research domains such as high-energy physics (HEP) and multi-messenger astrophysics (MMA). We developed SuperSONIC, a scalable server infrastructure for SONIC that enables the deployment of computationally intensive inference tasks, such as charged particle reconstruction, on Kubernetes clusters equipped with graphics processing units (GPUs). Leveraging NVIDIA’s Triton Inference Server, SuperSONIC decouples client workflows from server infrastructure, standardizing communication, improving throughput, and enabling robust load balancing and monitoring. SuperSonic has been successfully deployed in production environments, including the CMS and ATLAS experiments at CERN’s Large Hadron Collider, the IceCube Neutrino Observatory, and the LIGO gravitational-wave observatory. It offers a reusable, configurable framework that addresses cloud-native challenges and enhances the efficiency of accelerator-based inference across diverse scientific and industrial applications.
    
    Speaker: Yuan-Tang Chou (University of Washington (US))
    
    SuperSONIC FastML.pdf
  - 27
    
    Embedding domain knowledge: Inductive biases for algorithmic alignment in Machine Learning
    
    Most of the current machine learning (ML) applications are purely data-driven solutions with little considerations for underlying problem dynamics, limited to in-distribution applications. To tackle this limitation a stream of literature is emerging to address out-of-distribution (OOD) performance: Algorithmic alignment, which focuses on embedding algorithmic structures into ML architectures to reflect the inherent logic of the problem at hand. The general idea is summarized in two steps: first, we formalize the dynamics, mathematical workings involved, as well as constraints and assumptions on data, outputs and parameters; second, we design the corresponding ML algorithm that maximally replicates such specification, i.e., we implement its inductive bias (IB).
    
    The relatively recent literature of algorithmic alignment, however, shows a lack of proper characterization of existing algorithms and IBs.
    
    We provide said characterization for our core research focus: acceleration of large-scale scientific simulators. We hypothesise that these have already mathematically embedded domain knowledge, as a result of widely detailed physical phenomena and decades of development of dedicated algorithms. Examples would be sequential, discrete-event simulators such as traffic ones or physically characterized ones such as the ones from hydrodynamics or climate science. The approach is detailed to be transferable to other scientific disciplines and facilitating the application of algorithmic reasoning in ML solutions. We analyze 3 main subjects: traditional inductive biases in ML and how they align with simulators; unconventional inductive biases inspired by the domain knowledge and generalizing power of simulators; algorithmic structures from the most common algorithms in large-scale simulators.
    
    Our analysis of such multidisciplinary perspectives will result in a dictionary of IBs and their connections to specific tasks, which shall guide researchers and practitioners towards more robust ML solutions. We have already characterized simulators in the traffic domain and identified characteristic features of algorithms (such as Dijkstra, Frank Wolfe) and models (like the Cell Transmission Model or Elastic Traffic Assignment) in simulators. Some relevant trends are identified. We find, for example, that more simulators lean on addition and short term memory rather than multiplication or long term memory. The complexity of the simulation at hand is also a strong indicator of modularity, loop invariance and smoothness bias. Most simulators appear to deal with structural sparsity through computational power alone, but few more targeted models actually avoid it. Most simulators can also be separated in multiple components with different IBs and algorithmic structures, whose understanding is critical in designing ML meta-models able to perform OOD as well as current simulators.
    
    The study is designed to be a toolbox for novel efficient architectures embedding specific generalization preferences that mirror the identified IBs. The analysis of common algorithmic structures in simulators will allow to more easily design ML models algorithmically aligned with the problem at hand, isolating non-nonlinearities and improving the expressiveness of the novel architectures. The findings shall be easily replicable to other scientific domains with a huge body of domain knowledge historically already embedded in models, algorithms and simulations. A practical example may be presented.
    
    Speaker: Serio Angelo Maria Agriesti (Department of Technology, Management and Economics, Technical University of Denmark, Lyngby, Denmark)
    
    Agriesti_FAST.pdf
  - 28
    
    Pushing Matrix-Vector Multiplication Performance on AMD AI Engines for Low-Latency Edge Inference
    
    Matrix-vector (GEMV) operations are a common building block in many deep learning models, particularly for large dense layers found in convolutional neural networks (CNNs) and multi-layer perceptrons (MLPs). Despite their importance, GEMV kernels have historically underperformed compared to matrix-matrix (GEMM) operations due to their lower arithmetic intensity and limited data reuse, making them harder to scale efficiently. This work presents the first comprehensive analysis and optimization of matrix-vector operations using AMD’s AI Engines on the latest AIE-ML architecture. It addresses key bottlenecks in deploying AI models that rely on such operations targetting low-latency edge inference, such as meeting the tight real-time requirements of the CERN trigger system. Our proposed GEMV kernel achieves high throughput and low latency through exploitation of the AI Engine array, scaling efficiently across tiles both horizontally and vertically through a custom placement strategy. Furthermore, we introduce a novel graph connection mechanism that enables efficient pipelining across multiple layers. The design is modular and can be easily integrated with other frameworks such as hls4ml in a straightforward manner. Our multi-layer implementation achieves close to microsecond-level latency, demonstrating its suitability for ultra-low-latency applications. These results make AMD's AI engines a realistic middle ground solution that can offer the scalability that FPGAs struggle to reach for large models, while maintaining the ultra-low latency that GPUs typically cannot provide.
    
    Speaker: Dimitrios Danopoulos (CERN)
    
    fastML_2025_dimitrios_danopoulos.pdf
Wednesday 3 September
- Invited talks
  - 29
    
    The Instrumental Edge: Real-Time and AI-Ready Scientific Discovery
    
    From radio telescopes to particle accelerators and electron microscopes, scientific instruments produce tremendous amounts of data at equally high rates; previous architectures that have relied on offline storage and large data transfers are unable to keep up. The future of scientific discovery is interactive, streaming, and AI driven, placing the autonomous and intelligent instrument at the center of a given science workflow. By making our instruments smarter, scientists can run higher impact experiments.
    
    In this tools-focused talk, we will highlight NVIDIA’s work with real-time data processing including high speed data movement, FPGA co-design, and laying the foundations for moving AI inferencing as close to the data converter as possible.
    
    Speaker: Adam Thompson (NVIDIA)
    
    Plenary - Instrumental Edge - Sensor Processing on GPU.pdf
  - 30
    
    AI for next-gen cellular networks (6G)
    
    Speaker: Bozidar Radunovic (Microsoft Research)
    
    ETHZ Fast 2025.pdf
- 10:30
  
  Coffee
- Invited talks
  - 31
    
    Next Generation GPU Signal Processing Pipeline for Radio Astronomy
    
    As digitizer technologies scale, efficient processing of massive amounts of sensor data is essential for the next generation of science projects. This talk focuses on the next-generation electromagnetic signal processing pipeline developed at the Allen Telescope Array. Backed by the NVIDIA Holoscan SDK, this pipeline utilizes cutting-edge technologies to address the three key pillars of high-speed scientific sensor data acquisition in real-time. In this model, the GPU device acquires, processes, and stores the data without the direct involvement of the CPU memory. This topology, apart from performance gains, enables the deployment of cutting-edge detection technology based on machine learning models that directly work with spectrogram data in an online fashion. This work equips observatories with the capability to meet the growing demands of next-generation astronomy.
    
    Speaker: Luigi Cruz (SETI)
    
    FastML.pdf
- 32
  
  Conference photo outside of auditorium!
- 12:00
  
  Lunch
- Contributed talks
  - 33
    
    Real-Time Anomaly Detection in the CMS Level-1 Trigger with AXOL1TL
    
    AXOL1TL is an anomaly detection (AD) trigger algorithm integrated into the Global Trigger (GT) of the CMS Level-1 Trigger (L1T) system since 2024. The GT reduces the event rate from proton–proton collisions at the LHC, lowering it from 40 MHz to 100 kHz within a fixed latency of 50 ns. The AD algorithm, implemented in the FPGA firmware of the GT board, uses an autoencoder to assign an anomaly score to each event, enabling the selection of more anomalous events for further analysis. We present the full deployment workflow to achieve ultra-low-latency anomaly detection: from hardware-aware model training to firmware synthesis and integration into the L1T system. We also report on the characterisation and performance of the AXOL1TL trigger, using the latest model, updated in 2025 with an increased rate budget and a novel feature extraction technique, based on a self-supervised method, which led to improved performance. This work demonstrates one of the first fully functional anomaly detection triggers within the CMS L1T system and showcases how novel trigger-level approaches can enhance sensitivity to new physics in real-time event selection.
    
    Speaker: Sabrina Giorgetti (Universita e INFN, Padova (IT))
    
    axol1tl_fastml25.pdf
  - 34
    
    GELATO: A Generic Event-Level Anomaly detection Trigger for ATLAS
    
    The absence of BSM physics discoveries at the LHC suggests new physics could lie outside current trigger schemes. By applying unsupervised ML–based anomaly detection, we gain a model-agnostic way of spotting anomalous signatures that deviate from the current trigger’s expectations. Here we introduce a Run-3 trigger chain that embeds fast anomaly detection algorithms in both hardware and software levels. We will describe its design, integration, commissioning strategy with emphasis on rate stability and robustness, and show first validation results from its data stream. This marks ATLAS’s inaugural anomaly-detection trigger, laying the groundwork for future ML-driven triggers and novel sensitivity to a broad spectrum of new-physics signatures in Run-3 and beyond.
    
    Speaker: Kenny Jia (Stanford University/ SLAC)
    
    GELATO_FastML_9_3_2025_Kenny.pdf
  - 35
    
    Advancing the CMS Level-1 Trigger: Jet Tagging with DeepSets at the HL-LHC
    
    At the Phase-2 Upgrade of the CMS Level-1 Trigger (L1T), particles will be reconstructed by linking charged particle tracks with clusters in the calorimeters and muon tracks from the muon station. The 200 pileup interactions will be mitigated using primary vertex reconstruction for charged particles and a weighting for neutral particles based on the distribution of energy in a small area. Jets will be reconstructed from these pileup-subtracted particles using a fast cone algorithm. For the first time at the CMS L1T, the particle constituents of jets will be available for jet tagging. In this work we present a new multi-class jet tagging neural network (NN). Targeting the L1T, the NN is a small DeepSets architecture, and trained with Quantization Aware Training. The model predicts the classes: light jet (uds), gluon, b, c, $\tau_h^+$, $\tau_h^-$, electron, muon. The model additionally predicts the $p_T$ of the object. The new model enhances the selection power of the L1T for important processes for CMS at the High Luminosity LHC such as di-Higgs and Higgs production via Vector Boson Fusion. We present the model including its performance at object tagging and deployment into the L1T FPGA processors, and showcase the improved trigger capabilities enabled by the new tagger.
    
    Speaker: Christopher Edward Brown (CERN)
    
    DeepSets_FastML_CB_0309.pdf
    
    DeepSets_FastML_CB_0309.pdf
  - 36
    
    Low-Latency On-Chip Tau Event Selection with Machine Learning for the Belle II Level-1 Trigger
    
    Belle II is a luminosity frontier experiment located at the SuperKEKB asymmetric $e^+ e^-$ collider, operating at the $\Upsilon(4S)$ resonance. The $\tau$ physics program at Belle II involves both probes of new physics and precision measurements of standard model parameters with large statistics. SuperKEKB is projected to reach a luminosity of $6\times 10^{35}~\text{cm}^{-2}\text{s}^{-1}$ in the next decade. At these high luminosities, the hardware-based Level-1 Trigger system will require improved signal identification algorithms maintain high trigger efficiencies while keeping the total trigger rate below the data acquisition system limit of $30~\text{kHz}$. Utilizing per-weight mixed-precision quantization aware training, we develop a fast machine-learning based logic for $\tau$ event selection with $\sim 100~\text{ns}$ latency, implemented on an AMD XCVU080 FPGA. Our algorithm uses energy, timing, and position information provided by the electromagnetic calorimeter sub-trigger system as inputs to a feed-forward dense neural network to reconstruct low-multiplicity standard model $\tau$ decays. When compared with common trigger conditions currently used for $\tau$ selection, we achieve up to $50\%$ reduction in total trigger rate while maintaining over $95\%$ signal efficiency. The new firmware has been validated using cosmic ray data collected in early 2025, and is now implemented in the Belle II analysis software framework for further validation in simulation. Full implementation of the new logic is planned for the next Belle II physics run in fall 2025.
    
    Speaker: Deven Misra (University of Tokyo)
    
    FastML25.pdf
  - 37
    
    Accelerated Graph Neural Network Inference on FPGAs for Real-Time Muon Triggering at the HL-LHC
    
    The High Luminosity upgrade of the Large Hadron Collider (HL-LHC) presents a demanding environment for real-time data processing, with substantially increased event rates requiring faster and more efficient trigger systems. This study explores the deployment of graph neural networks (GNNs) on field-programmable gate arrays (FPGAs) for fast and accurate inference within future muon trigger pipelines. By leveraging the sparse and relational structure of detector data, GNNs enable robust pattern recognition while preserving spatial and topological correlations. We investigate hardware-friendly implementations of GNN architectures, focusing on model compression, parallelism, and low-latency execution; contributing to the broader goal of AI-driven event selection in high-energy physics experiments.
    
    Speaker: Davide Fiacco (Sapienza Universita e INFN, Roma I (IT))
    
    FPGA_FastMLforS_03_09_2025.pdf
  - 38
    
    Jet finding in real-time using an object detection CNN
    
    The ATLAS trigger system will undergo a comprehensive upgrade in advance of the HL-LHC programme. In order to deal with the increased data bandwidth trigger algorithms will be required to satisfy stricter latency requirements. We propose a method to speed up the current calorimeter-only preselection step and to aid trigger decisions for hadronic signals containing jets.
    We demonstrate the use of a dedicated object-detection Convolutional Neural Network (CNN) for jet finding in the ATLAS calorimeter. The modified computer vision model is employed in the task of jet detection to identify and localise jets within the central calorimeter acceptance and to subsequently estimate their transverse momenta. A custom architecture is introduced to reduce the number of learnable parameters required for improved inference speed. The model performance is evaluated on a set of simulated particle interactions in the ATLAS detector with up to 200 concurrent pile-up interactions.
    
    Speaker: Leon Bozianu (Universite de Geneve (CH))
    
    Bozianu_fastML_030925.pdf
- Posters and coffee HIT G floor (gallery)
  
  HIT G floor (gallery)
  
  All posters
  
  poster_listing_wrapped.pdf
- Contributed talks
  - 39
    
    KAN-LUT: Efficient LUT-Based Acceleration of Kolomogorov-Arnold Networks (KANs) on FPGAs
    
    Optimized FPGA implementations of tiny neural networks are crucial for low-latency and hardware-efficient inference for a variety of applications. Neural networks based on lookup tables (LUTs) are a standard technique for such problems due to their hardware efficiency and strong expressivity. However, such networks are often difficult to scale up as their resource usage scales exponentially with LUT fan-in. To address this issue, we propose a LUT-based implementation of the recently proposed Kolmogorov-Arnold Network (KAN). KANs consist of spline-based, trainable activations as edges between neurons in adjacent layers, with each node performing a sum operation on incoming activations. Because of the strong expressivity of spline-based activations, KANs can often achieve similar accuracies as compared to multi-layer perceptrons (MLPs) using significantly fewer layers. Since each node-to-node spline computation is performed with a LUT lookup, the fan-in of each LUT is only unity which avoids scaling issues associated with other LUT-based networks. Along with quantization-aware training (QAT), this architecture is well-suited for edge-pruning to decrease hardware resources after sparsification in training. Empirically, we demonstrate on various benchmarks that our design achieves task performance similar to other state-of-the-art techniques while also using comparable or fewer hardware resources.
    
    Speaker: Duc Hoang (Massachusetts Inst. of Technology (US))
    
    FastML_KAN.pdf
  - 40
    COLLIDE-2V - 750 Million Dual-View LHC Event Dataset for Low-Latency ML
    
    Modern foundation models (FMs) have pushed the frontiers of language, vision, and multi-model tasks by training ever-larger neural networks (NN) on unprecedented volumes of data. The use of FM models has yet to be established in collider physics, which both lack a comparably sized, general-purpose dataset on which to pre-train universal event representations, and a clear demonstrable need. Real-time event identification presents a possible need due to a requirement for fast event classification and selection of all possible collisions at the LHC. As a result, we construct a dual-view LHC collision dataset (COLLIDE-2V), a 50TB public dataset comprising ~750 million proton-proton events generated with MadGraph + Pythia + Delphes under High-Luminosity LHC conditions (<μ> = 200). Spanning everything from minimum-bias and γ+jets to top, Higgs, di-boson, multi-boson, exotic long-lived signatures and dark showers, the sample covers 50+ distinct processes and >99% of the CMS Run-3 trigger menu in a single coherent format. To allow for effective real-time event interpretation each event is provided twice, as Parquet files which retain physics-critical content:
    
    Offline - a full CMS-like reconstruction emulated by a tuned Delphes card
    
    L1T - a low-latency, lower-resolution view obtained via a custom Level-1 Trigger (L1T) parameterisation (degraded vertex, track and calorimeter performance, altered puppi, |η| ≤ 2.5 tracking, pT thresholds, etc.)
    
    As a proof-of-concept, COLLIDE-2V supports a wide spectrum of research applications ranging from few-shot transfer learning, fine-tuning, pileup mitigation, detector-level generative modelling, cross-experiment benchmarking, to fast simulation surrogates and real-time trigger inference, and entirely novel anomaly-detection - thereby accelerating the shift from handcrafted topology cuts to data-driven decision making throughout the HL-LHC program.
    
    Speaker: Eric Anton Moreno (Massachusetts Institute of Technology (US))
    
    FM_Collide2V_EMoreno.pdf
    
    FM_Collide2V_EMoreno.pdf
  - 41
    
    Accelerating Efficient Transformer Architectures for Point Cloud Data using hls4ml (REMOTE)
    
    The analysis of point cloud data, for example signals from charged particles recorded by detectors in high energy physics (HEP) experiments, can be significantly enhanced and accelerated by the application of machine learning models. In recent years, transformer architectures have come into focus as offering excellent model performance. However, for traditional transformers,the need to compute attention between all elements of the input data set results in high computational requirements and poor scaling of the inference performance with increasing data set size. To address this, the Locality-Sensitive Hashing-Based Efficient Point Transformer (HEPT) has been proposed, which segments the input dataset into smaller samples based on their adjacency which is evaluated using a hashing function. This approach has been shown to greatly improve computational efficiency when deployed on traditional GPU architectures.
    For deployment with stricter latency requirements, for example in the trigger systems of HEP experiments, further accelerating the inference of the HEPT architecture is required. We present an implementation of HEPT for AMD/Xilinx FPGAs using hls4ml, which includes the hashing and segmentation of the data set, the attention computation, and recombination of the data. Using a charged particle track reconstruction model as the benchmark, latencies on the microsecond scale are achieved within the computing resources available on an Alveo u250 FPGA. Model compression using pruning and quantization with PQuant is explored.
    
    Speaker: Jan-Frederik Schulte (Purdue University (US))
    
    HEPT_FastML2025_JSchulte.pdf
  - 42
    
    Hierarchical Dataflow Accelerator of Interaction Networks for Large-Scale Particle Tracking on FPGA
    
    The Interaction Network (IN) algorithm has shown great promise for particle tracking applications at the Large Hadron Collider (LHC), where identifying complex particle trajectories from raw detector data is a computationally intensive task. IN leverages graph-based representations of detector hits to learn relationships between particle interactions, making it well-suited for this domain. Given the extremely high data rates and stringent latency requirements in the LHC environment, Field-Programmable Gate Arrays (FPGAs) present an ideal platform for deploying IN-based inference systems, thanks to their low-latency, parallel-processing capabilities and energy efficiency.
    
    However, existing FPGA implementations of the IN algorithm are constrained by the limited on-chip resources available to handle large-scale input graphs. As a result, they typically rely on graph reduction or subgraph sampling techniques that risk discarding important structural information and compromising the overall fidelity of the dataset. These limitations hinder the deployment of IN models in real-time applications that require both high throughput and high accuracy across the entire dataset.
    
    To address these challenges, we propose a modular and hierarchical dataflow accelerator tailored for efficient large-scale graph processing on FPGA hardware. The proposed architecture introduces a novel parallel pipeline design, which enables fine-grained concurrency across computation stages, significantly improving processing efficiency. To further enhance performance, the architecture incorporates optimized utilization of on-chip memory, reducing access latency and mitigating data transfer bottlenecks during graph traversal and edge feature computation. The use of stream-based processing allows for continuous ingestion and processing of graph data without requiring full graph buffering, which is critical for real-time applications.
    
    A key feature of our design is its hierarchical task-level structure, which facilitates scalability by decomposing large graphs into manageable subgraphs and distributing the computation across multiple processing units. This not only improves the overall throughput but also enables modular expansion of the architecture to accommodate larger and more complex datasets. Such a design makes the system adaptable to future upgrades in detector resolution and data complexity, ensuring long-term applicability.
    
    We implement our design on a Xilinx Virtex UltraScale+ XCVU9P FPGA and evaluate its performance using representative large-scale graph workloads relevant to LHC tracking tasks. Experimental results show that our accelerator achieves a 129.1× speedup over a CPU baseline and a 7.16× improvement over a GPU implementation. Furthermore, the proposed design demonstrates significant energy efficiency advantages, with up to 972× improvement over CPU and 195× over GPU in terms of energy consumed per inference.
    
    Speaker: Bo-Cheng Lai
    
    Hierarchical Dataflow Accelerator for Large-Scale Particle Tracking Using GNN Interaction Networks
    
    Hierarchical Dataflow Accelerator for Large-Scale Particle Tracking Using GNN Interaction Networks
  - 43
    
    An ML Pipeline for Real-time Gravitational Wave Alerts
    
    Speakers: Christina Reissel (Massachusetts Inst. of Technology (US)), Katya Govorkova (Massachusetts Inst. of Technology (US)), Philip Coleman Harris (Massachusetts Inst. of Technology (US))
    
    FastML_ML4GW_CReissel.pdf
- 44
  
  Map to Tessin Grotto
  
  fml_dinner.pdf
- 17:45
  
  Walk to Tessin Grotto (see map)
- Social: Conference Dinner Tessin Grotto
  
  Tessin Grotto
  
  fml2025_intro.pdf
Thursday 4 September
- Invited talks
  - 45
    
    Scaling up Advanced, Near-Sensor AI: An Open Platform Approach
    
    AI is accelerating into the generative era, and it is poised to disrupt multiple businesses and applications. With the increasing focus on edge and extreme-edge, near sensor applications, inference is becoming the key workload and computational challenge. Computing system need to scale out and scale up to meet the challenge. In this talk I will discuss how to scale up chip(lets) for efficient inference at the edge targeting advanced AI models, optimizing the whole hardware stack, from processing elements to the global interconnect. I will emphasize the strategic importance of an end-to-end (models, software, instruction set architecture, digital design) open-platform approach to ensure a healthy innovation ecosystem, with long term sustainability.
    
    Speaker: Luca Benini (ETH Zurich)
    
    BeniniFastML09-25.pdf
  - 46
    
    Co-Design for Efficient & Adaptive ML
    
    Beyond the well-known highlights in computer vision and natural language, AI is steadily expanding into new application domains. This Pervasive AI trend requires supporting diverse and fast-moving application requirements, ranging from specialized I/O to fault tolerance and limited resources, all the while retaining high performance and low latency. Adaptive compute architectures such as AMD FPGAs are an excellent fit for such requirements but require co-design of hardware and ML algorithms to reap the full benefits. In this talk, we will cover a breadth of co-design techniques, including their merits and challenges, from streaming dataflow architectures to quantization, from sparsity to full circuit co-design. By combining such techniques, we can enable nanosecond-latency and performance in the hundreds of millions of inferences per second. The proliferation of this technology is enabled via open-source AMD tools such as FINN, Brevitas and LogicNets, as well as the AMD-FastML collaborative project QONNX.
    
    Speaker: Yaman Umuroglu (AMD)
    
    20250904_fastml.pdf
- 11:00
  
  Coffee
- Contributed talks
  - 47
    
    Towards a Self-Driving Trigger: Adaptive Response in Real Time
    
    The trigger systems of ATLAS and CMS currently reject vast numbers of potentially valuable collision events due to their conservative, static designs, a limitation that directly hampers discovery potential. We propose an alternative to these rigid, hand-tuned menus with an autonomous controller capable of dynamically optimizing trigger performance in real time.
    In this work, we demonstrate that, by continuously adapting trigger thresholds and resource allocations in response to evolving experimental conditions (such as pileup, beam-induced backgrounds, or detector drifts), our self-driving trigger system maintains peak performance across multiple axes: signal efficiency, rate of unusual events, and computational cost.
    Crucially, we validate our approach through playback of zero-bias L1 trigger data from real LHC collision events. Anomaly detection triggers serve as a natural testbed, where the gains from dynamic prioritization are especially clear. More broadly, this framework is particularly powerful for triggers that rely on subtle correlations and are inherently more sensitive to changes in detector conditions.
    Our architecture supports both lightweight feedback controllers and more powerful reinforcement learning approaches, laying the foundation for truly adaptive, intelligent triggering at the LHC and beyond.
    
    Speaker: Giovanna Salvi (University of Michigan (US))
    
    FastML_Salvi.pdf
  - 48
    
    Easing the path to deployment in ML4Sys through FPGAs
    
    Machine Learning (ML) techniques are increasingly applied for the optimization of complex computing systems, but their integration into core low-level system mechanisms remains limited. A key barrier is the lack of accessible, high- performance interfaces at the boundary between software and hardware as well as hardware-offloaded ML-inference at full systems speed. In this presentation, we show how Field Programmable Gate Arrays (FPGAs) can be a key enabler for closing this very gap: The combination of mature FPGA shells, thus full-fledged computer systems on reconfigurable fabric, and modern ML compilers for hardware-accelerated inference enables rapid prototyping and deployment of be-spoke ML models directly at the interfaces of key system mechanisms. This approach allows for ultra-low latency, real-time decision making in system components such as memory management, scheduling logic and network control. We outline a vision of the fully ML-optimized FPGA-SmartHub, serving as a research platform for system optimization in both classic computer systems and next-generation accelerators.
    
    Speaker: Maximilian Heer (ETH Zurich)
    
    FastML_Talk_Upload.pdf
  - 49
    
    Go small then go home - hyperparameter transfer for ML in HEP
    
    Tuning hyperparameters of ML models, especially large ML models, can be time consuming and computationally expensive. As a potential solution, several recent papers have explored hyperparameter transfer. Under certain conditions, the optimal hyperparameters of a small model are also optimal for larger models. One can therefore tune only the small model and transfer the hyperparameters to the larger model, saving a large amount of time and effort. This work explores how well the idea holds up in high-energy physics by applying it to three existing ML pipelines: metric learning for particle tracking, autoencoders for anomaly detection, and particle transformers for jet tagging. These cover several common ML architectures and reflect models currently used or in development at CMS and other experiments. We show that with a few changes to the models, hyperparameters can often be transferred across both neural net depth and width. We focus on learning rate transfer, but also show results on a few other hyperparameters. A few guidelines are introduced, encouraging the use of hyperparameter transfer in future HEP ML models.
    
    Speaker: Liv Helen Vage (Princeton University (US))
    
    GoSmallThenGoHome-FastML2025.pdf
  - 50
    
    Edge Deep Learning for Particle Physics (EPIGRAPHY)
    
    Speaker: Benedikt Maier (Imperial College (GB))
    
    EPIGRAPHY_FastML_2025.pdf
- 51
  
  Fast ML foundation discussion
  
  fastML-POSE-010925.pdf
  
  video1039218977.mp4
- 12:40
  
  Lunch
- Contributed talks
  - 52
    
    Low-Latency Resource-Efficient GNNs for Jet Tagging on FPGAs
    
    Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider. However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply.
    
    In this work, we present the first co-optimized framework that integrates High Granularity Quantization (HGQ) and Distributed Arithmetic for Machine Learning (da4ml) to enable efficient inference of IN-based GNNs on FPGAs. HGQ performs fine-grained, layer- and channel-level bitwidth allocation through hardware-aware optimization, minimizing precision where possible without sacrificing classification performance. Complementing this, da4ml replaces traditional multiply-accumulate units with highly parallel, LUT-based arithmetic units, allowing low-latency and resource-efficient implementation of quantized GNNs.
    
    We demonstrate our approach on a public benchmark dataset for jet classification, implementing compressed and quantized IN models on FPGAs. Our results show that GNN designs with the combined HGQ+da4ml approach achieve significant reductions in DSP and LUT usage compared to state-of-the-art GNN designs, while maintaining model accuracy and satisfying strict latency constraints.
    
    Speaker: Zhiqiang (Walkie) Que (Imperial College London)
    
    jedi-linear.pdf
  - 53
    
    Smartpixels: Intelligent pixel detectors: Towards a radiation hard ASIC with on-chip machine learning in 28nm CMOS
    
    The Smartpixels project is a coordinated effort to co-design pixel ASICs, design tools, ML algorithms, and sensors for on-detector data reduction, motivated by the technical challenges of current and future colliders. The drive to greater precision requires smaller pixel pitch, which together with higher event rates arising from pileup and/or beam-induced background generates petabytes of data per second. Readout chips must be power-efficient, radiation-hard, and capable of real-time data processing.
    
    The smartpixels team has developed algorithms for selecting the signatures of high-momentum tracks and coarse particle-trajectory reconstruction, and explored how the performance changes with pixel sensor geometry, orientation, and irradiation.
    
    We have leveraged and extended hls4ml to support neural network architectures meeting the strict latency and area constraints. To target our TSMC 28nm ASIC implementations, we have integrated the flow with Catapult HLS, allowing seamless synthesis of these designs into RTL for backend integration, and our first custom pixel ASICs have been produced and are undergoing testing.
    
    We will present the status of ongoing work, including efforts in testing ASICs, producing a new ASIC with trajectory-reconstruction algorithm, and improving the realism of the detector simulation though including noise, charge thresholds, and other effects.
    
    Speakers: Benjamin Weiss (Cornell University), Jannicke Pearkes (University of Colorado Boulder (US))
    
    Smartpixels_ Intelligent pixel detectors_ Towards a radiation hard ASIC with on-chip machine learning in 28nm CMOS (2).pdf
  - 54
    
    Quantum-Inspired Tensor Network Models for Ultrafast Jet Tagging on FPGAs
    
    We conduct a systematic study of quantum-inspired Tensor Network (TN) models—Matrix Product States (MPS) and Tree Tensor Networks (TTN)—for real-time jet tagging in high-energy physics, with a focus on low-latency deployment on FPGAs. Motivated by the strict computational demands of the HL-LHC Level-1 Trigger system, we explore TN architectures as compact and interpretable alternatives to deep neural networks. Our models are trained on jet events represented by low-level features of jet constituents. Benchmarked against state-of-the-art deep learning classifiers, they demonstrated competitive performance in terms of classification accuracy and AUC. We implement quantization-aware training for TTNs and successfully deploy the best-performing models on FPGA hardware, evaluating DSP usage, latency and memory usage. We are currently working on extending the support for the quantization of MPS models and synthesizing their designs for full FPGA deployment, to be able to compare them with TTNs in terms of both performance and hardware cost. This work aims to highlight the potential of TN-based models for fast, resource-efficient inference in low-latency environments such as the LHC.
    
    Speaker: Ms Ema Puljak (universitat Autònoma de Barcelona)
    
    TNs_4_FastML.pdf
  - 55
    
    Toward a Ultra-Fast, Energy-Efficient Readout of Calorimeters with Neuromorphic Processing
    
    Hadronic calorimeters are a key part of high energy physics experiments. Traditionally, they rely on high granularity to improve performances, but this leads to various challenges in terms of cost, energy consumption and output data volume. Moreover, current detectors do not have the capability of exploiting temporal information of the shower development, as the time frame for pattern detection is condensed to sub-nanoseconds due to the particles' speed. Neuromorphic architectures might help with overcoming these limitations.
    
    We explore a neuromorphic approach to calorimeter readout. Hadrons interacting with a homogeneous lead-tungstate (PbWO₄) calorimeter are simulated and the resulting time-dependent light signals, captured by a dense array of photodetectors, are encoded into spike trains to serve as input to a fully connected spiking neural network (SNN). This architecture is trained to reconstruct key physical quantities, including the total energy deposited and topological information about the events. The model performs effectively in both single- and multi-task regression settings, producing consistent results in both scenarios.
    
    Finally, we discuss the feasibility of implementing such a readout system using nanophotonic hardware based on III-V semiconductor nanowires, highlighting a pathway toward ultra-fast, energy-efficient calorimeter readout.
    
    Speaker: Enrico Lupi (CERN, INFN Padova (IT))
    
    LupiEnrico_NeuroCalo_FastML.pdf
  - 56
    
    SparsePixels: Efficient Convolution for Sparse Data on FPGAs
    
    Inference of standard convolutional neural networks (CNNs) on FPGAs often incurs high latency and long initiation intervals due to the nested loops required to slide filters across the full input, especially when the input dimensions are large. However, in some datasets, meaningful signals may occupy only a small fraction of the input, say sometimes just a few percent of the total pixels or even less. In such cases, most computations are wasted on regions containing no useful information. In this work, we introduce SparsePixels, a framework for efficient convolution for sparsely populated input data on FPGAs operating under tight resource and low-latency constraints. Our approach implements a special class of CNNs where only active pixels (non-zero or above a threshold) are retained and processed at runtime, while the inactive ones are discarded and ignored. We show that our framework can achieve performance comparable to standard CNNs in some target datasets while significantly reducing both latency and resource usage on FPGAs. Custom kernels for training and the HLS implementation are developed to support sparse convolution operations.
    
    Speaker: Ho-Fung Tsoi (University of Pennsylvania)
    
    sparsepixels_fastml.pdf
  - 57
    
    FPGA-accelerated ML for real-time RHEED inference
    
    Reflection High-Energy Electron Diffraction (RHEED) is a common diffraction-based surface characterization technique for analyzing the properties of crystalline materials that are grown using a thin-film deposition technique like pulsed-laser deposition (PLD) or molecular-beam epitaxy (MBE). In this work, we design an FPGA-accelerated machine learning (ML) algorithm to perform real-time analysis of RHEED data, allowing us to track the growth process of a thin-film deposition sample in real-time. This enables future study of the dynamic, high-speed interactions between the applied material (e.g. plasma, gaseous atomic or molecular plume) and the sample being grown, and lays the groundwork for future development in optimal control of thin-film deposition techniques.
    
    Our ML solution consists of 2 two standard CNNs in sequence: (a) an upstream object-detection CNN to locate the local maxima (“diffraction spots”) in a RHEED diffraction pattern, followed by (b) a downstream regression CNN to parametrize the ‘shape’ of each diffraction spot. We use a combination of real, augmented, and synthetic training data to produce models which are accurate, generalizable and robust to out-of-distribution samples. Through a combination of quantization-aware training, pruning, and neural architecture search, we produce models which are both small (in terms of onboard FPGA resource usage) and fast (in terms of inference latency and throughput). We convert these CNNs into FPGA designs using the hls4ml translation tool, and integrate them into a higher-level ‘inference block’ design, which handles data I/O, image-cropping and normalization, and block-level control of each CNN. Finally this inference block is linked to a Phantom S200 high-speed camera using the Euresys FrameGrabber FPGA design.
    
    We simulate the top-level FPGA design to show that this hardware-accelerated ML solution has sufficient accuracy, resource-usage, and latency to retrieve scientifically useful results from a real thin-film deposition growth process.
    
    Speaker: Abdelrahman Asem Elabd (University of Washington (US))
    
    FastML for RHEED-2.pdf
- 15:30
  
  Coffee
- Contributed talks
  - 58
    
    Low-latency Jet Tagging for HL-LHC Using Transformer Architectures
    
    Transformers are the state-of-the-art model architectures and widely used in application areas of machine learning. However the performance of such architectures is less well explored in the ultra-low latency domains where deployment on FPGAs or ASICs is required. Such domains include the trigger and data acquisition systems of the LHC experiments.
    
    We present a transformer-based algorithm for jet tagging built with the HGQ2 framework, which is able to produce a model with heterogeneous bitwidths for fast inference on FPGAs, as required in the trigger systems at the LHC experiments. The bitwidths are acquired during training by minimizing the total bit operations as an additional parameter. By allowing a bitwidth of zero, the model is pruned in-situ during training. Using this quantization-aware approach, our algorithm achieves state-of-the-art performance while also retaining permutation invariance which is a key property for particle physics applications
    
    Due to the strength of transformers in representation learning, our work serves also as a stepping stone for the development of a larger foundation model for trigger applications.
    
    Speaker: Lauri Antti Olavi Laatu (Imperial College (GB))
    
    L1_transformer_FastML.pdf
  - 59
    
    Radiation-Hard, ML-Based, Low-Latency Compression for the LHCb ECAL Upgrade
    
    The LHCb Upgrade II will operate at a data rate of 200 Tb/s, requiring efficient real-time data reduction. A major challenge of this pipeline is the transfer of full timing information from the frontend Electromagnetic Calorimeter (ECAL) to the backend for processing, which is critical for resolving pile-up, background suppression, and enhancing energy resolution. Due to the data rate, full timing information cannot be transmitted, requiring compression of data to reduce bandwidth. To address this, we develop a machine-learning-based compression algorithm, capable of learning detector-specific correlations in the data, outperforming generic compression schemes. Central to this effort is the extension of the hls4ml framework to fully support Microchip architectures, enabling the deployment of optimised autoencoder networks on the PolarFire FPGAs. These networks compress high-granularity timing data with minimal latency, achieving O(25 ns) inference times within stringent resource constraints. This development is key to reducing bandwidth while preserving physics performance and represents an essential step toward maintaining the physics reach of LHCb Upgrade II in the high-luminosity era.
    
    Speaker: Katya Govorkova (Massachusetts Inst. of Technology (US))
    
    Compression ECAL U2 FastML 2025.pdf
  - 60
    
    Chisel4ml: Using Chisel For Direct Circuit Implementation of Deeply Quantized Neural Networks
    
    We give an introduction to chisel4ml, a tool for generating direct circuit implementations of deeply quantized neural networks. It uses structural descriptions of deeply quantized neural networks in the form of Chisel generators. Chisel is a domain-specific language for designing synchronous digital circuits. It is a language embedded in Scala that offers a wealth of powerful features, such as: functional programming, object-oriented programming, and static type safety. We will introduce you to the basics of the Chisel language and show you how chisel4ml can be used to create implementations of deeply quantized neural networks.
    
    Speaker: Jure Vreča
    
    chisel4ml_presentation.pdf
    
    chisel-presentation.pdf
- Birds-of-a-Feather
  - 61
    
    MLOps Pipeline for Continuous Deployment of Machine Learning Algorithms for HEP HIT E 41.1
    
    HIT E 41.1
    
    We present an MLOps-based approach for managing the end-to-end lifecycle of machine learning (ML) algorithms deployed on FPGAs in real-time trigger systems, as used in experiments such as CMS and ATLAS. The primary objective of this pipeline is to enable agile and robust responses to evolving detector and beam conditions by automating the collection of new training data, retraining and optimizing models, validating performance, synthesizing firmware, and deploying updated versions to both online and offline environments.
    To monitor model stability over time, we incorporate dedicated data streams that bypass trigger selections (e.g., scouting or express streams). These streams allow for continuous monitoring of model outputs and the detection of distributional drifts, enabling us to assess model operational lifetimes and support strategies like continual learning, periodic retraining, or threshold adjustment to ensure consistent performance.
    Our pipeline uses existing computing infrastructure, which includes distributed computing resources, container orchestration frameworks like Kubeflow, and CI/CD tools such as GitLab, to provide a scalable and maintainable foundation for real-time ML integration. This architecture supports rapid iteration cycles while promoting long-term sustainability, both of which are essential as ML becomes more central to trigger design and real-time data processing in modern collider experiments.
    We invite discussion on shared challenges, solutions, and future directions for managing the ML lifecycle in low-latency HEP environments. This includes topics like model validation, deployment strategies, firmware synthesis workflows, and the role of community-developed tooling across experiments.
    
    Speakers: Maciej Mikolaj Glowacki (CERN), Marius Köppel (ETH Zurich (CH))
    
    FASTML BoF.pdf
  - 62
    
    QONNX Birds-of-a-Feather (BoF) Session Siemens Auditorium
    
    Siemens Auditorium
    
    QONNX (Quantized ONNX) serves as a shared input representation and frontend for several efficient inference projects, including FINN, chisel4ml and NN2FPGA. This birds-of-a-feather session would serve as a gathering point for the community to discuss recent developments and future plans for QONNX.
    
    Speaker: Yaman Umuroglu
Friday 5 September
- Invited talks
  - 63
    
    Fast inference with Decision Forests
    
    Decision Forests such as Random Forests and Gradient Boosted Trees are an effective and widely used class of models for machine learning, particularly for tabular data and forecasting. This talk covers the practical use and ongoing research on Decision Forests at Google. We provide a brief overview of decision forest modeling with a focus on novel split conditions. We will analyze their impact on model quality as well as on performance characteristics during training and inference. Then, we discuss a variety of real-world applications for these models. Finally, we will explore how algorithmic approaches to structuring tree traversal can be optimized for diverse hardware architectures, including CPUs, GPUs, and FPGAs.
    
    Speaker: Richard Stotz (Google Zurich)
    
    [EXTERNAL] Decision Forests at Google -- Fast Machine Learning for Science, 2025 .pdf
  - 64
    
    Efficient Graph neural networks at Google
    
    Graph Neural Networks (GNNs) are a powerful paradigm for Neural Net ML models to operate on relational data or data with structural information. This talk explores the practical use and ongoing research on GNN done at Google for industrial applications. We provide a brief overview of GNNs modeling, including GCNs, Graph Transformers, and geometric-aware models. Then we discuss a variety of real-world applications. Finally, we talk about scaling challenges on very large graphs, dynamic graphs, and fast inference on specialized hardware acceleration.
    
    Speaker: Mathieu Guillame-Bert (Google Zurich)
    
    [EXTERNAL] Graph Neural Network at Google -- Fast Machine Learning for Science, 2025.pdf
- 09:45
  
  Coffee
- Contributed talks
  - 65
    
    A Real-Time GNN-based Clustering Algorithm for the Level 1 Calorimeter Trigger at Belle II
    
    With increasing beam background levels at Belle II, which have already been observed due to the world-record instantaneous luminosities achieved by SuperKEKB and which are expected to rise further, an upgrade of the current Level 1 (L1) trigger algorithms is necessary to handle the evolving conditions. In this work, we present an upgraded L1 electromagnetic calorimeter trigger, based on Graph Neural Networks (GNNs) using dynamic graph building, implemented on the AMD XCVU160 FPGA used in the Belle II Universal Trigger Board 4 (UT4). The algorithm was developed in a software-hardware codesign approach, including quantization-aware training, pruning, and post-training optimizations, having both performance optimization and hardware requirements in mind. The network performs cluster finding and reconstruction in a one-shot approach, without assuming a predefined maximum number of clusters. We demonstrate an implementation of a 15-layer deep GNN with multiple graph construction and message passing steps on the FPGA. This design achieves the required throughput of 8 MHz and an overall latency of 3 $\mu$s. The implementation on the UT4 was deployed in monitoring mode within the full Belle II L1 trigger system and was included in collision data-taking in December 2024. This is the first operation of a GNN-based hardware trigger. We report implementation results showing 75% logic block usage and full utilization of DSP resources, with validation on both collision and cosmic-ray data collected with the Belle II detector.
    
    Speaker: Isabel Haide (Karlsruhe Institute for Technology)
    
    fastml_talk_gnnetm.pdf
  - 66
    
    Experiences Deploying a Hybrid PVFinder Algorithm for Primary Vertex Reconstruction in LHCb’s GPU-Resident HLT1
    
    The PVFinder algorithm employs a hybrid deep neural network (DNN) approach to reconstruct primary vertices (PVs) in proton-proton collisions at the LHC, addressing the complexities of high pile-up environments in LHCb and ATLAS experiments. By integrating fully connected layers with a UNet architecture, PVFinder’s end-to-end tracks-to-hist DNN processes charged track parameters to predict PV positions, achieving efficiencies above 97% and false positive rates as low as 0.03 per event in LHCb, surpassing conventional heuristic methods. We present the current status of embedding PVFinder into LHCb’s Allen framework, a fully software-based, GPU-optimized first-level trigger system for Run 3, handling 30 MHz of beam crossing data. Key challenges include optimizing computational efficiency and model integration within Allen’s real-time constraints. For ATLAS, PVFinder matches the Adaptive Multi-Vertex Finder’s efficiency while improving vertex-vertex resolution (0.23–0.37 mm vs. 0.76 mm). Future efforts target ATLAS ACTS integration and graph neural network enhancements.
    
    Speaker: Mohamed Elashri (University of Cincinnati)
    
    pvfinder_fastml_2025.pdf
  - 67
    
    State space models for Project 8 event reconstruction
    
    The Project 8 experiment aims to directly probe the neutrino mass by precisely measuring the energy spectrum of beta electrons emitted in the decay of tritium. The collaboration has pioneered the cyclotron radiation emission spectroscopy technique (CRES), which measures the energy of single electrons by detecting the cyclotron radiation they emit in a magnetic field. Traditional methods for event reconstruction rely on detecting tracks in a spectrogram after transforming the voltage output into the frequency domain. With the goal of achieving 0.3 eV root mean square (rms) energy resolution in the next prototype, these frequency-based methods face challenges, such as how to determine the electron’s location in the detector volume. State space models (SSMs) have shown promise for performing well on long time series data with good computational efficiency. In this work, we will demonstrate that the diagonal structured state space architecture (S4D) shows potential for reconstructing event parameters directly from the voltage time series in high-fidelity Project 8 simulations. The architecture’s minimal operations and good efficiency also opens the possibility for real-time reconstruction at the 400 MHz sampling frequency.
    
    Speaker: Hannah Binney
    
    Project8_SSM_FastML_final.pdf
  - 68
    
    Get your poster!
    
    Image from iOS.jpg
- Invited talks: Summary talk and poster prizes
  - 69
    
    Conference summary
    
    Speaker: Nhan Tran (Fermi National Accelerator Lab. (US))
  - 70
    
    Poster prizes and closing
    
    Speakers: Benjamin Ramhorst (ETH Zurich), Denis-Patrick Odagiu (ETH Zurich (CH)), Marius Köppel (ETH Zurich (CH))
- 71
  
  Satellite event: hls4ml dev meeting HIT E 41.1
  
  HIT E 41.1
  
  For minutes of the discussion, see https://indico.cern.ch/event/1586270/
  
  Speakers: Benjamin Ramhorst (ETH Zurich), Jan-Frederik Schulte (Purdue University (US))
  
  Zoom link

Choose timezone

Fast Machine Learning for Science Conference 2025

ETH Zurich

The Fast Machine Learning for Science Conference 2025 will be hosted by ETH Zurich September 1-5th, 2025.

Topics

Machine Learning Algorithm Design & Optimization

Accelerated Inference & Real-Time Processing

Scalable & Distributed ML Systems

Advanced Hardware & Computing Architectures

Scientific Applications of Fast ML

Important Deadlines

Scientific Committee

Organising Committee

COPL Common Room HIT F 23.2

HIT E 41.1

HIT F 31.2

ETH Zurich

HIT E 41.1

COPL Common Room HIT F 23.2

HIT F 31.2

Siemens Auditorium

Dozentenfoyer

ETH Main Building

HIT G floor (gallery)

HIT G floor (gallery)

Tessin Grotto

HIT E 41.1

Siemens Auditorium

HIT E 41.1