FPGAs provide unique advantages in the realm of machine learning acceleration. Unlike CPUs and GPUs, FPGAs allow for custom parallelism, data type precision and dataflow tailored specifically to the workload. Their reconfigurability enables the design of optimised hardware circuits that can reduce latency, power consumption, and improve throughput. Some common examples of FPGA-accelerated...
Neural networks (NNs) have gained significant interest in recent years due to their prevalence in AI applications. Lookup table (LUT) based NN architectures have emerged as a promising solution for ultra-low latency inference on reconfigurable hardware such as field programmable gate arrays (FPGAs). These techniques promise significant enhancements in both resource efficiency and inference...
This tutorial explores the growing demand for domain-specific hardware accelerators driven by the rapid evolution of AI and data analytics. Traditional hardware design cycles are too slow to keep up with the pace of algorithmic innovation. To address this, new agile hardware design methodologies are emerging, leveraging compiler technologies and High-Level Synthesis (HLS) to automate and...
As Moore’s Law and Dennard Scaling reach their limits, computing is shifting toward heterogeneous hardware for large-scale data processing. Cloud vendors are deploying accelerators, like GPUs, DPUs, and FPGAs, to meet growing computational demands of ML and big data.
While FPGAs offer great flexibility and performance, practically integrating them in larger systems remains challenging due...
Neural networks with a latency requirement on the order of microseconds are widely used at the CERN Large Hadron Collider, particularly in the low-level trigger system. To satisfy this latency requirement, these neural networks are often deployed on FPGAs.
This tutorial aims to provide a practical, hands-on guide of a software-hardware co-design workflow using the HGQ2 and da4ml libraries....
This tutorial explores the growing demand for domain-specific hardware accelerators driven by the rapid evolution of AI and data analytics. Traditional hardware design cycles are too slow to keep up with the pace of algorithmic innovation. To address this, new agile hardware design methodologies are emerging, leveraging compiler technologies and High-Level Synthesis (HLS) to automate and...
While machine learning has made tremendous progress in recent years, there is still a large gap between artificial and natural intelligence.
Closing this gap requires combining fundamental research in neuroscience with mathematics, physics, and engineering to understand the principles of neural computation and cognition.
Mixed-signal subthreshold analog and asynchronous digital electronic...
The real-time processing of data created by the Large Hadron Collider's (LHC) experiments, amounting to over 10% of worldwide internet traffic, is one of the greatest computing challenges ever attempted. I will discuss the concrete applications of real-time processing in the LHC's main experiments, and the technological innovations in this area over the past decades. I will also reflect on the...
This talk provides an overview of several libraries in the open-source JAX ecosystem (such as Equinox, Diffrax, Optimistix, ...) In short, we have been building an "autodifferentiable GPU-capable scipy". These libraries offer the foundational core of tools that have made it possible for us to train neural networks (e.g. score-based diffusions for image generation), solve PDEs, and smoothly...
Most commercial wearables still capture only basic metrics such as step counts or heart rate, and remain closed systems without access to raw data. In this talk, I will present our holistic approach to full-body biosignal intelligence, where ultra-low-power embedded platforms and machine learning algorithms are co-designed to capture and process signals from the brain, eyes, muscles, and...
Custom FPGA dataflow accelerators for DNN inference can enable unprecedented performance and efficiency for many applications. Dataflow accelerator compilers, such as the FINN framework, have improved in recent years and allow practitioners to explore this technology without requiring in-depth FPGA knowledge.
However, the overall design process remains quite tedious, time-consuming, and...
As the demand for efficient machine learning on resource-limited devices grows, model compression techniques like pruning and quantization have become increasingly vital. Despite their importance, these methods are typically developed in isolation, and while some libraries attempt to offer unified interfaces for compression, they often lack support for deployment tools such as hls4ml. To...
As neural networks (NNs) are increasingly used to provide
edge intelligence, there is a growing need to make the edge devices
that run them robust to faults. Edge devices must mitigate the resulting
hardware failures while maintaining strict constraints on power, energy,
latency, throughput, memory size, and computational resources. Edge
NNs require fundamental changes in model...
On-chip learning has the potential to unlock low-latency, low-power, and continuously adaptive AI directly on edge devices. However, research in this area remains limited by the lack of accessible hardware toolchains that support backpropagation. To address this gap, we propose ENABOL, a hardware-efficient extension of the HLS4ML toolchain that enables customizable backpropagation support...
Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs pipelined with II=1. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient...
The ATLAS Level-0 Global Trigger is a mission critical system opting to take advantage of the full calorimeter granularity during Run-4 and beyond. Level-0 Global will be executing a cascade of trigger algorithms combined both the calorimeter information and the muons. Within the Next Generation Trigger project at CERN there is a dedicated work package (WP2.1) exploring large deployment of...
In the era of continuous data generation, real-time processing of data streams has become crucial for timely, adaptive, and context-aware decision-making. However, maintaining effective learning models in such dynamic environments requires carefully balancing prediction performance, transparency and energy consumption.
In the talk, we will present two new state-of-the-art methods for...
The widespread deployment of embedded ML systems has created a need for resilient, fault-tolerant hardware and software capable of operating in inherently noisy conditions. While the standardization of low-precision (≤ 8-bit) datatypes has allowed for reduced training and inference costs and increased interoperability across commercial accelerators, clear guidelines for robust implementation...
The rising computational demands of increasing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments have driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) framework. SONIC accelerates ML inference by offloading tasks to local or remote coprocessors, optimizing resource utilization. Its portability across diverse...
Most of the current machine learning (ML) applications are purely data-driven solutions with little considerations for underlying problem dynamics, limited to in-distribution applications. To tackle this limitation a stream of literature is emerging to address out-of-distribution (OOD) performance: Algorithmic alignment, which focuses on embedding algorithmic structures into ML architectures...
Matrix-vector (GEMV) operations are a common building block in many deep learning models, particularly for large dense layers found in convolutional neural networks (CNNs) and multi-layer perceptrons (MLPs). Despite their importance, GEMV kernels have historically underperformed compared to matrix-matrix (GEMM) operations due to their lower arithmetic intensity and limited data reuse, making...
From radio telescopes to particle accelerators and electron microscopes, scientific instruments produce tremendous amounts of data at equally high rates; previous architectures that have relied on offline storage and large data transfers are unable to keep up. The future of scientific discovery is interactive, streaming, and AI driven, placing the autonomous and intelligent instrument at the...
AXOL1TL is an anomaly detection (AD) trigger algorithm integrated into the Global Trigger (GT) of the CMS Level-1 Trigger (L1T) system since 2024. The GT reduces the event rate from proton–proton collisions at the LHC, lowering it from 40 MHz to 100 kHz within a fixed latency of 50 ns. The AD algorithm, implemented in the FPGA firmware of the GT board, uses an autoencoder to assign an anomaly...
The absence of BSM physics discoveries at the LHC suggests new physics could lie outside current trigger schemes. By applying unsupervised ML–based anomaly detection, we gain a model-agnostic way of spotting anomalous signatures that deviate from the current trigger’s expectations. Here we introduce a Run-3 trigger chain that embeds fast anomaly detection algorithms in both hardware and...
At the Phase-2 Upgrade of the CMS Level-1 Trigger (L1T), particles will be reconstructed by linking charged particle tracks with clusters in the calorimeters and muon tracks from the muon station. The 200 pileup interactions will be mitigated using primary vertex reconstruction for charged particles and a weighting for neutral particles based on the distribution of energy in a small area. Jets...
Belle II is a luminosity frontier experiment located at the SuperKEKB asymmetric $e^+ e^-$ collider, operating at the $\Upsilon(4S)$ resonance. The $\tau$ physics program at Belle II involves both probes of new physics and precision measurements of standard model parameters with large statistics. SuperKEKB is projected to reach a luminosity of $6\times 10^{35}~\text{cm}^{-2}\text{s}^{-1}$ in...
The High Luminosity upgrade of the Large Hadron Collider (HL-LHC) presents a demanding environment for real-time data processing, with substantially increased event rates requiring faster and more efficient trigger systems. This study explores the deployment of graph neural networks (GNNs) on field-programmable gate arrays (FPGAs) for fast and accurate inference within future muon trigger...
The ATLAS trigger system will undergo a comprehensive upgrade in advance of the HL-LHC programme. In order to deal with the increased data bandwidth trigger algorithms will be required to satisfy stricter latency requirements. We propose a method to speed up the current calorimeter-only preselection step and to aid trigger decisions for hadronic signals containing jets.
We demonstrate the use...
Optimized FPGA implementations of tiny neural networks are crucial for low-latency and hardware-efficient inference for a variety of applications. Neural networks based on lookup tables (LUTs) are a standard technique for such problems due to their hardware efficiency and strong expressivity. However, such networks are often difficult to scale up as their resource usage scales exponentially...
Modern foundation models (FMs) have pushed the frontiers of language, vision, and multi-model tasks by training ever-larger neural networks (NN) on unprecedented volumes of data. The use of FM models has yet to be established in collider physics, which both lack a comparably sized, general-purpose dataset on which to pre-train universal event representations, and a clear demonstrable need....
The analysis of point cloud data, for example signals from charged particles recorded by detectors in high energy physics (HEP) experiments, can be significantly enhanced and accelerated by the application of machine learning models. In recent years, transformer architectures have come into focus as offering excellent model performance. However, for traditional transformers,the need to compute...
The Interaction Network (IN) algorithm has shown great promise for particle tracking applications at the Large Hadron Collider (LHC), where identifying complex particle trajectories from raw detector data is a computationally intensive task. IN leverages graph-based representations of detector hits to learn relationships between particle interactions, making it well-suited for this domain....
AI is accelerating into the generative era, and it is poised to disrupt multiple businesses and applications. With the increasing focus on edge and extreme-edge, near sensor applications, inference is becoming the key workload and computational challenge. Computing system need to scale out and scale up to meet the challenge. In this talk I will discuss how to scale up chip(lets) for efficient...
Beyond the well-known highlights in computer vision and natural language, AI is steadily expanding into new application domains. This Pervasive AI trend requires supporting diverse and fast-moving application requirements, ranging from specialized I/O to fault tolerance and limited resources, all the while retaining high performance and low latency. Adaptive compute architectures such as AMD...
The trigger systems of ATLAS and CMS currently reject vast numbers of potentially valuable collision events due to their conservative, static designs, a limitation that directly hampers discovery potential. We propose an alternative to these rigid, hand-tuned menus with an autonomous controller capable of dynamically optimizing trigger performance in real time.
In this work, we demonstrate...
Machine Learning (ML) techniques are increasingly applied for the optimization of complex computing systems, but their integration into core low-level system mechanisms remains limited. A key barrier is the lack of accessible, high- performance interfaces at the boundary between software and hardware as well as hardware-offloaded ML-inference at full systems speed. In this presentation, we...
Tuning hyperparameters of ML models, especially large ML models, can be time consuming and computationally expensive. As a potential solution, several recent papers have explored hyperparameter transfer. Under certain conditions, the optimal hyperparameters of a small model are also optimal for larger models. One can therefore tune only the small model and transfer the hyperparameters to the...
Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider. However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply.
In this...
The Smartpixels project is a coordinated effort to co-design pixel ASICs, design tools, ML algorithms, and sensors for on-detector data reduction, motivated by the technical challenges of current and future colliders. The drive to greater precision requires smaller pixel pitch, which together with higher event rates arising from pileup and/or beam-induced background generates petabytes of data...
We conduct a systematic study of quantum-inspired Tensor Network (TN) models—Matrix Product States (MPS) and Tree Tensor Networks (TTN)—for real-time jet tagging in high-energy physics, with a focus on low-latency deployment on FPGAs. Motivated by the strict computational demands of the HL-LHC Level-1 Trigger system, we explore TN architectures as compact and interpretable alternatives to deep...
Hadronic calorimeters are a key part of high energy physics experiments. Traditionally, they rely on high granularity to improve performances, but this leads to various challenges in terms of cost, energy consumption and output data volume. Moreover, current detectors do not have the capability of exploiting temporal information of the shower development, as the time frame for pattern...
Inference of standard convolutional neural networks (CNNs) on FPGAs often incurs high latency and long initiation intervals due to the nested loops required to slide filters across the full input, especially when the input dimensions are large. However, in some datasets, meaningful signals may occupy only a small fraction of the input, say sometimes just a few percent of the total pixels or...
Reflection High-Energy Electron Diffraction (RHEED) is a common diffraction-based surface characterization technique for analyzing the properties of crystalline materials that are grown using a thin-film deposition technique like pulsed-laser deposition (PLD) or molecular-beam epitaxy (MBE). In this work, we design an FPGA-accelerated machine learning (ML) algorithm to perform real-time...
Transformers are the state-of-the-art model architectures and widely used in application areas of machine learning. However the performance of such architectures is less well explored in the ultra-low latency domains where deployment on FPGAs or ASICs is required. Such domains include the trigger and data acquisition systems of the LHC experiments.
We present a transformer-based algorithm...
The LHCb Upgrade II will operate at a data rate of 200 Tb/s, requiring efficient real-time data reduction. A major challenge of this pipeline is the transfer of full timing information from the frontend Electromagnetic Calorimeter (ECAL) to the backend for processing, which is critical for resolving pile-up, background suppression, and enhancing energy resolution. Due to the data rate, full...
We present an MLOps-based approach for managing the end-to-end lifecycle of machine learning (ML) algorithms deployed on FPGAs in real-time trigger systems, as used in experiments such as CMS and ATLAS. The primary objective of this pipeline is to enable agile and robust responses to evolving detector and beam conditions by automating the collection of new training data, retraining and...
QONNX (Quantized ONNX) serves as a shared input representation and frontend for several efficient inference projects, including FINN, chisel4ml and NN2FPGA. This birds-of-a-feather session would serve as a gathering point for the community to discuss recent developments and future plans for QONNX.
Decision Forests such as Random Forests and Gradient Boosted Trees are an effective and widely used class of models for machine learning, particularly for tabular data and forecasting. This talk covers the practical use and ongoing research on Decision Forests at Google. We provide a brief overview of decision forest modeling with a focus on novel split conditions. We will analyze their impact...
Graph Neural Networks (GNNs) are a powerful paradigm for Neural Net ML models to operate on relational data or data with structural information. This talk explores the practical use and ongoing research on GNN done at Google for industrial applications. We provide a brief overview of GNNs modeling, including GCNs, Graph Transformers, and geometric-aware models. Then we discuss a variety of...
With increasing beam background levels at Belle II, which have already been observed due to the world-record instantaneous luminosities achieved by SuperKEKB and which are expected to rise further, an upgrade of the current Level 1 (L1) trigger algorithms is necessary to handle the evolving conditions. In this work, we present an upgraded L1 electromagnetic calorimeter trigger, based on Graph...
The PVFinder algorithm employs a hybrid deep neural network (DNN) approach to reconstruct primary vertices (PVs) in proton-proton collisions at the LHC, addressing the complexities of high pile-up environments in LHCb and ATLAS experiments. By integrating fully connected layers with a UNet architecture, PVFinder’s end-to-end tracks-to-hist DNN processes charged track parameters to predict PV...
For minutes of the discussion, see https://indico.cern.ch/event/1586270/
Quartz Crystal Microbalance (QCM) sensors are renowned for their high sensitivity to mass changes, making them ideal for detecting environmental parameters such as relative humidity (RH) and ultraviolet (UV) radiation. In this work, we present an AI-driven, dual-sided coated QCM sensor integrated with advanced machine learning (ML) and implemented on a real-time hardware platform. This sensor...
Authors:
Gustavo Alonso, Maximilian Jakob Heer, Benjamin Ramhorst
As Moore’s Law and Dennard Scaling reach their limits, computing is shifting toward heterogeneous hardware for large-scale data processing. Cloud vendors are deploying accelerators, like GPUs, DPUs, and FPGAs, to meet growing computational demands of ML and big data.
While FPGAs offer great flexibility and performance,...
Transformers excel at modeling correlations in LHC collisions but incur high costs from quadratic attention. We analyze the Particle Transformer using attention maps and pair correlations on the (η,ϕ) plane, revealing that Particle Transformer attention maps learn traditional jet substructure observables. To improve efficiency we benchmark linear attention variants on JetClass and find that...
Alpha Magnetic Spectrometer (AMS-02) is a precision high-energy cosmic-ray experiment consisting of Transition Radiation Detector (TRD), Silicon Tracker, Magnet, Time of Flight (ToF), and Ring Imaging Cherenkov Detector (RICH), Anti-Coincidence Counter (ACC), and Electromagnetic Calorimeter (ECAL) on the ISS operating since 2011, and has collected more than 240 billion cosmic-ray events. Among...
Small ($R<4\,\mathrm{R}_{\oplus}$), long-period ($30\,\mathrm{days}<P$) exoplanets with low equilibrium temperatures are an extremely interesting population, promising insights into planet formation, atmospheric chemistry and evolution, as well as habitability. However, for these planets, the current observing strategy of NASA's Transiting Exoplanet Survey Satellite (TESS) can only capture...
With the increasing size of the machine learning (ML) model and vast datasets, the foundation model has transformed how we apply ML to solve real-world problems. Multimodal language models like chatGPT and Llama have expanded their capability to specialized tasks with common pre-train. Similarly, in high-energy physics (HEP), common tasks in the analysis face recurring challenges that demand...
Since version 1.0, hls4ml has provided a oneAPI backend for Altera FPGAs, as an evolution of the backend that targeted Intel HLS. Some design choices will be presented here, including the use of pipes and task sequences to develop a dataflow-style architecture. The oneAPI framework, unlike the Intel HLS framework, also naturally supports an accelerator-style deployment. Using always-run...
We present the development of a machine learning (ML) based regulation system for third-order resonant beam extraction in the Mu2e experiment at Fermilab. Classical and ML-based controllers have been optimized using semi-analytic simulations and evaluated in terms of regulation performance and training efficiency. We compare several controller architectures and discuss the integration of...
Introduction
Accurate climate prediction hinges on the ability to resolve multi-scale turbulent dynamics in the atmosphere and oceans [1]. An important mechanism of energy exchange between the ocean and the atmosphere is mesoscale turbulence, which contains motions of length scale $\mathcal{O}$(100 km). Two-layer quasi-geostrophic (QG) simulations [2] are a popular technique for...
The rising popularity of large language models (LLMs) has led to a growing demand for efficient model deployment. In this context, the combination of post-training quantization (PTQ) and low-precision floating-point formats such as FP4, FP6 and FP8 has emerged as an important technique, allowing for rapid and accurate quantization with the ability to capture outlier values in LLMs without...
Modern development flows that use tooling for automated building, testing, and deployment of software are becoming the norm for large scale software and hardware projects. These flows offer quite a few advantages that make them desirable, but when attempting to implement them for projects that use FPGAs, some complications can arise when attempting to integrate them with traditional FPGA...
High-resolution electron microscopy generates large volumes of pixel detector data due to beam rates reaching $10^7$ to $10^{10}$ electrons per second directed at the sample. Of this data, only the electron entry point into the silicon detector prior to scattering is typically of interest for downstream analysis. Precise knowledge of these entry points is particularly important in electron...
In preparation for the High Luminosity LHC (HL-LHC) run, the CMS experiment is developing a major upgrade of its Level-1 (L1) Trigger system, which will integrate high-granularity calorimeter data and real-time tracking using FPGA-based processors connected via a high-bandwidth optical network. A central challenge is the identification of electrons in a high pileup environment within strict...
The LHCb experiment at CERN operates a fully software-based first-level trigger that processes 30 million collision events per second, with a data throughput of 4 TB/s. Real-time tracking—reconstructing particle trajectories from raw detector hits—is essential for selecting the most interesting events, but must be performed under tight latency and throughput constraints.
A key bottleneck in...
The CICADA (Calorimeter Image Convolutional Anomaly Detection Algorithm) project aims to detect anomalous physics signatures without bias from theoretical models in proton-proton collisions at the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider. CICADA identifies anomalies in low-level calorimeter trigger data using a convolutional autoencoder, whose behavior is...
The Large Hadron Collider (LHC) will soon undergo a high-luminosity (HL) upgrade to improve future searches for new particles and to measure particle properties with increased precision. The upgrade is expected to provide a dataset ten times larger than the one currently available by the end of its data-taking period. The increased beam intensity will also increase the number of simultaneous...
The inverse design of photonic surfaces produced by high-throughput femtosecond laser processing is limited by a strongly non-linear, many-to-one mapping from laser parameters (power, speed, hatch spacing) to the resulting optical spectrum. Tandem Neural Networks (TNNs) mitigate this ill-posedness by pairing a forward surrogate with a separately trained inverse network, but they still rely on...
Charge particle track reconstruction is the foundation of the collider experiments. Yet, it's also the most computationally expensive part of the particle reconstruction. The innovation in tracking reconstruction using graph neural networks (GNNs) has demonstrated a promising capability to address the computing challenges posed by the High-Luminosity LHC (HL-LHC) with Machine learning....
In this paper, we propose a method to perform empirical analysis of the loss landscape of machine learning (ML) models. The method is applied to two ML models for scientific sensing, which necessitates quantization to be deployed and are subject to noise and perturbations due to experimental conditions.
Our method allows assessing the robustness of ML models to such effects as a function of...
Benchmarks are a cornerstone of modern
machine learning practice, providing standardized eval-
uations that enable reproducibility, comparison, and
scientific progress. Yet, as AI systems — particularly
deep learning models — become increasingly dynamic,
traditional static benchmarking approaches are losing
their relevance. Models rapidly evolve in architecture,
scale, and capability;...
We present NomAD (Nanosecond Anomaly Detection), a real-time anomaly detection algorithm designed for the ATLAS Level-1 Topological (L1Topo) trigger using unsupervised machine learning. The algorithm combines a Variational Autoencoder (VAE) with Boosted Decision Tree (BDT) regression to compress and distill deep learning inference into a firmware-compatible format for FPGAs. Trained on 2024...
The escalating demand for data processing in particle physics research has spurred the exploration of novel technologies to enhance the efficiency and speed of calculations. This study presents the development of an implementation of MADGRAPH, a widely used tool in particle collision simulations, to FPGA using High-Level Synthesis. This research presents a proof of concept limited to a single,...
As the era of the High-Luminosity Large Hadron Collider (HL-LHC) approaches, the GPU-accelerated High-Level Trigger (HLT) of the CMS experiment faces a stringent requirement to reduce the Level-1 readout stream from 100 kHz to 5 kHz, a twenty-fold decrease essential to adhere to archival bandwidth constraints [[1][1]], [[2][2]]. Meeting this demand necessitates highly efficient real-time...
Simulating relativistic orbital dynamics around Schwarzschild black holes is essential for understanding general relativity and astrophysical phenomena like precession. Traditional numerical solvers face difficulty while dealing with noisy or sparse data, necessitating data-driven approaches. We develop a Scientific Machine Learning (SciML) framework to model orbital trajectories and...
We investigate the application of state space models (SSMs) to a diverse set of scientific time series tasks. In particular, we benchmark the performance of SSMs against a set of baseline neural networks across three domains: magnet quench prediction, gravitational wave signal classification (LIGO), and neural phase estimation. Our analysis evaluates both computational efficiency—quantified by...
Pak choi (Brassica rapa subsp. chinensis) is a leafy green vegetable widely cultivated in vertical urban farming systems due to its rapid growth and high yield under compact, hydroponic setups. However, even in these controlled environments, crops remain susceptible to various diseases. Among the most common threats are fungal infections such as Alternaria leaf spot and powdery mildew, and...
As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the
development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have
solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks,...
Graph Neural Networks (GNNs) have become promising candidates for particle reconstruction and identification in high-energy physics, but their computational complexity makes them challenging to deploy in real-time data processing pipelines. In the next-generation LHCb calorimeter, detector hits—characterized by energy, position, and timing—can be naturally encoded as node features, with...