6–10 Oct 2025
Rethymno, Crete, Greece
Europe/Athens timezone

A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine

10 Oct 2025, 09:20
16m
AQUILLES, Aquila

AQUILLES, Aquila

Oral Programmable Logic, Design and Verification Tools and Methods Logic

Speaker

Michail Sapkas (Universita e INFN, Padova (IT))

Description

This work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency-Constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE–Programmable Logic (PL) design to optimize computational efficiency. Benchmarking against existing FPGA GRU implementations using a top quark jet tagging dataset demonstrates promising latency results. Our approach highlights the potential of deploying adaptable neural networks in real-time environments such as online preprocessing in the readout chain of a physics experiment, offering a flexible alternative to traditional fixed-function algorithms.

Summary (500 words)

In this ongoing research, we investigate the potential of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) as a hardware accelerator for Recurrent Neural Networks (RNNs), with a particular focus on the Gated Recurrent Unit (GRU) architecture. The GRU was selected as a representative model due to its relatively low parameter count and reduced number of internal gates, making it a computationally efficient choice for deployment in constrained or high-throughput environments. Our exploration seeks to assess both the feasibility and performance of GRU inference when implemented within the AIE infrastructure.
The Versal AIE architecture presents a highly parallel processing environment, featuring an array of 400 vectorized computing engines optimized for numerical tasks. However, a significant limitation of the current software toolchain is the absence of native support for automatic workload distribution across the available processing elements. Consequently, we address the technical challenge of manual workload partitioning and propose a novel methodology for distributing the GRU's computational graph across the AIE cores. Our initial framework successfully enables decentralized execution of GRU operations, maintaining consistency and synchronization of internal states across the array.
In addition to the AIE-centric approach, we propose a hybrid computing strategy that leverages the strengths of both the AIE and the Versal Programmable Logic (PL) fabric. In this paradigm, computationally intensive algebraic operations involving floating point arithmetic are executed within the AIE, while the PL is employed for tasks well-suited to hardware acceleration, such as implementing Look-Up Tables (LUTs), control logic, and aggregation functions. This co-design strategy exploits the complementary capabilities of both domains, potentially leading to substantial improvements in inference latency and resource utilization.
To validate our approach, we benchmark our GRU implementation using a high-energy physics (HEP) dataset, specifically for the task of top quark jet tagging. Performance is evaluated against a previously published GRU implementation on a conventional FPGA platform. Preliminary results indicate that our solution delivers competitive latency performance. Furthermore, we investigate how performance scales with varying model sizes and parameter configurations highlighting the distinct resource constraints and computational trade-offs between the PL and AIE domains.
This research holds relevance for real-time data processing in HEP experiments, where online inference is critical. One promising application is the integration of RNNs into the readout of the L1-trigger system in CMS, particularly in the context of 40MHz scouting system that will pre-filter L1 objects during the High-Luminosity Large Hadron Collider (HL-LHC) era. Additionally, RNNs have demonstrated notable success in particle tracking tasks, further underscoring the utility of efficient neural network accelerators in experimental physics.
Finally, it is important to emphasize the broader implications of this work. By enabling the deployment of trainable models on real-time computing units like AIEs, we open the door to flexible, generalizable algorithms capable of adapting to diverse tasks via retraining, rather than requiring bespoke hardware implementations. This paradigm shift—from rigid, task-specific circuits to adaptive, learning-based solutions—represents a significant step forward in the development of intelligent, high-performance embedded systems.

Author

Michail Sapkas (Universita e INFN, Padova (IT))

Co-authors

Andrea Triossi (Universita e INFN, Padova (IT)) Marco Zanetti (Universita e INFN, Padova (IT))

Presentation materials