8–12 Sept 2025
Hamburg, Germany
Europe/Berlin timezone

Accelerating Deployment of FPGA-based AI in hls4ml with Parallel Synthesis through Model Partitioning

9 Sept 2025, 14:30
20m
ESA M

ESA M

Oral Track 1: Computing Technology for Physics Research Track 1: Computing Technology for Physics Research

Speaker

Dimitrios Danopoulos (CERN)

Description

The increasing reliance on deep learning for high-energy physics applications demands efficient FPGA-based implementations. However, deploying complex neural networks on FPGAs is often constrained by limited hardware resources and prolonged synthesis times. Conventional monolithic implementations suffer from scalability bottlenecks, necessitating the adoption of modular and resource-aware design paradigms. hls4ml, an open-source tool developed to translate machine learning models into FPGA-compatible architectures, has been instrumental in this effort but still faces synthesis bottlenecks for large networks. To address this challenge, we introduce a novel partitioning methodology that integrates seamlessly with hls4ml, allowing users to segment neural networks at predefined layers. This approach facilitates parallel synthesis and enables stepwise optimization, thus complementing both scalability and resource efficiency. The partitioned components are systematically reassembled into a unified architecture through an automated workflow leveraging AMD Vivado, ensuring functional correctness while minimizing manual intervention. An automated RTL-level testbench verifies system-wide correctness, eliminating manual validation steps and accelerating deployment. Experimental evaluations on convolutional neural networks, including ResNet20, demonstrate up to a 3.5× reduction in synthesis time, alongside enhanced debugging flexibility, thereby improving FPGA prototyping and deployment.

References

NextGen Triggers Technical Workshop at CERN [link: https://nextgentriggers.web.cern.ch/nextgen-triggers-technical-workshop-kicks-off-at-cern/ ]
Presentation link: https://indico.cern.ch/event/1421629/contributions/6136754/

Significance

While existing tools like hls4ml facilitate machine learning model translation to FPGA architectures, they struggle with large networks due to long synthesis times and resource constraints. Our approach enables network partitioning at predefined layers, allowing for parallel synthesis and stepwise optimization, thus significantly improving scalability and resource efficiency. The key novel contributions include A) a novel partitioning methodology that allows neural networks to be split at predefined layers enabling parallel synthesis, B) automated workflow integration with hls4ml and AMD Vivado, reassembling the partitioned components into a unified design with minimal manual intervention, C) automated RTL-level verification, eliminating manual validation. Experimental results showed up to a 3.5× reduction in synthesis time, which represents a significant update to hls4ml, improving FPGA-based deep learning for trigger applications and making AI model deployment more practical for real-time data processing in high-energy physics.

Experiment context, if any This work focuses on improving hls4ml, a general-purpose, experiment-agnostic tool for FPGA-based deep learning utilized in high-energy physics, already in production with CMS and under evaluation by ATLAS, LHCb, sPHENIX/EIC, and others.

Author

Co-author

Presentation materials