Speaker
Description
In software-hardware co-design, balancing performance with hardware constraints is critical, especially when using FPGAs for real-time applications in scientific fields with hls4ml. Limited resources and stringent latency requirements exacerbate this challenge. Existing frameworks such as AutoQKeras use Bayesian optimization to balance model size/energy, and accuracy, but they are time-consuming, rely on early-stage training, and often lead to inaccurate configuration evaluations, requiring significant trial and error. Additionally, these metrics often fail to reflect actual hardware usage.
In this work, we present ConNAS4ML, a gradient-based, constraint-aware Neural Architecture Search (NAS) framework for hardware-aware optimization within the hls4ml workflow. Our approach incorporates practical hardware resource metrics into the search process and dynamically adapts to different HLS designs, tool versions, and FPGA devices. Unlike AutoQKeras, ConNAS4ML performs simultaneous training and searching, requiring only minimal fine-tuning afterward. Users can either explore trade-offs between model performance and hardware usage or apply user-defined hardware constraints to ensure selected architectures stay within resource limits while maximizing performance. Key contributions include: (1) a user-friendly interface for customizing search space, hardware metrics, and constraints; (2) deep integration with hls4ml, allowing users to define and experiment with their own HLS synthesis configurations for FPGA; and (3) efficient hardware-aware optimization, exploring architectures under hardware constraints in a single shot manner, avoiding the time-consuming trial and error.
Preliminary results show our approach's effectiveness in two tasks. The first task optimized filter numbers of a 1.8M parameter CNN for energy reconstruction in calorimeters, achieving a 48.01% parameter reduction, and reductions in LUT usage (29.73%), FF (31.62%), BRAM (16.06%), and DSP (23.92%), with only a 0.84% increase in MAE after fine-tuning. The second task focuses on Jet Tagging classification using a precision search under various constraints. Even without fine-tuning, models stayed within constraints, with accuracy differences less than 0.37% from the baseline. Both tasks were efficient, with the architecture search taking 2 GPU hours and the precision search taking 0.26 GPU hours on one GPU. This framework can greatly accelerate FPGA deployment in resource-constrained environments, benefiting various fields beyond HEP such as edge computing and autonomous systems.