Speaker
Description
Transformers are becoming increasingly popular in fields such as natural language processing, speech processing, and computer vision. However, due to the high memory bandwidth and power requirements of Transformers, contemporary hardware is gradually unable to keep pace with the trend of larger models. To improve hardware efficiency and increase throughput and reduce latency, there has been a shift towards using FPGAs to implement existing Transformer algorithms. Compared to GPUs, FPGAs offer more on-chip Block RAM, allowing for the deployment of more medium-sized models for acceleration on FPGAs. However, as input sequences grow longer, larger buffers are needed on the FPGA to temporarily store these long sequences. Therefore, this paper proposes using Flash Attention within a blocked computation flow architecture to reduce the usage of Attention in Block RAM and its bandwidth requirements. Nonetheless, due to the necessity of using more Block RAM to store QKV instead of accessing HBM, designs often struggle to complete Place\&Route. As a solution, before the hardware synthesis compilation stage, an optimized mixed precision configuration is derived using post-training quantized models along with a Block RAM estimator in conjunction with a simulated annealing method. This approach not only significantly reduces the design period, but also allows for a reduction of Block RAM utilization by approximately 20%~40% without substantially affecting accuracy. When implementing Transformer-related algorithms on FPGAs using high-level synthesis techniques, power efficiency can be improved by 61%~321% compared to other studies.