Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

15–18 Oct 2024
Purdue University
America/Indiana/Indianapolis timezone

[Remote] BRAM-Aware Quantization for Efficient Transformer Inference via Tile-based Architecture on a FPGA

17 Oct 2024, 14:30
15m
Steward Center 306 (Third floor) (Purdue University)

Steward Center 306 (Third floor)

Purdue University

128 Memorial Mall Dr, West Lafayette, IN 47907
Standard 15 min talk Contributed talks

Speaker

Ling-Chi Yang (Institute of Electronics in National Yang Ming Chiao Tung University)

Description

Transformers are becoming increasingly popular in fields such as natural language processing, speech processing, and computer vision. However, due to the high memory bandwidth and power requirements of Transformers, contemporary hardware is gradually unable to keep pace with the trend of larger models. To improve hardware efficiency and increase throughput and reduce latency, there has been a shift towards using FPGAs to implement existing Transformer algorithms. Compared to GPUs, FPGAs offer more on-chip Block RAM, allowing for the deployment of more medium-sized models for acceleration on FPGAs. However, as input sequences grow longer, larger buffers are needed on the FPGA to temporarily store these long sequences. Therefore, this paper proposes using Flash Attention within a blocked computation flow architecture to reduce the usage of Attention in Block RAM and its bandwidth requirements. Nonetheless, due to the necessity of using more Block RAM to store QKV instead of accessing HBM, designs often struggle to complete Place\&Route. As a solution, before the hardware synthesis compilation stage, an optimized mixed precision configuration is derived using post-training quantized models along with a Block RAM estimator in conjunction with a simulated annealing method. This approach not only significantly reduces the design period, but also allows for a reduction of Block RAM utilization by approximately 20%~40% without substantially affecting accuracy. When implementing Transformer-related algorithms on FPGAs using high-level synthesis techniques, power efficiency can be improved by 61%~321% compared to other studies.

Primary authors

Bo-Cheng Lai ChiJui Chen Ling-Chi Yang (Institute of Electronics in National Yang Ming Chiao Tung University) Scott Hauck Shih-Chieh Hsu (University of Washington Seattle (US)) Trung Le

Presentation materials