Fast Machine Learning for Science Conference 2024

Name: Fast Machine Learning for Science Conference 2024
Start: 2024-10-15T08:00:00-04:00
End: 2024-10-18T21:00:00-04:00
Location: Purdue University

15–18 Oct 2024

Purdue University

America/Indiana/Indianapolis timezone

[Remote] BRAM-Aware Quantization for Efficient Transformer Inference via Tile-based Architecture on a FPGA

17 Oct 2024, 14:30

15m

Steward Center 306 (Third floor) (Purdue University)

Steward Center 306 (Third floor)

Purdue University

128 Memorial Mall Dr, West Lafayette, IN 47907

Standard 15 min talk Contributed talks

Ling-Chi Yang (Institute of Electronics in National Yang Ming Chiao Tung University)

Transformers are becoming increasingly popular in fields such as natural language processing, speech processing, and computer vision. However, due to the high memory bandwidth and power requirements of Transformers, contemporary hardware is gradually unable to keep pace with the trend of larger models. To improve hardware efficiency and increase throughput and reduce latency, there has been a shift towards using FPGAs to implement existing Transformer algorithms. Compared to GPUs, FPGAs offer more on-chip Block RAM, allowing for the deployment of more medium-sized models for acceleration on FPGAs. However, as input sequences grow longer, larger buffers are needed on the FPGA to temporarily store these long sequences. Therefore, this paper proposes using Flash Attention within a blocked computation flow architecture to reduce the usage of Attention in Block RAM and its bandwidth requirements. Nonetheless, due to the necessity of using more Block RAM to store QKV instead of accessing HBM, designs often struggle to complete Place\&Route. As a solution, before the hardware synthesis compilation stage, an optimized mixed precision configuration is derived using post-training quantized models along with a Block RAM estimator in conjunction with a simulated annealing method. This approach not only significantly reduces the design period, but also allows for a reduction of Block RAM utilization by approximately 20%~40% without substantially affecting accuracy. When implementing Transformer-related algorithms on FPGAs using high-level synthesis techniques, power efficiency can be improved by 61%~321% compared to other studies.

Bo-Cheng Lai ChiJui Chen Ling-Chi Yang (Institute of Electronics in National Yang Ming Chiao Tung University) Scott Hauck Shih-Chieh Hsu (University of Washington Seattle (US)) Trung Le

Presentation-FastML.pdf

Fast Machine Learning for Science Conference 2024

[Remote] BRAM-Aware Quantization for Efficient Transformer Inference via Tile-based Architecture on a FPGA

Steward Center 306 (Third floor)

Purdue University

Speaker

Description

Authors

Presentation materials

Choose timezone

Fast Machine Learning for Science Conference 2024

Speaker

Description

Authors

Presentation materials