

## **Tutorial on GPU Optimization**

Ziv Ilan- Solution Architect, NVIDIA Sergio Perez - Solution Architect, NVIDIA Harshita Seth - Solution Architect, NVIDIA



# Agenda of the tutorial

- Demo of TensorRT + Triton
- Build a TensorRT-LLM engine of Gemma 2B
- Evaluate the engine on MMLU
- Launch the Triton inference server
- Measure the throughput of Triton inference server
- Optional Compare to quantized versions of Gemma 2B

### How to connect to your tutorial instance?

- Create your NVIDIA account https://learn.nvidia.com/join
- Navigate to https://learn.nvidia.com/dli-event
- Enter the event code: CERN\_XLAB\_SE24
- Click on Start this will spin up an Nvidia A10 32GB cloud instance
- It takes 10-15 minutes for the environment and the model artifacts to load

## Demo: LLaMA 7B with TensorRT-LLM + Triton

Source code available in our Github repo

## Demo Video

| A REAL PROPERTY AND A REAL PROPERTY. |                                             |                                                 |                                                        |                    |
|--------------------------------------|---------------------------------------------|-------------------------------------------------|--------------------------------------------------------|--------------------|
|                                      |                                             | Leveloptz-und C bok micesteren                  |                                                        |                    |
| 6 but record (new bac plan           | neo gagoze rakit (opt) trounierver (est)    | 81                                              | ceregepe-roeit (workgabe jain)                         | a.c                |
| 6 DUC Second trout has type          | FLOWE.                                      | loot16 to1 park8 madea                          |                                                        |                    |
| [62/23/2624-13:23:07] [1X1-L         | DAT the send "sends and the side            | reacto_tbt_ranke.engine                         |                                                        |                    |
| 102/23/2024-13:23:07 [1X]            | Wi There a short postcion_tas               | this is a second on and other the second        | and the last for and had an encount                    |                    |
| [02/23/2024-13:29:09] [TAT]          | [T] [New Records and Tail a Blassia Blassia | Stelling Telling Celling and Stell 17627 Cell 2 | e-time, but is not being renoved.                      |                    |
| [02/23/2024-13:23:07] [IN]           | [1] [MenUsageChange] Intt cubLAS/cubLA      | SEC: CPU +1, VPU +04, NOW: CPU 17687, UPU 2     | 313 (MLB)                                              |                    |
| [62/23/2624-13:29.07] [141]          | [1] [Menusugechange] Intt cubin. CPU +.     | S, Gru +/0, now. Cru 1/090, Gru 2369 (M.B)      |                                                        |                    |
| [02/23/2024-13:29:09] [TRT]          | [m] Tersork was tirked against cubie t      | 0.9.0 Dut Loaded Cubin 0.9.2                    |                                                        |                    |
| [02/23/2024-13:23:07] [131]          | [1] Global Clining coone in use. Profili    | ing results in this builder pass will be st     | DICO.                                                  |                    |
| [02/23/2024 13:25:17] [11]           | [1] [Graphiceducetion] The approximate re   | egion cut reduction algorithm is called.        |                                                        |                    |
| [62/23/2624 13:29:17] [TRI]          | [1] Detoceds 105 inputs and 1 output n      | atmark tensors.                                 |                                                        |                    |
| [02/23/2024-13:29:21] [TRT]          | [1] Total Hest Persistent Memory: 6175      | 2                                               |                                                        |                    |
| [02/23/2029-13:29:21] [1K]           | [1] Total Device Persistent Memory: 9       |                                                 |                                                        |                    |
| [02/23/2024-13:29:21] [1K]           | [1] Total Scretch Memory: Meades/944        | black shifts This will take 550 store to        |                                                        |                    |
| [02/23/2024-13:29:21] [TRT]          | [1] [BlockAssignment] Started assigning     | g DLDCK SMLPTS. THIS WILL TOKE 550 STEPS TO     | complete.                                              |                    |
| [02/23/2024-13:29:21] [TRI]          | [1] [BIOCKASSIGNMENT] ALGORITHM SHITH       | cpuown took re. // set to assign 11 blocks      | to ese nones requiring 6/91330304 bytes.               |                    |
|                                      | [1] Total Activation Menory: 5/9135850      | • .                                             |                                                        |                    |
|                                      | [1] Intal weights Memory: 194/6451/3/       | And the state of the second state areas a       | E 217 (16 (1)                                          |                    |
| [W2723/7W24-13:24:22] [[K]]          | [1] [Memusagethange] Init custas/custa      | SET: CPU +0, GPU +04, HOW: CPU 17767, GPU 1     | 5511 (MLS)                                             |                    |
|                                      | [1] [Memusageunange] Init cuanw: LPD +4     | W, GPU 472, NOW: LPU 17757, GPU 15583 (MIE)     |                                                        |                    |
| [02/23/2024-13:25:22] [141]          | [M] Personal was linked appliest cuban a    | 5.9.5 BUT LODGED CUDAN 8.9.2                    |                                                        |                    |
| [02/23/2024-13:29:22] [TKT]          | [1] Engine generation completed in 12.      | 1/25 seconds.                                   |                                                        |                    |
| [02/23/2024-13:29:22] [1K1]          | [1] [Memusagestucs] Peak memory usage (     | OF IKI CHUY GRU MEMORY ALLOCATORS: CHU SAN M    | LS, GPU D0040 MIS                                      |                    |
| 102/23/2024-13:29:22 [IKI]           | [1] [Menusageunange] Tensorki-hanagea (     | attocation in pullaing engine: cru +0, oru      | +12655, HOW: CPU W, GPU 12655 (MLD)                    |                    |
| [02/23/2024-13:29:25] [TKT]          | [1] [Menusugestuits] Peak menury usage (    | during Englie Bullaing and Sertalization. C     | PJ. 44033 MLD                                          |                    |
| [00/03/0004-13:29:03 [IKI-L          | LMG [1] Total time of building liana_p      | flouille_tpl_runwe.engthe: 00:00.15             |                                                        |                    |
| [62/23/2624-13:25:25] [1KI-L         | [4] [1] Coning saved to rengthes/1-gou      | contrig.json.                                   |                                                        |                    |
| [02/23/0024 13:25:25] [IKI]          | [1] Doubes engine size: izeso Mit           | SIAN (THI NO. (THI NO. (THI 17000 (THI 1        | 5101 (0.6.2)                                           |                    |
| [02/23/2024-13:23:20] [[R]]          | [1] [Membragethange] Init cubiAS/cubiAS     | 0 CDU (64 mm) CDU 17638 CDU 17638, GPU 1        | arar (ara)                                             |                    |
| [02/23/2024 12:20:20] [13]           | DW TexcerPT was linked easings of the       | a G E but landed cutthe 8 0 2                   |                                                        |                    |
| [62/23/2624 13:25:26] [[R]]          | [M] Tersonic was innied against cubin a     | allocation in angine dependent allocation (MU)  | 0 (01 152952 mm Call 6 (01 12952 (160)                 |                    |
| [02/23/2024 13:25:26] [IRI]          | [1] [Monusagechange] Tensorki-Managea (     | a Nin                                           | e, and 122632, news CPU e, and 12852 (MIB)             |                    |
| [02/23/2024-13:29:20] [IRI-L         | ING TT Waights parameters 12000 214         |                                                 |                                                        |                    |
| [62/23/2024-13:29:26] [[K]-L         | ING TT May NU Cache manager circus 10210    | 0.00.14.2                                       |                                                        |                    |
| [62/23/2624-13:29:26] [IKI-L         | INT TT Estincted any manory size: 1024      | e.ee His                                        |                                                        |                    |
| 102/23/2624-13:29:26 [IRI-L          | INT Fastes is successfully built            | LA COLLARS START CO OF COLLARS and the second   | amon inflamancing on way chang                         |                    |
| 102/23 (2024-13:29:26) [[RI-[        | INI INI Since manad by sacks is cathled     | d the ary (U facto accord size is a state       | when underencing on max shape.                         | ar month mant pres |
| Texts 2 years and a figural          | ing ful since pagen_ev_facte is enobled     | a, the most of them memory size is a estimation | te for very excess cases, it's possible that nost case | S HUT T BEET DUN   |
| [R2/23/2R24-13:29:26] [IRI-L         | LM] [1] Semidlizing engine to /engines/     | /l-gpu/llama_bfloat16_tp1_ronk0.engine          |                                                        |                    |

## **MMLU Overview**

Academic benchmarks to evaluate LLMs

The MMLU (Measuring Massive Multitask Language Understanding) metric is a benchmark designed to evaluate the performance of large language models across a wide range of tasks and domains, providing a comprehensive assessment of a model's general knowledge, reasoning, and language understanding abilities.

## Quantization

#### How to Choose a Precision

- Best precision varies by application
  - FP8 activations generally provides best performacne
- Weight quantization reduces memory footprint & traffic
  - Reduces latency
  - Can fit larger models
  - Costs compute time to unpack the weights
- Activation quantization saves on compute
  - Improves throughput
  - Can run larger batch sizes
- WXAY = weights quantized to X bits, and activations to Y
- Quantization Guide

| Method                        | Performance<br>small batch<br>BS <=4 | Improvement<br>large batch<br>BS>=16 | Accuracy impact Calibration time |          |
|-------------------------------|--------------------------------------|--------------------------------------|----------------------------------|----------|
| <b>FP8</b><br>(W8A8)          | Medium                               | Medium                               | Very low / None                  | O(1min)  |
| INT8 SQ<br>(W8A8)             | Medium                               | Medium                               | Medium                           | O(1min)  |
| INT8 WO<br>(W8A16)            | Medium                               | None                                 | Low                              | None     |
| <b>INT4 WO</b><br>(W4A16)     | High                                 | None                                 | High                             | None     |
| <b>INT4 AWQ</b><br>(W4A16)    | High                                 | None                                 | Low                              | O(10min) |
| INT4 GPTQ<br>(W4A16)          | High                                 | None                                 | Low                              | O(10min) |
| <b>INT4-FP8 AWQ</b><br>(W4A8) | High                                 | Medium                               | Low                              | O(10min) |

SQ = Smooth QuantWO = Weight OnlyAWQ = Activation Aware Quantization

## Wrapping up: Trends in model compression

## Distilling the Knowledge of LLMs into SLMs

Train only the largest LLM and get smaller models with similar quality

## How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

Aug 14, 2024

🖒 +32 Like 🛛 🔎 Discuss (5)

By Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi





## **FP4 Format Supported in Blackwell Platform**

New FP4 format for inference

Data Center / Cloud

English  $\checkmark$ 

## NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1

Aug 28, 2024

🖒 +19 Like 🛛 🗏 Discuss (1)

By Ashraf Eassa, Ashwin Nanjappa, Zhihan Jiang, Yiheng Zhang, Jun Yang, Zihao Kong and Shengliang Xu



See blog https://developer.nvidia.com/blog/nvidia-blackwell-platform-sets-new-llm-inference-records-in-mlperf-inference-v4-1/

## The GPU Journey Continues: Stay Ahead of the Curve and Keep Innovating

Take your next steps in one of the following platforms



https://learn.nvidia.com/



## Thank you!

Ziv Ilan - Solution Architect, NVIDIA Sergio Perez - Solution Architect, NVIDIA Harshita Seth - Solution Architect, NVIDIA Extra slides about TensorRT features

## LAYER & TENSOR FUSION

## Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel

- Combines successive nodes into a single node, making single kernel execution
- Significantly reduces number of layers to compute, resulting in faster performance
- Eliminates unnecessary memory traffic by removing concat/slice layers
- See the <u>supported fusion list</u>



## **KERNEL AUTO-TUNING**

Selects best data layers and algorithms based on the target GPU platform

- Hundreds of specialized kernels optimized for every GPU Platform
- TensorRT optimizer uses runtime profile to select the best performance kernels
- Ensures best performance for specific deployment platform and specific neural network



## **DYNAMIC TENSOR MEMORY**

## Minimizes memory footprint and reuses memory for tensors efficiently

- Reduces memory footprint and improves memory re-use
- Graph optimizer combines tensors into regions
- Region lifetime is a section of network execution time
- Memory Optimizer assigns regions to blocks; regions assigned to a block have disjoint lifetimes
- Just like register allocation



## **DYNAMIC TENSOR MEMORY**

## Minimizes memory footprint and reuses memory for tensors efficiently

- Reduces memory footprint and improves memory re-use
- Graph optimizer combines tensors into regions
- Region lifetime is a section of network execution time
- Memory Optimizer assigns regions to blocks; regions assigned to a block have disjoint lifetimes
- Just like register allocation



## TIME FUSION

Optimizes recurrent neural networks over time steps with dynamically generated kernels

- Recurrent Neural Network Optimizations
- Deploy highly optimized ASR and TTS
- Compiler fuses pointwise ops, fuses GEMMs and compute efficiently across time steps



## **QUANTIZATION AWARE TRAINING**

Improved accuracy for INT8 inference

- Better accuracy compared to Post Training Quantization (PTQ)
- Quantize state of the art models with minimal loss of accuracy
- TensorRT optimizes the Q/DQ graph for inference without compromising performance
- Quantization Toolkit available for PyTorch and TensorFlow in OSS supporting QAT, PTQ and export to ONNX



## **LoRA & Customization**

#### Efficiently Supporting Customer User Experience

- LoRA & Prompt tuned models are support in TRT-LLM
- Support mulitple customers with a single model
- Dynamically swap LoRA's at runtime
- SLORA / LORAx caching adapters on device
- Base model can be quantized for memory savings
  - QLoRA in progress



User Specific LoRAs

Dynamically Swap LoRAs based on User

## **KV Cache & Attention Techniques**

(Sliding) Window Attention, & Streaming LLM

- Allow for longer (sometimes unlimited) sequence length
  - Reduces KV Cache Memory usage
  - Avoids OOM Errors
- (Sliding) Windowed Attention evict tokens based on arrival
  - Significantly reduces memory usage
  - Can negatively impact accuracy or require recomputing KV
- <u>Streaming-LLM</u> allows for unlimited sequence length
  - Does not evict Attention Sinks (important elements)
  - KV Cache stays constant size
  - Does not require recompute & does not impact accuracy
  - Particulary beneficial for multi-turn (ie. chat) usecases

Attention KV Cache Usage (Less is Better)



ᇗ NVIDIA.

### **KV Cache Reusage**

### System Prompt Caching & Block reusage

Allows for interactive/ turn based systems & System Prompts

- Load prior KV cachce blocks to avoid recomupation
  - Saves significant compute
  - Reduces Start-up time
- Block resuage allows for turn-based (chat) applications
  - Allows for additional options for intelligently reusing blocks
- System prompts allows for a preset KV cache for the LLM
  - E.g. to give rules, personality, or prior knowledge



## **Inflight Batching**

Maximing GPU Utilization during LLM Serving

TensorRT-LLM provides custom Inflight Batching to optimize GPU utilization during LLM Serving

- Replaces completed requests in the batch
  - Evicts requests after EoS & inserts a new request
- Improves throughput, time to first token, & GPU utilizaiton
- Integrated directly into the TensorRT-LLM Triton backend
- Accessible though the TensorRT-LLM Batch Manager



Static Batching



Inflight Batching

## **KV Cache Optimizations**

Paged & Quantized KV Cache

Paged KV Cache improves memory consumption & utilization

- Stores keys & values in non-contiguous memory space
- Allows for reduced memory consumption of KV cache
- Allocates memory on demand

Quantized KV Cache improves memory consumption & perf

- Reduces KV Cache elements from 16b to 8b (or less!)
- Reduces memory transfer improving performance
- Supports INT8 / FP8 KV Caches

Both allow for increased peak performance

#### KV Cache Contents: TensorRT-LLM optimizes inference on NVIDIA GPUs ...



Traditional KV Caching



Paged KV Cache



#### **Quantized Paged KV Cache**



## **Multi-Modal Support**

#### Current support & adding more

- TensorRT-LLM supports BLIP, LLaVa, & Nougat VLMs
  - Including many derivatives of these models
- Utilizes TensorRT & TensorRT-LLM
  - Vision encoder in TensorRT
    - Standard ONNX export path to TRT
  - LLM running in TensorRT-LLM
  - Output of Vision encoder passed to TensorRT-LLM
- Any model similar to the supported can be added
  - Replace vision encoder or LLM with appropriate model
  - See <u>examples/multimodal</u>

#### Multi-Modal

This document shows how to run multimodal pipelines with TensorRT-LLM, e.g. from image+text input modalities to text output.

Multimodal models' LLM part has an additional parameter --max\_multimodal\_len compared to LLM-only build commands. Under the hood, max\_multimodal\_len and max\_prompt\_embedding\_table\_size are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape [batch\_size, num\_visual\_features, visual\_hidden\_dim] is flattened as [batch\_size \* num\_visual\_features, visual\_hidden\_dim] and passed like a prompt embedding table.

We first describe how to run each model on a single GPU. We then provide general guidelines on using tensor parallelism for LLM part of the pipeline.

|--|

BLIP2-OPT

LLaVA and VILA

Nougat

Enabling tensor parallelism for multi-GPU

#### SLIP2-T5

Download Huggingface weights and convert original checkpoint to TRT-LLM checkpoint format following example in examples/enc\_dec/README.md .

wort MODEL\_NAME=flan-t5-xl

```
Multi-Modal Examples
```

2. Build TRT-LLM engine from TRT-LLM checkpoint

```
ython ../enc_dec/build.py --model_type t5 \
--weight_din tmp/trt_models/${MODEL_NAME}/tp1 \
--output_din trt_engines/${MODEL_NAME}/1-gpu \
--remove_input_padding \
--use_bert_attention_plugin \
--use_gemm_plugin \
--use_gemm_plugin \
--dtype bfloat16 \
--max_batch_size 8 \
--max_encoder_input_len 924 \
--max_output_len 100 \
```

-max\_multimodal\_len 256 # 8 (max\_batch\_size) \* 32 (num\_visual\_features)

NOTE: max\_multimodal\_len = max\_batch\_size \* num\_visual\_features , so if you change max\_batch\_size, max multimodal length MUST be changed accordingly.

The built T5 engines are located in ./trt\_engines/\${MODEL\_NAME}/1-gpu/bfloat16/tp1

3. Build TensorRT engines for visual components

## **Optimized Attention**

Custom Implementations for Attention

- Custom optimized CUDA kernels for Attention
  - Similar to FlashAttentionV2
- Optimized for A100 & H100
- Kernels for Encoder & Decoder, as well as context & prefill
- Supports MHA, MQA, GQA









## Multi-GPU Multi-Node

Sharding Models across GPUs

- Supports Tensor & Pipeline parallelism
- Allows for running very large models (tested up to 530B)
- Supports multi-GPU (single node) & multi-node
- TensorRT-LLM handles communication between GPUs
- Examples are parametrized for sharding across GPUs









No Parallelism

**Tensor Parallel** 

**Pipeline Parallel**