

### **Programming for GPUs** Daniel Cámpora | Senior Al Devtech Engineer, NVIDIA



### The CUDA Programming Model

- Host, device and memory
- Writing a kernel
- GPU architecture
- Common data parallel techniques
- Summary

### **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com



- CUDA stands for Compute Unified Device Architecture.
- It is a programming model introduced in 2006 by NVIDIA as a set of extensions to the C programming language.
- Nowadays, it works with a variety of languages: Python, C, C++, Fortran, etc.
- It allows GPUs to be used for general purpose computing (also referred to as **GPGPU** or **GPU** computing).
- Since its inception, other standards have emerged, such as OpenCL, ROCm or SYCL, to name a few. We will discuss them in depth in lecture Design patterns and best practices.

## What Is CUDA?







## **A CPU Application**

Daniel Cámpora – dcampora@nvidia.com





## A GPU Application

Daniel Cámpora – dcampora@nvidia.com

### validate()



- Threads are organized in **blocks of threads**.

Thread

### **Parallel Processors**

# GPUs are parallel processors that can execute many threads in parallel.







### A function in CUDA, also called a kernel, is invoked with a configurable grid of blocks, each with the same number of threads.

## The Kernel





- The **solve** kernel runs blocks and threads in parallel.
- According to Amdahl's law, it's the best target to parallelize in this example.
- Let's inspect it.

## **Kernel Execution Example**





- solve is invoked with a configuration of 2 blocks and 4 threads per block.
- Each block runs independently from one another.
- Each thread runs independently from one another.
- All blocks in a grid must have the same number of threads.

### A Day in a Kernel's Life



- Inside our kernel execution, indices identify each individual thread.
- gridDim.x is the number of blocks in the grid, in this case 2.
- blockIdx.x identifies the current block within the grid.
- blockDim.x refers to the number of threads in a block, in this case 4.
- threadIdx.x identifies the current thread within the block.
- kernel in the grid.

## Indices

• The formula **blockIdx.x** \* **blockDim.x** + **threadIdx.x** uniquely identifies threads in our





### GPU DATA



## **Coordinating Parallel Threads**

Daniel Cámpora – dcampora@nvidia.com

| * | blockDim.x | + | threadIdx.x |
|---|------------|---|-------------|
|   | 4          |   | 3           |





### GPU DATA



## **Coordinating Parallel Threads**

Daniel Cámpora – dcampora@nvidia.com

| * | blockDim.x | + | threadIdx.x |
|---|------------|---|-------------|
|   | 4          |   | 3           |



- Grid and block size can be configured with 3 dimensions.
- The 3 dimensions allow for simplicity when assigning tasks to each block / thread.
  - Eq. a configuration of {32, 4, 4} threads will generate 512 threads in total.

### There is a maximum of 1024 threads per block

- The multiplication of the three dimensions should not exceed the maximum.
- Dimensions can be accessed with members x, y and z:
  - gridDim.x, gridDim.y, gridDim.z
  - blockDim.x, blockDim.y, blockDim.z
  - blockIdx.x, blockIdx.y, blockIdx.z
  - threadIdx.x, threadIdx.y, threadIdx.z

## **Configuration Goodies**



- The CUDA Programming Model
- Host, device and memory
- Writing a kernel
- GPU architecture
- Common data parallel techniques
- Summary

### **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com



- A GPU-accelerated application runs like any other OS application. • It requires a CPU, which launches the application and acts as the **host**. The host can offload some of the work onto the GPU, which acts as the device. The host is in charge of the application at all times.

- It can kill the application, react to interrupts, etc.



## Host and Device



Get results

### Device



Function qualifiers:

- A function that is invoked by the **host**, and runs on the **host**. \_\_host\_\_ • \_\_device\_\_ A function that is invoked by the **device**, and runs on the **device**. • \_\_global\_\_ A function that is invoked by the **host or device**, and runs on the **device**.



## **CUDA Syntax**

\_\_global\_\_ functions always execute asynchronously.



### The GPU has three main kinds of memory:

- Global memory High latency, GBs of space.
- Caches Lower latency.
  - L2 cache MBs of space.
  - L1 cache KBs of space.
- **Registers** Lowest latency.
  - A configurable 64-255 registers available per GPU core.

### Memory Hierarchy



### L1 cache

Reg.

### L2 cache

### Global memory



- The host can only access device global memory.
- our kernel.



### Host – Device Communication

• Therefore, all **input data** to the GPU needs to be populated on **global memory** prior to starting

• All output data needs to be put on global memory before the kernel ends executing.

Offload computation

Get results

Daniel Cámpora – dcampora@nvidia.com

### Device



## Host – Device Communication (2)

- The host can only access device global memory.
- our kernel.



• Therefore, all input data to the GPU needs to be populated on global memory prior to starting

• All output data needs to be put on global memory before the kernel ends executing.





Global memory



### GPUs offer a low-level optimization that is not available on CPUs.

### • A part of L1 cache can be configured as shared memory.

Memory used for shared memory will not be used for L1 (tradeoff).

## Shared memory



Daniel Cámpora – dcampora@nvidia.com



- Shared memory is:
  - Fast memory you have direct control over.
  - Limited in size.
  - Shared among the block.
  - It can only be accessed from the device.
- Shared memory size is limited:

## Shared Memory (2)

Its contents are flushed after the \_\_global\_\_ function terminates.

• On the Tesla T4 that you will use in the tutorials, its processors (SMs) allow a maximum of **48 KBs**. • Generally speaking, it is configurable and its maximum size depends on the architecture.

• We'll see an example in lecture Performant Programming for GPUs.





## Memory Overview





## Memory Schema





Daniel Cámpora – dcampora@nvidia.com

local memory

### We will see this in the next lecture.

shared memory

global memory

constant memory

texture memory

You can ignore constant and texture memory.



- The CUDA Programming Model
- Host, device and memory

### Writing a kernel

- GPU architecture
- Common data parallel techniques
- Summary

### **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com



### Let's write a vector addition kernel.

### • **A**, **B**, and **C** are arrays of float of size **N**.

### **Vector Addition**



• Compute to global memory access ratio (arithmetic intensity in FLOPs / Bytes):  $10 \, FLOPs / 120 \, Bytes = 0.08$ 



- Recall the Roofline model.
  The peak performance on the T4 is: 8 TFLOPS
- Our peak performance on the T4 is:
   0.32 TBps \* 10 FLOPS / 120 B = 0.027 TFLOPS
- What is the main limitation of this code?
   We are heavily memory bound, and we can obtain at best 0.027 / 8 = 0.0034 = 0.34% of the performance the T4 has to offer with a vector addition.

### Vector Addition





- Function attributes:
- threadIdx.x, blockIdx.x Indices:
- gridDim.x, blockDim.x • Dimensions:
- Kernels (\_\_global\_\_ functions) invocation must specify grid and block dimensions as follows: • fn<<<grid\_dim, block\_dim>>>(arg0, arg1, ...);

### **CUDA Syntax Reminder**

\_\_global\_\_, \_\_device\_\_, \_\_host\_\_



## Vector Addition Parallelized Across Single Block of Threads

This function is marked \_\_global\_\_, it can be invoked on the host and it will be executed on the device.



vector\_addition is invoked from the host. It will run asynchronously and non-blockingly. Control is returned immediately to the host.

The kernel will be invoked with a grid dimension of 1, and a block dimension of N. In other words, a single block of N threads.

Daniel Cámpora – dcampora@nvidia.com

The value of threadIdx.x will be different for each thread in the block.



## accesses.

## Flexibility

• The prior kernel assumes the block dimension to be N.

\_\_global\_\_ void vector\_addition(float\* A, float\* B, float\* C) { unsigned i = threadIdx.x; C[i] = A[i] + B[i];

Invoking the kernel with any other block dimension leads to incorrect results or out of bounds

Daniel Cámpora – dcampora@nvidia.com



- A common practice is to make loops involving threadIdx.x to be block dimension-strided. Now, invoking the kernel with any number of threads will give a correct result.



## **Block-dimension Strided Loops**

\_global\_\_ void vector\_addition(float\* A, float\* B, float\* C) { for (unsigned i = threadIdx.x; i < N; i += blockDim.x) {</pre> C[i] = A[i] + B[i];

vector\_addition<<<1, n>>>(A, B, C);

Any number of threads will yield the same result.



- We are so far using a single block. We could do better!
- Splitting the work across several blocks will ensure the GPU is better utilized for the task.

A good size for block dimension is 256.

## **Moving to Multiple Blocks**

| A | 0 | 1 | 2        | 3  | 4  | 5 | 6  | 7  | 8  |
|---|---|---|----------|----|----|---|----|----|----|
|   |   |   | <b>+</b> |    |    |   |    |    |    |
| B | 8 | 2 | 1        | 9  | 7  | 3 | 5  | 3  | 2  |
|   |   |   |          |    |    |   |    |    |    |
| С | 8 | 3 | 3        | 12 | 11 | 8 | 11 | 10 | 10 |
|   |   |   |          |    |    |   |    |    |    |

• If it's too few threads, the processors will be underutilized.











## Vector Addition Parallelized Across Several Blocks

We now iterate through all threads across all blocks, evenly assigning work.



Grid and block dimensions are configurable at runtime.

### \_\_global\_\_ void vector\_addition(float\* A, float\* B, float\* C) { for (unsigned i = blockIdx.x \* blockDim.x + threadIdx.x; i < N; i += gridDim.x \* blockDim.x)</pre>

Daniel Cámpora – dcampora@nvidia.com

Stride is adjusted to account for blocks.





## What About the Missing Sections?

- - 1. Allocate memory on the GPU.
  - 2. Populate inputs.
  - 3. Invoke kernel.
  - 4. Synchronize with kernel completion.
  - 5. Read outputs.



# • We also need to perform data preparation, synchronization and data retrieval:



Daniel Cámpora – dcampora@nvidia.com





### • There are several manners to control memory.

- Unified memory allows for a more high-level API where host device copies occur behind the scenes. • A lower-level API allows for explicit allocation, deallocation and copying.
- We will use the latter:
  - It assumes less from the user.
  - Allocations and copies are slow processes.
    - Unified memory requires doing prefetching for finer control, harder to get right.
  - They are fully translatable to other languages.

## Data Handling Syntax





5. Read outputs.

## **Data Handling Example**

- The CUDA Programming Model
- Host, device and memory
- Writing a kernel
- GPU architecture
- Common data parallel techniques
- Summary

### **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com



### GPUs are made of processors known as **Streaming Multiprocessors (SMs)**.

- Each SM contains:
  - A small control unit.
  - Many arithmetic units.
  - L1 cache and register memory (more on this later).

## The Streaming Multiprocessor





## The Streaming Multiprocessor (2)

- The heavy lifting is done by CUDA cores:
  - INT32, FP32, FP64 units and SFUs.
- Tensor cores are processors specialized for AI.
  - They allow faster matrix multiplications + additions.
  - They can also be used with CUDA.



y CUDA cores: nd SFUs.



L0 Instruction Cache

Warp Scheduler (32 thread/clk)

Dispatch Unit (32 thread/clk)

### Register File (16,384 x 32-bit)

| Г32                    | FP32  | FP32  | FP64    |                            |
|------------------------|-------|-------|---------|----------------------------|
| Г32                    | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| F32                    | FP32  | FP32  | FP64    |                            |
| F32                    | FP32  | FP32  | FP64    |                            |
| Г32                    | FP32  | FP32  | FP64    | TENSOR CORE                |
| <b>F32</b>             | FP32  | FP32  | FP64    | 4 <sup>th</sup> GENERATION |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| 1/2                    | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| Г32                    | FP32  | FP32  | FP64    |                            |
| <b>F32</b>             | FP32  | FP32  | FP64    |                            |
| Г32                    | FP32  | FP32  | FP64    |                            |
| N                      | LD/ L | D/ LD | LDI LDI |                            |
| T ST ST ST ST ST ST ST |       |       |         |                            |
|                        |       |       |         |                            |



### Blocks of threads are scheduled to run on SMs.



## **Transparent Scaling**





# than one block may be scheduled on a SM.



## **Transparent Scaling (2)**

• Depending on the number of SMs, their capabilities, and the requirements of each block, more







How relevant this is depends on your use-case.



## **Transparent Scaling (3)**

 Invocation configuration that are not multiple of number of SMs lead to a wave quantization effect: we don't use the GPU to its fullest some of the time.





- We will use Tesla T4s for the tutorials.
  - Tesla T4s have **40 SMs**.
  - Tesla V100s have 80 SMs.
  - GH200s have **132 SMs**.

## Saturating the GPU

• During a kernel invocation, each block of threads is executed *preferably* on a separate SM. • Hence, a kernel with at least 40 blocks would use all SMs of the Tesla T4.



## **Saturating the GPU – Slightly More Detail**

- There are three parameters that determine how many blocks can be scheduled in parallel: Invocation configuration (i.e. number of blocks, number of threads). • **Register usage** of the kernel.
- - Shared memory usage of the kernel.
- Of course, if the GPU is busy processing other tasks that will also impact the performance of the kernel.
- The CUDA scheduler assigns work to the SMs and manages the GPU resources. In particular, it is possible to runs several kernels asynchronously. Or even several CUDA applications.



- The CUDA Programming Model
- Host, device and memory
- Writing a kernel
- GPU architecture
- Common data parallel techniques
- Summary

## **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com

44



### GPUs are very efficient for data parallel workloads.

- Perform the same operation across a dataset.
- As opposed to instruction / thread / process level parallelism.



Constructing a tower is not data parallel

## Data Parallelism

Data dependencies are key factors to take into consideration!





Daniel Cámpora – dcampora@nvidia.com

### Image by freepik

### Calculating a filter is data parallel



- - Cache hits are more likely.



## Work Balance

Blocks are small executable pieces that can be carried out by a single SM.

There should be enough work to efficiently use the SM resources.

### Work imbalance is important: a single thread can stall resources of the entire block.

Ideally all threads should have a similar amount of work to do.

The slower thread stalls the entire block.



46



### • It acts as a *control flow barrier*, it will wait for all threads to reach that instruction.

```
for (unsigned i = threadIdx.x; i < N; i += blockDim.x) {</pre>
 C[i] = A[i] + B[i];
__syncthreads();
 C[i] += C[i - 1];
```

## Synchronising Threads

- Threads in a block can be synchronized through the \_\_syncthreads() command.
  - \_\_global\_\_ void vector\_addition(float\* A, float\* B, float\* C) {

for (unsigned i = 1 + threadIdx.x; i < N; i += blockDim.x) {</pre>



### Blocks cannot communicate to one another.



Assign blocks to sections.

Use all threads to load pixels in parallel.

## **Blocks Are Independent**

• In fact, *it is not guaranteed* that any two blocks will even execute concurrently.

You can reuse blocks smartly to divide the work considering data dependencies.

• Use \_\_synchronize() and then reassign the role of each thread.

• Eq. imagine we need to encrypt an image with an encoding that has a dependency across columns:





Synchronize





Each thread gets assigned a different row.



- memory).
- There are many use cases for atomics:
  - Counting elements.
  - Searching eg. elements on an array.
  - Histogramming...
- Atomics can be used over global data that is accessed by several threads or blocks.
  - They can also be used over shared data that is accessed by threads on a block.

## Atomics

### CUDA provides operations that allow atomic accesses over data (on global memory or shared)

Atomic accesses guarantee data will be coherent and prevents race conditions between threads.



Daniel Cár



- Atomics could be potentially very slow if they are abused.
- The good news is that atomics are automatically optimized (both software and hardware).
  - Just because they are needed so often.
- Two syntaxes are supported on CUDA:
  - Functions: atomicAdd, atomicInc, atomicOr, ...
  - cuda::atomic (it took 10 years to get this one) working in CUDA).

## Atomics (2)



### TOOK A LITTLE BIT LONGER TO EXPLAIN...





- - Hence, it could live in shared memory or global memory.

One detail to know about C++ atomics:

• std::atomic owns the memory.

Cannot be copied.

- location.
  - Meant to be passed by value, can be copied.

## cuda::(std::)atomic

 cuda::atomic behaves like what you would expect from the C++ standard std::atomic. It also allows to define a scope where the atomic takes effect: • The scope could e.g. be a single block, a cluster of blocks or the entire GPU.

• std::atomic\_ref is a lightweight non-owning wrapper around a user-specified memory



## **One Last Example: an Atomic Addition**

### Adding together numbers in a floating point array.

sum += A[i];

## Bear in mind: **Floating point** atomics will result in a non-deterministic result! Integer atomics don't have this issue.



\_\_global\_\_ void vector\_addition(float\* A, cuda::atomic<float>& sum) { for (unsigned i = blockIdx.x \* blockDim.x + threadIdx.x; i < N; i += gridDim.x \* blockDim.x) {</pre>

$$a_7 \ \dots \ a_{n-1}$$



- The CUDA Programming Model
- Host, device and memory
- Writing a kernel
- GPU architecture
- Common data parallel techniques
- Summary

## **Table of Contents**

Daniel Cámpora – dcampora@nvidia.com

53



- We have gone through the basic building blocks of CUDA. The host is in charge, the device is used for offloading computation. • *Kernels* are functions invoked on the host, run on the device. Computation is divided in blocks and threads.

- Knowing your hardware leads to better software.
  - SMs execute blocks in parallel.
  - Global memory allows for communication with the host and device, and must be preallocated.
- GPUs excel at data parallelism.
  - Identifying the right problem to tackle on GPU is half the work.

## Summary



- GPU Teaching Kit on Accelerated Computing.
- NVIDIA Deep Learning Institute materials.
- <u>CUDA Programming Guide</u>.

## **Resources Used in the Talk**

Talk by O. Giroux on <u>The One-Decade Task: Putting std::atomic in CUDA</u>.





