

# TOWARDS "WRITE ONCE, RUN ANYWHERE"



<u>Miguel Astrain Etxezarreta</u> Universidad Politécnica de Madrid (UPM) Acknowledgements:

Mariano Ruiz, Antonio Carpeño and Sergio Esquembri of the Universidad Politécnica de Madrid (UPM)



miguel.astrain@i2a2.upm.es





Introduction (1h)

- Context
- The OpenCL Models
- OpenCL Host Programming
- OpenCL Kernel Programming
- FPGA Oriented Kernel Design Advanced (1h)
- More on OpenCL host programming
- Example: Matrix Multiplication
- More on FPGA Oriented kernel design
- Memory Hierarchy
- Synchronization in OpenCL
- The OpenCL Event Model
- Building OpenCL



In addition to these slides, and the set of examples, it is useful to have:

- OpenCL in Action. Matthew Scarpino. Manning 2011. (Very recomended)
- Heterogeneous Computing with OpenCL. Gaster, Howes, et al Morgan Kaufmann 2011.
- OpenCL Parallel Programming Development CookBook. Raymond Tay. PACKT 2013.
- Vendor specific documentation for each platform:
- Xilinx VITIS Softwate platform
  - <u>https://www.xilinx.com/html\_docs/xilinx2019\_2/vitis\_doc/index.html</u>
- Intel FPGA SDK for OpenCL
  - <u>https://www.intel.com/content/www/us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html</u>



#### Three main profiles that will benefit from this course:

- 1. Scientific background: Scientific experiments, Data Acquisition, some hardware...
- 2. Programmer background: Programming frameworks, some parallelization, GPUs....
- 3. Hardware background: HDL, FPGAs, DSP, some programming....

No knowledge is required from the other fields.

#### The goal... understanding the heterogeneous programming model.

- 1. Lower learning curve to write high-performance algorithms for hardware.
- 2. Apply programming concepts to hardware. Integrate many hardware in a unified framework.
- 3. Optimizations thanks to hardware knowledge, provide tools for the above.



Very recommended to review the CSC2019 lectures on these topics:

- 1. Software Design in the Many-Cores era; A. Gheata, E. Tejedor.
- 2. Base Concepts of Parallel Programming: A Pragmatic Approach; A. Gheata, E. Tejedor.

Over-summarized, the frameworks for high-performance computing help on:

- 1. Dividing tasks into smaller problems.
- 2. Managing the tasks (and memory!) to execute efficiently.

But OpenCL offers one more perk, managing heterogeneous hardware:

- 1. Additional to parallelism I can select optimal hardware!  $\rightarrow$  More performance!
- A single computing language can manage many devices! → Ease of maintenance, portability!
- 3. An open standard, which is manufacturer agnostic  $\rightarrow$  Write once, run anywhere!



 OpenCL lets Programmers write a single <u>portable</u> program that uses <u>ALL</u> resources in the heterogeneous platform

A modern computing platform includes:

- One or more CPUs
- One or more FPGAs
- One of more GPUs
- DSP processors
- <u>FPGA with AI Cores + DSP + PCIe +DDR</u>
- <u>ADCs?! (Xilinx RFSOC )</u>









#### **OpenCL – Open Computing Language**



Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, FPGAs, DSPs,...

OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group.

16/03/2020 Introduction to Heterogeneous Programming in OpenCL with FPGAs - Miguel Astrain - iCSC2020 7



- Very interesting hardware, with real-time capabilities.
- While often losing to GPUs in raw computation muscle, FPGAs are more energy efficient!
- The learning curve for HDLs is steep, what do I need to know about FPGAs to use OpenCL?
  - Actually, very little! Knowledge helps, of course, but the tools provided with OpenCL SDKs help us!



## THE OPENCL MODELS

There are quite a few.



OpenCL defines a set of models to organize the core ideas:

- Platform Model
- Execution Model
- Memory Model
- Programming Model

from: Khronos registry, OpenCL specification 2.2 276 pages of condensed information!



- One *Host* and one or more OpenCL *Devices.* 
  - Each OpenCL Device is composed of one or more *Compute Units.*
    - Each Compute Unit is divided into one or more *Processing Elements.*



OpenCL keywords are high-lightened in RED hereinafter.



- The host has the recipe on how to perform the computation
  - It uses commands to the device to do so.
- The device has the power to perform the computation.
  - Can only understand kernel code.
- Memory divided into *Host Memory* and *Device Memory*.





- Define a problem domain and execute an instance of a kernel for each point in the domain.
  - The smallest unit is called a work-item
- If the problem needs synchronization or has dependencies, manage them into work-groups





#### **OpenCL Execution Model**



• No imposed order of execution



## The idea behind OpenCL

- Computation divided into simpler functions (a kernel) executing at each point in a problem domain.
- Most of the computing muscle is usually needed in a few lines of code.
- Typical example: 1024x1024 image with one kernel invocation per pixel or 1024x1024 = 1,048,576 kernel executions

#### Traditional loops

```
void multiply(const int n, const
                                         kernel void multiply ( global
float *a,
                                      const float *a,
         const float *b, float
                                                         global const float
*c)
                                      *b,
                                                                global
  int i;
                                      float *c)
  for (i = 0; i < n; i++)
    c[i] = a[i] * b[i];
                                         int id = get_global_id(0);
                                        c[id] = a[id] * b[id];
in parallel, but how much parallelism?
```

**Data Parallel OpenCL** 

16/03/2020 Introduction to Heterogeneous Programming in OpenCL with PGAS Migue Astrain - iCSC2020 15



- Typical example, an image:
- Global Dimensions:
  - 1024x1024 (whole problem space)
- Local Dimensions:
  - 128x128 (work-group, executes together)
- The work-item is the pixel.
- You can tune the dimensions that are "best" for your hardware.
- Remember Platform Model "CU" and "PE".





## **OpenCL Memory model**

- Private Memory
  - $\circ$  Per work-item
- Local Memory
  - $\circ~$  Shared within a work-group
- Global Memory
  - $\circ~$  Visible to all work-groups
- Host Memory / Constant
  - o On the CPU



https://www.khronos.org/



## **OpenCL Memory model**

#### Memory management is **explicit!**.

More on this and SYCL later...



https://www.khronos.org/



Definitions, lots of definitions... What does it all mean?

- They help keep ideas clear.
- Divide and manage the problem the OpenCL way!
- The programming model is ... the rest ...
- Your problem, your algorithm, your hardware



### **OpenCL Programming Model**

Data Parallelism

kernels and indexes

 Work-items

Task Parallelism

- kernels and queues
   Work-Group
- Single Instruction Multiple Data
   OpenC
  - OpenCL C
    Vector instructions.

- Single Program Multiple Data
- Platforms and devices
   Deploy to multiple devices

## **OPENCL HOST PROGRAMMING**

What is your favorite language?



- There is now a full specification in C++.
- Host bindings are available for C, C++, Java, Python.
- Kernels are written in OpenCL C subset of C99 with specific extensions and restrictions.

- Recommend using C/C++. Most examples are written in C.
- Lots of development effort in C++, SYCL,...



- The host program is the code that runs on the host to:
  - Setup the environment for the OpenCL program
  - Create and manage kernels



- 5 simple steps in a basic host program:
  - 1. Define the *platform* ... platform = devices + context + queues
  - 2. Create the *program* (dynamic library for kernels)
  - 3. Setup *memory* objects
  - 4. Define the *kernel* (attach arguments to kernel functions)
  - 5. Submit *commands* ... transfer memory objects and execute kernels



#### **Host Program**



16/03/2020 Introduction to Heterogeneous Programming in OpenCL with FPGAs - Miguel Astrain - iCSC2020 24



- The computation "recipe" is scheduled through the command-queue.
- **Commands** for a device include kernel execution, synchronization, and memory transfer operations.

write

Task

read







#### Lets analyse a host program:

1. Platform:



#### \*The code ahead might be simplified or wrong to keep it shorter and readable.



 $\underline{\mathbf{Ex:}}$ 

2.

3.

4.

5.

**cl\_uint** num\_of\_platforms = 0;

**clGetPlatformIDs**(0, 0, &num\_of\_platforms);

cl\_platform\_id\* platforms = new cl\_platform\_id[num\_of\_platforms];

**clGetPlatformIDs**(num\_of\_platforms, platforms, 0);

**clGetPlatformInfo**(platforms[i], CL\_PLATFORM\_NAME, 0, 0, platform\_name\_length); **clGetPlatformInfo**(platforms[i], CL\_PLATFORM\_NAME, platform\_name\_length, platform\_name, 0);

cl\_platform\_id platform = platforms[selected\_platform\_index];

Number of available platforms: 1
Platform names:[0] Intel(R) OpenCL [Selected]



5.

**cl\_uint** cur\_num\_of\_devices; **clGetDeviceIDs**(platform, CL\_DEVICE\_TYPE\_CPU, 0, 0, &cur\_num\_of\_devices);

 $\mathbf{Ex}:$ **cl\_device\_id**\* devices\_of\_type = new cl\_device\_id[cur\_num\_of\_devices]; 1. clGetDeviceIDs(platform, CL\_DEVICE\_TYPE\_CPU, cur\_num\_of\_devices, 2. 3. devices\_of\_type, 0); 4.

CL DEVICE TYPE CPU: 1

```
cl_uint device_index = 0;
```

**cl\_device\_id** device = devices\_of\_type[device\_index];

CL DEVICE TYPE CPU[0] CL DEVICE NAME: Genuine Intel(R) CPU @ 2.60GHz CL DEVICE AVAILABLE: 1 CL DEVICE VENDOR: Intel(R) Corporation



Ex: 1.

2. 3.

4.

5.

• Create a simple context with a single device:

```
cl_context clCreateContext(cl_context_properties *properties, cl_uint
num_devices, cl_device_id *devices, (void CL_CALLBACK
*notify_func)(...), void user_data, cl_int *error);
```

```
context = clCreateContext(NULL, 1, &device_id, NULL,
NULL, &err);
```

Create a simple command-queue to feed our device:

 cl\_command\_queue clCreateCommandQueue(cl\_context context,
 cl\_device\_id device\_id, cl\_command\_queue\_properties properties,
 cl\_int \*error);

```
q_commands = clCreateCommandQueue(context, device_id, 0,
&err);
16/03/2020 Introduction to Heterogeneous Programming in OpenCL with FPGAs - Miguel Astrain - iCSC2020 29
```



 $\mathbf{E}\mathbf{x}$ :

2. 🗸

3.

4.

5.

### Example: step by step

```
const char* raw_text = &program_text_prepared[0];
cl_int err;
cl_program program = clCreateProgramWithSource(context,
1, &raw_text, 0, &err);
```

clBuildProgram(program, (cl\_uint)num\_of\_devices, devices, build\_options.c\_str(), 0, 0);

> Build program options: "-DT=float -DTILE\_SIZE\_M=1 -DTILE\_GROUP\_M=16 -DTILE\_SIZE\_N=128 -DTILE\_GROUP\_N=1 -DTILE\_SIZE\_K=8"



 $\mathbf{E}\mathbf{x}$ :

2. 🗸

4. 🧹

3.

5.

```
cl_kernel krnl = 0;
string kernel_name = "Multiply"
krnl = clCreateKernel(program, kernel_name.c_str(),
&err);
```

- As we will see later, kernels are really like functions.
- They have arguments. But must return void.
- They are identified by name for OpenCL.
- Remember that kernels are compiled for the device architecture.



HOST MEMORY BUFFER:

cl\_float\* data\_ptr = (cl\_float \*) malloc(sizeof(cl\_float) \* count); cl\_mem array\_of\_floats = clCreateBuffer(context, CL\_MEM\_READ\_WRITE, sizeof(cl\_float)\*count, data\_ptr, NULL);



clSetKernelArg(kernel, 0, sizeof(cl\_mem), &array\_of\_floats);





 $\frac{\mathbf{Ex:}}{\mathbf{1.}}$ 

```
Remember the variables:
```

```
cl_command_queue q_commands;
```

cl\_kernel krnl;

```
cl_mem array_of_floats;
```



- 2. clEnqueueTask(q\_commands, krnl, 0, NULL,NULL); NO dimensions!
- 3. clEnqueueReadBuffer(q\_commands, array\_of\_floats, CL\_TRUE, sizeof(cl\_float)\*count, data\_ptr, 0, NULL, NULL);

## **OPENCL KERNEL PROGRAMMING**



- Derived from ISO C99 + ISO C11
  - A few *restrictions*: no recursion, function pointers,...
  - Preprocessing directives defined by C99 are supported (#include etc.)
- Built-in data types
  - $\circ\,$  Scalar and vector data types, pointers
  - Image types:
    - image2d\_t, image3d\_t and sampler\_t
- OpenCL #pragmas added to guide the compiler.
- The return type of a kernel function must be void



- Function qualifiers
  - \_\_\_\_kernel qualifier declares a function as a kernel
    - I.e. makes it visible to host code so it can be enqueued.
  - Kernels can call other kernel-side functions
- Address space QUALIFIERS
  - o \_\_global, \_\_local, \_\_constant, \_\_private
  - Pointer kernel arguments must be declared with an address space qualifier
- Work-item functions

o get\_work\_dim(), get\_global\_id(), get\_local\_id(), get\_group\_id()

- Synchronization functions
  - Barriers all work-items within a work-group must execute the barrier function before any work-item can continue
  - Memory fences provides ordering between memory operations



- Pointers to functions are *not* allowed
- Pointers to pointers allowed *within* a kernel, but not as an argument to a kernel invocation
- Bit-fields are not supported
- Variable length arrays and structures are not supported
- Recursion is not supported (yet!)
- Double types are optional in OpenCL v1.1, but the key word is reserved (note: most implementations support double)



- Built-in functions *mandatory* 
  - $\circ~$  Work-Item functions, math.h, read and write image
  - Relational, geometric functions, synchronization functions
     printf ()
- Built-in functions *optional* (called "extensions")
  - $\circ~$  Double precision, atomics to global and local memory
  - Selection of rounding mode, writes to image3d\_t surface

# **FPGA ORIENTED KERNEL DESIGN**



#### **FPGA Oriented Kernel Design**

- Some general rules for FPGAs:
- Work-Item and a kernel.

| Strategy      | Scheme                                          | AREA | FREQ | THROUGHPUT | LATENCY |
|---------------|-------------------------------------------------|------|------|------------|---------|
| Parallelizing | $D1 \longrightarrow P1$ $D2 \longrightarrow P1$ | ++   | =    | ++         | =       |
| Pipelining    | D1 P1.1 P1.2<br>D2 P1.1                         | +    | ++   | +          | +       |
| Complex op.   | D1 → P1+P2+P3                                   | =    |      | ++         |         |
| Divide op.    | D1 → P1 → P2 → P3                               | ++   | ++   | =          | ++      |



#### The most important FPGA design patern

```
#pragma OPENCL EXTENSION cl_intel_channels : enable
channel int c0;
__kernel void producer() {
   for (int i = 0; i < 10; i++) {
      write_channel_intel (c0, i);
   }
}
__kernel void consumer (__global uint * restrict dst) {
   for (int i = 0; i < 5; i++) {
      dst[i] = read_channel_intel(c0);
   }
}</pre>
```





## **FPGA Oriented Kernel Design**



- Send Data from one kernel to another without host intervention
- Send Data from I/O to kernel or from kernel to I/O
- Send Data from host to kernel and vice versa without using global memory
- Data remains in a channel as long as the kernel remains loaded on the FPGA device, persistence among NDRange invocations and among work-groups
- Blocking and Non-Blocking behavior
- Pipes are OpenCL 2.0 standard, contrary to channels



for programers.

- SYCL. cross-platform abstraction C++ programming model for OpenCL.
   Adding much of the ease of use and flexibility of single-source C++.
- SYCL implements a single-source multiple compiler-passes (SMCP).
   Simplifying the **Device-Host** separation of OpenCL.
- Easier to handle for programmers. But OpenCL concepts remain, i.e.
- The SYCL Platform Model
   SYCL Execution Mode (command groups)
   Memory Model (same 4 layers)
   The SYCL programming model
   Cl::sycl::context
   cl::sycl::device
   cl::sycl::program
   cl::sycl::queue
   cl::sycl::queue

# **END OF INTRODUCTION**

## **ADVANCED:**

# MORE ON OPENCL HOST PROGRAMMING



## **Context and Command-Queues**

- Context.
  - The environment within which kernels execute and in which synchronization and memory management is defined.
- The *context* includes:
  - One or more devices
  - Device memory
  - $\circ~$  One or more command-queues
- All *commands* for a device (kernel execution, synchronization, and memory transfer operations) are submitted through a *command-queue*.
- Each *command-queue* points to a single device within a context.





- Commands include:
  - Kernel executions
  - Memory object management
  - $\circ$  Synchronization
- The only way to submit commands to a device is through a command-queue.
- Each command-queue points to a single device within a context.
- Multiple command-queues can feed a single device.
  - Independent streams of commands that don't require synchronization.





**Command queues** can be configured in different ways to control how commands execute

- In-order queues:
  - Commands are enqueued and complete in the order they appear in the program (programorder)

#### Out-of-order queues:

- Commands are enqueued in program-order but can execute (and hence complete) in any order.
- Execution of commands in the command-queue are guaranteed to be completed at synchronization points
  - o Discussed later



Memory Objects:

• A handle to a reference-counted region of global memory.

There are two kinds of memory object

- **Buffer** object:
  - Defines a linear collection of bytes ("just a C array").
  - The contents of buffer objects are fully exposed within kernels and can be accessed using pointers
- *Image* object:
  - Defines a two- or three-dimensional region of memory.
  - Image data can only be accessed with read and write functions, i.e. these are opaque data structures. The read functions use a sampler.
- Pipe (Channel) object:
  - A pipe is a memory object that stores data organized as a FIFO.
  - Pipe objects are not accessible from the host.



## **Memory Object Options**

| Flag value            | Meaning                                                                        |
|-----------------------|--------------------------------------------------------------------------------|
| CL_MEM_READ_WRITE     | The memory object can be read from and written to.                             |
| CL_MEM_WRITE_ONLY     | The memory object can only be written to.                                      |
| CL_MEM_READ_ONLY      | The memory object can only be read from.                                       |
| CL_MEM_USE_HOST_PTR   | The memory object will access the memory region specified by the host pointer. |
| CL_MEM_COPY_HOST_PTR  | The memory object will set the memory region specified by the host pointer.    |
| CL_MEM_ALLOC_HOST_PTR | A region in host-accessible memory will be allocated for use in data transfer. |

• These are from the point of view of the device.



- It can get confusing about whether a host variable is just a regular C array or an OpenCL buffer
- A useful convention is to prefix the names of your regular host C arrays with "h\_" and your OpenCL buffers which will live on the device with "d\_"



• Enqueue the kernel for execution:

clEnqueueTask(cl\_command\_queue commands, cl\_kernel kernel, cl\_uint
num\_events, const cl\_event \*wait\_list, cl\_event event);

clEnqueueNDRangeKernel (cl\_command\_queue commands, cl\_kernel kernel, cl\_uint work\_dims, size\_t \*global\_work\_offset, size\_t global\_work\_size, size\_t \*local\_work\_size, 0, NULL, NULL);

**clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local,** 0, NULL, NULL); <u>Events later.</u>

## MATRIX MULTIPLICATION EXAMPLE



Analize C=A-B

$$\begin{pmatrix} A_{00} & A_{01} & A_{02} \\ A_{10} & A_{11} & A_{12} \\ A_{20} & A_{21} & A_{22} \end{pmatrix} \cdot \begin{pmatrix} B_{00} & B_{01} & B_{02} \\ B_{10} & B_{11} & B_{12} \\ B_{20} & B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} C_{00} & C_{01} & C_{02} \\ C_{10} & C_{11} & C_{12} \\ C_{20} & C_{21} & C_{22} \end{pmatrix}$$



```
void mat mul(int N, float *A, float *B, float *C)
ł
    int i, j, k;
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            C[i*N+j] = 0.0f;
            for (k = 0; k < N; k++) {
              // C(i, j) = sum(over k) A(i,k) * B(k,j)
              C[i*N+j] += A[i*N+k] * B[k*N+j];
```



```
kernel void mat mul
const int N,
global float *A, global float *B, global float *C)
    int i, j, k;
    for (i = 0; i < N; i++) {
      for (j = 0; j < N; j++) {
        // C(i, j) = sum(over k) A(i,k) * B(k,j)
        for (k = 0; k < N; k++) {
          C[i*N+j] += A[i*N+k] * B[k*N+j];
                              Mark as a kernel function and
                              specify memory qualifiers
```



```
kernel void mat mul
 const int N,
   global float *A, global float *B, global float *C)
ł
   int i, j, k;
    i = get global id(0);
    j = get global id(1);
            for (k = 0; k < N; k++) {
                // C(i, j) = sum(over k) A(i,k) * B(k,j)
                C[i*N+j] += A[i*N+k] * B[k*N+j];
                                       Remove outer loops and set
                                       work-item co-ordinates
```



```
kernel void mat mul(
const int N,
  global float *A, global float *B, global float *C)
{
   int i, j, k;
   i = get global id(0);
   j = get global id(1);
   // C(i, j) = sum(over k) A(i,k) * B(k,j)
   for (k = 0; k < N; k++) {
     C[i*N+j] += A[i*N+k] * B[k*N+j];
```



• Rearrange and use a local scalar for intermediate C element values (a common optimization in Matrix Multiplication functions)

```
kernel void mmul
  const int N,
   global float *A,
   global float *B,
   global float *C)
{
 int k;
 int i = get_global_id(0);
 int j = get global id(1);
 float tmp = 0.0f;
 for (k = 0; k < N; k++)
  tmp += A[i*N+k]*B[k*N+j];
 C[i*N+j] += tmp;
```

This Accumulation is recognized by the compiler !

## **MORE ON FPGA ORIENTED KERNEL DESIGN**



- Intel recommends single work-Item kernels, when possible
- Use NDRange when the code does not have memory dependencies and loops. If data must be shared among WI this structure is not efficient.
- High throughput achieved by using multiple pipelines stages at any time. Parallelism by pipelining the loop iterations.
- Some strategies are common to FPGAs, others depend on the family of FPGA, consult the programming guide for each one.



#### Single Work-Item vs NDRange Kernel







#### Single Work-Item vs NDRange Kernel

#### NDRange Kernel





Strategy 1: Unrolling a Loop

#pragma unroll <N>

#pragma unroll 2
for(size\_t k = 0; k < 4; k++)
{
 mac += data\_in[(gid \* 4) + k] \* coeff[k];
}</pre>



Strategy 2: Coalescing Nested Loops

```
#pragma loop_coalesce <loop_nesting_level>
```

The OpenCL compiler hates nested loops. Try to avoid its use!!

```
#pragma loop_coalesce
for (int i = 0; i < N; i++)
  for (int j = 0; j < M; j++)
    sum[i][j] += i+j;
int i = 0;
int j = 0;
while(i < N){
    sum[i][j] += i+j;
    .
```

```
j++;
if (j == M){
    j = 0;
    i++;
}
```



**Strategy 3**: Specifying a Loop Initiation Interval

#pragma ii <desired\_initiation\_interval>

Define the number of clock cycles to wait among successive loop iterations

Strategy 4: Loop Concurrency

#pragma max\_concurrency <N>

Define the number of iterations to be in progress at one time



Strategy 5: Specifying the work group size

```
_attribute__((max_work_group_size(512,1,1)))
_kernel void sum (__global const float * restrict a,
___global const float * restrict b,
___global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid] = a[gid] + b[gid];
}
```



Strategy 6: Specifying the Number of Compute Units

```
_attribute__((num_compute_units(2)))
_kernel void test(__global const float * restrict a,
___global const float * restrict b,
___global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid] = a[gid] + b[gid];
}
```

Strategy 7: Specifying the Number of SIMD Work-Items

However. Use vectors explicitly as frequently as possible



Strategy 8: Removing Loop-Carried Dependencies

#pragma ivdep

When each loop access different parts of the array there may be fictitious dependencies. This directive commands the compiler to forget the dependencies and remove the extra initiation cycles in the loop immediately after the pragma directive.



#### **Compilation Report**

| HLD FPGA Reports (Beta) View reports Report menu                                               |                                                                                                                                                                                                            |             |                |           |                 |                                                                                                                                                                                                                              |      |
|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|----------------|-----------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| Summary                                                                                        |                                                                                                                                                                                                            |             |                |           |                 | - lob#                                                                                                                                                                                                                       | ×    |
| Info<br>Project Marrie<br>Target Family, Davies, Board<br>ACDS Writen<br>AOC Yenson<br>Command | HTM       4       Arvin TO, NORTH HERMERIKE, and give and give       TYTED INSTALLER AND AND       TYTED INSTALLER AND       Arvin Darks and       Arcol And And       Arvin Darks and       Arcol And And |             |                |           |                 | <pre>24 25 26 27 26 28 28 29 29 29 29 20 20 20 20 20 20 20 20 20 20 20 20 20</pre>                                                                                                                                           |      |
| Reports Generated At This Mar 2010;54:25:2017 Kennel, Sansmany                                 |                                                                                                                                                                                                            |             |                |           |                 | 36 dorfine MALLERORES (1 << (1.001 - LOGPONTS))<br>37 // Log of how much to fick at most for one area of input buffer,<br>38 dorfine LOG_CONT_SATOR 6<br>40 dorfine CONT_SATOR (1 << 106_CONT_SATOR)<br>41                   |      |
| Servel Name                                                                                    | Rennel Type                                                                                                                                                                                                | Actorso     | Warkgroup Stor |           | # Compute Units | <ul> <li>// Head some depth to car channels to accomplete their hursty filling.</li> <li>ds channel float2 chanim [8] _attribute_[(depth(CONT_PACTORY8)));</li> </ul>                                                        |      |
| fetzh                                                                                          | MDRumps                                                                                                                                                                                                    | No          | 512,1/1        |           | 4               | dd<br>d5 - uist bit_reversed(uist s, uist bits) (                                                                                                                                                                            |      |
| 491.4                                                                                          | Singla sarikitan                                                                                                                                                                                           | Nex         | 5,1,3          |           | 4               | 46 sint y = 8;<br>47 deragne unroll<br>48 - for kott i = 8; i < hits; i++) (                                                                                                                                                 |      |
| Estimated Resource Usage<br>Reset Name                                                         | 8 AUTS<br>2012                                                                                                                                                                                             | PR 100%     |                | EAM:      | DNPs<br>0       | $\begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                         |      |
| trid                                                                                           | 2 39 4 8                                                                                                                                                                                                   | 11014       |                | 210       | 112             | 55 )<br>56 J/ fetch li points eo folleno:                                                                                                                                                                                    |      |
| Remeil Sabtutal                                                                                | 27681                                                                                                                                                                                                      | 90719       |                | 358       | 214             | 37 // retch = points of roution;<br>38 // - each thread will lise 8 consecutive values<br>69 // - load CON_FACTOR consecutive loads () values each), then jump by H/H, and load newt<br>60 // CAN_FACTOR consecutive values. |      |
| Channel Resources                                                                              | 400                                                                                                                                                                                                        | 4576        |                | 96        | 0               | <ol> <li>// - Dice load CONT_FACTOR values starting at 70/8, send CONT_FACTOR values</li> <li>// dots the chewnel to the fit kernel.</li> </ol>                                                                              |      |
| Board Interface                                                                                | 66000                                                                                                                                                                                                      | 113600      |                | 102       | 0               | 63 // - start process again.<br>64 // Trick way, only need BxCBNT_SECTOR local memory buffer, instead of BxH.                                                                                                                |      |
| Tetal                                                                                          | 9517013294                                                                                                                                                                                                 | 199927 (33% | 1              | 756 (MPR) | B4150 5140      | 65 // Group index is used as follows ( 0 to CONT_FACTOR, iteration num )<br>67 //                                                                                                                                            |      |
| Analysis pan<br>Compile Warnings                                                               | 387600<br>C                                                                                                                                                                                                | 1575200     |                | 2531      | 1518            | 88     // 64C values = 2138, nm_fetches=2133, 216 = CBNT_PACTOR, 211-nm_Fetches / cont_factor       89     // / < C (value 4 > 71       70     // < C (value 4 > 71       71     // SAL210001043100                          | ne , |
| Details                                                                                        |                                                                                                                                                                                                            |             |                |           |                 |                                                                                                                                                                                                                              | ×    |
| fftt1d:<br>• Kernel type: Sing<br>• Required workgr<br>• Maximum workg                         | oup size: (1, 1, 1)                                                                                                                                                                                        |             |                |           |                 |                                                                                                                                                                                                                              |      |
|                                                                                                |                                                                                                                                                                                                            |             |                |           |                 | Details pane                                                                                                                                                                                                                 |      |

Report about code structures that prevent the loops from being fully pipelined. Report about area usage.

Report about wrong memory management.



**Strategy 9**: Implementing Arbitrary Precision Integers

Sometimes, optimizing the code when working with FPGAs demands adjusting the size of data to the size strictly needed.

Strategy 10: Inferring Registers and Shift Registers.

When variables are defined as "private" and the access to arrays are statically inferable, they are implemented as FFs in LEs, or in blocks RAM (if their size is larger than 64 bytes). They are the fastest hardware for loop execution.

**Strategy 11**: Inferring Single-Cycle Floating-Point Accumulator.

Only for Arria10 devices.



Create good and efficient code for FPGA kernels is a complex task. The results depends heavily on the designer's expertise.

However, it's a perfect beginning to read two very useful manuals.

"Intel FPGA SDK for OpenCL Programming Guide" and "Intel FPGA SDK for OpenCL Best Practices Guide"

They are hard to deal with though they will give you priceless help for optimizing the area and speed of your application parting from the information given by the compilation report.

They will show you how to work with loop-carried dependency, how to carry out proper memory management to improve access, how to use channels or pipes when needed, etc.

# THE OPENCL MEMORY HIERARCHY



## **The Memory Hierarchy**





- Managing the memory hierarchy is one of <u>the</u> most important things to get right to achieve good performance.
- Remember memory transfers are explicit!
- Private Memory:
  - A very scarce resource, only a few thousands of 32-bit words per Work-Item at most
  - If you use too much it spills to global (or local) memory or reduces the number of Work-Items that can be run at the same time, potentially harming performance
  - These is the closest-to-hardware memory. The actual realization varies from one to another. (CPU registers, FPGA registers,...)



- The memory for the work-groups:
  - Close to the hardware, but shared between work-items. Each device realizes it in a different way.
  - Your kernels are responsible for transferring data between Local and Global/Constant memories
- Access patterns to Local Memory affect performance in a similar way to accessing Global Memory.
- Due to their architecture, managing local memory is most important to GPUs.
- FPGA local memory is still very fast (private), but being shared means access patterns affect it.
- CPUs do not have specialized hardware for this....



- The host accessible memory.
- The access pattern to global memory should be the easiest possible.
  - Move data to faster memories, think about dependencies of the algorithm.
- Constant memory is an specialization of global.
- In FPGAs global memory is RAM outside the chip.
  - Constant memory might get replicated or cached to chip memory to satisfy reading needs.
  - While the bandwidth is good, the FPGA can easily overwhelm it with read and write operations.



- OpenCL uses a relaxed consistency memory model; i.e.
  - The state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times.
- Within a work-item:
  - Memory has load/store consistency to the work-item's private view of memory,
     i.e. it sees its own reads and writes correctly
- Within a work-group:
  - Local memory is consistent between work-items at a barrier.
- Global memory is consistent within a work-group at a barrier, <u>but not guaranteed</u> <u>across different work-groups!!</u>
  - $\circ$  This is a common source of bugs!
- Consistency of memory shared between commands (e.g. kernel invocations) is enforced by synchronization (barriers, events, in-order queue)

## SYNCHRONIZATION IN OPENCL



- Global Dimensions:
  - 1024x1024 (whole problem space)
- Local Dimensions:
  - o 64x64 (work-group, executes together)



Synchronization: when multiple units of execution (e.g. work-items) are brought to a known point in their execution. Most common example is a barrier ... i.e. all units of execution "in scope" arrive at the barrier before any proceed.



- Use barrier to synchronize work items inside a work-group.
- barrier( CLK\_LOCAL\_MEM\_FENCE ) or barrier( CLK\_GLOBAL\_MEM\_FENCE )
- Careful with branching! All the work items must take the same branch.
- Across work-groups
  - No guarantees as to where and when a particular work-group will be executed relative to another work-group
  - Cannot exchange data, or have barrier-like synchronization between two different work-groups! (Critical issue!)
  - $\circ~$  Only solution: finish the kernel and start another
- The FPGA being hardware, does have other means to synchronize (pipes).



# Where might we need synchronization?

- Consider a reduction ... reduce a set of numbers to a single value
   E.g. find sum of all elements in an array
- Sequential code

```
int reduce(int Ndim, int *A)
{
    int sum = 0;
    for (int i = 0; i < Ndim; i++)
        sum += A[i];
    return sum;
}</pre>
```



- A reduction can be carried out in three steps:
  - 1. Each work-item sums its private values into a local array indexed by the work-item's local id
  - 2. When all the work-items have finished, one work-item sums the local array into an element of a global array (indexed by work-group id)
  - 3. When all work-groups have finished the kernel execution, the global array is summed on the host

Again, the dimensionality of the problem regarding performance depends on the hardware!

# THE OPENCL EVENT MODEL



# **OpenCL Kernel life cycle**

- An event is an object that communicates the status of commands in OpenCL ... legal values for an event:
  - **CL\_QUEUED**: command has been enqueued.

• ERROR\_CODE:

- CL\_SUBMITTED: command has been submitted to the compute device
- CL\_RUNNING: compute device is executing the command
  - CL\_COMPLETE: command has completed
    - a negative value indicates an error condition occurred.
- Can query the value of an event from the host ... for example to track the progress of a command.
   Examples:



## Generating and consuming events

• Consider the command to enqueue a kernel. The last three arguments optionally expose events (NULL otherwise).

```
cl int clEnqueueNDRangeKernel (
                                                 Number of events this
     cl command queue command queue,
                                                 command is waiting to
     cl kernel kernel,
     cl uint work dim,
                                                 complete before executing
     const size_t *global_work_offset,
     const size t *global work size,
     const size t *local work size,
     cl_uint num_events_in_wait_list,
     const cl_event *event wait list,
     cl event *event)
                                          Array of pointers to the
                                          events being waited upon ....
     Pointer to an event object
                                          Command queue and events
     generated by this command
                                          must share a context.
```



### **Event: basic event usage**

- Events can be used to impose order constraints on kernel execution.
- Very useful with out-of-order queues.

```
cl event
            k events[2];
err = clEnqueueNDRangeKernel(commands, kernel1, 1,
NULL, &global, &local, 0, NULL, &k_events[0]
                                                     Enqueue two
                                                     kernels that
err = clEnqueueNDRangeKernel(commands, kernel2
                                                     expose events
NULL, &global, &local, 0, NULL, &k events[1]);
err = clEnqueueNDRangeKernel(commands, kernel3, 1,
NULL, &global, &local, 2, k events, NULL);
                                              Wait to execute
                                              until two previous
                                             events complete
```



#### **OpenCL synchronization: queues & events**



16/03/2020 Introduction to Heterogeneous Programming in OpenCL with FPGAs - Miguel Astrain - iCSC2020 89



- OpenCL is a performance oriented language ... Hence performance analysis is an essential part of OpenCL programming.
- The OpenCL specification defines a portable way to collect profiling data.
- Can be used with most commands placed on the command queue ... includes:
  - Commands to read, write, map or copy memory objects
  - Commands to enqueue kernels, tasks
- Profiling works by turning an event into an opaque object to hold timing data.



- Profiling is enabled when a queue is created with the CL\_QUEUE\_PROFILING\_ENABLE flag set.
- When profiling is enabled, the following function is used to extract the timing data





- CL\_PROFILING\_COMMAND\_QUEUED
  - the device time in nanoseconds when the command is enqueued in a command-queue by the host. (cl\_ulong)
- CL\_PROFILING\_COMMAND\_SUBMIT
  - the device time in nanoseconds when the command is submitted to compute device. (cl\_ulong)
- CL\_PROFILING\_COMMAND\_START
  - the device time in nanoseconds when the command starts execution on the device. (cl\_ulong)
- CL\_PROFILING\_COMMAND\_END
  - the device time in nanoseconds when the command has finished execution on the device. (cl\_ulong)

# **BUILDING OPENCL FOR FPGA**



- The program object encapsulates:
  - o A context
  - The program kernel source or binary
  - List of target devices and build options
- The C API creates a program object:
  - o clCreateProgramWithSource()
  - o clCreateProgramWithBinary()



#### **CPUs and GPUs:**

OpenCL uses runtime compilation ... because in general you don't know the details of the target device when you ship the program





### **OpenCL FPGAs Development**

- Example of heterogeneous system:
- SoC with:
- FPGA (Altera)
- CPU (arm).



16/03/2020 Introduction to Heterogeneous Programming in OpenCL with FPGAs - Miguel Astrain - iCSC2020 95



- Remember the models to divide the computation the OpenCL way!
- The Host is controlling the computation of one or multiple heterogeneous devices.
- The host communicates using commands in command-queues.
- The 4 layers of memory. Memory transfers are explicit!
- There are lots of layers of parallelism
- Synchronize your work-items.
- Use events and profile them to monitor performance.
- Building an OpenCL application requires multiple compilations (min. 2)

# Thank you!

Special thanks to Dr. Antonio Carpeño and Professor Mariano Ruiz for their guidance.