

# oneappl.

## INTRODUCTION TO OFFLOAD ADVISOR

Klaus-Dieter Oertel Intel IAGS oneAPI@CERN 24-Mar-2020

#### **INTEL® ADVISOR (BETA)** Design assistant – Design for modern hardware

#### **Offload Advisor**

- Identify opportunities for offload to an accelerator
- Vectorization Advisor
- Add and optimize vectorization

#### **Roofline Analysis**

Optimize CPU/GPU code for memory and compute

#### **Threading Advisor**

- Add effective threading to unthreaded applications
   Flow Graph Analyzer
- Create and analyze efficient flow graphs





## AGENDA

- Introduction to Offload Advisor
- Command line tips
- Understanding the performance modelization
- GPU Roofline Analysis



## INTRODUCTION TO OFFLOAD ADVISOR

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



## INTEL OFFLOAD ADVISOR (BETA)

- Starting from a baseline binary (running on CPU):
  - Helps defining which sections of the code should run on a given accelerator
  - Provides performance projection on accelerators (currently gen9 and gen11)

| Intel® Advisor Beta                                                                                                                                                                                                                                        |                                          |                 |                                                                                                                                                                                                                                                                                                       |                   |                                                                                                                              | Intel® Advisor Beta, build 604296                                                                              |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| CINE OFFLOAD ADVISOR<br>Summary   Offloaded Regions   Non Offloaded Regions   Call Tree   Configurat                                                                                                                                                       | ion   Logs                               |                 | Speed Up for Accelerated Code (7) 8                                                                                                                                                                                                                                                                   | .9x Number of Of  | ffloads ⑦ 1 Fraction                                                                                                         | of Accelerated Code (7) 99%                                                                                    |
| Program metrics ⑦                                                                                                                                                                                                                                          |                                          |                 | Offloads bounded by ⑦                                                                                                                                                                                                                                                                                 |                   | Gen9 GT2 configura                                                                                                           | ation 🕐 🛃 🖞 🖒                                                                                                  |
| Original ⑦     13.84s       Accelerated ⑦     1.63s       Target Platform     Gen9 GT2       Number of Offloads ⑦     1       Speed Up for Accelerated Code ⑦     8.9x       Amdahl's Law Speed Up ⑦     8.5x       Fraction of Accelerated Code ⑦     99% | 0.09s<br>1.54s<br>0s<br><0.01s<br><0.01s | 94%             | Compute ⑦         0%           L3 Cache BW ⑦         0%           LLC BW ⑦         99%           Memory BW ⑦         0%           Data Transfer ⑦         0%           Dependency ⑦         0%           Trip Count ⑦         0%           Unknown ⑦         0%           Non Offloaded ⑦         <1% | 99%               | 1 15 GHz frequency ⊘<br>24 EU ⊘<br>512 0 KB L3 ⑦<br>20 8 GB/s L3 bandwidth ⑦<br>24 GB/s DRAM bandwidth ⑦<br>Integrated GPU ⑦ |                                                                                                                |
| Top offloaded ⑦                                                                                                                                                                                                                                            |                                          |                 | Top non offloaded ⑦                                                                                                                                                                                                                                                                                   |                   |                                                                                                                              |                                                                                                                |
| Location ⑦ Speed Up ⑦                                                                                                                                                                                                                                      | Bounded By ⑦                             | Data Transfer @ | Location @                                                                                                                                                                                                                                                                                            | Data Transfer (2) | Execution Time (?)                                                                                                           | Why Not Offloaded @                                                                                            |
| [loop in iso_3dfd\$omp\$parallel@52 at iso-<br>3dfd_parallel.cc:53] 8.94 x CPU 13.75s<br>GPU 1.54s                                                                                                                                                         | LLC_BW                                   | <0.01MB         | [loop in iso_3dfd at iso-3dfd_parallel.cc:85]                                                                                                                                                                                                                                                         | 0.09MB            | CPU 13.75s<br>GPU 19.45s                                                                                                     | Not profitable.                                                                                                |
|                                                                                                                                                                                                                                                            |                                          |                 | [loop in main at iso-3dfd_main.cc:194]                                                                                                                                                                                                                                                                | 0MB               | CPU 0.02s<br>GPU <0.01s                                                                                                      | Total time is too small for reliable<br>modelling. Useloop-filter-threshold=0<br>to model such small offloads. |
|                                                                                                                                                                                                                                                            |                                          |                 | [loop in initialize at iso-3dfd_main.cc:59]                                                                                                                                                                                                                                                           | 0MB               | CPU 0.02s<br>GPU <0.01s                                                                                                      | Total time is too small for reliable<br>modelling. Useloop-filter-threshold=0<br>to model such small offloads. |
|                                                                                                                                                                                                                                                            |                                          |                 | [loop in initialize at iso-3dfd_main.cc:60]                                                                                                                                                                                                                                                           | 0MB               | CPU 0.02s<br>GPU <0.01s                                                                                                      | Total time is too small for reliable<br>modelling. Useloop-filter-threshold=0<br>to model such small offloads. |
|                                                                                                                                                                                                                                                            |                                          |                 |                                                                                                                                                                                                                                                                                                       |                   |                                                                                                                              |                                                                                                                |





#### **MODELING FLOWS SUPPORTED: NOW**



#### MODELING FLOWS SUPPORTED: NOW + COMING SOON



### FROM YOUR CPU APPLICATION, YOU WONDER:

#### How your code might perform on an accelerator ?

| Program metrics ⑦           |                 |          |                       |        |     |  |  |
|-----------------------------|-----------------|----------|-----------------------|--------|-----|--|--|
| Original ⑦<br>Accelerated ⑦ | 13.84s<br>1.63s |          |                       |        |     |  |  |
| Target Platform             |                 | Gen9 GT2 | Time on Host          | 0.09s  |     |  |  |
| Number of Offloads ⑦        |                 | 1        | Time on Accelerator ⑦ | 1.54s  | 6%  |  |  |
| Speed Up for Accelera       | ted Code 🕐      | 8.9x     | Data Transfer Tax (?) | 0s     |     |  |  |
| Amdahl's Law Speed U        | Jp ()           | 8.5x     | Invocation Tax ③      | <0.01s | 94% |  |  |
| Fraction of Accelerated     | l Code @        | 99%      | Code Transfer Tax ⑦   | <0.01s |     |  |  |

#### What might be limiting your performance on the



#### What should you offload ?

#### Top offloaded ⑦

| Location ⑦                                                          | Speed Up ? |                         | Bounded By (?) |        | Data Transfer (?) |
|---------------------------------------------------------------------|------------|-------------------------|----------------|--------|-------------------|
| [loop in iso_3dfd\$omp\$parallel@52 at iso-<br>3dfd_parallel.cc:53] | 8.94x      | CPU 13.75s<br>GPU 1.54s |                | LLC_BW | <0.01MB           |
|                                                                     |            |                         |                |        |                   |

#### What are the bad candidates for offload and Why ?

#### Top non offloaded (?)

| Location ⑦                                    | Data Transfer @ | Execution Time ⑦         | Why Not Offloaded ⑦                                                                                            |  |
|-----------------------------------------------|-----------------|--------------------------|----------------------------------------------------------------------------------------------------------------|--|
| [loop in iso_3dfd at iso-3dfd_parallel.cc:85] | 0.09MB          | CPU 13.75s<br>GPU 19.45s | Not profitable.                                                                                                |  |
| [loop in main at iso-3dfd_main.cc:194]        | 0MB             | CPU 0.02s<br>GPU <0.01s  | Total time is too small for reliable<br>modelling. Useloop-filter-threshold=0<br>to model such small offloads. |  |



#### Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### **TOP OFFLOADED IN DEPTH**

- Provides a detailed description of each loop interesting for offload
  - Timings (total time, time on the accelerator, speedup)
  - Offload metrics (offload taxe, data transfers)
  - Memory traffic (DRAM, L3, L2, L1), trip count
  - Highlight which part of the code should run on the accelerator

#### Intel® Advisor Beta )FFLOAD ADVISOR intel. 99% Speed Up for Accelerated Code ② 8.9x Number of Offloads (?) Fraction of Accelerated Code ③ Summary | Offloaded Regions | Non Offloaded Regions | Call Tree | Configuration | Logs Source Name: [loop in iso\_3dfd\$omp\$parallel@52 at iso-3dfd\_parallel.cc:53] Trip Counts > L3 Cache > LLC > Instruction & Traffic Counts > Diagnostics 2 |||| Memory > 0 51 #pragma omp parallel for MP\_SCHEDULE num\_threads(1) c ^ Total Data Average for(int iz=HALF\_LENGTH; iz<n3-HALF\_LENGTH; iz++) {</pre> 52 Call Total L3 Traffic Total LLC Total Memory FPU Util Hierarchy FLOP per Cycle Transferred from Trip Diagnostics 53 for(int iy=HALF LENGTH; iy<n2-HALF LENGTH; iy++) +</pre> Count (GB) Access (GB) Traffic (GB) (GFLOP/s) GPU to CPU (MB) Count 54 #pragma omp simd 55 for(int ix=HALF\_LENGTH; ix<n1-HALF\_LENGTH; ix+</pre> 23.637 [loop in iso\_3dfd\$omp\$parallel@52 at i < 0.01 57600 102 174.250 113.259 7.896 7.896 In whole loop 56 int offset = iz\*dimn1n2 + iy\*n1 + ix; 57 float value = 0.0; [loop in iso\_3dfd\$omp\$parallel@52 at 0 30 5875200 173.894 113.257 23.637 7.947 7.947 58 value += ptr\_prev[offset]\*coeff[0]; [loop in iso\_3dfd\$omp\$parallel@52 <1 <1 0 0 0 0 0 Aggregated ex for(int ir=1; ir<=HALF LENGTH; ir++) {</pre> 59 60 value += coeff[ir] \* (ptr prev[offset 61 value += coeff[ir] \* (ptr\_prev[offset 62 value += coeff[ir] \* (ptr\_prev[offset \_ No memory objects data No memory object tracked for selected row. Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

This is where you will use DPCPP or OMP target for offload

#### NON OFFLOADED IN DEPTH

- Explains why Advisor doesn't recommend a given loop for offload
  - Dependency issues
  - Not profitable
  - Total time is too small

| ( | Intel® Advisor Beta<br>OFFLOAD ADVISOR<br>Summary   Offloaded Reg | R<br>gions   Non Offloaded Regio                                            | ns   Call Tree   Co | nfiguration   Logs        |                                                     | Speed Up for Accelerated Code ⑦ 6                     | 8.9x Numbe                               | er of O           |
|---|-------------------------------------------------------------------|-----------------------------------------------------------------------------|---------------------|---------------------------|-----------------------------------------------------|-------------------------------------------------------|------------------------------------------|-------------------|
|   |                                                                   | Information >                                                               |                     |                           | Potential Offload 🔨                                 |                                                       |                                          | ш                 |
|   | Hierarchy                                                         | Estimated<br>oop Execution<br>ion Time on<br>ed? Accelerator<br>(+Host) (s) | Bounded<br>By       | Fraction<br>Offloaded (%) | Why Not Offloaded                                   |                                                       | Potential Spee<br>Up for Whole<br>Region | Column configurat |
|   | [loop in iso_3dfd at iso-3dfd_parallel.co                         |                                                                             |                     | 100.00                    | Not profitable.                                     |                                                       | 0.7068x                                  | figura            |
|   | [loop in main at iso-3dfd_main.cc:194]                            | 0.020                                                                       |                     | 0                         | Total time is too small for reliable modelling. Use | loop-filter-threshold=0 to model such small offloads. | 41338.1565x                              | ator              |
|   | > [loop in initialize at iso-3dfd_main.cc:5                       | 9                                                                           |                     |                           | Total time is too small for reliable modelling. Use | loop-filter-threshold=0 to model such small offloads. | 640.4016x                                | $\nabla$          |
|   |                                                                   |                                                                             |                     |                           |                                                     |                                                       |                                          | Custom filte      |





The program tree offers another view of the proportion of code that can be offloaded to the accelerator.



Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

## **COMMAND LINE TIPS**



### **BEFORE YOU START TO USE OFFLOAD ADVISOR**

The only strict requirement for compilation and linking is full debug information:

-g: Requests full debug information (compiler and linker)

- Offload Advisor supports any optimization level, but the following settings are considered the optimal requirements:
  - -02: Requests moderate optimization
  - -no-ipo: Disables inter-procedural optimizations that may inhibit Offload Advisor to collect performance data (Intel<sup>®</sup> C++ & Fortran Compiler specific)



### **SOURCE OFFLOAD ADVISOR**

- To set up the Intel<sup>®</sup> Advisor Beta environment, run one of the shell script: source <ONEAPI\_INSTALL\_DIR>/setvars.sh or
  - source <ADV\_INSTALL\_DIR>/env/vars.sh
- This script sets all required Intel Advisor environment variables, including APM, which points to <ADV\_INSTALL\_DIR>/perfmodels
- This is the location of the Offload Advisor scripts in the Intel<sup>®</sup> Advisor Beta installation directory



The performance modeling functionality is available on Linux\* OS only



### **HOW DOES IT WORK?**

Easy to collect data and generate output with batch mode:

advixe-python <ADV\_INSTALL\_DIR>/perfmodels/run\_oa.py <project\_dir>
 --config gen9 --out-dir <project\_dir> [--options] -- <app> <app\_args>

- By default, run\_oa.py marks up all regions and only selects the most profitable ones for analysis
- To generate the report.html, uses the following command:

u31313@s001-n004:/opt/intel/inteloneapi/advisor/latest/perfmodels\$ ls accelerators analyze.py collect.py debug.so environ.py oa\_wrapper.so shared.so toml analyze\_impl.so collect\_impl.so compute\_stats.py dot\_graph.so helpers run\_oa.py template tree.so



### **RUN\_OA.PY: WHAT IS RUNNING BEHIND?**

run\_oa.py



Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



### **OFFLOAD ADVISOR OUTPUT OVERVIEW**

- report.html: Main report in HTML format
- report.csv and whole\_app\_metric.csv: Comma-separated CSV files
- program\_tree.dot: A graphical representation of the call tree showing the offloadable and accelerated regions
- **program\_tree.pdf:** A graphical representation of the call tree

Generated if the DOT(GraphViz\*) utility is installed

1:1 conversion from the program\_tree.dot file

 JSON and LOG files that contain data used to generate the HTML report and logs, primarily used for debugging and reporting bugs and issues



### WANT TO AVOID DEPENDENCY CHECKING?

- Dependency adds a lot of time to the collection and you might want to remove it.
- Add the option –c basic for the collection:

advixe-python <ADV\_INSTALL\_DIR>/perfmodels/run\_oa.py <path\_to\_result\_dir>
-config gen9 -c basic --out-dir <path\_to\_result\_dir> [--options] -- <app>

Add the option --assume-parallel for the analysis:

advixe-python \$APM/analyse.py <project\_dir> --assume-parallel --config
gen9 [--options] -- <app\_binary> [app\_options]



- You might want to run the command lines independently to tweak the parameters
- A good start is to use run\_oa.py script with --dry-run to see the list of command lines and retrieve the cache configuration of the target accelerator.
- The next command will output the different command lines for doing separate analyses without running advisor collection.
- advixe-python <ADV\_INSTALL\_DIR>/perfmodels/run\_oa.py
   <path\_to\_result\_dir> --dry-run -config gen9 -c basic --out-dir
   <path\_to\_result\_dir> [--options] -- <app>

- We start with the survey
- advixe-cl --collect=survey --auto-finalize --stackwalkmode=online --static-instruction-mix --project-dir=./oa\_report
  - my\_app
- The survey times your application and run some static analysis on the binary without impact on the application's performance.
  - Sampling
  - Binary static analysis
  - Static code analysis (compiler and debug infos)

- We continue with the trip count and cache simulation
- advixe-cl --collect=tripcounts -return-app-exitcode -flop -stacks -autofinalize -ignore-checksums -enable-data-transfer-analysis -track-heapobjects -profile-jit -cache-sources -enable-cache-simulation -cacheconfig=1:8w:32k/1:64w:512k/1:16w:8m --project-dir=./oa\_report - my\_app
- The tripcounts with –flop and –cache-simulation counts:
  - The number of iterations in your loops
  - The number of operations
  - Evaluate the data transfers between memory subsystems configured with –cache-config
- This analysis has usually =~10x speeddown

- Optional step: Dependency analysis
- advixe-cl --collect=dependencies --loops="total-time>5" -filter-reductions --loop-call-count-limit=16 --projectdir=./oa\_report -- my\_app
- Detects data dependencies in your loop by checking your memory accesses
- This analysis has an important impact on the performance
- It is up to the user to define how loops will be selected for this anlysis, here we use loops="total-time>5" which select all loops impacting more than 5% of the overall time

- Last step: Generating the report
- 2 Cases:
  - You ran the dependency analysis:

```
advixe-python $APM/analyse.py ./oa_report --config gen9 --out-dir
oa_report - my_app
```

 You didn't run the dependency analysis advixe-python \$APM/analyse.py ./oa\_report --config gen9 --assumeparallel --out-dir oa\_report - my\_app

## UNDERSTANDING THE PERFORMANCE MODELIZATION

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



#### THE MECHANISMS BEHIND

#### First order analytical modeling pillars:

- Compute throughput model •
- Memory sub-system model  $\bullet$
- Offload data transfer modeling Region X Region Y **Execution time on baseline platform (CPU)** Execution time on accelerator. Estimate assuming bound exclusively by Compute Execution time on accelerator. Estimate assuming bound exclusively by caches/memory Offload Tax estimate (data transfer + invoke) Y' Final estimated time on target platform (eg GPU) **Y** - too much overhead, **X** – profitable to accelerate, t(X) > t(X')not accelerable, t(Y) < t(Y') $t_{region} = max(t_{compute}, t_{memory subsystem}) + t_{data transfer tax} + t_{kernel launch}$

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

۰

•

## **GPU ROOFLINE ANALYSIS**



### INTEL<sup>®</sup> GEN9 MEMORY HIERARCHY

- Intel<sup>®</sup> Graphics Compute Architecture uses the same DRAM with the CPU
- Level-3 (L3) data cache: slice-shared asset
- Shared Local Memory (SLM): a dedicated structure within the L3 that supports the work-group local memory address space
- Graphics Technology Interface (GTI): a dedicated interface unit connects the entire architecture interfaces to the rest of the SoC components
- The rest of SoC memory hierarchy includes the large Last-Level Cache (LLC, which is shared between CPU and GPU), possibly embedded DRAM and finally the system DRAM



A view of the SoC chip level memory hierarchy and its theoretical peak bandwidths for the compute architecture of Intel processor graphics gen9





#### FIND EFFECTIVE OPTIMIZATION STRATEGIES

#### GPU Roofline Performance Insights

- Highlights poor performing loops
- Shows performance 'headroom' for each loop
  - Which can be improved
  - Which are worth improving
- Shows likely causes of bottlenecks
  - Memory bound vs. compute bound
- Suggests next optimization steps









The Roofline model on GPU is a technical preview feature and is not available by default. To enable it:

export ADVIXE\_EXPERIMENTAL=gpu-profiling

To run the GPU Roofline analysis in the Intel<sup>®</sup> Advisor CLI:

Run the Survey analysis with the **--enable-gpu-profiling** option:

advixe-cl -collect=survey --enable-gpu-profiling --project-dir=<my\_project\_directory> --search-dir src:r=<my\_source\_directory> -- ./myapp [app\_parameters]

Run the Trip Counts and FLOP analysis with **--enable-gpu-profiling** option:

advixe-cl -collect=tripcounts --stacks --flop --enable-gpu-profiling --projectdir=<my\_project\_directory> --search-dir src:r=<my\_source\_directory> -- ./myapp [app\_parameters]

Generate a GPU Roofline report: advixe-cl --report=roofline --gpu --project-dir=<my\_project\_directory> --report-output=roofline.html

Open the generated roofline.html in a web browser to visualize GPU performance.



### **ROOFLINE ANALYSIS ON INTEL® GPU**



Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



#### **LEGAL NOTICES & DISCLAIMERS**

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel's plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel's results and plans is included in Intel's SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon, Movidius and others are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation.



## BACKUP

Optimization No

32

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



Intel Confidential – For use under NDA only

## THE MECHANISMS BEHIND 2/2

We minimize the total time spent in this loop hierarchy by varying offload strategies U (offload/non-offload, #threads for each component  $loop_i$  of loopnest)

**Objective function**: 
$$T_{all} = \min_{U = \{uf_1, uf_2, ...\}} (\sum_i T_i + t_{data transfer} + t_{invoke} + T_{cpu})$$

Reject loopnests for which  $T(x86) / T_{all}(x86+"X") < 1.0$ 

$$\boldsymbol{T_{i}} = max \begin{cases} T_{i}^{Comp_{only}}() \\ T_{i}^{M_{k}only} \left(M_{i}^{k}\right) = \frac{M_{i}^{k}}{BW_{k}} \end{cases}$$

This is effectively "balance" (throughput) model

Under algorithmic constraints (Dependencies and TripCount/Granularity)



