

## Diving into NVIDIA Grace Hopper and NVIDIA Grace CPU Superchips: capabilities and performance Filippo Spiga | March 26<sup>th</sup>, 2024





THIS INFORMATION IS INTENDED TO OUTLINE OUR GENERAL PRODUCT DIRECTION. MANY OF THE PRODUCTS AND FEATURES DESCRIBED HEREIN REMAIN IN VARIOUS STAGES AND WILL BE OFFERED ON A WHEN-AND-IF-AVAILABLE BASIS. THIS ROADMAP DOES NOT CONSTITUTE COMMITMENT, PROMISE, OR LEGAL OBLIGATION AND IS SUBJECT TO CHANGE AT THE SOLE DISCRETION OF NVIDIA. THE DEVELOPMENT, RELEASE, AND TIMING OF FEATURES OR FUNCTIONALITIES DESCRIBED FOR OUR PRODUCTS REMAINS AT THE SOLE DISCRETION OF NVIDIA. NVIDIA WILL HAVE NO LIABILITY FOR FAILURE TO DELIVER OR DELAY IN THE DELIVERY OF ANY OF THE PRODUCTS, FEATURES, OR FUNCTIONS SET FORTH IN THIS DOCUMENT.







- Products
- Capabilities
- Software
- Use-Cases



# Platforms





# **NVIDIA Superchip Modules** Grace Hopper (GH200) on left || Grace CPU Superchip on right



### High Performance Power Efficient Cores

144 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core

### Fast On-Chip Fabric

3.2 TB/s of bi-section bandwidth connects CPU cores, NVLink-C2C, memory, and system IO

### High-Bandwidth Low-Power Memory

Up to 960GB of data center enhanced LPDDR5X Memory that delivers up to 1TB/s of memory bandwidth

### **Fast and Flexible CPU IO**

Up to 8x PCIe Gen5 x16 interface. PCIe Gen 5 up to 128GB/s 2X more bandwidth compared to PCIe Gen 4

### **Full NVIDIA Software Stack**

Al, Omniverse

## **NVIDIA Grace CPU Superchip** 2X Performance at the Same Power for the Modern Data Center





## **Grace Simplifies System Design and Workload Optimization** Reduces NUMA & sub-NUMA Bottlenecks

### Grace Server Grace C2 Superchip



2 NUMA Nodes

2 Compute Dies

> 500 Watts (CPU +MEM)

### 900

GB/s worstcase n to n

Conventional 2-Socket Server Example: 2x AMD Genoa, Native NPS=4



## **NVIDIA GH200 Grace Hopper Superchip** Processor For The Era of Accelerated Computing And Generative AI



72 Core Grace CPU | 4 PFLOPS Hopper GPU 96 GB HBM3 | 4 TB/s | 900 GB/s NVLink-C2C

- 7X bandwidth to GPU vs PCIe Gen 5
- Combined 576 GB of fast memory
- 1.2x capacity and bandwidth vs H100
- Full NVIDIA Compute Stack

GH200 with HBM3

Available for order



- World's first HBM3e GPU
- Combined 624 GB of fast memory
- 1.7x capacity and 1.5x bandwidth vs H100
- Full NVIDIA Compute Stack

GH200 with HBM3e

Available late Q2 2024



144 Core Grace CPU | 8 PFLOPS Hopper GPU 288 GB HBM3e | 10 TB/s | 900 GB/s NVLink-C2C

- Simple to deploy MGX-compatible design
- Combined 1.2 TB fast memory
- 3.5x capacity and 3x bandwidth vs H100
- Full NVIDIA Compute Stack

#### NVLink Dual GH200 System

Available late Q2 2024





NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

## **Energy Efficient Design** More Efficient Computation and Data Movement



5 pJ/Bit

7X less energy

GRACE HOPPER NVLink-C2C CPU GPU 1.3 pJ/Bit 5X less energy 12 pJ/Flop DP 62 pJ/Flop DP Equal energy 1.6X less energy





# Capabilities



## **NVIDIA Grace SIMD Highlights**

Neoverse V2 Arm IP core

- 4x128b SIMD units = 512b SIMD vector bandwidth total
- Each SIMD unit can retire NEON or SVE2 instructions
- On this architecture, SVE2 and NEON have the same peak performance
- SVE2 can vectorize more complex codes and supports more data types than NEON.
- NEON doesn't require predicate calculation
  - Neither does VLS SVE

For more in-depth core u-arch details: <u>Arm<sup>®</sup> Neoverse™ V2 Core Technical Reference Manual</u>





**IN ORDER** 



OUT OF ORDER



## Grace SoC Memory Subsystem Compute and Data-mover Architecture

- Separate L1 data and instruction caches per core
  - L1 64KB, 4-way set associative, 64B cache line
- Private, unified data and instruction L2 cache per core
  - 1MB, 8-way set associative
- Scalable Coherency Fabric
  - Shared, uniform 117MB of L3 cache for entire chip.

| Superchip    | Capacity (GB) | OMP_NUM_THREADS | Expected TRIAD<br>Bandwidth |
|--------------|---------------|-----------------|-----------------------------|
| Grace-Hopper | 120           | 72              | 450+                        |
| Grace-Hopper | 480           | 72              | 340+                        |
| Grace CPU    | 240           | 144             | 900+                        |
| Grace CPU    | 480           | 144             | 900+                        |
| Grace CPU    | 960           | 144             | 680+                        |

Source: https://docs.nvidia.com/grace-performance-tuning-guide.pdf



72 Cores Arm Neoverse V2 Cores with 2X Perf/W Over Today's Server

> 117 MB L3 Cache





#### Grace SoC STREAM Triad

| 16      | 24 | 32 | 40 | 48 | 56 | 64 |  |
|---------|----|----|----|----|----|----|--|
| Threads |    |    |    |    |    |    |  |





## **Grace Hopper Superchip** GPU can access CPU memory at CPU memory speeds



### **GPU Memory is Visible to the Operating System** Standard operating system commands work on the GPU

| •••         |            | n                 | vidia@locall | host: ~  |                |         |
|-------------|------------|-------------------|--------------|----------|----------------|---------|
| nvidia@loc  | alhost:~\$ | numactl -H        |              |          |                |         |
| available:  |            |                   |              |          |                |         |
|             | ~          | -                 | 0 11 12 13   | 14 15 16 | 17 18 19 20 21 | 22 2    |
| · · · · · · |            |                   |              |          | 44 45 46 47 4  |         |
|             |            | 59 60 61 62 63    |              |          |                | • • • • |
| node 0 size |            | AD .              |              |          |                |         |
| node 0 free |            |                   | U I          |          |                |         |
| node 1 cpus |            |                   |              |          |                |         |
| node 1 size |            |                   |              |          |                |         |
| node 1 free |            |                   | U            |          |                |         |
| node 2 cpus |            |                   |              |          |                |         |
| node 2 size |            |                   |              |          |                |         |
| node 2 free |            |                   |              |          |                |         |
| node 3 cpus |            |                   |              |          |                |         |
| node 3 size |            |                   |              |          |                |         |
| node 3 free |            |                   |              |          |                |         |
| node 4 cpus |            |                   |              |          |                |         |
| node 4 size |            |                   |              |          |                |         |
| node 4 free |            |                   |              |          |                |         |
| node 5 cpus |            |                   |              |          |                |         |
| node 5 size |            |                   |              |          |                |         |
| node 5 free | 5          | – MIG             |              |          |                |         |
| node 6 cpus |            |                   |              |          |                |         |
| node 6 size |            |                   |              |          |                |         |
| node 6 free |            |                   |              |          |                |         |
| node 7 cpus |            |                   |              |          |                |         |
| node 7 size |            |                   |              |          |                |         |
| node 7 free |            |                   |              |          |                |         |
| node 8 cpus |            |                   |              |          |                |         |
| node 8 size |            |                   |              |          |                |         |
| node 8 free |            |                   |              |          |                |         |
| node dista  |            |                   |              |          |                |         |
|             | 1 2 3      | 4 5 6             | 78           |          |                |         |
|             | 80 80 80   | 4 5 6<br>80 80 80 | 7 8<br>80 80 |          |                |         |
|             |            | 55 255 255        |              | 255      |                |         |
|             |            |                   |              |          |                |         |
|             |            |                   |              | 255      |                |         |
|             |            | 10 255 255        |              | 255      |                |         |
|             |            |                   |              | 255      |                |         |
|             |            | 255 255 10        |              | 255      |                |         |
|             |            | 255 255 255       |              | 255      |                |         |
|             |            | 255 255 255       |              | 255      |                |         |
|             |            | 255 255 255       | 255 255      | 10       |                |         |
| nvidia@loc  |            | · .               | 6            | -1       | ad bucc/aast   |         |
| Mome        | total      | used              | free         |          |                | av      |
| Mem:        | 573        | 11                | 558          |          | 1 3            |         |
| Swap:       | 0          | 0                 | 0            |          |                |         |
| nvidia@loc  | alnost:~\$ |                   |              |          |                |         |
|             |            |                   |              |          |                |         |



node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5

> Hopper GPU appears to the OS as a NUMA node with no CPU cores

> > Total system memory capacity is CPU (480GB) + GPU (96GB)

| free | shared | buff/cache | available |
|------|--------|------------|-----------|
| 558  | 1      | 3          | 541       |
| 0    |        |            |           |

nvidia@localhost:/home/nvidia/jlinford/mt-dgemm/src\$ numactl -m1 ./mt-dgemm.nvpl 5000 1 1 1 0 1 1

Can use numactl to put CPU application data in GPU memory





CPU fetches GPU data into CPU L3 cache Cache remains **coherent** with GPU memory Changes to GPU memory **evict** cache line

## **Global Access to All Data** Cache-coherent access via NVLink C2C from either processor to either physical memory

GPU loads CPU data via CPU L3 cache CPU and GPU can both hit on cached data Changes to CPU memory **update** cache line

### Hopper directly reading Grace's memory



# Software



- Use a compiler that supports Neoverse V2
- Check and update your compiler flags
- Use -mcpu=native
- If possible use -Ofast
- Use -flto to enable link-time optimization
- Fortran may benefit from -fno-stack-arrays
- Remember to check your dependencies

## Compilers

Can also use -mcpu=neoverse-v2, but -mcpu=native will "port forward"

 If fast math optimizations are not acceptable, use -03 -ffp-contract=fast • For even more accuracy, use -ffp-contract=off to disable floating point operation contraction (e.g. FMA)

• The benefits of link-time optimization vary from code to code, but can be significant • See e.g. <u>https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html</u> for details

• Apps may need -fsigned-char or -funsigned-char depending on the developer's assumption

• This is also a good opportunity to check for newer version with improved Arm Neoverse V2 / Grace support

| Compiler     | Version ≧ |
|--------------|-----------|
| GCC          | 12.2      |
| LLVM (Clang) | 16        |
| NVIDIA HPC   | 23.3      |
| Arm Compiler | 23.04     |



## **Porting Applications that use Math Libraries (MKL, OpenBLAS, etc.)** Several library options to choose from

 Prefer Netlib BLAS/LAPACK and FFTW interfaces Building on these interfaces enables compatibility

### NVPL

-I/PATH/TO/nvpl/include \ -L/PATH/TO/nvpl/lib \ -o mt-dgemm.nvpl mt-dgemm.c \ -lnvpl\_blas\_lp64\_gomp

### ArmPL

 gcc -DUSE\_CBLAS -ffast-math -mcpu=native -03 -o mt-dgemm.armpl mt-dgemm.c \ -larmpl\_lp64

```
• gcc -DUSE_CBLAS -ffast-math -mcpu=native -03 \
```

```
-I/opt/arm/armpl-23.10.0_Ubuntu-22.04_gcc/include \
-L/opt/arm/armpl-23.10.0_Ubuntu-22.04_gcc/lib '
```

• ATLAS, OpenBLAS, BLIS, ... Community supported with some optimizations for Neoverse V2. Works on Grace, but unlikely to outperform NVPL and ArmPL. A good compatibility option.

libnvpl\_blas\_ilp64\_gomp.so libnvpl\_blas\_ilp64\_seq.so libnvpl\_blas\_lp64\_gomp.so libnvpl\_blas\_lp64\_seq.so libnvpl\_fftw.so libnvpl\_lapack\_ilp64\_gomp.so libnvpl\_lapack\_ilp64\_seq.so libnvpl\_lapack\_lp64\_gomp.so libnvpl\_lapack\_lp64\_seq.so libnvpl\_rand\_mt.so libnvpl\_rand.so libnvpl\_scalapack\_ilp64.so libnvpl\_scalapack\_lp64.so libnvpl\_sparse.so libnvpl\_tensor.so



| 🕶 📕 CPU 21             | Ŧ | 0 to 100% |              |
|------------------------|---|-----------|--------------|
| Backend Stalls.metric  |   |           |              |
| Frontend Stalls.metric |   |           |              |
| Retiring.metric        |   |           |              |
| CPU_CYCLES             |   |           |              |
| OP_RETIRED             |   |           |              |
| OP_SPEC                |   |           |              |
| STALL_BACKEND          |   |           |              |
| STALL_FRONTEND         |   |           |              |
| STALL_SLOT             |   |           |              |
| CPU 23                 |   | 0 to 100% |              |
| ✓ [89166] bench -      | Ŧ | 0 to 100% |              |
|                        |   |           |              |
| NVTX                   | Ŧ |           |              |
|                        |   |           | it <b>er</b> |

nsys profile --cpu-core-metrics=help

### **Nsight Systems – Core/Uncore metrics** Collect core/uncore performance metrics



nsys profile --cpu-socket-metrics=help





## The Grace Hopper Advantage Full CUDA support with additional Grace memory extensions

| System Allocated                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| GPU can access memory allocated from malloc(), mmap(), etc.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |  |  |  |  |
| CPU Memory       GPU Memory         App Data       App Data         GPU access to malloc()       memory         Image: State of the state of t |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |

Access possible with explicit call to cudaHostRegister() at PCle speeds Requires HMM patch in Linux Kernel

cudaHostRegister() not needed; access at NVLink C2C speeds



# **Use-cases: NEMO on GH200**

### (Presented at GTC24 - S62337)



### A partially accelerated case utilizing unified memory on Grace-Hopper

The "Nucleus for European Modelling of the Ocean" (NEMO) is a state-of-the-art modelling framework, used for research activities and forecasting services in ocean and climate sciences.

#### • Setup (NEMO v4.2.0)

- **GYRE\_PISCES** benchmark
  - Scaling factor for grid resolution: **nn\_GYRE = 25** 
    - ~ORCA ½ grid
    - ~80 GB RAM, fits on single GPU
- **MPI-only**, single core to every MPI process for CPU runs
- Incremental porting on Grace-Hopper (480GB) using unified memory and access-counter based migrations
  - Memory management left to runtime **system-allocated memory** with automatic migrations
    - compile with -gpu=unified, nomanaged
  - Simply offloading loops to GPU using **OpenACC**, in 3 steps:
    - Horizontal (lateral) diffusion,
    - Advection,
    - Vertical diffusion and time-filtering,

for both "active" (TRA) and "passive" (TRC) tracer transport

## **NEMO Ocean Model**





Image source: NEMO User Guide — NEMO release-4.2.2 documentation (nemo-ocean.io)





| Lomo                   | MP               | Full timestep on Grace | e CPU                |               |                                 |               |
|------------------------|------------------|------------------------|----------------------|---------------|---------------------------------|---------------|
|                        |                  | stp_MLF [2.032 s]      |                      |               |                                 |               |
| .) [dyn] [dyn_z] [dia] |                  | trc_stp [1.108 s]      |                      |               | tra_adv [176.453 ms] tra_ldf [1 | 136.433 tra d |
| hann. 3                | trc_sms [139.319 | trc_trp [951.]         | 736 ms]              |               |                                 |               |
|                        | p2z_sms [139.31] | trc_adv [484.887 ms]   | trc_ldf [342.800 ms] | trc_zdf [ trc |                                 |               |
|                        | p2z p2z_bio      |                        |                      | []            |                                 |               |









| Lomo                   | MP               | Full timestep on Grace | e CPU                |               |                                 |               |
|------------------------|------------------|------------------------|----------------------|---------------|---------------------------------|---------------|
|                        |                  | stp_MLF [2.032 s]      |                      |               |                                 |               |
| .) [dyn] [dyn_z] [dia] |                  | trc_stp [1.108 s]      |                      |               | tra_adv [176.453 ms] tra_ldf [1 | 136.433 tra d |
| hann. 3                | trc_sms [139.319 | trc_trp [951.]         | 736 ms]              |               |                                 |               |
|                        | p2z_sms [139.31] | trc_adv [484.887 ms]   | trc_ldf [342.800 ms] | trc_zdf [ trc |                                 |               |
|                        | p2z p2z_bio      |                        |                      | []            |                                 |               |





**Ported to GPU: Horizontal diffusion** 

We run multiple (i.e. 40) MPI processes on **CPU** and GPU using MPS, and use **"migratable"** system allocated memory

|               | 1 1             |                                          |               |       |
|---------------|-----------------|------------------------------------------|---------------|-------|
| trc_zdf [ trc |                 | I53 ms] tra_ldf [1                       | 136.433 tra d |       |
|               |                 |                                          | 1.9x          | 1.3x  |
| ene en        |                 | <br>                                     |               |       |
| zdf [6 trc    | tra_adv [162.74 | 18 ms] tra_ldf<br>traldf_is<br>traldf_is | so_t          | 1.65x |
|               |                 |                                          |               |       |



| 1                |                            |
|------------------|----------------------------|
| <br>zdf_phy [122 | 2.5 Idf_slp d              |
| 2.               |                            |
| zdf_phy [88      | 3.2 Idf_slp [69 d          |
|                  |                            |
| zdf_phy [        | 91.718 ms] [ldf_slp [68.48 |
|                  |                            |

**Ported to GPU:** Horizontal diffusion

Advection

We run multiple (i.e. 40) MPI processes on **CPU** and GPU using MPS, and use **"migratable"** system allocated memory



| tra_adv [176.453 ms] tra_ldf [136.433 tra d                             |             |
|-------------------------------------------------------------------------|-------------|
| <b>1.9x</b>                                                             | <b>1.3x</b> |
| tra_adv [162.748 ms] tra_ldf [72 tra_zdf [<br>_zdf [6 trc] traldf_iso_t | 1.65x       |
| 2.5x                                                                    |             |
|                                                                         | 1.92x       |
|                                                                         |             |
|                                                                         |             |



| 1                  | 1     |           | . 11 |        | _   |
|--------------------|-------|-----------|------|--------|-----|
| <br>zdf_phy [122.5 |       | slp       |      | d      |     |
|                    |       | _slp [    | 69   | . 1    | d.  |
|                    |       | 1         |      | 1      | Ŀ   |
|                    | 18 m  | <br>s]) ( | df_s | Ip [68 | 3.4 |
|                    |       |           |      |        |     |
| )zdf_phy [9:       | 3.052 | ? ms]     |      | ldf_   | slt |
|                    |       |           |      |        |     |

#### Ported to GPU:

- Horizontal diffusion
- Advection
- Vertical diffusion and time-filtering

We run multiple (i.e. 40) MPI processes on **CPU and GPU** using **MPS**, and use **"migratable"** system allocated memory





## Porting NEMO to Grace-Hopper using Unified Memory A deeper look into the effect of access-counter based migrations on the partially accelerated port

## 10 1 1 ..... zdf\_phy [122.5...] ldf\_slp...] d...] dyn\_...] dyn\_z...] dia\_...]

#### **CPU only run**



|               | 1. ]           | 1 J        |                |       |
|---------------|----------------|------------|----------------|-------|
| trc_zdf [ trc | tra_adv [176.4 | 453 ms] tr | a_ldf [136.433 | tra d |



## **Porting NEMO to Grace-Hopper using Unified Memory** A deeper look into the effect of access-counter based migrations on the partially accelerated port

### **CPU only run**





#### **Tracer transport on GPU** with migrations disabled

( buffers with first-touch on CPU will never migrate to GPU memory)



**1.46x** GPU kernels pull data first-touched by CPU directly from **CPU memory** 

| /ait : tr   | Wa              | Wait : traldf<br>Wait : traldf | Wa  | Wait :<br>Wait |
|-------------|-----------------|--------------------------------|-----|----------------|
|             | alle a she      | 3                              |     | <u> </u>       |
| LED         |                 |                                |     | 6              |
|             |                 |                                |     | ŀ              |
|             | tra_adv [96.436 | tra_ldf [114.705 ms]           | tra | dyn_at         |
|             | tra_adv_fct [9  | traldf_iso_t_DATA []           | tra |                |
| c_zdf trc_a | n               | traldf_iso_t_COMP              |     |                |
| a_zdf I I   |                 |                                |     |                |
|             |                 |                                |     |                |
| u Otu O     |                 | au ChuChur a h                 |     |                |
| uCtxS       | Cu              | cuCtxSynch                     | cu  | cuCt           |

LL L.

11 11

tra\_adv [176.453 ms] tra\_ldf [136.433 ... tra\_... d...

**1.44**x



## **Porting NEMO to Grace-Hopper using Unified Memory** A deeper look into the effect of access-counter based migrations on the partially accelerated port

### **CPU only run**

#### **Tracer transport on GPU** with migrations disabled

(buffers with first-touch on CPU will never migrate to GPU memory)

# zdf\_phy [122.5... | ldf\_slp... | d... | dyn\_... | dyn\_z... | dia\_..



### **Enabling automatic page** migrations from CPU to GPU

("hot" pages migrate to GPU)







**1.46x** GPU kernels pull data first-touched by CPU directly from CPU memory

|                                                                             | Wa Wait : traldf Wa Wait :<br>Wa Wait : traldf Wa Wait                                                                       |
|-----------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| _ED                                                                         |                                                                                                                              |
| tra_adv [9<br>tra_adv_<br>c_zdf trc_a<br>a_zdf l l                          | 6.436 tra_ldf [114.705 ms] tra dyn_at<br>fct [9 traldf_iso_t_DATA [ tra<br>traldf_iso_t_COMP                                 |
| uCtxS                                                                       | cu cuCtxSynch cu cuCt                                                                                                        |
| 2                                                                           | .02x                                                                                                                         |
| FD                                                                          |                                                                                                                              |
|                                                                             |                                                                                                                              |
|                                                                             |                                                                                                                              |
| s]<br>df [127.674 ms] trc.<br>_t_DATA [127.243 ms] I.<br>t_COMPUTE [127.186 | tra_adv [70.430       tra_l          tra_adv_fct [68       traldf       I          no       lbc_lnk       traldf           I |
|                                                                             |                                                                                                                              |

LL L.

tra\_adv [176.453 ms] tra\_ldf [136.433 ... tra\_... d...

GPU kernels become faster as more and more pages migrate to GPU

C...

**1.44x** 



# **Use-cases: Petrobras SolverBR**

#### (Presented at GTC24 - S62529)





# SolverBR



















### • The SolverBR is a high-performance CPU sparse linear solver specialized for reservoir simulation applications developed in collaboration with academic institutions.

• It combines excellence in parallel computing with the most advanced algorithms of sparse linear algebra for flow and geomechanics problems.





41

# **Porting to Arm**

- Build systems adjustments
  - Added a new compile target for aarch64 architecture
    - Mapped compatible compile variables
  - Initially used gcc for a "fair" and easier comparison
- Code porting adjustment
  - Removed Intel Intrinsics
    - Evaluated the usage of SIMDe and SSE2NEON libraries as replacement
  - Addressed synchronization issues that led to precision errors in floating-point calculation
  - Follow best practices

| 35 | ALWAYS_INLINE void       |
|----|--------------------------|
| 36 | {                        |
| 37 | <pre>#ifdefaarch64</pre> |
| 38 | OPENMP ( omp flu:        |
| 39 | OPENMP ( omp ato         |
| 40 | #endif                   |
| 41 | taskFinished[ tas        |
| 42 | 3                        |

| 14 | ALWAYS INLINE void        |
|----|---------------------------|
| 15 | const int * parer         |
| 16 | {                         |
| 17 | for ( int tIdx =          |
| 18 | {                         |
| 19 | <pre>#ifdefaarch64</pre>  |
| 20 | const int t =             |
| 21 | int finished;             |
| 22 | do                        |
| 23 |                           |
| 24 | OPENMP ( on               |
| 25 | OPENMP ( on               |
| 26 | finished =                |
| 27 | <pre>} while ( fini</pre> |
| 28 | #else                     |
| 29 | const int t =             |
| 30 | while ( taskFi            |
| 31 | #endif                    |
| 32 | }                         |
| 33 | }                         |
|    |                           |



```
#ifndef __aarch64_
#include <xmmintrin.h>
#else
#include <utils/simd/sse2neon.h>
#endif
```

P2PNotify( const int task, volatile int \* taskFinished )

```
nic write)
```

```
sk] = 1;
```

```
P2PWait( const int threadTask, const int * parentIndex,
nts, const volatile int * taskFinished )
```

parentIndex[ threadTask ]; tIdx < parentIndex[ threadTask + 1 ]; ++tIdx ]</pre>

```
parents[ tIdx ];
```

```
p flush)
p atomic read)
taskFinished[ t ];
ished == 0);
```

```
parents[ tIdx ];
inished[ t ] == 0 ) PAUSE;
```





# Single-socket Speedups (max core-count)





Average of 150 executions per matrix (considering 3 different timesteps)







# Estimated Energy Efficiency at max load



Estimated are computed considering CPU max TDP at full load plus estimated memory consumption based on capacity and technology. We assume an average of 3W per 8 GB for DDR4 and 4.75W per 16 GB for DDR5. For Grace SoC, CPU plus memory is ~250W. Speed-ups are computed using AMD EPYC 9R14 as baseline.





# To conclude...



## E4 and NVIDIA Partnership Relentless pursue of innovation and added value

- Remote access to GH200 and Grace Superchip • CHIRON LAB: <u>https://www.e4company.com/en/chiron-lab/</u>
- Systems from various OEM/ODM
  - Ability to run comparison across architectures
  - Ability to measure server power consumption
- E4 long-standing experience in Arm-based systems
- E4 expertise in system configuration and bring-up
- NVIDIA expertise in GPU application tuning

<u>Contact</u>: Marco Cicala (<u>marco.cicala@e4company.com</u>)





# COMPUTER ENGINEERING

www.e4company.com



## ANNOUNCING NVIDIA BLACKWELL PLATFORM FOR TRILLION-PARAMETER SCALE GENERATIVE AI



AI SUPERCHIP 208B Transistors



RAS ENGINE 100% In-System Self-Test



#### 2<sup>nd</sup> GEN TRANSFORMER ENGINE FP4/FP6 Tensor Core



5<sup>th</sup> GENERATION NVLINK Scales to 576 GPUs



SECURE AI Full Performance Encryption & TEE



DECOMPRESSION ENGINE 800 GB/sec







## Further Resources for Grace CPU and Grace Hopper

#### **Grace CPU Superchip**

- Grace CPU Customer Deck
- Grace CPU Superchip Architecture Whitepaper
- Grace CPU Architecture In-Depth Blog
- Grace CPU Superchip Data Sheet
- Grace CPU Energy Efficiency Blog
- <u>A Demonstration of AI and HPC Applications for NVIDIA Grace CPU [S51880]</u>
- Grace CPU Power Efficiency Video
- Unlock the Power of NVIDIA Grace and Hopper with Foundational HPC Software

#### **GH200 Grace Hopper Superchip**

- GH200 Grace Hopper Customer Deck
- <u>Grace Hopper Superchip Architecture Whitepaper</u>
- <u>Grace Hopper Architecture In-Depth Blog</u>
- Grace Hopper Superchip Architecture Data Sheet
- Grace Hopper Recommender System Blog
- Programming Model and Applications for the Grace Hopper Superchip [S51120]
- Accelerating HPC applications with ISO C++ on Grace Hopper [S51054]
- Deploying RAG Applications on NVIDIA GH200







