

# CPU Hardware Architecture and Performance Optimization

G. Amadio (CERN)

#### **Our Main Goals**

- Understand the architecture of modern CPU hardware
  - Hardware evolution
  - Main features of modern hardware
- Understand how to analyze the performance of our code
  - How to identify performance bottlenecks
  - What to measure and how to measure it
- Combine architectural knowledge and performance analysis
  - How to interpret performance measurements
  - What changes to make to the software

# CPU Hardware Architecture and Evolution

### **Early Computing Devices**









| 2700 – 2300 BC                                                                                                 | 1620- 1630                                                                                                                                                                                 | 1642                                                                                                                          | 1820s                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| Abacus                                                                                                         | Slide Rule                                                                                                                                                                                 | Pascaline                                                                                                                     | Difference Engine                                                                                                                     |
| Used since ancient times,<br>until Arabic numerals became<br>the norm. Still in use as an<br>educational tool. | Uses logarithm scales to help<br>with multiplications and also<br>computing other functions.<br>Extensively used by engineers<br>in the last century, before<br>computers became powerful. | Mechanical calculator<br>invented by Blaise Pascal to<br>help his father with tax<br>calculations. Could add and<br>subtract. | Automatic mechanical<br>calculator designed to<br>tabulate polynomial<br>functions. Designed and first<br>created by Charles Babbage. |

Images: Wikipedia

4

#### Ada Lovelace, the first computer programmer

Augusta Ada King, Countess of Lovelace (10 December 1815 – 27 November 1852) was an English mathematician and writer, known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognize that the new machine had applications beyond simple calculations. She arguably wrote the first "computer program". In her article entitled "note G" on the Analytical Engine, she described in detail an algorithm to compute a sequence of Bernoulli numbers using it.



Source: Wikipedia

### The Turing Machine: concept of first generic computer



### From Turing Machine to Stored-Program Computer

|    | Turing Machine                                      |                                                                                                                                                   | ABC, Colossus, ENIAC                                                          |                                                                                                                             | Assembly Language                                                              |                                                                                                  |
|----|-----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
|    | Conceptually the first genera<br>computing machine. | al                                                                                                                                                | First truly digital computers,<br>based on boolean logic and<br>vacuum tubes. |                                                                                                                             | Beginning of standardizatio<br>to program computer with a<br>instruction sets. |                                                                                                  |
|    | • 19                                                | 937                                                                                                                                               | • 19                                                                          | 947                                                                                                                         | • 1 <sup>,</sup>                                                               | 950                                                                                              |
|    |                                                     |                                                                                                                                                   |                                                                               |                                                                                                                             |                                                                                |                                                                                                  |
| 19 | 936                                                 | • 1942                                                                                                                                            | 2-1945                                                                        | • 19                                                                                                                        | 949                                                                            | •                                                                                                |
|    |                                                     | Harvard IBM Mark I                                                                                                                                |                                                                               | Solid-State Transistor                                                                                                      |                                                                                | EDVAC                                                                                            |
|    |                                                     | Inspired on the Analytical Er<br>One of the earliest general-p<br>electromechanical compute<br>First computer bug discover<br>it by Grace Hopper. | purpose<br>rs.                                                                | The first solid-state transisto<br>was a based on a point-cont<br>connection to a crystal by clo<br>spaced thin gold foils. | act                                                                            | First stored-program computer,<br>based on John von Neumann's<br>architecture concept from 1945. |

### John von Neumann Architecture





OurWorldinData.org – Research and data to make progress against the world's largest problems.

Licensed under CC-BY by the authors Hannah Ritchie and Max Roser.

### **Integrated Circuit-Based Microprocessors**

| Fi  | ntel 4004<br>irst Intel<br>nicroprocessor.<br>300 Transistors. | ],                                                | MOS 6502<br>Powered many popular devic<br>such as the Apple II, Atari 260<br>Commodore 64, and the NES<br>3,510 Transistors. | es,<br>00,                                                                                                   | Intel 8086<br>First 16-bit microprocessor.<br>Its successor, the Intel 8088<br>version, powered first IBM P | 29,000 Transistors.<br>8, a slightly modified<br>bc.                                              |
|-----|----------------------------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| •   | 19                                                             | 072                                               | • 19                                                                                                                         | 76                                                                                                           | • 1º                                                                                                        | 985                                                                                               |
|     |                                                                |                                                   |                                                                                                                              |                                                                                                              |                                                                                                             |                                                                                                   |
| 197 | 70                                                             | 19                                                | 975                                                                                                                          | 19                                                                                                           | 978                                                                                                         | •                                                                                                 |
|     | I                                                              | Intel 8008                                        | :                                                                                                                            | Zilog Z80                                                                                                    |                                                                                                             | Intel 80386                                                                                       |
|     |                                                                | First 8 bit microprocessor.<br>3,500 Transistors. |                                                                                                                              | B-bit microprocessor.<br>Powered devices such as<br>Sega Master System and Me<br>and Sinclair's ZX Spectrum. | ga Drive,                                                                                                   | 32-bit microprocessor.<br>275,000 Transistors, 33MHz.<br>Cemented Intel's PC market<br>dominance. |

### **Intel's 8086 Registers and Assembly**

#### ${}^{1}_{9} {}^{1}_{8} {}^{1}_{7} {}^{1}_{6} {}^{1}_{5} {}^{1}_{4} {}^{1}_{3} {}^{1}_{2} {}^{1}_{1} {}^{1}_{0} {}^{0}_{9} {}^{0}_{8} {}^{0}_{7} {}^{0}_{6} {}^{0}_{5} {}^{0}_{4} {}^{0}_{3} {}^{0}_{2} {}^{0}_{1} {}^{0}_{0} (\textit{bit position})$ Main registers AH AL AX (primary accumulator) 0000 BH BL BX (base, accumulator) CH CL CX (counter, accumulator) DL DX (accumulator, extended acc) DH Index registers 0000 SI Source Index 0000 DI **D**estination Index BP **B**ase **P**ointer 0000

| 0000          | SP   |      | Stack Pointer       |
|---------------|------|------|---------------------|
| Program count | ter  |      |                     |
| 0000          | IP   |      | Instruction Pointer |
| Segment regis | ters |      |                     |
|               | CS   | 0000 | Code Segment        |
|               | DS   | 0000 | Data Segment        |
|               | ES   | 0000 | Extra Segment       |
|               | SS   | 0000 | Stack Segment       |

ODITSZ-A-P-CFlags

Status register

-

| ; all alphab<br>; ES=DS<br>; Entry stac<br>; [SP+4] = | ll-termir<br>petic cha<br>ck parame<br>= src, Ac<br>= dst, Ac | ldress of source<br>ldress of target           | case.                                                                                                                              |
|-------------------------------------------------------|---------------------------------------------------------------|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| _strtolower                                           | push<br>mov<br>push<br>push                                   | <pre>bp bp,sp si di si,[bp+6] di,[bp+4]</pre>  | ;Set up the call frame<br>;Set SI = src (+2 due to push bp)<br>;Set DI = dst<br>;string direction ascending                        |
| loop:                                                 | cmp                                                           | al,'A'<br>copy<br>al,'Z'<br>copy<br>al,'a'-'A' | ;Load AL from [si], inc si<br>;If AL < 'A',<br>; Skip conversion<br>;If AL > 'Z',<br>; Skip conversion<br>;Convert AL to lowercase |
| сору:                                                 | stosb<br>or<br>jne                                            | al,al<br>loop                                  | ;Store AL to [di], inc di<br>;If AL <> θ,<br>; Repeat the loop                                                                     |
| done :                                                | pop<br>pop<br>pop<br>ret<br>end                               | di<br>si<br>bp<br>proc                         | ; restore di and si<br>;Restore the prev call frame<br>;Return to caller                                                           |

Source: Wikipedia

### Intel x86 Assembly

```
__attribute__((noinline))
 1
    int is_odd(unsigned long long n)
 2
 3
    {
 4
        return n & 1;
 5
 6
    unsigned int collatz_count(unsigned long long n)
 7
 8
        unsigned int count = 0;
 9
10
11
        while (n != 1)
12
13
            if (is_odd(n))
                n = 3 * n + 1;
14
15
            else
16
                n = n / 2;
17
18
            ++count;
19
20
21
        return count;
22
```

| 1  | to odd.  |         |             |               |
|----|----------|---------|-------------|---------------|
| 1  | is_odd:  |         |             |               |
| 2  |          | mov     | eax,        |               |
| 3  |          | and     | eax,        | 1             |
| 4  |          | ret     |             |               |
| 5  | collatz_ | _count: |             |               |
| 6  |          | xor     | edx,        | edx           |
| 7  |          | cmp     | rdi,        | 1             |
| 8  |          | jne     | .L7         |               |
| 9  |          | jmp     | <u>.L3</u>  |               |
| 10 | .L10:    |         |             |               |
| 11 |          | lea     | rdi,        | [rdi+1+rdi*2] |
| 12 |          | add     | edx,        | 1             |
| 13 |          | cmp     | rdi,        | 1             |
| 14 |          | je      | .L3         |               |
| 15 | .L7:     |         |             |               |
| 16 |          | call    | is o        | dd            |
| 17 |          | test    | eax,        | eax           |
| 18 |          | jne     | <u>.L10</u> |               |
| 19 |          | shr     | rdi         |               |
| 20 |          | add     | edx,        | 1             |
| 21 |          | cmp     | rdi,        | 1             |
| 22 |          | jne     | <u>.L7</u>  |               |
| 23 | .L3:     |         |             |               |
| 24 |          | mov     | eax,        | edx           |
| 25 |          | ret     |             |               |
|    |          |         |             |               |

#### **Registers available in the x86-64 instruction set**

| AVX/A             | AVX2    |              | S         | SE4.2   |        |       |       |            |        |                              |               |              |                    |          |       |      |
|-------------------|---------|--------------|-----------|---------|--------|-------|-------|------------|--------|------------------------------|---------------|--------------|--------------------|----------|-------|------|
| l l               |         | >            |           | 7       |        |       |       |            |        |                              |               |              |                    |          |       |      |
| ZMM0 YMM0         | XMM0    | ZMM1         | YMM1      | XMM1    | ST(0)  | MM0   | ST(1) | MM1        |        | XEAX RAX                     | R88 R8W R8D   | R8 [R128R12] | WR12DR12           | MSWC     | R0 CR | 4    |
| ZMM2 YMM2         | XMM2    | ZMM3         | YMM3      | XMM3    | ST(2)  | MM2   | ST(3) | MM3        | вівнВ  | <mark>x</mark> ebx RBX       | R98 R9W R9D   | R9 [R138]R13 | WR13DR13           | CR1      | L CR  | 5    |
| ZMM4 YMM4         | XMM4    | ZMM5         | YMM5      | XMM5    | ST(4)  | MM4   | ST(5) | MM5        | СССНС  | XECX RCX                     | R108R10W R10D | R10 R14BR14  | W R14D R14         | CR2      | 2 CR  | 6    |
| ZMM6 YMM6         | XMM6    | ZMM7         | YMM7      | XMM7    | ST(6)  | MM6   | ST(7) | MM7        | DLDHD) | xedx RDX                     | R118R11W R11D | R11 R15BR15  | W R15D R15         | CRE      | 3 CR  | 7    |
| ZMM8 YMM8         | XMM8    | ZMM9         | YMM9      | XMM9    |        |       |       |            | BPLBP  | EBPRBP                       | DIL DI EDI F  |              | EIP RIP            | MXCS     | SR CR | 8    |
| ZMM10 YMM10       | XMM10   | ZMM11        | YMM11     | XMM11   | CW     | FP_IP | FP_DP | PFP_CS     | SIL SI | ESI RSI                      | SPL SPESPR    | SP           |                    |          | CR    | 9    |
| ZMM12 YMM12       | XMM12   | ZMM13        | YMM13     | XMM13   | SW     | ]     |       |            | _      |                              | _             | _            |                    |          | CR    | 10   |
| ZMM14 YMM14       | XMM14   | ZMM15        | YMM15     | XMM15   | TW     |       |       | register   |        | bit register<br>bit register |               | register     | 256-bit<br>512-bit |          | CR    | 11   |
| ZMM16 ZMM17 ZMM18 | B ZMM19 | ZMM20 ZM     | IM21 ZMM2 | 2 ZMM23 | FP_DS  |       | 10-01 | t register | 04-    | bit register                 | 120-DI        | register     | 512-bit            | register | CR    | 12   |
| ZMM24 ZMM25 ZMM28 | ZMM27   | ZMM28 ZM     | IM29 ZMM3 | 0 ZMM31 | FP_OPC | FP_DP | FP_IP | CS         | 5 SS   | 5 DS                         | GDTR          | IDTR         | DR0                | DR6      | CR    | 13   |
|                   |         | 1            |           |         |        |       |       | ES         | 5 FS   | GS GS                        | TR            | LDTR         | DR1                | DR7      | CR    | 14   |
|                   |         | $\checkmark$ |           |         |        |       |       |            |        |                              | FLAGS EFLAGS  | RELAGS       | DR2                | DR8      | CR    | 15   |
|                   | A۱      | /X512        |           |         |        |       |       |            |        |                              |               |              | DR3                | DR9      |       |      |
|                   |         |              |           |         |        |       |       |            |        |                              |               |              | DR4                | DR10     | DR12  | DR14 |
|                   |         |              |           |         |        |       |       |            |        |                              |               |              | DR5                | DR11     | DR13  | DR15 |

By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

#### **Instruction Sets**

- **CISC** (Complex Instruction Set Computer)
  - Intel x86 and AMD64
    - Most laptop and desktop PCs, Playstation 5, Xbox One
  - IBM System z (mainframe computers)
- **RISC** (Reduced Instruction Set Computer)
  - ARM
    - Amazon Graviton (AWS VMs)
    - Apple M1–M4 (iPhone, iPad, iMacs)
    - Ampere Altra, Fujitsu A64FX, etc
    - Qualcomm (mobile phones, tablets)
    - Nintendo Game Boy Advance, DS, 3DS and Switch, Raspberry Pi, etc
  - IBM's PowerPC
    - Apple Macintosh (1994–2005), Nintendo GameCube and Wii, Playstation 3, Xbox 360
  - DEC Alpha, MIPS, Motorola 68000, RISC-V, SPARC, SuperH
    - Apple II (M68k), Nintendo 64, PlayStation 1 and 2 (MIPS), Sega Saturn and Dreamcast (SuperH)

### **Programming Language Evolution**

programming.

programming.

1947 1954 1963 1972 1979 Assembly BASIC C++Fortran С Low-level language. One of the earliest Beginner's All-purpose Originally developed to C++ was designed with High correspondence high-level imperative Symbolic Instruction implement many of the systems programming, between language and **C**ode is a family of utilities for UNIX OSs. embedded software. programming hardware instructions. Still in wide use today. languages. high-level languages. and efficiency in mind. Code written in Introduced procedural BASIC became popular C is a portable, Although many think of assembly is converted programming, double during the 8-bit era, but C++ as a superset of C imperative language to machine code using precision, and complex declined in popularity or C with classes. their with a static type an assembler, which numbers. Still popular in the 90s, when more system and which latest versions are not was a big upgrade over in HPC, including in advanced languages supports structured fully compatible. Used previous forms of **GPU** application like C were the norm. extensively in HEP and programming.

HPC nowadays.

#### **50 Years of Microprocessor Trend Data**



source: https://github.com/karlrupp/microprocessor-trend-data

#### **50 Years of Microprocessor Trend Data**



source: https://github.com/karlrupp/microprocessor-trend-data

### **Breakdown of Dennard's Scaling**

- Power density per unit area stopped decreasing
- Frequency could no longer keep increasing after each die shrink
  - But the transistor numbers kept growing
- Single-thread performance gains continued, albeit at a slower pace
  - More complexity: pipelining, superscalar, out-of-order execution, SIMD
- AMD and Intel bring 64-bit CPUs to the mainstream market
  - Intel with IA-64, and AMD with amd64 (x86\_64), announced in 1999
- From symmetric multiprocessing (SMP) to multithreading (SMT)
  - In the '90s, dual socket high-end servers became popular
  - First SMT capable CPU was the Intel Pentium 4, released in 2002
- First dual core processors began to appear in mid 2000s
- The era of parallelism is born

### **Instruction-Level Parallelism**



### Symmetric multithreading (SMT)



#### Threads scheduled one at a time on each physical core





Threads run simultaneously on two logical cores



#### **CPU Architecture**

#### Logical Components

- Control Unit
- Arithmetic Logic Units (ALUs)
- Floating Point Unit (FPU)
- Branch Predictor Unit (BPU)
- Memory Management Unit (MMU)
- Translation Lookaside Buffer (TLB)
- Memory Subsystem
  - L1(~32-512KB per core)
    - L1 Instruction Cache
    - L1 Data Cache
  - L2 (~1-8MB per core)
    - Instruction/Data Shared Cache
  - L3 (up to ~8MB-1.1GB per socket)
    - Last level cache (LLC)

#### **Generic Dual Core CPU**



#### Source: Systems Performance 2nd Edition, Brendan Gregg

### **Memory Hierarchy**



### Latency Numbers Every Programmer Should Know

| • | 1ns (~5 CPU cycles)     |  |
|---|-------------------------|--|
| • | L1 cache reference: 1ns |  |
|   | Branch mispredict: 3ns  |  |
|   | L2 cache reference: 4ns |  |
|   | Mutex lock/unlock: 17ns |  |
|   | 100ns =                 |  |

| • | Main memory reference:<br>100ns          |   |
|---|------------------------------------------|---|
|   | 1,000ns ≈ 1µs                            |   |
|   | Compress 1KB wth Zippy:<br>2,000ns ≈ 2µs | ł |
|   | 10,000ns ≈ 10µs = ■                      |   |
|   |                                          |   |





1,000,000ns = 1ms = **=** 

Read 1,000,000 bytes sequentially from SSD: 49,000ns  $\approx 49\mu$ s

Disk seek: 2,000,000ns ≈ 2ms

Read 1,000,000 bytes sequentially from disk: 825,000ns  $\approx$  825µs



Source: https://colin-scott.github.io/personal\_website/research/interactive\_latency.html

### **Virtual Memory**

- First appeared in the Atlas computer in 1962
- Memory Management Unit (MMU)
- Memory managed in pages
  - Page sizes are usually 4K, 16K, 64K
  - May also support "huge pages" of 2MB, 1GB
- Hides fragmentation of physical memory
- Memory hierarchy managed by the kernel
- Makes application programming easier
  - Memory looks contiguous
  - No need to worry about fragmentation
  - Seems to own whole address space
  - Enabled timesharing features



### **The Translation Lookaside Buffer**

"A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip's memory management unit (MMU). The TLB

- stores the recent translations of virtual memory to physical memory and can be called an
- address-translation cache."



### **The Parallel Era**

- First dual core CPUs debut in 2004
  - Pentium D, based on Pentium 4
  - AMD Athlon X2
- Quickly evolved from 2 to 4 cores
  - Stagnated at 4 cores for several years
- Ryzen brought AMD back in the game
  - Offered more cores, forced Intel to do the same
- ARM finally begins move from phones to servers
  - Amazon Graviton, Fujitsu A64FX, Ampere Altra
- Innovations in packaging led to multi-chip CPUs
  - AMD EPYC, Intel Sapphire and Emerald Rapids



TofuD 28 Gbps x 2 lanes x 10 ports

Fujitsu A64FX

PCIe Gen3 16 Jane





**Intel Emerald Rapids** 

#### Ampere Roadmap 2020 – 2026



Source: Ampere

### **Non-Uniform Memory Architecture (NUMA)**



|                                              | AMD EPYC 7001<br>'NAPLES' | AMD EPYC 7002<br>'ROME' | AMD EPYC 7003<br>'MILAN' | AMD EPYC 9004, 8004<br>'GENOA', 'SIENA'              |
|----------------------------------------------|---------------------------|-------------------------|--------------------------|------------------------------------------------------|
|                                              |                           |                         |                          |                                                      |
| Core Architecture                            | 'Zen'                     | 'Zen 2'                 | 'Zen 3'                  | 'Zen 4' and 'Zen 4c'                                 |
| Cores                                        | 8 to 32                   | 8 to 64                 | 8 to 64                  | 8 to 128                                             |
| IPC Improvement Over<br>Prior Generation     | N/A                       | ~24% <sup>ROM-236</sup> | ~19% MLN-003             | ~14% <sup>EPYC-038</sup>                             |
| Max L3 Cache                                 | Up to 64 MB               | Up to 256 MB            | Up to 256 MB             | Up to 384 MB (EPYC 9004)<br>Up to 128 MB (EPYC 8004) |
| Max L3 Cache with 3D V-Cache <sup>™</sup> te | chnology                  |                         | 768 MB                   | Up to 1152 MB                                        |
| PCIe <sup>®</sup> Lanes                      | Up to 128 Gen 3           | Up to 128 Gen 3         | Up to 128 Gen 4          | Up to 128 Gen 5<br>8 bonus lanes Gen 3               |
| CPU Process Technology                       | 14nm                      | 7nm                     | 7nm                      | 5nm                                                  |
| I/O Die Process Technology                   | N/A                       | 14nm                    | 14nm                     | 6nm                                                  |
| Power (Configurable TDP [cTDP])              | 120-200W                  | 120-280W                | 155-280W                 | 70-400W                                              |
| Max Memory Capacity                          | 2 TB DDR3-2400/2666       | 4 TB DDR4-3200          | 4 TB DDR4-3200           | 6 TB DDR5-4800                                       |

### **AMD Zen4 Architecture in detail**

#### COMPUTE

- AMD "Zen4" x86 cores (Up to 12 CCDs / 96 cores / 192 threads)
- 1MB L2/Core, 96MB L3/CCD / Total up to 1,152MB L3
- ISA updates: BFLOAT16, VNNI, AVX-512 (256b data path)
- Memory addressability with 57b/52b
   Virtual/Physical Address
- Updated IOD and internal AMD Gen3 Infinity Fabric™ architecture with increased die-to-die bandwidth
- Target TDP range: Up to 400W (cTDP)
- Updated RAS

#### Memory

- 12 channel DDR5 with ECC up to 4800 MHz
- Option for 2, 4, 6, 8, 10, 12 channel memory interleaving1
- RDIMM, 3DS RDIMM
- Up to 2 DIMMs/channel capacity with up to 12TB in a 2 socket system (256GB 3DS RDIMMs)1

#### Source: AMD

| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zen 4" |
|---------|-------|---------|---------|-----|---------|
| "Zen 4" |       | "Zen 4" | "Zen 4" | 23  | "Zen 4" |
| "Zen 4" | 2     | "Zen 4" | "Zen 4" | 8-  | "Zen 4" |
| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zen 4" |
| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zen 4" |
| "Zen 4" | 9     | "Zen 4" | "Zen 4" | 885 | "Zen 4" |
| "Zen 4" | 136MB | "Zen 4" | "Zen 4" |     | "Zen 4" |
| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zen 4" |
| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zon 4" |
| "Zen 4" | 9     | "Zen 4" | "Zen 4" | 23  | "Zen 4" |
| "Zen 4" | 2     | "Zen 4" | "Zen 4" | 87  | "Zen 4" |
| "Zen 4" |       | "Zen 4" | "Zen 4" |     | "Zen 4" |



| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |
|---------|----|---------|---------|------------|---------|
| "Zen 4" | 23 | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" | 8- | "Zen 4" | "Zen 4" | 100        | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" | <b>8</b> 5 | "Zen 4" |
| "Zen 4" | 2  | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" | ľ  | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |
| "Zen 4" | 2  | "Zen 4" | "Zen 4" | 2          | "Zen 4" |
| "Zen 4" |    | "Zen 4" | "Zen 4" |            | "Zen 4" |

**ORANGE** indicates difference from General Purpose

#### SP5 Platform

- · New socket, increased power delivery and VR
- Up to 4 links of Gen3 AMD Infinity Fabric<sup>™</sup> with speeds of up to 32Gbps
- Flexible topology options
- Server Controller Hub (USB, UART, SPI, I2C, etc.)

#### Integrated I/O – No Chipset

Up to 160 IO lanes (2P) of PCIe® Gen5

- Speeds up to 32Gbps, bifurcations supported down to x1
- Up to 12 bonus PCIe<sup>®</sup> Gen3 lanes in 2P config (8 lanes-1P)
- Up to 32 IO lanes for SATA

 64 IO Lanes support for CXL1.1+ w/bifurcations supported down to x4

#### **Security Features**

Dedicated Security Subsystem with enhancements

Secure Boot, Hardware Root-of-Trust

SME (Secure Memory Encryption)

SEV-ES (Secure Encrypted Virtualization & Register Encryption)

SEV-SNP (Secure Nested Paging), AES-256-XTS with more encrypted VMs

#### Intel® Xeon® Die Package Enhancements

5th Gen Intel<sup>®</sup> Xeon<sup>®</sup> Processors

Scalable, Balanced Architecture



#### 5th Gen Intel® Xeon® Processors Turbo Frequencies

Introducing Improved 5 Turbo Ratio Levels

- Improves Turbo Frequencies for Intel® AVX heavy and Intel® AMX light workloads including HPC and AI
- ~2 bins Turbo Frequency Upside on Intel® AVX-512 Heavy usage
- ~9% performance improvement on low load (4T or 8T) LINPACK AVX512
- ~5% performance improvement on low instance (4 or 32) Resnet50 amx\_int8, amx\_bfloat16 and avx\_fp32
- Lowers the Turbo frequency penalty for using AVX512 or AMX, broadening usability of these instruction sets

| h Gen I | Voon |  |
|---------|------|--|
|         |      |  |

| Instruction<br>Class | Cdyn Class      |              |              |           |  |
|----------------------|-----------------|--------------|--------------|-----------|--|
|                      | 0               | 1            | 2            | 3         |  |
| SSE                  | 128 Light       | 128 Heavy    |              |           |  |
| AVX2                 | 256 Light       | 256 Moderate | 256 Heavy    |           |  |
| AVX512               | 512 Ultra-Light | 512 Light    | 512 Moderate | 512 Heavy |  |
| AMX                  |                 | AMX Light    | AMX Moderate | AMX Heavy |  |
| Turbo<br>Frequency   | SSE             | AVX2         | AVX512       | AMX       |  |

5th Gen Intel® Xeon® CPU

| Instruction        | Cdyn Class      |                     |              |              |           |
|--------------------|-----------------|---------------------|--------------|--------------|-----------|
| Class              | 0               | 1                   | 2            | 3            | 4         |
| SSE                | 128 Light       | 128 Heavy           |              |              |           |
| AVX2               | 256 Light       | 256 Moderate        | 256 Heavy    |              |           |
| AVX512             | 512 Ultra-Light | 512 Light           | 512 Moderate | 512 Heavy    |           |
|                    | HANCED          | AMX Ultra-<br>Light | AMX Light    | AMX Moderate | AMX Heavy |
| Turbo<br>Frequency | SSE             | AVX2                | AVX512       | AVX51244     | AMX       |
|                    |                 |                     |              |              |           |

#### Compute Express Link® 1.1 Enhancements

Type 3 memory support with 5th Gen Intel® Xeon® processors



DORS DORS DORS DORS

CXL Memory Buffer

2 Tier Memory Support Example

intel

#### 2-Tier Memory Support

Type 3 Memory Expansion Devices:

**Capacity Expansion** 

- Tier 1 memory = native DDR, Tier 2 memory = CXL<sup>®</sup> attached memory
- Supports up to 4 channels of CXL memory across two CXL type 3 devices
- Supports CXL memory latency QoS distress signaling

Increased transactions per second for In-Memory databases (e.g. Redis)

#### Single Tier Memory Support

- 12 channel DDR+CXL interleaved memory
- Either for capacity or bandwidth expansion

Source: Intel

#### **SIMD Vectorization**



### History of Intel<sup>®</sup> SIMD ISA Extensions

• Intel<sup>®</sup> Pentium Processor (1993)

🗌 32bit

• Multimedia Extensions (MMX in 1997)

**64bit integer support only** 

• Streaming SIMD Extensions (SSE in 1999 to SSE4.2 in 2008)

**32bit/64bit integer and floating point, no masking** 

• Advanced Vector Extensions (AVX in 2011 and AVX2 in 2013)

**Fused multiply-add (FMA), HW gather support (AVX2)** 

• Many Integrated Core Architecture (Xeon Phi<sup>™</sup> Knights Corner in 2013)

HW gather/scatter, exponential

• AVX512 on Knights Landing, Skylake Xeon, and Core X-series (2016/2017)

Conflict detection instructions



### **Evolution of Intel<sup>®</sup> SIMD ISA Extensions**

- AVX 10
  - Supported on both P-cores and E-cores
  - Brings benefits of AVX512 to smaller registers
- Advanced Matrix Extensions (AMX)
  - Targeted at AI applications
  - SIMD for small matrix operations
  - Available on 4th and 5th generation Xeon
- Advanced Performance Extensions (APX)
  - Adds new features that improve general-purpose performance
  - Expands x86 instruction set with more general-purpose registers (from 16 to 32)
  - New REX2 prefix provides uniform access to the new registers
  - Adds conditional forms of load, store, and compare/test instructions
  - New prefix increase average instruction length, but there are less instructions overall

|                        |                         |                             |                                           | Intel®AVX10.2                                       |
|------------------------|-------------------------|-----------------------------|-------------------------------------------|-----------------------------------------------------|
| ers                    |                         |                             | Intel <sup>®</sup> AVX10.1 (pre-enabling) | New data movement, transforms and type instructions |
|                        |                         | Intel <sup>®</sup> AVX-512  | Optional 512-bit FP/Int                   | Optional 512-bit FP/Int                             |
|                        |                         | 128/256/512-bit FP/Int      | 128/256-bit FP/Int                        | 128/256-bit FP/Int                                  |
|                        |                         | 32 vector registers         | 32 vector registers                       | 32 vector registers                                 |
|                        |                         | 8 mask registers            | 8 mask registers                          | 8 mask registers                                    |
|                        |                         | 512-bit embedded rounding   | 512-bit embedded rounding                 | 256/512-bit embedded rounding                       |
| Intel <sup>®</sup> AVX | Intel <sup>®</sup> AVX2 | Embedded broadcast          | Embedded broadcast                        | Embedded broadcast                                  |
| Intel <sup>®</sup> AVA | Intel® AVA2             | Scalar/SSE/AVX "promotions" | Scalar/SSE/AVX "promotions"               | Scalar/SSE/AVX "promotions"                         |
| 128/256-bit FP         | Float16                 | Native media additions      | Native media additions                    | Native media additions                              |
| 16 registers           | 128/256-bit FP FMA      | HPC additions               | HPC additions                             | HPC additions                                       |
| NDS (and AVX128)       | 256-bit int             | Transcendental support      | Transcendental support                    | Transcendental support                              |
| Improved blend         | PERMD                   | Gather/Scatter              | Gather/Scatter                            | Gather/Scatter                                      |
| MASKMOV                | Gather                  | Flag-based enumeration      | Version-based enumeration                 | Version-based enumeration                           |
| Implicit unaligned     |                         | Intel® Xeon P-core only     | Intel® Xeon P-core only                   | Supported on P-cores, E-cores                       |

# Microarchitecture of a Modern Intel Core

- Front End
  - Instruction Fetch and Decode
  - Branch Predictor Unit (BPU)
  - L1 Instruction Cache
  - Instruction TLB

#### • Back End

- Execution Engine
  - Scheduler
  - Register File
  - Execution Units (EUs)
- Memory Subsystem
  - Load/Store Units (LSU)
  - L1 / L2 Data Cache
  - Data TLB





### **Meteor Lake Hybrid Architecture**









#### Meteor Lake Block Diagram



Source: Intel

#### REDWOOD COVE

## **New P-core**

Targeted for efficient performance



MSROM

I-TLB + 64KB I-Cache

Decode

µop Queue

Predict

µop Cache

\*Architectural simulation vs. Golden Cove architecture. Results may vary across workloads.

#### CRESTMONT

## **New E-core**

Significant improvements over prior E-core





Source: Intel

#### **DCAI** Architecture Evolution



### Intel Process Technology



#### Source: Intel

### **Modern Hardware**





### Summary

- We've come a long way, modern hardware is quite complex
  - NUMA Architecture (multi socket)
  - High parallelism (multicore, superscalar)
  - Advanced Packaging (chiplets)
  - Hybrid Architectures (performance/efficiency)
  - Variable CPU frequency scaling (turbo boost, thermal throttling)
  - Accelerators and Heterogeneity (GPUs, NPUs, FPGAs, ASICs)
- Performance does not come for free, we needed to adapt our software
  - Concurrency and Parallelism (processes, threads, SIMD)
  - Memory alignment, access patterns, fragmentation
  - Code layout, compiler optimizations, data structures, software design
  - Need the right tools to guide us: profilers, static analysis, etc
  - Need the right methodology: identify causes of bottlenecks, address the right issue