# Hardware aspects of optimization

#### David Grellscheid based on slides by Ivan Girotto, ICTP







#### International Centre for Theoretical Physics



**John Von Neumann** 

# The Classical Model







# The Instruction Processing Cycle

- Fetch: read the next instruction from memory
  - $\hspace{0.1in} 001000 \hspace{0.1in} 00001 \hspace{0.1in} 00010 \hspace{0.1in} 00000100001000$
- Decode: operands and operation are decoded
  - add, \$r1, \$r2, 10
- Load: retrieve the data from memory to registers
- Execute: execute the instruction
  - \$r1 = 4500 + 10
- Store: store the results



The Abdus Salam International Centre for Theoretical Physics







Hardware aspects of optimization, David Grellscheid 2016-05-20

International Centre for Theoretical Physics



| Pipelining    |   |   |   |   |   |   |   |   |   |     |   |
|---------------|---|---|---|---|---|---|---|---|---|-----|---|
| Clock Period  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ••• | Ν |
| Instruction 1 | F | D | L | E | S |   |   |   |   |     |   |
| Instruction 2 |   | F | D | L | Е | S |   |   |   |     |   |
| Instruction 3 |   |   | F | D | L | Е | S |   |   |     |   |
| Instruction 4 |   |   |   | F | D | L | Е | S |   |     |   |
| Instruction 5 |   |   |   |   | F | D | L | E | S |     |   |



#### International Centre for Theoretical Physics



IAEA

## Superscalaring

| Clock<br>Period | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ••• | Ν |
|-----------------|---|---|---|---|---|---|---|---|---|-----|---|
| Instruction 1   | F | D | L | Е | S |   |   |   |   |     |   |
| Instruction 2   | F | D | L | Е | S |   |   |   |   |     |   |
| Instruction 3   |   | F | D | L | Ε | S |   |   |   |     |   |
| Instruction 4   |   | F | D | L | Ε | S |   |   |   |     |   |
| Instruction 5   |   |   | F | D | L | Е | S |   |   |     |   |
| Instruction 6   |   |   | F | D | L | Е | S |   |   |     |   |



### Vectorization

### SIMD Mode

### Scalar Mode

в





# demo

Hardware aspects of optimization, David Grellscheid 2016-05-20



The Abdus Salam International Centre for Theoretical Physics



## Loops and Pipeline



| <pre>for( i =</pre> | 0; i  | < N; | i | += 1 | . ) |
|---------------------|-------|------|---|------|-----|
| {                   |       |      |   |      |     |
| A[i]                | = s * | A[i] |   |      |     |
| }                   |       |      |   |      |     |

| Loop: | load r1, A(i)   |
|-------|-----------------|
|       | load r2, s      |
|       | mult r3, r2, r1 |
|       | store A(i), r3  |
|       | branch => loop  |





## The CPU Memory Hierarchy







| Re<br>ref<br>Size: 100<br>Speed: 3 | CPU(s):<br>On-line CPU(s) list:<br>Thread(s) per core:<br>Core(s) per socket:<br>Socket(s):<br>NUMA node(s):<br>Vendor ID:<br>CPU family:<br>Model:<br>Model name:<br>Stepping:<br>CPU MHz:<br>BogoMIPS:<br>Virtualization: | x86_64<br>32-bit, 64-bit<br>Little Endian<br>8<br>0-7<br>2<br>4<br>1<br>1<br>GenuineIntel<br>6<br>60<br>Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz<br>3<br>1074.492<br>5986.16<br>VT-x<br>32K<br>32K<br>256K<br>8192K<br>0-7 | storage<br>Disk<br>emory<br>ference<br>-16 TB<br>-10 ms |
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|







| LI cache reference0.5 ns                                      |
|---------------------------------------------------------------|
| Branch mispredict5 ns                                         |
| L2 cache reference7 ns                                        |
| Mutex lock/unlock 25 ns                                       |
| Main memory reference 100 ns                                  |
| SSD random read 150,000 ns $=$ 150 µs                         |
| Read I MB sequentially from memory 250,000 ns $= 250 \ \mu s$ |
| Read I MB sequentially from SSD I,000,000 ns = I ms           |
| Disk seek 10,000,000 ns = 10 ms                               |
| Read I MB sequentially from disk 20,000,000 ns = 20 ms        |
| Send packet CA->Netherlands->CA 150,000,000 ns = 150 ms       |



| LI cache reference    | 0.5 s |
|-----------------------|-------|
| Branch mispredict     | 5 s   |
| L2 cache reference    | 7 s   |
| Mutex lock/unlock     | 25 s  |
| Main memory reference | 100 s |



| LI cache reference               | 0.5 s        |
|----------------------------------|--------------|
| Branch mispredict                | 5 s          |
| L2 cache reference               | 7 s          |
| Mutex lock/unlock                | 25 s         |
| Main memory reference            | 100 s        |
| SSD random read                  | 1.7 days     |
| Read I MB sequentially from memo | ory 2.9 days |
| Read I MB sequentially from SSD  | 11.6 days    |

Hardware aspects of optimization, David Grellscheid 2016-05-20



| LI cache reference               | 0.5 s        |
|----------------------------------|--------------|
| Branch mispredict                | 5 s          |
| L2 cache reference               | 7 s          |
| Mutex lock/unlock                | 25 s         |
| Main memory reference            | 100 s        |
| SSD random read                  | I.7 days     |
| Read I MB sequentially from memo | ory 2.9 days |
| Read I MB sequentially from SSD  | 11.6 days    |
| Disk seek 16.                    | 5 weeks      |
| Read I MB sequentially from disk | 7.8 months   |



| LI cache reference               | 0.5 s        |
|----------------------------------|--------------|
| Branch mispredict                | 5 s          |
| L2 cache reference               | 7 s          |
| Mutex lock/unlock                | 25 s         |
| Main memory reference            | 100 s        |
| SSD random read                  | I.7 days     |
| Read I MB sequentially from memo | ory 2.9 days |
| Read I MB sequentially from SSD  | 11.6 days    |
| Disk seek 16.                    | 5 weeks      |
| Read I MB sequentially from disk | 7.8 months   |
| Send packet CA->Netherlands->CA  | 4.8 years    |