

# Intel<sup>®</sup> Architecture for HPC Developers

Presenter: Georg Zitzlsberger Date: 09-07-2015



# Agenda

- Introduction
- Intel<sup>®</sup> Architecture
  - Desktop, Mobile & Server
  - Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor
- Summary

# Moore's "Law"



#### "The number of transistors on a chip will **double** approximately **every two years.**" [Gordon Moore]



Moore's Law graph, 1965

#### Microprocessor Transistor Counts 1971-2011 & Moore's Law



# Parallelism

#### **Problem:**

Economical operation **frequency of (CMOS) transistors is limited**. ⇒ **No free lunch anymore!** 

#### **Solution:**

More transistors allow more gates/logic on the same die space and power envelop, **improving parallelism**:

- Thread level parallelism (TLP): Multi- and many-core
- Data level parallelism (DLP): Wider vectors (SIMD)
- Instruction level parallelism (ILP): Microarchitecture improvements, e.g. threading, superscalarity, ...



## Processor Architecture Basics UMA and NUMA



- UMA (aka. non-NUMA):
  - Uniform Memory Access (UMA)
  - Addresses interleaved across memory nodes by cache line
  - Accesses may or may not have to cross QPI link
    - ⇒ Provides good portable performance without tuning
- NUMA:
  - Non-Uniform Memory Access (NUMA)
  - Addresses not interleaved across memory nodes by cache line
  - Each processor has direct access to contiguous block of memory
     Provides peek performance but requires special handling



System

Memory Map

### Processor Architecture Basics UMA vs. NUMA

### **UMA (non-NUMA) is recommended:**

- Avoid additional layer of complexity to tune for NUMA
- Portable NUMA-tuning is very difficult
- Future platforms based on Intel or non-Intel processors might require different NUMA tuning
- Own NUMA strategy might conflict with optimizations in 3<sup>rd</sup> party code or OS (scheduler)

### Use NUMA if...

- Memory access latency and bandwidth is dominating bottleneck
- Developers are willing to deal with additional complexity
- System is dedicated to few applications worth the NUMA tuning
- Benchmark situation

### Processor Architecture Basics NUMA - Thread Affinity & Enumeration

### **Non-NUMA:**

Thread affinity **might** be beneficial (e.g. cache locality) but not required

### NUMA:

Thread affinity is **required**:

- Improve accesses to local memory vs. remote memory
- Ensure 3<sup>rd</sup> party components support affinity mapping, e.g.:
  - Intel<sup>®</sup> TBB via set\_affinity()
  - Intel<sup>®</sup> OpenMP\* via \$OMP\_PLACES
  - Intel<sup>®</sup> MPI via \$I\_MPI\_PIN\_DOMAIN
  - •
- Right way to get enumeration of cores: Intel<sup>®</sup> 64 Architecture Processor Topology Enumeration <u>https://software.intel.com/en-us/articles/intel-64-architecture-</u> <u>processor-topology-enumeration</u>

# Processor Architecture Basics

NUMA - Memory, Bandwidth & Latency

### Memory allocation:

- Differentiate: implicit vs. explicit memory allocation
- Explicit allocation with NUMA aware libraries, e.g. libnuma (Linux\*)
- Bind memory ⇔ (SW) thread, and (SW) thread ⇔ processor
- More information on optimizing for performance: <u>https://software.intel.com/de-de/articles/optimizing-applications-for-numa</u>



- Remote memory access **latency** ~1.7x greater than local memory
- Local memory bandwidth can be up to ~2x greater than remote

# **NUMA Information**

#### Get the NUMA configuration: http://www.open-mpi.org/projects/hwloc/ or numact1 --hardware

### **Documentation (libnuma & numactl):**

http://halobates.de/numaapi3.pdf

#### numactl:

- \$ numactl --cpubind=0 --membind=0 <exe>
- \$ numactl --interleave=all <exe>

#### libnuma:

- Link with -lnuma
- Thread binding and preferred allocation: numa\_run\_on\_node(node\_id); numa\_set\_preferred(node\_id);
- Allocation example: void \*mem = numa\_alloc\_onnode(bytes, node\_id);

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# Agenda

- Introduction
- Intel<sup>®</sup> Architecture
  - Desktop, Mobile & Server
  - Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor
- Summary

## Desktop, Mobile & Server

"Big Core"



**Optimization Notice** 

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

## Desktop, Mobile & Server Tick/Tock Model





Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

12

### Desktop, Mobile & Server Your Source for Intel<sup>®</sup> Product Information

#### Naming schemes:

- Desktop & Mobile:
  - Intel<sup>®</sup> Core<sup>™</sup> i3/i5/i7 processor family
  - 5 generations, e.g.: 4<sup>th</sup> Generation Intel<sup>®</sup> Core<sup>™</sup> i5-4XXX 5<sup>th</sup> Generation Intel<sup>®</sup> Core<sup>™</sup> i5-5XXX
  - 1<sup>st</sup> generation starts with "Nehalem"
- Server:
  - Intel<sup>®</sup> Xeon<sup>®</sup> E3/E5/E7 processor family
  - 3 generations, e.g.: Intel<sup>®</sup> Xeon<sup>®</sup> Processor E3-XXXX v3
  - 1<sup>st</sup> generation starts with "Sandy Bridge"

Information about available Intel products can be found here: <u>http://ark.intel.com/</u>



### Desktop, Mobile & Server Learn About Intel<sup>®</sup> Processor Numbers

#### Processor numbers follow specific scheme:

- Processor type
- Processor family/generation
- Product line/brand
- Power level
- Core count
- Multi-processor count
- Socket type

**9** Search For Business For Home Products Support About Intel intel Products with Intel Inside® Browse by Topic Browse by Role Product Lookup Communities Change Locatio Intel Products 4 Processors Processor Numbers Intel® Core™ Processors Intel\* Pentium \* Intel<sup>®</sup> Atom<sup>®</sup> Intel\* Core\* Intel® Celeron® Processors Intel\* Xeon\* and 2nd generation Intel® Core™ processor family Compare View processor specifications and i7-2600 processor Intel<sup>®</sup> Core<sup>™</sup> compare processors Performance Processor numbers<sup>1</sup> for the 2nd generation Intel® Core™ processor family have an alpha/numerical identifier followed by a four digit numerical sequence, and may have an alpha View processor suffix depending on the processor. The table below explains the alpha suffixes used for the 2nd performance enchmarks eneration Intel Core processor family Description Unlocked i7-2600K/i5-2600K Performance optimized lifestyle i5-2500S/i5-2400S (5-2500T/i5-2390) Power optimized lifestyle Intel<sup>®</sup> Core<sup>™</sup> i7-940 processor

Encoding of the processor numbers can be found here: <a href="http://www.intel.com/products/processor\_number/eng/">http://www.intel.com/products/processor\_number/eng/</a>

**Optimization Notice** 

🥶 Learn About Intel® Core™ ... 🗙

Www.intel.com/products/processor\_number/eng/about/c

- 0 **- x** 

🗴 🕹 🔳

🚱 🖬

VC \$

## Desktop, Mobile & Server Characteristics

- Processor core:
  - 4 issue
  - Superscalar out-of-order execution
  - Simultaneous multithreading: Intel<sup>®</sup> Hyper-Threading Technology with 2 HW threads per core
- Multi-core:
  - Intel<sup>®</sup> Core<sup>™</sup> processor family: up to 8 cores (desktop & mobile)
  - Intel<sup>®</sup> Xeon<sup>®</sup> processor family: up to 18 cores (server)
- Caches:
  - Three level cache hierarchy L1/L2/L3 (Nehalem and later)
  - 64 byte cache line

## Desktop, Mobile & Server Caches

### Cache hierarchy:



| Level                            | Latency (cycles)               | Bandwidth<br>(per core per cycle) | Size                     |
|----------------------------------|--------------------------------|-----------------------------------|--------------------------|
| L1-D                             | 4                              | 2x 16 bytes                       | 32KiB                    |
| L2 (unified)                     | 12                             | 1x 32 bytes                       | 256KiB                   |
| L3 (LLC)                         | 26-31                          | 1x 32 bytes                       | varies (≥ 2MiB per core) |
| L2 and L1 D-Cache in other cores | 43 (clean hit), 60 (dirty hit) |                                   |                          |

Example for 4th Generation Intel<sup>®</sup> Core<sup>™</sup>

(intel)

## Desktop, Mobile & Server Intel® Hyper-Threading Technology I

- 4 issue, superscalar, out-of-order processor:
  - 4 instructions are decoded to uops per cycle
  - Multiple uops are scheduled and executed in the backend
  - Backend-bound pipeline stalls likely (long 14 stage pipeline)
- Problem: How to increase instruction throughput? Solution: Intel<sup>®</sup> Hyper-Threading Technology (HT)
  - Simultaneous multi-threading (SMT)
  - Two threads per core
- Threads per core share the same resources partitioned or duplicated:
  - L/S buffer and ROB
  - Scheduler (reservation station)
  - Execution Units
  - Caches
  - TLB

## Desktop, Mobile & Server Intel® Hyper-Threading Technology II

- Not shared are:
  - Registers
  - Architectural state
- Smaller extensions needed:
  - More uops in backend need to be handled
  - ROB needs to be increased
- Easy and efficient to implement:
  - Low die cost: Logic duplication is minimal
  - Easy to handle for a programmer (multiple SW threads)
  - Can be selectively used, depending on workload
  - Some workloads can benefit from SMT just enable it
- More insights to Intel<sup>®</sup> Hyper-Threading Technology: <u>https://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology</u>

## Desktop, Mobile & Server New Instructions: 4th Generation Intel® Core™ (Haswell)

| Grou                    | ıp                                                      | Description                                                                                            | Count*         |
|-------------------------|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------|----------------|
| 4VX2                    | SIMD Integer<br>Instructions<br>promoted to 256<br>bits | nstructions<br>promoted to 256                                                                         |                |
| Intel <sup>®</sup> AVX2 | Gather                                                  | Load elements using a vector of indices, vectorization enabler                                         | 170/124        |
|                         | Shuffling / Data<br>Rearrangement                       | Blend, element shift and permute instructions                                                          |                |
| FMA                     |                                                         | Fused Multiply-Add operation forms (FMA-3)                                                             | 96 / 60        |
|                         | lanipulation and<br>tography                            | Improving performance of bit stream<br>manipulation and decode, large integer<br>arithmetic and hashes | 15 / 15        |
| TSX=RTM+HLE             |                                                         | Transactional Memory                                                                                   | <del>4/4</del> |
| Othe                    | · · · · · · · · · · · · · · · · · · ·                   |                                                                                                        | 2/2            |

\* Total instructions / different mnemonics

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

## Desktop, Mobile & Server Performance

• Following Moore's Law:

| Microarchitecture | Instruction<br>Set        | SP FLOPs<br>per Cycle<br>per Core | DP FLOPs<br>per Cycle<br>per Core | L1 Cache Bandwidth<br>(bytes/cycle) | L2 Cache<br>Bandwidth<br>(bytes/cycle) |
|-------------------|---------------------------|-----------------------------------|-----------------------------------|-------------------------------------|----------------------------------------|
| Nehalem           | SSE<br>(128-bits)         | 8                                 | 4                                 | 32<br>(16B read + 16B write)        | 32                                     |
| Sandy Bridge      | Intel® AVX<br>(256-bits)  | 16                                | 8                                 | 48<br>(32B read + 16B write)        | 32                                     |
| Haswell           | Intel® AVX2<br>(256-bits) | 32                                | 16                                | 96<br>(64B read + 32B write)        | 64                                     |

- Example of theoretic peak FLOP rates:
  - Intel<sup>®</sup> Core<sup>™</sup> i7-2710QE (Sandy Bridge):
     2.1 GHz \* 16 SP FLOPs \* 4 cores = 134.4 SP GFLOPs
  - Intel<sup>®</sup> Core<sup>™</sup> i7-4765T (Haswell):
     2.0 GHz \* 32 SP FLOPs \* 4 cores = 256 SP GFLOPs

## Desktop, Mobile & Server FMA Latency

- FMA latency better than combined multiply and add instruction:
  - Add latency: 3 cycles
  - Multiply and FMA latencies: 5 cycles
- But not always optimal latency with different combinations, e.g.:



#### ⇒ FMA can improve or reduce performance due to context!

**Optimization Notice** 

## Desktop, Mobile & Server Intel® Xeon® Processor: Ivy Bridge vs. Haswell

| Feature                   | Ivy Bridge                       | Haswell             |
|---------------------------|----------------------------------|---------------------|
| QPI Speed (GT/s)          | 6.4, 7.2 and 8.0                 | 6.4, 8.0, 9.6       |
| Cores                     | Up to 12                         | Up to 18            |
| Last Level Cache (LLC)    | Up to 30 MB                      | Up to 45 MB         |
| Memory                    | DDR3-800/1066/1333/<br>1600/1866 | DDR4-1600/1866/2133 |
| Max. Memory Bandwidth     | 59.7 GB/s                        | 68 GB/s             |
| Instruction Set Extension | Intel <sup>®</sup> AVX           | Intel® AVX2         |

## Desktop, Mobile & Server New Instructions: 5th Generation Intel® Core™

| Instruction | Description                                                      |
|-------------|------------------------------------------------------------------|
| RDSEED      | provide reliable seeds for pseudo-random number generator (PRNG) |
| ADCX, ADOX  | large integer arithmetic addition                                |
| PREFETCHW   | extending SW prefetch                                            |

Supported via intrinsics :

- Intel Compilers 13.0 Update 2 (and later)
- GNU GCC 4.8

## 4<sup>th</sup> Generation Intel<sup>®</sup> Core<sup>™</sup> (Haswell) Execution Unit Overview



# Haswell EP Die Configurations



| Chop | Columns | Home<br>Agents | Cores | Power (W) | Transitors<br>(B) | Die Area<br>(mm²) |
|------|---------|----------------|-------|-----------|-------------------|-------------------|
| НСС  | 4       | 2              | 14-18 | 110-145   | 5.69              | 662               |
| мсс  | 3       | 2              | 6-12  | 65-160    | 3.84              | 492               |
| LCC  | 2       | 1              | 4-8   | 55-140    | 2.60              | 354               |

**Optimization Notice** 

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

inte

### Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-2600 v3 Product Family Snoop Modes

Each mode is configurable through BIOS settings

- Early Snoop Mode
  - Intel's BIOS default for HSW-EP
  - Same mode available on SNB-EP
  - Applications needing lowest memory latency or small chache-to-cache latency to remote socket
- Home Snoop Mode
  - Same mode available on IVB-EP\*
  - Optimized for NUMA applications
- Cluster on Die Mode
  - New mode introduced on HSW-EP
  - Lowest memory latency and highest bandwidth; requires NUMA app.!

- Memory bandwidth & latency tradeoffs will vary across the 3 die configurations for each snoop mode
- Intel recommends exposing all snoop modes as BIOS options to the user

\*Home Snoop mode is available on IVB-EP but is not the default setting



# Cluster on Die (COD) Mode

#### Supported on 2S HSW-EP SKUs with 2 Home Agents (10+ cores)

- Targeted at NUMA workloads where latency is more important than sharing data across Caching Agents (Cbo)
  - Reduces average LLC hit and local memory latencies
  - HA mostly sees requests from reduced set of threads which can lead to higher memory bandwidth
- OS/VMM own NUMA and process affinity decisions



#### COD Mode for 18C HSW-EP

# Cluster on Die (COD) Mode



28

(intel)

### Feature Comparison across Intel<sup>®</sup> Xeon<sup>®</sup> Generations

|                                                | Intel® Xeon®<br>Processor X5600<br>Series (Westmere-EP) | Intel® Xeon® Processor E5-<br>2600 Product Family<br>(Sandy Bridge-EP) | Intel® Xeon® Processor E5-<br>2600 v2 Product Family (Ivy<br>Bridge-EP) | Intel® Xeon® Processor E5-<br>2600 v3 Product Family<br>(Haswell-EP) |
|------------------------------------------------|---------------------------------------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------|
| <u>Essentials</u>                              | 1                                                       |                                                                        |                                                                         |                                                                      |
| Launch Date                                    | e Q1'11                                                 | Q1'12                                                                  | Q3'13                                                                   | Q3'14                                                                |
| Maximum # of Cores                             |                                                         | 8                                                                      | 12                                                                      | 18                                                                   |
| Maximum # of Threads                           | 5 12                                                    | 16                                                                     | 24                                                                      | 36                                                                   |
| Last Level Cache (LLC)                         |                                                         | Up to 20 MB                                                            | Up to 30 MB                                                             | Up to 45 MB                                                          |
| Maximum QPI Bus Speed                          |                                                         | 8 GT/s                                                                 | 8 GT/s                                                                  | 9.6 GT/s                                                             |
| Instruction Set Extensions                     |                                                         | Intel <sup>®</sup> AVX                                                 | Intel <sup>®</sup> AVX                                                  | Intel <sup>®</sup> AVX 2                                             |
| Intel Process Technology                       |                                                         | 32 nm                                                                  | 22 nm                                                                   | 22 nm                                                                |
| Intel® Turbo Boost<br>Technology               | , 1.0                                                   | 2.0                                                                    | 2.0                                                                     | 2.0                                                                  |
| Power Management                               | Cores<br>Fixed Uncore<br>Frequency                      | Same P-States for All Cores<br>Same Core and Uncore<br>Frequency       | Same P-States for All Cores<br>Same Core and Uncore<br>Frequency        | Per Core P-States<br>Independent Uncore<br>Frequency Scaling         |
| Memory Specifications                          |                                                         |                                                                        |                                                                         |                                                                      |
| Max Memory Size per Socket                     | t 288 GB                                                | 384 GB                                                                 | 768 GB                                                                  | 768 GB                                                               |
| Memory Types                                   | DDR3<br>800/1066/1333<br>RDIMM/UDIMM                    | DDR3<br>800/1066/1333/1600<br>RDIMM/UDIMMs<br>Quad Rank LRDIMM         | DDR3<br>800/1066/1333/1600/1866<br>RDIMM/UDIMM<br>Quad Rank LRDIMM      | DDR4<br>1600/1866/2133<br>RDIMM<br>Quad Rank LRDIMM                  |
| # of Memory Channels                           | s 3                                                     | 4                                                                      | 4                                                                       | 4                                                                    |
| Max # of DIMMs/Channel                         |                                                         | 3                                                                      | 3                                                                       | 3                                                                    |
| Max # of DIMMs/Socket                          |                                                         | 12                                                                     | 12                                                                      | 12                                                                   |
| Theoretical Max Memory<br>Bandwidth per Socket | 32 GB/s                                                 | 51.2 GB/s                                                              | 59.7 GB/s                                                               | 68.2 GB/s                                                            |
| Expansion Options                              |                                                         |                                                                        |                                                                         |                                                                      |
| PCI Express Revision                           |                                                         | 3.0                                                                    | 3.0                                                                     | 3.0                                                                  |
| Max # of PCI Express Lanes                     | s N/A                                                   | 40                                                                     | 40                                                                      | 40                                                                   |
| PCI Express Configurations                     | s N/A                                                   | x4, x8, x16                                                            | x4, x8, x16<br>x16 Non-Transparent Bridge                               | x4, x8, x16<br>x16 Non-Transparent Bridge                            |
|                                                |                                                         |                                                                        | All comparisons based on a sir                                          | ingle socket. (intel) 29                                             |

**Optimization Notice** 

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# System Configurations for HPC App.

|                                                        | Intel® Xeon® E5-<br>2697 v2 | Intel® Xeon®    | E5-2667 v3       | Intel® Xeon®  | <sup>°</sup> E5-2680 v3 | Intel® Xeon®   | E5-2695 v3     | Intel® Xeon®    | E5-2697 v3    | Intel® Xeon® | E5-2699 v3    |
|--------------------------------------------------------|-----------------------------|-----------------|------------------|---------------|-------------------------|----------------|----------------|-----------------|---------------|--------------|---------------|
| Sockets / Cores                                        | 2 x 12C                     | 2 x 8C          |                  | 2 x 12C       |                         | 2 x 14C        |                | 2 x 14C         |               | 2 x 22C      |               |
| Base Freq                                              | 2.7GHz, C0                  | 3.2GHz, R2      |                  | 2.5GHz, M0    |                         | 2.3GHz, C1     |                | 2.6GHz, C0      |               | 2.3GHz       |               |
| Turbo Mode                                             | Enabled                     | Ena             | bled             | Ena           | bled                    | Enal           | oled           | Enal            | oled          | Enal         | bled          |
| Memory                                                 | 64 GB DDR3-1867             | 128 GB D        | DR4-2133         | 128 GB D      | DR4-2133                | 128 GB DI      | DR4-2133       | 128 GB DI       | DR4-2133      | 128 GB DI    | DR4-2133      |
| Platform                                               | Romley-EP Server<br>SDP     | Wildo           | at Pass          | Wildo         | at Pass                 | Wildc          | at Pass        | Maya            | n City        | Maya         | n City        |
|                                                        | нт                          | нт              | Snoop<br>Mode    | нт            | Snoop<br>Mode           | нт             | Snoop<br>Mode  | нт              | Snoop<br>Mode | нт           | Snoop<br>Mode |
| Computer Aided Engineering                             |                             |                 |                  |               |                         |                |                |                 |               |              |               |
| Finite Element Analysis                                | ON                          | ON              | ES               | OFF           | COD                     | OFF            | COD            | ON              | COD           | ON           | COD           |
| Dynamic Analysis                                       | ON                          | ON              | ES               | ON            | ES                      | ON             | ES             | ON              | HS            | ON           | HS            |
| Computational Fluid Dynamics                           | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | COD           | ON           | COD           |
| Multiphysics Simulation                                | ON                          | ON              | ES               | OFF           | COD                     | OFF            | COD            | ON              | COD           | ON           | COD           |
| Crash Simulation                                       | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | COD           | ON           | COD           |
| Energy                                                 |                             |                 |                  |               |                         |                |                |                 |               |              |               |
| seis-kernel2 (Kirchhoff-PB) v1.1-<br>12.0              | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | COD           | ON           | COD           |
| seis-kernel3 (TTI-T4-WG) v1.1-<br>12.0                 | ON                          | OFF             | ES               | ON            | COD                     | OFF            | ES             | OFF             | HS            | ON           | HS            |
| Financial Services                                     |                             |                 |                  |               |                         |                |                |                 |               |              |               |
| binomialcpu v3.0-13.1.0_AVX /<br>AVX2                  | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | HS            | ON           | HS            |
| BlackScholes v5.0-13.1.1_AVX /<br>AVX2                 | ON                          | ON              | ES               | ON            | ES                      | ON             | COD            | ON              | COD           | ON           | COD           |
| FNR v4.0-12.0                                          | ON                          | ON              | ES               | ON            | COD                     | ON             | ES             | ON              | COD           | ON           | ES            |
| MonteCarlo v3.0-13.1.0-AVX /<br>AVX2                   | ON                          | ON              | ES               | OFF           | ES                      | ON             | COD            | OFF             | COD           | ON           | COD           |
| Life Sciences                                          |                             |                 |                  |               |                         |                |                |                 |               |              |               |
| Amber v12-13.1.0_SSE4.2 / v12-<br>14.0.0 AVX2          | ON                          | ON              | ES               | OFF           | ES                      | OFF            | ES             | ON              | HS            | ON           | ES            |
| Blast v2.2.28+_13.1.0_OPT2                             | ON                          | ON              | ES               | ON            | ES                      | ON             | COD            | ON              | COD           | ON           | COD           |
| bowtie2 v2-2.1.0.0-13.1 / AVX2                         | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | HS            | ON           | ES            |
| Gamess v01MAY2012-R1-12.1* / v01MAY2013.R1-14.0.1_AVX2 | ON                          | ON              | ES               | ON            | ES                      | ON             | COD            | ON              | COD           | ON           | COD           |
| Gaussian g09-D.01                                      | ON                          | OFF             | ES               | OFF           | COD                     | OFF            | COD            | OFF             | COD           | ON           | HS            |
| Gromacs v4.6.1-13.1.1_AVX                              | ON                          | ON              | ES               | ON            | ES                      | ON             | COD            | ON              | ES            | ON           | ES            |
| NAMD v2.9-13.1.1_OPT3 / v2.9-<br>14.0.0_AVX2           | ON                          | ON              | ES               | OFF           | ES                      | OFF            | COD            | ON              | ES            | ON           | COD           |
| MILC v7.7.8-13.1.1_OPT3 /<br>v7.7.8-14.0.0_AVX2        | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | COD           | ON           | COD           |
| Numerical Weather                                      |                             |                 |                  |               |                         |                |                |                 |               |              |               |
| HOMME v2841-20130227_AVX                               | ON                          | ON              | ES               | ON            | COD                     | ON             | COD            | ON              | COD           | ON           | COD           |
| ROMS v3.0-12.0_AVX / v3.6.690-<br>14.0.0_AVX2          | ON                          | OFF             | ES               | OFF           | COD                     | OFF            | COD            | ON              | COD           | ON           | COD           |
| WRF v3.1-11.1_AVX / v3.5-<br>14.0.0_AVX2               | ON                          | ON              | ES               | OFF           | COD                     | OFF            | COD            | ON              | COD           | ON           | COD           |
| nization Notice                                        | Copyright © 2015. I         | ntol Corporatio | n All rights ros | an alet 6 oat | Gentlahnd hra           | nds may be cla | imod as the pr | oparty of other | -             |              |               |

**Optimization Notice** 

Copyright © 2015, Intel Corporation. All rights reser 🕮 🕈 🖓 🖓 🖓 Copyright 💿 2015, Intel Corporation. All rights reser

30

## Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor

**High Performance Computing** 

**inte** 

## Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Generations



### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Your Source for Intel<sup>®</sup> Product Information

#### Family:

- Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor 3XXX:
  - Entry level
  - 57 cores
  - 6 GB memory
- Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor 5XXX:
  - Mid level
  - 60 cores
  - 8 GB memory
- Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor 7XXX:
  - Top level
  - 61 cores
  - 16 GB memory

Information about available Intel products can be found here: <u>http://ark.intel.com/</u>

| er Intel® Xeon Phi™ Cop | rroce ×                                                          |                |             |             |             |                            | x |
|-------------------------|------------------------------------------------------------------|----------------|-------------|-------------|-------------|----------------------------|---|
| 🗲 🛞 ark.intel.com/produ | ucts/family/71840/Intel-Xeon-Phi-Coprocessors#                   | @Server        |             |             |             | ⊽ ८ 👌 🖡                    | = |
| (intel) Menu            |                                                                  | Communities Fi | nd Content  | Search Inte | el.com      | ٩                          |   |
| ARK Home ►              |                                                                  |                |             | Туре Н      | ere to Sear | ch Products                | Ī |
|                         |                                                                  |                |             |             |             | 🗖 🎽 🖬 🖬                    |   |
| Intel® Xeon Pl          | ni™ Coprocessors                                                 |                |             |             |             | <b>Q</b> Feature Filter    |   |
| Compare<br>Select All   | Product Name                                                     | Status         | Launch Date | # of Cores  | Max TDP     | Recommended Customer Price |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 7120X<br>(16GB, 1.238 GHz, 61 core) | Launched       | Q2'13       | 61          | 300 W       | \$4129.00                  | Ξ |
| Select                  | Intel® Xeon Phi™ Coprocessor 7120P<br>(16GB, 1.238 GHz, 61 core) | Launched       | Q2'13       | 61          | 300 W       | \$4129.00                  |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 7120D<br>(16GB, 1.238 GHz, 61 core) | Launched       | Q1'14       | 61          | 270 W       | \$4235.00                  |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 7120A<br>(16GB, 1.238 GHz, 61 core) | Launched       | Q2'14       | 61          | 300 W       | \$4235.00                  |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 5120D<br>(8GB, 1.053 GHz, 60 core)  | Launched       | Q2'13       | 60          | 245 W       | \$2759.00                  |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 5110P<br>(8GB, 1.053 GHz, 60 core)  | Launched       | Q4'12       | 60          | 225 W       | \$2437.00 - \$2649.00      |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 3120P<br>(6GB, 1.100 GHz, 57 core)  | Launched       | Q2'13       | 57          | 300 W       | \$1695.00                  |   |
| Select                  | Intel® Xeon Phi™ Coprocessor 3120A<br>(6GB, 1.100 GHz, 57 core)  | Launched       | Q2'13       | 57          | 300 W       | \$1695.00 - \$1960.00      |   |

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Characteristics I

- Coprocessor core:
  - 2 issue (with 2 cycle delay)
  - In-order execution
  - Simultaneous multithreading:
     4 HW threads per core
- Multi-core (many-core):
  - Up to 61 cores (57/60/61) per coprocessor
  - Up to 8 coprocessors already validated per host system (node)
- Caches:
  - Two level cache hierarchy L1/L2
  - 64 byte cache line

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Characteristics II

- Page sizes:
  - 4 kB
  - 64 kB
  - 2 MB
- Others:
  - Coprocessor(s) connected to node via PICe

Architecture: <u>https://software.intel.com/sites/default/files/article/393195/intel-</u> <u>xeon-phi-core-micro-architecture.pdf</u>

Data sheet: http://www.intel.com/content/dam/www/public/us/en/documents/data sheets/xeon-phi-coprocessor-datasheet.pdf

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Intel<sup>®</sup> Xeon<sup>®</sup> Processor vs. Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor

### Intel<sup>®</sup> Xeon<sup>®</sup> Processor:

- Optimal for workloads with...
  - High single-thread performance
  - High memory capacity
- Core/memory connections via sockets and nodes
- Instruction set:
  - SIMD SSE 128-bit & AVX 256-bit
  - Gather, FMA, virtualization, AES, etc.

### **Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor:**

- Optimal for workloads with...
  - High parallelization
  - High memory bandwidth
- Up to 61 cores per die, connection via PCIe and nodes
- Instruction set:
  - SIMD 512-bit
  - Gather and scatter, FMA, masked instructions

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Domain



Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessor complements Intel<sup>®</sup> Xeon<sup>®</sup> processor!

inte

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Memory



- 57-61 cores
- 8-16 GB GDDR5 memory (ECC)
- PCIe Gen2 (client) x16 per direction
- 8 memory controllers (MC)
- 2 GDDR5 channels per MC
  - Up to 5.5 GT/s per channel

Intel<sup>®</sup> Xeon Phi<sup>™</sup> **theoretic** bandwidth: 8 MC \* 2 channels \* 5.5 GT/s \* 4 byte = **352 GB/s** 

38

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Memory Considerations

Limitations of the theoretic memory bandwidth:

- HW related (signal noice, DDR5 overhead)
- Depending on the Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessor family: Low frequency models cannot saturate the bandwidth with loads/stores.
- ECC on/off
- Page size (4k by default)
   ⇒ Use 2MB pages (large pages)
- Application not pure memory bound

How to benchmark using **Stream Triad**: <u>https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-</u> <u>stream-triad</u>

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Architecture

- Pentium scalar instruction set with x87
- Extended with full 64 bit addressing
- Decoding:
  - In order-operation
  - 2 cycle decoder, 2 issue (1 scalar & 1 vector)
- Simultaneous multi-threading:
  - 4 HW threads per core
  - 2 instruction prefetch per HW thread
  - Round robin
- Instruction latencies:
  - Scalar: 1 cycle
  - Vector: 4 cycle (throughput of 1 cycle)
- Two pipelines:
  - Scalar (V pipeline)
  - Vector/Scalar (U pipeline)



### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor L1 Cache

- Size:
  - 32 KiB I-cache per core
  - 32 KiB D-cache per core
- 8 way associative
- 64 byte cache line
- 3 cycle access latency (address generation)
- Up to 8 outstanding requests
- Fully coherent
- Inclusive (L2 contains L1 data)



### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor L2 Cache

- Size:
  - 512 KiB unified per core
  - Total L2: 512 KiB x # cores (~30 MiB)
- 8 way associative
- 64 byte cache line
- 11 cycle access latency
- Up to 32 outstanding requests
- Streaming HW prefetcher
- Fully coherent
- Inclusive (L2 contains L1 data)



### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Vector Unit

- 32 512 bit vector registers per HW thread
- Each holds 16 SP FP or 8 DP FP
- ALUs support:
  - 32 bit integer/FP operations
  - 64 bit integer/FP/logic operations
- Ternary operations including Fused Multiply Add (FMA)
- Broadcast/swizzle support and 16 bit FP up-convert
- 8 vector mask registers for per lane conditional operations
- Most ops have a 4-cycle latency and 1-cycle throughput
- Mostly IEEE 754 2008 compliant
- Not supported:
  - MMX<sup>™</sup> technology
  - Streaming SIMD Extensions (SSE)
  - Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX)



### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor SIMD Vectors



Example of Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessor **theoretic peak** FLOP rate: 1.238 GHz \* 16 DP FLOPs \* 61 cores \* = **1.208 TeraFLOPs** 

inte

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Instruction Set

- Pentium scalar instruction set
- 64 bit extensions (e.g. 64 bit registers **rax**, **rbx**, ...)
- 512 bit SIMD vector registers: zmm0 ... zmm31
- Mask registers:
   k0 ... k7 (k0 is special, don't use)
- Backward compatibility to big core:
  - Missing SSE (128 bit) and AVX (256 bit) with "Knights Corner"
  - Compatibility with "Knights Landing"
- Legacy x87 also exists for scalar FP operations
   ⇒ For good performance, don't use!

Illustrations: Xi, Yi & results 32 bit integer

# Future "Knights Landing"



| PERFORMANCE                                                   |                                                                           | MICROARCHITECTURE                                                                                                                                                                      |                                           |
|---------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|
| 3+ TeraFLOPS of double-precision peak theoretical performance |                                                                           | Over 8 billion transistors per die based on Intel's 14 nanometer                                                                                                                       |                                           |
| per single socket node <sup>0</sup>                           |                                                                           | manufacturing technology                                                                                                                                                               |                                           |
| INTEGRATION                                                   |                                                                           | Binary compatible with Intel <sup>®</sup> Xeon <sup>®</sup> Processors with support for<br>Intel <sup>®</sup> Advanced Vector Extensions 512 (Intel <sup>®</sup> AVX-512) <sup>6</sup> |                                           |
| Intel® Omni Scale™ fabric integration                         |                                                                           | 3x Single-Thread Performance compared to Knights Corner <sup>7</sup>                                                                                                                   |                                           |
| Over 5x STREAM vs. DDR4 <sup>1</sup> → Over 400 GB/s          |                                                                           | 60+ cores in a 2D Mesh architecture                                                                                                                                                    |                                           |
| High-                                                         | Up to 16GB at launch                                                      | 2 cores per tile with 2 vecto                                                                                                                                                          | r processing units (VPU) per core)        |
| performance                                                   | NUMA support                                                              | 1MB L2 cache shared between 2 cores in a tile (cache-coherent)                                                                                                                         |                                           |
| on-package                                                    | Over 5x Energy Efficiency vs. GDDR5 <sup>2</sup>                          |                                                                                                                                                                                        | 4 Threads / Core                          |
| memory                                                        | Over 3x Density vs. GDDR5 <sup>2</sup>                                    |                                                                                                                                                                                        | 2X Out-of-Order Buffer Depth <sup>8</sup> |
| (MCDRAM)                                                      | In partnership with Micron Technology                                     | "Based on Intel® Atom™                                                                                                                                                                 | Gather/scatter in hardware                |
|                                                               | Flexible memory modes including cache and flat                            | core (based on Silvermont                                                                                                                                                              | Advanced Branch Prediction                |
|                                                               |                                                                           | microarchitecture) with                                                                                                                                                                | High cache bandwidth                      |
| SERVER PROCESSOR                                              |                                                                           | many HPC enhancements"                                                                                                                                                                 | 32KB Icache, Dcache                       |
| Standalone bootable processor (running host OS) and a PCIe    |                                                                           |                                                                                                                                                                                        | 2 x 64B Load ports in Dcache              |
| coprocessor (PCIe end-point device)                           |                                                                           |                                                                                                                                                                                        | 46/48 Physical/virtual address bits       |
| Platform memory: up to 384GB DDR4 using 6 channels            |                                                                           | Most of today's parallel optimizations carry forward to KNL                                                                                                                            |                                           |
| Reliability ("Intel server-class reliability")                |                                                                           | Multiple NUMA domain support per socket                                                                                                                                                |                                           |
| Power Efficien<br>Over 10 GF/W                                | cy (Over 25% better than discrete coprocessor) <sup>4</sup> $\rightarrow$ |                                                                                                                                                                                        |                                           |
|                                                               |                                                                           | -                                                                                                                                                                                      |                                           |

Density (3+ KNL with fabric in 1U)<sup>5</sup>

Up to 36 lanes PCIe\* Gen 3.0

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

<sup>b</sup> Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating

point operations per cycle. <sup>1</sup> Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated.

<sup>2</sup> Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel® Xeon Phi<sup>™</sup> coprocessor 7120P.

<sup>3</sup> Compared to 1st Generation Intel® Xeon Phi<sup>™</sup> 7120P Coprocessor (formerly codenamed Knights Corner)

<sup>4</sup> Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel® Xeon® processors and Knights Landing coprocessors as compared to a rack with KNL processors only

<sup>b</sup> Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card.

<sup>6</sup> Binary compatible with Intel® Xeon® Processors v3 (Haswell) with the exception of Intel® TSX (Transactionaly Synchronization Extensions)

<sup>7</sup> Projected peak theoretical single-thread performance relative to 1<sup>st</sup> Generation Intel® Xeon Phi™ Coprocessor 7120P

<sup>8</sup> Compared to the Intel® Atom<sup>™</sup> core (base on Silvermont microarchitecture)

# Future "Knights Landing" – cont'd NEW!

#### **FUTURE**

Knights Hill is the<br/>codename for the 3rd<br/>generation of the Intel®Based on Intel's 10 nanometer<br/>manufacturing technologyXeon Phi™ product<br/>familyIntegrated 2nd generation Intel® Omni-Path<br/>Host Fabric Interface

#### **AVAILABILITY**

PCIe\* Gen 3 I/O

First commercial HPC systems in 2H'15 Knights Corner to Knights Landing upgrade program available today Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6

ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated





All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

### AVX-512 - Greatly increased Register File



AVX-512

(2014 - KNL)

32 x

512bit

inte

## The Intel® AVX-512 Subsets [1]

#### AVX-512 F

#### AVX-512 F: 512-bit Foundation instructions common between MIC and Xeon

Comprehensive vector extension for HPC and enterprise

- □ All the key AVX-512 features: masking, broadcast...
- □ 32-bit and 64-bit integer and floating-point instructions
- □ Promotion of many AVX and AVX2 instructions to AVX-512
- □ Many new instructions added to accelerate HPC workloads

#### AVX-512CD

#### AVX-512 CD (Conflict Detection instructions)

Allow vectorization of loops with possible address conflict
 Will show up on Xeon



## The Intel<sup>®</sup> AVX-512 Subsets [2]

#### AVX-512DQ

#### AVX-512 Double and Quad word instructions

□ All of (packed) 32bit/64 bit operations AVX-512F doesn't provide

- □ Close 64bit gaps like VPMULLQ : packed 64x64  $\rightarrow$  64
- Extend mask architecture to word and byte (to handle vectors)
- Packed/Scalar converts of signed/unsigned to SP/DP

#### AVX-512BW

#### AVX-512 Byte and Word instructions

- □ Extent packed (vector) instructions to byte and word (16 and 8 bit) data type □MMX/SSE2/AVX2 re-promoted to AVX512 semantics
- □ Mask operations extended to 32/64 bits to adapt to number of objects in 512bit

□ Permute architecture extended to words (VPERMW, VPERMI2W, ...)



#### AVX-512 Vector Length extensions

Vector length orthogonality

□Support for 128 and 256 bits instead of full 512 bit

□ Not a new instruction set but an attribute of existing 512bit instructions

### **Other New Instructions**



51

### AVX-512 – KNL and future XEON

- KNL and future Xeon architecture share a large set of instructions
  - but sets are not identical
- Subsets are represented by individual feature flags (CPUID)



inte

### Intel<sup>®</sup> Compiler Processor Switches

| Switch                | Description                                      |  |
|-----------------------|--------------------------------------------------|--|
| -xmic-avx512          | KNL only; already in 14.0                        |  |
| -xcore-avx512         | Future XEON only, already in 15.0.1              |  |
| -xcommon-avx512       | AVX-512 subset common to both, already in 15.0.2 |  |
| -m, -march, /arch     | Not yet!                                         |  |
| -ax <avx512></avx512> | Same as for "-x <avx512>"</avx512>               |  |
| -mmic                 | No – not for KNL                                 |  |

### **Knights Landing Integrated On-Package** Memory

### Cache Model

Let the hardware automatically manage the integrated onpackage memory as an "L3" cache between KNL CPU and external DDR

Manually manage how your Flat application uses the integrated Model on-package memory and external DDR for peak performance

Model

Harness the benefits of both Hybrid cache and flat models by segmenting the integrated onpackage memory



### Maximizes performance through higher memory bandwidth and flexibility<sup>1</sup>

<sup>1</sup> As compared with Intel® Xeon Phi<sup>™</sup> x100 Coprocessor Family

Diagram is for conceptual purposes only and only illustrates a CPU and memory - it is not to scale, and is not representative of actual component layout.

## High Bandwidth On-Chip Memory API

- API is open-sourced (BSD licenses)
  - <u>https://github.com/memkind</u>
  - Uses jemalloc API underneath
    - <u>http://www.canonware.com/jemalloc/</u>
    - <u>https://www.facebook.com/notes/facebook-engineering/scalable-</u> <u>memory-allocation-using-jemalloc/480222803919</u>

### Malloc replacement:

```
#include <memkind.h>
    hbw_check_available()
    hbw_malloc, _calloc, _realloc,... (memkind_t kind, ...)
    hbw_free()
    hbw_posix_memalign()
    hbw_get_size(), _psize()
ld ... -ljemalloc -lnuma -lmemkind -lpthread
```

### HBW API for Fortran, C++

Fortran:

!DIR\$ ATTRIBUTES FASTMEM :: data\_object1, data\_object2

- All Fortran data types supported
- Global, local, stack or heap; scalar, array, ...
- Support in compiler 15.0 update 1 and later versions

C++:

standard allocator replacement for e.g. STL like

#include <hbwmalloc.h>

std::vector<int, hbwmalloc::hbw\_allocator>

Available already but not documented yet – working on documentation just now

### Intel<sup>®</sup> Software Development Emulator (SDE)

# Use Intel<sup>®</sup> Software Development Emulator (SDE) to test AVX-512 enabled code

- Will test instruction mix, not performance
- Does not emulate hardware (e.g. memory hierarchy) only ISA

### Use the SDE to answer

- Is compiler generating Intel<sup>®</sup> AVX-512/KNL-ready code for my source code already?
- How do I restructure my code so that Intel<sup>®</sup> AVX-512 code is generated?

Visit Intel Xeon Phi Coprocessor code named "Knights Landing" -Application Readiness

## Agenda

- Introduction
- Intel<sup>®</sup> Architecture
  - Desktop, Mobile & Server
  - Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor
- Summary

### Summary

- More cores to come in the future
- Make sure your application scales with more cores
- Identify whether new technology is applicable for you and use it
- Parallelism is not just adding cores...

## Thank you!



**Optimization Notice** 

Copyright © 2015, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

