

# Innovating Toward Exascale... ...and Beyond

CHEP2015, Okinawa, Japan April 13, 2015

Dr. William Magro Intel Fellow, Software & Services Group Chief Technologist, Technical Computing Software



# Legal Disclaimer & Optimization Notice

- INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
- Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
- Copyright © 2014, Intel Corporation. All rights reserved. Intel, Xeon, Xeon Phi, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

# HPC: Unwavering Progress... Amazing Impact





Top 500 FLOPS >50% CAGR For Past Decade

Source: Top500.org

înte

#### One of the Most Advanced Supercomputers Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery & innovation

#### >180 PFLOPS

(option to increase up to 450 PF)

>50,000 nodes 13 MW 2018 delivery **18X** higher performance\*

ENERGY

(intel)

>6X more energy efficient\* Argonne NATIONAL LABORATORY



**Prime Contractor** 



Source: Argonne National Laboratory and Intel. \*Versus ANL's current biggest system name MIRA (10PFs and 3.9MW). Detailed comparison at www.intel.com/newsroom/assets/Intel\_Aurora\_factsheet.pdf See "Legal Disclaimer & Optimization Notice" for important performance information. Other names and brands may be claimed as the property of others.

)

## Supercomputing Pulls All of HPC Along



Top 500 FLOPS >50%<sup>1</sup> CAGR For Past Decade

#500 system on Top 500 in 2004: ~600 GFLOPs

10 years later... ...one Intel® Xeon Phi™ co-processor: ~1 TFLOPS

Sources: Top500.org, Intel

ínte

#### HPC is Evolving...Powering Discovery in New Ways Current and future Intel innovations aimed at overcoming architectural challenges





Memory | Fabric | Storage Software Efficiency Energy-efficient Performance Space | Resiliency Fast and Efficient Data Mobility

Persistent Memory Innovations Big Data Analytics Advanced Lustre\* File System



Extending HPC's Reach

Enabling HPC at Every Scale

\*Other names and brands may be claimed as the property of others.

## Breaking Down the Walls...Requires Co-design & Integration



### Moore's Law and Parallelism



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

Inte

## Code Modernization...Critical in the Multi-Core Era



# Multi-core CPU



(inte



SIM

## Increasingly Parallel Code Enables Increased Processor Parallelism...and Performance



|                                |                                                                       |                                                                            |                                                                     |                                                                     |                                                                                       |                                                                                     |                                                                                  |                                                      | Leve Here<br>COL6, 14                                                                                     |
|--------------------------------|-----------------------------------------------------------------------|----------------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| Core(s)<br>Threads<br>ID Width | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>64-bit<br>series | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br><b>5100</b><br>series | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>5500<br>series | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>5600<br>series | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>code-named<br>Sandy<br>Bridge EP | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>code-named<br>Ivy Bridge<br>EP | Intel <sup>®</sup> Xeon <sup>®</sup><br>processor<br>code-named<br>Haswell<br>EX | Intel' Xeon Phi™<br>coprocessor<br>Knights<br>Corner | Intel <sup>®</sup> Xeon Phi <sup>™</sup><br>processor &<br>coprocessor<br>Knights<br>Landing <sup>1</sup> |
| Core(s)                        | 1                                                                     | 2                                                                          | 4                                                                   | 6                                                                   | 8                                                                                     | 12                                                                                  | 18                                                                               | 61                                                   | 60+                                                                                                       |
| Threads                        | 2                                                                     | 2                                                                          | 8                                                                   | 12                                                                  | 16                                                                                    | 24                                                                                  | 36                                                                               | 244                                                  | 244+                                                                                                      |
| ID Width                       | 128                                                                   | 128                                                                        | 128                                                                 | 128                                                                 | 256                                                                                   | 256                                                                                 | 256                                                                              | 512                                                  | 512                                                                                                       |

\*Product specification for launched and shipped products available on ark.intel.com.

1. Not yet launched.

10

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Product Family



<sup>1</sup> Claim based on calculated theoretical peak double precision performance capability for a single coprocessor. 16 DP FLOPS/clock/core \* 61 cores \* 1.23GHz = 1.208 TeraFLOPS <sup>2</sup>Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating-point operations per second per cycle. <sup>3</sup> Intel internal estimate



#### Knights Landing Next Generation Intel<sup>®</sup> Xeon Phi<sup>™</sup> Products

**Platform Memory** 

Up to **384 GB** DDR4 (6 ch)



#### Compute

- Intel<sup>®</sup> Xeon<sup>®</sup> Processor Binary-Compatible
- 3+ TFLOPS<sup>1</sup>, 3x ST<sup>2</sup> (single-thread) perf. vs KNC
- 2D Mesh Architecture
- Out-of-Order Cores
  - **On-Package Memory**
  - Over **5x** STREAM vs. DDR4<sup>3</sup>
  - Up to **16 GB** at launch

**Omni-Path** (optional)

1<sup>st</sup> Intel processor to integrate

#### Up to 36 PCIe 3.0 lanes

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

1/0

Over 60 Cores

Integrated Intel<sup>®</sup> Omni-Path

**Processor Package** 

Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>



# Intel<sup>®</sup> Parallel Studio XE Portable, Standards-Based Programming



Unlike accelerators, optimizations for Intel® Xeon Phi<sup>™</sup> and Intel® Xeon® products share the same languages, directives, libraries, and tools.

Parallel Studio XE

# Simplifying Vectorization

#### **OpenMP 4: vectorization directive**

```
#pragma omp simd
for(int ray=0; ray < N; ray++) {</pre>
  float Color = 0.0f, Opacity = 0.0f;
  int len = 0;
  int upper = raylen[ray];
  while (len < upper) {</pre>
    int voxel = ray + len;
    len++;
    if(visible[voxel] == 0) continue;
    float 0 = opacity[voxel];
    if(0 == 0.0) continue;
    float Shading = 0 + 1.0;
    Color += Shading * (1.0f - Opacity);
    Opacity += 0 * (1.0f - Opacity);
    if(Opacity > THRESH) break;
```

```
color_out[ray] = Color;
```

}

#### Intel<sup>®</sup> Advisor XE: Vectorization Advisor

| 📕 Threading and             | Vectoria        | zation S              | urvey 🛚            | 3                      |                   |                      |             |                    | Intel Advisor    | XE 20 |
|-----------------------------|-----------------|-----------------------|--------------------|------------------------|-------------------|----------------------|-------------|--------------------|------------------|-------|
| 🌳 Summary 🛭 😂 Survey Re     | eport 🛛 🦞 S     | uitability Rep        | oort 🍅 Co          | rrectnes               | s Report          | Ju                   | Memory Acce | ess P              | atterns          |       |
| Filter by Loop Type Vectori | Vectorized      | Filter by Source (All |                    |                        | Filt              | Filter by Module (Al |             |                    |                  |       |
| Function Call Sites and     | Self Time       | Total Time            | Memory<br>analysis | Compiler Vectorization |                   |                      | ≪           | ✓ Vectorized Loops | <sup>™</sup>     |       |
| Loops                       | Sell Time       |                       |                    | Loop T                 | ype               | •                    | Gain Estima | te                 | vectorized Loops |       |
| 🞙 🔽 [loop at mmult_se       | 10.040s         | 10.040s               |                    | Vector                 | rized             |                      | 2.19727     |                    | SSE2             |       |
| Iloop at mmult_serial.cp    | 0.000s          | 10.100s               | .00s               |                        | Scalar            |                      |             |                    | SSE2             |       |
| Iloop at mmult_serial.cp    | 0.000s          | 10.100s               |                    | Scalar                 |                   |                      |             |                    |                  |       |
| ▷ [loop inlibc_start_mai    | 0.000s          | 10.100s               |                    | Scalar                 |                   |                      |             |                    |                  |       |
| <                           |                 |                       | 111                |                        |                   |                      |             |                    |                  | >     |
| Top Down Source Asse        |                 |                       |                    |                        |                   |                      |             |                    |                  |       |
| Function Call Sites and     | Total Time<br>% |                       | Self Time          |                        | Vector 🗵 Location |                      |             |                    | <b>«</b>         |       |
| Loops                       |                 | Total Time            |                    |                        | Loops             |                      | Source Loc  | Мос                | lule             |       |
| ⊽Total                      | 100.0%          | 0 10.100s             | 0s                 |                        |                   |                      |             |                    |                  |       |
| ▼libc_start_main            | 100.0%          | 10.100s               | 0s                 |                        |                   |                      |             | libc-              | 2.12.so          |       |
| ⊽©[loop in _libc_st         | 100.0%          | 10.100s               | 0s                 |                        |                   |                      |             | libc-              | 2.12.so          |       |
| マmain                       | 100.0%          | 10.100s               | 0s                 |                        |                   | n                    | nmult_seri  | 1_mmult_serial     |                  |       |
| ⊽ [loop at mm…              | 100.0%          | 10.100s               | 0s                 | 5                      |                   | n                    | nmult_seri  | 1_m                | mult_serial      |       |
| ⊽ o [loop at m…             | 100.0%          | 10.100s               | 0s                 | 5                      | SSE2              | n                    | nmult_seri  | 1_m                | mult_serial      |       |
| ∽multiply_d                 | 100.0%          | 10.100s               | 0.0600s            |                        |                   | n                    | nmult_seri  | 1_m                | mult_serial      |       |
| ▷ (loop                     | 99.4%           | 10.040s               | 10.0400s           | 5                      | SSE2              | n                    | nmult_seri  | 1_m                | mult_serial      |       |

1

#### Intel® Parallel Computing Centers Over 50 centers worldwide Eight working in high-energy physics











UNIVERSIDADE ESTADUAL PAULISTA "JÚLIO DE MESQUITA FILHO"

Learn more @ https://software.intel.com/en-us/ipcc



**inte** 

# Intel<sup>®</sup> Omni-Path: the Next-Generation Fabric



- Host and Fabric Optimized for HPC
- Flexible Configurations
- End-to-End Solution

#### **INTEGRATION**



Coming in '15

PCIe Adapters V Edge Switches V

Director Systems Intel Silicon Photonics Open Software Tools\*

16

#### Intel<sup>®</sup> Omni-Path Architecture Benefits for "Every" Scale



<sup>1</sup> A 48-port Fat-Tree full bisectional bandwidth (FBB) topology based on a 36-port switch chip requires four (4) additional Edge Switches and 54 additional cables <sup>2</sup> Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700/SB7790 Edge switches compared to preliminary Intel simulations for Intel® Omni-Path switches. Fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration, using a 48-port switch for Intel® Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intel® True Scale clusters. <sup>3</sup> Actual number is 27,628 nodes based on a cluster configured with the Intel® Omni-Path Architecture using 48-port switch ASICs, as compared with a 36-port switch chip that can support 11,664 nodes. See "Legal Disclaimer & Optimization Notice" for important performance information.

17

ínte

# Intel<sup>®</sup> Omni-Path Architecture: Enhanced Switching Fabric



High messaging rates: Designed to support high MPI traffic from each node



Low latency: Extremely low port-to-port switch latency



End-to-End Reliability: Built-in error detection & correction



Consistent end-to-end latency: Enables higher MPI application performance



# OpenFabrics\* Scalable Fabric Interface: A New Software Framework to Unleash Fast Fabrics

#### **Open Fabrics Interfaces (OFI)**

| Control<br>Services | Communication<br>Services | Completion<br>Services | Data Transfe      | rations |           |
|---------------------|---------------------------|------------------------|-------------------|---------|-----------|
| Discovery           | Connection<br>Management  | Event<br>Queues        | Message<br>Queues | RMA     | Opei      |
| fi_info             | Address<br>Vectors        | Counters               | Tag<br>Matching   | Atomics | Triggered |



- Better semantic match for HPC: high performance
- Lower software complexity: high productivity

## Big Data Meets High Performance: New Intel® Data Analytics Acceleration Library



- Highly-optimized building blocks
- Supports all data analysis stages
- Supports batch, streaming, and distributed processing
- Works with popular platforms (Hadoop, Spark) and tools (R, Python, Matlab)
- Flexible data interfaces (CSV, MySQL, HDFS, RDD (Spark))
- Handles sparse and noisy data
- C++ and Java APIs

Get the beta @ bit.ly/psxe2016beta

## Innovation Across Key System Ingredients

CPU





Software & Tools



Fabric

Intel® Xeon® Processors Intel® Xeon Phi™ Product Family Intel® Parallel Studio Intel® Enterprise Edition for Lustre\* software

#### Intel<sup>®</sup> Omni-Path Architecture

Intel<sup>®</sup> Solid-State Drives (NVMe)

Char

Storage

înte

#### Intel's HPC Scalable System Framework Hardware and software building blocks that integrate into powerful, compatible systems,



Powering HPC at every scale

Compute- and data-centric computing

Standards-based programmability

Application compatibility

Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors

Intel<sup>®</sup> Ethernet

Intel<sup>®</sup> SSDs Intel<sup>®</sup> Lustre\*-based Solutions Intel<sup>®</sup> Silicon Photonics Technology Intel<sup>®</sup> Parallel Studio Developer Tools Intel<sup>®</sup> Cluster Ready Program

\*Other names and brands may be claimed as the property of others.

#### Innovating Toward Exascale...and Beyond

Innovative technologies in a scalable system framework A co-design approach that optimizes workload performance Powerful software tools to unlock performance & productivity A thriving, open, and enabled ecosystem

We're on this journey together

