



## Over 12B ARM Based Chips Shipped in 2014

| ZUIS ZUIA Grown | 20 | 13 | 2014 | Growth |
|-----------------|----|----|------|--------|
|-----------------|----|----|------|--------|

| Mobile     | 5.1bn | 5.4bn | 300m |  |
|------------|-------|-------|------|--|
| Embedded   | 2.9bn | 4.1bn | I.2b |  |
| Enterprise | I.8bn | I.9bn | I00m |  |
| Home       | 0.6bn | 0.6bn | -    |  |

ARM CPU Core Unit Shipments



## ARM Architecture: Licensing Overview



Hundreds of optimized system-on-chip solutions



### Extensible Architecture for Heterogeneous Multi-core Solutions





## ARM in Datacenter/HPC Compute

- ARM-based SOC's enable compelling solutions:
  - Optimized performance/watt, performance/RU
  - Workload optimized balance of CPU, memory, cache, and IO
  - Application specific HW accelerators
  - Heterogeneous compute (DPS's, FPGA's, GPU's, etc).
  - Comprehensive SW Ecosystem
- Built to existing datacenter standards
  - ARM platforms use standard motherboard form-factors or chassis/rack form-factors
  - Standard interconnects (PCIe, RapidIO, etc.)
  - Standard platform firmware abstractions for deployment and management





### ARM: All the Pieces for HPC

### Partner 64-bit SOC's

- Applied Micro X-Gene I and 2
- AMD Seattle
- Cavium ThunderX
- Broadcom Vulcan
- HiSilicon
- Several Other Confidentials...

8 – 48 Cores + Accelerators

### Workloads:

Scale-out, HPC, Networking, Cloud, and Storage



### Engagements

- Global: NA, EU, Asia
- Exascale Projects, Nat'l Labs, and Universities

### Ecosystem

- HPC Workloads, Math Libraries, Compilers
- OpenCL, OpenMP, OpenMPI, etc

### **Technologies**

- Performance Cores, Interconnects, GPU, DSP, Vector/Floating Point
- Partners and partner Accelerators
- Performance/Watt



## ARM HPC Software Ecosystem Overview

|                                             | Compilers<br>C/Fortran                           | Analysis Tools                                                            | Math Libraries                                              | Parallelism<br>Libraries                               | Parallel File<br>Systems                                  | Cluster Management and Test SW                                       |
|---------------------------------------------|--------------------------------------------------|---------------------------------------------------------------------------|-------------------------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------|----------------------------------------------------------------------|
| X86 Ecosystem Compare =>                    | ICC, Pathscale, PGI,<br>NAG, GCC,<br>Proprietary | Parallel Studio Allinea, Rogue Wave Cluster Studio XE, IBM Toolkit, Vtune | MKL (BLAS),<br>BLIS, libATLAS,<br>Eigen BLAS,<br>ACML, FFTW | TBB, OpenMP, SequenceL, DistrParallelism: OpenMPI/PGAS | Lustre, Panasas<br>OpenSFS,<br>HDFS, Ceph,<br>IBM GPFS    | HP CMU, IBM Platform LSF, Adaptive Computing (Moab) Altair PBS Works |
| Supercomputer<br>Vendors & Labs             | Internal or proprietary<br>Compilers             | Vendor or Customer<br>SW Stack                                            | Internal or proprietary                                     | Vendor or Customer<br>SW Stack                         | Vendor SW Stack                                           | Vendor or<br>Customer SW Stack                                       |
| Enterprise<br>Volume HPC<br>(Commercial SW) | GCC, LLVM<br>Pathscale EKOPath,<br>NAG           | Allinea DDT<br>Rogue Wave<br>ARM DS-5 + PTP                               | Pathscale BLAS<br>NAG Numerical                             | ARM OpenMP<br>GCC Libgomp,<br>Libtbb<br>OpenMPI        | Lustre Client/Server<br>IBM GPFS<br>Panasas               | IBM LSF supported,<br>HP CMU supported,<br>Altair PBS and Moab       |
| HPC Labs, Universities<br>(Open Source SW)  | GCC & Clang, Gfortran<br>LLVM                    | TAU, Perf, HPCToolkit PTP                                                 | BLIS, FFTW<br>libATLAS,<br>EigenBLAS<br>OpenBLAS            | GCC, OpenMP<br>LLVM run-time<br>PGAS                   | OpenSFS (Lustre) Lustre Client/Server HDFS, CEPH, Gluster | SLURM<br>OpenLava (LSF fork)                                         |



## Maximizing Throughput Density: per mm<sup>2</sup>, per Watt



### **ARM Solution Benefits:**

- Less than 1/3<sup>rd</sup> the power for equivalent performance
- Allows more specialized computing or significantly greater thread density in the same power budget

Comparison for equivalent number of threads

- Platforms used:
  - Xeon-E5 2660 v3 10C20T platform (measured)
  - Xeon-E5 2650 v3 10C20T platform (measured)
  - Gcc compiler v4.9 with −o3 flag
- Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache
  - per-core measurements on RTL with relevant memory system
  - Gcc compiler v4.9 with -o3 flag
  - Scaled to 20T based on modelled and empirical results
  - Power estimated in 16nm based on ARM internal implementations for entire CPU+ interconnect



### Cortex-A72: Ideal for dense compute environments

Single Cortex-A72 core <sup>2</sup> ~1.15mm<sup>2</sup>



Cortex-A72 is <20 % size



Single Broadwell CPU + 256K<sup>1</sup> L2 ~8mm<sup>2</sup>





### Cortex-A72: More performance within constrained envelopes



- Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with -o3 flag.
- Cortex-A72 measured on RTL with realistic memory system with the same compiler settings
- Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache.





### ARM Partner SoC solutions c.2015







ARMADA XP Block Diagram

#### "SEATTLE" SOC OVERVIEW

#### **Power Efficient Cores**

Up to Eight ARM Cortex-A57 cores
 Up to 4MB shared L2 cache total

#### Cache Coherent Network

- · Full cache coherency
- 8MB L3 cache
- · SMMU: I/O address mapping and protection

#### High Performance, Flexible Memory

- Two 64-bit DDR3/4 channels with ECC
  Two DIMMs/channel up to 1866Mhz
- SODIMM, UDIMM, RDIMM support
- · Up to 128GB per CPU

#### Highly Integrated I/O

- 8x SATA 3 (6Gb/s) ports
- Two 10GBASE-KR Ethernet ports
- · 8 lanes PCI-Express® Gen 3, supports x8, x4, x2

#### System Control Processor

- TrustZone® technology for enhanced security
- Dedicated 1GbE system management port (RGMII)
- SPI, UART, I2C interfaces

#### Cryptographic Coprocessor

 Separate Cryptographic algorithm engine for offloading encryption, decryption, compression, decompression computations



### THUNDERX Family of Workload Optimized Processors

#### S CAVIUM

#### Up to 48 custom ARMv8 cores @ 2.5GHz

- 78K-I cache & 32K-D cache, 16MB L2
- 1S and 2S configuration
- Up to 4x72 bit DDR3/4 Memory Controllers
  - 1 TB system memory in 2S config
- Family Specific I/O's
- Standards based low latency Ethernet fabric
- virtSOC™: Virtualization from Core to I/O
- Family Specific Accelerators
- 4 Workload Optimized Processor Families:
  - ThunderX\_CP: Compute servers
  - ThunderX\_ST: Storage Servers
  - ThunderX NT: Network/Telco Servers
  - ThunderX SC: Secure Servers



### More ARM 64-bit solutions on the horizon...



Source: Broadcom Presentation at IDC HPC USER FORUM APRIL



#### Qualcomm to Build ARM-Based Se



CEO Steve Mollenkopf gave few details, but Qualcomm will present a challenge to both smaller ARM server chip makers and dominant player Intel.

Recommend < 11

11 Like < 11

Qualcomm, the world's top mobile chip maker, is ready to get into the crowded ARM-based server chip business.

At the company's annual analyst day Nov. 19 in New York CEO Steve Mollenkopf said company engineers have been working on the technology "for some time. Now we are going to have a big product that goes into the server."

The Wall Street Journal was the first to report on Qualcomm's move.

## ARMv8-A Infrastructure Ecosystem Building Momentum

Example **End Users** 









Key Applications Middleware































Operating System, Virtualization & Firmware











OEMs and ODMs

































## Empowering Enterprise Software Developers



Multiple options for software developers on ARM.



## PayPal – Real-time Data Analysis

PayPal

Application logs
Data center metrics
Server machine data
Metadata
Social media data.



3M events/sec, 25Tb/hour

ARM
General Purpose
Java, Python,
OpenCL, Open
MPI



TI DSP
Signal processing performance



- Leverages HPC and Networking/Telco dataplane Technology
- 55W per cartridge (measured) => II.2GFlops/watt
  - Top of Green500.ORG SuperComputer list is 4.4GFlops/W
- SOC Specific Advantages
  - Right Sized General Purpose Compute
  - TI DSP cores for high performance signal processing performance
  - Low latency response times
  - An integrated, high-performance I/O fabric



# Questions

