# **CERN** openlab

## Does the Intel Xeon Phi processor fit HEP workloads?

October 17th, CHEP 2013, Amsterdam Andrzej Nowak, CERN openlab CTO office On behalf of Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro, Sverre Jarp, Pawel Szostek, Liviu Valsan, Mirela-Madalina Botezatu, Julien Leduc



#### **CERN openlab**

Partners







ORACLE

SIEMENS

Contributors



Associates

Yandex

CERN openlab is a framework for evaluating and integrating cuttingedge IT technologies or services in partnership with industry

The Platform Competence Center (PCC) has worked closely with Intel for the past decade and focuses on:

- many-core scalability
- performance tuning and optimization
- benchmarking and thermal optimization
- teaching



#### Outline

- A brief history of openlab involvement with the Intel MIC project
- Architecture refresher
  - HEP benchmarks
    - Results
    - **Conclusions and projections**





Larry Seiler<sup>1</sup>, Doug Carmean<sup>1</sup>, Eric Sprangle<sup>1</sup>, Tom Forsyth<sup>1</sup>, Michael Abrash<sup>2</sup>, Pradeep Dubey<sup>1</sup>, Stephen Junkins<sup>1</sup>, Adam Lake<sup>1</sup>, Jerenny Sugerman<sup>1</sup>, Robert Cavin<sup>1</sup>, Roger Espasa<sup>1</sup>, Ed Grochowski<sup>1</sup>, Toni Juan<sup>1</sup>, and Pat Hanrahan<sup>3</sup>

#### Abstract

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2<sup>nd</sup> level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

CCS: 13.1 [Computer Graphics]: Hardware Architecture--Graphics Processors, Parallel Processing; I.3.3 [Computer Graphics]: Picture/Image Generation--Display Algorithms; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism--Color, shading, shadowing, and texture

Keywords: graphics architecture, many-core computing, realtime graphics, software rendering, throughput computing, visual computing, parallel processing, SIMD, GPGPU.

#### 1. Introduction

Modem GPUs are increasingly programmable in order to support advanced graphics algorithms and other parallel applications.

1 Intel® Corporation: larry seiler, doug carmean, eric sprangle, tom forsyth, pradeep.dubey, stephen.junkins, adam t.lake, robert.d.cavin, roger.espasa, edward grochowski & toni.juan @intel.com

<sup>2</sup> RAD Game Tools: mikea@radgametools.com

<sup>3</sup> Stanford University: yoel & hanrahan @cs.stanford.edu

#### ACM Reference Format

Saler, L., Carmen, D., Sprange, E., Fonyth, T., Atnanh, M., Dubey, P., Juniste, B., Lale, A., Segerman, J., Carle, R., Espana, R., Grochweist, E., Juan, T., Hanniban, F. 2008, Lamber A Many-Core alth Architecture for Mana Computing ACM Trans. Gapon. 27, Artistics 19, Regular 2008; 15 (pages DCI = V3.1145/1380612,1380617) Regularia.aux.arg=0.1145/1380612.1380617.

Petrolasion to make digital or hard copies of part or all of this work for personal or disascours use is granted without her provided that orgins are not made or databased for profit or deept constructed advantage and that occises show this notice on the first page or billial screen of a depley slong with the hal classon. Dopytights for components of this work owned by others that ACM must be boround. Abstracting with Not begins the composition of the sum that begins the structure in which is based as the based of the sum of t 304.wotr.org/10.1145/198061/2.138061/

However, general purpose programmability of the graphics pipeline is restricted by limitations on the memory model and by fixed function blocks that schedule the parallel threads of execution. For example, pixel processing order is controlled by the rasterization logic and other dedicated scheduling logic.

This paper describes a highly parallel architecture that makes the rendering pipeline completely programmable. The Larrabee architecture is based on in-order CPU cores that run an extended version of the x86 instruction set, including wide vector processing operations and some specialized scalar instructions. Figure 1 shows a schematic illustration of the architecture. The cores each access their own subset of a coherent L2 cache to provide high-bandwidth L2 cache access from each core and to simplify data sharing and synchronization.

Larrabee is more flexible than current GPUs. Its CPU-like x86based architecture supports subroutines and page faulting. Some operations that GPUs traditionally perform with fixed function logic, such as rasterization and post-shader blending, are performed entirely in software in Larrabee. Like GPUs, Larrabee uses fixed function logic for texture filtering, but the cores assist the fixed function logic, e.g. by supporting page faults.

|        | In-Order<br>CPU core  | In-Order<br>CPU core | -      | In-Order<br>CPU core  | In-Order<br>CPU core  | ia<br>B  |
|--------|-----------------------|----------------------|--------|-----------------------|-----------------------|----------|
| offic  |                       | Interproces          | sor Ri | ng Network            |                       | 14       |
| tion L | Coherent<br>L2 cache  | Coherent<br>L2 cache | -      | Coherent<br>L2 cache  | Coherent<br>L2 cache  | (A) Inte |
| Func   | Coherent<br>1.2 cuche | Coherent<br>L2 cache | ***    | Coherent<br>1.2 cache | Coherent<br>1.2 cache | y & U    |
| Re l   |                       | Interproces          | sor Ri | ng Network            | 1.                    | - Li     |
| -      | In-Order<br>CPU core  | In-Order<br>CPU core | -      | In-Order<br>CPU core  | In-Order<br>CPU core  | Ma       |

Figure 1: Schematic of the Larrabee many-core architecture: The number of CPU cores and the number and type of co-processors and 10 blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip.

This paper also describes a software rendering pipeline that runs efficiently on this architecture. It uses bunning to increase parallelism and reduce memory bandwidth, while avoiding the problems of some previous tile-based architectures. Implementing the renderer in software allows existing features to be optimized based on workload and allows new features to be added. For example, programmable blending and order-independent transparency fit easily into the Larrabee software pipeline.

Finally, this paper describes a programming model that supports more general parallel applications, such as image processing, physical simulation, and medical & financial analytics. Latrabee's support for irregular data structures and its scatter-gather capability make it suitable for these throughput applications as demonstrated by our scalability and performance analysis.

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.





**Figure 10**: Overall performance: shows the number of Larrabee Units (cores running at 1 GHz) needed to achieve 60fps for of the series of sample frames in each game.



### **Brief history of Intel MIC at openlab**

#### Early access

- Work since MIC alpha (under RS-NDA)
- ISA reviews in 2008

#### Results

 3 benchmarks ported from Xeon and delivering results: ROOT, Geant4, ALICE HLT trackfitter

#### Expertise

- Understood and compared with Xeon
- Post-launch dissemination



### **Architecture refresher (1)**



Image: Intel



### **Architecture refresher (2)**



Image: semiaccurate.com / Intel



### **Architecture refresher (3)**

- PCIe card form factor, "SMP machine on a chip"
  - Linux OS on board
  - PCIe power envelope ~200-300W
  - Limited on-board memory (16GB total limited by GDDR)
  - ~60 cores @ ~1 GHz
- x86 architecture
  - 64-bit 💿
- P54C core: In-order, Superscalar
- Shared coherent cache (~256-512k L2/core KNF/KNC)
- New ISA with new vector instructions
- Floating point support through vector units
- 512-bit wide, FMA
- 4-wide hardware threading
- >200 threads in total
- 1 TFLOP DP

• Still with the programmability of a Xeon?





### **HEP software today**

- Very limited or no vectorization
- Online has somewhat better conditions to vectorize
- Sub-optimal instruction level parallelism (CPI at >1)
  - Hardware threading often beneficial
  - Cores used well through multiprocessing bar the stiff memory requirements
  - However, systems put in production with delays
    Sockets used well
  - Multiple systems used very well
- Relying on in-core improvements and # cores for scaling

#### **Benchmarks**



- Standard: HEPSPEC06 (partial test) Need to look forward to prototype workloads Analysis: MLFit
- Threaded (pthreads, MPI, OpenMP, TBB)
- Vectorized (Cilk+)
- Simulation: Next-gen multi-threaded Geant4 prototype
- Threaded "FullCMS" example
- No explicit vectorization
- Online: New ALICE/CBM track fitter prototype
- Threaded (here with OpenMP)
- Vectorized with Vc

ICC 14.0 and pinning used unless specified otherwise



#### **Porting effort**

| -     | LOC       | 1 <sup>st</sup> port<br>time | New<br>ports | Tuning   |
|-------|-----------|------------------------------|--------------|----------|
| TF    | < 1'000   | days                         | N/A          | 2 weeks  |
| MLFit | 3'000     | < 1 day                      | < 1 day      | weeks    |
| MTG   | 2'000'000 | 1 month                      | < 1 day      | < 1 week |

## **HEPSPEC06** results and extrapolation

- HEPSPEC06 represents our current family of workloads, not optimized for next-gen hardware
- Soplex would not finish properly
- Ran out of time to investigate
- 32 cores: 57.7 HS06
- Extrapolating to 61 cores: 110 HS06
- SMT scalability:

**CERN** openlab

- 1.8 / thread @ 1 core
- 3.48 / 4 threads @ core
- SMT under full load scales differently
- Using factor from MTG4 ~70%
- Expected throughput for 244 threads: ~190 Reference IVB-EP w/ 48 threads: > 450





#### MLFit speedup on KNC



A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013

# CERNopenlab

## MLFit – SMT scaling / BW saturation

MLFit - SMT scaling





### MLFit kernel prototype

#### **Code courtesy of Vincenzo Innocente, CERN**

#### **Optimized MLFit kernel prototype on KNC**





A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013



A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013



#### The new ALICE/CBM Trackfitter preliminary results



A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013

Number of threads



#### **Xeon comparison**

**IVB-EP vs. KNC throughput** 

#### SNB-EP vs. KNC throughput



SNB 32 threads, 2.9 GHz vs. KNC 244 threads, 1.24 GHz

A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013

114%

Trackfitter

87%

MLFit



## Porting and software - conclusions

- Think in multiple dimensions of performance
  - Vectorization, threading, porting to ICC can and should be done independently, on Xeon
- Optimization
- Compiler
- Tuning threading and performance tools exist, so do many runtimes
- Math usage (different implementation!)
- Memory 16 GB limit today
- Build systems the on-card OS is "simpler" Linux, with some, not all, OSS addons

**VECTORIZATION** 

(hard, but not always impossible)

**CERN** openlab





- Optimized applications surpass dual-socket Xeon performance
- Non-optimized performance reaches approximately a single server socket
  - We don't make use of optimized bandwidth
  - Control over math function usage and performance is key vis a vis Xeon
  - Stack (including compiler) maturity is important and improving
  - SMT benefit for 1, 2, 3, 4 HW threads changed over time



### Were we of any help?

Pre-silicon feedback (Geant4) -> arch. behavior

System connectivity -> full system

System integration -> ongoing (KNL)

Comments on general OS -> Linux

Math function usage -> better compilers and guidelines

Documentation -> improved

Benchmarks -> delivered

Testimonials -> delivered

Comments on stack -> ongoing (OSS)

Many more...

A. Nowak - Is the Intel Xeon Phi processor fit for HEP workloads? / CHEP 2013



### Looking ahead

|         | SIMD | ILP  | HW THREADS | CORES  | SOCKETS |
|---------|------|------|------------|--------|---------|
| MAX     | 8    | 4    | 1.35       | 12     | 4       |
| TYPICAL | 6    | 1.57 | 1.25       | 10     | 2       |
| НЕР     | 1    | 0.80 | 1.25       | 8      | 2       |
|         | SIMD | ILP  | HW THREADS | CORES  | SOCKETS |
| MAX     | 8    | 32   | 43.2       | 518.4  | 2073.6  |
| TYPICAL |      | 9.43 | 11.79      | 117.86 | 235.71  |
| НЕР     |      | 0.8  |            |        | 16      |



- KNL: 14nm stand-alone or PCI, integrated memory
- Let's do some math
  - Stampede: 8 out of 10 PF from MIC
  - Upgrade to KNL: 15 PF
  - = 50-80% improvement over KNC
  - ISA convergence between KNL and Xeon?
- New platform connectivity options
- Heterogeneity
- The future of accessible performance



# Thank you (Q & A)







#### Andrzej.Nowak@cern.ch

Credits go to Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro, Sverre Jarp, Pawel Szostek, Liviu Valsan, Mirela-Madalina Botezatu, Julien Leduc and many others. Special thanks to Intel. This work has been supported by the EU FP7 "ICE-DIP" project, #316596 (Marie Curie Actions)