



## Performance and Power Analysis Utilizing Intel<sup>®</sup> Performance Bottleneck Analyzer (PBA)

By: Michael Chynoweth Principal Engineer Intel Corporation

*Contributors:* Rajshree Chabukswar, Charlie Hewett, Seung-Woo Kim, Vardhan Dugar, Erik Niemeyer, Joe Olivas and Manuj Sabharwal

Michael Chynoweth – CERN Workshop

## Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

\*Other names and brands may be claimed as the property of others.

Copyright ° 2010. Intel Corporation.





## **Optimization Notice**

#### **Optimization Notice**

Intel<sup>®</sup> compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel<sup>®</sup> and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel<sup>®</sup> Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel<sup>®</sup> compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel<sup>®</sup> compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel<sup>®</sup> compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel<sup>®</sup> Streaming SIMD Extensions 2 (Intel<sup>®</sup> SSE2), Intel<sup>®</sup> Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel<sup>®</sup> and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20101101





## Taking Advantage of Engineering Innovation Through Tools

#### Issue Identified by Hand



Innovation to Drive Automatic Identification

Automatic Identification



Capability is improved by other engineers

6 5

Speedup

Workload Level Speedups Utilizing PBA

Automated capability used by other engineers

Extensible Tools Allow Engineers to Attack Difficult Problems as a Community

## Intel<sup>®</sup> PBA Flow of Analysis



## 2 Primary Views of PBA: Replacing This



# What is A Stream?

- Reconstructing the flow of basic blocks
- Shows flow of instructions as it ran on the core
- We pull all events along the stream for a "Poor-mans" pipe trace
- Find and reconstruct loops as a granularity
- Catches issues across branches

| Address    | Instruction                 |
|------------|-----------------------------|
| 30AA668A   | call 30AA63F7               |
| 30AA63F7   | mov ecx, dword ptr [edi]    |
| 30AA63FF   | sar ebx, 18h                |
| 30AA6403   | mov esi, eax                |
| 30AA6405   | js 30AA64D8                 |
| 30AA640B   | movzx eax, word ptr [edi+4] |
|            |                             |
| 30AA640F   | cmp ax, OFFFEh              |
| 30AA6413   | je 30AA647B                 |
| 30AA647B 🧲 | movzx eax, word ptr [edi+4] |





## Comparing Architectures with Streams



Comparison between architectures to find issues



Streams Allow Us To Find Issues Across Branch Boundaries

Clockticks

# DEMO #1 Streams: "If you want micro-architectural issues fixed...make it easy"





### Streams: Summary of Determining Issues in a Loop





### Streams: Correlating Static and Dynamic Identification



### **PBA Relates Static ASM with Events**

### Streams: Assist in Creating Larger and Larger Streams



PUTCOUT

Clockticks(SANDYBRIDGE) - BR MISP RETIRED.ALL BRANCHES PS(SANDYBRIDGE)

### **Streams Created with Last Branch Records Produce Greater Context**

## Streams: Loads and LFB Breakdown



**PBA Displays Where Problematic Loads are Satisfied** 

# Theme #2 JIT Explosion: "Everyone is writing a JIT Nowadays"





# Vtune JIT APIs picked up by xIF

### JIT Explosion

### Automated Code Gen Analysis for JITs



### xIF automatically identifying issues in JIT





Theme #2 Short Non-Steady State Workloads Becoming Critical "Debugging in the Millisecond Timeframe"





# **DEMO TOUCH DEBUG**





# Theme #2 Power: "Power and performance are two sides of the same coin"





## Power Correlation using Intel<sup>®</sup> PBA



Package power intermittently jumping high



19 **1 0** 



### Automatically Determine Causes of High Power

| Module_Name               | Clocktick<br>High Power% | Clockticks<br>Low Power% |
|---------------------------|--------------------------|--------------------------|
| AppProcess:mshtml.dll     | 15.3                     | 9.39                     |
| AppProcess:ntoskrnl.exe   | 10.64                    | 5.12                     |
| AppProcess:ntdll.dll      | 7.92                     | 2.5                      |
| AppProcess:oleaut32.dll   | 5.95                     | 0.15                     |
| AppProcess:igd10umd64.dll | 5.57                     | 7.23                     |
| AppProcess:msvcrt.dll     | 4.89                     | 1.3                      |

Loading Library is Causing High Power

## Shifting to Power Analysis

#### 1ms activities wasting power!



| ModuleName | ProcessName | Clockticks%(SANDYBRIDGE) |
|------------|-------------|--------------------------|
|            |             | 27.18                    |
|            |             | 9.70                     |

### Calculating Frequency of Activities is Powerful for Performance

# **CERN Collider Proposal**





# Conclusion

- Intel<sup>®</sup> PBA post-processes data collected using Intel<sup>®</sup> VTune to automatically call out micro-architectural and SoC issues
- Issue identification capabilities increased by order of magnitude using streams
- PBA provides extensibility by specifying rules for various architectures in a configuration file
- Overtime views, UX and power analysis are new capabilities that augment existing functionalities





