Google

# Improving perf\_events measurement correctness

Maria Dimakopoulou Optimization Team

CERN PMU Workshop 2013



### What Is It About?

- at-retirement memory events may corrupt events on the sibling counter with HyperThreading enabled on Intel processors
  - 0xd0 : MEM UOPS RETIRED.\*
  - 0xd1 : MEM\_LOAD\_UOPS\_RETIRED.\*
  - 0xd2 : MEM LOAD UOPS LLC HIT RETIRED.\*
  - 0xd3 : MEM\_LOAD\_UOPS\_LLC\_MISS\_RETIRED.\*
- Example: SNB, CPU0,1 siblings

# perf stat -a -C0 -e r81d0 sleep 10 (r81d0: MEM\_UOPS\_RETIRED:ALL\_LOADS)
# perf stat -a -C1 -e r20cc sleep 1 (r20cc: ROB\_MISC:LBR\_INSERTS )
10,022,279 r20cc (LBR unused: should be zero)

- Silent & random measurement corruption
- Errata: SandyBridge (BJ122), IvyBridge (BV98), Haswell (HSD29)

Google

# Severity

- Corrupting events are commonly used
  - $\circ$  to study cache behavior
- Multiplexing increases risk
  - occurs asynchronously on each CPU
- Error maximized with high frequency vs. low frequency events
  - mem\_load\_uops\_retired vs. mem\_load\_uops\_llc\_miss\_retired:remote\_dram
     mem\_load\_uops\_retired vs. mispredicted\_branch\_retired
     ...

# Solutions

- No Intel firmware fix available
- Only measure one logical CPU per physical core
  - coarse-grained exclusion
- Current Kernel Fix: black-list corrupting events (IvyBridge for now)
   at-retirement memory events can never be measured
- Our approach: USX Protocol (Cache Coherence Style Protocol)
  - fine-grained exclusion based on the sibling thread's state
  - force mutual exclusion for counters with corrupting events
  - allow sharing for counters with non-corrupting events

#### Google

# perf\_events Scheduler Overview

- Kernel-level scheduling of event groups
  - greedy 1st match algorithm, stops at first error
- Static constraints on events are hardcoded in kernel
- Multiplexing if necessary
  - Round-Robin of event group list for fairness
  - default rate is each timer tick





# **USX** Protocol

- Counter Events
  - Cycles = Non-Corrupting
  - **M**emory = Corrupting
- Counter States
  - $\circ$  Unused
  - $\circ$  Shared
  - $\circ$  Xclusive



- Principles
  - event scheduling on one HT's counters affects the other's HT's state
  - $\circ \quad \textbf{M} \text{ events} \rightarrow \text{allowed on counters only with } \textbf{U} \text{ state}$
  - $\circ$   $\ \ \, C$  events  $\rightarrow$  allowed on counters only with U or S state





- CPU0, CPU1 hyperthreads
- Event Lists
  - CPU0: M, C, C
  - CPU1: C, M, M
- Initial State





- 1. Add M event on CPU0
- 2. M static constraint: 1111 (run on any counter)
- 3. CPU0 state constraint: 1111
  - all counters unused
- 4. M dynamic constraint: 1111 & 1111 = 1111
- 5. Scheduler picks counter0
- 6. Mark counter0 in CPU1 as Xclusive
  - $\circ$  No events can be scheduled on it





- 1. Add C event on CPU1
- 2. C static constraint: 1111 (on any counter)
- 3. CPU1 state constraint: 1110 ○ counter0 marked as X
- 4. C dynamic constraint: 1111 & 1110 = 1110
- 5. Scheduler picks counter1
- 6. Mark counter1 in CPU0 as Shared
  - $\circ$   $\,$  Only C events can be scheduled on it





- 1. Add C event on CPU0
- 2. C static constraint: 1111 (on any counter)
- CPU1 state constraint: 1111
   C events allowed on S counters
- 4. C dynamic constraint: 1111 & 1111 = 1111
- 5. Scheduler picks counter1
- 6. Mark counter1 in CPU1 as Shared
  - $\circ$   $\,$  Only C events can be scheduled on it





- 1. Add M event on CPU1
- 2. M static constraint: 1111 (on any counter)
- 3. CPU1 state constraint: 1100
- 4. C dynamic constraint: 1111 & 1100 = 1100
- 5. Scheduler picks counter2
- 6. Mark counter2 in CPU0 as Xclusive
  - no events can be scheduled on it





- 1. Add C event on CPU0
- 2. M static constraint: 1111 (on any counter)
- 3. CPU0 state constraint: 1011
- 4. C dynamic constraint: 1111 & 1011 = 1011
- 5. Scheduler picks counter3
- 6. Mark counter3 in CPU1 as Shared only C events can be scheduled on it





- 1. Add M event on CPU1
- 2. M static constraint: 1111 (on any counter)
- 3. CPU1 state constraint: 0100
- 4. M dynamic constraint: 1111 & 0100 = 0100
- 5. Scheduler cannot pick counter2: occupied
   Multiplexing!







#### **Broken Results**

Correct Results Multiplexing

#### **USX Protocol: Example Results**



#### **USX Protocol: Other Results**

#### • Initial example SNB, CPU0,1 siblings

# perf stat -a -C0 -e r81d0 sleep 10 (r81d0: MEM\_UOPS\_RETIRED:ALL\_LOADS)
# perf stat -a -C1 -e r20cc sleep 1 (r20cc: ROB\_MISC:LBR\_INSERTS)
0 r20cc

• Example with overcommitted counters (multiplexing)

# perf stat -a --pfm-event rob\_misc\_events:lbr\_inserts, mem\_uops\_retired:all\_loads,...



# Summary

- provided a work-around to unsolved reliability issue on SNB/IVB/HSW
  - no change to the way the workload runs
  - no user-level changes
- all events can now be measured reliably
  - valuable for tools such as Gooda, Perf, GWP
- more **reliability** at the cost of **extra multiplexing** 
  - need for an optimal scheduling algorithm (Google Optimization Team)
- kernel patches to be pushed to upstream kernel



#### References

- Intel SandyBridge specification update
- Intel IvyBridge specification update
- Intel Haswell specification update
- <u>Gooda Tool</u>
- IA-32 Software Developers Manual (SDM) Vol3b September 2013