Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello...

12
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’[email protected]

Transcript of Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello...

Page 1: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

Performance Counters on Intel® Core™ 2 Duo Xeon®

ProcessorsMichael D’Mello

michael.d’[email protected]

Page 2: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

2

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

• Basic data collection mechanisms

• Some architectural considerations

• Cycle accounting methodology

• Use the VTune™ Performance Analyzer to identify micro-architectural bottlenecks in software running on Intel® Core™ 2 Duo Xeon® processors

• Address the performance bottleneck for Intel® Core™ 2 Duo Xeon® processors

Page 3: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

3

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Basic data collection mechanisms

•Deterministic interrupts• Processor interrupted at regular time intervals

•Interrupts based on pre-assigned metric• A performance counter increments on the CPU every time an

event occurs• A sample of the execution context is recorded every time a

performance counter overflows• Events = samples * sample after value

Page 4: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

4

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache/Memory

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Architecture Block and Instruction Flow

Page 5: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

5

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Simpler abstraction – OOO engine

Fetch &Fetch &Branch prediction Branch prediction

DecodDecode e

ReservatioReservationn

StationStation

ExecutioExecutionn

UnitsUnits Re-Order Re-Order Buffer Buffer

Retirement Writeback Retirement Writeback

Notes:Notes:• uops wait until their inputs are available in RSuops wait until their inputs are available in RS• uops wait to be retired in ROBuops wait to be retired in ROB

Page 6: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

6

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Accounting for cycles - 1

• For simplicity select the micro-op dispatch point to begin analysis

• Decompose Total Cycles into sum of two input parts• Time spent issuing micro-ops to execution unit• Time spent not issuing micro-ops (i.e. execution stalls)

• Decompose Total Cycles spent issuing micro-ops into three “output” components• Cycles during which executed micro-ops are retired• Cycles during which executed micro-ops are not retired• Stalls

• Use simple balance equations to dig deeper:• micro-ops dispatched/executed = # retired + # not retired

• Convert to units of cycles using the micro-op dispatch rate

Page 7: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

7

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Accounting for cycles - 2

• Use VTune® Sampling to track selected events i. CPU_CLK_UNHALTED.CORE tracks Total Cyclesii. RS_UOPS_DISPATCHED to track total number of micro-ops dispatchediii. RS_UOPS_DISPATCHED:C=1 tracks cycles during which micro-ops are

dispatchediv. RS_UOPS_DISPATCHED_NONE (same as

RS_UOPS_DISPATCHED:C=1:I=1) gives second term of input equationv. UOPS_RETIRED_ANY & UOPS_RETIRED FUSED gives an estimate of total

micro-ops retired (approximate)vi. Micro-op dispatch rate obtained by dividing (ii) by (iii)vii. # of cycles during which micro-ops not ultimately retired are executed

is given by the difference (ii) – (v) divided by (iii)viii.Using (i), (vi), and (vii) obtain Stalls

Page 8: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

8

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Recap

• Achieve basic objective (Minimize Total Cycles) as follows:• Minimize the Stall component by removing memory & other

bottlenecks• Minimize the Non-Retired component by reducing the branch

mispredictions• Minimize Retired component by reducing instructions (SSEx)

Page 9: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

9

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Characterizing Stalls & Branch Mis-predictions

• Percentage of time stalled• (RS_UOPS_DISPATCHED _CYCLES_NONE/CPU_CLK_UNHALTED.CORE)*100

• Fractions of useful & wasted work1. Count number of UOPS dispatched

• Use RS_UOPS_DISPATCHED2. Count number of UOPS executed which are eventually retired

• Use (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED)3. Count number of UOPS executed which are non retired

• Difference of amount dispatched & amount retired

4. Compute fractions

Page 10: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

10

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Characterizing FSB Usage/Saturation

• Method:• [(Core Frequency) *64*BUS_TRANS_BURST.BOTH_CORES.ALL_AGENTS] divided by

CPU_CLK_UNHALTED.CORE

• Always useful to run a calibration test case

•Further analysis via:• BUS_TRANS_ set of events• Use VTune® Help facility for explanation of each event

Page 11: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

11

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Other characterizations of Stalls

• Instruction starvation component may be monitored via• RESOURCE_STALLS.ANY

• ROB & RS must be purged of incorrect executions• Approximate via RESOURCE_STALLS_BR_MISS_CLEAR• Units of this event are in cycles

• Other resource limited stalls:• Resource_Stalls.ROB_FULL (96 instructions in ROB)• Resource_Stalls.LD_ST (All Store or Load buffers in use) • Resource_Stalls.RS_FULL (32 instructions waiting for inputs in

Reservation Station)

• For more information see paper on Cycle Accounting by David Levinthal (available through www.intel.com)

Page 12: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello michael.d’mello@intel.com.

12

Copyright © 2006, Intel Corporation. All rights reserved.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.