Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

26
Analysis of Path Profiling Information Generated with Performance Monitoring Hardware Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon Dan Fay, Vijay Janapa Reddi, Dan Connors University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

description

Analysis of Path Profiling Information Generated with Performance Monitoring Hardware. Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon Dan Fay, Vijay Janapa Reddi, Dan Connors University of Colorado at Boulder Department of Electrical and Computer Engineering - PowerPoint PPT Presentation

Transcript of Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Page 1: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Analysis of Path Profiling Information Generated with

Performance Monitoring Hardware

Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon

Dan Fay, Vijay Janapa Reddi, Dan Connors

University of Colorado at Boulder

Department of Electrical and Computer Engineering

DRACO Architecture Research Group

Page 2: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Introduction

• Profile information is critical to success of optimizers

– Point Profile - BBs count, edge profiles, etc.– Path Profile - correlated branches

• Off-line Path Profiling Methods:– Use static/dynamic instrumentation to gather

full path profile[Ball96][Joshi04][Bond05]

• On-line Path Profiling Method:– Interpretation: MRET[Bala00][Bruen03]

• Both incur high overhead!!• For run-time systems, overhead

unacceptable

A

B C

D

E F

G

80 20

7030

Edge Profile: ABDFG 70-50

Path Profile: ABDFG 60 ACDFG 10 …

Page 3: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Performance Monitoring

• Modern processors contain on-chip Performance Monitoring Units(PMUs)– Itanium, Pentium 4, Power PC support branch vectors

• Sampling PMU– Less information– Non-deterministic, phase behavior

• Branch Execution Information– Itanium-2 PMU Branch Trace Buffer(BTB) - up to four branches

• Different configurations: Last-4 branches, Last-4 taken branches, etc

– Compiler can expand this information

Page 4: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

PMU-based Path Profiling• Goal: Combine compiler analysis and PMU

branch vectors to generate a path profile

• In order for PMU-based path profiling to effective, it must to comparable to a full path profile ex. Ball Larus PP[Ball96]

• Other forms of PMU-based profile information have been shown to be effective at run-time optimization - ADORE[Chen03][Lu04]

Hot Path

BTB Trace

Page 5: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Hardware Profiling Approaches

• Proposed Techniques:– BTB profile buffer [Conte94]

• OS coupled with BTB hardware to fill out an edge profile– Hot Spot Detection[Merten00]

• Proposed Branch Behavior Buffer to store branch information to fill out edge profile

– Programmable Path Profiler [Vaswani05]

• Hardware Path Stack and Path Detector

• Performance Monitoring Unit Techniques– Continuous Profiling/Optimization Systems

• Simple PMU - event counters– ADORE Dynamic Optimizer [Chen03][Lu04]

• Sampling Itanium-2 PMU to drive memory optimizations

Page 6: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Motivation

• Unfortunately, most existing techniques are only able to accomplish one or two of these.

• This project aims to combine the accuracy of path profiling with low-overhead utilizing existing performance monitoring hardware.

Accuracy Single-Stage Low-Overhead

Static Instrumentation

Dynamic Instrumentation

Interpretation

Hardware Techniques

Characteristics of the Ideal Run-time Profiler

1. Accuracy - Ability to reflect run-time execution well

2. Single-Stage - Can profile binary on-the-fly without extra compilation stages

3. Low Overhead - Incurs little to no overhead

Page 7: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Itanium-2 PMU Path Profiling

• 2 Phases– Online

• BTB Trace Collection

– Offline• Partial Path Creation• Region Formation• Path Profile Generation

– Path Matching– Path Crediting

PMU

Processor

BTB Traces… Partial Paths

Compiler-Aided Offline Analysis

PATHS!Region

Formation

Path Matching/Crediting

TerminologyBTB Trace: Series of addresses from BTB

Partial Path: Path of ops in compiler IR

Region: Single Entrance region in CFG

Path: Complete path through a region

Page 8: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

BTB Trace Collection

• BTB Trace: Sequence of four branches per sample– Configured to sample only taken branches

• Allows for longer partial paths to be built• The not taken path is trivial to follow

• BTB Trace placed into specialized hash table every sample– If BTB Trace exists, increment count

• At the end of execution, BTB Traces and counts are dumped to a file

Page 9: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Partial Path Creation

• Partial Path: List of low-level IR ops

• Partial Path Formation– Recreate path from BTB Trace– Partial Path weight = count– Perform Partial Path Extensions

• Up until Join Point• Down until Branch Point

Join Point

Branch Point

Partial Path from BTB Trace

Extended Partial Path

BTB Trace Branch

Page 10: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Path Matching and Crediting• Path Matching

– Find list of all paths that contain partial path

• Path Crediting– Distribute partial path weight equally among

matched paths

• Example:

• Challenge:– Number of paths grows exponentially– Large control flow graphs present a problem

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Partial Path Count Matches Inc Total

OPRSUVXY 100 ABDLMOPRSUVXY

ACDLMOPRSUVXY

ABDLNOPRSUVXY

ACDLNOPRSUVXY

+25

+25

+25

+25

25

25

25

25

Page 11: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Region 3

Region 2

Region 1

Region Formation• We use region-based paths

– Makes total # paths more manageable– Limits number of matching paths

• Rules for Region R:– R must be single entry– R may not cross loop boundaries

• Loop Regions created first

– R may not cross function boundaries– Total # paths in R is limited by a threshold– R must be as large as possible

• Side Effects of Region Formation– Partial Paths must be split at:

• Loop boundaries• Function boundaries• Region boundaries

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Page 12: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Path Generation Example• Suppose we encounter these paths:

– ABDLMOP– ABDEFHIK

• Split into ABD, EFHIK

– OPRSUVX

Partial Path Count Matches Inc Total

ABDLMOP 100 ABDLMOPRSUVX

ABDLMOPRSUWX

ABDLMOPRSUVX

ABDLMOPRSUWX

+25

+25

+25

+25

25

25

25

25

ABD 160 ABDLMOPRSUVX

…(14 more)

ABDLNOQRTUWX

+10

+10

35

10

EFHIK 160 EFHIK +160 160

OPRSUVX 280 ABDLMOPRSUVX

ABDLNOPRSUVX

ACDLMOPRSUVX

ACDLNOPRSUVX

+70

+70

+70

+70

105

80

70

70Region 3

Region 2

Region 1

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Page 13: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Methodology

• Experiments run on Itanium 2• Developed tool using perfmon kernel interface

and libpfm[perfmon] to interface with PMU• Benchmarks

– Set of SPEC2000 benchmarks– Compiled with the OpenIMPACT Research

Compiler[oicc]

• Without aggressive profile-directed optimizations

• Off-line analysis with OpenIMPACT module• Compared to full path profile gathered with a

PIN path profiling tool

Page 14: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Effect of Sampling Period

• Knee of Overhead curve ~500K• Number of Unique Paths consistently grows as sampling period decreases

– Levels off some between 50K and 100K

164.gzip

0

500

1000

1500

2000

50K 100K 500K 1M 5M 10M

Sampling Period

Unique Paths0

10

20

30

40

50

Overhead (%)

300.twolf

0

2000

4000

6000

8000

10000

50K 100K 500K 1M 5M 10M

Sampling Period

Unique Paths0102030405060

Overhead (%)

Page 15: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Accuracy Results• Accuracy measured similar to Wall’s weight matching scheme[Wall91]

Accuracy Vs. Sampling Period

0

10

20

30

40

50

60

70

80

90

100

50K 100K500K1M 5M 10M 50M100M500M

Sampling Period

Accuracy (%)

164.gzip175.vpr177.mesa179.art181.mcf183.equake188.ammp197.parser256.bzip2300.twolf

Page 16: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Incorrectly Detected Paths• With our path crediting technique:

– We can distinguish hot paths in a regions

– May incorrectly detect hot paths in program

• May be crediting cold paths enough for them to seem hot compared to rest of program

Partial Path Count Matches Inc Total

ABDLMOP 100 ABDLMOPRSUVX

ABDLMOPRSUWX

ABDLMOPRSUVX

ABDLMOPRSUWX

+25

+25

+25

+25

25

25

25

25

Region 3

Region 2

Region 1

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Page 17: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Partial Path Length

• Length of Partial Paths drops drastically from splitting on function on loop back edges

Partial Path Length

0

10

20

30

40

50

60

70

80

164.gzip 175.vpr 177.mesa

179.art 181.mcf 183.equake 186.crafty 188.ammp 197.parser 256.bzip2 300.twolf

Length (# Ops)

Initial Length

Length After Splitting

Page 18: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Function Correlation

• MANY partial paths cross function boundaries– Should use function correlation

Function Boundaries Spanned By Partial Paths

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

164.gzip 175.vpr 177.mesa

179.art 181.mcf 183.equake 186.crafty 188.ammp 197.parser 256.bzip2 300.twolf

4 Function Boundaries3 Function Boundaries2 Function Boundaries1 Function Boundary0 Function Boundaries

Page 19: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Multiple Runs

• May be possible to use multiple runs to provide more accurate path profile data

164.gzip

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Aggregated Runs

# Unique Paths

50K

100K

500K

1M

5M

10M

Page 20: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Future Work

• Region Formation– Characterize quality of our regions

• Important because no correlation between regions

– Regions stretching across function boundaries

• Noise Elimination– Crucial to removing false positives due to path

crediting

• Effects of Optimization– Find effects of superblocks, inlining, etc. on partial

paths and accuracy of path profile

Page 21: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Conclusion

• We introduce rationale and initial data of PMU-based path profiling

• PMU-based profiling shows promise

• At Sampling Period = 5M cycles– ~85% accurate– ~1% overhead

Questions?

Page 22: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

References[Bala00]V. Bala, E. Duesterwald and S. Banerjia. “Dynamo: A Trasparent

Dynamic Optimization System” PLDI 2000.[Ball92]T. Ball and J.R. Larus. “Optimally Profiling and Tracing Programs”

TOPLAS 1992.[Ball96]T. Ball and J.R. Larus. “Efficient Path Profiling” MICRO-29, 1996.[Bond05] M.D. Bond and K.S. McKinley. “Practical Path Profiling for

Dynamic Optimizers”, CGO 2005.[Bruen03]D. Bruening, R. Garnett and S. Amarasinghe. “An Infrastructure

for Adaptive Dynamic Optimization” CGO 2003.[Chen03]H. Chen, W.C. Hsu, J. Lu, P.C. Yew and D.Y. Chen. “Dynamic

Trace Selection Using Performance Monitoring Hardware Sampling” CGO 2003.

[Conte94]T.M. Conte, B.A. Patel and J.S. Cox. “Using Branch Handling Hardware to Support Profile-Driven Optimization” MICRO-27, 1994.

Page 23: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

References (cont)[Intel04]Intel, “Intel Itanium 2 Processor Reference Manual: For Software

Development and Optimization” May 2004.[Joshi04]R. Joshi, M.D. Bond and C. Zilles. “Targeted Path Profiling:

Lower Overhead Path Profiling for Staged Dynamic Optimization Systems” CGO 2004.

[Kistler01]T. Kistler and M. Franz. “Continuous Program Optimization” IEEE Trans. On Computers v50 no6 June 2001.

[Lu04]J. Lu, H. Chen, P.C. Yew and W.C. Hsu. “Design and Implementation of a Lightweight Dynamic Optimization System” Journal of ILP 6, 2004

[Merten00]M.C. Merten, A.R. Trick, E.M. Nystrom, R.D. Barnes, and W.W. Hwu. “A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots” ISCA 2000.

[oicc] http://gelato.uiuc.edu[pin] http://rogue.colorado.edu/Pin

Page 24: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Extra Slides

Page 25: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

ADORE Trace Selection

• Goal: Gather hot traces with many cache misses to add pre-fetches

• However, hot traces may not be enough to detect full hot paths

• Compiler can perform further analysis – Correlate BTB based traces into longer paths

PMU

Itanium 2

Sample

last 4 takenbranches

Branch Trace D1 Misses I1 Misses Cycles

10ac,640,66c,10c6

350 128 81280

… … … …

Branch Trace Table

BTB Trace

Hot Path

Page 26: Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Partial Path Characteristics

• Partial Path extensions increase length ~20%• However, splitting drastically decreases lengths

– ~30% on function boundaries, ~20% more on loop back edges

• Many paths span 1 or more function boundaries– Indicates a great amount of function correlation is being thrown away

Benchmark Initial Ext Func Loop

164.gzip

175.vpr

177.mesa

179.art

181.mcf

183.equake

186.crafty

188.ammp

197.parser

256.bzip2

300.twolf

28.9

41.8

44.6

29.5

32.0

65.8

36.0

31.3

28.7

38.8

37.8

34.8

50.5

53.8

34.7

38.8

75.1

45.2

39.5

35.2

45.8

46.5

22.8

30.6

35.3

32.1

33.7

66.8

31.7

36.4

14.7

33.4

32.5

20.4

19.6

33.0

22.9

25.5

54.5

31.1

28.5

12.7

22.7

25.4

Benchmark 0 1 2 3 4

164.gzip

175.vpr

177.mesa

179.art

181.mcf

183.equake

186.crafty

188.ammp

197.parser

256.bzip2

300.twolf

187

234

75

230

1106

231

1367

298

488

532

1293

149

171

50

29

237

31

2281

123

654

353

514

35

162

27

7

96

24

968

42

599

279

467

2

65

7

1

26

1

154

5

224

55

55

0

12

0

0

1

0

13

2

40

9

7Function Boundaries SpannedAverage Partial Path Lengths