Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar...

29
www.compaq.com Using Using Interpretation for Interpretation for Profiling the Alpha Profiling the Alpha 21264a 21264a Kip Walker Kip Walker Mike Burrows, Úlfar Mike Burrows, Úlfar Erlingsson, Erlingsson, Mark Vandevoorde, Carl Mark Vandevoorde, Carl Waldspurger, Waldspurger, William E. Weihl and many William E. Weihl and many more. more.

Transcript of Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar...

Page 1: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

www.compaq.com

Using Interpretation forUsing Interpretation forProfiling the Alpha 21264aProfiling the Alpha 21264a

Kip WalkerKip WalkerMike Burrows, Úlfar Erlingsson,Mike Burrows, Úlfar Erlingsson,Mark Vandevoorde, Carl Waldspurger,Mark Vandevoorde, Carl Waldspurger,William E. Weihl and many more.William E. Weihl and many more.

Page 2: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

2 / 30

IntroductionIntroduction 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) 28 0 0xcac0 ldq at, 8(a2)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14

26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp)

* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2)* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2)

Changing this ONEinstruction will makemy Java programs run2.3% faster!

HOW CAN I FIND IT?

HOW DO I FIX IT??

Page 3: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

3 / 30

The OptionsThe Options

Read the sourceRead the source -- not always usefulnot always useful Read the assemblyRead the assembly -- hard, not always usefulhard, not always useful SimulationSimulation -- very slow, infeasiblevery slow, infeasible InstrumentationInstrumentation -- slow, interferenceslow, interference Sample-based profilingSample-based profiling -- not enough detailnot enough detail

Or useOr use periodic interpretation periodic interpretation

Page 4: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

4 / 30

It’s Not EasyIt’s Not Easy

A true story –A true story – Sometimes program X runs twice as long as usualSometimes program X runs twice as long as usual Variance due to Variance due to # of bytes in environment vars# of bytes in environment vars!!

– Base address of Base address of main()main()’s stack had dramatic effect’s stack had dramatic effect Simulation eventually revealed the problemSimulation eventually revealed the problem

Information requirementsInformation requirements Detailed instruction behavior profileDetailed instruction behavior profile Contents of registersContents of registers Correlated data for nearby instructionsCorrelated data for nearby instructions

Page 5: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

5 / 30

OutlineOutline

Out-of-order ProcessorsOut-of-order Processors Performance ProblemsPerformance Problems Why Interpretation?Why Interpretation? Profiling InfrastructureProfiling Infrastructure An ExampleAn Example EvaluationEvaluation Future WorkFuture Work SummarySummary

Page 6: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

6 / 30

Out-of-order ProcessorsOut-of-order Processors

Try to exploit instruction-level parallelismTry to exploit instruction-level parallelism Fetch, issue 4 instructions at a timeFetch, issue 4 instructions at a time Many function unitsMany function units Retire up to 11 instructions in a cycleRetire up to 11 instructions in a cycle

Fetch in-orderFetch in-order Execute out of orderExecute out of order Retire in-orderRetire in-order

Page 7: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

7 / 30

Enemies of PerformanceEnemies of Performance

Bad cache utilizationBad cache utilization Static stalls / dependencesStatic stalls / dependences Branch mispredictionBranch misprediction Illegal re-orderingIllegal re-ordering

Pipeline traps!}

Page 8: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

8 / 30

TrapsTraps

Processor detects that it let “bad things” happenProcessor detects that it let “bad things” happen wrong instructions executedwrong instructions executed instructions may have seen incorrect datainstructions may have seen incorrect data up to 80 in-flight instructions thrown out!up to 80 in-flight instructions thrown out!

Branch mispredict:Branch mispredict:

...

...beq

execute

fetch

[Predict !taken]

......

...

fetch

execute

......

...

...

...

TAKEN!

......

...

...

...

fetch ......

ABORTED!ABORTED!

Page 9: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

9 / 30

Memory Order TrapsMemory Order Traps

Memory operations are freely reorderedMemory operations are freely reordered Must enforce consistent view of memory Must enforce consistent view of memory Problems are detected dynamicallyProblems are detected dynamically

(a) reordered operations to overlapping bytes - “order” trap(a) reordered operations to overlapping bytes - “order” trap

load from X...store to X

store to X...load from X

program order: execute order:

Page 10: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

10 / 30

Troll TrapsTroll Traps

L1 data cache

L2 cache

Load from Y

Load from X

? ?

(b) accesses resulting in contention for a cache line - “troll” trap(b) accesses resulting in contention for a cache line - “troll” trap not allowed to have more than one outstanding fill requestnot allowed to have more than one outstanding fill request unspecified ordering of responses from L2 cacheunspecified ordering of responses from L2 cache replay the load until the fill happensreplay the load until the fill happens

Y

X

Miss!

Page 11: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

11 / 30

Wrong Size TrapsWrong Size Traps

(c) wide load follows narrow store - “size” trap(c) wide load follows narrow store - “size” trap

Store-long mem(x)

Load-quad mem(x)

Store queue

L1 data cache

Load-quad mem(x)

Page 12: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

12 / 30

A Better WayA Better Way

Need a runtime solutionNeed a runtime solution Notice when two instructions Notice when two instructions in a tracein a trace “match” “match” Observe effective addresses of memory opsObserve effective addresses of memory ops

InterpretInterpret instruction traces instruction traces Emulate (most) operationsEmulate (most) operations Apply statistically to cover whole systemApply statistically to cover whole system Extends the power of sample-based profilingExtends the power of sample-based profiling

Page 13: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

13 / 30

Available InformationAvailable Information

Control Flow – Edge FrequenciesControl Flow – Edge Frequencies Return address (in register or on stack)Return address (in register or on stack) Branch taken directionBranch taken direction

Computed valuesComputed values Function arguments, resultsFunction arguments, results Load/store addressesLoad/store addresses

Possible replay trap culpritsPossible replay trap culprits

Page 14: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

14 / 30

ProfileMe on Alpha 21264aProfileMe on Alpha 21264a

fetch map issue exec retire

icache

branchpredictor

interrupt

Fetch counter

overflow?

pc notrap?replay?

mispredict?dtbmiss?

ProfileMe tag!

tagged?

taken?capture!

internal processor registers

imiss? retired?

random selection

map stall?

Page 15: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

15 / 30

ProfileMe InterruptProfileMe Interrupt

execute instructions

Read counters; get PID/PC

Program instruction stream

ProfileMeinterrupt

Log event inhash table

interruptreturns

execute instructions!

Page 16: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

16 / 30

Interpretation - Value ProfilingInterpretation - Value Profiling

execute native

interpret in interrupt handler

Program instruction stream

ProfileMeinterrupt

log register contentswith profile data

interruptreturns

execute native

Registercontents

Update regs,memory New register

values

!

Partial CFG

Page 17: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

17 / 30

Interpreter DetailsInterpreter Details

Initial register values delivered with interruptInitial register values delivered with interrupt Interpret Interpret nn instructions or until bail instructions or until bail

PALcode (OS support)PALcode (OS support) Page faultPage fault

Branches and jumps are interpretedBranches and jumps are interpreted can’t detect mispredictscan’t detect mispredicts

Memory accesses are performedMemory accesses are performed can’t detect cache missescan’t detect cache misses

Final register state updatedFinal register state updated

Page 18: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

18 / 30

Values CapturedValues Captured

ArithmeticArithmetic -- result valueresult value

Memory opMemory op -- effective addresseffective address

Indirect jumpIndirect jump -- destination addressdestination address

……

and current return address in all casesand current return address in all cases

Page 19: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

19 / 30

Interpretation - Replay TrapsInterpretation - Replay Traps

execute native

interpret

Program instruction stream

register dependence

effective addresses

ProfileMeinterrupt

analyze

report possible culpritsas value samples

interruptreturns

execute native!

Page 20: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

20 / 30

Example Profile - MTRTExample Profile - MTRT> dcpiprof $labels $db -pm replays mtrtbase.exe

Column Total Period (for events)------ ----- ------replays:count 397 126976===========================================================replays :count % procedure image 100 25.19% ...OctNode.Intersect(...) mtrtbase.exe 51 12.85% java.io.BufferedInputStream.read() mtrtbase.exe 48 12.09% ...Vector.Dot(...) mtrtbase.exe

Page 21: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

21 / 30

Replays in OctNode.IntersectReplays in OctNode.Intersect> dcpilist $labels $db -pm replays \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...):replays :count code elided 0 0x2002d2a0 stt $f8, 104(sp) 0 0x2002d2a4 bis a0, a0, s5 0 0x2002d2a8 bis a1, a1, s6 0 0x2002d2ac bis a2, a2, s4 0 0x2002d2b0 stt $f19, 8(sp) 0 0x2002d2b4 bsr ra, 0x20022250 0 0x2002d2b8 bis v0, v0, a0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0x2002d2c8 bis v0, v0, s2 43 0x2002d2cc ldq at, 0(a0) 0 0x2002d2d0 bsr ra, 0x20027a50

Order?Wrong Size?Troll?Queue Full?

Page 22: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

22 / 30

Replay Trap Value ProfileReplay Trap Value Profile> dcpilist $labels $db -pm replays -vreplay \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...):replays :count vtot thld nv code elided 0 0x2002d2a0 stt $f8, 104(sp) 5 1.0 1 (100.0% 0x2002b4f8) 0 0x2002d2a4 bis a0, a0, s5 0 0.0 0 0 0x2002d2a8 bis a1, a1, s6 0 0.0 0 0 0x2002d2ac bis a2, a2, s4 0 0.0 0 0 0x2002d2b0 stt $f19, 8(sp) 0 0.0 0 0 0x2002d2b4 bsr ra, 0x20022250 0 0.0 0 0 0x2002d2b8 bis v0, v0, a0 0 0.0 0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0.0 0 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0.0 0 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0.0 0 0 0x2002d2c8 bis v0, v0, s2 0 0.0 0 43 0x2002d2cc ldq at, 0(a0) 25 1.0 1 (100.0% 0x203f10d0) 0 0x2002d2d0 bsr ra, 0x20027a50 0 0.0 0

Possible ConflictingInstruction(accesses overlapping bytes)

Page 23: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

23 / 30

Conflicting InstructionConflicting Instruction> dcpilist -vreplay -vshow 1 $labels $db -pm repl '0x203f10d0' \ mtrtbase.execomp_alloc_fast:replays :count vtot thld nv 0 0x203f10c0 ldq t1, 64(s0) 88 1.0 4 (48.9% 0x203f10d8) 0 0x203f10c4 ldq v0, 56(s0) 98 1.0 12 (43.9% 0x203f10dc) 0 0x203f10c8 subq t1, a2, t1 0 0.0 0 0 0x203f10cc blt t1, 0x203f1134 0 0.0 0 1 0x203f10d0 stl a1, 0(v0) 16 1.0 16 (6.2% T 0x2002b464) 0 0x203f10d4 addq v0, a2, t2 0 0.0 0 0 0x203f10d8 stq t1, 64(s0) 43 1.0 2 (97.7% 0x203f10d8) 1 0x203f10dc stq t2, 56(s0) 46 1.0 6 (89.1% 0x203f10dc) 0 0x203f10e0 ret zero, (ra), 1 0 0.0 0

4-byte method pointer write in code for JVM’s new; 8-byte object header read for null check wrong_size replay trap for every allocation.

Fix with 4-byte reads for null check!2.3% speedup across SPECjvm98

(yes it matters!!)

Page 24: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

24 / 30

Avoiding TrapsAvoiding Traps

““Build a better …” {program,compiler,processor}Build a better …” {program,compiler,processor} Change access widthsChange access widths Try to get loads/stores further apartTry to get loads/stores further apart Correct unfortunate data alignmentCorrect unfortunate data alignment Avoid filling load/store queuesAvoid filling load/store queues Improve instruction slottingImprove instruction slotting

Page 25: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

25 / 30

Interpretation ParametersInterpretation Parameters

FrequencyFrequency don’t need to interpret on every interruptdon’t need to interpret on every interrupt

DurationDuration longer runs find more possible traps...longer runs find more possible traps...

(interacting instructions can be > 80 apart!)(interacting instructions can be > 80 apart!) ...but they are more expensive...but they are more expensive

– we are running at highest prioritywe are running at highest priority

– more time interpretingmore time interpreting

– more culprits data to collectmore culprits data to collect

Page 26: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

26 / 30

Evaluation - OverheadEvaluation - Overhead

Single runs of 11 early Single runs of 11 early cpu2000 int benchmarkscpu2000 int benchmarks

Dual 667 MHz Alpha Dual 667 MHz Alpha 21264a21264a

Paths of 128 every 128 Paths of 128 every 128 interrupts interrupts 225/sec 225/sec

Benchmark ProfileMe w/ Interp.1 1.20 4.702 2.50 4.703 2.60 3.304 2.80 3.405 2.90 6.006 3.20 3.907 3.40 4.708 3.50 5.709 3.80 19.2010 4.00 5.0011 7.20 7.20

g.mean 3.11 5.37

Overhead (%)

?

?

Page 27: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

27 / 30

Future WorkFuture Work

Measure overhead for other frequencies/lengthsMeasure overhead for other frequencies/lengths Evaluate ability to actually find culpritsEvaluate ability to actually find culprits Optimize data flowOptimize data flow Sample unbiasingSample unbiasing

more likely to discover culprits nearbymore likely to discover culprits nearby more interpretation windows will cover both instrs.more interpretation windows will cover both instrs.

Try to filter more unlikely culpritsTry to filter more unlikely culprits

Page 28: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

28 / 30

SummarySummary

Low-impact way to get trace informationLow-impact way to get trace information No special requirements for processorNo special requirements for processor Benefits of statistical samplingBenefits of statistical sampling

Manageable overheadManageable overhead Useful applicationsUseful applications

Value profiling - code specialization, online optim.Value profiling - code specialization, online optim. Path profiling - edge countsPath profiling - edge counts Pipeline trap explanation - replay trap culpritsPipeline trap explanation - replay trap culprits

Page 29: Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

www.compaq.com