Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar...
-
Upload
constance-norton -
Category
Documents
-
view
217 -
download
0
Transcript of Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar...
www.compaq.com
Using Interpretation forUsing Interpretation forProfiling the Alpha 21264aProfiling the Alpha 21264a
Kip WalkerKip WalkerMike Burrows, Úlfar Erlingsson,Mike Burrows, Úlfar Erlingsson,Mark Vandevoorde, Carl Waldspurger,Mark Vandevoorde, Carl Waldspurger,William E. Weihl and many more.William E. Weihl and many more.
2 / 30
IntroductionIntroduction 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) 28 0 0xcac0 ldq at, 8(a2)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14
26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5* 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp)
* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2)* 2285 1.0 0x21c cmptlt $f4,$f7,$f11* 2224 0.9 0x220 cmptlt $f8,$f5,$f12* 2227 1.0 0x224 cmptlt $f6,$f8,$f13* 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2)* 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5)* 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5)* 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5)* 2214 1.0 0x244 adds $f12,$f18,$f18* 2292 1.0 0x248 adds $f13,$f19,$f19* 2234 1.0 0x24c adds $f14,$f20,$f20* 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2)
Changing this ONEinstruction will makemy Java programs run2.3% faster!
HOW CAN I FIND IT?
HOW DO I FIX IT??
3 / 30
The OptionsThe Options
Read the sourceRead the source -- not always usefulnot always useful Read the assemblyRead the assembly -- hard, not always usefulhard, not always useful SimulationSimulation -- very slow, infeasiblevery slow, infeasible InstrumentationInstrumentation -- slow, interferenceslow, interference Sample-based profilingSample-based profiling -- not enough detailnot enough detail
Or useOr use periodic interpretation periodic interpretation
4 / 30
It’s Not EasyIt’s Not Easy
A true story –A true story – Sometimes program X runs twice as long as usualSometimes program X runs twice as long as usual Variance due to Variance due to # of bytes in environment vars# of bytes in environment vars!!
– Base address of Base address of main()main()’s stack had dramatic effect’s stack had dramatic effect Simulation eventually revealed the problemSimulation eventually revealed the problem
Information requirementsInformation requirements Detailed instruction behavior profileDetailed instruction behavior profile Contents of registersContents of registers Correlated data for nearby instructionsCorrelated data for nearby instructions
5 / 30
OutlineOutline
Out-of-order ProcessorsOut-of-order Processors Performance ProblemsPerformance Problems Why Interpretation?Why Interpretation? Profiling InfrastructureProfiling Infrastructure An ExampleAn Example EvaluationEvaluation Future WorkFuture Work SummarySummary
6 / 30
Out-of-order ProcessorsOut-of-order Processors
Try to exploit instruction-level parallelismTry to exploit instruction-level parallelism Fetch, issue 4 instructions at a timeFetch, issue 4 instructions at a time Many function unitsMany function units Retire up to 11 instructions in a cycleRetire up to 11 instructions in a cycle
Fetch in-orderFetch in-order Execute out of orderExecute out of order Retire in-orderRetire in-order
7 / 30
Enemies of PerformanceEnemies of Performance
Bad cache utilizationBad cache utilization Static stalls / dependencesStatic stalls / dependences Branch mispredictionBranch misprediction Illegal re-orderingIllegal re-ordering
Pipeline traps!}
8 / 30
TrapsTraps
Processor detects that it let “bad things” happenProcessor detects that it let “bad things” happen wrong instructions executedwrong instructions executed instructions may have seen incorrect datainstructions may have seen incorrect data up to 80 in-flight instructions thrown out!up to 80 in-flight instructions thrown out!
Branch mispredict:Branch mispredict:
...
...beq
execute
fetch
[Predict !taken]
......
...
fetch
execute
......
...
...
...
TAKEN!
......
...
...
...
fetch ......
ABORTED!ABORTED!
9 / 30
Memory Order TrapsMemory Order Traps
Memory operations are freely reorderedMemory operations are freely reordered Must enforce consistent view of memory Must enforce consistent view of memory Problems are detected dynamicallyProblems are detected dynamically
(a) reordered operations to overlapping bytes - “order” trap(a) reordered operations to overlapping bytes - “order” trap
load from X...store to X
store to X...load from X
program order: execute order:
10 / 30
Troll TrapsTroll Traps
L1 data cache
L2 cache
Load from Y
Load from X
? ?
(b) accesses resulting in contention for a cache line - “troll” trap(b) accesses resulting in contention for a cache line - “troll” trap not allowed to have more than one outstanding fill requestnot allowed to have more than one outstanding fill request unspecified ordering of responses from L2 cacheunspecified ordering of responses from L2 cache replay the load until the fill happensreplay the load until the fill happens
Y
X
Miss!
11 / 30
Wrong Size TrapsWrong Size Traps
(c) wide load follows narrow store - “size” trap(c) wide load follows narrow store - “size” trap
Store-long mem(x)
Load-quad mem(x)
Store queue
L1 data cache
Load-quad mem(x)
12 / 30
A Better WayA Better Way
Need a runtime solutionNeed a runtime solution Notice when two instructions Notice when two instructions in a tracein a trace “match” “match” Observe effective addresses of memory opsObserve effective addresses of memory ops
InterpretInterpret instruction traces instruction traces Emulate (most) operationsEmulate (most) operations Apply statistically to cover whole systemApply statistically to cover whole system Extends the power of sample-based profilingExtends the power of sample-based profiling
13 / 30
Available InformationAvailable Information
Control Flow – Edge FrequenciesControl Flow – Edge Frequencies Return address (in register or on stack)Return address (in register or on stack) Branch taken directionBranch taken direction
Computed valuesComputed values Function arguments, resultsFunction arguments, results Load/store addressesLoad/store addresses
Possible replay trap culpritsPossible replay trap culprits
14 / 30
ProfileMe on Alpha 21264aProfileMe on Alpha 21264a
fetch map issue exec retire
icache
branchpredictor
interrupt
Fetch counter
overflow?
pc notrap?replay?
mispredict?dtbmiss?
…
ProfileMe tag!
tagged?
taken?capture!
internal processor registers
imiss? retired?
random selection
map stall?
15 / 30
ProfileMe InterruptProfileMe Interrupt
execute instructions
Read counters; get PID/PC
Program instruction stream
ProfileMeinterrupt
Log event inhash table
interruptreturns
execute instructions!
16 / 30
Interpretation - Value ProfilingInterpretation - Value Profiling
execute native
interpret in interrupt handler
Program instruction stream
ProfileMeinterrupt
log register contentswith profile data
interruptreturns
execute native
Registercontents
Update regs,memory New register
values
!
Partial CFG
17 / 30
Interpreter DetailsInterpreter Details
Initial register values delivered with interruptInitial register values delivered with interrupt Interpret Interpret nn instructions or until bail instructions or until bail
PALcode (OS support)PALcode (OS support) Page faultPage fault
Branches and jumps are interpretedBranches and jumps are interpreted can’t detect mispredictscan’t detect mispredicts
Memory accesses are performedMemory accesses are performed can’t detect cache missescan’t detect cache misses
Final register state updatedFinal register state updated
18 / 30
Values CapturedValues Captured
ArithmeticArithmetic -- result valueresult value
Memory opMemory op -- effective addresseffective address
Indirect jumpIndirect jump -- destination addressdestination address
……
and current return address in all casesand current return address in all cases
19 / 30
Interpretation - Replay TrapsInterpretation - Replay Traps
execute native
interpret
Program instruction stream
register dependence
effective addresses
ProfileMeinterrupt
analyze
report possible culpritsas value samples
interruptreturns
execute native!
20 / 30
Example Profile - MTRTExample Profile - MTRT> dcpiprof $labels $db -pm replays mtrtbase.exe
Column Total Period (for events)------ ----- ------replays:count 397 126976===========================================================replays :count % procedure image 100 25.19% ...OctNode.Intersect(...) mtrtbase.exe 51 12.85% java.io.BufferedInputStream.read() mtrtbase.exe 48 12.09% ...Vector.Dot(...) mtrtbase.exe
21 / 30
Replays in OctNode.IntersectReplays in OctNode.Intersect> dcpilist $labels $db -pm replays \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...):replays :count code elided 0 0x2002d2a0 stt $f8, 104(sp) 0 0x2002d2a4 bis a0, a0, s5 0 0x2002d2a8 bis a1, a1, s6 0 0x2002d2ac bis a2, a2, s4 0 0x2002d2b0 stt $f19, 8(sp) 0 0x2002d2b4 bsr ra, 0x20022250 0 0x2002d2b8 bis v0, v0, a0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0x2002d2c8 bis v0, v0, s2 43 0x2002d2cc ldq at, 0(a0) 0 0x2002d2d0 bsr ra, 0x20027a50
Order?Wrong Size?Troll?Queue Full?
22 / 30
Replay Trap Value ProfileReplay Trap Value Profile> dcpilist $labels $db -pm replays -vreplay \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...):replays :count vtot thld nv code elided 0 0x2002d2a0 stt $f8, 104(sp) 5 1.0 1 (100.0% 0x2002b4f8) 0 0x2002d2a4 bis a0, a0, s5 0 0.0 0 0 0x2002d2a8 bis a1, a1, s6 0 0.0 0 0 0x2002d2ac bis a2, a2, s4 0 0.0 0 0 0x2002d2b0 stt $f19, 8(sp) 0 0.0 0 0 0x2002d2b4 bsr ra, 0x20022250 0 0.0 0 0 0x2002d2b8 bis v0, v0, a0 0 0.0 0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0.0 0 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0.0 0 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0.0 0 0 0x2002d2c8 bis v0, v0, s2 0 0.0 0 43 0x2002d2cc ldq at, 0(a0) 25 1.0 1 (100.0% 0x203f10d0) 0 0x2002d2d0 bsr ra, 0x20027a50 0 0.0 0
Possible ConflictingInstruction(accesses overlapping bytes)
23 / 30
Conflicting InstructionConflicting Instruction> dcpilist -vreplay -vshow 1 $labels $db -pm repl '0x203f10d0' \ mtrtbase.execomp_alloc_fast:replays :count vtot thld nv 0 0x203f10c0 ldq t1, 64(s0) 88 1.0 4 (48.9% 0x203f10d8) 0 0x203f10c4 ldq v0, 56(s0) 98 1.0 12 (43.9% 0x203f10dc) 0 0x203f10c8 subq t1, a2, t1 0 0.0 0 0 0x203f10cc blt t1, 0x203f1134 0 0.0 0 1 0x203f10d0 stl a1, 0(v0) 16 1.0 16 (6.2% T 0x2002b464) 0 0x203f10d4 addq v0, a2, t2 0 0.0 0 0 0x203f10d8 stq t1, 64(s0) 43 1.0 2 (97.7% 0x203f10d8) 1 0x203f10dc stq t2, 56(s0) 46 1.0 6 (89.1% 0x203f10dc) 0 0x203f10e0 ret zero, (ra), 1 0 0.0 0
4-byte method pointer write in code for JVM’s new; 8-byte object header read for null check wrong_size replay trap for every allocation.
Fix with 4-byte reads for null check!2.3% speedup across SPECjvm98
(yes it matters!!)
24 / 30
Avoiding TrapsAvoiding Traps
““Build a better …” {program,compiler,processor}Build a better …” {program,compiler,processor} Change access widthsChange access widths Try to get loads/stores further apartTry to get loads/stores further apart Correct unfortunate data alignmentCorrect unfortunate data alignment Avoid filling load/store queuesAvoid filling load/store queues Improve instruction slottingImprove instruction slotting
25 / 30
Interpretation ParametersInterpretation Parameters
FrequencyFrequency don’t need to interpret on every interruptdon’t need to interpret on every interrupt
DurationDuration longer runs find more possible traps...longer runs find more possible traps...
(interacting instructions can be > 80 apart!)(interacting instructions can be > 80 apart!) ...but they are more expensive...but they are more expensive
– we are running at highest prioritywe are running at highest priority
– more time interpretingmore time interpreting
– more culprits data to collectmore culprits data to collect
26 / 30
Evaluation - OverheadEvaluation - Overhead
Single runs of 11 early Single runs of 11 early cpu2000 int benchmarkscpu2000 int benchmarks
Dual 667 MHz Alpha Dual 667 MHz Alpha 21264a21264a
Paths of 128 every 128 Paths of 128 every 128 interrupts interrupts 225/sec 225/sec
Benchmark ProfileMe w/ Interp.1 1.20 4.702 2.50 4.703 2.60 3.304 2.80 3.405 2.90 6.006 3.20 3.907 3.40 4.708 3.50 5.709 3.80 19.2010 4.00 5.0011 7.20 7.20
g.mean 3.11 5.37
Overhead (%)
?
?
27 / 30
Future WorkFuture Work
Measure overhead for other frequencies/lengthsMeasure overhead for other frequencies/lengths Evaluate ability to actually find culpritsEvaluate ability to actually find culprits Optimize data flowOptimize data flow Sample unbiasingSample unbiasing
more likely to discover culprits nearbymore likely to discover culprits nearby more interpretation windows will cover both instrs.more interpretation windows will cover both instrs.
Try to filter more unlikely culpritsTry to filter more unlikely culprits
28 / 30
SummarySummary
Low-impact way to get trace informationLow-impact way to get trace information No special requirements for processorNo special requirements for processor Benefits of statistical samplingBenefits of statistical sampling
Manageable overheadManageable overhead Useful applicationsUseful applications
Value profiling - code specialization, online optim.Value profiling - code specialization, online optim. Path profiling - edge countsPath profiling - edge counts Pipeline trap explanation - replay trap culpritsPipeline trap explanation - replay trap culprits
www.compaq.com