Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.
-
Upload
valerie-oneal -
Category
Documents
-
view
219 -
download
0
Transcript of Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.
![Page 1: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/1.jpg)
Alternative Dispatch Techniques for the Tcl VM
Benjamin Vitale
Mathew Zaleski
![Page 2: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/2.jpg)
Tcl VM Dispatch 2
Outline
• How the VM Interprets Bytecode
• Dispatch speed on pipelined CPUs
• The Context Problem
• Context Threading
• Results
![Page 3: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/3.jpg)
Tcl VM Dispatch 3
Running a Tcl Program
TclSource
Bytecode
Interpreter
BytecodeCompiler
![Page 4: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/4.jpg)
Tcl VM Dispatch 4
Compiling to Bytecode
# find first power of 2
# greater than 100
proc find_pow {} {
set x 1
while {$x < 100} {
incr x $x
}
return $x
}
0 push1 0 # x = 12 storeScalar1 04 pop
5 jump1 +7
7 loadScalar1 0 # x += x9 incrScalar1 0
11 pop
12 loadScalar1 0 # if x < 10014 push1 1 # goto 716 lt17 jumpTrue1 -10
19 loadScalar1 0 # return x21 done
![Page 5: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/5.jpg)
Tcl VM Dispatch 5
Interpreter
push1
0
storeScalar1
0
pop
jump1
7
loadScalar1
0
incrScalar1
0
…
vpc
for (;;) {
opcode = *vpc;
switch (opcode) {
case PUSH1:
// real work…
vpc += 2;
break;
case POP:
…Bytecode Representation
![Page 6: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/6.jpg)
Tcl VM Dispatch 6
Performance Problem
• Interpreting bytecode is faster than interpreting source
• But still slow
• One problem for some VMs is high dispatch overhead
• How does switch() dispatch work?
![Page 7: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/7.jpg)
Tcl VM Dispatch 7
How C compiles switch()
push_work:add r6, 4, r6ldub [r4+1], o0ld [fp+72], o2…bra .switch_end
pop_work:ld [r2], g1add r2, -4, r2mov g1, l0…bra .switch_end
&&push_work
&&pop_work
&&add_work
&&sub_work
…
Code Addresses
![Page 8: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/8.jpg)
Tcl VM Dispatch 8
Executing switch()
ldub opc = [vpc] // Opcode load (unaligned)
cmp opc, max_opc // Bounds check (useless)bg switch_default
set r5 = switch_table // Table lookup (avoidable)mul r1 = r4 * 4ld [r5 + r1], r1
jmp r1 + r5 // Indirectly jump to work• 17 cycles
![Page 9: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/9.jpg)
Tcl VM Dispatch 9
Direct Threading
ld address = [vpc] // Opcode load (aligned)jmp address // Indirect jump
• 12 Cycles
• portably expressed in Gnu C
– we should consider this for Tcl
• 2 insns in 12 cycles. What is CPU doing?
![Page 10: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/10.jpg)
Tcl VM Dispatch 10
CPU Pipeline
F D L E …Instruction
Cache
L2Cache
add r6 += 4ld r1 = [r4]ld r2 = [fp+8]ld addr = [vpc]jmp addr???
•Keeping pipeline full requires pre-fetching. But which instructions?
![Page 11: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/11.jpg)
Tcl VM Dispatch 11
Branch Target Predictor
0 add r6 += 44 ld addr = [r1]8 cmp r6, 12
12 bg 616 jmp addr20 ld r2 = [r3]24 sll r2 = r2, 228 jmp r2
pcjmp pctarget
16 42
28 1000
Branch Target
Address Cache• Predict branch target from past behavior
![Page 12: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/12.jpg)
Tcl VM Dispatch 12
? Xpcjmp target
…
Context Problem Example
BTAC
for
switch
add print …push
pcjmp target
switch push
…
pcjmp target
switch add
…
pcjmp target
switch print
…for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
for
switch
add print …push
push 2push 3addprint
Interpreter
Bytecode Program
![Page 13: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/13.jpg)
Tcl VM Dispatch 13
Context Problem
• Hardware is using PC for prediction
– Only one branch means one BTAC entry
• VM is using vpc
– branch depends on vpc, has many targets
– [Ertl03] 85% mispredicts, costs 10+ cycles
• How can we avoid misprediction?
![Page 14: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/14.jpg)
Tcl VM Dispatch 14
Subroutine Threading
• Old idea. Great for modern CPUs
• Correlates native pc with virtual pc
• 6 cycle dispatch
0 push1 02 storeScalar1 04 pop5 loadScalar1 07 incrScalar1 09 pop
call push1call storeScalar1call popcall loadScalar1call incrScalar1call pop
Native Code (“CTT”)
Bytecode
![Page 15: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/15.jpg)
Tcl VM Dispatch 15
Context Threading
• Our implementation of subroutine threading
• CGO’05
• Keep bytecode around for operands, etc.
• Optimizations exploit CTT’s flexibility
![Page 16: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/16.jpg)
Tcl VM Dispatch 16
Inlining Small Opcodes
call push push
callstoreScalar
callpopcallincrScalarcallpop pop
incrScalar
storeScalar
push
![Page 17: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/17.jpg)
Tcl VM Dispatch 17
Virtual Branches become Native
• jump becomes two native instructions
• jumpTrue uses native branch, but also calls
0jump1+72loadScalar10
4incrScalar10
6pop7loadScalar10
9push11
11lt
12jumpTrue1-1014loadScalar10
0b 164set vpc = 78callloadScalar12 callincrScalar16 callpop20 callloadScalar124 callpush128 calllt
32 calljumpTrue1
36 beq 840 callloadScalar1
![Page 18: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/18.jpg)
Tcl VM Dispatch 18
Conditional Branch Peephole Opt
• We can eliminate the call to jumpTrue
– Profile: what precedes cond. branches?
• gt, lt, tryConvertNumeric, foreachStep4
– move the branch code into CTT and gt
• Tcl 8.5 has a similar optimization, but bigger payoff for native
• Loops go faster
![Page 19: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/19.jpg)
Tcl VM Dispatch 19
Cond. Branch Peephole Opt Demo
gt:c = do_compare…o = new_bool (c)push (o)vpc++return
jump_true: 91 insnso = pop ()coerce_bool (o)if (o.bool)
vpc = targetv
elsevpc = fall_thruv
asm (“cmp o.bool, 0”)return
gt_jump:c = do_compare…
vpc++return
gt_jump:c = do_compare…asm (“cmp o.bool, 0”)vpc++return
call gtcall jumpTruebeq targetn
call gt_jump beq targetn
call gt_jumpset vpc = targetv
beq targetn
set vpc = fall_thruv
call gtcall jumpTruebeq targetn
![Page 20: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/20.jpg)
Tcl VM Dispatch 20
Catenation
• IVME ’04
• Inline everything
– Specialize operands
– Eliminate vpc
• Complicated
• 0 cycle dispatch pop
storeScalar
push
push
add
![Page 21: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/21.jpg)
Tcl VM Dispatch 21
Results
• Tclbench
– microbenchmarks, only 12 with more than 100,000 dispatches
– de-facto standard
– focus on 60 with > 10,000 dispatches
• UltraSPARC III
• Use switch interpreter as baseline
![Page 22: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/22.jpg)
Tcl VM Dispatch 22
Performance Summary
Dispatch type
Geo. mean speedup
Number of benchmarks improved
Direct Threading
4.3% 88%
Catenation 4.0 73
Context Threading
5.4 88
CT + peephole
12.0 97
![Page 23: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/23.jpg)
Tclbench Speedup versus Switch
23
![Page 24: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/24.jpg)
Performance Details
![Page 25: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/25.jpg)
Tcl VM Dispatch 25
Tcl Opcodes are Big
Java25%
Ocaml
37%
Tcl 5%
Context Threading Speedup
![Page 26: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/26.jpg)
Tcl VM Dispatch 26
Conclusions and Future Work
• Context Threading is simple & effective
– fast dispatch (not Tcl’s problem)
– facilitates optimization
– inline more opcodes, port to x86, PowerPC
• 12% speedup trivial: Tcl 10x slower than C
– micro opcodes and a real JIT
![Page 27: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/27.jpg)
Tcl VM Dispatch 27
Low Dispatch Overhead
Tcl Opcodes
DISPATCH
Ocaml Opcodes
![Page 28: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/28.jpg)
Tcl VM Dispatch 28
Branch prediction on Sparc
• Ultra 1 had “NFA” in I-cache
• UltraSPARC III
– What kind of branch target predictor?
– “prepare-to-branch” instruction?
– Consider two virtual programs, on the next slide:
![Page 29: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/29.jpg)
Tcl VM Dispatch 29
Jekyll and Hyde Programs
start: Vop1Vop2Vop1Vop3Vop1Vop4Vop1Vop5Vop1goto start
Unpredictable
start: Vop1Vop1Vop1Vop1Vop1Vop1Vop1Vop1Vop1
goto start
Predictable
![Page 30: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/30.jpg)
Tcl VM Dispatch 30
Mispred vs. predict
Dispatch Type UltraSPARC III Pentium 4
switch 17.3 19.2
indirect mispred
14.2 18.6
indirect predict 14.2 11.8
direct mispred 11.2 18.7
direct predict 11.2 11.3
subroutine 6.3 8.4
![Page 31: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/31.jpg)
Tcl VM Dispatch 31
CT, Tcl, Sparc
• Branch Delay Slot
• Big Tcl bodies nearly all contain calls
– Calls clobber link register (o7)
– We save link register in a reserved reg
call bigopmov o7, save_ret…
bigop:call runtime…mov save_ret, o7retlinc vpc, 4
![Page 32: Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f3a5503460f94c57935/html5/thumbnails/32.jpg)
Tcl VM Dispatch 32
Compilation Time
• We include compile time in every iteration
• Tclbench amortizes
Dispatch Type
Native Compile time relative to ByteCode
Direct Threading
6%
Catenation 44
Context Threading
35
CT + peephole 38
• Varies significantly across benchmarks