Post on 22-May-2020
YETI: Gradually Extensible Trace
InterpreterMathew Zaleski,
University of Toronto
matz@cs.toronto.edu
(thesis proposal)
Thesis Proposal Jan 2006 2
Overview
‣Introduction• Background• Efficient Interpretation• Our Approach to Mixed-Mode Execution• Results and Discussion
Thesis Proposal Jan 2006
Why so few JIT compilers?
• Complex JIT infrastructure built in “big bang”, before any generated code can run.
• Rather than incrementally extend the interpreter, typical JITs is built alongside.• The code generator of current JIT compilers
makes little provision to reuse the interpreter.• The method-orientation of most JITs means
that cold code is compiled with hot.‣ Interpreters should be more gradually extensible
to become dynamic compilers.
3
Thesis Proposal Jan 2006
Problems with current practice
• Packaging of virtual instruction bodies is:• Inefficient: Interpreters slowed by branch
misprediction• Non-reusable: JIT compilers must implement
all virtual instructions from scratch• Method orientation of a JIT compiler forces it to
compile cold code along with hot.• Code compiled cold requires complex runtime
to perform late binding if it runs.• Recompiling cold code that becomes hot
requires complex recompilation infrastructure.
4
Thesis Proposal Jan 2006
Our Approach
• Branch prediction problems of interpretation can be addressed by calling the virtual bodies.• Can speed up interpretation significantly.• Enables generated code to call the bodies.• JIT need not support all virtual instructions.
• Complexity of compiling cold code can be side stepped by compiling dynamically selected regions that contain only hot code.• We describe how compiling traces allows us to
compile only hot code and link on newly hot regions as they emerge.‣ Enables gradual enhancement of interpreter
5
Thesis Proposal Jan 2006
Overview of Contribution
• Callable bodies make for efficient interpretation.
• Reuse of callable bodies from generated code smooths “big bang”.
• A trace oriented JIT compiler is a simple and promising architecture.
6
interpretation
efficiency challenges
callable bodies
method based JIT
big bang development
reuse callable bodies/traces
YETI
Thesis Proposal Jan 2006 7
Gradual extension of VM
Larger regions. More instructions compiled.Size and Complexity of Compiled Code Regions
Traces - sub dispatchx
Basic Blocks
x
SUBinterpreterx
Traces - compile all virtual instructions x
Virtu
al in
stru
ctio
ns
com
pile
d
Traces- just integer instructions x✔
Thesis Proposal Jan 2006
Result preview - Efficient Interpretation
• Branch misprediction dealt with by calling the bodies from region of generated code.
• Relative to Direct Threaded VM
• Geo mean• Java SpecJVM98
benchmarks• Ocaml benchmarks
8
0
0.2
0.4
0.6
0.8
1.0
java/P
4
java/P
PC
ocam
l/P4
ocam
l/PPC
0.63
0.900.820.81
Subroutine Threading
Rela
tive
to D
irect
Thr
eadi
ng
Thesis Proposal Jan 2006
Result preview - Trace based JIT
• Geom mean SPECJVM98 relative to Sun Hotspot JIT
• SABVM• Selective inlining
• Modified JamVM.• TR-LINK = traces,no JIT• JIT = trace, JIT• Only 50 integer
bytecodes• Promising start
9
0
1
2
3
4
5
6
SABV
M
TR-LIN
K JIT
4.3
5.75.9
Java/PPC970
Rela
tive
to S
un H
otsp
ot
Thesis Proposal Jan 2006
Background & Related Work
10
Ertl & Gregg Branch misprediction
Piumarta & Riccardi Selective inlining
Parrot (perl6) Callable bodies
Vitale, Abdelrahman Catenation, Tcl
Bala, Duesterwald, Banerjia Dynamo
Bruening, Garnett ,Amarasinghe Dynamo Rio
Whaley Partial methods
Gal, Probst, Franz Hotpath, Trace-based JIT
Suganuma,Yasue,Nkatani Region based compilation
Hozle, Chambers, Ungar Self
Many Java, JVM and JIT authors Java
Thesis Proposal Jan 2006 11
Overview
•Introduction‣ Background:‣ Dynamo & Traces‣ Interpretation
• Our Approach to Mixed-Mode Execution• Results and Discussion
Thesis Proposal Jan 2006 12
• Trace-oriented dynamic optimization system.• HP PA-8000 computers.
• Counter-Intuitive approach:• Don’t execute optimized binary interpret it.• Count transits of reverse branches.• Trace-generate (next slide).• Dispatch traces when encountered.
• Soon, most execution from trace cache.• faster than binary on hardware of the day!
HP Dynamo
Thesis Proposal Jan 2006
• Trace is path followed by program
• Conditional branches become trace exits.
• Do not expect trace exits to be taken.
Trace with if-then-else//c => b2if (c) b1;else b2;b3;
c
b1 b2
b3
c
b2b3
texit b1
13
Thesis Proposal Jan 2006 14
Overview
•Introduction‣ Background:
• Dynamo & Traces‣ Interpretation
• Efficient Interpretation • Our Approach• Selecting Regions• Results and Discussion
Thesis Proposal Jan 2006
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
15
int f(boolean parm){ if (parm){ return 42; }else{ return 0; } }
Java Source
Java Bytecode
Javac compiler
Virtual Program
Thesis Proposal Jan 2006 16
Interpreter
LoadedProgram
Bytecodebodies
Internal Representation
fetch
dispatch LoadParms
execute
Execution Cycle
Thesis Proposal Jan 2006
vPC = internalRep;
while(1){ switch(*vPC++){
//and many more..
}};
17
Switched Interpreter
case iload_1: ..
break;
case ifeq: ..
break;
slow. Burdened by switch and loop overhead.
Thesis Proposal Jan 2006
static int *vPC = internalRep;
interp(){ while(1){ (*vPC)(); } };
18
Call Threaded Interpreter
void iload_1(){ //push load 1 vPC++; }
slow. burdened by function pointer call
void ifeq(){ //change vPC vPC++; }
Thesis Proposal Jan 2006
Direct Threaded Interpreter
‣ Good: one dispatch branch taken per body19
ifeq: if () vPC= goto *vPC++;
bipush: .. goto *vPC++;
iload_1: .. goto *vPC++;
iconst_0: .. goto *vPC++;
ireturn: .. goto *vPC++;
ireturn: .. goto *vPC++;
-Execution of virtual program “threads” through bodies
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
Thesis Proposal Jan 2006
Context Problem
‣ Bad: hardware has no context to predict dispatch20
ifeq: if () vPC= goto *vPC++;
bipush: .. goto *vPC++;
iload_1: .. goto *vPC++;
iconst_0: .. goto *vPC++;
ireturn: .. goto *vPC++;
ireturn: .. goto *vPC++;
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iload_1 8: ireturn
Virtual PC predicts destination.
Hardware PC insufficient context
Thesis Proposal Jan 2006 21
Overview
✓ Introduction✓ Background‣ Efficient Interpretation• Our Approach to Mixed-Mode Execution• Results and Discussion
Thesis Proposal Jan 2006
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
22
int f(boolean parm){ if (parm){ return 42; }else{ return 0; } }
Java Source
Java Bytecode
Javac compiler
Virtual Program
Thesis Proposal Jan 2006
ifeq: if () vPC= goto *vPC++;
bipush: .. goto *vPC++;
23
Direct Threaded Interpreter
…iload_1ifeq 7bipush 42ireturniconst_0ireturn…
DTT - DirectThreading Table
VirtualProgram
vPC iload_1: .. goto *vPC++;
DTT maps vPC to implementation
C implementationof each body
DTT&&iload_1&&ifeq
+4&&bipush
42&&ireturn&&iconst_0&&ireturn
Thesis Proposal Jan 2006
Context Problem
‣ Bad: hardware has no context to predict dispatch24
ifeq: if () vPC= goto *vPC++;
bipush: .. goto *vPC++;
iload_1: .. goto *vPC++;
iconst_0: .. goto *vPC++;
ireturn: .. goto *vPC++;
ireturn: .. goto *vPC++;
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iload_1 8: ireturn
Virtual PC predicts destination.
Hardware PC insufficient context
Thesis Proposal Jan 2006 25
Essence of Subroutine Threading
iload_1: .. asm(ret);
ifeq: vPC=.. goto *vPC;
Context Threading Table
Package bodies as subroutines and call them
CTT
call iload_1
call ifeq
call bipush
call ireturn
call iconst_0
call ireturn
DTT
4
42
ret terminated
virtual branches as directpoints to generated
code
Thesis Proposal Jan 2006 26
Context Threading (CT) -- Generating specialized code in CTT
4
DTT
Specialized bodies can also be generated in CTT!
vPCcall iload_1
subl
movl ;pop stack
cmpl ;compare 0
jne
movl ; vPC =
jmp ; else
addl ; vPC +=
call bipush
call ireturn
call iconst_0
call ireturn
inlined code for if_eq
context for conditional branches
Thesis Proposal Jan 2006
CT Performance
‣ CT is an efficient interpretation technique27Context Threading 24
!"#$%&'()*+',#*-.#(,#/)0
0.00
0.25
0.50
0.75
1.00
java
/p4
java
/ppc
ocam
l/p4
ocam
l/ppc
Subroutine Branch Inlining Tiny Inlining
!"#$%&'()*+,"+
-'#).,+/0#)%*'12
! 1%2*&#$3)'4%#*'5*#66#$&'7#*/)8*.#)#2/9*
Thesis Proposal Jan 2006 28
Overview
✓ Introduction✓ Background: Interpretation & traces✓ Efficient Interpretation‣ Our Approach to Mixed-Mode Execution
• Selecting Regions• Results and Discussion
Thesis Proposal Jan 2006
Gradually Extensible Trace Interpreter
Three main enablers:1. Bodies organized as callable routines so
executable regions can efficiently mix compiled code and dispatched bodies.
2. The DTT can point to variously shaped execution units.
3. Efficient, flexible instrumentation.
29
Thesis Proposal Jan 2006
1. Bodies are callable
‣ Needn’t build all virtual instructions all in one shot.
30
iload_1: ..ret;
istore: ..ret;
call iload_1call iload_1specialized
code
for iadd
call istore_1
Packaging bytecode bodies as lightweight subroutines means that they can be efficiently called from generated code
Thesis Proposal Jan 2006
2. DTT always points to implementation
‣ DTT indirection enables any shape of execution unit to be dispatched.
31
iload_1: ..ret;
istore: ..ret;
call iload_1call iload_1JIT compiled
codefor iadd
call istore_1
DTT
..of corresponding execution unitvPC
Thesis Proposal Jan 2006
3. Flexible, Efficient Instrumentation
‣ Our runtime active before and after each dispatch
32
A dispatcher describes an execution unit
DTT
postworker(){}
preworker(){}
dispatcherbodypayloadpre post
while(1){ //dispatch loop d = vPC->dipatcher; pay = d->payload; (*d->pre)(vPC,pay,&tcs); (*d->body)(); (*d->post)(vPC,pay,&tcs); }
body or generated code for region
payload data specific to execution unit
Thesis Proposal Jan 2006 33
Overview
✓ Introduction✓ Background: Interpretation & traces• Our Approach to Mixed-Mode Execution
• Selecting Regions‣ Basic Blocks• Traces
• Results and Discussion
Thesis Proposal Jan 2006
int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
34
int f(boolean parm){ if (parm){ return 42; }else{ return 0; } }
Java Source
Java Bytecode
Basic Block Detection
Thesis Proposal Jan 2006
t_dispatcherbody=&&iload_1payloadpre post
t_dispatcherbody=BB0 payload=bb0pre post
35
Basic Block Detection
0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
t_thread_context historyList recordMode=0
iload_1 ifeq
DTT
t_dispatcherbody=&&ifeqpayloadpre post
1
while(1){ //dispatch loop d = vPC->dipatcher; pay = d->payload; (*d->pre)(vPC,pay,&tcs); (*d->body)(); (*d->post)(vPC,pay,&tcs); }
Thesis Proposal Jan 2006
Generated code for a basic block
• Could JIT the bb, currently we generate “subroutine threading style” code for it.
36
DTTiload_1ifeq4
iconst_0ireturn
bb1call iconst_0
call ireturn
ret
bb0call iload_1
call if_eq
ret
Basic block is a run-time superinstruction
bb0
bb1
t_dispatcherbodypayloadprepost
t_dispatcherbodypayloadprepost
Thesis Proposal Jan 2006
• all in one C function• thread private are C
local variables• loader initializes
DTT to static dispatchers
• dispatch loop calls instrumentation specific to dispatcher
C Code for interp function
37
static t_dispatcher init[256]={};interp(t_dispatcher *dtt){ t_dispatcher *vPC = dtt; t_thread_context tcs; iload: //real work of iload here.. vPC++; //to next instruction asm volatile(“ret”); iadd: //and many more bodies //dispatch loop while(1){ d = vPC->dispatcher; pay = d->payload; (*d->pre)(vPC,pay,&tcs); (*d->body)(); (*d->post)(vPC,pay,&tcs); }
Thesis Proposal Jan 2006 38
Overview
✓ Introduction✓ Background: Interpretation & traces✓ Our Approach to Mixed-Mode Execution
• Selecting Regions✓ Basic Blocks‣ Traces
• Results and Discussion
Thesis Proposal Jan 2006
Detecting Traces
• Use Dynamo’s trace detection heuristic.
• Instrument reverse branches until they are hot.• Postworker of basic block dispatcher
• Trace generate starting from hot reverse branch:• Much like bb’s were recorded• Postworker of each basic block region adds
each bb to thread private history list.• Eventually creates new trace dispatcher• Hold off generating code until after trace has
“trained” a few times..39
Thesis Proposal Jan 2006
• Trace is path followed by program
• Conditional branches become trace exits.
• Do not expect trace exits to be taken.
Trace with if-then-else//c => b2if (c) b1;else b2;b3;
c
b1 b2
b3
c
b2b3
texit b1
40
Thesis Proposal Jan 2006
Subroutine threading style code for a Trace
• Dispatch virtual instructions for trace
41
DTTiload_1ifeq4
iconst_0ireturn
trace-b0-b1
call iload_1
trace_exit_eq
call iconst_0
trace_exit_iret
..caller code..
...
Trace is super-super instruction
bb0
bb1
trace exit handlers
TEH..
ret
TEH..
ret
Thesis Proposal Jan 2006
Trace Exits
(a) Two-way: from conditional branches
• one leg on trace• other leg off trace
(b) Multiway: from invokes and returns
• one leg on trace• potentially many legs off trace
42
trace-b0-b1
call iload_1
trace_exit_eq
call iconst_0
trace_exit_iret
...
(a)
(b)
Thesis Proposal Jan 2006
Trace Exit Handlers
• Code runs when trace exit is taken before return to interpreter
• Record which trace exit has occurred in thread context
• Return to dispatch loop• Housekeeping roles:• flush state of JIT code• Trace linking
43
trace-b0-b1
call iload_1
trace_exit_eq
call iconst_0
trace_exit_iret
ret
TEHtcs=..
ret
TEHtcs=..
ret
Thesis Proposal Jan 2006
Trace linking
• When trace exit is hot and destination is a trace
• Rewrite ret at end of trace exit handler as jmp to destination trace• Only use of code
rewriting in system
44
trace-1
TEH
trace-2
retjmp
Hot trace exit
Destination Trace
Thesis Proposal Jan 2006
Trace JIT
• Generate native code for trace exits• A lot like branch inlining from CT system.
• Optimize invokevirtual when call and return occur in same trace.
• Naive register allocation scheme• Only handle 50 integer/object virtual instructions• Do virtual instructions one-by-one• Relatively easy debugging
• Floating point should be easy.
45
Thesis Proposal Jan 2006 46
Overview
✓ Introduction✓ Background: Interpretation & traces✓ Our Approach to Mixed-Mode Execution‣ Results and Discussion
• Data• Discussion• Remaining Work in this dissertation
Thesis Proposal Jan 2006
Implementation
• Modify JamVM 1.1.3 to be SUB threaded• Gradually extend it to:• Detect, execute subroutine style basic blocks• Detect, execute subroutine style traces• Link traces• Compile traces.
47
Thesis Proposal Jan 2006
Region Shape
• As execution units become larger• Trips around dispatch loop become less frequent• Next show data to justify “step back” approach.• Very simple experiment:• Modify dispatch loops to count iterations.
48
Condition DescriptionDCT Direct Call Threading
BB CT-style Basic BlocksTR Traces (no link, no JIT)
TR-LINK Linked Traces (no JIT)
Thesis Proposal Jan 2006
Region Shape Effect on Dispatch Count
49
compress
db jack
javac
jess
mpeg
mtrt ray
scitest
geomean
1e0
1e2
1e4
1e6
1e8
1e10
log
disp
atch
cou
nt
Spec JVM98 Benchmarks
LegendDCTBBTRTR-LINK
Thesis Proposal Jan 2006
How efficient is profiling system?
• Run instrumentation without the JIT.• Are the intermediate versions of Java viable?• Include SUB threading in comparison:• Since it is an efficient dispatch technique.
• Report elapsed time relative to distro of JamVM
50
Thesis Proposal Jan 2006
Profiling/Instrumentation Overhead
51
0.73
1.87
0.95
0.78
0.66
compress
0.87
1.45
1.05
0.90
0.80
db0.88
1.41
1.22
0.88
0.79
jack
0.88
1.39
1.18
0.96
0.85
javac
0.87
1.38
1.15
0.87
0.77
jess
0.72
2.00
0.99
0.78
0.72
mpeg
0.84
1.11
1.42
0.82
0.81
mtrt
0.88
1.49
1.33
0.84
0.77
ray
0.59
1.69
1.18
0.68
0.60
scitest
0.80
1.51
1.15
0.83
0.75
geomean
0.0
0.5
1.0
1.5
2.0
Elap
sed
time
rela
tive
to ja
m-d
istro
Spec JVM98 Benchmarks
LegendSUBDCTBBTRTR-LINK
Thesis Proposal Jan 2006
Performance of simple JIT
• Compare YETI performance WITHOUT JIT to selective inlining SableV
• Compare YETI with preliminary trace based JIT to Sun’s Hotspot optimizing compiler
• Not much basis for comparison to Hotpath
52
Condition DescriptionTR-LINK Linked TracesSABVM SableVM selective inlining
JIT Traces (JIT and Link)
Thesis Proposal Jan 2006
JIT Performance relative to Sun Hotspot
53
8.78
8.14
5.49
compress
2.31
1.93
1.50
db
3.06 3.23
2.59
jack
3.92
2.86
2.43
javac
5.73
5.17
4.25
jess
12.26 13.73
7.95
mpeg
11.65
11.96
11.87
mtrt
11.46
9.71
7.20
ray
4.075.21
3.45
scitest
5.94
5.69
4.31
geomean
0
2
4
6
8
10
12
14
Elap
sed
time
rela
tive
to h
otsp
ot
Spec JVM98 Benchmarks
LegendSABVMTR-LINKJIT
Thesis Proposal Jan 2006 54
Overview
✓ Introduction✓ Background: Interpretation & traces✓ Our Approach✓ Selecting Regions• Results and Discussion✓ Data‣ Discussion (and future work)• Remaining Work in this dissertation
Thesis Proposal Jan 2006
Gradual performance improvement
55
0
1
2
SABV
MSU
BDCT BB TR
TR-LIN
K JIT
0.57
0.750.83
1.15
1.51
0.800.78
Rela
tive
to J
amVM
1.3
.3 D
irect
Thr
eadi
ng
Geometric Mean SpecJVM98Elapsed Time PPC970
• Performance improves as effort invested.
• SUB very effective for lightweight bodies
• BB not viable by itself• TR-LINK about same
as CT/branch inlining.• JIT preliminary.
Thesis Proposal Jan 2006
Discussion
• We have demonstrated how to build an interpreter that is simple and yet as efficient as SableVM and JamVM.
• We have shown that our interpreter can be gradually extended to identify, link and compile traces.
• We have shown how generated code can reuse callable virtual instruction bodies.
• Our JIT, although it has no optimizer, only supports 50 Java virtual instructions, improves performance by 24%.• More instructions, better performance?
56
Thesis Proposal Jan 2006 57
2D vision of Incremental VM lifecycle..
SUBinterpreter
Basic Blocks Traces - sub dispatch
Have explored this spaceComplexity of Compiled Code Regions
Traces- just integer instructions
Traces - compile all virtual instructions x
x
Basic blocks - just integer virtual instructions
Num
ber o
f virt
ual in
stru
ctio
ns c
ompi
led
✔ ✔ ✔
✔
Thesis Proposal Jan 2006
Application
• If I had to build a new interpreter.• for “lightweight” bodies, so dispatch matters.
• Start with bodies that can be conditionally compiled to be either direct threaded or callable.
• Bring up the system using DCT because the dispatch loop makes it easier to debug.• e.g. Logging from dispatch loop is very helpful.
• Primary platforms would run SUB and secondary platforms would run direct threading.
• Gradually extend as described.
58
Thesis Proposal Jan 2006
Future Work
• Work could go in many different directions.• Apply to another language system• JavaScript? Python? Fortress?• Deal with polymorphic bytecodes
• Extend JIT/Optimizer • Explore performance potential• Need a lot more infrastructure (e.g. IR)
• Package infrastructure for others to apply.
59
Thesis Proposal Jan 2006 60
Overview
✓ Introduction✓ Background: Interpretation & traces✓ Our Approach✓ Selecting Regions• Results and Discussion✓ Data✓ Discussion‣ Remaining work in this dissertation
Thesis Proposal Jan 2006
Proposed Work for winter 2007.
• Infrastructure to measure:• Compile time;• Proportion of virtual instructions executed from
compiled code.• Add float register class.• scimark, ray, Linpack would likely benefit.
• Compile Basic Blocks• Long bb benchmarks will benefit.
• Write, write, write.
61
Thesis Proposal Jan 2006
• Back 12• Efficient 23• Our 28• BB 33• TR 38• RES 46• CONC 54
62