Catenation and Operand Specialization - Department of Computer

70
Catenation and Operand Specialization For Tcl VM Performance Benjamin Vitale 1 , Tarek Abdelrahman U. Toronto 1 Now with OANDA Corp.

Transcript of Catenation and Operand Specialization - Department of Computer

Page 1: Catenation and Operand Specialization - Department of Computer

Catenation and Operand Specialization

For Tcl VM Performance

Benjamin Vitale1, Tarek Abdelrahman

U. Toronto

1Now with OANDA Corp.

Page 2: Catenation and Operand Specialization - Department of Computer

2

Preview

• Techniques for fast interpretation (of Tcl)

• Or slower, too!

• Lightweight compilation; a point betweeninterpreters and JITs

• Unwarranted chumminess with Sparcassembly language

Page 3: Catenation and Operand Specialization - Department of Computer

3

Tclsource

Nat ivemach ine

code

Tclbytecodecompi ler

JIT Compiler

y Catenat iony Special izat iony Linking

RuntimeInitialization

y Templateextract ion

y Patchsynthesis

Implementation

Page 4: Catenation and Operand Specialization - Department of Computer

4

Outline

• VM dispatch overhead

• Techniques for removing overhead, and consequences

• Evaluation

• Conclusions

Page 5: Catenation and Operand Specialization - Department of Computer

5

Interpreter Speed

• What makes interpreters slow?

• One problem is dispatch overhead– Interpreter core is a dispatch loop

– Probably much smaller than run-time system

– Yet, focus of considerable interest

– Simple, elegant, fertile

Page 6: Catenation and Operand Specialization - Department of Computer

6

for (;;)

opcode = *vpc++ Fetch opcodeswitch (opcode) Dispatch

case INST_DUP

obj = *stack_top*++stack_top = obj

break

case INST_INCR

arg = *vpc++ Fetch operand *stack_top += arg

Typical Dispatch Loop

Real work

Page 7: Catenation and Operand Specialization - Department of Computer

7

Dispatch Overhead

1019Dispatch

6~6Operand fetch

5~4Real work

InstructionsCycles

• Execution time of TclINST_PUSH

Page 8: Catenation and Operand Specialization - Department of Computer

8

Goal: Reduce Dispatch

<<10selective inlining (average)

Piumarta & RiccardiPLDI’98

19for/switch

0?

10direct threaded, decentralized

next

14token threaded, decentralized

next

SPARC Cycle Time

Dispatch Technique

Page 9: Catenation and Operand Specialization - Department of Computer

9

Native Code the Easy Way

• To eliminate all dispatch, we must execute native code

• But, we’re lazy hackers

• Instead of a writing a real code generator, use interpreter as source of templates

Page 10: Catenation and Operand Specialization - Department of Computer

10

Interpreter Structure

VirtualProgram

push 0push 1add

inst_pushinst_fooinst_add

InstructionTable

body of pushdispatch

body of foodispatch

body of adddispatch

InterpreterNative Code• Dispatch loop sequences bodies

according to virtual program

Page 11: Catenation and Operand Specialization - Department of Computer

11

Catenation

Virtual Program

push 0push 1add

“Compiled”Native Code

copy of code forinst_push

copy of code forinst_push

copy of code forinst_add

• No dispatch required

• Control falls-through naturally

Page 12: Catenation and Operand Specialization - Department of Computer

12

Opportunities

• Catenated code has a nice feature

• A normal interpreter has one generic implementation for each opcode

• Catenated code has separate copies

• This yields opportunities for further optimization. However…

Page 13: Catenation and Operand Specialization - Department of Computer

13

Challenges

• Code is not meant to be moved after linking

• For example, some pc-relative instructions are hard to move, including some branches and function calls

• But first, the good news

Page 14: Catenation and Operand Specialization - Department of Computer

14

Exploiting Catenated Code

• Separate bodies for each opcode yield three nice opportunities

1. Convert virtual branches to native

2. Remove virtual program counter

3. Reduce operand fetch code to runtime constants

Page 15: Catenation and Operand Specialization - Department of Computer

15

Virtual branches become Native

L1: dec_var xpush 0cmpbzvirtual L1done

bytecode

L2: dec_var bodypush bodycmp bodybznative L2done body

native code

• Arrange for cmp body to set condition code

• Emit synthesized code forbz; don’t memcpy

Page 16: Catenation and Operand Specialization - Department of Computer

16

Eliminating Virtual PC

• vpc is used by dispatch, operand fetch, virtual branches – and exception handling

• Remove code to maintain vpc, and free up register

• For exceptions, we rematerialize vpc

Page 17: Catenation and Operand Specialization - Department of Computer

17

Rematerializing vpc

1: inc_var 13: push 05: inc_var 2

code for inc_varif (err)

vpc = 1br exception

code for pushcode for inc_var

if (err)vpc = 5br exception

Bytecode Native Code

• Separate copies

• In copy for vpc 1, set vpc = 1

Page 18: Catenation and Operand Specialization - Department of Computer

18

Moving Immovable Code

• pc-relative instructions can break:

7000: call +2000 (9000 <printf>)

3000: call +2000 (5000 <????>)

Page 19: Catenation and Operand Specialization - Department of Computer

19

Patching Relocated Code

• Patchpc-relative instructions so they work:

3000: call +2000 (5000 <????>)

3000: call +6000 (9000 <printf>)

Page 20: Catenation and Operand Specialization - Department of Computer

20

PatchesInput Types

ARGLITERAL

BUILTIN_FUNCJUMP

PCNONE

Output Types

SIMM13SETHI/OR

JUMPCALL

• Objects describing change to code

• Input: Type, position, and sizeof operand in bytecode instruction

• Output: Type and offsetof instruction in template

• Only 4 output types on Sparc!

Page 21: Catenation and Operand Specialization - Department of Computer

21

Interpreter Operand Fetch

add l5, 1, l5 increment vpc to operandldub [l5], o0 load operand from bytecode streamld [fp+48], o2 get bytecode object addr from C stackld [o2+4c], o1 get literal tbl addr from bytecode objsll o0, 2, o0 compute offset into literal tableld [o1+ o0], o1 load from literal table

push 1 00 01

Bytecode Instruction

0 0xf81d41 0xfa008 “foo”

Literal Table

Operand Fetch

Page 22: Catenation and Operand Specialization - Department of Computer

22

Operand Specialization

sethi o1, hi(obj_addr )or o1, lo(obj_addr )

• Array load becomes a constant

• Patch input : one-byte integer literal at offset 1

• Patch output: sethi/or at offset 0

add l5, 1, l5ldub [l5], o0ld [fp+48], o2ld [o2+4c], o1sll o0, 2, o0ld [o1+ o0], o1

Page 23: Catenation and Operand Specialization - Department of Computer

23

Net Improvement

sethi o1, hi(obj_addr )or o1, lo(obj_addr )st o1, [l6]ld [o1], o0inc o0st o0, [o1]

Final Template for push

• Interpreter:11 instructions + 8 dispatch

• Catenated:6 instructions + 0 dispatch

• push is shorter than most, but very common

Page 24: Catenation and Operand Specialization - Department of Computer

24

Evaluation

• Performance

• Ideas

Page 25: Catenation and Operand Specialization - Department of Computer

25

Compilation Time

• Templates are fixed size, fast

• Two catenation passes– compute total length– memcpy, apply patches (very fast)

• adds 30 - 100% to bytecode compile time

Page 26: Catenation and Operand Specialization - Department of Computer

26

Execution Time

Page 27: Catenation and Operand Specialization - Department of Computer

27

Varying I-Cache Size

• Four hypothethical I-cache sizes

• Simics Full Machine Simulator

• 520 Tcl Benchmarks

• Run both interpreted and catenating VM

Page 28: Catenation and Operand Specialization - Department of Computer

28

Varying I-Cache Size

45

66

82

54

0%

100%

32 128 1024 infinite

I-Cache Size (KB)

# of

Ben

chm

arks

Impr

oved

Page 29: Catenation and Operand Specialization - Department of Computer

29

Branch Prediction - Catenation

• No dispatch branches

• Virtual branches become native

• Similar CFG for native and virtual program

• BTB knows what to do

• Prediction rate similar to statically compiled code: excellent for many programs

Page 30: Catenation and Operand Specialization - Department of Computer

30

Implementation Retrospective

• Getting templates from interpreter is fun

• Too brittle for portability, research

• Need explicit control over code gen

• Write own code generator, or

• Make compiler scriptable?

Page 31: Catenation and Operand Specialization - Department of Computer

31

Related Work

• Ertl & Gregg, PLDI 2003– EfficientInterpreters (Forth, OCaml)

– Smaller bytecodes, more dispatch overhead

– Code growth, but little I-cache overflow

• DyC: m88ksim

• Qemu x86 simulator (F. Bellard)

• Many others; see paper

Page 32: Catenation and Operand Specialization - Department of Computer

32

Conclusions

• Many ways to speed up interpreters

• Catenation is a good idea, but like all inlining needs selective application

• Not very applicable to Tcl’s large bytecodes

• Ever-changing micro-architectural issues

Page 33: Catenation and Operand Specialization - Department of Computer

33

Future Work

• Investigate correlation between opcode body size, I-cache misses

• Selective outlining, other adaptation

• Port: another architecture; an efficient VM

• Study benefit of each optimization separately

• Type inference

Page 34: Catenation and Operand Specialization - Department of Computer

34

The End

Page 35: Catenation and Operand Specialization - Department of Computer

35

JIT emitting

• Interpret patches

• A few loads, shifts, adds, and stores

Emitter

Template w/Patches

Special izedNat ive Code

BytecodeInstruction

Page 36: Catenation and Operand Specialization - Department of Computer

36

Token Threading in GNU C

#define NEXT goto * (instr_table [*vpc++])

Enum {INST_ADD, INST_PUSH, …};char prog [] = {INST_PUSH, 2, INST_PUSH, 3, INST_MUL, …};void *instr_table [ ] = { && INST_ADD, && INST_PUSH, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Page 37: Catenation and Operand Specialization - Department of Computer

37

Virtual Machines are Everywhere

• Perl, Tcl, Java, Smalltalk. grep?

• Why so popular?

• Software layering strategy• Portability, Deploy-ability, Manageability

• Very late binding

• Security (e.g., sandbox)

Page 38: Catenation and Operand Specialization - Department of Computer

38

Software layering strategy

• Software getting more complex

• Use expressive higher level languages

• Raise level of abstraction

Runtime SysOS

Machine

App

Runtime SysOS

Machine

App layer 2

App runtimeApp layer 1 VM

App

Page 39: Catenation and Operand Specialization - Department of Computer

39

Problem: Performance

• Interpreters are slow: 1000 – 10 times slower than native code

• One possible solution: JITs

Page 40: Catenation and Operand Specialization - Department of Computer

40

Just-In-Time Compilation

• Compile to native inside VM, at runtime

• But, JITs are complex and non-portable –would be most complex and least portable part of, e.g. Tcl

• Many JIT VMs interpret sometimes

Page 41: Catenation and Operand Specialization - Department of Computer

41

Reducing Dispatch Count

• In addition to reducing cost of each dispatch, we can reduce the numberof dispatches

• Superinstructions: static, or dynamic, e.g.:

• Selective Inlining

Piumarta & Riccardi, PLDI’98

Page 42: Catenation and Operand Specialization - Department of Computer

42

Switch Dispatch Assemblyfor_loop:

ldub [i0], o0 fetch opcode

switch:cmp o0, 19 bounds checkbgu for_loop

add i0, 1, i0 increment vpc

sethi hi(inst_tab), r0 lookup addror r0, lo(inst_tab),r0sll o0, 2, o0ld [r0 + o0], o2

jmp o2 dispatchnop

Page 43: Catenation and Operand Specialization - Department of Computer

43

Push Opcode Implementation

add l6, 4, l6 ; increment VM stack pointeradd l5, 1, l5 ; increment vpc past opcode. Now at operandldub [l5], o0 ; load operand from bytecode streamld [fp + 48], o2 ; get bytecode object addr from C stackld [o2 + 4c], o1 ; get literal tbl addr from bytecode objsll o0, 2, o0 ; compute array offset into literal tableld [o1 + o0], o1 ; load from literal tablest o1, [l6] ; store to top of VM stackld [o1], o0 ; next 3 instructions increment ref countinc o0st o0, [o1]

• 11 instructions

Page 44: Catenation and Operand Specialization - Department of Computer

44

Indirect (Token) Threading

0 110 7456

Interpreter Code

0 push 0x800010001 pop 0x800010202 add 0x800010343 sub 0x800010584 mul 0x800010725 print 0x8000111c6 done 0x80001024

Instruct ion TableBytecode

push 11push 7mulprintdone

Bytecode "asm" inst_push: ......next

inst_pop: ......next

inst_add: ......next

...

Page 45: Catenation and Operand Specialization - Department of Computer

45

Token Threading Example

#define TclGetUInt1AtPtr(p) ((unsigned int) *(p))#define Tcl_IncrRefCount(objPtr) ++(objPtr)->refCount#define NEXT goto *jumpTable [*pc];

case INST_PUSH:Tcl_Obj *objectPtr;

objectPtr = codePtr->objArrayPtr [TclGetUInt1AtPtr (pc + 1)];*++tosPtr = objectPtr; /* top of stack */Tcl_IncrRefCount (objectPtr);pc += 2;NEXT;

Page 46: Catenation and Operand Specialization - Department of Computer

46

Token Threaded Dispatch

• 8 instructions

• 14 cyclessethi hi(800), o0or o0, 2f0, o0ld [l7 + o0], o1ldub [l5], o0sll o0, 2, o0ld [o1 + o0], o0jmp o0nop

Page 47: Catenation and Operand Specialization - Department of Computer

47

inst_push: ......next

inst_mul: ......next

inst_print: ......next

inst_add: ...next

inst_done: ...next

0x80001000 110x80001000 70x800010720x8000111c0x80001024

Interpreter Code

Bytecode

y Represent program instruct ions withaddress of their interpreterimplementat ions

Direct Threading

Page 48: Catenation and Operand Specialization - Department of Computer

48

Direct Threaded Dispatch

• 4 instructions

• 10 cycles

ld r1 = [vpc]

add vpc = vpc + 4

jmp *r1

nop

Page 49: Catenation and Operand Specialization - Department of Computer

49

Direct Threading in GNU C

#define NEXT goto * (*vpc++)

int prog [] ={&&INST_PUSH, 2, &&INST_PUSH, 3, &&INST_MUL, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Page 50: Catenation and Operand Specialization - Department of Computer

50

Superinstructionsiload 3

iload 1

iload 2

imul

iadd

istore 3

iinc 2 1

iload 1bipush 10if_icmplt 7

iload 2

bipush 20

if_icmplt 12

iinc 1 1

• Note repeated opcode sequence

• Create new synthetic opcode

iload_bipush_if_icmplt

• takes 3 parms

Page 51: Catenation and Operand Specialization - Department of Computer

51

Copying Native Code

inst_push: ldstnext

inst_mul: ldmulstnext

inst_print: ......next

...

inst_push_mul:ldstldmulstnext

inst_push: ldstnext

inst_mul: ld... ...

Page 52: Catenation and Operand Specialization - Department of Computer

52

inst_add Assembly

inst_table:.word inst_add switch table

.word inst_push

.word inst_print

.word inst_done

inst_add:ld [i1], o1 arg = *stack_top--

add i1, -4, i1ld [i1], o0 *stack_top += arg

add o0, o1, o0st o0, [i1]

b for_loop dispatch

Page 53: Catenation and Operand Specialization - Department of Computer

53

Copying Native Code

uint push_len = &&inst_push_end - &&inst_push_start;

uint mul_len = &&inst_mul_end - &&inst_mul_start;

void *codebuf = malloc (push_len + mul_len + 4);

mmap (codebuf, MAP_EXEC);

memcpy(codebuf, &&inst_push_start, push_len);

memcpy(codebuf + push_len, &&inst_mul_start, mul_len);

/* … memcpy (dispatch code) */

Page 54: Catenation and Operand Specialization - Department of Computer

54

Limitations of Selective Inlining

• Code is not meant to be memcpy’d

• Can’t move function calls, some branches

• Can’t jump into middle of superinstruction

• Can’t jump out of middle (actually you can)

• Thus, only usable at virtual basic block boundaries

• Some dispatch remains

Page 55: Catenation and Operand Specialization - Department of Computer

55

Catenation

• Essentially a template compiler

• Extract templates from interpreter

Page 56: Catenation and Operand Specialization - Department of Computer

56

Catenation - Branches

L1: inc_var 1

push 2

cmp

beq L1

L1: code for inc_var

code for push

code for cmp

code for beq-test

beq L1

bytecode native code

•Virtual branches become native branches

•Emit synthesized code; don’t memcpy

Page 57: Catenation and Operand Specialization - Department of Computer

57

Operand Fetch

• In interpreter, one genericcopy of push for all virtual instructions, with any operands

• Java, Smalltalk, etc. have push_1, push_2

• But, only 256 bytecodes

• Catenated code has separatecopy of push for each instruction

push 1

push 2

inc_var 1

sample code

Page 58: Catenation and Operand Specialization - Department of Computer

58

Threaded Code,Decentralized Dispatch

• Eliminate bounds check by avoiding switch• Make dispatch explicit

• Eliminate extra branch by not using for

• James Bell, 1973 CACM

• Charles Moore, 1970, Forth

• Give each instruction its own copy of dispatch

Page 59: Catenation and Operand Specialization - Department of Computer

59

Why Real Work Cycles Decrease

• We do not separately show improvements from branch conversion, vpc elimination, and operand specialization

Page 60: Catenation and Operand Specialization - Department of Computer

60

Why I-cache Improves

• Useful bodies packed tightly in instruction memory (in interpreter, unused bodies pollute I-cache)

Page 61: Catenation and Operand Specialization - Department of Computer

61

Operand Specialization

• push not typical; most instructions much longer (for Tcl)

• But, push is very common

Page 62: Catenation and Operand Specialization - Department of Computer

62

Micro Architectural Issues

• Operand fetch includes 1 - 3 loads

• Dispatch includes 1 load, 1 indirect jump

• Branch prediction

Page 63: Catenation and Operand Specialization - Department of Computer

63

Branch Prediction

for

swi tch

subpush pop beqadd ...

*

9

8

7

6

5

4

3

2

1

Last Op0 *

BTBControl Flow Graph - Switch Dispatch

• 85 - 100% mispredictions [Ertl 2003]

Page 64: Catenation and Operand Specialization - Department of Computer

64

Better Branch Prediction

CFG - Threaded Code

add

pop beq

push

sub

B 1

B 2

B 3

B 4

B 5

Last SuccB10

Last SuccB9

Last SuccB8

Last SuccB7

Last SuccB6

Last SuccB5

Last SuccB4

Last SuccB3

Last SuccB2

Last SuccB1

BTB•Approx. 60% mispredict

Page 65: Catenation and Operand Specialization - Department of Computer

65

How Catenation Works

• Scan templates for patterns at VM startup– Operand specialization points

– vpc rematerialization points

– pc-relative instruction fixups

• Cache results in a “compiled” form

• Adds 4 ms to startup time

Page 66: Catenation and Operand Specialization - Department of Computer

66

From Interpreter to Templates

• Programming effort:– Decompose interpreter into 1 instruction case

per file

– Replace operand fetch code with magic numbers

Page 67: Catenation and Operand Specialization - Department of Computer

67

From Interpreter to Templates 2

• Software build time (make):– compile C to assembly (PIC)

– selectively de-optimizeassembly

– Conventional link

Page 68: Catenation and Operand Specialization - Department of Computer

68

#ifdef INTERPRET

#define MAGIC_OP1_U1_LITERAL codePtr->objArray [GetUInt1AtPtr (pc + 1)]#define PC_OP(x) pc ## x#define NEXT_INSTR break

#elseif COMPILE

#define MAGIC_OP1_U1_LITERAL (Tcl_Obj *) 0x7bc5c5c1#define NEXT_INSTR goto *jump_range_table [*pc].start#define PC_OP(x) /* unnecessary */

#endif

case INST_PUSH1:Tcl_Obj *objectPtr;

objectPtr = MAGIC_OP1_U1_LITERAL;*++tosPtr = objectPtr; /* top of stack */Tcl_IncrRefCount (objectPtr);PC_OP (+= 2);NEXT_INSTR; /* dispatch */

Magic Numbers

Page 69: Catenation and Operand Specialization - Department of Computer

69

Dynam

ic Execution F

requency

0.0%

5.0%

10.0%

15.0%loadScalar1

push1pop

loadScalar4jumpFalse1

jump1push4

storeScalar1foreach_step4

dupadd

invokeStk1jumpFalse4

jump4storeScalar4beginCatch4

doneincrScalar1Imm

bitorbitandbitxor

subinvokeStk4

loadArrayStkrshift

eqlt

tryCvtToNumericloadScalarStk

listindexlshift

concat1incrScalar1

lappendScalar1appendScalar1

loadArray1lappendScalar4

multincrScalarStkImm

callBuiltinFunc1

Page 70: Catenation and Operand Specialization - Department of Computer

70

0

10

0

20

0

30

0

40

0

loadScalar1push1poploadScalar4jumpFalse1jump1push4storeScalar1foreach_step4dupaddinvokeStk1jumpFalse4jump4storeScalar4beginCatch4doneincrScalar1ImmbitorbitandbitxorsubinvokeStk4loadArrayStkrshifteqlttryCvtToNumericloadScalarStklistindexlshiftconcat1incrScalar1lappendScalar1appendScalar1loadArray1lappendScalar4multincrScalarStkImmcallBuiltinFunc1callFunc1notgtappendScalar4listlengthloadArray4endCatchdivbitnotstoreArray1

<<<<

Dyn

amic

ally

mor

e fr

eque

nt

Body length (ntv insns). In

stru

ctio

n B

ody

Leng

th