Catenation and Operand Specialization

For Tcl VM Performance

Benjamin Vitale1, Tarek Abdelrahman

U. Toronto

1Now with OANDA Corp.

Preview

• Techniques for fast interpretation (of Tcl)

• Or slower, too!

• Lightweight compilation; a point betweeninterpreters and JITs

• Unwarranted chumminess with Sparcassembly language

Tclsource

Nat ivemach ine

Tclbytecodecompi ler

JIT Compiler

y Catenat iony Special izat iony Linking

RuntimeInitialization

y Templateextract ion

y Patchsynthesis

Implementation

Outline

• VM dispatch overhead

• Techniques for removing overhead, and consequences

• Evaluation

• Conclusions

Interpreter Speed

• What makes interpreters slow?

• One problem is dispatch overhead– Interpreter core is a dispatch loop

– Probably much smaller than run-time system

– Yet, focus of considerable interest

– Simple, elegant, fertile

for (;;)

opcode = *vpc++ Fetch opcodeswitch (opcode) Dispatch

case INST_DUP

obj = *stack_top*++stack_top = obj

case INST_INCR

arg = *vpc++ Fetch operand *stack_top += arg

Typical Dispatch Loop

Real work

Dispatch Overhead

1019Dispatch

6~6Operand fetch

5~4Real work

InstructionsCycles

• Execution time of TclINST_PUSH

Goal: Reduce Dispatch

<<10selective inlining (average)

Piumarta & RiccardiPLDI’98

19for/switch

10direct threaded, decentralized

14token threaded, decentralized

SPARC Cycle Time

Dispatch Technique

Native Code the Easy Way

• To eliminate all dispatch, we must execute native code

• But, we’re lazy hackers

• Instead of a writing a real code generator, use interpreter as source of templates

Interpreter Structure

VirtualProgram

push 0push 1add

inst_pushinst_fooinst_add

InstructionTable

body of pushdispatch

body of foodispatch

body of adddispatch

InterpreterNative Code• Dispatch loop sequences bodies

according to virtual program

Catenation

Virtual Program

push 0push 1add

“Compiled”Native Code

copy of code forinst_push

copy of code forinst_add

• No dispatch required

• Control falls-through naturally

Opportunities

• Catenated code has a nice feature

• A normal interpreter has one generic implementation for each opcode

• Catenated code has separate copies

• This yields opportunities for further optimization. However…

Challenges

• Code is not meant to be moved after linking

• For example, some pc-relative instructions are hard to move, including some branches and function calls

• But first, the good news

Exploiting Catenated Code

• Separate bodies for each opcode yield three nice opportunities

1. Convert virtual branches to native

2. Remove virtual program counter

3. Reduce operand fetch code to runtime constants

Virtual branches become Native

L1: dec_var xpush 0cmpbzvirtual L1done

bytecode

L2: dec_var bodypush bodycmp bodybznative L2done body

native code

• Arrange for cmp body to set condition code

• Emit synthesized code forbz; don’t memcpy

Eliminating Virtual PC

• vpc is used by dispatch, operand fetch, virtual branches – and exception handling

• Remove code to maintain vpc, and free up register

• For exceptions, we rematerialize vpc

Rematerializing vpc

1: inc_var 13: push 05: inc_var 2

code for inc_varif (err)

vpc = 1br exception

code for pushcode for inc_var

if (err)vpc = 5br exception

Bytecode Native Code

• Separate copies

• In copy for vpc 1, set vpc = 1

Moving Immovable Code

• pc-relative instructions can break:

7000: call +2000 (9000 <printf>)

3000: call +2000 (5000 <????>)

Patching Relocated Code

• Patchpc-relative instructions so they work:

3000: call +2000 (5000 <????>)

3000: call +6000 (9000 <printf>)

PatchesInput Types

ARGLITERAL

BUILTIN_FUNCJUMP

PCNONE

Output Types

SIMM13SETHI/OR

JUMPCALL

• Objects describing change to code

• Input: Type, position, and sizeof operand in bytecode instruction

• Output: Type and offsetof instruction in template

• Only 4 output types on Sparc!

Interpreter Operand Fetch

add l5, 1, l5 increment vpc to operandldub [l5], o0 load operand from bytecode streamld [fp+48], o2 get bytecode object addr from C stackld [o2+4c], o1 get literal tbl addr from bytecode objsll o0, 2, o0 compute offset into literal tableld [o1+ o0], o1 load from literal table

push 1 00 01

Bytecode Instruction

0 0xf81d41 0xfa008 “foo”

Literal Table

Operand Fetch

Operand Specialization

sethi o1, hi(obj_addr )or o1, lo(obj_addr )

• Array load becomes a constant

• Patch input : one-byte integer literal at offset 1

• Patch output: sethi/or at offset 0

add l5, 1, l5ldub [l5], o0ld [fp+48], o2ld [o2+4c], o1sll o0, 2, o0ld [o1+ o0], o1

Net Improvement

sethi o1, hi(obj_addr )or o1, lo(obj_addr )st o1, [l6]ld [o1], o0inc o0st o0, [o1]

Final Template for push

• Interpreter:11 instructions + 8 dispatch

• Catenated:6 instructions + 0 dispatch

• push is shorter than most, but very common

Evaluation

• Performance

• Ideas

Compilation Time

• Templates are fixed size, fast

• Two catenation passes– compute total length– memcpy, apply patches (very fast)

• adds 30 - 100% to bytecode compile time

Execution Time

Varying I-Cache Size

• Four hypothethical I-cache sizes

• Simics Full Machine Simulator

• 520 Tcl Benchmarks

• Run both interpreted and catenating VM

Varying I-Cache Size

32 128 1024 infinite

I-Cache Size (KB)

Branch Prediction - Catenation

• No dispatch branches

• Virtual branches become native

• Similar CFG for native and virtual program

• BTB knows what to do

• Prediction rate similar to statically compiled code: excellent for many programs

Implementation Retrospective

• Getting templates from interpreter is fun

• Too brittle for portability, research

• Need explicit control over code gen

• Write own code generator, or

• Make compiler scriptable?

Related Work

• Ertl & Gregg, PLDI 2003– EfficientInterpreters (Forth, OCaml)

– Smaller bytecodes, more dispatch overhead

– Code growth, but little I-cache overflow

• DyC: m88ksim

• Qemu x86 simulator (F. Bellard)

• Many others; see paper

Conclusions

• Many ways to speed up interpreters

• Catenation is a good idea, but like all inlining needs selective application

• Not very applicable to Tcl’s large bytecodes

• Ever-changing micro-architectural issues

Future Work

• Investigate correlation between opcode body size, I-cache misses

• Selective outlining, other adaptation

• Port: another architecture; an efficient VM

• Study benefit of each optimization separately

• Type inference

The End

JIT emitting

• Interpret patches

• A few loads, shifts, adds, and stores

Emitter

Template w/Patches

Special izedNat ive Code

BytecodeInstruction

Token Threading in GNU C

#define NEXT goto * (instr_table [*vpc++])

Enum {INST_ADD, INST_PUSH, …};char prog [] = {INST_PUSH, 2, INST_PUSH, 3, INST_MUL, …};void *instr_table [ ] = { && INST_ADD, && INST_PUSH, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Virtual Machines are Everywhere

• Perl, Tcl, Java, Smalltalk. grep?

• Why so popular?

• Software layering strategy• Portability, Deploy-ability, Manageability

• Very late binding

• Security (e.g., sandbox)

Software layering strategy

• Software getting more complex

• Use expressive higher level languages

• Raise level of abstraction

Runtime SysOS

Machine

Runtime SysOS

Machine

App layer 2

App runtimeApp layer 1 VM

Problem: Performance

• Interpreters are slow: 1000 – 10 times slower than native code

• One possible solution: JITs

Just-In-Time Compilation

• Compile to native inside VM, at runtime

• But, JITs are complex and non-portable –would be most complex and least portable part of, e.g. Tcl

• Many JIT VMs interpret sometimes

Reducing Dispatch Count

• In addition to reducing cost of each dispatch, we can reduce the numberof dispatches

• Superinstructions: static, or dynamic, e.g.:

• Selective Inlining

Piumarta & Riccardi, PLDI’98

Switch Dispatch Assemblyfor_loop:

ldub [i0], o0 fetch opcode

switch:cmp o0, 19 bounds checkbgu for_loop

add i0, 1, i0 increment vpc

sethi hi(inst_tab), r0 lookup addror r0, lo(inst_tab),r0sll o0, 2, o0ld [r0 + o0], o2

jmp o2 dispatchnop

Push Opcode Implementation

add l6, 4, l6 ; increment VM stack pointeradd l5, 1, l5 ; increment vpc past opcode. Now at operandldub [l5], o0 ; load operand from bytecode streamld [fp + 48], o2 ; get bytecode object addr from C stackld [o2 + 4c], o1 ; get literal tbl addr from bytecode objsll o0, 2, o0 ; compute array offset into literal tableld [o1 + o0], o1 ; load from literal tablest o1, [l6] ; store to top of VM stackld [o1], o0 ; next 3 instructions increment ref countinc o0st o0, [o1]

• 11 instructions

Indirect (Token) Threading

0 110 7456

Interpreter Code

0 push 0x800010001 pop 0x800010202 add 0x800010343 sub 0x800010584 mul 0x800010725 print 0x8000111c6 done 0x80001024

Instruct ion TableBytecode

push 11push 7mulprintdone

Bytecode "asm" inst_push: ......next

inst_pop: ......next

inst_add: ......next

Token Threading Example

#define TclGetUInt1AtPtr(p) ((unsigned int) *(p))#define Tcl_IncrRefCount(objPtr) ++(objPtr)->refCount#define NEXT goto *jumpTable [*pc];

case INST_PUSH:Tcl_Obj *objectPtr;

objectPtr = codePtr->objArrayPtr [TclGetUInt1AtPtr (pc + 1)];*++tosPtr = objectPtr; /* top of stack */Tcl_IncrRefCount (objectPtr);pc += 2;NEXT;

Token Threaded Dispatch

• 8 instructions

• 14 cyclessethi hi(800), o0or o0, 2f0, o0ld [l7 + o0], o1ldub [l5], o0sll o0, 2, o0ld [o1 + o0], o0jmp o0nop

inst_push: ......next

inst_mul: ......next

inst_print: ......next

inst_add: ...next

inst_done: ...next

0x80001000 110x80001000 70x800010720x8000111c0x80001024

Interpreter Code

Bytecode

y Represent program instruct ions withaddress of their interpreterimplementat ions

Direct Threading

Direct Threaded Dispatch

• 4 instructions

• 10 cycles

ld r1 = [vpc]

add vpc = vpc + 4

jmp *r1

Direct Threading in GNU C

#define NEXT goto * (*vpc++)

int prog [] ={&&INST_PUSH, 2, &&INST_PUSH, 3, &&INST_MUL, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Superinstructionsiload 3

iload 1

iload 2

istore 3

iinc 2 1

iload 1bipush 10if_icmplt 7

iload 2

bipush 20

if_icmplt 12

iinc 1 1

• Note repeated opcode sequence

• Create new synthetic opcode

iload_bipush_if_icmplt

• takes 3 parms

Copying Native Code

inst_push: ldstnext

inst_mul: ldmulstnext

inst_print: ......next

inst_push_mul:ldstldmulstnext

inst_push: ldstnext

inst_mul: ld... ...

inst_add Assembly

inst_table:.word inst_add switch table

.word inst_push

.word inst_print

.word inst_done

inst_add:ld [i1], o1 arg = *stack_top--

add i1, -4, i1ld [i1], o0 *stack_top += arg

add o0, o1, o0st o0, [i1]

b for_loop dispatch

Copying Native Code

uint push_len = &&inst_push_end - &&inst_push_start;

uint mul_len = &&inst_mul_end - &&inst_mul_start;

void *codebuf = malloc (push_len + mul_len + 4);

mmap (codebuf, MAP_EXEC);

memcpy(codebuf, &&inst_push_start, push_len);

memcpy(codebuf + push_len, &&inst_mul_start, mul_len);

/* … memcpy (dispatch code) */

Limitations of Selective Inlining

• Code is not meant to be memcpy’d

• Can’t move function calls, some branches

• Can’t jump into middle of superinstruction

• Can’t jump out of middle (actually you can)

• Thus, only usable at virtual basic block boundaries

• Some dispatch remains

Catenation

• Essentially a template compiler

• Extract templates from interpreter

Catenation - Branches

L1: inc_var 1

push 2

beq L1

L1: code for inc_var

code for push

code for cmp

code for beq-test

beq L1

bytecode native code

•Virtual branches become native branches

•Emit synthesized code; don’t memcpy

Operand Fetch

• In interpreter, one genericcopy of push for all virtual instructions, with any operands

• Java, Smalltalk, etc. have push_1, push_2

• But, only 256 bytecodes

• Catenated code has separatecopy of push for each instruction

push 1

push 2

inc_var 1

sample code

Threaded Code,Decentralized Dispatch

• Eliminate bounds check by avoiding switch• Make dispatch explicit

• Eliminate extra branch by not using for

• James Bell, 1973 CACM

• Charles Moore, 1970, Forth

• Give each instruction its own copy of dispatch

Why Real Work Cycles Decrease

• We do not separately show improvements from branch conversion, vpc elimination, and operand specialization

Why I-cache Improves

• Useful bodies packed tightly in instruction memory (in interpreter, unused bodies pollute I-cache)

Operand Specialization

• push not typical; most instructions much longer (for Tcl)

• But, push is very common

Micro Architectural Issues

• Operand fetch includes 1 - 3 loads

• Dispatch includes 1 load, 1 indirect jump

• Branch prediction

Branch Prediction

swi tch

subpush pop beqadd ...

Last Op0 *

BTBControl Flow Graph - Switch Dispatch

• 85 - 100% mispredictions [Ertl 2003]

Better Branch Prediction

CFG - Threaded Code

pop beq

Last SuccB10

Last SuccB9

Last SuccB8

Last SuccB7

Last SuccB6

Last SuccB5

Last SuccB4

Last SuccB3

Last SuccB2

Last SuccB1

BTB•Approx. 60% mispredict

How Catenation Works

• Scan templates for patterns at VM startup– Operand specialization points

– vpc rematerialization points

– pc-relative instruction fixups

• Cache results in a “compiled” form

• Adds 4 ms to startup time

From Interpreter to Templates

• Programming effort:– Decompose interpreter into 1 instruction case

per file

– Replace operand fetch code with magic numbers

From Interpreter to Templates 2

• Software build time (make):– compile C to assembly (PIC)

– selectively de-optimizeassembly

– Conventional link

#ifdef INTERPRET

#define MAGIC_OP1_U1_LITERAL codePtr->objArray [GetUInt1AtPtr (pc + 1)]#define PC_OP(x) pc ## x#define NEXT_INSTR break

#elseif COMPILE

#define MAGIC_OP1_U1_LITERAL (Tcl_Obj *) 0x7bc5c5c1#define NEXT_INSTR goto *jump_range_table [*pc].start#define PC_OP(x) /* unnecessary */

#endif

case INST_PUSH1:Tcl_Obj *objectPtr;

objectPtr = MAGIC_OP1_U1_LITERAL;*++tosPtr = objectPtr; /* top of stack */Tcl_IncrRefCount (objectPtr);PC_OP (+= 2);NEXT_INSTR; /* dispatch */

Magic Numbers

ic Execution F

requency

15.0%loadScalar1

push1pop

loadScalar4jumpFalse1

jump1push4

storeScalar1foreach_step4

dupadd

invokeStk1jumpFalse4

jump4storeScalar4beginCatch4

doneincrScalar1Imm

bitorbitandbitxor

subinvokeStk4

loadArrayStkrshift

tryCvtToNumericloadScalarStk

listindexlshift

concat1incrScalar1

lappendScalar1appendScalar1

loadArray1lappendScalar4

multincrScalarStkImm

callBuiltinFunc1

loadScalar1push1poploadScalar4jumpFalse1jump1push4storeScalar1foreach_step4dupaddinvokeStk1jumpFalse4jump4storeScalar4beginCatch4doneincrScalar1ImmbitorbitandbitxorsubinvokeStk4loadArrayStkrshifteqlttryCvtToNumericloadScalarStklistindexlshiftconcat1incrScalar1lappendScalar1appendScalar1loadArray1lappendScalar4multincrScalarStkImmcallBuiltinFunc1callFunc1notgtappendScalar4listlengthloadArray4endCatchdivbitnotstoreArray1

Body length (ntv insns). In

Catenation and Operand Specialization - Department of Computer

Transcript of Catenation and Operand Specialization - Department of Computer

Catenation and Operand Specialization - Department of Computer

Documents

Transcript of Catenation and Operand Specialization - Department of Computer

CPU Opcodes - Read the Docs · CPU Opcodes, Release 0.3.14 • rm – a register or memory operand. Must be a reference to an instruction operand. If rm is a reference to a operand,

An Operand-Optimized Asynchronous IEEE 754 Double-Precision ...

Instructions and Addressing (cont’d.). Index addressing (1) opcode Reg index instruction operand memory or registers operand + registers.

Table Field Caption Sort Operand ... - support.clinplus.com

Power Efficient Arithmetic Circuits for Application ... · PDF file4.2 Latch-Based Operand Isolation ... 2.6 Pragma based operand isolation in VHDL RTL code ... (4,2) compressor

BOW: Breathing Operand Windows to Exploit Bypassing in ...BOW: Breathing Operand Windows to Exploit Bypassing in GPUs Hodjat Asghari Esfeden , Amirali Abdolrashidi , Shaﬁur Rahman

3D Stacked Wide-Operand Adders: A Case Study - TU Delftce-publications.et.tudelft.nl/...3d_stacked_wideoperand_adders_a... · 3D Stacked Wide-Operand Adders: A Case Study George R

MSP430 Assembly. BYU CS/ECEn 124MSP430 Assembly2 Topics to Cover… Double Operand Instructions Single Operand Instructions Jump Instructions Emulated Instructions.

S5-135U/155U CPU 928/928B/948 · 2015-01-24 · I, Q BI for an operand with bit address I, Q, F BY for an operand with byte address IB, QB, FY, DL, DR, PY, OY W for an operand with

Potent Catenation of Supercoiled and Gapped DNA Circles by Topoisomerase I in the Presence of

T00160030120134077[L] Pert 3 - Operator, Operand, And Arithmetic

AAS SPECIALIZATION. IT SPECIALIZATION C# Java VB.NET Networking Oracle BA SPECIALIZATION Management Accounting International Business.

Purely Functional, Real-Time Deques with Catenation

In Vivo Catenation and Decatenation of DNAt - Molecular and

Arsitektur dan Organisasi Komputer Pengalamatan - PTIIKgembong.lecture.ub.ac.id/.../03/Instruksi-Komputer-Pengalamatan.pdf · Immediate Addressing Operand is part of instruction Operand

Specialization Catalog August 30, 2013partnertrainings.oraevents.eu/images/brochures/Specialization... · Oracle PartnerNetwork Specialization Catalog – August 30, 2013 ... Oracle

Rapid Thermally Assisted Donor-Acceptor Catenation

Chapter 22 Carbon & Hydrocarbons. Diamond Graphite Fullerenes Delocalized electrons Organic compounds Catenation Hydrocarbons Isomers Structural formula.

NEG Instruction Change operand content into two’s complement (negative value) and stored back into its operand mov bl,00000001b neg bl; bl = 11111111 mov.

Control over Catenation in Pillared Paddlewheel Metal ...chemgroups.northwestern.edu/hupp/Publications/ChemMater-2013-25-739.pdfControl over Catenation in Pillared Paddlewheel Metal−Organic