Dalvik jit

Android JIT

Introduction

Just-In-Time (JIT)/Dynamic Compilation

JIT Design

Dalvik JIT

JIT Compiler

Intermediate Representation

Optimization Techniques

Data- Control- Flow Analysis

Introduction:

The Java language is made to be interpreted to achieve the critical goal of application portability.

HW.java

javac

public class HW{ . . . . void hello(){ . . . . }}

HW.class

ca fe ba be 08 1a 42 ..

java

Other classes

Java Language Java Virtual Machine

Microprocessors have instruction sets that define the operations they can perform, so does the VM instructions compile into a format known as bytecodes.

It is through the VM that executable bytecode Java classes are executed and ultimately routed to appropriate native system calls.

Problem:“A Java program executing within the VM is executed a bytecode at a time”

Java Source file Class file(bytecode)

Problem (Contd.):

The conventional approach resulted in significantly lower performance when compared to compiled languages like C/C++ by the additional processor and memory usage during interpretation.

As a result, slow and space-constrained computing devices have tended not to include virtual computing technology(i.e. JVM).

Initiatives: JSR-30 : J2ME CLDC (Connected Limited Device Configuration) Specification Reference implementation of the J2ME CLDC (Connected Limited Device

Configuration) in April 1999, got approval in August 1999 Final public release of CLDC 1.0 in May 2003

The HotSpot engine was developed to address the perception that Java virtual machine performance was insufficient for many mainstream applications.

By implementing a host of performance enhancing techniques that went beyond innovations like just-in-time (JIT) compilers, the performance of the Java virtual machine increased by an order of magnitude

Just-In-Time (JIT)/Dynamic Compilation :

The Just-In-Time (JIT) compiler is a component of the Java Runtime Environment. It improves the performance of Java applications by compiling bytecodes to native machine code at run time.

ByteCodes

JVM

GC

Just-In-Time CompilerIntermediate Representation

Generator

Optimizer

Code Generator

Runtime

Profiler

Just-In-Time (JIT) Compiler

Just-In-Time (JIT)/Dynamic Compilation (Contd.) :

JIT Compilation Strategies:

With a JIT compiler, Java programs are compiled one block of code at a time as they execute into the native processor's instructions to achieve higher performance. The process involves generating an internal representation of a method that's different from bytecodes but at a higher level than the target processor's native instructions. The compiler performs optimization to improve quality and efficiency and finally a code-generation step to translate the optimized internal representation to the target processor's native instructions To avoid the overhead of compiling and optimizing all an application’s classes at a time, a number of incremental compilation strategies have evolved. The general strategy of only compiling the “hot” parts of an application will often

result in only a small percentage of an application being compiled, thus saving considerable compilation time.

“A continuously operating sampling profiler identifies programs hot regions for code reoptimization”

“The JIT compiler operates on a compilation thread that's separate from the application threads so that the application doesn't need to wait for a compilation to occur”


The Just-In-Time (JIT) compiler is a component of the Java Runtime Environment. It improves the performance of Java applications by compiling bytecodes to native machine code at run time.

A Java class that has been loaded into memory by the VM contains a V-table (virtual table), which is a list of the addresses for all the methods in the class.

Method - 1

Method 1 Bytecode

Each address in the V-table points to the executable bytecode for the particular method

V-table

Method - 2Method 2 Bytecode

Method - 3 Method 3 Bytecode

Method - 4

Method 4 Bytecode


When the JIT is loaded, bytecode address in the V-table is replaced with the address of the JIT compiler itself.

Method - 5

Method - 1

Method - 2

Method - 3

Method - 4

V-table

Just-In-Time Compiler

When the VM calls a method through the address in the V-table, the JIT compiler is executed instead.


The JIT compiler steps in and compiles the Java bytecode into native code and then patches the native code address back to the V-table.

Method - 5

Method - 1

Method - 2

Method - 3

Method - 4

V-table


Method 5 Native Code

From now on, each call to the method results in a call to the native version.

JIT Design :

Challenges (Price of Platform neutrality):

The time it takes to compile the code is added to the program's running time. JIT typically causes a slight delay in initial execution of an application, due to the time taken to load and compile the bytecode.

Optimizations:

Modern JIT compilers take one of two approaches1. Compile all the code but without performing any expensive analyses or transformations so that the

code is generated quickly.2. Devote compilation resources to only a small number of methods that execute frequently.

Combine interpretation and JIT compilation. The application code is initially interpreted, but the JVM monitors which sequences of bytecode are frequently executed and translates them to machine code for direct execution on the hardware.

JIT Design (Contd.) :

There are 4 reasons for why a JIT for the complete byte code set was not implemented and the combined usage of Interpreter and JIT has become unavoidable.

1. If thread context switching would have had to be performed whilst executing generated native code, this would have added complexity to code generation, runtime support, and the base VM code. By only performing context switching in the interpreter no changes were made to the way the thread scheduling was done in VM.

2. The generated machine code would have needed to be more rigorous in the way it dealt with error conditions and other exceptional conditions. As it is, the machine code only needs to check for error conditions. When they occur the error handling bytecodes can be then executed by the interpreter, which then can deal with the details of how the error should be processed.

3. A complete JIT would have required more complicated interactions between the generated machine code and the virtual machine as a whole. For example, the generated machine code could cause the compiler, class loader, garbage collector, or native code to run. In retrospect some of these restrictions were not strictly necessary, but the system probably has fewer undiscovered bugs, and it does not seem to have limited the performance of the type of compute-intensive software that is the target of the design.

(Contd.)


4. A debugging technique (discussed below) was used which could not have been employed so easily with a complete JIT.

Therefore the system was designed to allow execution to pass from the compiled code to the interpreter at any time, and also for the interpreter to be able return to generated code in a timely fashion.

Additionally, to keep the interpreter from getting trapped in a long loop of bytecodes it was necessary to be able to return to compiled code in the middle of a method as well as at the start.

“JIT lets the interpreter to deal with complex tasks such as Class loading, Exception handling, Synchronization, Garbage Collection etc”

The basic interpreter loop is as follows:

Start:Try to enter compiled code.Interpret the next bytecode.goto Start.

If the current method has not been compiled then checks are performed to determine if it can be.


Compilation may not be possible for one of the following reasons.

1. A native function was called.2. The method has more than a certain number of parameters or local variables, is unusually

large3. There is no available memory for more compiled code.4. An object could not be created without running the garbage collector.5. An operation was attempted that required a class to be initialized.6. The start of an exception handler was reached.7. An exception or error occurred. The interpreter always processes these.8. The part of a method was reached for which no corresponding machine code could be

generated.9. A function was called for which there was no compiled code.10.A method return was executed but there was no compiled code to return to because the

code buffer had been flushed.

Method - 1

Method 1 Bytecode

V-table

Method - 2

Method 2 Bytecode

Method - 3

Method 3 Bytecode

Method - 4

Method 4 Bytecode



1. The JVM interprets a method until its call count exceeds a JIT threshold.2. After a method is compiled, its call count is reset to zero; subsequent calls to the method continue to increment its count. 3. When the call count of a method reaches a JIT recompilation threshold, the JIT compiles it a second time, this time applying a larger selection of optimizations than on the previous compilation (because the method has proven to be a significant part of the whole program)

Native Code

.class .class .class

JVM JVM

Operating System

Interpreter

JIT JIT=OFF JIT=ON Threshold=10

times >= 10 times < 10


Dalvik JIT :

Dalvik Execution Environment:

1. Register based architecture (Register Machine) Stack-based machines (JVMs) must use instructions to load data on the stack and manipulate that data, and, thus, require more instructions than register machines.2. Very compact representation Java bytecode is converted into an alternate instruction set used by the Dalvik VM. dx is a tool used to convert some (but not all) Java .class files into the .dex format. 3. Emphasis on code/data sharing to reduce memory usage Multiple classes are included in a single .dex file.4. Highly-tuned very fast (2x similar) Dalvik Interpreter, good enough for most of the

applications. For compute-intensive applications, Native Development Kit was released to allow Dalvik applications to call out statically-compiled(native) methods.

Dalvik JIT (Contd.):

Other part of solution is Dalvik JIT:

Translates byte code to optimized native code at run time.

1. Method Compiler2. Trace Compiler

3. Method Compiler- Most common model for server JITs- Interprets with profiling to detect hot methods- Compile & optimize method-sized chunks

- Strengths• Larger optimization window• Machine state sync with interpreter only at method call boundaries - Weaknesses• Cold code within hot methods gets compiled• Much higher memory usage during compilation & optimization• Longer delay between the point at which a method goes hot and the

point that a compiled and optimized method delivers benefits


2. Trace Compiler - Most common model for low-level code migration systems - Interprets with profiling to identify hot execution paths - Compiled fragments chained together in translation cache - Strengths

• Only hottest of hot code is compiled, minimizing memory usage• Tight integration with interpreter allows focus on common cases• Very rapid return of performance boost once hotness detected - Weaknesses• Smaller optimization window limits peak gain• More frequent state synchronization with interpreter• Difficult to share translation cache across processes


(Method Vs Trace):

Full Program4,695,780 bytes

Hot Methods396,230 bytes

26% of Hot methods2% of program

Hot Traces396,230 bytes

Method JIT: Best optimization windowTrace JIT: Best speed/space tradeoff

8% of program


The provisional decision was to start with trace for the following reasons:

• Minimizing memory usage critical for mobile devices• Important to deliver performance boost quickly

- User might give up on new app if we wait too long to JIT• Leave open the possibility of supplementing with method-based JIT

- The two styles can co-exist- A mobile device looks more like a server when it’s plugged in- Best of both worlds

• Trace JIT when running on battery• Method JIT in background while charging

The Dalvik JIT can be considered as an extension of the Interpreter because it is the Interpreter which profiles and triggers trace selection mode when a potential trace head goes hot.


Dalvik Trace JIT Flow:

Start

Update Profile count for this

location

Interpret/buildTrace request

Threshold?

Xlationexists?

Interpret until next potential

trace head

Translation

Exit 0Exit 1

Translation

Exit 0Exit 1

Translation

Exit 0Exit 1

Compiler Thread

NO

YES

YESNO

Submit Compilation Request

Install new translation

Translation Cache


Features:• Trace request is built during interpretation

- Allows access to actual run-time values- Ensures that trace only includes byte codes that have successfully executed at least once (useful for some optimizations)

• Trace requests handed off to compiler thread, which compiles and optimizes into native code• Compiled traces chained together in translation cache• Per-process translation caches (sharing only within security sandboxes)• Simple traces - generally 1 to 2 basic blocks long• Local optimizations

- Register promotion- Load/store elimination- Redundant null-check elimination- Heuristic scheduling

• Loop optimizations- Simple loop detection- Invariant code motion- Induction variable optimization

JIT Compiler:

JIT Compiler Work Flow:

In order to execute bytecode, JIT compiler goes through three stages.

1. Baseline: Generates code that is “Obviously correct”The process involves generating an internal representation of a java code that is

different from bytecodes but at a higher level than the target processor's native instructions (Intermediate Representation(IR)). “IR allows more effective machine-specific optimizations”

2. Optimizing: Applies a set of optimizations to a class when it is loaded at run time

3. Adaptive: Methods are compiled with a non-optimizing compiler first and then selects “hot” methods for recompilation based on run-time profiling information.

“A key part of the JIT design was to split the compilation process into two passes. The first pass transforms the standard, stack-based bytecodes into a simple 3-address intermediate representation in which all temporary statement results are placed into new local variables instead of entries on an evaluation stack. The second pass converts this three-address form into native machine code.”

Intermediate Representation:

An IR instruction is an N-tuple (a simple mathematical set), consisting of an operator, and some number of operands.

“The Intermediate Representation is a machine- and language-independent version of the original source code”

An Operator is the instruction to performOperands are used to represent Symbolic Register, Physical Registers, Memory Locations, Constants, Branch targets, Method Signatures, Types etc

An IR code must be convenient to translate into real assembly code for all

desired target machines

Intermediate Representation (contd.):

Three Address Code (TAC or 3AC):

1. Three-address code is a form of representing intermediate code(IR) used by compilers to aid in the implementation of code-improving transformations. 2. Each instruction in three-address code can be described as a 4-tuple: (operator, operand1,

operand2, result) as shown.

result := operand1 operator operand2

such asx := y + z

3. Expressions containing more than one fundamental operation, such as:p = x + y * z

are not representable in three-address code as a single instruction. Instead, they are decomposed into an equivalent series of instructions,

such ast1 := y * zp := x + t1

“The key features of three-address code are that every instruction implements exactly one fundamental operation, and that the source and destination may refer to any available register”


Static Single Assignment form (SSA):

1. A refinement of three-address code and a property of an intermediate representation (IR), which says that each variable is assigned exactly once2. Existing variables in the original IR are split into versions, new variables typically indicated

by the original name with a subscript in textbooks, so that every definition gets its own version

Benefits (by Example):

y := 1y := 2x := y

TAC

y1 := 1y2 := 2x := y2

SSA

1. Humans can see that the first assignment is

not necessary2. The value of y being used in the third line comes from the second assignment of y. A program would have to perform “reaching

definition analysis” to do these optimizations

With SSA, 1 and 2 are immediate as it identifies “y1” is used only once and omitting it wont affect other part of code


3 levels of IR:

Levels of IR:

H

I

R

M

I

R

bytecode

L

I

R

Machine

1. IRs that are close to a high-level language are called high-level IRs, and IRs that are close to assembly are called low-level IRs.

2. A high-level IR might preserve things like array subscripts or field accesses whereas a low-level IR converts those into explicit addresses and offsets.

Original HIR MIR LIR

float a[10][20] t1 = a[i, j+2] t1 = j+2 r1 = [fp-4]a[i][j+2] t2 = i*20 r2 = [r1+2] t3 = t1+t2 r3 = [fp-8] t4 = 4*t3 r4 = r3*20 t5 = addr a r5 = r4+r2 t6 = t5+t4 r6 = 4*r5 t7 = *t6 r6 = fp–216 f1 = [r7+r6]


1. HIR (High Level IR)a) IR that are closer to high-level language (Operators similar to Java bytecode)b) Usually preserves information such as loop-structure and if-then-else

statementsc) Operate on symbolic registers instead of an implicit stack

HIR Generation:

class AdditionMethodTest { public static void main(String args[]) { int a = 3; int b = 4; int c = a + b; int d = getNewValue(c); return; } // End method main

public static int getValue(int var) { return var * var; } // End method getNewValue}

Java Code (.java) Bytecode (.class)

Method void main(java.lang.String[]) 0 iconst_3 1 istore_1 2 iconst_4 3 istore_2 4 iload_1 5 iload_2 6 iadd 7 istore_3 8 iload_3 9 invokestatic #2 <Method int getValue(int)> 12 istore 4 14 return Method int getNewValue(int) 0 iload_0 1 iload_0 2 imul 3 ireturn


Conversion from Java bytecode to HIR:

Compiler that performs this conversion contains 2 parts.1. The BC2IR algorithm that translates bytecode to HIR and performs on-the-fly optimizations during translation.2. Additional optimizations perform on the HIR after translation.

BC2IR Translation:

3. Discovers extended-basic-blocks4. Constructs an exception-table for the method5. Creates HIR instructions for bytecodes6. Performs On-the-fly optimizations

a) Copy propagationb) Constant propagationc) Register renaming for local variablesd) Dead-Code eliminatione) Short final or static methods are in-lined

Note: Even though these optimizations are performed in later phases, doing so here reduces the size of the HIR generated and thus compile time.


Example of on-fly-optimization:

Copy propagation algorithm can be noticed here

y = x + 5

Generated IR(optimization off)Java Bytecode

iload xiconst 5iaddistore y

INT_ADD tint, xint 5INT_MOVE yint, tint

INT_ADD yint, xint, 5

Generated IR(optimization on)

********* START OF IR DUMP Initial HIR FOR AdditionMethodTest.getValue (I)I-13 LABEL0 Frequency: 0.0-2 EG ir_prologue l0i(I,d) = 2 int_mul t2i(I) = l0i(I,d), l0i(I,d)3 int_move t1i(I) = t2i(I)-3 return t1i(I)-1 bbend BB0 (ENTRY)********* END OF IR DUMP Initial HIR FOR AdditionMethodTest.getValue (I)I


The HIR generated code for AdditionMethodTest.java:

********* START OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V-13 LABEL0 Frequency: 0.0-2 EG ir_prologue l0i([Ljava/lang/String;,d) = 1 int_move l1i(B) = 33 int_move l2i(B) = 47 int_move l3i(B) = 79 EG call l5i(I) AF CF OF PF SF ZF = 66668, static"AdditionMethodTest.getValue (I)I", <unused>, 7-3 return <unused>-1 bbend BB0 (ENTRY)********* END OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V


Optimizations for HIR:

Following optimizers are provided for the basic optimization.

1.CF // Constant Folding 2.CPF // Constant Propagation and Folding (triggered by the propagation)3.CSE // Common Sub-expression Elimination (within basic blocks) 4.DCE // Dead Code Elimination5.GT // Global Variable Temporalization (within basic block)

The optimizers CF and GT do not require data flow analysis, however, CPF, CSEand DCE require some result of data flow analysis.

Complete Description can be available @ http://www.coins-project.org/international/COINSdoc.en/hiropt/hiropt.html

http://www.coins-project.org/international/COINSdoc.en/hiropt/hiropt.html


2.Medium-Level IRs (MIR)a) Support range of features in a set of source languages, but in a language-

independent way.b) Good basis for generation of efficient machine code for one or more

architectures. Example: register transfer languages

3.Low-Level IRs (LIR)a) Almost one-to-one correspondence to target-machine instructions: quite

architecture-dependent.

<MIR & LIR to be added>

Optimization Techniques:

Why Optimization:

1. Programmers do not always write optimal code.a) For example, ways to improve code are not always recognized

(e.g. move loop-invariant code out of loops, avoiding re-computation of the same expression).2. High-level language may not allow a programmer to avoid redundant computation (or make it inconvenient)

a[i][j] = a[i][j] + 13. The programmer should not be bothered with the target machine architecture.

Moreover, modern machine architectures assume optimization; it has become hard to optimize by hand.

Goal:

Let programmers write clean, high-level source code, produce programs that approach assembly-code performance.Optimization: the transformation of a program P into a program P´, that has the same input/output behavior, but is somehow “better”. Better might mean:

• faster, or• smaller, or• uses less power, or• whatever you care about

P´ is not optimal, may even be worse than P.

1. In-lining (also at lower levels)2. Specialization3. Constant folding4. Constant propagation5. Value numbering6. Dead code elimination7. Loop-invariant code motion8. Common sub-expression elimination9. Strength reduction10.Branch prediction/optimization11. Register allocation12.Loop unrolling13.Cache optimization

Optimization Techniques:

Dalvik jit

Technology

Transcript of Dalvik jit