“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia,...

29
“Dynamo: A Transparent Dynamic Optimization System” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 Presented by: David Carrillo CS 295: Run-time Techniques for Efficient and Reliable Program Execution. Guoqing Xu

Transcript of “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia,...

Page 1: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

“Dynamo: A Transparent Dynamic Optimization System”

V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000

“Dynamo: A Transparent Dynamic Optimization System”

V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000

Presented by:

David Carrillo

CS 295: Run-time Techniques for Efficient and Reliable Program Execution. Guoqing Xu

Page 2: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Motivation• Recent developments decrease effectiveness of compiler

techniques• Software shipped as collection of DLLs not single executable• Object oriented languages use delayed binding• Dynamic code generation

• System vendors have to rely on Software Vendors to avail powerful static compiler optimizations

• Still trend is to offload complexity from hardware to compiler• CISC vs. RISC• Superscalar vs. VLIW (EPIC)

Page 3: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Motivation (contd.)• Legacy binaries may not take advantages of modern

compilers/chip innovations

• Two opposite developments• Static compiler taking on increased performance burden• Obstacles to static compiler analysis increasing

• Choices1.Go for complex specialized compilers

2.Analyze code dynamically and do optimizations (Run-time optimization)

Dynamo’s Choice

Page 4: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Overview

• Dynamo is a dynamic “native-to-native” optimization system• Input an executing native instruction stream• No involvement of Software Vendor• Transparent operation

• Input instruction stream to Dynamo may be• Statically created binary• Dynamically generated binary (i.e. JIT, emulator)

• Present work focuses on static binary case

Page 5: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Overview (contd.)

• Dynamo works like a Virtual Machine• Can observe execution behavior• Tries to identify hot instruction sequences (traces)

• After identification of hot traces• Generates optimized fragments for those traces• Stored into fragment cache (a software code cache)• Links successive traces (trace tree).

• Subsequent occurrences of a hot trace• Stops interpreting• Executes corresponding fragment from “fragment cache”• Upon end of fragment, Dynamo resumes interpreting code

Page 6: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Overview (contd.)

• Overheads associated• Interpreting code is very slow• Identification of hot traces and keeping track• Generation of optimized fragments

• Advantages of approach• Working set of application gradually materializes into cache• Optimized using run-time information (expected to be much faster

than original code)

• Beneficial if• Performance benefits from reuse of fragments offset overheads• Assumes that major working set of application is small

Page 7: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Dynamo: System Architecture

Page 8: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Identification of Hot Traces

• Uses the MRET (Most Recently Executed Trail) scheme• Interpreter associates counters with possible start-of-trace points• Increments whenever these points executed

• Tries to do minimal profiling• Assumes that following instruction are also hot • Estimates the end-of-trace point using some conditions

• When count exceeds a threshold• Stops interpreting• Records instruction sequence till end-of-trace condition• This gives the hot trace

Page 9: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Trace Optimization and Fragment Generation• Selected hot trace converted to intermediate representation(IR)

• IR close to underlying machine instruction set

• Performs optimizations on the IR• Removes unconditional branches• Other optimizations on branch instructions

• For fragment generation• Trace might be split into multiple fragments due to branches• Register allocator tries to preserve original mappings• Stubs are inserted for exit from fragment

• Fragment linking done to couple contiguous fragments (highest performance gain)

Page 10: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

The TraceA

B C

D

E

F

G

H I

J

call

return

*

Speculative Trace Construction

A

B*

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

Page 11: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Effects of Fragment Construction• Hot path is linearized

• Eliminates branches• Creates superblock• Applies local optimization

• Cold-path branches remain• Targets are stubs• Send control to interpreter

• Path includes call & return• Jumps not branches• Interprocedural effects

• Indirect branches• Speculate on target• Fall back on hash table of

branch targets

A

B*

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

Page 12: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Sources of ImprovementMany small things contribute to make Dynamo profitable

• Linearization eliminates branches

• Improves TLB & I-cache behavior

• 2 passes of local optimization• Redundancy elimination, copy propagation, constant folding, simple strength

reduction, loop unrolling, loop invariant code motion, redundant load removal

• Keep in mind that “local” includes interprocedural in Dynamo

• Focuses on Optimization hard to do on compilation, at least without profile information. Aims to not compete with static compiler.

Nevertheless, Engineering detail makes a difference

Page 13: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment LinkingA

B C

D

E

F

G

H I

J

call

returnWhat happens if another path becomes hot? (Say ABDGIJE)

*

Speculative Trace Construction

Block B meets “start of trace” condition (exit from fragment)

A

B*

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

Page 14: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment LinkingWhen counter reaches

hot:• Builds a fragment

A

B*

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

Page 15: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment LinkingWhen counter reaches

hot:• Builds a fragment

A

B*

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

B

D

E

F*

G

H*

I

J

A*

Page 16: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment LinkingWhen counter reaches

hot:• Builds a fragment• Links exit AB to new

fragment• Links exit EA to old

fragment

A

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

B

D

E

F*

G

H*

I

J

Page 17: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment LinkingWhen counter reaches hot:• Builds a fragment• Links exit AB to new

fragment• Links exit EA to old

fragment

What if B* held redundant op?

• Have LIVE on entry to B• Can test LIVE sets for both

exit from A & entry to B • May show op is dead …• Link-time RE

A

C

D

E

F*

G

H*

I

J

Back to the interpreter

From the interpreter

B

D

E

F*

G

H*

I

J

Page 18: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Fragment Cache Management

• Need to flush out cold entries regularly• Mechanism for replacement can’t take too much overhead• Schemes such as LRU might not be feasible

• Dynamo’s approach• Looks for sharp increase in fragment creation range• Indicates significant change in program working set• Complete fragment cache flush triggered

Page 19: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

An example justifying this assumption

Page 20: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Signal handling in Dynamo

• Signals/interrupts could be of two types• Asynchronous• Synchronous

• Asynchronous signals queued at arrival time and processed at end of current fragment’s execution

• Synchronous signal arrival on optimized code can create problems• Can’t be postponed• Might assume original program context which no longer exists due

to optimizations

Page 21: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Signal handling in Dynamo (contd.)

• Conservative Optimizations• Constant propagation, constant folding, copy propagation, etc• Allow precise signal context to be reconstructed

• Aggressive Optimizations• Dead code removal, code sinking, loop invariant code motion, etc.• Can’t guarantee that synchronous signals can be handled properly

Page 22: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Results

Figure. Speedup of +O2 optimized PA-8000 binaries running on Dynamo, relative to the identical binaries running standalone. The contributions from dynamo inlining due to trace selection, conservative trace optimization and aggressive trace optimization are shown. Dynamo bails out to direct native execution on go and vortex.

Page 23: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Results• Dynamo runs faster than native code in the average case

• Overhead averages 1.5% (Spec95 on HP PA-8000)

• Most significant source of speedup is trace-selection.

• Most of the overhead is spent in trace selection• Interpret, bump target counters, test against threshold• Optimization & code generation have minor cost

Page 24: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Results

Performance gain reduces V.S. highly optimized binaries.

Page 25: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Explanation of results• Gains due to trace selection due to

• Selecting a trace and forming fragments out of it• Partial procedure inlining and improved layout in cache

• Large improvements in• Perl, compress, li and m88skim• Have stable working sets

• No performance improvements on• go, ijpeg and vortex• Startup time is non-negligible fraction of the total runtime in these

programs• Don’t run long enough to recoup Dynamo’s startup overhead• Lack of a stable working set.

Page 26: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Remarks• Dynamo project

• Started in 1996• Tries to achieve dynamic performance improvement• Novel native-to-native online optimization aproach• Complements the static compiler

• Very influential paper• 279 tracked in ACM Digital Library

Page 27: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Remarks• Starting up

• Single call to library routine• It copies the stack, creates space for interpreter’s state, and begins

Dynamo’s execution• This model simplifies initialization of program and grants Dynamo local

access to binary memory space.

• Counter management• 1st time at branch target allocates & initializes a counter• 2nd & subsequent times bumps that counter• Optimization & code generation recycles counter’s space

• With cheap breakpoint mechanism, it could spare the use of an interpreter and use breakpoint to invoke Dynamo (similar to a debugger).

Page 28: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Caveats• Runs on PA-8000 (introduced on November 1995)

• Very simple instructions, easier to write optimizer than CISC (recall Itanium Fiasco)

• This chip lacks most of hardware optimization features common on modern chips.

• Low performance gain on already highly optimized binaries

• Poor handling of interruptions may cause problems in I/O intensive scenarios. Experiments does not test this scenario

• Posterior documents from the authors market Dynamo as an interpreter for emulators

Page 29: “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

References• Dynamo: A Transparent Dynamic Optimization System

• Vasanth Bala, Evelyn Duesterwald and Sanjeev Banerjia• Proceedings of the ACM SIGPLAN 2000 conference on Programming

language design and implementation

• ARS Technica report on Dynamo• http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html

Slides based on previous slides of:• Dynamic Optimization as typified by the Dynamo

System.• Keith D. Cooper & Linda Torczon.• COMP 512, Rice University.

• Mayank Agarwal. CS 812 : High Level Design and Modeling Student Presentation