Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly...

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p

Domain-specific interpreters(a nested talk)

Paul Kelly (Imperial College London)

Joint work with

Olav Beckmann, Karen Osmond, Tony Field and others

Dagstuhl, January 2006

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p2Domain-specific optimisation

Libraries extend general-purpose languages

Good libraries promote problem-focused code

“Active libraries” apply library-specific optimisations to client code

C a = new C(…);C b = new C(…);…c = a.f(…);…print( b.g(c) );

constructor C(…);constructor C(…);

f(…) {…}

g(…) {…}

Client Library

Client calling context may enable optimisation

fusion,

redundancy elimination,

incremental-isation, etc

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p3Active library technologies

How to deliver “active libraries”?Domain-specific compiler?

Source-to-source transformation?

Plug-in – based compiler architecture?

Plug-in – based virtual machine?

“Domain-specific optimisation components”

Aspect weaver?

This talk is about an appealingly low-tech solution, which we glorify with a big name – the “Domain-Specific Interpreter”

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p5Domain-specific interpreter

DSI is interposed between client and library

C a = new C(…);C b = new C(…);…c = a.f(…);…print( b.g(c) );

constructor C(…);constructor C(…);

f(…) {…}

g(…) {…}

Client Library

DelayExecution, build “recipe”

DSI

Plan optimised execution, execute

Inject proxy between application and library

Use proxy to capture, delay and optimise the calls

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p6Domain-specific interpreter

DSI is a design pattern

Standard questions:When is DSI a good idea?

When is it applicable?

How do you implement it (in your favoured language)?

Show me an example!

Let’s do the example first…

MayaViTool for visualising fluid flows

GUI supports interactive construction of visualisation pipelines

Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes




I’m going to show you how we dramatically improved MayaVi interactivity

By parallel execution on SMP

By parallel execution on linux cluster

By caching pre-calculated results

Without changing a single line of MayaVi or VTK code

Without writing a compiler




I’m going to show you how we dramatically improved MayaVi interactivity

By parallel execution on SMP

By parallel execution on linux cluster

By caching pre-calculated results

Without changing a single line of MayaVi or VTK code*

Without writing a compiler* Actually we did change a few lines in VTK to fix a problem with Python’s Global Interpreter Lock

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p11MayaVi: Working on partitioned data

Our ocean simulations are generated in parallel

Input data consists of a set of partitions (and an XML index)

Normally, VTK fuses these partitions into one mesh as they are read

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p12MayaVi: Working on partitioned data

Our ocean simulations are generated in parallel

Input data consists of a set of partitions (and an XML index)

Normally, VTK fuses these partitions into one mesh as they are read

Some – many – analyses can operate partition-by-partition

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p13MayaVi: what the DSI has to do

Capture all delayable calls to methods from a DSL through a proxy layerA force point is a call which requires an immediate result – in this case to render on screenA recipe is the set of calls between consecutive force points

(in parallel)

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p14Implementing a generic DSI proxy in Python

Actually, the real implementation generates dummies for all the methods and members as well as the classes

So when MayaVi reflects on the module to generate the GUI configuration forms it finds the right stuff

import vtkpython_realfrom vtkdsi import proxyObjectfor className in dir(vtkpython_real): exec “class “ + className + “(proxyObject):pass”

class proxyObject: def __getattr__ (self, callName): return lambda callArgs: self.proxyCall(callName, callArgs) def proxyCall(self, callName, callArgs): # if forcepoint: optimise and apply recipe # else: add call to the current recipe

Self-generating proxy module

Proxy implementation

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p16How well does it work?

Benchmark:

Plot isosurfaces for seven pressure values in flow past heated sphereEach isosurface is several hundred MBHardware:

For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4For distributed-memory: Cluster of 4 Pentium 4 2.8 GHz, 512 KB L2, 1 GB RAM, Linux 2.4

Tiling optimisation yields substantial speedup

Modest further speedup from two-way shared-memory parallel

Parallel execution on a four-processor Linux cluster also offers substantial speedup

Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p20Further MayaVi DSI optimisations

Caching:check whether results of this recipe (or part thereof) are available in cacheMultiple frames per second…

Region of Interest (RoI):Load from disk only those partitions which intersect a cuboid specified by the user

Level of Detail (LoD):Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions

Put together… “Google Earth” for global ocean flow

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p21Further MayaVi DSI optimisations

Caching:check whether results of this recipe (or part thereof) are available in cacheMultiple frames per second…

Region of Interest (RoI):Load from disk only those partitions which intersect a cuboid specified by the user

Level of Detail (LoD):Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions

Put together… “Google Earth” for global ocean flowLarge space of possible execution plans for each visualisation task - choose

Appropriate parallelisation

recalculate or retrieve from (remote, persistent, peer?) cache

Which intermediate results to save to cache

Partition size

Level of detail (eg to satisfy response-time budget)

Whether to decimate surfaces to fit in graphics RAM

Whether to construct (and cache) index for multiple isosurfaces

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p22Back to DSI…



How do you implement it?

Show me an example!

When:You can’t analyse the client code

The client code is too complex to analyse statically

The client composes library code dynamically

The overheads are small compared to library functions’ execution time

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p23Back to DSI…




Show me an example!

When:Execution of library code can be delayed

All dependencies between client and library code are explicit in library API

Library data structures are opaque

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p24Back to DSI…




Show me an example!Interpose proxy:

Built by hand

Using generic proxy mechanism based on reflection – as shown in Python

Using IDL-based parameter marshalling

Using aspect weaver (but…)

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p25Back to DSI…




Show me an example!We have used the DSI trick several times

So have lots of other people…

MayaVi/Python/VTK

Message fusion and scheduling in parallel programming

Loop fusion in a matrix/vector library

Aggregation of Java RMI (correctness issues are tricky)

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p26What makes DSI hard to implement?

Non-opaque return valuesEg vector type is opaque, but dot-product returns a non-opaque scalar

ExceptionsDelayed execution shifts the point where errors are discovered

Unnecessary force-pointsEg property getter methods

Hidden dependenciesEg we can aggregate remote method calls provided none of them results in a call back that can affect the caller JVM

AntidependenciesClient overwrites operand of delayed call

(Next to Last slide)

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p28Conclusions/discussion

DSI is not newBut just keeps popping up, solves tricky problems

DSI programs are program generatorsType safety of the recipe derives from type safety of the client (so DSI interpreter could be tagless)

Safety of optimising transformations is another matter…

DSIs can be JITsEg our C++ matrix/vector library uses a multistage programming library to generate C loops at runtime (and fuse them)

There is a useful catalogue of techniques to enhance DSI applicability, overheads etc

Last slide

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p29Related stuff…

Lazy evaluation – with reflection

Template metaprogramming – encode recipe in type

Proxy interposition trick is common in dynamically-typed languages:

Redefining the lookup function in Common Lisp

The “doesNotUnderstand: hack” in Smalltalk

The idea of converting a call to a message…Message-Oriented Programming: The Case for First Class Messages (Dave Thomas, JOT 2004)

Tomasulo-style renaming to prevent antidependences from forcing execution

Compare with explicit recipe constructionworkflow systems, command objects, LINQ

Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly...

Documents

Transcript of Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly...