Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly...
-
Upload
monica-strapp -
Category
Documents
-
view
221 -
download
1
Transcript of Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly...
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p
Domain-specific interpreters(a nested talk)
Paul Kelly (Imperial College London)
Joint work with
Olav Beckmann, Karen Osmond, Tony Field and others
Dagstuhl, January 2006
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p2Domain-specific optimisation
Libraries extend general-purpose languages
Good libraries promote problem-focused code
“Active libraries” apply library-specific optimisations to client code
C a = new C(…);C b = new C(…);…c = a.f(…);…print( b.g(c) );
constructor C(…);constructor C(…);
f(…) {…}
g(…) {…}
Client Library
Client calling context may enable optimisation
fusion,
redundancy elimination,
incremental-isation, etc
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p3Active library technologies
How to deliver “active libraries”?Domain-specific compiler?
Source-to-source transformation?
Plug-in – based compiler architecture?
Plug-in – based virtual machine?
“Domain-specific optimisation components”
Aspect weaver?
This talk is about an appealingly low-tech solution, which we glorify with a big name – the “Domain-Specific Interpreter”
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p5Domain-specific interpreter
DSI is interposed between client and library
C a = new C(…);C b = new C(…);…c = a.f(…);…print( b.g(c) );
constructor C(…);constructor C(…);
f(…) {…}
g(…) {…}
Client Library
DelayExecution, build “recipe”
DSI
Plan optimised execution, execute
Inject proxy between application and library
Use proxy to capture, delay and optimise the calls
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p6Domain-specific interpreter
DSI is a design pattern
Standard questions:When is DSI a good idea?
When is it applicable?
How do you implement it (in your favoured language)?
Show me an example!
Let’s do the example first…
MayaViTool for visualising fluid flows
GUI supports interactive construction of visualisation pipelines
Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes
MayaViTool for visualising fluid flows
GUI supports interactive construction of visualisation pipelines
Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes
MayaViTool for visualising fluid flows
GUI supports interactive construction of visualisation pipelines
Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes
I’m going to show you how we dramatically improved MayaVi interactivity
By parallel execution on SMP
By parallel execution on linux cluster
By caching pre-calculated results
Without changing a single line of MayaVi or VTK code
Without writing a compiler
MayaViTool for visualising fluid flows
GUI supports interactive construction of visualisation pipelines
Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes
I’m going to show you how we dramatically improved MayaVi interactivity
By parallel execution on SMP
By parallel execution on linux cluster
By caching pre-calculated results
Without changing a single line of MayaVi or VTK code*
Without writing a compiler* Actually we did change a few lines in VTK to fix a problem with Python’s Global Interpreter Lock
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p11MayaVi: Working on partitioned data
Our ocean simulations are generated in parallel
Input data consists of a set of partitions (and an XML index)
Normally, VTK fuses these partitions into one mesh as they are read
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p12MayaVi: Working on partitioned data
Our ocean simulations are generated in parallel
Input data consists of a set of partitions (and an XML index)
Normally, VTK fuses these partitions into one mesh as they are read
Some – many – analyses can operate partition-by-partition
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p13MayaVi: what the DSI has to do
Capture all delayable calls to methods from a DSL through a proxy layerA force point is a call which requires an immediate result – in this case to render on screenA recipe is the set of calls between consecutive force points
(in parallel)
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p14Implementing a generic DSI proxy in Python
Actually, the real implementation generates dummies for all the methods and members as well as the classes
So when MayaVi reflects on the module to generate the GUI configuration forms it finds the right stuff
import vtkpython_realfrom vtkdsi import proxyObjectfor className in dir(vtkpython_real): exec “class “ + className + “(proxyObject):pass”
class proxyObject: def __getattr__ (self, callName): return lambda callArgs: self.proxyCall(callName, callArgs) def proxyCall(self, callName, callArgs): # if forcepoint: optimise and apply recipe # else: add call to the current recipe
Self-generating proxy module
Proxy implementation
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p16How well does it work?
Benchmark:
Plot isosurfaces for seven pressure values in flow past heated sphereEach isosurface is several hundred MBHardware:
For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4For distributed-memory: Cluster of 4 Pentium 4 2.8 GHz, 512 KB L2, 1 GB RAM, Linux 2.4
Tiling optimisation yields substantial speedup
Modest further speedup from two-way shared-memory parallel
Parallel execution on a four-processor Linux cluster also offers substantial speedup
Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p20Further MayaVi DSI optimisations
Caching:check whether results of this recipe (or part thereof) are available in cacheMultiple frames per second…
Region of Interest (RoI):Load from disk only those partitions which intersect a cuboid specified by the user
Level of Detail (LoD):Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions
Put together… “Google Earth” for global ocean flow
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p21Further MayaVi DSI optimisations
Caching:check whether results of this recipe (or part thereof) are available in cacheMultiple frames per second…
Region of Interest (RoI):Load from disk only those partitions which intersect a cuboid specified by the user
Level of Detail (LoD):Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions
Put together… “Google Earth” for global ocean flowLarge space of possible execution plans for each visualisation task - choose
Appropriate parallelisation
recalculate or retrieve from (remote, persistent, peer?) cache
Which intermediate results to save to cache
Partition size
Level of detail (eg to satisfy response-time budget)
Whether to decimate surfaces to fit in graphics RAM
Whether to construct (and cache) index for multiple isosurfaces
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p22Back to DSI…
Standard questions:When is DSI a good idea?
When is it applicable?
How do you implement it?
Show me an example!
When:You can’t analyse the client code
The client code is too complex to analyse statically
The client composes library code dynamically
The overheads are small compared to library functions’ execution time
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p23Back to DSI…
Standard questions:When is DSI a good idea?
When is it applicable?
How do you implement it?
Show me an example!
When:Execution of library code can be delayed
All dependencies between client and library code are explicit in library API
Library data structures are opaque
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p24Back to DSI…
Standard questions:When is DSI a good idea?
When is it applicable?
How do you implement it?
Show me an example!Interpose proxy:
Built by hand
Using generic proxy mechanism based on reflection – as shown in Python
Using IDL-based parameter marshalling
Using aspect weaver (but…)
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p25Back to DSI…
Standard questions:When is DSI a good idea?
When is it applicable?
How do you implement it?
Show me an example!We have used the DSI trick several times
So have lots of other people…
MayaVi/Python/VTK
Message fusion and scheduling in parallel programming
Loop fusion in a matrix/vector library
Aggregation of Java RMI (correctness issues are tricky)
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p26What makes DSI hard to implement?
Non-opaque return valuesEg vector type is opaque, but dot-product returns a non-opaque scalar
ExceptionsDelayed execution shifts the point where errors are discovered
Unnecessary force-pointsEg property getter methods
Hidden dependenciesEg we can aggregate remote method calls provided none of them results in a call back that can affect the caller JVM
AntidependenciesClient overwrites operand of delayed call
(Next to Last slide)
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p28Conclusions/discussion
DSI is not newBut just keeps popping up, solves tricky problems
DSI programs are program generatorsType safety of the recipe derives from type safety of the client (so DSI interpreter could be tagless)
Safety of optimising transformations is another matter…
DSIs can be JITsEg our C++ matrix/vector library uses a multistage programming library to generate C loops at runtime (and fuse them)
There is a useful catalogue of techniques to enhance DSI applicability, overheads etc
Last slide
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p29Related stuff…
Lazy evaluation – with reflection
Template metaprogramming – encode recipe in type
Proxy interposition trick is common in dynamically-typed languages:
Redefining the lookup function in Common Lisp
The “doesNotUnderstand: hack” in Smalltalk
The idea of converting a call to a message…Message-Oriented Programming: The Case for First Class Messages (Dave Thomas, JOT 2004)
Tomasulo-style renaming to prevent antidependences from forcing execution
Compare with explicit recipe constructionworkflow systems, command objects, LINQ