SEJITS: Raising the Abstraction Level of Productivity Programming - Par...

SEJITS: Raising the Abstraction Level of Productivity Programming Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Krste Asanovic, Kurt Keutzer, Dave Patterson

Layering vs. “Stovepiping”

• Layering: one or a few common intermediate

languages

• Must be flexible enough to support many DSLs

• And map to wide variety of HW

• Stovepiping: specialize structural computation

patterns (motifs, not domains) directly to HW

Other Opportunities

• Autotuning • SEJITS can intercept calls and substitute autotuned code

(see PySKI)

• Locus of control for making co-tuning decisions

•Cloud Computing

• Generate Hadoop (Java) code expressing PLL

computation as MapReduce

• Generate code for multiple cloud frameworks

Early case study: Python + CUDA

OOO GPU SIMD FPGA ???

Runtime & OS

Common language substrate

Rendering Probabilistic Physics Lin. Alg.

Virt. worlds Data viz. Robotics Music App domains

Computation

domains

Language

Thick Runtime

Hardware

Traditional

Layers

OOO GPU SIMD FPGA ???

Runtime & OS

Data viz. Robotics Music Applications

Motifs/Pattern

Thin Runtime

Hardware

SEJITS

“Stovepipes”

Sparse Matrix Dense Matrix Stencil

Leveraging Efficiency-Layer Research

Efficiency programmers, autotuner writers: target

computation patterns to hardware

stencil/SIMD codes => GPUs

sparse matrix => communication-avoiding algo’s on

multicore

“Big finance” Monte Carlo sim => MapReduce

Libraries? Useful, but don’t raise abstraction level

How to make ELL work accessible to more PLL

programmers?

SEJITS in a nutshell: Selective, Embedded

Just-in-Time Specialization

Productivity programmers write in general

purpose, modern, high level PLL

SEJITS infrastructure specializes computation

patterns selectively at runtime

Specialization uses runtime info to generate and

JIT-compile ELL code targeted to hardware

Embedded because PLL’s own machinery

enables (vs. extending PLL interpreter)

Specializer

SEJITS

Productivity app

counters

SEJITS Exploits Productivity Level Language (PLL) Mechanisms

1. Some functions in productivity app annotated as potentially

specializable

2. SEJITS intercepts calls using dynamic language features,

uses introspection to examine function's Abstract Syntax Tree

If AST contains function call or pattern known in local catalog,

specializer is invoked and handed AST

3. Specializer generates source code in an efficiency language

(C, OpenMP, CUDA, ...), compiles & links

4. Specialized function binary is called, results returned to

productivity language

5. (Optional) performance recorded, code cached for future

Selective specialization: If any step fails, fall back to

PLL (no need to JIT or specialize the whole app)

Embedded: SEJITS machinery uses PLL features,

no need to modify or extend PLL interpreter

Productivity language code

Efficiency language code

Berkeley ParLab

Status, Ongoing Work, Challenges

Conclusions

Enables code-generation strategy per-function, not per-app

Uniform approach to productive programming

same app on cloud, multicore, autotuned libraries

Research enabler

Incrementally develop specializers for different motifs, prototype HW

Don’t need full compiler & toolchain just to get started

Subverting PLL Mechanisms

Observation: mechanisms intended to promote reuse

also enable SEJITS

Metaprogramming: generate & JIT-compile efficiency code to

replace PLL code for this function

Make decisions at runtime based on available HW, argument values, etc., vs.

“static” autotuning

Introspection: intercept & analyze function to see if can specialize

Extend PLL without modifying interpreter

Higher-order programming: patterns at higher levels of abstraction

capture reusable motifs as well as low-level functions

Prototypes working for NVidia, x86 multicore, RAMP (SPARC

Generalize infrastructure for catalog, pattern matching, call site

annotation, history

Integrate with PySKI/autotuning

Cloud computing: Integrate with Nexus

Cloud/multicore synergy: specialize intra-node as well as

generate cloud code

Capture additional motifs as specializers

Ruby => OpenMP on multicore x86

(S. Kamil)

~1000-2000x faster than pure Ruby

Minimal per-call overhead at runtime

Python => NVidia GPU (B. Catanzaro,

Y. Lee)

Stencils & Category-reduce (image

processing)

Python decorators denote

specializable functions

~1000x Faster than pure Python

3x-12x slower than handcrafted CUDA

(including specialization overhead)

Overheads: Naive code generation &

caching, Type propagation, CUDA

compilation, data marshalling

Productivity programmer only writes

Python/Ruby, not CUDA or OpenMP

class LaplacianKernel < Kernel

def kernel(in_grid, out_grid)

in_grid.each_interior do |point|

in_grid.neighbors(point,1).each

do |x|

out_grid[point] += 0.2*x.val

VALUE kern_par(int argc, VALUE* argv, VALUE

self) {

unpack_arrays into in_grid and out_grid;

#pragma omp parallel for default(shared)

private (t_6,t_7,t_8)

for (t_8=1; t_8<256-1; t_8++) {

for (t_7=1; t_7<256-1; t_7++) {

for (t_6=1; t_6<256-1; t_6++) {

int center = INDEX(t_6,t_7,t_8);

out_grid[center] = (out_grid[center]

+(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));

out_grid[center] = (out_grid[center]

+(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));

return Qtrue;}

SEJITS: Raising the Abstraction Level of Productivity Programming - Par...

Documents

Transcript of SEJITS: Raising the Abstraction Level of Productivity Programming - Par...

NEWs update: Uc berkeley EECS parlab opens!

Communication-Optimal Parallel Algorithm for …parlab.eecs.berkeley.edu/sites/all/parlab/files...Communication-Optimal Parallel Algorithm for Strassen’s Matrix Multiplication Regular

New Paradigms for Multicore Programming - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/ParLab-20120201-Xilin… · Circuit design, computer architecture, massively parallel

Source Snooping Cache Coherence Protocols - Par Labparlab.eecs.berkeley.edu/.../parlab/files/20091029-goodman-ssccp.pdfSource Snooping Cache Coherence Protocols ... – Supports all

Distributed Memory Programming in MPI and UPC - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/Slides_5.pdf · Distributed Memory Programming in MPI and UPC ... • Two important

DHihlPlllICtDttiDamascene: Highly Parallel Image Contour ...parlab.eecs.berkeley.edu/sites/all/parlab/files/damascene_0.pdf · Memoryy 1 GB Combine, Normalize GlobalGlobal Pb environment

DHihlPlllICtDttiDamascene: Highly Parallel Image Contour …parlab.eecs.berkeley.edu/sites/all/parlab/files/... · 2012. 2. 23. · DHihlPlllICtDttiDamascene: Highly Parallel Image

Eric Battenberg and David Wessel - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/music_update_june... · Parallelizing drum track extraction on OpenMPand CUDA ... “Machine

SUPERCONDUCTOR - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files... · 2012. 8. 22. · SUPERCONDUCTOR 1 Million Nodes at 30 FPS Matthew E. Torok, UC Berkeley Par Lab mtorok@eecs.berkeley.edu

P A R A L L E L C O M P U T I N G L A B O R A T O R Yparlab.eecs.berkeley.edu › sites › all › parlab › files › 2011-05-sejits-po… · r=1! r=3! Original imagery courtesy

Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

PARLab Parallel Boot Campparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcampwelcome... · algorithm is 7.8 minutes / image on a PC • Invention + talk within Par Lab on parallelizing

PARLab Parallel Boot Campparlab.eecs.berkeley.edu/sites/all/parlab/files/OpenCL_Mattson.pdf · PARLab Parallel Boot Camp Introduction to OpenCL Tim Mattson Microprocessor and Programming

Open MP - Basics - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/openmp_basics_0.pdf · 1 Open MP - Basics* * The name “OpenMP” is the property of the OpenMP Architecture

Intel® Threading Building Blocks - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/TBB-Wrinn.pdf · Key Features Intel® Threading Building Blocks (TBB) It is a template library

End of ParLab Celebration - University of California, …parlab.eecs.berkeley.edu/sites/all/parlab/files/Where We...From Hennessy and Patterson, Computer Architecture: A Quantitative

A Couple Music Applications - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/MusicApps.pdf · A Couple Music Applications ... “Machine Listening”, “Music Understanding”

Fast Pediatric MRI - Par Labparlab.eecs.berkeley.edu/sites/all/parlab/files/Fast Pediatric MRI.pdf · Fast Pediatric MRI Shreyas Vasanawala, MD/PhD Stanford University Lucile Packard

PARLab Parallel Boot Campparlab.eecs.berkeley.edu/sites/all/parlab/files/Slides_2.pdf– Cloud computing, start up allocations from LBNL/NERSC, clusters • LBNL … New CSE Courses

PARLab Parallel Boot Camp Sources of Parallelism and Locality in Simulation Jim Demmel EECS and Mathematics University of California, Berkeley.