Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries...

12
Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Applications Platforms Performance

Transcript of Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries...

Page 1: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

SpiralComputer Generation ofPerformance LibrariesJosé M. F. MouraMarkus PüschelFranz Franchetti& the Spiral Team

Applications

Platforms

Performance

Page 2: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance libraryoptimized for given platform

Spiral

High performance libraryoptimized for given platform

Comparable performance

Page 3: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Main Idea: Program Generation

νpμ

Architectural parameter:Vector length, #processors, …

rewritingdefines

Kernel: problem size, algorithm choice

picksearch

abstraction abstraction

Model: common abstraction= spaces of matching formulas

architecturespace

algorithmspace

optimization

Page 4: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

How Spiral Works

Algorithm Generation

Algorithm Optimization

Implementation

Code Optimization

Compilation

Compiler Optimizations

Problem specification (“DFT 1024” or “DFT”)

algorithm

C code

Fast executable

performance

Sear

ch

controls

controls

Spiral

Complete automation of the implementation and optimization task

Basic ideas: • Declarative representation

of algorithms

• Rewriting systems to generate and optimize algorithms at a high level of abstraction

Page 5: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Algorithms: Rules in Domain Specific LanguageViterbi DecodingLinear Transforms

Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFTmatched filtering

preprocessing

convolutionalencoder

Viterbidecoder

010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00

= £

£

Page 6: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

One Approach for all Types of Parallelism

Multithreading (Multicore)

Vector SIMD (SSE, VMX/Altivec,…)

Message Passing (Clusters, MPP)

Streaming/multibuffering (Cell)

Graphics Processors (GPUs)

Gate-level parallelism (FPGA)

HW/SW partitioning (CPU + FPGA)

Page 7: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Example: Code Generation for Multicore CPUs

Tensor product: embarrassingly parallel operator

A

A

A

A

x y

Processor 0

Processor 1

Processor 2

Processor 3

Permutation: problematic; may produce false sharing

x y

Hardware abstraction: shared cache with cache lines

Page 8: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Spiral: Meta-Tool to Autotuning Libraries

High-Performance Library “FFTW-like”

Spiral Library Generator

Input: Transform:

Algorithms:

Vectorization: 2-way SSE

Threading: Yes

Output: Optimized library (10,000 lines of C++)

For general input size (not collection of fixed sizes)

Vectorized

Multithreaded

With runtime adaptation mechanism

Performance competitive with hand-written code

Page 9: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Verify algorithms symbolically

Check rules through verification of instances

Check code empirically

Verification and Testing

= ?

= ?

= ? DFT4([0,1,0,0])

DFT4_rnd([0.1,1.77,2.28,-55.3]))DFT4([0.1,1.77,2.28,-55.3]) = ?

Page 10: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

Samsung i9100 Galaxy S IIDual-core ARM at 1.2GHz with NEON ISA

SIMD vectorization + multi-threading

Range: Cell Phone To Supercomputer

G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

Global FFT (1D FFT, HPC Challenge)performance [Gflop/s]

BlueGene/P at Argonne National Laboratory128k cores (quad-core CPUs) at 850 MHz

SIMD vectorization + multi-threading + MPI

6.4 Tflop/s

BlueGene/P

Page 11: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

More Results: Spiral Outperforms Humans

FFT on Multicore

FFT on FPGA

SAR

SDR

improvement

Page 12: Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries High-Performance Library “FFTW-like” Spiral Library Generator Input: Transform:

Carnegie Mellon

More Information:www.spiral.netwww.spiralgen.com