Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code...

12
Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Applications Platforms Performance

Transcript of Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code...

Page 1: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

SpiralComputer Generation ofPerformance LibrariesJosé M. F. MouraMarkus PüschelFranz Franchetti& the Spiral Team

Applications

Platforms

Performance

Page 2: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance libraryoptimized for given platform

Spiral

High performance libraryoptimized for given platform

Comparable performance

Page 3: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Main Idea: Program Generation

νpμ

Architectural parameter:Vector length, #processors, …

rewritingdefines

Kernel: problem size, algorithm choice

picksearch

abstraction abstraction

Model: common abstraction= spaces of matching formulas

architecturespace

algorithmspace

optimization

Page 4: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

How Spiral Works

Algorithm Generation

Algorithm Optimization

Implementation

Code Optimization

Compilation

Compiler Optimizations

Problem specification (“DFT 1024” or “DFT”)

algorithm

C code

Fast executable

performance

Sear

ch

controls

controls

Spiral

Complete automation of the implementation and optimization task

Basic ideas: • Declarative representation

of algorithms

• Rewriting systems to generate and optimize algorithms at a high level of abstraction

Page 5: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Algorithms: Rules in Domain Specific LanguageViterbi DecodingLinear Transforms

Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFTmatched filtering

preprocessing

convolutionalencoder

Viterbidecoder

010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00

= £

£

Page 6: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

One Approach for all Types of Parallelism

Multithreading (Multicore)

Vector SIMD (SSE, VMX/Altivec,…)

Message Passing (Clusters, MPP)

Streaming/multibuffering (Cell)

Graphics Processors (GPUs)

Gate-level parallelism (FPGA)

HW/SW partitioning (CPU + FPGA)

Page 7: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Example: Code Generation for Multicore CPUs

Tensor product: embarrassingly parallel operator

A

A

A

A

x y

Processor 0

Processor 1

Processor 2

Processor 3

Permutation: problematic; may produce false sharing

x y

Hardware abstraction: shared cache with cache lines

Page 8: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Spiral: Meta-Tool to Autotuning Libraries

High-Performance Library “FFTW-like”

Spiral Library Generator

Input: Transform:

Algorithms:

Vectorization: 2-way SSE

Threading: Yes

Output: Optimized library (10,000 lines of C++)

For general input size (not collection of fixed sizes)

Vectorized

Multithreaded

With runtime adaptation mechanism

Performance competitive with hand-written code

Page 9: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Verify algorithms symbolically

Check rules through verification of instances

Check code empirically

Verification and Testing

= ?

= ?

= ? DFT4([0,1,0,0])

DFT4_rnd([0.1,1.77,2.28,-55.3]))DFT4([0.1,1.77,2.28,-55.3]) = ?

Page 10: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

Samsung i9100 Galaxy S IIDual-core ARM at 1.2GHz with NEON ISA

SIMD vectorization + multi-threading

Range: Cell Phone To Supercomputer

G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

Global FFT (1D FFT, HPC Challenge)performance [Gflop/s]

BlueGene/P at Argonne National Laboratory128k cores (quad-core CPUs) at 850 MHz

SIMD vectorization + multi-threading + MPI

6.4 Tflop/s

BlueGene/P

Page 11: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

More Results: Spiral Outperforms Humans

FFT on Multicore

FFT on FPGA

SAR

SDR

improvement

Page 12: Spiral · How Spiral Works Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (“DFT 1024” or

Carnegie Mellon

More Information:www.spiral.netwww.spiralgen.com