Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries...

Carnegie Mellon

SpiralComputer Generation ofPerformance LibrariesJosé M. F. MouraMarkus PüschelFranz Franchetti& the Spiral Team

Applications

Platforms

Performance

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance libraryoptimized for given platform

Spiral

High performance libraryoptimized for given platform

Comparable performance

Carnegie Mellon

Main Idea: Program Generation

Architectural parameter:Vector length, #processors, …

rewritingdefines

Kernel: problem size, algorithm choice

picksearch

abstraction abstraction

Model: common abstraction= spaces of matching formulas

architecturespace

algorithmspace

optimization

Carnegie Mellon

How Spiral Works

Algorithm Generation

Algorithm Optimization

Implementation

Code Optimization

Compilation

Compiler Optimizations

Problem specification (“DFT 1024” or “DFT”)

algorithm

C code

Fast executable

performance

controls

Spiral

Complete automation of the implementation and optimization task

Basic ideas: • Declarative representation

of algorithms

• Rewriting systems to generate and optimize algorithms at a high level of abstraction

Carnegie Mellon

Algorithms: Rules in Domain Specific LanguageViterbi DecodingLinear Transforms

Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFTmatched filtering

preprocessing

convolutionalencoder

Viterbidecoder

010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00

Carnegie Mellon

One Approach for all Types of Parallelism

Multithreading (Multicore)

Vector SIMD (SSE, VMX/Altivec,…)

Message Passing (Clusters, MPP)

Streaming/multibuffering (Cell)

Graphics Processors (GPUs)

Gate-level parallelism (FPGA)

HW/SW partitioning (CPU + FPGA)

Carnegie Mellon

Example: Code Generation for Multicore CPUs

Tensor product: embarrassingly parallel operator

Processor 0

Processor 1

Processor 2

Processor 3

Permutation: problematic; may produce false sharing

Hardware abstraction: shared cache with cache lines

Carnegie Mellon

Spiral: Meta-Tool to Autotuning Libraries

High-Performance Library “FFTW-like”

Spiral Library Generator

Input: Transform:

Algorithms:

Vectorization: 2-way SSE

Threading: Yes

Output: Optimized library (10,000 lines of C++)

For general input size (not collection of fixed sizes)

Vectorized

Multithreaded

With runtime adaptation mechanism

Performance competitive with hand-written code

Carnegie Mellon

Verify algorithms symbolically

Check rules through verification of instances

Check code empirically

Verification and Testing

= ? DFT4([0,1,0,0])

DFT4_rnd([0.1,1.77,2.28,-55.3]))DFT4([0.1,1.77,2.28,-55.3]) = ?

Carnegie Mellon

Samsung i9100 Galaxy S IIDual-core ARM at 1.2GHz with NEON ISA

SIMD vectorization + multi-threading

Range: Cell Phone To Supercomputer

G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

Global FFT (1D FFT, HPC Challenge)performance [Gflop/s]

BlueGene/P at Argonne National Laboratory128k cores (quad-core CPUs) at 850 MHz

SIMD vectorization + multi-threading + MPI

6.4 Tflop/s

BlueGene/P

Carnegie Mellon

More Results: Spiral Outperforms Humans

FFT on Multicore

FFT on FPGA

improvement

Carnegie Mellon

More Information:www.spiral.netwww.spiralgen.com

Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries...

Documents

Transcript of Spiral › doc › slides › spiral-10slides.pdf · Spiral: Meta-Tool to Autotuning Libraries...

for version 3.3.10, 10 December 2020 - FFTW

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams*§, Jonathan Carter*, James Demmel§, Leonid

Fftw Paper Ieee

Autotuning Large Computational Chemistry Codes

Autotuning in Web100

Autotuning the Intel HLS Compiler using the Opentuner ...

PetaBricks: A Language and Compiler based on Autotuning

Autotuning in High- Performance Computing Applications

pOSKI: An Extensible Autotuning Framework to Perform ...parlab.eecs.berkeley.edu/sites/all/parlab/files/pOSKI- AN Extensible... · pOSKI: An Extensible Autotuning Framework to Perform

Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon

Circulator Autotuning

Questions Big Autotuning: The

A Survey on Compiler Autotuning using Machine LearningAdditional Key Words and Phrases: Compilers, Autotuning, Machine Learning, Phase ordering, Optimizations, Application Characterization

Efficient Autotuning of Hyperparameters in Approximate ...

USING AUTOTUNING FOR ACCELERATING TENSOR …

The secrets of FFTW: the Fastest Fourier Transform in the West

Autotuning (1/2) - Vuduc

Identiﬁcation and Autotuning of Temperature-Cont

A Learning-Based Recommender System for Autotuning …

Autotuning Programs with Algorithmic Choice - MIT CSAILgroups.csail.mit.edu/commit/papers/2014/ansel-phd-thesis-slides.pdf · Introduction PetaBricks OpenTuner Conclusions Autotuning

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams§, Jonathan Carter, James Demmel§, Leonid