The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

82
synergy.cs.vt .edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng

description

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT). Carlo C. del Mundo Advisor: Prof. Wu- chun Feng. The Multi- and Many-core Menace. - PowerPoint PPT Presentation

Transcript of The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

Page 1: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.edu

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)Carlo C. del MundoAdvisor: Prof. Wu-chun Feng

Page 2: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

The Multi- and Many-core Menace• “...when we start talking about parallelism and

ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. ...I would be panicked if I were in industry.”

John Hennessy, Stanford UniversityAuthor of Computer Architecture: A Quantitative Approach

Page 3: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Berkeley’s View

Page 4: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Page 5: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

Structured Grid

UnstructuredGrid

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

MapReduce

Graphical Models

Combinational Logic

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

Page 6: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

Structured Grid

UnstructuredGrid

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

Combinational Logic

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

MapReduce

Graphical Models

Page 7: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Fast Fourier Transfor

m

Software Hardware

HardwareSoftware /

Boundary

Page 8: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Fast Fourier Transfor

m

Software Hardware

HardwareSoftware /

Boundary

Page 9: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

OpenFFT: A heterogeneous FFT library

C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.)C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication)C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

Page 10: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Berkeley’s View

Page 11: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Computational Pattern: The Butterfly

Page 12: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

• Dwarf: Spectral Methods– Butterfly Pattern

Computational Pattern

a b

+ -

a + b * wk

a – b * wkFigure 1: Simple Butterfly Figure 2: Butterfly with

Twiddle, wk

a b

+ -

a + b a - b

wk

Page 13: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

Page 14: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: 1 thread

Page 15: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: 4 threads

Page 16: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp

Page 17: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Block

Page 18: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One SM

37.5% Occupancy on NVIDIA Kepler K20c

Page 19: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

Page 20: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

GPU Memory Spaces

Page 21: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy

Page 22: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory

Page 23: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 24: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory

Memory Unit

Read Bandwidth (TB/s)

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 25: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 26: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4Local 2.7

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 27: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Memory Unit

Read Bandwidth (TB/s)

Registers 16.2Constant 5.4

Local 2.7L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 28: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Global Data Bandwidth(Bus Traffic)

Page 29: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Global Data Banwidth• Bus Traffic: Bytes transferred from off-chip

to on-chip memory and back.• Suppose we take the FFT of a 128 MB data set

– Minimum bus traffic is 2 x 128 = 256 MB• Load 128 MB (global -> on-chip)• Performs FFT• Store 128 MB (on-chip -> global)

• Bus Traffic and Performance

Page 30: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

Page 31: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

• Scattered Memory Accesses (power-of-2 strides)

• Uncoalesced Memory Access

Page 32: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

Page 33: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

System-level optimizations(applicable to any application)

Page 34: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

System-level optimizations(applicable to any application)

1. Register Preloading2. Vector Access/{Vector,Scalar}

Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing

Page 35: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S1: Register Preloading• Load to registers first

Without Register Preloading

79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);With Register

Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4]; // Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

Page 36: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

a[0]

a[1]

a[2]

a[3]

Page 37: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

– Scalar Math (VASM)• float + float

a[0]

a[1]

a[2]

a[3]

Page 38: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• floatN + floatN

a[0]

a[1]

a[2]

a[3]

Page 39: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S3: Constant Memory• Fast cached lookup

for frequently used data

Page 40: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

S3: Constant Memory• Fast cached lookup

for frequently used data

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];

Page 41: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

Page 42: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizations

(applicable only to FFT)

Page 43: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizations

(applicable only to FFT)

1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM4. Register-to-register transpose

(shuffle)

Page 44: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory

Page 45: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory

Page 46: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

• Register to Register (shfl)

1 CUDA shared memory == OpenCL local memory

Page 47: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizationsOriginal Transposed

Page 48: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 49: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 50: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 51: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 52: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Page 53: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics• FFT Transpose can be implemented using

shuffle

Page 54: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Page 55: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 1: Horizontal

Page 56: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 2: Vertical

Page 57: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 3: Horizontal

Page 58: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

Page 59: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

Page 60: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Page 61: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Page 62: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Divergence

Divergence

Divergence

Page 63: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];

Divergence

Divergence

Divergence

Page 64: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (Experimental Testbed)EvaluationAlgorithm 1D FFT (batched), N = 16-, 64-, and 256-

ptsFFTW Version v3.3.2 (4 threads, OpenMP with AVX

extensions)FFTW Hardware Intel i5-2400 (4 cores @ 3.1 GHz)

GPU Testbed

Device CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)Max TDP(Watts)

AMD Radeon HD 6970

1536 2703 176 250

AMD Radeon HD 7970

2048 3788 264 250

NVIDIA Tesla C2075

448 1288 144 225

NVIDIA Tesla K20c 2496 4106 208 225

Page 65: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (optimizations in isolation)

• Minimize bus traffic via on-chip optimizations (RP, LM-CM, LM-CC, LM-CT)– Critical in AMD GPUs, not so much for NVIDIA GPUs

• Use VASM2/VASM4 (do not consider VAVM types)RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and

Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Page 66: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (optimizations in concert)

• Device data transfer (black) subsumes execution time§

• One set of opts. for all GPUs§ in RP+LM-CM + VASM2 + CM/GAP

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Page 67: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Page 68: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%

16.5%

Page 69: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

16.5%

Page 70: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

Page 71: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x

Page 72: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x

Page 73: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

Page 74: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

Page 75: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

Page 76: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

Page 77: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Page 78: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Page 79: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

Shuffle Results: At 50% occupancy

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Higher performance athigher occupancy

Page 80: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (1D FFT 256-pts)

• Speedups – as high as 9.1 and 5.8 over FFTW– as high as 18.2 and 2.9 over unoptimized GPU

Page 81: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Summary• Approach

– Focus on identifying optimizations for hardware

• Takeaways– FFTs are memory-bound (focus should be on memory

opts.)– Homogeneous set of optimizations for all GPUs:

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Page 82: The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Thank You!• Contributions:

– Optimization principles for FFT on GPUs– An analysis of GPU optimizations applied in isolation

and in concert on AMD and NVIDIA GPU architectures• Contact:

– Carlo del Mundo <[email protected]>

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.