The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

synergy.cs.vt.edu

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)Carlo C. del MundoAdvisor: Prof. Wu-chun Feng

synergy.cs.vt.eduThe Co-Design Process for the FFT

The Multi- and Many-core Menace• “...when we start talking about parallelism and

ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. ...I would be panicked if I were in industry.”

John Hennessy, Stanford UniversityAuthor of Computer Architecture: A Quantitative Approach


Berkeley’s View


Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.




Structured Grid

UnstructuredGrid

SpectralMethods


Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

MapReduce

Graphical Models

Combinational Logic


Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines




Structured Grid

UnstructuredGrid

SpectralMethods


Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

Combinational Logic


Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

MapReduce

Graphical Models




SpectralMethods



Fast Fourier Transfor

m

Software Hardware

HardwareSoftware /

Boundary


OpenFFT: A heterogeneous FFT library

C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.)C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication)C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.


Berkeley’s View


Computational Pattern: The Butterfly


• Dwarf: Spectral Methods– Butterfly Pattern

Computational Pattern

a b

+ -

a + b * wk

a – b * wkFigure 1: Simple Butterfly Figure 2: Butterfly with

Twiddle, wk

a b

+ -

a + b a - b

wk


16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”


16-pt FFT: Computation-to-Core: 1 thread


16-pt FFT: Computation-to-Core: 4 threads


16-pt FFT: Computation-to-Core: One Warp


16-pt FFT: Computation-to-Core: One Block


16-pt FFT: Computation-to-Core: One SM

37.5% Occupancy on NVIDIA Kepler K20c



S1: “Columns”


S1: “Columns”


GPU Memory Spaces


Background (GPUs)• GPU Memory

Hierarchy



Hierarchy– Global Memory



Hierarchy– Global Memory

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970



Hierarchy– Global Memory– Image Memory

Memory Unit


L1/L2 Cache 1.35 / 0.45Global 0.17




Hierarchy– Global Memory– Image Memory– Constant Memory

Memory Unit


Constant 5.4

L1/L2 Cache 1.35 / 0.45Global 0.17




Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Memory Unit


Constant 5.4Local 2.7

L1/L2 Cache 1.35 / 0.45Global 0.17




Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Memory Unit


Registers 16.2Constant 5.4

Local 2.7L1/L2 Cache 1.35 / 0.45

Global 0.17



Global Data Bandwidth(Bus Traffic)


Global Data Banwidth• Bus Traffic: Bytes transferred from off-chip

to on-chip memory and back.• Suppose we take the FFT of a 128 MB data set

– Minimum bus traffic is 2 x 128 = 256 MB• Load 128 MB (global -> on-chip)• Performs FFT• Store 128 MB (on-chip -> global)

• Bus Traffic and Performance


16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?


16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

• Scattered Memory Accesses (power-of-2 strides)

• Uncoalesced Memory Access



S1: “Columns”


S1: “Columns”


System-level optimizations(applicable to any application)


System-level optimizations(applicable to any application)

1. Register Preloading2. Vector Access/{Vector,Scalar}

Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing


S1: Register Preloading• Load to registers first

Without Register Preloading

79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);With Register

Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4]; // Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);


S2: Vector Types• Vector Access (float{2, 4,

8, 16})

a[0]

a[1]

a[2]

a[3]



8, 16})

– Scalar Math (VASM)• float + float

a[0]

a[1]

a[2]

a[3]



8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• floatN + floatN

a[0]

a[1]

a[2]

a[3]


S3: Constant Memory• Fast cached lookup

for frequently used data


S3: Constant Memory• Fast cached lookup

for frequently used data

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];



S1: “Columns”


S1: “Columns”


Algorithm-level optimizations

(applicable only to FFT)



(applicable only to FFT)

1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM4. Register-to-register transpose

(shuffle)


A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory


A1: Transpose via local1 memory• Via Shared Memory

• Register to Register (shfl)

1 CUDA shared memory == OpenCL local memory


Algorithm-level optimizationsOriginal Transposed


1. Naïve Transpose (LM-CM)


Local Memory

t0 t1 t2 t3

Original Transposed

Register File


Shuffle Mechanics


Shuffle Mechanics• FFT Transpose can be implemented using

shuffle


Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle


Shuffle Mechanics

Register File

t0 t1 t2 t3


Stage 1: Horizontal


Shuffle Mechanics

Register File

t0 t1 t2 t3


Stage 2: Vertical


Shuffle Mechanics

Register File

t0 t1 t2 t3


Stage 3: Horizontal


Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical


Shuffle Mechanics

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical


Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.




tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;



}






tid + k) % 4];







}



Divergence

Divergence

Divergence




tid + k) % 4];







}



Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];

Divergence

Divergence

Divergence


Results (Experimental Testbed)EvaluationAlgorithm 1D FFT (batched), N = 16-, 64-, and 256-

ptsFFTW Version v3.3.2 (4 threads, OpenMP with AVX

extensions)FFTW Hardware Intel i5-2400 (4 cores @ 3.1 GHz)

GPU Testbed

Device CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)Max TDP(Watts)

AMD Radeon HD 6970

1536 2703 176 250

AMD Radeon HD 7970

2048 3788 264 250

NVIDIA Tesla C2075

448 1288 144 225

NVIDIA Tesla K20c 2496 4106 208 225


Results (optimizations in isolation)

• Minimize bus traffic via on-chip optimizations (RP, LM-CM, LM-CC, LM-CT)– Critical in AMD GPUs, not so much for NVIDIA GPUs

• Use VASM2/VASM4 (do not consider VAVM types)RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and

Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.


Results (optimizations in concert)

• Device data transfer (black) subsumes execution time§

• One set of opts. for all GPUs§ in RP+LM-CM + VASM2 + CM/GAP

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.


Shuffle Results


Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%

16.5%


Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

16.5%


Shuffle Results



Shuffle Results


– Speedupenhanced (s > 1) = 1.08x


Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x


Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!


Shuffle Results: At 50% occupancy




• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy



1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%





Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%


1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy




Higher performance athigher occupancy


Results (1D FFT 256-pts)

• Speedups – as high as 9.1 and 5.8 over FFTW– as high as 18.2 and 2.9 over unoptimized GPU


Summary• Approach

– Focus on identifying optimizations for hardware

• Takeaways– FFTs are memory-bound (focus should be on memory

opts.)– Homogeneous set of optimizations for all GPUs:



Thank You!• Contributions:

– Optimization principles for FFT on GPUs– An analysis of GPU optimizations applied in isolation

and in concert on AMD and NVIDIA GPU architectures• Contact:

– Carlo del Mundo <[email protected]>


The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)

Documents

Transcript of The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)