C6-Discrete Time Fourier (DFT) and Fast Fourier Transform (FFT)
The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)
description
Transcript of The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)
synergy.cs.vt.edu
The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)Carlo C. del MundoAdvisor: Prof. Wu-chun Feng
synergy.cs.vt.eduThe Co-Design Process for the FFT
The Multi- and Many-core Menace• “...when we start talking about parallelism and
ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. ...I would be panicked if I were in industry.”
John Hennessy, Stanford UniversityAuthor of Computer Architecture: A Quantitative Approach
synergy.cs.vt.eduThe Co-Design Process for the FFT
Berkeley’s View
synergy.cs.vt.eduThe Co-Design Process for the FFT
Dwarfs of Symbolic Computation1,2
• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication
2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Dwarfs of Symbolic Computation1,2
• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication
Structured Grid
UnstructuredGrid
SpectralMethods
2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
Dense Linear
Algebra
Sparse Linear
Algebra
N-BodyMethods
MapReduce
Graphical Models
Combinational Logic
1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
Graph Traversal Dynamic Programming
Branch-and-Bound
Finite State Machines
synergy.cs.vt.eduThe Co-Design Process for the FFT
Dwarfs of Symbolic Computation1,2
• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication
Structured Grid
UnstructuredGrid
SpectralMethods
2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
Dense Linear
Algebra
Sparse Linear
Algebra
N-BodyMethods
Combinational Logic
1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
Graph Traversal Dynamic Programming
Branch-and-Bound
Finite State Machines
MapReduce
Graphical Models
synergy.cs.vt.eduThe Co-Design Process for the FFT
Dwarfs of Symbolic Computation1,2
• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication
SpectralMethods
2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
Fast Fourier Transfor
m
Software Hardware
HardwareSoftware /
Boundary
synergy.cs.vt.eduThe Co-Design Process for the FFT
Dwarfs of Symbolic Computation1,2
• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication
SpectralMethods
2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
Fast Fourier Transfor
m
Software Hardware
HardwareSoftware /
Boundary
synergy.cs.vt.eduThe Co-Design Process for the FFT
OpenFFT: A heterogeneous FFT library
C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.)C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication)C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Berkeley’s View
synergy.cs.vt.eduThe Co-Design Process for the FFT
Computational Pattern: The Butterfly
synergy.cs.vt.eduThe Co-Design Process for the FFT
• Dwarf: Spectral Methods– Butterfly Pattern
Computational Pattern
a b
+ -
a + b * wk
a – b * wkFigure 1: Simple Butterfly Figure 2: Butterfly with
Twiddle, wk
a b
+ -
a + b a - b
wk
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Stages of ComputationInput Array
S1: “Columns”
S2: “Twiddles”S3: “Transpose”
S1: “Columns”
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: 1 thread
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: 4 threads
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: One Warp
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: One Block
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: One SM
37.5% Occupancy on NVIDIA Kepler K20c
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Stages of ComputationInput Array
S1: “Columns”
S2: “Twiddles”S3: “Transpose”
S1: “Columns”
synergy.cs.vt.eduThe Co-Design Process for the FFT
GPU Memory Spaces
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory
Memory Unit
Read Bandwidth (TB/s)
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory– Image Memory
Memory Unit
Read Bandwidth (TB/s)
L1/L2 Cache 1.35 / 0.45Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory– Image Memory– Constant Memory
Memory Unit
Read Bandwidth (TB/s)
Constant 5.4
L1/L2 Cache 1.35 / 0.45Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory
Memory Unit
Read Bandwidth (TB/s)
Constant 5.4Local 2.7
L1/L2 Cache 1.35 / 0.45Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.eduThe Co-Design Process for the FFT
Background (GPUs)• GPU Memory
Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers
Memory Unit
Read Bandwidth (TB/s)
Registers 16.2Constant 5.4
Local 2.7L1/L2 Cache 1.35 / 0.45
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.eduThe Co-Design Process for the FFT
Global Data Bandwidth(Bus Traffic)
synergy.cs.vt.eduThe Co-Design Process for the FFT
Global Data Banwidth• Bus Traffic: Bytes transferred from off-chip
to on-chip memory and back.• Suppose we take the FFT of a 128 MB data set
– Minimum bus traffic is 2 x 128 = 256 MB• Load 128 MB (global -> on-chip)• Performs FFT• Store 128 MB (on-chip -> global)
• Bus Traffic and Performance
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?
• Scattered Memory Accesses (power-of-2 strides)
• Uncoalesced Memory Access
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Stages of ComputationInput Array
S1: “Columns”
S2: “Twiddles”S3: “Transpose”
S1: “Columns”
synergy.cs.vt.eduThe Co-Design Process for the FFT
System-level optimizations(applicable to any application)
synergy.cs.vt.eduThe Co-Design Process for the FFT
System-level optimizations(applicable to any application)
1. Register Preloading2. Vector Access/{Vector,Scalar}
Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing
synergy.cs.vt.eduThe Co-Design Process for the FFT
S1: Register Preloading• Load to registers first
Without Register Preloading
79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);With Register
Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4]; // Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(®isters[0], ®isters[1], ®isters[2], ®isters[3]);
synergy.cs.vt.eduThe Co-Design Process for the FFT
S2: Vector Types• Vector Access (float{2, 4,
8, 16})
a[0]
a[1]
a[2]
a[3]
synergy.cs.vt.eduThe Co-Design Process for the FFT
S2: Vector Types• Vector Access (float{2, 4,
8, 16})
– Scalar Math (VASM)• float + float
a[0]
a[1]
a[2]
a[3]
synergy.cs.vt.eduThe Co-Design Process for the FFT
S2: Vector Types• Vector Access (float{2, 4,
8, 16})
– Scalar Math (VASM)• float + float
– Vector Math (VAVM)• floatN + floatN
a[0]
a[1]
a[2]
a[3]
synergy.cs.vt.eduThe Co-Design Process for the FFT
S3: Constant Memory• Fast cached lookup
for frequently used data
synergy.cs.vt.eduThe Co-Design Process for the FFT
S3: Constant Memory• Fast cached lookup
for frequently used data
16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};
Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }
With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *
twiddles[4*j+tid];
synergy.cs.vt.eduThe Co-Design Process for the FFT
16-pt FFT: Stages of ComputationInput Array
S1: “Columns”
S2: “Twiddles”S3: “Transpose”
S1: “Columns”
synergy.cs.vt.eduThe Co-Design Process for the FFT
Algorithm-level optimizations
(applicable only to FFT)
synergy.cs.vt.eduThe Co-Design Process for the FFT
Algorithm-level optimizations
(applicable only to FFT)
1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM4. Register-to-register transpose
(shuffle)
synergy.cs.vt.eduThe Co-Design Process for the FFT
A1: Transpose via local1 memory• Via Shared Memory
1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.eduThe Co-Design Process for the FFT
A1: Transpose via local1 memory• Via Shared Memory
1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.eduThe Co-Design Process for the FFT
A1: Transpose via local1 memory• Via Shared Memory
• Register to Register (shfl)
1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.eduThe Co-Design Process for the FFT
Algorithm-level optimizationsOriginal Transposed
synergy.cs.vt.eduThe Co-Design Process for the FFT
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.eduThe Co-Design Process for the FFT
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.eduThe Co-Design Process for the FFT
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.eduThe Co-Design Process for the FFT
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics• FFT Transpose can be implemented using
shuffle
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
• FFT Transpose can be implemented using shuffle
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
• FFT Transpose can be implemented using shuffle
Stage 1: Horizontal
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
• FFT Transpose can be implemented using shuffle
Stage 2: Vertical
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
• FFT Transpose can be implemented using shuffle
Stage 3: Horizontal
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement
Stage 2: Vertical
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Mechanics
Register File
t0 t1 t2 t3
for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];
Code 1: (NAIVE)
• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement
Stage 2: Vertical
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle MechanicsCode 1 (NAIVE)
63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.
– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle MechanicsCode 1 (NAIVE)
63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.
– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle MechanicsCode 1 (NAIVE)
63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.
– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
Divergence
Divergence
Divergence
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle MechanicsCode 1 (NAIVE)
63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.
– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];
Divergence
Divergence
Divergence
synergy.cs.vt.eduThe Co-Design Process for the FFT
Results (Experimental Testbed)EvaluationAlgorithm 1D FFT (batched), N = 16-, 64-, and 256-
ptsFFTW Version v3.3.2 (4 threads, OpenMP with AVX
extensions)FFTW Hardware Intel i5-2400 (4 cores @ 3.1 GHz)
GPU Testbed
Device CoresPeak
Performance
(GFLOPS)
PeakBandwidth
(GB/s)Max TDP(Watts)
AMD Radeon HD 6970
1536 2703 176 250
AMD Radeon HD 7970
2048 3788 264 250
NVIDIA Tesla C2075
448 1288 144 225
NVIDIA Tesla K20c 2496 4106 208 225
synergy.cs.vt.eduThe Co-Design Process for the FFT
Results (optimizations in isolation)
• Minimize bus traffic via on-chip optimizations (RP, LM-CM, LM-CC, LM-CT)– Critical in AMD GPUs, not so much for NVIDIA GPUs
• Use VASM2/VASM4 (do not consider VAVM types)RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Results (optimizations in concert)
• Device data transfer (black) subsumes execution time§
• One set of opts. for all GPUs§ in RP+LM-CM + VASM2 + CM/GAP
RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
• Fractionenhanced (0 < f < 1) = 16.5%
16.5%
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
16.5%
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
– Speedupenhanced (s > 1) = 1.08x
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
Name
Reg.
Shm.
SLOC
LMEM
Shared
50 4K 11 0
SELP(OOP
)
74 4K 382 0
SELP(IP)
49 4K 418 0
DIV 72 4K 462 0
Naive
52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
– Speedupenhanced (s > 1) = 1.08x
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
Name
Reg.
Shm.
SLOC
LMEM
Shared
50 4K 11 0
SELP(OOP
)
74 4K 382 0
SELP(IP)
49 4K 418 0
DIV 72 4K 462 0
Naive
52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results
Name
Reg.
Shm.
SLOC
LMEM
Shared
50 4K 11 0
SELP(OOP
)
74 4K 382 0
SELP(IP)
49 4K 418 0
DIV 72 4K 462 0
Naive
52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results: At 50% occupancy
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results: At 50% occupancy
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results: At 50% occupancy
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced
1 (s > 1) = 1.19x – Speedupenhanced
2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy
synergy.cs.vt.eduThe Co-Design Process for the FFT
Shuffle Results: At 50% occupancy
1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy
Name
Reg.
Shm.
SLOC
OCC
Shared
50 4K 11 37.5%
SELP(OOP
)
74 4K 382 37.5%
SELP(IP)
49 4K 418 37.5%
SELP (IP)
49 0 418 50%
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced
1 (s > 1) = 1.19x – Speedupenhanced
2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy
synergy.cs.vt.eduThe Co-Design Process for the FFT
Name
Reg.
Shm.
SLOC
OCC
Shared
50 4K 11 37.5%
SELP(OOP
)
74 4K 382 37.5%
SELP(IP)
49 4K 418 37.5%
SELP (IP)
49 0 418 50%
Shuffle Results: At 50% occupancy
1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy
• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced
1 (s > 1) = 1.19x – Speedupenhanced
2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy
Higher performance athigher occupancy
synergy.cs.vt.eduThe Co-Design Process for the FFT
Results (1D FFT 256-pts)
• Speedups – as high as 9.1 and 5.8 over FFTW– as high as 18.2 and 2.9 over unoptimized GPU
synergy.cs.vt.eduThe Co-Design Process for the FFT
Summary• Approach
– Focus on identifying optimizations for hardware
• Takeaways– FFTs are memory-bound (focus should be on memory
opts.)– Homogeneous set of optimizations for all GPUs:
RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.eduThe Co-Design Process for the FFT
Thank You!• Contributions:
– Optimization principles for FFT on GPUs– An analysis of GPU optimizations applied in isolation
and in concert on AMD and NVIDIA GPU architectures• Contact:
– Carlo del Mundo <[email protected]>
RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.