Why GPUs?
description
Transcript of Why GPUs?
Why GPUs?Why GPUs?
Robert StrzodkaRobert Strzodka
2
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
3
INOUT
Data Processing in GeneralData Processing in General
ProcessorIN OUT
mem
ory
mem
ory
memorymemorywallwall
lack oflack ofparallelismparallelism
4
Old and New Wisdom in Computer ArchitectureOld and New Wisdom in Computer Architecture
• Old: Power is free, Transistors are expensive• New: “Power wall”, Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast• New: “Memory wall”, Multiplies fast, Memory slow
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)
• New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited)
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
slide courtesy of
Christos Kozyrakis
5
Uniprocessor Performance (SPECint)Uniprocessor Performance (SPECint)
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
??%/yearFrom Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Sea change in chip design: multiple “cores” or processors per chip
3X
slide courtesy of
Christos Kozyrakis
6
Processor
Instruction-Stream-Based ProcessingInstruction-Stream-Based Processing
instructions
cache
mem
ory
mem
orydata data
datadata
datadata
data
7
Instruction- and Data-StreamsInstruction- and Data-Streams
Addition of 2D arrays: C= A + B
for(y=0; y<HEIGHT; y++)for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x];}
instuctionstream
processingdata
inputStreams(A,B);outputStream(C);kernelProgram(OP_ADD);processStreams();
data streamsundergoing a
kerneloperation
8
Processor
Data-Stream-Based ProcessingData-Stream-Based Processing
mem
ory
mem
ory
pip
eline
datadata
configuration
pip
eline
pip
eline
9
Architectures: Data – Processor LocalityArchitectures: Data – Processor Locality
• Field Programmable Gate Array (FPGA)– Compute by configuring Boolean functions and local memory
• Processor Array / Multi-core Processor– Assemble many (simple) processors and memories on one chip
• Processor-in-Memory (PIM)– Insert processing elements directly into RAM chips
• Stream Processor– Create data locality through a hierarchy of memories
10
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
11
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays: 1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Stream of array elements,order unknown
Fragment Processor (FP)
Kernel changes each datum independently,
reads more input arrays
Output Arrays: 1D, 3D (slice),
2D (typical)
12
Index Regions in Output ArraysIndex Regions in Output Arrays
Output region• Quads and Triangles– Fastest option
Output region
• Line segments– Slower, try to pair lines to
2xh, wx2 quads
Output region
• Point Clouds– Slowest, try to gather
points into larger forms
13
High Level Graphics Language for the High Level Graphics Language for the KernelsKernels
• Float data types:– half 16-bit (s10e5), float 32-bit (s23e8)
• Vectors, structs and arrays:– float4, float vec[6] , float3x4, float arr[5][3], struct {}
• Arithmetic and logic operators: – +, -, *, /; &&, ||, !
• Trignonometric, exponential functions:– sin, asin, exp, log, pow, …
• User defined functions– max3(float a, float b, float c) { return max(a,max(b,c)); }
• Conditional statements, loops:– if, for, while, dynamic branching in PS3
• Streaming and random data access
14
Input and Output ArraysInput and Output Arrays
CPU• Input and output
arrays may overlap
GPU• Input and output arrays
must not overlap
Input
Output
Input
Output
15
Native Memory Layout – Data LocalityNative Memory Layout – Data Locality
CPU• 1D input
• 1D output
• Higher dimensions with offsets
GPU• 1D, 2D, 3D input
• 2D output
• Other dimensions with offsets
Input Input Output
Output
Color coded localityred (near), blue (far)
16
Data-Flow: Gather and ScatterData-Flow: Gather and Scatter
CPU• Arbitrary gather
• Arbitrary scatter
GPU• Arbitrary gather
• Restricted scatter
Input Output Input Output
Input Output Input Output
17
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
18
1) Computational Performance1) Computational PerformanceG
FL
OP
S
chart courtesy
of John Owens
ATI R520
Note: Sustained performance is usually much lower and depends heavily on the memory system !
19
2) Memory Performance2) Memory Performance
• CPU– Large cache– Few processing elements– Optimized for spatial and
temporal data reuse
GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4
chart courtesy
of Ian Buck
Memory access types: Cache, Sequential, Random
• GPU – Small cache– Many processing elements– Optimized for sequential
(streaming) data access
20
3) Configuration Overhead3) Configuration Overhead
Configu-Configu-rationrationlimitedlimited
Compu-Compu-tationtationlimitedlimited
chart courtesy
of Ian Buck
21
ConclusionsConclusions
• Parallelism is now indispensable to further increase performance
• Both memory and processing element dominated designs have pros and cons
• Mapping algorithms to the appropriate architecture allows enormous speedups
• Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)