Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU...

81
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes – Bretagne Atlantique [email protected]

Transcript of Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU...

Page 1: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

Parallel programming:Introduction to GPU architecture

Sylvain CollangeInria Rennes – Bretagne Atlantique

[email protected]

Page 2: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

2

GPU internals

What makes a GPU tick?

NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering!

Page 3: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

3

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 4: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

4

The free lunch era... was yesterday

1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements

Exponential performance increase

Software compatibility preserved

Do not rewrite software, buy a new machine!Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed. 2006

Page 5: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

5

Computer architecture crash course

How does a processor work?

Or was working in the 1980s to 1990s:modern processors are much more complicated!

An attempt to sum up 30 years of research in 15 minutes

Page 6: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

6

Machine language: instruction set

Registers

State for computations

Keeps variables and temporaries

Instructions

Perform computations on registers,move data between register and memory, branch…

Instruction word

Binary representation of an instruction

Assembly language

Readable form of machine language

Examples

Which instruction set does your laptop/desktop run?

Your cell phone?

01100111

R1, R3ADD

R0, R1, R2, R3... R31

Example

Page 7: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

7

The Von Neumann processor

Let's look at it step by step

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

Statemachine

OperationOperands

Instructionword

Load/StoreUnit

BranchUnit

Result bus

+1

Page 8: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

8

Step by step: Fetch

The processor maintains a Program Counter (PC)

Fetch: read the instruction word pointed by PC in memory

Fetch unit

Memory

PC

Instructionword

01100111

01100111

Page 9: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

9

Decode

Split the instruction word to understand what it represents

Which operation? → ADD

Which operands? → R1, R3

Decoder

Fetch unit

Memory

PC

OperationOperands

Instructionword

01100111

R1, R3 ADD

Page 10: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

10

Read operands

Get the value of registers R1, R3 from the register file

Decoder

Fetch unit

Memory

Register file

PC

Operands

Instructionword

R1, R3

42, 17

Page 11: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

11

Execute operation

Compute the result: 42 + 17

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

OperationOperands

Instructionword

42, 17 ADD

59

Page 12: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

12

Write back

Write the result back to the register file

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

OperationOperands

Instructionword

Result bus

59

R1

Page 13: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

13

Increment PC

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

OperationOperands

Instructionword

Result bus

+1

Page 14: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

14

Load or store instruction

Can read and write memory from a computed address

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

OperationOperands

Instructionword

Load/StoreUnit

Result bus

+1

Page 15: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

15

Branch instruction

Instead of incrementing PC, set it to a computed value

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

OperationOperands

Instructionword

Load/StoreUnit

BranchUnit

Result bus

+1

Page 16: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

16

What about the state machine?

The state machine controls everybody

Sequences the successive steps

Send signals to units depending on current state

At every clock tick,switch to next state

Clock is a periodic signal(as fast as possible)

Fetch-decode

Readoperands

Readoperands

ExecuteWriteback

IncrementPC

Page 17: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

17

Recap

We can build a real processor

As it was in the early 1980's

Youare

here

How did processors become faster?

Page 18: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

18

Reason 1: faster clock

Progress in semiconductor technologyallows higher frequencies

Frequencyscaling

But this is not enough!

Page 19: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

19

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 20: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

20

Going faster using ILP: pipeline

Idea: we do not have to wait until instruction n has finishedto start instruction n+1

Like a factory assembly line

Or the bandeijão

Page 21: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

21

Pipelined processor

Independent instructions can follow each other

Exploits ILP to hide instruction latency

Program

1: add, r1, r32: mul r2, r33: load r3, [r1]

Fetch Decode Execute Writeback

1: add

Page 22: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

22

Pipelined processor

Independent instructions can follow each other

Exploits ILP to hide instruction latency

Program

1: add, r1, r32: mul r2, r33: load r3, [r1]

Fetch Decode Execute Writeback

1: add2: mul

Page 23: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

23

Pipelined processor

Independent instructions can follow each other

Exploits ILP to hide instruction latency

Program

1: add, r1, r32: mul r2, r33: load r3, [r1]

Fetch Decode Execute Writeback

1: add2: mul3: load

Page 24: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

24

Superscalar execution

Multiple execution units in parallel

Independent instructions can execute at the same time

Fetch

Decode Execute Writeback

Exploits ILP to increase throughput

Decode Execute Writeback

Decode Execute Writeback

Page 25: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

26

Locality

Time to access main memory: ~200 clock cycles

One memory access every few instructions

Are we doomed?

Fortunately: principle of locality

~90% of memory accesses on ~10% of data

Accessed locations are often the same

Temporal localityAccess the same location at different times

Spacial localityAccess locations close to each other

Page 26: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

27

Caches

Large memories are slower than small memories

The computer theorists lied to you:in the real world, access in an array of size n costs O(log n), not O(1)!

Think about looking up a book in a small or huge library

Idea: put frequently-accessed data in small, fast memory

Can be applied recursively: hierarchy with multiple levels of cache

L1cache

Capacity: 64 KBAccess time: 2 ns

L2cache

1 MB10 ns

L3cache

8 MB30 ns

Memory 8 GB60 ns

Page 27: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

28

Branch prediction

What if we have a branch?

We do not know the next PC to fetch from until the branch executes

Solution 1: wait until the branch is resolved

Problem: programs have 1 branch every 5 instructions on average

We would spend most of our time waiting

Solution 2: predict (guess) the most likely direction

If correct, we have bought some time

If wrong, just go back and start over

Modern CPUs can correctly predict over 95% of branches

World record holder: 1.691 mispredictions / 1000 instructions

General concept: speculation

P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-poTAGE+ SC predictor." JWAC-4: Championship Branch Prediction (2014).

Page 28: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

29

Example CPU: Intel Core i7 Haswell

Up to 192 instructionsin flight

May be 48 predicted branches ahead

Up to 8 instructions/cycle executed out of order

About 25 pipeline stages at ~4 GHz

Quizz: how far does light travel during the 0.25 ns of a clock cycle?

Too complex to explain in 1 slide, or even 1 lecture

David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012http://www.realworldtech.com/haswell-cpu/

Page 29: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

30

Recap

Many techniques to run sequential programsas fast as possible

Discovers and exploits parallelism between instructions

Speculates to remove dependencies

Works on existing binary programs,without rewriting or re-compiling

Upgrading hardware is cheaper than improving software

Extremely complex machine

Page 30: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

31

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 31: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

32

Technology evolution

Memory wall

Memory speed does not increase as fast as computing speed

More and more difficult to hide memory latency

Power wall

Power consumption of transistors does not decrease as fast as density increases

Performance is now limited by power consumption

ILP wall

Law of diminishing returns on Instruction-Level Parallelism

Pollack rule: cost performance²≃

Cost

Serial performance

Performance

Time

Gap

Compute

Memory

Time

Transistordensity

Transistorpower

Total power

Page 32: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

33

Usage changes

New applications demandparallel processing

Computer games : 3D graphics

Search engines, social networks…“big data” processing

New computing devices arepower-constrained

Laptops, cell phones, tablets…

Small, light, battery-powered

Datacenters

High power supplyand cooling costs

Page 33: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

34

Latency vs. throughput

Latency: time to solution

CPUs

Minimize time, at the expense of power

Throughput: quantity of tasks processed per unit of time

GPUs

Assumes unlimited parallelism

Minimize energy per operation

Page 34: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

35

Amdahl's law

Bounds speedup attainable on a parallel machine

S=1

1−PPN

Time to runsequential portions

Time to runparallel portions

N (available processors)

S (speedup)

G. Amdahl. Validity of the Single Processor Approach to Achieving Large-ScaleComputing Capabilities. AFIPS 1967.

S SpeedupP Ratio of parallel

portionsN Number of

processors

Page 35: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

36

Why heterogeneous architectures?

Latency-optimized multi-core (CPU)

Low efficiency on parallel portions:spends too much resources

Throughput-optimized multi-core (GPU)

Low performance on sequential portions

S=1

1−PPN

Heterogeneous multi-core (CPU+GPU)

Use the right tool for the right job

Allows aggressive optimizationfor latency or for throughput

Time to runsequential portions

Time to runparallel portions

M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008.

Page 36: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

37

Example: System on Chip for smartphone

Big coresfor applications

Small coresfor background activity

GPU

Special-purposeaccelerators

Lots of interfaces

Page 37: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

38

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 38: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

39

Vertices

The (simplest) graphics rendering pipeline

Fragments

Clipping, RasterizationAttribute interpolation

Z-CompareBlending

Pixels

Vertex shader

Fragment shader

Primitives(triangles…)

Framebuffer

Programmablestage

Parametrizablestage

Textures

Z-Buffer

Page 39: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

40

How much performance do we need

… to run 3DMark 11 at 50 frames/second?

Element Per frame Per second

Vertices 12.0M 600M

Primitives 12.6M 630M

Fragments 180M 9.0G

Instructions 14.4G 720G

Intel Core i7 2700K: 56 Ginsn/s peak

We need to go 13x faster

Make a special-purpose accelerator

Source: Damien Triolet, Hardware.fr

Page 40: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

42

Beginnings of GPGPU

20092004 20072002

7.x 8.0 9.08.1 9.0ca 9.0b 10.0 10.1 11

2000 2001 2003 2005 2006 2008

Microsoft DirectX

NVIDIA

NV10 NV20 NV30 NV40 G70 G80-G90 GT200

ATI/AMD

R100 R200 R300 R400 R500 R600 R700

Programmableshaders

FP 16 FP 32

FP 24 FP 64

SIMT

CTM CAL

CUDA

GPGPU traction

Dynamiccontrol flow

2010

GF100

Evergreen

Unified shaders

Page 41: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

43

Today: what do we need GPUs for?

1. 3D graphics rendering for games

Complex texture mapping, lighting computations…

2. Computer Aided Design workstations

Complex geometry

3. GPGPU

Complex synchronization, data movements

One chip to rule them all

Find the common denominator

Page 42: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

44

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 43: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

45

Little's law: data=throughput×latency

Intel Core i7 920

210

1500

350 ns

190

50

1,25

180

3 10 50 Latency (ns)

Throughput (GB/s)

L1

L2DRAM

NVIDIA GeForce GTX 580

30

320

J. Little. A proof for the queuing formula L= λ W. JSTOR 1961.

Page 44: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

46

Reques ts

Hiding memory latency with pipelining

Memory throughput: 190 GB/s

Memory latency: 350 ns

Data in flight = 66 500 Bytes

At 1 GHz:190 Bytes/cycle,350 cycles to wait

Tim

e

1 cycle

...

In flight: 65 KB

Throughput:190B/cycle

Latency :350 cyc les

Page 45: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

47

Consequence: more parallelism

GPU vs. CPU

8× more parallelism to feed more units (throughput)

8× more parallelism to hide longer latency

64× more total parallelism

How to find this parallelism?

Space ×8

×8

Tim

e

Reques ts

...

Page 46: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

48

Sources of parallelism

ILP: Instruction-Level Parallelism

Between independent instructionsin sequential program

TLP: Thread-Level Parallelism

Between independent execution contexts: threads

DLP: Data-Level Parallelism

Between elements of a vector:same operation on several elements

add r3 r1, r2←mul r0 r0, r1←sub r1 ← r3, r0

Thread 1 Thread 2

Parallel

add mul Parallel

vadd r a,b←+ + +a1 a2 a3

b1 b2 b3

r1 r2 r3

Page 47: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

49

Example: X ← a×X

In-place scalar-vector product: X ← a×X

Or any combination of the above

Launch n threads:X[tid] a * X[tid]←

Threads (TLP)

For i = 0 to n-1 do:X[i] a * X[i]←

Sequential (ILP)

X a * X←Vector (DLP)

Page 48: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

50

Uses of parallelism

“Horizontal” parallelismfor throughput

More units working in parallel

“Vertical” parallelismfor latency hiding

Pipelining: keep units busy when waiting for dependencies, memory

A B C D

throughput

late

ncy

A B C D

A B

A

C

B

A

cycle 1 cycle 2 cycle 3 cycle 4

Page 49: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

51

How to extract parallelism?

Horizontal Vertical

ILP Superscalar Pipelined

TLP Multi-coreSMT

Interleaved / switch-on-event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

We have seen the first row: ILP

We will now review techniques for the next rows: TLP, DLP

Page 50: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

52

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 51: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

53

Sequential processor

Focuses on instruction-level parallelism

Exploits ILP: vertically (pipelining) and horizontally (superscalar)

for i = 0 to n-1X[i] a * X[i]←

move i 0←loop:

load t X[i]←mul t a×t←store X[i] t←add i i+1←branch i<n? loop Sequential CPU

add i ← 18

store X[17]

mul

Fetch

Decode

Execute

Memory

Source code

Machine code

Memory

Page 52: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

54

The incremental approach: multi-core

Source: Intel

Intel Sandy Bridge

Several processorson a single chipsharing one memory space

Area: benefits from Moore's law

Power: extra cores consume little when not in use

e.g. Intel Turbo Boost

Page 53: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

55

Homogeneous multi-core

Horizontal use of thread-level parallelism

Improves peak throughput

IFID

EX

LSU

F

D

X

Mem

add i ← 18

store X[17]

mul

IFID

EX

LSU

F

D

X

Mem

add i ← 50

store X[49]

mul

Mem

ory

T0 T1Threads:

Page 54: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

56

Example: Tilera Tile-GX

Grid of (up to) 72 tiles

Each tile: 3-way VLIW processor,5 pipeline stages, 1.2 GHz

Tile (1,1)

…Tile (1,2)

Tile (9,1)

Tile (1,8)

Tile (9,8)

… …

Page 55: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

57

Interleaved multi-threading

mul

mul

add i ← 50

Fetch

Decode

Execute

Memoryload-storeunit

load X[89]

Memory

Vertical use of thread-level parallelism

Hides latency thanks to explicit parallelismimproves achieved throughput

store X[72]load X[17]

store X[49]

add i ←73

T0 T1 T2 T3Threads:

Page 56: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

58

Example: Oracle Sparc T5

16 cores / chip

Core: out-of-order superscalar, 8 threads

15 pipeline stages, 3.6 GHz

Core 1

Thread 1Thread 2

Thread 8

Core 2 Core 16

Page 57: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

59

Clustered multi-core

For eachindividual unit,select between

Horizontal replication

Vertical time-multiplexing

Examples

Sun UltraSparc T2, T3

AMD Bulldozer

IBM Power 7

Area-efficient tradeoff

Blurs boundaries between cores

br

mul

add i ← 50

Fetch

Decode

EX

L/S Unitload X[89]

Memory

store X[72]load X[17]

store X[49]

muladd i ←73

store

T0 T1 T2 T3

→ Cluster 1 → Cluster 2

Page 58: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

60

Implicit SIMD

In NVIDIA-speak

SIMT: Single Instruction, Multiple Threads

Convoy of synchronized threads: warp

Extracts DLP from multi-thread applications

(0-3) store

(0) mul

F

D

X

Mem(0)

Mem

ory

(1) mul (2) mul (3) mul

(1) (2) (3)

(0-3) load

Factorization of fetch/decode, load-store units

Fetch 1 instruction on behalf of several threads

Read 1 memory location and broadcast to several registers

T0

T1

T2

T3

Page 59: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

61

Explicit SIMD

Single Instruction Multiple Data

Horizontal use of data level parallelism

Examples

Intel MIC (16-wide)

AMD GCN GPU (16-wide×4-deep)

Most general purpose CPUs (4-wide to 8-wide)

loop:vload T X[i]←vmul T a×T←vstore X[i] T←add i i+4←branch i<n? loop

Machine code

SIMD CPU

add i ← 20

vstore X[16..19

vmul

F

D

X

Mem

Mem

ory

Page 60: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

62

Quizz: link the words

Parallelism

ILP

TLP

DLP

Use

Horizontal:more throughput

Vertical:hide latency

Architectures

Superscalar processor

Homogeneous multi-core

Multi-threaded core

Clustered multi-core

Implicit SIMD

Explicit SIMD

Page 61: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

63

Quizz: link the words

Parallelism

ILP

TLP

DLP

Use

Horizontal:more throughput

Vertical:hide latency

Architectures

Superscalar processor

Homogeneous multi-core

Multi-threaded core

Clustered multi-core

Implicit SIMD

Explicit SIMD

Page 62: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

64

Quizz: link the words

Parallelism

ILP

TLP

DLP

Use

Horizontal:more throughput

Vertical:hide latency

Architectures

Superscalar processor

Homogeneous multi-core

Multi-threaded core

Clustered multi-core

Implicit SIMD

Explicit SIMD

Page 63: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

65

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 64: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

66

Hierarchical combination

Both CPUs and GPUs combine these techniques

Multiple cores

Multiple threads/core

SIMD units

Page 65: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

67

Example CPU: Intel Core i7

Is a wide superscalar, but has also

Multicore

Multi-thread / core

SIMD units

Up to 117 operations/cycle from 8 threads

256-bitSIMDunits: AVX

Wide superscalar

Simultaneous Multi-Threading:2 threads

4 CPU cores

Page 66: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

68

Example GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads

16 SMs / chip

2×16 cores / SM, 48 warps / SM

Up to 512 operations per cycle from 24576 threads in flight

Time

Core 1

Core 2

Core 16

Warp 3

Warp 1

Warp 47

SM1 SM16

……C

ore 17

Core 18

Core 32

Warp 4

Warp 2

Warp 48

Page 67: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

69

Taxonomy of parallel architectures

Horizontal Vertical

ILP Superscalar / VLIW Pipelined

TLPMulti-core

SMTInterleaved / switch-on-

event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

Page 68: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

70

Classification: multi-core

Oracle Sparc T5

2

16 8

ILP

TLP

DLP

Horizontal Vertical

Cores Threads

Intel Haswell

8

8

4 2

SIMD(AVX)

CoresHyperthreading

4

10

12 8

IBM Power 8

2

8

16 2

Fujitsu SPARC64 X

General-purposemulti-cores:balance ILP, TLP and DLP

Sparc T:focus on TLP

Page 69: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

71

Classification: GPU and many small-core

Intel MIC Nvidia Kepler AMD GCN

16

2

60 4

ILP

TLP

DLP 32

2

16×4 32

16

20×4

4

40

SIMD Cores SIMT Multi-threading

Cores×units

Kalray MPPA-256

5

17×16

Tilera Tile-GX

3

72

GPU: focus on DLP, TLPhorizontal and vertical

Many small-core:focus on horizontal TLP

Horizontal Vertical

Page 70: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

72

Takeaway

All processors use hardware mechanisms to turn parallelism into performance

GPUs focus on Thread-level and Data-level parallelism

Page 71: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

73

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

Page 72: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

74

Computation cost vs. memory cost

Power measurements on NVIDIA GT200

Energy/op(nJ)

Total power(W)

Instruction control 1.8 18

Multiply-add on a 32-wide warp

3.6 36

Load 128B from DRAM 80 90

With the same amount of energy

Load 1 word from external memory (DRAM)

Compute 44 flops

Must optimize memory accesses first!

Page 73: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

75

External memory: discrete GPU

Classical CPU-GPU model

Split memory spaces

Highest bandwidth from GPU memory

Transfers to main memory are slower

CPU GPU

Main memory Graphics memory

PCIExpress

16GB/s

26GB/s 290GB/s

Ex: Intel Core i7 4770, Nvidia GeForce GTX 780

8 GB 3 GB

Page 74: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

76

External memory: embedded GPU

Most GPUs today

Same memory

May support memory coherence

GPU can read directly from CPU caches

More contention on external memory

CPU GPU

Main memory

26GB/s

8 GB

Cache

Page 75: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

77

GPU: on-chip memory

Conventional wisdom

Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:

GPU Register files+ caches

NVIDIA GM204 GPU

8.3 MB

AMD Hawaii GPU

15.8 MB

Intel Core i7 CPU

9.3 MB

GPU/accelerator internal memory exceeds desktop CPUs

But... if we include registers:

Page 76: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

78

Registers: CPU vs. GPU

Registers keep the contents of local variables

CPU GPU

Registers/thread 32 32

Registers/core 256 65536

Read / Write ports 10R/5W 2R/1W

Registers keep the contents of local variables

Typical values

GPU: many more registers, but made of simpler memory

Page 77: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

79

Internal memory: GPU

Cache hierarchy

Keep frequently-accessed data

Reduce throughput demand on main memory

Managed by hardware (L1, L2) or software (shared memory)

Core

L1

L2 L2 L2

Crossbar

Core

L1

Core

L1

External memory

290 GB/s

~2 TB/s

6 MB

1 MB

Page 78: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

80

Caches: CPU vs. GPU

On CPU, caches are designed to avoid memory latency

Throughput reduction is a side effect

On GPU, multi-threading deals with memory latency

Caches are used to improve throughput (and energy)

CPU GPU

LatencyCaches,

prefetchingMulti-threading

Throughput Caches

Page 79: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

81

GPU: thousands of cores?

NVIDIA GPUs G80/G92(2006)

GT200 (2008)

GF100(2010)

GK104(2012)

GK110(2012)

GM204(2014)

Exec. units 128 240 512 1536 2688 2048

SM 16 30 16 8 14 16

Computational resources

Number of clients in interconnection network (cores)stays limited

AMD GPUs R600(2007)

R700 (2008)

Evergreen(2009)

NI(2010)

SI(2012)

VI(2013)

Exec. Units 320 800 1600 1536 2048 2560

SIMD-CU 4 10 20 24 32 40

Page 80: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

82

Takeaway

Result of many tradeoffs

Between locality and parallelism

Between core complexity and interconnect complexity

GPU optimized for throughput

Exploits primarily DLP, TLP

Energy-efficient on parallel applications with regular behavior

CPU optimized for latency

Exploits primarily ILP

Can use TLP and DLP when available

Page 81: Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

83

Next time

Next Tuesday, 1:00pm, room 2014CUDA

Execution model

Programming model

API

Thursday 1:00pm, room 2011Lab work: what is my GPU and when should I use it?

There may be available seats even if you are not enrolled