Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor...

28
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 1 GPUDet: A Deterministic GPU Architecture Hadi Jooybar 1 , Wilson Fung 1 , Mike O’Connor 2, Joseph Devietti 3 , Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research 3 University of Washington

Transcript of Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor...

Page 1: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 1

GPUDet: A Deterministic GPU Architecture

Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3, Tor M. Aamodt1

1The University of British Columbia2AMD Research3University of Washington

Page 2: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 2

• GPUs are …

• Fast

• Energy efficient

• Commodity hardware

But…

× Mostly use for certain range of applications

Why?

Communication among concurrent threads 1000s of Threads

Page 3: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 3

0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }

V0

V2V1

Cost = -Active = -

Cost = -Active = -

V0

V2V1

Cost = 1Active = 1

Cost = 1Active = 1

V0

V2V1

Cost = 1Active = 1

Cost = 2Active = 1

Motivation

BFS algorithmPublished in HiPC 2007

Page 4: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 4

I will debug it this time

What about debuggers?!

The bug may appear occasionally or in different places in each run.

OMG! Where was that bug?!

Motivation

Page 5: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 5

GPUDetStrong Determinism (hardware proposal)

Same Outputs Same Execution Path

Makes the program easier to Debug Test

Page 6: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 6

0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }

V0

V2V1

Cost = 1Active = 1

Cost = 2Active = 1

Motivation

BFS algorithmPublished in HiPC 2007

Page 7: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 7

GPUDetStrong Determinism

Same Outputs Same Execution Path

Makes the program easier to Debug Test

×There is no free lunch× Performance Overhead

Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead

Page 8: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 8

DRAMGPU Architecture

Compute Unit

Memory Unit

L1 Cache

ALUALUALU

DRAML2 Cache

Workgroups

CPUKernel launch

workgroup 2workgroup 1workgroup 0

x = input[threadID];y= func(x);output[threadID] = y;

Page 9: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 9

Outline

• Introduction

• GPU Architecture

• Challenges

• Deterministic Execution with GPUDet

• GPUDet Optimizations• Workgroup-Aware Quantum Formation

• Deterministic parallel commit using Z-Buffer Unit

• Compute Unit level serialization

• Results and Conclusion

Page 10: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 10

Normal Execution

T0

T1

T2

T3

Deterministic GPU Execution Challenges

• Isolation mechanism

• Provide method to pause execution of a thread

…Quantum 0

T0

T1

T2

T3

Quantum n

T0

T1

T2

T3

…Isolation

T0

T1

T2

T3

Communication Isolation

T0

T1

T2

T3

Communication

Page 11: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 11

Deterministic GPU Execution Challenges

• Isolation mechanism

• Lack of private caches

• Lack of cache coherency

• Provide method to pause execution of a thread

• Single Instruction Multiple Threads (SIMT)

• Potential deadlock condition

• Major changes in control flow hardware

• Performance overheadworkgroupn

wavefront

Page 12: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 12

Deterministic GPU Execution Challenges

• Very large number of threads

• Expensive global synchronization

• Expensive serialization

• Different program properties

• Large number of short running threads

• Frequent workgroup synchronization

• Less locality in intra thread memory accesses

Page 13: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 13

Outline

• Introduction

• GPU Architecture

• Challenges

• Deterministic Execution with GPUDet

• GPUDet Optimizations• Workgroup-Aware Quantum Formation

• Deterministic parallel commit using Z-Buffer Unit

• Compute Unit level serialization

• Results and Conclusion

Page 14: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 14

if (tid < 16) x[tid%2] = tid;

x[0] = 0

T0

Coalescing Unit

x[1] = 1

T1

x[0] = 2

T2

x[1] = 15

T15

Deterministic Execution of a Wavefront

Data RaceMask v v - - - - - - … -

Address x

Data 14 15 - - - - - - … -

x[0] = 14 x[1] = 15 Not modifiedTo memory

Execution of one wavefront is deterministic

Page 15: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 15

Deterministic GPU Execution Challenges

• Isolation mechanism

• Provide method to pause execution of a thread

…Isolation

T0

T1

T2

T3

Communication Isolation

T0

T1

T2

T3

Communication

wavefront granularity

not a challenge anymore

Page 16: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 16

Reaching Quantum Boundary

Global Memory

Read Only

Store Buffers

Local Memory

Wavefronts

Load Op CommitAtomic Op

• GPUDet-Basic

1. Instruction Count2. Atomic Operations3. Memory Fences4. Workgroup Barriers5. Execution Complete

Page 17: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 17

Outline

• Introduction

• GPU Architecture

• Challenges

• Deterministic Execution with GPUDet

• GPUDet Optimizations• Workgroup-Aware Quantum Formation

• Deterministic parallel commit using Z-Buffer Unit

• Compute Unit level serialization

• Results and Conclusion

Page 18: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 18

Workgroup-Aware Quantum Formation

• Extra global synchronizations

Load Imbalance

Reducing number of synchronizationsAvoid unnecessary quantum termination

Page 19: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 19

AES

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t

0%

20%

40%

60%

80%

100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers

%of

Ter

min

ation

Rea

sons

Workgroup-Aware Quantum Formation

Quanta are finished by workgroup barriers

All reach a workgroup barrier

Continue execution in the parallel mode

Workgroup-Aware Decision Making

Page 20: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 20

AES

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t

0%

20%

40%

60%

80%

100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers

%of

Ter

min

ation

Rea

sons

Finish execution of the Kernel function

Workgroup-Aware Decision Making

Workgroup-Aware Quantum Formation

Deterministic workgroup partitioning

Page 21: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 21

Deterministic Parallel Commit using the Z-Buffer Unit

∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞

∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞

8 8 8 8 8 88 8 8 8 8 87 7 7 8 8 87 7 7 8 8 87 7 7 8 8 8

8 8 5 5 8 88 8 5 5 5 87 5 5 5 5 57 5 5 5 5 55 5 5 5 5 5

Depth Buffer

Store Buffer Contents ≈ Color Values

Wavefront ID ≈ Depth Values

Z-Buffer Unit

Page 22: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 22

• GPUs preserve Point to Point Ordering

A

A

A

A

A

A

Serialization is only among compute units

Compute Unit Level Serialization

Page 23: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 23

Outline

• Introduction

• GPU Architecture

• Challenges

• Deterministic Execution with GPUDet

• GPUDet Optimizations• Workgroup-Aware Quantum Formation

• Deterministic parallel commit using Z-Buffer Unit

• Compute Unit level serialization

• Results and Conclusion

Page 24: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 24

Results

AES

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t00.5

11.5

22.5

33.5

44.5

5

Serial Mode

Commit Mode

Parallel Mode

Nor

mal

ized

Ex

ecuti

on T

ime

2x Slowdown

• GPGPU-Sim 3.0.2Applications with atomic operations

Page 25: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 25

20% Performance Improvement for application with barriers

19% Performance Improvement for application with small kernel functions

Quantum FormationA

ES

BFSr

BFSf

CFD C

P H LIB

LPS

SRA

D HT

ATM

CLop

t

AVG

0

1

2

3

4

5

GPUDet-baseWorkgroup BarrierEnd of the Kernel

Nor

mal

ized

Exec

ution

Tim

e

Page 26: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 26

Deterministic Parallel Commit using the Z-Buffer UnitZ-

Buff

er

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

AES BFSr BFSf CFD CP HOTSP LIB LPS SRAD HT ATM Clopt

0

2

4

6

8

10#REF! #REF!

Nor

mal

ized

Exe

cutio

n Ti

me

60% Performance Improvement on Average

Page 27: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 27

Compute Unit Level Serialization

W-S

er

CU-S

er

W-S

er

CU-S

er

W-S

er

CU-S

er

CLopt HT ATM

02468

101214

Serial Mode Series2Series1

Nor

mal

ize

Exec

ution

Tim

e6.1x Performance Improvement in

Serial Mode

Page 28: Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 28

Conclusion

• Encourages programmers to use GPUs in broader

range of applications

• Exploits GPU characteristics to reduce performance

overhead• Deterministic execution within a wavefront

• Workgroup-aware quantum formation

• Deterministic parallel commit using Z-Buffer Unit

• Compute Unit level serialization

Questions?