GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik...

21
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1

Transcript of GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik...

Page 1: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU

Applications

Bo Fang , Karthik Pattabiraman, Matei Ripeanu,

The University of British ColumbiaSudhanva Gurumurthi

AMD Research1

Page 2: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Graphic Processing Unit Main accelerators for general

purpose applications (e.g. TITAN , 18,000 GPUs)

2

Performance

PowerReliability

Trade-off

Page 3: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Reliability trends The soft error rate per chip will continue to

increase because of larger arrays; probability of soft errors within the combination logic will also increase. [Constantinescu, Micro 2003]

Source: final report for CCC cross-layer reliability visioning study (http://www.relxlayer.org/)

3

Page 4: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

GPGPUs vs Traditional GPUs Traditional: image rendering

GPGPU: DNA matchingATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT

ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

Serious result !

Page 5: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Vulnerability and error resilience Vulnerability: Probability that the

system experiences a fault that causes a failure • Not consider the behaviors of

applications Error Resilience: Given a fault occurs,

how resilient the application is ✔

5Device/Circuit Level

Architectural Level

Operating System Level

Application Level

Architecture Vulnerability Factor study stops here

Error resilience study

Faults

Page 6: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Goal

6

Application-specific, software-

based, fault tolerance

mechanisms

Investigate the error-resilience

characteristics of

applications

Find correlatio

n with applicati

on propertie

s

Page 7: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

GPU-Qin: a methodology and a tool Fault injection: introduce faults in a

systematic, controlled manner and study the system’s behavior

GPU-Qin: on real hardware via cuda-gdb Advantages:

Representativeness efficiency

7

Page 8: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Challenges

High level: massive parallelism Inject into all the threads:

impossible! In practice: limitations in cuda-gdb

Breakpoint on source code level, not instruction level

Single-step to target instructions

8

Page 9: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Fault Model

Transient faults (e.g. soft errors) Single-bit flip

No cost to extend to multiple-bit flip Faults in arithmetic and logic

unit(ALU), floating point Unit (FPU) and the load-store unit(LSU). (Otherwise ECC protected)

9

Page 10: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Overview of the FI methodology

10

Grouping Profiling

Fault injection

Page 11: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Grouping• Detects groups of threads with

similar behaviors (i.e. same number of dynamic insts)

• Pick representative groups (challenge 1)

11

Category

Benchmarks Groups Groups to profile

% of threads in picked groups

Category I

AES,MRI-Q,MAT,Mergesort-k0, Transpose

1 1 100%

Category II

SCAN, Stencil, Monte Carlo, SAD, LBM, HashGPU

2-10 1-4 95%-100%

Category III

BFS 79 2 >60%

Grouping

Profiling

Fault injectio

n

Page 12: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Profiling

Finds the instructions executed by a thread and their corresponding CUDA source-code line (challenge2)

Output: Instruction trace: consisting of the

program counter values and the source line associated with each instruction

12

Grouping

Profiling

Fault injectio

n

Page 13: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Fault injection

13

Native execution

Breakpoint hit

Single-step execution

PC hit

Fault injection

Single-step execution

Activation window

Native execution

Grouping

Profiling

Fault injectio

n

Until end

Page 14: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Heuristics for efficiency

A large number of iterations in loop Limit the number of loop iterations

explored Single-step is time-consuming

Limit activation window size

14

Page 15: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Characterization study

NVIDIA Tesla C 2070/2075 12 CUDA benchmarks, 15 kernels

Rodinia, Parboil and CUDA SDK Outcomes:

Benign: correct output Crash: exceptions from the system or

application Silent Data Corruption (SDC):

incorrect output Hang: finish in considerably long time 15

Page 16: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Error Resilience Characteristics

16

AES

HashG

PU-m

d5

HashG

PU-s

ha1

MRI-Q M

AT

Tran

spos

e

SAD-k

0

SAD-k

1

SAD-k

2

Sten

cil

SCAN-b

lock

MONTE

Mer

geSo

rt-k0 BFS

LBM

0%10%20%30%40%

Benchmarks

SD

C r

ate

AES

HashG

PU-m

d5

HashG

PU-s

ha1

MRI-Q M

AT

Tran

spos

e

SAD-k

0

SAD-k

1

SAD-k

2

Sten

cil

SCAN-b

lock

MONTE

Mer

geSo

rt-k0 BFS

LBM

0%20%40%60%80%

Benchmarks

Cra

sh

ra

te

1.SDC rates and crash rates vary significantly across benchmarks

SDC: 1% - 38% Crash: 5% - 70%2. Average failure rate is 67%3. CPUs have 5% - 15% difference in SDC rates

We need application-specific FT mechanisms

Page 17: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Hardware error detection mechanisms• Hardware exceptions

17

AES (Crash rate: 43%) MAT (Crash rate: 30%)

1. Exceptions due to memory-related error2. Crashes are comparable on CPUs and GPUs

Page 18: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Crash latency – measure the fault propagation• Two examples:

18

1.Device illegal address have highest latency in all benchmarks except Stencil

2.Warp out-of-range has lower crash latency otherwise

3. Crash latency in CPUs is shorter (thousands of cycles) [Gu, DSN04]

We can use crash latency to design application-specific

checkpoint/recovery techniques

Page 19: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Resilience Category

Resilience Category

Benchmarks Measured SDC Dwarfs

Search-based MergeSort 6% Backtrack and Branch+Bound

Bit-wise Operation

HashGPU, AES 25% - 37% Combinational Logic

Average-out Effect

Stencil, MONTE 1% - 5% Structured Grids, Monte Carlo

Graph Processing

BFS 10% Graph Traversal

Linear Algebra Transpose, MAT, MRI-Q, SCAN-block, LBM, SAD

15% - 25% Dense Linear Algebra, Sparse Linear Algebra, Structured Grids

19

Page 20: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Conclusion

GPU-Qin: a methodology and a fault injection tool to characterize the error resilience of GPU applications We find that there are significant

variations in resilience characteristics of GPU applications

Algorithmic characteristics of the application can help us understand the variation of SDC rates

20

Page 21: GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia.

Thank you !

Project homepage: http://netsyslab.ece.ubc.ca/wiki/

index.php/FTGPUCode: https://github.com/

DependableSystemsLab/GPU-Injector

21