Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl...

32
Partially supported by Presenter: Box. Leangsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University S. Laosooksathit, N. Naksinehaboon, Box. Leangsuksun, A. Dhungana, C Ch dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Louisiana Tech University [email protected]u C. Chandler Louisiana Tech U Thammasat Univ 4 th HPCVirt workshop, Paris, France, April 13, 2010

Transcript of Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl...

Page 1: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Partially supported by

Presenter: Box. LeangsuksunSWEPCO Endowed Professor*, Computer Science

Louisiana Tech University

S. Laosooksathit, N. Naksinehaboon, Box. Leangsuksun, A. Dhungana,

C Ch dl

Amir FabinU of Texas, Arlington

K. Chanchio

Thammasat Univ

Louisiana Tech [email protected]

C. ChandlerLouisiana Tech U

Thammasat Univ

4th HPCVirt workshop, Paris, France, April 13, 2010

Page 2: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Motivations Background - VCCP GPU checkpoint protocols: Memcopy vs

i l SsimpleStream CheCUDA (related work) GPU checkpoint protocols: CUDA Streams GPU checkpoint protocols: CUDA Streams Restart protocols Scheduling model and Analysis Scheduling model and Analysis Conclusion

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 2

Page 3: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

More attention on GPUs ORNL-NVDIA 10 petaflop machine Large scale GPU cluster -> fault tolerance for

GPU li iGPU applications◦ Normal checkpoint doesn’t help GPU applications

when a failure occurs.e a a u e occu s◦ GPU execution isn’t saved when do checkpoint on

CPU

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 3

Page 4: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

High transparency◦ Checkpoint/restart mechanisms should be

transparent to applications OS and runtimetransparent to applications, OS, and runtime environments; no modification required

Efficiency◦ Checkpoint/restart mechanisms should not

generate unacceptable overheads Normal ExecutionNormal Execution Communication Checkpointing Delay

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

Page 5: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Run apps/OS unmodified

FIFO R li blCheckpoint/restart protocols

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

FIFO, ReliableCheckpoint/restart protocols

Page 6: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

1. Pauce VM computation1. Pauce VM computation2. Flush messages out of the

networknetwork3. Locally Save State of every VM4. Continue computationp

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

Page 7: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

VCCP checkpoint protocolVCCP checkpoint protocol

Head compute01 compute02

save

save

save

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

Page 8: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

VCCP checkpoint protocolVCCP checkpoint protocol

Head compute0

1

compute0

21 2

Flush

communication communication

channel

channel empty channel empty channel empty

save VM & buffersave VM & buffer save VM & buffer

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

Page 9: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

VCCP checkpoint protocolVCCP checkpoint protocol

Head compute01 compute02

res lt res ltresult result

resumesuccess

t resumecont

cont

cont

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010

cont

Page 10: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Publication in IEEE cluster 2009

Average overhead 12%

Provide transparent checkpoint/restart

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 10

Page 11: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

1. Device Initialization D i

Host Device

Grid 1

2. Device memory allocation

3. Copies data to device

Kernel 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

memory4. Executes kernel (Calling

__global__ function) Kernel 2

( , ) ( , ) ( , )

Grid 2

5. Copies data from device memory (retrieve results)

2

Block (1, 1)

Issues – latency round trip data movement Thread

(0, 1)Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 11

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Page 12: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Long running GPU application High (relatively) failure rate in a large scale

GPU cluster in MPI & GPU environmentS GPU f Save GPU software state

Move data back from GPU in low latency◦ Memcopy (pauce GPU) vs simpleStream◦ Memcopy (pauce GPU) vs simpleStream

(concurrency)

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 12

Page 13: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

“CheCUDA: A Checkpoint/Restart Tool for CUDA Applications” by H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi

A prototype of an add on package of BLCR A prototype of an add-on package of BLCR for GPU checkpointing

Memcopy approach Memcopy approach

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 13

Page 14: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

CPU checkpointing

2

Migration/ CPU checkpoint

2checkpoint

GPU checkpointing

1

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 14

Page 15: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Process startsH-D memoryH D memory

copyKernel starts

Syncthread()

GPU checkpointGPU checkpoint durationCPU checkpoint/

migration

Kernel completesD-H memory

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 15

co p etesD H memory copy

Process ends

Page 16: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

1. Copying all the user data in the device memory to the host memory

2. Writing the current status of the application and the user data to a checkpoint fileand the user data to a checkpoint file

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 16

Page 17: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

1. Read the checkpoint file2. Initialize the GPU and recreating CUDA

resourcesS di h d b k h d i3. Sending the user data back to the device memory

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 17

Page 18: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Transfer data from device to host = overhead◦ Must pauce GPU computation until the copy is

completed

SimpleStream◦ Using latency hiding (Streams) to reduce the

overhead◦ CUDA streams = overlap memory copy and kernel

executionexecution

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 18

Page 19: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Process starts

H D memory K l t t Code AnalysisH-D memory copy

Kernel starts Code Analysis

After the sync point, OVERWRITE?

Syncthread()

GPU checkpoint durationCPU checkpoint/ migration YES NO

Kernel completesD-H memory

copy

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 19

Kernel completespy

Process ends

Page 20: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Process starts

H D memory K l t t Code AnalysisH-D memory copy

Kernel starts Code Analysis

After the sync point, OVERWRITE?

Syncthread()

YES NO

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 20

Page 21: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Process starts

H D memory K l t tAfter the H-D memory

copyKernel startssync point,

OVERWRITE?

Syncthread()NO

CPU checkpoint/ migration

GPU checkpoint duration

Kernel completesD-H memory

copy

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 21

Kernel completespy

Process endsBACK

Page 22: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Process starts

H D memory K l t tAfter the H-D memory copy

Kernel startsAfter the sync point,

OVERWRITE?

Syncthread()

YES C h i i GPUDuplicate image

CPU checkpoint/ migration

YES

GPU checkpoint duration

Copy the sync image in GPU

migration

Kernel completesD-H memory

copy

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 22

Kernel completespy

Process endsBACK

Page 23: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Restart CPU Transfer the last GPU checkpoint back to CPU Recreate CUDA context from the CKpt file Restart the kernel execution from the marked

synchronization point

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 23

Page 24: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

GPU checkpoint after a thread synchronization

NOT every thread synchronizationQUESTION??? QUESTION???◦ Which thread synchronization should a checkpoint

be invoked?be o ed FACTORs◦ GPU checkpoint overhead◦ Chance of a failure occurrence

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 24

Page 25: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

O

jCthm thnC

CCPn

mjjf

ˆ mj

OPCOP ff 1ˆPerform the checkpoint:

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 25

Page 26: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

O

jCthm thnC

CCPn

mjjf

ˆSkip the checkpoint:

mj

OPCOP ff 1ˆPerform the checkpoint:

Perform the checkpointOCPn

jf ÷÷

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 26

pmj

jf

Page 27: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Simulate failures & the wasted time ◦ total checkpoint overhead + re-computing due to a

failure Overhead Overhead◦ Non-stream: 10 milliseconds – 3 seconds◦ Streams: negligible

MTTF: 12 hours – 7 days Thread sync interval: 10 and 30 minutes

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 27

Page 28: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Thread sync interval = 10 mins

Thread sync interval = 30 mins10 mins 30 mins

28

Page 29: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Thread sync interval = 10 mins

Thread sync interval = 30 mins10 mins 30 mins

29

Page 30: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Against MTTFs Against overheadsg g

30

Page 31: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

GPU checkpointing with Stream to reduce overhead

Non-stream and stream checkpoints are insignificantly different if data transfer isinsignificantly different if data transfer is insignificant

BUT stream checkpoint potentially performs BUT stream checkpoint potentially performs better when the checkpoint overhead of memcopy is larger.

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 31

Page 32: Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl Amir Fabin U of Texas, Arlington K. Chanchio Thammasat Univ Tech University box@latech.edu

Implement GPU checkpoint/restart mechanism

Work on other checkpoint protocolI l d GPU i i Include GPU process migration

4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 32