Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl...
Transcript of Lightweight Checkpoint Mechanism and Modeling in GPGPU ... · Box. Leangsuksun, A. Dhungana, CCh dl...
Partially supported by
Presenter: Box. LeangsuksunSWEPCO Endowed Professor*, Computer Science
Louisiana Tech University
S. Laosooksathit, N. Naksinehaboon, Box. Leangsuksun, A. Dhungana,
C Ch dl
Amir FabinU of Texas, Arlington
K. Chanchio
Thammasat Univ
Louisiana Tech [email protected]
C. ChandlerLouisiana Tech U
Thammasat Univ
4th HPCVirt workshop, Paris, France, April 13, 2010
Motivations Background - VCCP GPU checkpoint protocols: Memcopy vs
i l SsimpleStream CheCUDA (related work) GPU checkpoint protocols: CUDA Streams GPU checkpoint protocols: CUDA Streams Restart protocols Scheduling model and Analysis Scheduling model and Analysis Conclusion
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 2
More attention on GPUs ORNL-NVDIA 10 petaflop machine Large scale GPU cluster -> fault tolerance for
GPU li iGPU applications◦ Normal checkpoint doesn’t help GPU applications
when a failure occurs.e a a u e occu s◦ GPU execution isn’t saved when do checkpoint on
CPU
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 3
High transparency◦ Checkpoint/restart mechanisms should be
transparent to applications OS and runtimetransparent to applications, OS, and runtime environments; no modification required
Efficiency◦ Checkpoint/restart mechanisms should not
generate unacceptable overheads Normal ExecutionNormal Execution Communication Checkpointing Delay
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
Run apps/OS unmodified
FIFO R li blCheckpoint/restart protocols
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
FIFO, ReliableCheckpoint/restart protocols
1. Pauce VM computation1. Pauce VM computation2. Flush messages out of the
networknetwork3. Locally Save State of every VM4. Continue computationp
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
VCCP checkpoint protocolVCCP checkpoint protocol
Head compute01 compute02
save
save
save
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
VCCP checkpoint protocolVCCP checkpoint protocol
Head compute0
1
compute0
21 2
Flush
communication communication
channel
channel empty channel empty channel empty
save VM & buffersave VM & buffer save VM & buffer
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
VCCP checkpoint protocolVCCP checkpoint protocol
Head compute01 compute02
res lt res ltresult result
resumesuccess
t resumecont
cont
cont
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010
cont
Publication in IEEE cluster 2009
Average overhead 12%
Provide transparent checkpoint/restart
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 10
1. Device Initialization D i
Host Device
Grid 1
2. Device memory allocation
3. Copies data to device
Kernel 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
memory4. Executes kernel (Calling
__global__ function) Kernel 2
( , ) ( , ) ( , )
Grid 2
5. Copies data from device memory (retrieve results)
2
Block (1, 1)
Issues – latency round trip data movement Thread
(0, 1)Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 11
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Long running GPU application High (relatively) failure rate in a large scale
GPU cluster in MPI & GPU environmentS GPU f Save GPU software state
Move data back from GPU in low latency◦ Memcopy (pauce GPU) vs simpleStream◦ Memcopy (pauce GPU) vs simpleStream
(concurrency)
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 12
“CheCUDA: A Checkpoint/Restart Tool for CUDA Applications” by H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi
A prototype of an add on package of BLCR A prototype of an add-on package of BLCR for GPU checkpointing
Memcopy approach Memcopy approach
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 13
CPU checkpointing
2
Migration/ CPU checkpoint
2checkpoint
GPU checkpointing
1
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 14
Process startsH-D memoryH D memory
copyKernel starts
Syncthread()
GPU checkpointGPU checkpoint durationCPU checkpoint/
migration
Kernel completesD-H memory
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 15
co p etesD H memory copy
Process ends
1. Copying all the user data in the device memory to the host memory
2. Writing the current status of the application and the user data to a checkpoint fileand the user data to a checkpoint file
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 16
1. Read the checkpoint file2. Initialize the GPU and recreating CUDA
resourcesS di h d b k h d i3. Sending the user data back to the device memory
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 17
Transfer data from device to host = overhead◦ Must pauce GPU computation until the copy is
completed
SimpleStream◦ Using latency hiding (Streams) to reduce the
overhead◦ CUDA streams = overlap memory copy and kernel
executionexecution
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 18
Process starts
H D memory K l t t Code AnalysisH-D memory copy
Kernel starts Code Analysis
After the sync point, OVERWRITE?
Syncthread()
GPU checkpoint durationCPU checkpoint/ migration YES NO
Kernel completesD-H memory
copy
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 19
Kernel completespy
Process ends
Process starts
H D memory K l t t Code AnalysisH-D memory copy
Kernel starts Code Analysis
After the sync point, OVERWRITE?
Syncthread()
YES NO
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 20
Process starts
H D memory K l t tAfter the H-D memory
copyKernel startssync point,
OVERWRITE?
Syncthread()NO
CPU checkpoint/ migration
GPU checkpoint duration
Kernel completesD-H memory
copy
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 21
Kernel completespy
Process endsBACK
Process starts
H D memory K l t tAfter the H-D memory copy
Kernel startsAfter the sync point,
OVERWRITE?
Syncthread()
YES C h i i GPUDuplicate image
CPU checkpoint/ migration
YES
GPU checkpoint duration
Copy the sync image in GPU
migration
Kernel completesD-H memory
copy
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 22
Kernel completespy
Process endsBACK
Restart CPU Transfer the last GPU checkpoint back to CPU Recreate CUDA context from the CKpt file Restart the kernel execution from the marked
synchronization point
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 23
GPU checkpoint after a thread synchronization
NOT every thread synchronizationQUESTION??? QUESTION???◦ Which thread synchronization should a checkpoint
be invoked?be o ed FACTORs◦ GPU checkpoint overhead◦ Chance of a failure occurrence
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 24
O
jCthm thnC
CCPn
mjjf
ˆ mj
OPCOP ff 1ˆPerform the checkpoint:
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 25
O
jCthm thnC
CCPn
mjjf
ˆSkip the checkpoint:
mj
OPCOP ff 1ˆPerform the checkpoint:
Perform the checkpointOCPn
jf ÷÷
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 26
pmj
jf
Simulate failures & the wasted time ◦ total checkpoint overhead + re-computing due to a
failure Overhead Overhead◦ Non-stream: 10 milliseconds – 3 seconds◦ Streams: negligible
MTTF: 12 hours – 7 days Thread sync interval: 10 and 30 minutes
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 27
Thread sync interval = 10 mins
Thread sync interval = 30 mins10 mins 30 mins
28
Thread sync interval = 10 mins
Thread sync interval = 30 mins10 mins 30 mins
29
Against MTTFs Against overheadsg g
30
GPU checkpointing with Stream to reduce overhead
Non-stream and stream checkpoints are insignificantly different if data transfer isinsignificantly different if data transfer is insignificant
BUT stream checkpoint potentially performs BUT stream checkpoint potentially performs better when the checkpoint overhead of memcopy is larger.
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 31
Implement GPU checkpoint/restart mechanism
Work on other checkpoint protocolI l d GPU i i Include GPU process migration
4th HPCVirt workshop, Paris, France, April 13, 20104th HPCVirt workshop, Paris, France, April 13, 2010 32