Warp-Aware Trace Scheduling for GPUS James Jablin (Brown)
Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)
Slide 2
Historical Trends in GFLOPS: CPUs vs. GPUs Reproduced from
NVIDIA C Programming Guide (Version 5.0)
Slide 3
Performance Pitfalls Control flow can negatively affect
performance.
Slide 4
Performance Pitfalls Pipeline Stall execution delay in an
instruction pipeline to resolve a dependency
Slide 5
Hardware: CPU versus GPU Reproduced from NVIDIA C Programming
Guide (Version 5.0)
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Performance Pitfalls Pipeline Stall execution delay in an
instruction pipeline to resolve a dependency
Slide 19
Performance Pitfalls Pipeline Stall execution delay in an
instruction pipeline to resolve a dependency Warp Divergence
threads within a warp take different paths and the different
execution paths are serialized
Slide 20
Warp Divergence Example
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Warp-Aware Trace Scheduling Schedule instructions across basic
block boundaries to expose additional ILP
Slide 29
Warp-Aware Trace Scheduling Schedule instructions across basic
block boundaries to expose additional ILP while managing and
optimizing warp divergence.
Slide 30
Origins: Microcode Trace Scheduling generalizing local and
disparate vertical-to- horizontal microcode compaction
StepDescription
Slide 31
Origins: Microcode Trace Scheduling generalizing local and
disparate vertical-to- horizontal microcode compaction
StepDescription 1. Trace Selection
Slide 32
Origins: Microcode Trace Scheduling generalizing local and
disparate vertical-to- horizontal microcode compaction
StepDescription 1. Trace Selection 2. Trace Formation
Slide 33
Origins: Microcode Trace Scheduling generalizing local and
disparate vertical-to- horizontal microcode compaction
StepDescription 1. Trace Selection 2. Trace Formation 3. Local
Scheduling
Slide 34
Origins: Microcode Trace Scheduling generalizing local and
disparate vertical-to- horizontal microcode compaction
StepDescription 1. Trace SelectionPartition basic blocks into
regions 2. Trace Formation Facilitate local scheduling, potentially
adding nodes and edges 3. Local SchedulingSchedule instructions
within each region