Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang...

29
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA

Transcript of Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang...

Page 1: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

Ping Xiang, Yi Yang, Huiyang Zhou

1The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA

Page 2: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Outline

• Background

• Motivation

• Mitigation: WarpMan

• Experiments

• Conclusions

2

Page 3: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Register File

Threads

3

Overview of GPU Architecture

ALU

ALUControl

ALU

ALU

Cache

ALU

ALU

ALU

ALU

Warp

DRAM

WarpWarp

TB

Shared Memory

Page 4: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Motivation:

• Typically large TB size (512, e.g.) – More efficient data sharing/communication within a TB– Limited total TB number

Register File

TB

TB

Unused Registers

Resource Fragmentation

4

Page 5: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Motivation Warp-Level Divergence:

….

TB

Warp1Warp2Warp3Warp4

FinishedFinished

warps within the same TB don’t finish at the same time

Resources cannot be released promptly

Unused Resources

5

Page 6: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Outline

• Background

• Motivation– Characterization:

• Mitigation: WarpMan

• Experiments

• Conclusions

6

Page 7: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Characterization:

Register File

TB

TB

Unused Resources

Spatial Resource

underutilization

FinishedFinished

TemporalResource

underutilization

7

Page 8: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Spatial Resource Underutilization

Register resource as an example

28% 17%

RS(2) HS(3) RAY(2) MM(5) NN(5) CT(7) MC(4) HG(3) ST(1) GM0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TB8TB7TB6TB5TB4TB3TB2TB1

46%

8

Page 9: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Temporal Resource Underutilization

• Case Study: Ray Tracing– 6 warps per TB– Study TB0 as an example

0

1

2

3

4

5

0 5000 10000 15000 20000 25000

Warp Level Divergence for RAY

Cycle

Warp Num.

RTRU = 49.7%

RTRU: ratio of temporal resource underutilization

9

Page 10: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Why There Is Temporal Resource Underutilization?

• Input-dependent workload imbalance – Same code, different input: “if(a < 123)”

• Program-dependent workload imbalance– Code like if(tid < 32)

• Memory divergence– Some warps experience more cache hits than others

• Warp scheduling policy– Scheduler prioritizes certain warps than others

10

Page 11: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Characterization: RTRU

CT MC RS SN HS PF SR ST RAY MM NN BT HG GM0%

10%

20%

30%

40%

50%

60%

70%

80%

90%Round Robin Scheduling Policy

11

Page 12: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Outline

• Background

• Motivation– Characterization:

– Micro-benchmarking

• Mitigation: WarpMan

• Experiments

• Conclusions

12

Page 13: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Micro-benchmark

• Code runs on both GTX480 and GTX 680

• 1. __global__ void TB_resource_kernel(…, bool call = false){• 2. if(call) bloatOccupancy(start, size); • ...• 3. clock_t start_clock = clock();• 4. if(tid < 32){ //tid is the threadid within a TB• 5 clock_offset = 0; • 6. while( clock_offset < clock_count ) {• 7. clock_offset = clock() - start_clock;• 8. }• 9. }• 10. clock_t end_clock = clock();• 11. d_o[index] = start_clock; //index is the global thread id• 12. d_e[index] = end_clock;• 13.}

13

Page 14: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Micro-benchmarking

• Results> Using CUDA device [0]: GeForce GTX 480> Detected Compute SM 3.0 hardware with 8 multi-processors.…CTA 250 Warp 0: start 80, end 81CTA 269 Warp 0: start 80, end 81CTA 272 Warp 0: start 80, end 81CTA 283 Warp 0: start 80, end 81CTA 322 Warp 0: start 80, end 81CTA 329 Warp 0: start 80, end 81…

14

Page 15: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Outline

• Background

• Motivation

• Mitigation: WarpMan

• Experiments

• Conclusions

15

Page 16: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

WarpMan

SM

TB0

TB1

TB-level Resource Management

Unused Resources

Finished Warp2

Finished Warp1

Finished Warp0

TB2

16

Warp Level Resource Management

cycle

Workload

TB0 TB2

TB1

Warp2

Warp0

Warp1

Page 17: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

SM

TB0

TB1

TB-level Resource Management WarpMan

SM

TB0

TB1

Unused Resources

Finished Warp

Finished Warp

Finished Warp

TB2

Warp0 From TB2

Warp1 From TB2

FinishedReleased ResourceWarp2 From TB2

17

cycle

Workload

Warp0 and warp 1

WarpMan

TB0

TB1Warp2 warp2

Warp0Warp1 Saved Cycle

Warp Level Resource Management

Page 18: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

WarpMan ---- Design

• Dispatch logic– Traditional TB-level dispatching logic– Add partial TB dispatch logic

• Workload buffer– Store the dispatched but not running partial TBs

18

Page 19: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Dispatching

TB-level Resource Check

Warp-level Resource Check

Resources required for a TB

Resources required for a Warp

A full TB

A partial TB

Workload to be dispatched

Shared memoryWarp entriesTB entriesRegisters

19

The shared memory is still allocated at the TB level

Page 20: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Workload Buffer

• Store the dispatched but not running TB– Hardware TB id (assigned by the hardware)– Software TB id (defined by the software)– Start warp id – End warp id– Valid bit

3

26

5

5

1

40bits

20

Page 21: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Workload Buffer

120

0

2

1

21

Store the dispatched but not running TB

TB120

WarpMan

SM

TB118

TB117

Unused Resources

TB Num

Warp0 From TB120

Warp1 From TB120

Start Warp ID

End Warp ID

Valid

Workload buffer

FinishedWarp2 From TB120

0

12

Page 22: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Outline

• Background

• Motivation:

• Mitigation: WarpMan

• Experiments

• Conclusions

22

Page 23: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Methodology

• Use GPUWattch for both timing and energy evaluation

• Baseline Architecture: (GTX480)– 15 SMs, with SIMD size of 32, running at 1.4Ghz– Max TBs per SM is 8, Max threads per SM is 1536– Scheduling policy: round robin / two level– 16KB L1 cache, 48 KB shared memory. 128KB regs

• Applications from:• Nvidia CUDA SDK• Rodinia Benchmark Suit• GPGPUsim

23

Page 24: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Performance Results:

• temp: allow early finished warps to release resource for new warps• temp + spatial: resources are allocated /released at warp level

• The performance improvements can be as high as 71%/76%• On average, 15.3% improvements

CT MC RS SN HS PF SR ST RAY MM NN BT HG GM100%

110%

120%

130%

140%

150% 171%

1.71414827194826

176%

Performance Improvement

temp temp+spatial

24

Page 25: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Energy Results

CT MC RS SN HS PF SR ST RAY MM NN BT HG GM70%

75%

80%

85%

90%

95%

100%

Normalized energy consumptiontemp temp+spatial

The energy savings can be as high as over 20%, and 6% on average

Page 26: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

A Software Alternative

• Change the software to have a smaller TB size• Change the hardware to enable more concurrent TBs

• Inefficient shared memory usage / synchronization• Decrease the data locality• More as we proceed to the experimental results…

a smaller TB size

26

Page 27: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Comparing to the Software Alternative

• CT and ST: software approach decreases L1 locality• NN and BT: reduced total number of threads

• On average: 25% improvement VS 48% degradation

CT MC ST Ray NN BT GM0%

20%

40%

60%

80%

100%

120%

140%

160%

180%Performance Improvment

temp+spatial TBsize_32

125%

52%

27

Page 28: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Related Work

• Resource underutilization due to branch divergence or thread-level divergence has been well studied.

• Yi Yang et al [Pact-21] targets at the shared memory resource management and is complementary to our proposed WarpMan scheme.

• D. Tarjan, et al [US Patent-2009], proposes to use virtual register table to manage physical register file to enable more concurrent TBs

28

Page 29: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Conclusion

• We highlight the limitations of TB-level resource management

• we characterize warp-level divergence and reveal the fundamental reasons for such divergent behavior;

• we propose WarpMan and show that it can be implemented with minor hardware changes

• we show that our proposed solution is highly effective and achieves significant performance improvements and energy savings

Questions?

29