A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck...

37
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

Transcript of A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck...

Page 1: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

A Case for Core-Assisted

Bottleneck Acceleration in GPUsEnabling Flexible Data Compression

with Assist Warps

Nandita Vijaykumar

Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,

Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,

Todd C. Mowry, Onur Mutlu

Page 2: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Executive Summary

Observation: Imbalances in execution leave GPU resources underutilized

Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads

Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture?

Our Solution: CABA (Core-Assisted Bottleneck Acceleration)

A new framework to enable helper threading in GPUs

Enables flexible data compression to alleviate the memory bandwidth bottleneck

A wide set of use cases (e.g., prefetching, memoization)

Key Results: Using CABA to implement data compression in

memory improves performance by 41.7%2

Page 3: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

GPUs today are used for a wide range

of applications …

Computer Vision Data Analytics Scientific

Simulation

Medical

Imaging

3

Page 4: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Challenges in GPU Efficiency

Memory

Hierarchy

Register File Cores

GPU Streaming Multiprocessor

Thread

0

Thread

1

Thread

2

Thread

3

Full! Idle!

Thread limits lead to an underutilized register file The memory bandwidth bottleneck leads to idle cores

Threads

4

Idle!

Full!

Page 5: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Motivation: Unutilized On-chip Memory

24% of the register file is unallocated on average

Similar trends for on-chip scratchpad memory

0%10%20%30%40%50%60%70%80%90%

100%

% U

nalloca

ted R

egis

ters

5

Page 6: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Motivation: Idle Pipelines

Memory Bound

Compute Bound

0%

20%

40%

60%

80%

100%

CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg.

% C

ycl

es

Active

Stalls

0%

20%

40%

60%

80%

100%

NN STO bp hs dmr NQU SLA lc pt mc

% C

ycl

es

Active

Stalls

6

67% of cycles idle

35% of cycles idle

Page 7: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Motivation: Summary

Heterogeneous application requirements lead to:

Bottlenecks in execution

Idle resources

7

Page 8: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Our Goal

Memory

Hierarchy

Cores Register File

Use idle resources to do something useful:

accelerate bottlenecks using helper threads

A flexible framework to enable helper threading in GPUs:

Core-Assisted Bottleneck Acceleration (CABA)8

Helper

threads

Page 9: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Helper threads in GPUs

Large body of work in CPUs …

[Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98],

[Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01,

MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05],

[Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+

ASPLOS ’11], etc.

However, there are new challenges with GPUs…

9

Page 10: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Challenge

How do you efficiently

manage and use helper threads

in a throughput-oriented architecture?

10

Page 11: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Managing Helper Threads in GPUs

Thread

Warp

Block Software

Hardware

Where do we add helper threads?11

Page 12: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Approach #1: Software-only

Regular threads

Helper threads

No hardware changes

Coarse grained

Not aware of runtime

program behavior

12

Synchronization is

difficult

Page 13: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Where Do We Add Helper Threads?

Thread

Warp

Block Software

Hardware

13

Page 14: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Approach #2: Hardware-only

14

Fine-grained control

– Synchronization

– Enforcing Priorities

GPU

Cores Register File

Warps

Core 0 Core 1

Reg File 0

Reg File 1

CPU

Reg File 0

Reg File 1Providing contexts

efficiently is difficult

Page 15: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

CABA: An Overview

“Tight coupling” of helper threads and regular threads

SW

HW “Decoupled management” of helper threads

and regular threads

Efficient context management

Simpler data communication

Dynamic management of threads

Fine-grained synchronization

15

Page 16: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

CABA: 1. In Software

Helper threads:

Tightly coupled to regular threads

Simply instructions injected into the GPU pipelines

Share the same context as the regular threads

Regs

Block

16

Regular threads

Helper threads

Efficient context management

Simpler data communication

Page 17: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

CABA: 2. In Hardware

Helper threads:

Decoupled from regular threads

Tracked at the granularity of a warp – Assist Warp

Each regular (parent) warp can have different assist

warps

Parent Warp: X

Assist Warp: A

Assist Warp: B17

Dynamic management

of threads

Fine-grained

synchronization

Page 18: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Key Functionalities

Triggering and squashing assist warps

Associating events with assist warps

Deploying active assist warps

Scheduling instructions for execution

Enforcing priorities

Between assist warps and parent warps

Between different assist warps

18

Page 19: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Deploy

Scheduler

CABA: Mechanism

ALU

Fetch

I-Cache

Assist Warp Store

Writeback

InstructionBuffer

Assist WarpBuffer

ScoreboardDecode

ALUALU

Mem

Issue

Trigger

Assist Warp Controller

Assist Warp Store

Holds instructions for different assist warp routines

Assist Warp Controller

Central point of control for: o Triggering assist warpso Squashing them

Tracks progress for active assist warps

Assist WarpBuffer

Stages instructions from triggered assist warps for execution

Helps enforce priorities

19

Page 20: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Other functionality

In the paper:

More details on the hardware structures

Data communication and synchronization

Enforcing priorities

20

Page 21: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

CABA: Applications

Data compression

Memoization

Prefetching

21

Page 22: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

A Case for CABA: Data Compression

Data compression can help alleviate the memory

bandwidth bottleneck - transmits data in a more

condensed form

Memory

Hierarchy

CompressedUncompressed

CABA employs idle compute pipelines to perform compression

Idle!

22

Page 23: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Data Compression with CABA

Use assist warps to:

Compress cache blocks before writing to memory

Decompress cache blocks before placing into the cache

CABA flexibly enables various compression algorithms

Example: BDI Compression [Pekhimenko+ PACT ’12]

Parallelizable across SIMT width

Low latency

Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10]

23

Page 24: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Walkthrough of Decompression

Scheduler

L1DL2 +

Memory

Assist WarpStore

Assist Warp

Controller

Cores

Hit!Miss!

Trigger

24

Page 25: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Walkthrough of Compression

Scheduler

L1DL2 +

Memory

Assist WarpStore

Assist Warp

Controller

Cores

Trigger

25

Page 26: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Evaluation

Page 27: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Methodology

Simulator: GPGPUSim, GPUWattch Workloads

Lonestar, Rodinia, MapReduce, CUDA SDK

System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers,

FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way

associative

Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data

bus is busy 27

Page 28: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Effect on Performance

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Norm

alized P

erf

orm

ance

CABA-BDI No-Overhead-BDI

CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs

with no overhead for compression 28

41.7%

Page 29: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Effect on Bandwidth Consumption

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Mem

ory

Band

wid

th C

on

sum

ptio

n

Baseline CABA-BDI

Data compression with CABA alleviates

the memory bandwidth bottleneck 29

Page 30: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Different Compression Algorithms

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Norm

alized P

erf

orm

ance

CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll

CABA is flexible: Improves performance with

different compression algorithms 30

Page 31: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Other Results

CABA’s performance is similar to pure-hardware

based BDI compression

CABA reduces the overall system energy (22%) by

decreasing the off-chip memory traffic

Other evaluations:

Compression ratios

Sensitivity to memory bandwidth

Capacity compression

Compression at different levels of the hierarchy

31

Page 32: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Conclusion

Observation: Imbalances in execution leave GPU resources underutilized

Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads

Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture?

Our Solution: CABA (Core-Assisted Bottleneck Acceleration)

A new framework to enable helper threading in GPUs

Enables flexible data compression to alleviate the memory bandwidth bottleneck

A wide set of use cases (e.g., prefetching, memoization)

Key Results: Using CABA to implement data compression in

memory improves performance by 41.7%32

Page 33: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

A Case for Core-Assisted

Bottleneck Acceleration in GPUsEnabling Flexible Data Compression

with Assist Warps

Nandita Vijaykumar

Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,

Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,

Todd C. Mowry, Onur Mutlu

Page 34: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Backup Slides34

Page 35: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Effect on Energy

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

No

rma

lized

En

erg

y

CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI

CABA reduces the overall system energy by decreasing the off-chip memory traffic

35

Page 36: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Effect on Compression Ratio

36

Page 37: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar

Other Uses of CABA

37

Hardware Memoization

Goal: avoid redundant computation by reusing previous results over the same/similar inputs

Idea:

hash the inputs at predefined points

use load/store pipelines to save inputs in shared memory

eliminate redundant computation by loading stored results

Prefetching

Similar to CPU