A Case for Toggle-Aware Compression for GPU Systems

26
A Case for Toggle-Aware Compression for GPU Systems Gennady Pekhimenko, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry Evgeny Bolotin, Stephen W. Keckler

Transcript of A Case for Toggle-Aware Compression for GPU Systems

Page 1: A Case for Toggle-Aware Compression for GPU Systems

A Case for Toggle-Aware Compression for GPU Systems

Gennady Pekhimenko,

Nandita Vijaykumar,

Onur Mutlu, Todd C. Mowry

Evgeny Bolotin, Stephen W. Keckler

Page 2: A Case for Toggle-Aware Compression for GPU Systems

Executive Summary

2

Data compression is a known technique to decrease the bandwidth pressure

Observation: Compression significantly increases the energy cost of communication by increasing the number of bit toggles (bit flips)

Our approach: Toggle-Aware Compression

• Energy Control (EC): send compressed data only when it is beneficial

• Metadata Consolidation (MC): consolidates metadata bits to reduce the bit toggle count

Key results: 2.2X increase in bit toggles reduced to only 1.1X with most of the performance benefits preserved

Page 3: A Case for Toggle-Aware Compression for GPU Systems

Performance and Energy Efficiency

3

Energy efficiency

Applications today are data-intensive

Memory Caching

Databases Graphics

Page 4: A Case for Toggle-Aware Compression for GPU Systems

Computation vs. Communication

Data movement is very costly

– Integer operation: ~1 pJ

– Floating operation: ~20 pJ

– Low-power memory access: ~1200 pJ

Implication

– Transfer less or keep data near processing units

4

Modern memory systems are bandwidth constrained

Sources: Bill Dally (NVIDIA/Stanford)Kayvon Fatahalian (CMU)

Page 5: A Case for Toggle-Aware Compression for GPU Systems

Potential for Data Compression

5

Significant redundancy in memory transfers:

0x00000000

How can we exploit this redundancy?

• Bandwidth compression

• Provides effect of a higher effective bandwidth without increasing the number of wires or raising the frequency

0x0000000B 0x00000003 0x00000004 …

Page 6: A Case for Toggle-Aware Compression for GPU Systems

Bandwidth Compression for GPUs

6

0.91

1.11.21.31.41.51.61.71.8

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

Co

mp

ress

ion

Rat

io

Compression is effective in providing higher bandwidth

221 appsfrom NVIDIA

21 apps

Page 7: A Case for Toggle-Aware Compression for GPU Systems

Common Wisdom about Compression

Higher effective capacity

Higher effective bandwidth

(Usually) lower energy consumption

7

Compression/Decompression overhead

Complexity/Cost to support variable size

A new problem:

Significant increase in the bit toggle count

(# bit flips), despite less bits sent

Page 8: A Case for Toggle-Aware Compression for GPU Systems

What is a Bit Toggle?

8

0

1

0011Previous data:

Bit Toggles

Energy = C*V2

How energy is spent in data transfers:

0101New data:

0

01

0

11

Energy:

Energy of data transfers (e.g., NoC, DRAM) is proportional to the bit toggle count

Page 9: A Case for Toggle-Aware Compression for GPU Systems

Excessive Number of Bit Toggles

9

0x00003A00 0x8001D000 0x00003A01 0x8001D008 …

Flit 0

Flit 1

XOR

000000010….01000=

# Toggles = 2Compressed Cache Line (FPC)

0x5 0x3A00 0x7 8001D000 0x5 0x3A01

Flit 0

Flit 1

XOR

001001111 … 110100011000=

# Toggles = 31

0x7 8001D008 …

5 3A00 7 8001D000 5 1D

1 01 7 8001D008 5 3A02 1

Uncompressed Cache Line

Page 10: A Case for Toggle-Aware Compression for GPU Systems

Effect of Compression on Bit Toggles

10

0.8

1

1.2

1.4

1.6

1.8

2

2.2FP

C

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

No

rmal

ize

d B

it T

ogg

le #

Compression significantly increases bit toggle count

Page 11: A Case for Toggle-Aware Compression for GPU Systems

Outline

11

• Motivation

• Key Observations

• Toggle-Aware Compression:• Energy Control (EC)

• Metadata Consolidation (MC)

• Evaluation

• Conclusion

Page 12: A Case for Toggle-Aware Compression for GPU Systems

Energy Control Decision Flow

12

$Line Compress Comp. $Line

Count Toggles

Sele

ct

ECDecision

T0

T1

BWUtilization

CR

$Line

Send compressedor uncompressed

Page 13: A Case for Toggle-Aware Compression for GPU Systems

How to Make the EC Decision?

13

• Energy

– Battery life

• Energy X Delay

– Balance performance and energy

• Energy X Delay2

– Fixed power with voltage scaling

• Energy: ~ Toggle #, Delay ~ 1/(Comp. Ratio)

– When bandwidth utilization (BU) is high (>50%) use 1 / (1 - BU) coefficient

Page 14: A Case for Toggle-Aware Compression for GPU Systems

EC in the System

14

Streaming Multiprocessor

L1D Co

mp

ress

or/

Dec

om

pre

sso

r

Ener

gy

Co

ntr

ol C

om

presso

r/D

ecom

presso

r

Energ

y Co

ntro

lNoC LLC

LLC

Co

mp

ress

or/

Dec

om

pre

sso

r

Ener

gy

Co

ntr

ol C

om

presso

r/D

ecom

presso

r

Energ

y Co

ntro

l

off-chip bus DRAM

Page 15: A Case for Toggle-Aware Compression for GPU Systems

Energy Control Summary

• Bit toggle count: compressed vs. uncompressed

• Use a heuristic (Energy X Delay or Energy X Delay2 metric) to estimate the trade-off

• Take bandwidth utilization into account

• Throttle compression when it is not beneficial

15

Page 16: A Case for Toggle-Aware Compression for GPU Systems

Metadata Consolidation

16

Compressed Cache Line with FPC, 4-byte flits

Toggle-aware FPC: all metadata consolidated

0x5,0x3A00, 0x5, 0x3A01, 0x5, 0x3A02, …

ConsolidatedMetadata

0x5, 0x3A03,

0x3A00, 0x3A01, 0x3A02, 0x3A03, 0x5 0x5 … 0x5

# Toggles = 18

# Toggles = 2

Page 17: A Case for Toggle-Aware Compression for GPU Systems

Outline

17

• Motivation

• Key Observations

• Toggle-Aware Compression:• Energy Control (EC)

• Metadata Consolidation (MC)

• Evaluation

• Conclusion

Page 18: A Case for Toggle-Aware Compression for GPU Systems

Methodology

• Simulator: GPGPU-Sim 3.2.x and in-house simulator

• Workloads:

– NVIDIA apps (discrete and mobile): 221 apps

– Open-source (Lonestar, Rodinia, MapReduce): 21 apps

• System parameters (Fermi):– 15 SMs, 32 threads/warp

– 48 warps/SM, 32768 registers, 32KB Shared Memory

– Core: 1.4GHz, GTO scheduler, 2 schedulers/SM

– Memory: 177.4GB/s BW, GDDR5

– Cache: L1 - 16KB; L2 - 768KB

18

Page 19: A Case for Toggle-Aware Compression for GPU Systems

Effect of EC on Bit Toggle Count

19

0.81

1.21.41.61.8

22.2

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

No

rmal

ize

d B

it T

ogg

le C

ou

nt Without EC With EC

EC significantly reduces the bit toggle count Works for different compression algorithms

Page 20: A Case for Toggle-Aware Compression for GPU Systems

Effect of EC on Compression Ratio

20

0.8

1

1.2

1.4

1.6

1.8FP

C

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

FPC

BD

I

BD

I+FP

C

Fib

on

acci

C-P

ack

Discrete Mobile Open-Source

Co

mp

ress

ion

Rat

io

Without EC With EC

EC preserves most of the benefits of compression

Page 21: A Case for Toggle-Aware Compression for GPU Systems

Bit Toggles for C-Pack Algorithm

21

0123456789

10C

UD

AB

FSC

ON

SFW

TJP

EG LPS

MU

MR

AY

SLA

TRA

lon

esta

rb

fs bh

mst sp

sssp

Mar

sK

mea

ns

Mat

rixM

ul

Pag

eVie

wC

ou

nt

Pag

eVie

wR

ank

Sim

ilari

tySc

ore

rod

inia

hea

rtw

all

nw

Geo

Mea

n

No

rmal

ize

d B

it T

ogg

le C

ou

nt

Without EC EC

Different tradeoffs for different applications

Page 22: A Case for Toggle-Aware Compression for GPU Systems

DRAM Energy for C-Pack

22

0

0.5

1

1.5

2

2.5

3

CU

DA

BFS

CO

NS

FWT

JPEG LP

S

MU

M

RA

Y

SLA

TRA

lon

esta

r

bfs bh

mst sp

sssp

Mar

s

Km

ean

s

Mat

rixM

ul

Pag

eV

iew

Co

un

t

Pag

eVie

wR

ank

Sim

ilari

tySc

ore

rod

inia

hea

rtw

all

nw

Geo

Mea

n

No

rmal

ize

d D

RA

M E

ne

rgy

Without EC EC

7% average DRAM energy reduction, up to 28% for TRA

Page 23: A Case for Toggle-Aware Compression for GPU Systems

Effect of Metadata Consolidation (MC)

23

0

1

2

3

4

5

6

7

CU

DA

BFS

CO

NS

FWT

JPEG LP

SM

UM

RA

YSL

ATR

Alo

nes

tar

bfs bh

mst sp

sssp

Mar

sK

mea

ns

Mat

rixM

ul

Pag

eVie

wC

ou

nt

Pag

eVie

wR

ank

Sim

ilari

tySc

ore

rod

inia

hea

rtw

all

nw

Geo

Mea

n

No

rmal

ize

d B

it T

ogg

le C

ou

nt Without EC MC EC MC+EC

FPC compression algorithm

MC is effective in reducing the bit toggle countBut less effective than EC

Page 24: A Case for Toggle-Aware Compression for GPU Systems

Other Results in the Paper

• On-chip interconnect results– Higher impact of bit toggles on the interconnect

energy, but lower overall energy impact

• Data bus inversion (DBI)– EC and MC benefits are independent on whether DBI

encoding is used

• Complexity estimation– Energy and latency

• Analyzing different EC decision functions– Energy x Delay vs. Energy x Delay2

24

Page 25: A Case for Toggle-Aware Compression for GPU Systems

Conclusion

25

Data compression is a known technique to decrease the bandwidth pressure

Observation: Compression significantly increases the energy cost of communication by increasing the number of bit toggles (bit flips)

Our approach: Toggle-Aware Compression

• Energy Control (EC): send compressed data only when it is beneficial

• Metadata Consolidation (MC): consolidate metadata bits to reduce the bit toggle count

Key results: 2.2X increase in bit toggles reduced to only 1.1X with most of the performance benefits preserved

Page 26: A Case for Toggle-Aware Compression for GPU Systems

A Case for Toggle-Aware Compression for GPU Systems

Gennady Pekhimenko,

Nandita Vijaykumar,

Onur Mutlu, Todd C. Mowry

Evgeny Bolotin, Stephen W. Keckler