L1 Data Cache Decomposition for Energy Efficiency

28
L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau , Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http:// iacoma.cs.uiuc.edu/ flexram

description

L1 Data Cache Decomposition for Energy Efficiency. Michael Huang, Joe Renau , Seung-Moon Yoo, Josep Torrellas. University of Illinois at Urbana-Champaign. http://iacoma.cs.uiuc.edu/flexram. Objective. Reduce L1 data cache energy consumption No performance degradation - PowerPoint PPT Presentation

Transcript of L1 Data Cache Decomposition for Energy Efficiency

Page 1: L1 Data Cache Decomposition for Energy Efficiency

L1 Data Cache Decomposition for Energy Efficiency

Michael Huang, Joe Renau, Seung-Moon Yoo, Josep TorrellasUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu/flexram

Page 2: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

2

Objective

Reduce L1 data cache energy consumption No performance degradation

Partition the cache in multiple ways Specialization for stack accesses

Page 3: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

3

Outline

L1 D-Cache decomposition Specialized Stack Cache Pseudo Set-Associative Cache Simulation Environment Evaluation Conclusions

Page 4: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

4

L1 D-Cache Decomposition

A Specialized Stack Cache (SSC)

A Pseudo Set-Associative Cache (PSAC)

Page 5: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

5

Selection

Selection done in decode stage to speed up– Based on instruction address and opcode

2Kbit table to predict the PSAC way

Opcode

Address

PSAC SSC

Page 6: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

6

Stack Cache

Small, direct-mapped cache Virtually tagged Software optimizations:

– Very important to reduce stack cache size– Avoid trashing: allocate large structs in heap– Easy to implement

Page 7: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

7

SSC: Specialized Stack Cache

Pointers to reduce traffic: TOS: reduce number write-backs SRB (safe-region-bottom):

reduce unnecessary line-fills for write miss– Region between TOS & SRB

is “safe” (missing lines are non initialized)

Infrequent access

TOS

Stack grows

TOS

SRB

SRB

TOS

SRB

TOS

Page 8: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

8

Pseudo Set-Associative Cache

Partition the cache in 4 ways

Evaluated activation policies: Sequential, FallBackReg, Phased Cache, FallBackPha, PredictPha

DataTag

Page 9: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

9

Sequential (Calder ‘96)

cycle 1

cycle 2

cycle 3

Page 10: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

10

Fallback-regular (Inoue ‘99)

cycle 1

cycle 2

Page 11: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

11

Phased Cache (Hasegawa ‘95)

cycle 1

cycle 2

Page 12: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

12

Fallback-phased (ours)

cycle 1

cycle 2

cycle 3

Emphasis in energy reduction

Page 13: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

13

Predictive Phased (ours)

cycle 1

cycle 2

Emphasis in performance

Page 14: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

14

Simulation Environment

Baseline configuration: Processor: 1GHz R10000 like L1: 32 KB 2-way L2: 512KB 8-way phased cache Memory: 1 Rambus Channel Energy model: extended CACTI Energy is for data memory hierarchy only

Page 15: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

15

Applications

Mp3dec: MP3 decoder Mp3enc: MP3 encoder Gzip: Data compression Crafty: Chess game MCF:Traffic model Bsom: data mining Blast:protein matching Treeadd: Olden tree search

Multimedia

SPECint

Scientific

Page 16: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

16

Adding a Stack Cache

1.01

0.83 0.84

1.00

0.80 0.81

0.99

0.78 0.77

0.99

0.77 0.76

0.99

0.77 0.76

0.98

0.76 0.75

0

0.2

0.4

0.6

0.8

1

Delay Energy E*D

Normalize Baseline

PLAIN 256BSSC 256BPLAIN 512BSSC 512BPLAIN 1KBSSC 1KB

For the same size the Specialized Stack Cache is always better

Page 17: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

17

Pseudo Set-Associative Cache

1.05

0.680.72

0.99

0.69 0.69

1.05

0.740.78

1.01

0.67 0.68

0.98

0.68 0.67

0

0.2

0.4

0.6

0.8

1

Delay Energy E*D

Normalize Baseline

4-way Sequential4-way FallBackReg4-way Phased4-way FallBackPha4-way PredictPha

PredictPha has the best delay and energy-delay product

Page 18: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

18

PSAC: 2-way vs. 4-way

0.99

0.78 0.77

0.97

0.79 0.76

0.98

0.68 0.67

0

0.2

0.4

0.6

0.8

1

Delay Energy E*D

Normalize Basline

2-way Sequential2-way PredictPha4-way PredictPha

For E*D, 4-way PSAC is better than 2-way

Page 19: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

19

Pseudo Set-Associative + Specialized Stack Cache

0.98

0.68 0.67

0.98

0.61 0.60

0.97

0.58 0.56

0.96

0.57 0.55

0

0.2

0.4

0.6

0.8

1

Delay Energy E*D

Normalize Baseline

4-way PredictPha

4-way PredictPha + SSC256B

4-way PredictPha + SSC512B

4-way PredictPha + SSC1KB

Combining PSAC and SSC reduces E*D by 44% on average

Page 20: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

20

Area Constrained: small PSAC+SSC

0.98

0.74 0.72

0.98

0.61 0.60

0.97

0.58 0.56

0

0.2

0.4

0.6

0.8

1

Delay Energy E*D

Normalize Baseline

24KB 3-way PredictPha 24KB 3-way PredictPha + SSC512B32KB 4-way PredictPha + SSC512B

SSC + small PSAC delivers cost effective E*D design

Page 21: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

21

Energy Breakdown

0

0.2

0.4

0.6

0.8

1

Baseline

4-way PSAC

SSC512B

Comb

Baseline

4-way PSAC

SSC512B

Comb

Baseline

4-way PSAC

SSC512B

Comb

Normalize Baseline

SSC

L1

L2

Mem

BLAST MCF MP3D

Page 22: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

22

Conclusions

Stack cache: important for energy-efficiency SW optimization required for stack caches Effective Specialized Stack Cache extensions Pseudo Set-Associative Cache:

– 4-way more effective than 2-way– Predictive Phased PSAC has the lowest E*D

Effective to combine PASC and SSC– E*D reduced by 44% on average

Page 23: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

23

Backup Slides

Page 24: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

24

Cache Energy

0

200

400

600

800

1000

1200

1400

1600

1800

2000

4K 8K 16K 32K 64K

Cache Size

Energy (pJ)

4-way

2-way

1-way

Page 25: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

25

Extended CACTI

New sense amplifier– 15% bit-line swing for reads

Full bit-line swing for writes Different energy for reads, writes,

line-fills, and write backs Multiple optimization parameters

Page 26: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

26

SSC Energy Overhead

Small energy consumption required to use TOS and SRB

Registers updated at function call and return

Registers check on cache miss

Page 27: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

27

Miss Rate

0%

2%

4%

6%

8%

10%

12%

4KB 8KB 16KB 32KB 64KB

BLAST BSOM CRAFTY GZIPMCF MP3D MP3E TREE

Page 28: L1 Data Cache Decomposition for Energy Efficiency

International Symposium on Low Power Electronics and Design, August 2001

28

Overview