Reprogrammable Redundancy for Cache Vmin Reduction in a ...€¦ · 5th RISC-V Workshop, Nov....

1

Brian Zimmer, Pi-Feng Chiu, Borivoje Nikolić, and Krste Asanović

Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley

5th RISC-V Workshop, Nov. 29-30, 2016

Reprogrammable Redundancy for Cache Vmin Reduction in a

28nm RISC-V Processor

2

Motivation

•  Voltage scaling is effective in reducing energy consumption

•  SRAM limits minimum operating voltage (Vmin)

3

Approach

•  Instead of preventing errors, tolerate errors •  Significant Vmin reduction possible by

tolerating ~1000s of errors per MB

10.4 0.5 0.6 0.7 0.8 0.9

100

10-12

10-10

10-8

10-6

10-4

10-2

Vdd

Bit E

rror R

ate

(BER

)

Vmin

Preventerrors (circuit-level)

Tolerate errors (architecture-level)

Allowable failure rate

Bitcell failure rate

Tolerate+Prevent

Prevent

Tolerate

4

Related work

•  Circuit-level (prevent) – Assist circuits (negative bitline,

wordline underdrive, etc) – Cell upsizing

•  Architecture-level (tolerate) – SECDED ECC – Fused column redundancy – Line disable

5

Chip Goals

•  Prove that SRAM Vmin can be effectively lowered by tolerating a reasonable number of failing bitcell

•  Intuition: target tail of distribution

Failure region

6

RISC-V

•  Want real silicon evaluation •  Rocket chip generator perfect

platform to build experiment on – Realistic SRAM usage/constraints – Software toolchain for testing

•  Modified caches to add reprogrammable redundancy, ECC, and BIST

7

Implemented Techniques

•  Three techniques: 1.  Dynamic column redundancy (DCR):

avoid single-bit errors in data SRAM 2.  Line disable (LD): avoid >=2 bit errors

in data SRAM 3.  Bit bypass (BB): avoid all errors in tag

SRAM

8

System Architecture

•  Target system: RISC-V “Rocket” scalar processor

•  16KB L1 and 1MB L2 cache protected by DCR, BB, and LD

•  ECC monitors SRAM to ensure all errors are avoided

VDD_CORE (Core + L1 Cache)5-stage pipeline

SCR

8KB Instr. cacheTags Data

DCRBBDisable

16KB Data cacheTags Data

16K

DCRBBDisable

BIST

VDD_UNCORE(Uncore)

VDD_L2 (L2 Cache)TileLinkIO

Cache Bank [0-3]

Crossbar Network

Tags Data

DCRBBDisable

High-speedclock receivers

(3)

To S

CR

L1 cache

1MB L2 Cache

Reprog.Redund.Control

BIST

Reprog.Redund.Control

Chip IO

HTIFTo

SCR

To MEMIO

MEMIO

9

Dynamic Column Redundancy (DCR)

•  Traditional column redundancy, but different mux address per row

•  Bits per RA is programmable

4 3 2 1 04 bad 2 1 04 3 2 1 bad

4 3 2 1 0

11100100

Row

RA

Row RA x11100100

0000 3 2 1 04 2 1 01000

01

4 3 2 111113 2 1 00000

Example Accesses:Structure:

010101

10

DCR Implementation

•  Redundancy address stored in tag array –  Lookup is part of tag access and does not

impact critical path •  Small timing overhead for shifting only •  Most area overhead inside tag array

Concept

Implementation

4 3 2 1 04 bad 2 1 04 3 2 1 bad

4 3 2 1 0

11100100

Row

RA

Row RA x11100100

0000 3 2 1 04 2 1 01000

01

4 3 2 111113 2 1 00000

Example Accesses:Structure:

010101

DCR Encoder

Tag SRAMRAwayN way0...

...

DCR Decoder

...

......

bad...

...

...

Spare columnbad

(Writes) (Reads)

DataSRAM

TagSRAM

Data

_In

==

DCR

TagSRAM

Data

_Out

==

RA

RA

Data SRAMRAwayN way0...

Pipeline Diagram

DataSRAM DC

R

n+1

n+1

n

n

11

DCR Details

•  Single 2:1 mux on critical path

Encoding

Decoding

...

D[63] D[62] D[62] ... D[0]

{0,D[63:0]}{D[63:0],0}

D[63] 0D[62] D[61]0 D[63] D[62] D[0]

110...0 1 0 1 0 1 0 1 0

Avoid column 62:Thermometer code

bad

...1 0 1 0 1 0_10...0

Same thermometercode, without MSB (1 less mux)

Example: 64 bit word (D[63:0]), avoid bit at index 62

D[63] D[62] ... D[0]

12

Line Disable (LD)

•  Multibit failures limit Vmin, but extending DCR to handle multibit faults is too expensive

•  Multibit failures are rare, can pay large cost to avoid -> disable an entire way

Tags

LFSR

...way0_enabled wayN_enabled

replacement way

Priorityencoder

Random:wayXwayY

wayY

wayY_enabled

wayX

wayX_enabled 1

1

0

0

13

Bit Bypass (BB)

•  Tag arrays cannot have faults •  Use standard cell flip-flops to avoid failing

bitcells

bad bad

bad

...

127126

01

RowCol

......

...

127 126 1 0

1127 1 A B1260 -

---- - -

-C

--

--

(16k SRAM)n

sets

Repairrow

Repaircolumns

Repairbits

(23*n bit repair array)

14

BB Implementation

SRAM

Writes: If writing to an addresswith a bad bit, store the correctbit value

Matching repair entryRepair column

n repair sets * m repair bits per set

Data

_In

Repair Row

==Address

Read hit

Repair Row

==Address

Write hit

Repair bit

1n

Reads: If reading from anaddress with a bad bit, replaceoutput with the correct value

Matching repair entryRepair column

...

...

Data

_Out

L1 tags: Repair up to 7 addresses (n) with 2 failing bits each (m)L2 tags: Repair up to 22 addresses (n) with 2 failing bits each (m)

15

L1 Data Cache Pipeline

Tag SRAM

Data SRAM

hit waysel

Valid?

==?

DC

R E

nc.

DC

R D

ec.

Bit b

ypas

s

Redundancy overhead

RA

Disabled?

ECC

Enc

ECC

Enc

Rec

ycle

Rec

ycle

ECC

D

ecEC

C

Dec

ECC overhead (uncorrected)

ECC

ECC

ECC

L1 Data Cache

•  RR easy to add, but ECC difficult to add •  ECC decoding is pipelined •  If error detected, operation is recycled

16

L2 Cache Pipeline •  Adding

latency by pipelining ECC isn’t a problem

•  Serial tag and data access

Tag SRAM

Data SRAM

Valid?

==?

DC

R E

nc.

DC

R D

ec.

Bit b

ypas

s

Red: Redundancy overhead

RA

Disabled?

ECC

Enc

ECC

Enc

ECC

Dec

ECC

Dec

Blue: ECC overhead

ECC

ECC

L2 Tags

ECC

waysel

RA

L2 Data

To TSHR...From TSHR...

From TSHR... To TSHR...

data

17

ECC logging

•  Queue ECC errors

•  Arbitrate from every SRAM array

•  Read through HTIF Arbiter (12 inputs)Arbiter (12 inputs)Arbiter (12 inputs)

RA Tag[7:0] Data

Arbiter (12 inputs)

ECC Log Entry (32 bits)

Uncore SCR

HTIF

ECC Decoding

SRAMS

Queues(stores syndrome+address+way)

ECC ECC ECC

L2 Cache

Pipeline

(Address: log[20:18],log[3:0]x=bank)

7:0,x

Arbiter

4 banks

...

0,4+x 0,8+x

18

BIST Interface

•  SRAMs tested in parallel

I$ Tags

128x65

I$ Data Way0

512x138

I$ Data Way1

512x138

D$ Tags

128x125

D$ Data Way0

512x146

D$ Data Way1

512x146

D$ Data Way2

512x146

D$ Data Way3

512x146

addr

ess

read

/writ

eex

pect

ed d

in

din

== == == == == == == ==

error buffer entry: {address, dout, current test}

error buffer select

BIST controller (Verilog)

Uncore SCR

HTIF

Latches enforceC-Q withinhalf period)

clkbistclk

(Stage 3)

(Stage 2)

(Stage 1)

bist enable

expected doutdout

19

Comparison to prior art

6.6%SC: 6.4%Flops: 0.5%

Extreme Redundancy** Yes2.4⨉10-5 SC: 6.4%

Flops: 0.5%

6.7%

Total Cache

Overhead

-

0.2%

1.1%

2.2%

Data

-

DCR: 0.7%

7%SC: 1%

Fuses: 0.12%

-4%

-

SC: 1%Fuses: 0.12%

Tags

BB: 15%DCR: 6%LD: 4%

7%Yes

No

Protects tags?

No

Yes

Yes

ECC† [4] 1.3⨉10-6

Nominal 1.1⨉10-10

Static Redundancy* [2] 4.4⨉10-7

Line Disable § [3] 1.8⨉10-5

9.8⨉10-5Proposed (BB +DCR+LD) §

Maximum BER Technique

§ Up to 1% disabled **1 column repair/KB and 1 row repair/32KB †1 repair/128 bit *10 repairs/MB # Estimate (not reported), fuse=100⨉ bitcell, flop=10⨉ bitcell Note: Data portion is 90% of cache area, tag portion is 6% of cache area

Area Overhead

#

# #

##

# #

[2] Huang JSSC 2013 [3] Chang JSSC 2007 [4] Sawant ISSCC 2011

20

Area Overhead Breakdown

6.6%SC: 6.4%Flops: 0.5%

Extreme Redundancy** Yes2.4⨉10-5 SC: 6.4%

Flops: 0.5%

6.7%

Total Cache

Overhead

-

0.2%

1.1%

2.2%

Data

-

DCR: 0.7%

7%SC: 1%

Fuses: 0.12%

-4%

-

SC: 1%Fuses: 0.12%

Tags

BB: 15%DCR: 6%LD: 4%

7%Yes

No

Protects tags?

No

Yes

Yes

ECC† [4] 1.3⨉10-6

Nominal 1.1⨉10-10

Static Redundancy* [2] 4.4⨉10-7

Line Disable § [3] 1.8⨉10-5

9.8⨉10-5Proposed (BB +DCR+LD) §

Maximum BER Technique

§ Up to 1% disabled **1 column repair/KB and 1 row repair/32KB †1 repair/128 bit *10 repairs/MB # Estimate (not reported), fuse=100⨉ bitcell, flop=10⨉ bitcell Note: Data portion is 90% of cache area, tag portion is 6% of cache area

Area Overhead

#

# #

##

# #•  Data array (L2): 137 bits + 1 redundant column •  Tag array (L2): 216 bits + 8 bits (LD) + 13 (DCR) •  BB: 15% larger than tags for additional

standard cells •  Tag is 6% of L2, data is 90% of L2

21

Optimal BER

•  Techniques are intentionally optimized to tolerate BER of 1×10-4

•  In 1MB cache, ≈1000 errors

10�10 10�9 10�8 10�7 10�6 10�5 10�4 10�3 10�2

Probability a bit fails (Pbit)

0.0

0.2

0.4

0.6

0.8

1.0P

roba

bilit

ya

wor

dha

sa

faili

ngbi

t

Vdiff

28nm FDSOI Vdiff: -19 mV

32nm Intel Vdiff: -72 mV

32nm PTM Vdiff: -42 mV

References Vdiff: -14 mV

L1L2

Vdiff ≈ 50mV

22

Fabricated Prototype

•  TSMC 28nm HPM •  2.7mm x 1.8mm, wire-bonded •  RISC-V core + 1MB L2 Cache

1.8mm

2.7mm

1MB L2 Cache

Uncore

Bank 0

Bank 1

Bank 2

Bank 3

TagsData

D$

I$Pipeline

VDD/ClockBoundaries

Core+L1

23

Design Flow •  TSMC provided standard cells and SRAM

compiler •  Chisel + Verilog wrappers for BIST, multi-

clock support •  Synthesis: Synopsys Design Compiler •  Place-and-Route: Synopsys IC Compiler •  Timing: Synopsys Primetime •  Simulation: Synopsys VCS

24

Measured Shmoo plot

•  Processor core and L1 operate up to 1GHz and 30mW

0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9

20.0

70.0

120.0

170.0

220.0

270.0

320.0

370.0

420.0

470.0

520.0

570.0

620.0

670.0

L1 Cache + Core 1MB L2 Cache

Freq

uenc

y (M

Hz)

Freq

uenc

y (M

Hz)

20

500

1000Vdd (a.u.) Vdd (a.u.)

0.45 0.850.65

20

300

650 0.45 0.850.65

0.86mW to 34mW 0.61mW to 21mW

PASS PASS

FAILFAIL

25

BIST Programming Algorithm

•  Standard BIST identifies fault locations

Set target Vmin and Freq.

Run BIST tests

Add BB entry

Add DCR entry

Add LD entry

Program BB

Error intag array

Error indata array

First errorin the set

Program DCR

Program LD

Start processor

Already errorin the set

Continue tests

Testsdone

Ideal BIST test time:5 (number of March tests)⨉ 4096 (deepest array)⨉ 20 (avg. test complexity)⨉ 50ns (cycle period at Vmin)= 20ms

26

Measured SRAM Error Rate

•  Suite of March tests to identify errors

•  As expected, high density bitcell in L2 has higher Vmin than high speed bitcell in L1

0.3 0.35 0.4 0.45 0.5 0.55 0.6Vdd (a.u.)

10�8

10�7

10�6

10�5

10�4

10�3

10�2

Bit

Err

orR

ate

(BE

R)

L1 DataL1 TagsL2 DataL2 Tags

[3]

[4][2]

(Proposed)

L1 Vmin Reduction

L2 Vmin Reduction

[2] Huang JSSC 2013 [3] Chang JSSC 2007 [4] Sawant ISSCC 2011

27

Measured Vmin Reduction

•  Proposed techniques achieve 25% average Vmin reduction in L2 for 2% area overhead

Vminreduction

28

Summary

•  Easier than assist techniques to adopt – Predictable Vmin reduction – Comparable effectiveness

•  Requires power-up BIST or non-volatile memory

29

Conclusion

•  DCR, LD, and BB are programmed by BIST to avoid failing bitcells

•  Significant Vmin reduction for low overhead and no circuit changes – Can use in addition to assist techniques

•  28nm prototype processor shows –  25% Vmin reduction, –  49% power reduction –  2% area overhead

30

Acknowledgements

•  Henry Cook •  Yunsup Lee •  Andrew Waterman •  James Dunn •  Brian Richards •  Stephen Twigg •  Scott Liao •  Jonathan Chang •  Work funded in part by BWRC, ASPIRE,

DARPA PERFECT Award Number HR0011-12-2-0016, Intel ARO, and a fabrication donation by TSMC.

Reprogrammable Redundancy for Cache Vmin Reduction in a ...€¦ · 5th RISC-V Workshop, Nov....

Documents

Transcript of Reprogrammable Redundancy for Cache Vmin Reduction in a ...€¦ · 5th RISC-V Workshop, Nov....