Reprogrammable Redundancy for Cache Vmin Reduction in a ...€¦ · 5th RISC-V Workshop, Nov....
Transcript of Reprogrammable Redundancy for Cache Vmin Reduction in a ...€¦ · 5th RISC-V Workshop, Nov....
1
Brian Zimmer, Pi-Feng Chiu, Borivoje Nikolić, and Krste Asanović
Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley
5th RISC-V Workshop, Nov. 29-30, 2016
Reprogrammable Redundancy for Cache Vmin Reduction in a
28nm RISC-V Processor
2
Motivation
• Voltage scaling is effective in reducing energy consumption
• SRAM limits minimum operating voltage (Vmin)
3
Approach
• Instead of preventing errors, tolerate errors • Significant Vmin reduction possible by
tolerating ~1000s of errors per MB
10.4 0.5 0.6 0.7 0.8 0.9
100
10-12
10-10
10-8
10-6
10-4
10-2
Vdd
Bit E
rror R
ate
(BER
)
Vmin
Preventerrors (circuit-level)
Tolerate errors (architecture-level)
Allowable failure rate
Bitcell failure rate
Tolerate+Prevent
Prevent
Tolerate
4
Related work
• Circuit-level (prevent) – Assist circuits (negative bitline,
wordline underdrive, etc) – Cell upsizing
• Architecture-level (tolerate) – SECDED ECC – Fused column redundancy – Line disable
5
Chip Goals
• Prove that SRAM Vmin can be effectively lowered by tolerating a reasonable number of failing bitcell
• Intuition: target tail of distribution
Failure region
6
RISC-V
• Want real silicon evaluation • Rocket chip generator perfect
platform to build experiment on – Realistic SRAM usage/constraints – Software toolchain for testing
• Modified caches to add reprogrammable redundancy, ECC, and BIST
7
Implemented Techniques
• Three techniques: 1. Dynamic column redundancy (DCR):
avoid single-bit errors in data SRAM 2. Line disable (LD): avoid >=2 bit errors
in data SRAM 3. Bit bypass (BB): avoid all errors in tag
SRAM
8
System Architecture
• Target system: RISC-V “Rocket” scalar processor
• 16KB L1 and 1MB L2 cache protected by DCR, BB, and LD
• ECC monitors SRAM to ensure all errors are avoided
VDD_CORE (Core + L1 Cache)5-stage pipeline
SCR
8KB Instr. cacheTags Data
DCRBBDisable
16KB Data cacheTags Data
16K
DCRBBDisable
BIST
VDD_UNCORE(Uncore)
VDD_L2 (L2 Cache)TileLinkIO
Cache Bank [0-3]
Crossbar Network
Tags Data
DCRBBDisable
High-speedclock receivers
(3)
To S
CR
L1 cache
1MB L2 Cache
Reprog.Redund.Control
BIST
Reprog.Redund.Control
Chip IO
HTIFTo
SCR
To MEMIO
MEMIO
9
Dynamic Column Redundancy (DCR)
• Traditional column redundancy, but different mux address per row
• Bits per RA is programmable
4 3 2 1 04 bad 2 1 04 3 2 1 bad
4 3 2 1 0
11100100
Row
RA
Row RA x11100100
0000 3 2 1 04 2 1 01000
01
4 3 2 111113 2 1 00000
Example Accesses:Structure:
010101
10
DCR Implementation
• Redundancy address stored in tag array – Lookup is part of tag access and does not
impact critical path • Small timing overhead for shifting only • Most area overhead inside tag array
Concept
Implementation
4 3 2 1 04 bad 2 1 04 3 2 1 bad
4 3 2 1 0
11100100
Row
RA
Row RA x11100100
0000 3 2 1 04 2 1 01000
01
4 3 2 111113 2 1 00000
Example Accesses:Structure:
010101
DCR Encoder
Tag SRAMRAwayN way0...
...
DCR Decoder
...
......
bad...
...
...
Spare columnbad
(Writes) (Reads)
DataSRAM
TagSRAM
Data
_In
==
DCR
TagSRAM
Data
_Out
==
RA
RA
Data SRAMRAwayN way0...
Pipeline Diagram
DataSRAM DC
R
n+1
n+1
n
n
11
DCR Details
• Single 2:1 mux on critical path
Encoding
Decoding
...
D[63] D[62] D[62] ... D[0]
{0,D[63:0]}{D[63:0],0}
D[63] 0D[62] D[61]0 D[63] D[62] D[0]
110...0 1 0 1 0 1 0 1 0
Avoid column 62:Thermometer code
bad
...1 0 1 0 1 0_10...0
Same thermometercode, without MSB (1 less mux)
Example: 64 bit word (D[63:0]), avoid bit at index 62
D[63] D[62] ... D[0]
12
Line Disable (LD)
• Multibit failures limit Vmin, but extending DCR to handle multibit faults is too expensive
• Multibit failures are rare, can pay large cost to avoid -> disable an entire way
Tags
LFSR
...way0_enabled wayN_enabled
replacement way
Priorityencoder
Random:wayXwayY
wayY
wayY_enabled
wayX
wayX_enabled 1
1
0
0
13
Bit Bypass (BB)
• Tag arrays cannot have faults • Use standard cell flip-flops to avoid failing
bitcells
bad bad
bad
...
127126
01
RowCol
......
...
127 126 1 0
1127 1 A B1260 -
---- - -
-C
--
--
(16k SRAM)n
sets
Repairrow
Repaircolumns
Repairbits
(23*n bit repair array)
14
BB Implementation
SRAM
Writes: If writing to an addresswith a bad bit, store the correctbit value
Matching repair entryRepair column
n repair sets * m repair bits per set
Data
_In
Repair Row
==Address
Read hit
Repair Row
==Address
Write hit
Repair bit
1n
Reads: If reading from anaddress with a bad bit, replaceoutput with the correct value
Matching repair entryRepair column
...
...
Data
_Out
L1 tags: Repair up to 7 addresses (n) with 2 failing bits each (m)L2 tags: Repair up to 22 addresses (n) with 2 failing bits each (m)
15
L1 Data Cache Pipeline
Tag SRAM
Data SRAM
hit waysel
Valid?
==?
DC
R E
nc.
DC
R D
ec.
Bit b
ypas
s
Redundancy overhead
RA
Disabled?
ECC
Enc
ECC
Enc
Rec
ycle
Rec
ycle
ECC
D
ecEC
C
Dec
ECC overhead (uncorrected)
ECC
ECC
ECC
L1 Data Cache
• RR easy to add, but ECC difficult to add • ECC decoding is pipelined • If error detected, operation is recycled
16
L2 Cache Pipeline • Adding
latency by pipelining ECC isn’t a problem
• Serial tag and data access
Tag SRAM
Data SRAM
Valid?
==?
DC
R E
nc.
DC
R D
ec.
Bit b
ypas
s
Red: Redundancy overhead
RA
Disabled?
ECC
Enc
ECC
Enc
ECC
Dec
ECC
Dec
Blue: ECC overhead
ECC
ECC
L2 Tags
ECC
waysel
RA
L2 Data
To TSHR...From TSHR...
From TSHR... To TSHR...
data
17
ECC logging
• Queue ECC errors
• Arbitrate from every SRAM array
• Read through HTIF Arbiter (12 inputs)Arbiter (12 inputs)Arbiter (12 inputs)
RA Tag[7:0] Data
Arbiter (12 inputs)
ECC Log Entry (32 bits)
Uncore SCR
HTIF
ECC Decoding
SRAMS
Queues(stores syndrome+address+way)
ECC ECC ECC
L2 Cache
Pipeline
(Address: log[20:18],log[3:0]x=bank)
7:0,x
Arbiter
4 banks
...
0,4+x 0,8+x
18
BIST Interface
• SRAMs tested in parallel
I$ Tags
128x65
I$ Data Way0
512x138
I$ Data Way1
512x138
D$ Tags
128x125
D$ Data Way0
512x146
D$ Data Way1
512x146
D$ Data Way2
512x146
D$ Data Way3
512x146
addr
ess
read
/writ
eex
pect
ed d
in
din
== == == == == == == ==
error buffer entry: {address, dout, current test}
error buffer select
BIST controller (Verilog)
Uncore SCR
HTIF
Latches enforceC-Q withinhalf period)
clkbistclk
(Stage 3)
(Stage 2)
(Stage 1)
bist enable
expected doutdout
19
Comparison to prior art
6.6%SC: 6.4%Flops: 0.5%
Extreme Redundancy** Yes2.4⨉10-5 SC: 6.4%
Flops: 0.5%
6.7%
Total Cache
Overhead
-
0.2%
1.1%
2.2%
Data
-
DCR: 0.7%
7%SC: 1%
Fuses: 0.12%
-4%
-
SC: 1%Fuses: 0.12%
Tags
BB: 15%DCR: 6%LD: 4%
7%Yes
No
Protects tags?
No
Yes
Yes
ECC† [4] 1.3⨉10-6
Nominal 1.1⨉10-10
Static Redundancy* [2] 4.4⨉10-7
Line Disable § [3] 1.8⨉10-5
9.8⨉10-5Proposed (BB +DCR+LD) §
Maximum BER Technique
§ Up to 1% disabled **1 column repair/KB and 1 row repair/32KB †1 repair/128 bit *10 repairs/MB # Estimate (not reported), fuse=100⨉ bitcell, flop=10⨉ bitcell Note: Data portion is 90% of cache area, tag portion is 6% of cache area
Area Overhead
#
# #
##
# #
[2] Huang JSSC 2013 [3] Chang JSSC 2007 [4] Sawant ISSCC 2011
20
Area Overhead Breakdown
6.6%SC: 6.4%Flops: 0.5%
Extreme Redundancy** Yes2.4⨉10-5 SC: 6.4%
Flops: 0.5%
6.7%
Total Cache
Overhead
-
0.2%
1.1%
2.2%
Data
-
DCR: 0.7%
7%SC: 1%
Fuses: 0.12%
-4%
-
SC: 1%Fuses: 0.12%
Tags
BB: 15%DCR: 6%LD: 4%
7%Yes
No
Protects tags?
No
Yes
Yes
ECC† [4] 1.3⨉10-6
Nominal 1.1⨉10-10
Static Redundancy* [2] 4.4⨉10-7
Line Disable § [3] 1.8⨉10-5
9.8⨉10-5Proposed (BB +DCR+LD) §
Maximum BER Technique
§ Up to 1% disabled **1 column repair/KB and 1 row repair/32KB †1 repair/128 bit *10 repairs/MB # Estimate (not reported), fuse=100⨉ bitcell, flop=10⨉ bitcell Note: Data portion is 90% of cache area, tag portion is 6% of cache area
Area Overhead
#
# #
##
# #• Data array (L2): 137 bits + 1 redundant column • Tag array (L2): 216 bits + 8 bits (LD) + 13 (DCR) • BB: 15% larger than tags for additional
standard cells • Tag is 6% of L2, data is 90% of L2
21
Optimal BER
• Techniques are intentionally optimized to tolerate BER of 1×10-4
• In 1MB cache, ≈1000 errors
10�10 10�9 10�8 10�7 10�6 10�5 10�4 10�3 10�2
Probability a bit fails (Pbit)
0.0
0.2
0.4
0.6
0.8
1.0P
roba
bilit
ya
wor
dha
sa
faili
ngbi
t
Vdiff
28nm FDSOI Vdiff: -19 mV
32nm Intel Vdiff: -72 mV
32nm PTM Vdiff: -42 mV
References Vdiff: -14 mV
L1L2
Vdiff ≈ 50mV
22
Fabricated Prototype
• TSMC 28nm HPM • 2.7mm x 1.8mm, wire-bonded • RISC-V core + 1MB L2 Cache
1.8mm
2.7mm
1MB L2 Cache
Uncore
Bank 0
Bank 1
Bank 2
Bank 3
TagsData
D$
I$Pipeline
VDD/ClockBoundaries
Core+L1
23
Design Flow • TSMC provided standard cells and SRAM
compiler • Chisel + Verilog wrappers for BIST, multi-
clock support • Synthesis: Synopsys Design Compiler • Place-and-Route: Synopsys IC Compiler • Timing: Synopsys Primetime • Simulation: Synopsys VCS
24
Measured Shmoo plot
• Processor core and L1 operate up to 1GHz and 30mW
0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9
20.0
70.0
120.0
170.0
220.0
270.0
320.0
370.0
420.0
470.0
520.0
570.0
620.0
670.0
L1 Cache + Core 1MB L2 Cache
Freq
uenc
y (M
Hz)
Freq
uenc
y (M
Hz)
20
500
1000Vdd (a.u.) Vdd (a.u.)
0.45 0.850.65
20
300
650 0.45 0.850.65
0.86mW to 34mW 0.61mW to 21mW
PASS PASS
FAILFAIL
25
BIST Programming Algorithm
• Standard BIST identifies fault locations
Set target Vmin and Freq.
Run BIST tests
Add BB entry
Add DCR entry
Add LD entry
Program BB
Error intag array
Error indata array
First errorin the set
Program DCR
Program LD
Start processor
Already errorin the set
Continue tests
Testsdone
Ideal BIST test time:5 (number of March tests)⨉ 4096 (deepest array)⨉ 20 (avg. test complexity)⨉ 50ns (cycle period at Vmin)= 20ms
26
Measured SRAM Error Rate
• Suite of March tests to identify errors
• As expected, high density bitcell in L2 has higher Vmin than high speed bitcell in L1
0.3 0.35 0.4 0.45 0.5 0.55 0.6Vdd (a.u.)
10�8
10�7
10�6
10�5
10�4
10�3
10�2
Bit
Err
orR
ate
(BE
R)
L1 DataL1 TagsL2 DataL2 Tags
[3]
[4][2]
(Proposed)
L1 Vmin Reduction
L2 Vmin Reduction
[2] Huang JSSC 2013 [3] Chang JSSC 2007 [4] Sawant ISSCC 2011
27
Measured Vmin Reduction
• Proposed techniques achieve 25% average Vmin reduction in L2 for 2% area overhead
Vminreduction
28
Summary
• Easier than assist techniques to adopt – Predictable Vmin reduction – Comparable effectiveness
• Requires power-up BIST or non-volatile memory
29
Conclusion
• DCR, LD, and BB are programmed by BIST to avoid failing bitcells
• Significant Vmin reduction for low overhead and no circuit changes – Can use in addition to assist techniques
• 28nm prototype processor shows – 25% Vmin reduction, – 49% power reduction – 2% area overhead
30
Acknowledgements
• Henry Cook • Yunsup Lee • Andrew Waterman • James Dunn • Brian Richards • Stephen Twigg • Scott Liao • Jonathan Chang • Work funded in part by BWRC, ASPIRE,
DARPA PERFECT Award Number HR0011-12-2-0016, Intel ARO, and a fabrication donation by TSMC.