Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
description
Transcript of Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
Snehasish KumarArrvindh ShriramanEric MatthewsLesley Shannon
Hongzhou ZhaoSandhya Dwarkadas
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
2
On-chip Storage
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
3
Fixed granularity cache
Tag Array Data Array
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
4
Cache data utilization
Tags Data UntouchedData
Tag Array Data Array
Utilization = Fraction of words touched in cache block at the time of eviction
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
5Ser0%
25%
50%
75%
100%64K L1 – 4 ways – 64B/block
apac
he
cann
.
eclip
se
firef
ox
h2 jbb
lbm
mcf
tpcc
x264
Cache utilization
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
6
55%
13%6%
26%
18%5%4%
73%
Block Distribution
1-2
3-4
5-6
7-8
40%
26%
9%
25%
75%
14%
6%5%
Apac
heEc
lipse
Fire
fox
Cann
eal
# Words Touched
64K – 64B/block
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
7
58%20%
12%
10%
Block Distribution
1-2
3-4
5-6
7-8
75%
14%
6%5%
Cann
eal
Cann
eal
# Words Touched
64K – 64B/block 1M – 64B/block
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
8
Application specific behaviour ― Inefficient data structure access
patterns
Interaction with cache geometry— Way conflicts reduce block lifetime
and cause poor utilization
Factors affecting cache utilization
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
9
Application Specific Behaviour
struct TIE {long long X, Y, Z;long long V, H;long long data[3];
} Imperial[1024];
Data[3]X Y HZ V
Access in a loop
Data Arrayfor (int i=0; i<1024; i++){
Imperial[i].X = …;Imperial[i].Y = …;Imperial[i].Z = …;Imperial[i].V = …;
}
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
10
Cache Geometry
Data Array – 4 ways
Problem : Lots of data map to same set
1 2 3
4 5
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
11
1. Shrinks effective cache space
2. Increases miss rate
3. Wastes on-chip bandwidth
4. Increases on-chip cache energy consumption
Implications
=
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
12
Miss Rate
Space Utilisation
Bandwidth
AmoebaCache
Target Metrics
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
13
Variable Granularity Blocks
Tag Array Data Array
How to support variable # of blocks / set ?
How to support variable granularity for each block?
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
14
Our Approach : Amoeba Cache
Unified SRAM Array
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
15
Amoeba Cache
• Insert• Lookup• Partial Miss• Overheads
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
16
SRAM Array
Region Tag Start End
1 word 1+ words
SRAM Array
Tag Data Block
Bitmaps
0000Valid? Tag?
0000
0000 0000
0000 0000
0000 0000
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
17
Tag - Regions
Memory
Region
RMAXbytes
Region Tag ByteStart / EndSet Index
3
64 bit address
Top 3
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
18
Example
struct TIE {long long X, Y, Z;long long V, H;long long data[3];
} Imperial;
Imperial.X = … ;
Miss
Invoke Spatial Granularity Predictor(PC/Region based)
Fetch
Tag X Y Z V
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
19
00000000
Valid? Tag?
Amoeba Cache – Insert (8words/set)
00000000SRAM Array / Set
Miss
Insert 4+1 words
00000 substring()
1Pos: 0
Tag X Y Z V
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
20
00000000Valid? Tag?
Amoeba Cache – Insert (8words/set)
00000000
SRAM Array / Set
11111000
Tag X Y Z V
Refill
210000000
3
Tag X Y Z V
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
21
Example
struct TIE {long long X, Y, Z;long long V, H;long long data[3];
} Imperial;
Imperial.Y = … ;Lookup Data from the cacheData[3]X Y HZ VX Y Z V
Tag X Y Z V
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
22
Amoeba Cache – Lookup (8words/set)
RegionTag
Set Index
Word (W)
Tag X Y Z V
SRAM Array / Set
10000000
2x1 2x12x1 2x1
Tag?1
2 𝐴𝑑𝑑𝑟 ∈𝑇𝑎𝑔Region
==Start ≤ W
End > W Word SelectorHit?
3
Tag X Y Z VOutput Buffer
Criti
cal P
ath
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
23
Partial MissIdentify Sub-Blocks Step 1 of 2
New ∩ Tags
1
MSHR 2 Evict Overlap
Fetch NewTag X Y Z V
Tag X Y Tag V H
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
24
Partial MissInsert New Block Step 2 of 2
MSHR3
Allocate 6 words
Miss 4
5Patch Missing ?’s
Tag
Occurs ≈ 5 in 1000 accesses
Tag X Y Z V H
X Y ? V HZ
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
Hardware Overheads
SRAM Array
25
Metadata
0000Valid? Tag?
0000
0000 0000
0000 0000 Criti
cal
Path
Extr
a
Amoe
ba C
ritica
l Pat
h
1 KB
Latency +4%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
26
Evaluation
• Parameters for latency and energy• Workloads
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
27
Latency Parameters (cycles)
300
64K L1
1M LLC
CPU1
3
20
Fixe
d Gr
anul
arity
Amoe
ba C
ache
1.04 Latency +4%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
28
On-Chip Energy Parameters (pJ)
64K L1
1M LLC
101
230
Fixe
d Gr
anul
arity
Amoe
ba C
ache
≈ 7 / word
105
238
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
29
• 22 diverse workloads from• PARSEC• SPEC-CPU 2000 & 2006• DaCapo ( Java Benchmarks )• Apache, Firefox and PostgreSQL
Workloads
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
30
Results
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
31
% Improvement in L1 Miss-Rate
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0%
10%
20%
30%
40%
Reduces L1 and L2 miss rate by 18%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
32
% Improvement in L1 Miss-Bandwidth
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0%
25%
50%
75%
Reduces on-chip bandwidth by 46% Reduces off-chip bandwidth by 38%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
33
% Improvement in memory energy
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0%
10%
20%
30%
40%
Reduces energy by 11%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
34
% Improvement in execution time
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0%
5%
10%
15%
20%
21%
Improves performance by 10%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
35
Results SummaryAmoeba-Cache
• Reduce cache pollution for applications with low cache utilization
• Improve performance for moderate cache utilization
• Maintain performance for high cache utilization workloads
• Save energy for streaming applications by keeping out unused words
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
36
Additional Results
Lookup as an extra cache pipeline stage vs. throttling the CPU
Spatial Granularity Predictor— Indexing— Training — Table Size
For extra pipeline stage, 8 of 22 applications show improvement
18 of 22 – Address region betterEvictions and First Touch
256 – PC and 1024 – Region
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
37
Additional Results
Multicore Shared Cache
Comparison against other designs— Fixed Granularity 2X— Sector Cache variants— Multi-$
Reduces miss rate (avg 18%) and LLC miss bandwidth (16%-39%)
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
38
Amoeba Cache
What? —Enable variable granularity data caching
Why?—Eliminate waste
How?—Unify tag and data into a single SRAM array
—Afforded by recent technology trendsWhere?
—Definitely at the L2, possibly at the L1
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
39
Frequently Asked Questions
1. Multiple threads?
2. Compare against other designs
3. Spatial Pattern Predictor
4. Replacement Policy
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
40
Multicore Shared Cache
Miss Miss Miss Miss BW
Mix T1 T2 T3 T4 (All)
jbb x2, tpc-c x2 12.38% 12.38% 22.29% 22.37% 39.07%
Firefox x2, x264 x2 3.82% 3.61% –2.44% 0.43% 15.71%
cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62%
canneal, astar, ferret, milc 4.85% 2.75% 19.39% –4.07% 17.77%
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
41
Comparison
Impact on Miss-RateImpact on BandwidthLow tag overheadTradeoff data and tag spaceDynamically resize blocks
Amoeba Cache
Multi -$Sector Variants
YesYes~
~NoYesNoNo
NoNo
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
42
Comparison – Moderate Group – 64K
1.0 1.1 1.2 1.3 1.4 1.5 1.60.4
0.5
0.6
0.7
0.8
0.9
1.0
Miss Rate Ratio
Band
wid
th R
atio Sector
(x:2.9)
Sector-Pre
Fixed-2X
AmoebaMulti$-25
Multi$-50
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
43
Spatial Pattern Predictor
Index Pattern
PC / Region 01011111
PC / Region 00011101
Predictor History Table
1
PC : Read Addr 0 0 0 1 1 1 0 1
2
Critical Word
Policy Miss vs Policy-Bandwidth
What to do when there is no entry?
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
44
Predictor Training
Data Array
Index Pattern
PC / Region 01011111
PC / Region 00011101
Add / update entry on evict
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
45
Predictor – L1 Miss Rate (1 of 2)
cann
e.
eclip
.
firef
.
h2
tpc-
c
x264
0
2
4
6
8
10Aligned Finite Infinite Finite+FT History
MPK
I
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
46
Predictor – L1 Miss Rate (2 of 2)
apac
.
lbm
mcf jbb020406080
100120140
Aligned Finite Infinite Finite+FT History
MPK
I
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
47
Predictor – L1 Miss Bandwidth (1 of 2)
cann
e.
eclip
.
firef
.
h2
tpc-
c
x264
0
300
600
900
1200
1500
1800Aligned Finite Infinite Finite+FT History
Band
wid
th R
ate
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
48
Predictor – L1 Miss Bandwidth (2 of 2)
apac
.
lbm
mcf jbb0
2000
4000
6000
8000
10000Aligned Finite Infinite Finite+FT History
Band
wid
th R
ate
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
49
Predictor – Summary
For majority applications Region Predictor with
— 1024 entry table— Table with 8 ways x 128 sets
PC Predictor is good for 5 applications— apache, art, mcf, lbm and omnetpp
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
50
Pseudo LRU Replacement
• Logically partition the set into a Nways
• Pick a block at random from way• Unset the T? (Tag) and V? (Valid) bits
Way 0 Way 1
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
51apac
he art
asta
rca
ctus can
eclip
se fac
ferr
etfir
efox
fluid
.fre
q. h2 jbb
lbm
mcf
milc
omne
t.so
plex
tpc-
c.tr
ade.
twol
fx2
64m
ean0
20
40
60
80
100
1-2 Words 3-4 Words 5-6 Words 7-8 WordsW
ords
Acc
esse
d (%
)
45 20 39 79 30 80 77 82 49 62 55 38 40 32 29 81 33 21 53 73 29 46 50
Access Distribution for L1W
ord
dist
ributi
on fo
r 64K
L1
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
52
Amoeba block size distribution for L1Bl
ock
dist
ributi
on fo
r 64K
L1
apac
he art
asta
rca
ctus can
eclip
se fac
ferr
etfir
efox
fluid
.fre
q. h2 jbb
lbm
mcf
milc
omne
t.so
plex
tpc-
c.tr
ade.
twol
fx2
64m
ean0
20
40
60
80
100
1-2 Words 3-4 Words 5-6 Words 7-8 Words%
of A
moe
ba B
lock
s
92 80 98 100
67 98 88 99 78 100
94 82 89 89 93 100
83 91 91 97 70 91 90
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
53
L1 FSM
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
54
Miss-Rate ( 64K L1 )
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0
20
40
60
80
Fixed
Amoeba
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
55
Miss Bandwidth Rate ( 64K L1 )
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0
2000
4000
6000
8000
10000Fixed
Amoeba
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
56
Energy Rate ( L1 + LLC ) – (nJ/KI)
mcf
canneal
lbm h2 jbb
apache
x264
firefoxtpcc
eclipse
0
25
50
75
100Fixed
Amoeba
Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy
57
Reduction in execution time
0
4000
8000
12000
16000
Fixed
Amoeba