ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...
Transcript of ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...
-
1
ECE4750/CS4420 Computer Architecture
L6: Advanced Memory Hierarchy
Edward Suh
Computer Systems Laboratory
2
Announcements
Lab 1 due today
Reading: Chapter 5.1 – 5.3
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
2
3
Overview
How to improve cache performance
Recent research: Flash cache
ECE4750/CS4420 — Computer Architecture, Fall 2008
4
Improving Cache Performance
ECE4750/CS4420 — Computer Architecture, Fall 2008
Average memory access time =Hit time + Miss rate x Miss penalty
To improve performance:
-
3
5
Small, Simple Caches
On many machines today the cache access sets the cycle time
• Hit time is therefore important beyond its effect on AMAT
ECE4750/CS4420 — Computer Architecture, Fall 2008
=
hit
data
tag index byte
=
data
tag index byte
= mux
hit
6
Way Predicting Caches(MIPS R10000 L2)
Use processor address to index into way prediction table
Look in predicted way at given index, then:
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
4
7
Improving Cache Performance
Decrease Hit Time
Decrease Miss Rate
Decrease Miss Penalty
ECE4750/CS4420 — Computer Architecture, Fall 2008
8
Causes for Cache Misses
• Compulsory: first-reference to a block a.k.a. coldstart misses
- misses that would occur even with
• Capacity: cache is too small to hold all data needed by the program- misses that would occur even under
• Conflict: misses that occur because of collisionsdue to block-placement strategy- misses that would not occur with
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
5
9
Effect of Cache Parameters
• Larger cache size
• Higher associativity
• Larger block size
ECE4750/CS4420 — Computer Architecture, Fall 2008
10
Victim Cache (HP7200)
ECE4750/CS4420 — Computer Architecture, Fall 2008
L1 Data Cache
Unified L2 Cache
RF
CPU
Victim cache is a small asociative back-up cache, added to a direct
mapped cache, which holds recently evicted blocks
-
6
11
Prefetching
Speculate on future instruction and data accesses and fetch them into cache(s)
Varieties of prefetching
• Hardware prefetching
• Software prefetching
• Mixed schemes
What types of misses does prefetching affect?
ECE4750/CS4420 — Computer Architecture, Fall 2008
12
Issues in Prefetching
Usefulness
Timeliness
Cache and bandwidth pollution
ECE4750/CS4420 — Computer Architecture, Fall 2008
L1 Data
L1
Instruction
Unified L2
Cache
RF
CPU
Prefetched data
-
7
13
Hardware Instruction Prefetch
Alpha 21064
ECE4750/CS4420 — Computer Architecture, Fall 2008
L1
Instruction
Unified L2
CacheRF
CPU
Stream
Buffer
Prefetched
instruction blockReq
block
Req
block
14
Hardware Data Prefetching
Prefetch-on-miss:
One Block Lookahead (OBL) scheme
Strided prefetch
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
8
15
Software Prefetching
Compiler-directed prefetching
• compiler can analyze code and know where misses occur
for(i=0; i < N; i++) {
prefetch( &a[i + 1] );
prefetch( &b[i + 1] );
SUM = SUM + a[i] * b[i];
}
Issues?
What property do we require of the cache for prefetching to work?
ECE4750/CS4420 — Computer Architecture, Fall 2008
16
Compiler Optimizations
Restructuring code affects the data block access sequence
• Group data accesses together to improve spatial locality
• Re-order data accesses to improve temporal locality
Prevent data from entering the cache
• Useful for variables that will only be accessed once before being replaced
• Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits)
Kill data that will never be used again
• Streaming data exploits spatial locality but not temporal locality
• Replace into dead cache locations
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
9
17
Array Merging
Some weak programmers may produce code like:
int val[SIZE];
int key[SIZE];
… and proceed to reference val and key in lockstep
Problem?
ECE4750/CS4420 — Computer Architecture, Fall 2008
18
Loop Interchange
for(j=0; j < N; j++) {for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];}
}
What type of locality does this improve?
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
10
19
Loop Fusion
for(i=0; i < N; i++)
for(j=0; j < M; j++)
a[i][j] = b[i][j] * c[i][j];
for(i=0; i < N; i++)
for(j=0; j < M; j++)
d[i][j] = a[i][j] * c[i][j];
What type of locality does this improve?ECE4750/CS4420 — Computer Architecture, Fall 2008
20
Blocking
for(i=0; i < N; i++)for(j=0; j < N; j++) {
r = 0;for(k=0; k < N; k++)
r = r + y[i][k] * z[k][j];x[i][j] = r;
}
ECE4750/CS4420 — Computer Architecture, Fall 2008
x y zj k j
i i k
Not touched Old access New access
-
11
21
Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B)
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++)
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}
ECE4750/CS4420 — Computer Architecture, Fall 2008
x y zj k j
i i k
22
Improving Cache Performance
Decrease Hit Time
Decrease Miss Rate
Decrease Miss Penalty• Some cache misses are inevitable
• when they do happen, want to service as quickly as possible
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
12
23
Multilevel Caches
A memory cannot be large and fast
Increasing sizes of cache at each level
AMAT = Hit timeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit timeL2 + Miss RateL2 x Miss PenaltyL2
What is 2nd-level miss rate?
• local miss rate – number of cache misses / cache accesses
• global miss rate – number of cache misses / CPU memory refs
ECE4750/CS4420 — Computer Architecture, Fall 2008
CPU L1 L2 DRAM
24
L2 Cache Hit Time vs. Miss Rate
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
13
25
Reduce Read Miss Penalty
Let read misses bypass writes
Problem?
Solution?
ECE4750/CS4420 — Computer Architecture, Fall 2008
Data Cache
Unified L2 Cache
RF
CPU
Writebuffer
26
Early Restart
Decrease miss penalty with no new hardware
• well, okay, with some more complicated control
Strategy: impatience!
There is no need to wait for entire line to be fetched
Early Restart – as soon as the requested word (or double word) of the cache block arrives, let the CPU continue execution
If CPU references another cache line or a later word in the same line: stall
Early restart is often combined with the next technique…
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
14
27
Critical Word First
Improvement over early restart
• request missed word first from memory system
• send it to the CPU as soon as it arrives
• CPU consumes word while rest of line arrives
Example: 32B block (8 words), miss on address 20
• words return from memory system as follows:
ECE4750/CS4420 — Computer Architecture, Fall 2008
0 4 8 12 16 20 24 28
28
Sub-blocking
Tags are too large, i.e., too much overhead
• Simple solution: Larger blocks, but miss penalty could be large.
Sub-block placement (aka sector cache)
• A valid bit added to units smaller than the full block, called sub-blocks
• Only read a sub-block on a miss
• If a tag matches, is the word in the cache?
ECE4750/CS4420 — Computer Architecture, Fall 2008
100
300
204
1 1 1 1
1 1 0 0
0 1 0 1
-
15
29
Recent Research: Flash Cache
Slides from Taeho Kgil and Trevor Mudge (U of Michigan)
Roadmap for Memory - Flash
ECE4750/CS4420 — Computer Architecture, Fall 2008
2005 2007 2009 2011 2013
Cell Density of SRAM
MBytes / cm211 17 28 46 74
Cell Density of DRAM
MBytes / cm2153 243 458 728 1,154
Cell Density of NAND
Flash Memory (SLC /
MLC) MBytes / cm2
339 /
713
604 /
1,155
864 /
1,696
1,343 /
5,204
2,163 /
8,653
From ITRS 2005 Roadmap
30
Power/Performance of DRAM and Flash
ECE4750/CS4420 — Computer Architecture, Fall 2008
Density
(Gb /cm2)$/Gb
Active
Power
Idle
Power
Read
latency
Write
latency
Erase
latency
DRAM 0.7 48 878mW 80mW 55ns 55ns N/A
NAND 1.42 21 27mW 6μW 25μs 200μs 1.5ms
Flash good for idle power optimization
• 1000× less power than DRAM
Flash not so good for low access latency usage model
• DRAM still required for decent access latencies
Performance and power for 1Gb memory from Samsung
Datasheets 2003
-
16
31
Cost of Power and Cooling
ECE4750/CS4420 — Computer Architecture, Fall 2008
World wide cost of purchasing and operating servers Source: IDC
50%
25%
32
System-Level Power Consumption
ECE4750/CS4420 — Computer Architecture, Fall 2008
From Sun talk given by James Laudon
SunFire T2000 Power running SpecJBB
I/O 22%
Fans
10%
AC/DC
conversion
15%
Processor
26%
16GB memory
22%
Disk 4%
Processor
16GB memory
I/O
Disk
Service Processor
Fans
AC/DC conversion
Total Power 271W
-
17
33
Case for Flash as 2nd Disk Cache
Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more)
Large working-set is cached to main memory to maintain high throughput
Large portion of DRAM to disk cache
Many server applications are read intensive than write intensive
32GB DRAM on SunFire T2000 consumes idle power 45W
Flash memory consumes 1000x less idle power than DRAM
Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content
Client requests are spatially and temporally a zipf like distribution
Ex) 90% of client requests are to 20% of files
ECE4750/CS4420 — Computer Architecture, Fall 2008
34
Latency vs. Throughput
0
200
400
600
800
1,000
1,200
12us 25us 50us 100us 400us 1600us
disk cache access latency to 80% of files
Ne
two
rk b
an
dw
idth
- M
bp
s
(Th
rou
gh
pu
t)
MP4 MP8 MP12
Specweb99
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
18
35
Overall Architecture
ECE4750/CS4420 — Computer Architecture, Fall 2008
1GB DRAM
Processors
HDD ctrl.
Hard Disk
Drive
Main memory
Baseline without
FlashCache
36
Flash Lifetime
ECE4750/CS4420 — Computer Architecture, Fall 2008
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
0% 20% 40% 60% 80% 100%
Flash memory size (percentage of working set size)
Lif
eti
me
- y
ea
rs
SURGE SPECWeb99 Financial1 WebSearch1
-
19
37
Programmable Flash Controller
ECE4750/CS4420 — Computer Architecture, Fall 2008
Flash Density
Control
Flash Program /
Erase time control
GF Field LUT
BCH Encode
/ Decode
CRC Encode
/ Decode
BCH configuration
Descriptor
Density Descriptor
P/E Descriptor
Flash Address
Flash Data (Writes)
Bit Error (Yes / No)
Flash Data (Reads)
External
Interface
NAND
Flash
Memory
Programmable Flash memory controller
38
Impact of ECC
ECE4750/CS4420 — Computer Architecture, Fall 2008
100,000
1,100,000
2,100,000
3,100,000
4,100,000
5,100,000
6,100,000
7,100,000
0 2 4 6 8 10 12
# of correctable errors (code strength)
ma
x.
tole
rab
le P
/E c
yc
les
stdev = 0 stdev = 5% of meanstdev = 10% of mean stdev = 20% of mean
-
20
39
Flash Lifetime w/ Programmable Controller
ECE4750/CS4420 — Computer Architecture, Fall 2008
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
0% 20% 40% 60% 80% 100%
Flash memory size (percentage of working set size)
Lif
eti
me
- y
ea
rs
SURGE SPECWeb99 Financial1 WebSearch1
40
Overall Performance - Mbps
0
200
400
600
800
1,000
1,200
DRAM 32MB +
FLASH 1GB
DRAM 64MB +
FLASH 1GB
DRAM 128MB
+FLASH 1GB
DRAM 256MB
+FLASH 1GB
DRAM 512MB
+FLASH 1GB
DRAM 1GB
Ne
two
rk B
an
dw
idth
- M
bp
s
MP4 MP8 MP12
Specweb99
ECE4750/CS4420 — Computer Architecture, Fall 2008
-
21
41
Overall Main Memory Power
0
0.5
1
1.5
2
2.5
3
DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash
1GB
Ov
era
ll P
ow
er
- W
read power write power idle power
SpecWeb99
2.5W
1.6W
0.6W
ECE4750/CS4420 — Computer Architecture, Fall 2008