Topic 6: Cache Microarchitecture ECE 4750 Computer ... · ECE 4750 T06: Cache Microarchitecture 31...
Transcript of Topic 6: Cache Microarchitecture ECE 4750 Computer ... · ECE 4750 T06: Cache Microarchitecture 31...
ECE 4750 Computer ArchitectureTopic 6: Cache Microarchitecture
Christopher BattenSchool of Electrical and Computer Engineering
Cornell University
http://www.csl.cornell.edu/courses/ece4750
slide revision: 2013-10-01-22-23
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Agenda
Single-Bank Cache Microarchitecture
Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples
ECE 4750 T06: Cache Microarchitecture 2 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Direct Mapped Cache Microarchitecture
ECE 4750 T06: Cache Microarchitecture 3 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Set Associative Cache Microarchitecture
ECE 4750 T06: Cache Microarchitecture 4 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Fully Associative Cache Microarchitecture
ECE 4750 T06: Cache Microarchitecture 5 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Synchronous SRAMs
ECE 4750 T06: Cache Microarchitecture 6 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Direct-Mapped Parallel Read Hit Path
ECE 4750 T06: Cache Microarchitecture 7 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Direct Mapped Pipelined Write Hit Path
ECE 4750 T06: Cache Microarchitecture 8 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Set-Associative Parallel Read Hit Path
ECE 4750 T06: Cache Microarchitecture 9 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Processor-Cache Interaction
PC addr inst
Primary Instruction Cache
0x4 Add
IR
D
nop
hit?
Decode, Register Fetch
wdata
R
addr
wdata
rdata Primary Data Cache
we A
B
Y Y ALU
MD1 MD2
Cache Refill Data from Lower Levels of Memory Hierarchy
hit?
Stall entire CPU on data cache miss
To Memory Control
M E
ECE 4750 T06: Cache Microarchitecture 10 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Zero-Cycle Hit Latency with Tightly Coupled Interface
zext
sext
ir[20:16]
ir[25:21]
ir[15:0]
ir[15:0]
ir[25:0]
ir[15:0]
j_tgen
br_tgen
ir[10:6]16
op0_X
pc_plus4_D
op1_X
sd_X sd_M
result_Mir_D
cs_X cs_M cs_W
pc_F
pc_plus4
br_targ
j_targjr
bypass_from_X1
bypass_from_M
bypass_from_W
imem
dmem
+4
addr
wdata
rdata
result_W
regfile(read)
regfile(write)
addr rdata
nop
decode
alu
kill_F
pc_sel_P
op0_sel_D
op1_sel_D
dmem_wen_M
alu_func_X
wb_sel_Mregfile
_wen_W
regfile_waddr_W
branch_cond
Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W)
stall_F
stall_D
stall_D
br_targ_X
Tag Check → Data Access
ECE 4750 T06: Cache Microarchitecture 11 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Two-Cycle Hit Latency with Val/Rdy Interface
zext
sext
ir[20:16]
ir[25:21]
ir[15:0]
ir[15:0]
ir[25:0]
ir[15:0]
j_tgen
br_tgen
ir[10:6]16
op0_X
pc_plus4_D
op1_X
sd_X
result_M0ir_D
cs_X cs_M0 cs_W
pc_F0
pc_plus4
br_targ
j_targjr
bypass_from_X1
bypass_from_M
bypass_from_W
imem
dmem
+4
memreqvalrdy
memrespval
result_W
regfile(read)
regfile(write)
memreqvalrdy
nop
decode
alu
kill_F1
pc_sel_P
op0_sel_D
op1_sel_D
alu_func_X
wb_sel_Mregfile
_wen_W
regfile_waddr_W
branch_cond
Fetch (F0/F1) Decode & Reg Read (D) Execute (X) Memory (M0/M1) Writeback (W)
stall_F0
stall_D
stall_D
br_targ_X
Tag Check → Data Access
memrespval!rdy
!rdy
cs_M1
result_M1
pc_plus4_F1
ECE 4750 T06: Cache Microarchitecture 12 / 36
• Single-Bank Cache uArch • Multi-Bank Cache uArch Basic Optimizations Cache Examples
Parallel Read, Pipelined Write Hit Path
zext
sext
ir[20:16]
ir[25:21]
ir[15:0]
ir[15:0]
ir[25:0]
ir[15:0]
j_tgen
br_tgen
ir[10:6]16
op0_X
pc_plus4_D
op1_X
sd_X
result_Mir_D
cs_X cs_M cs_W
pc_F
pc_plus4
br_targ
j_targjr
bypass_from_X1
bypass_from_M
bypass_from_W
imem
dmem
+4
memreqvalrdy
memrespval
result_W
regfile(read)
regfile(write)
memreqvalrdy
nop
decode
alu
kill_F
pc_sel_P
op0_sel_D
op1_sel_D
dmem_wen_M
alu_func_X
wb_sel_Mregfile
_wen_W
regfile_waddr_W
branch_cond
Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W)
stall_F
stall_D
stall_D
br_targ_X
memrespval!rdy
!rdy
Tag Check ∥Read Access
WriteAccess
New Hazards?
ECE 4750 T06: Cache Microarchitecture 13 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Agenda
Single-Bank Cache Microarchitecture
Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples
ECE 4750 T06: Cache Microarchitecture 14 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Multicore PARC w/ Shared Unified I/D $
ECE 4750 T06: Cache Microarchitecture 15 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Multicore PARC w/ Shared Single-Bank I$ and D$
ECE 4750 T06: Cache Microarchitecture 16 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Multicore PARC w/ Shared Multi-Bank I$ and D$
ECE 4750 T06: Cache Microarchitecture 17 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Multicore PARC w/ Private I$ and D$
ECE 4750 T06: Cache Microarchitecture 18 / 36
Single-Bank Cache uArch • Multi-Bank Cache uArch • Basic Optimizations Cache Examples
Multicore PARC w/ Private I$ and Shared D$
ECE 4750 T06: Cache Microarchitecture 19 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Agenda
Single-Bank Cache Microarchitecture
Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples
ECE 4750 T06: Cache Microarchitecture 20 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Average Memory Access Time
Avg Mem Access Time = Hit Time + ( Miss Rate × Miss Penalty )
I Reduce hit time. Small and simple caches
I Reduce miss rate. Large block size. Large cache size. High associativity. Compiler optimizations
I Reduce miss penalty. Multi-level cache hierarchy. Prioritize reads
ECE 4750 T06: Cache Microarchitecture 21 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Hit Time: Small & Simple Caches
ECE 4750 T06: Cache Microarchitecture 22 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Rate: Large Block Size
I Less tag overheadI Exploit fast burst transfers
from DRAMI Exploit fast burst transfers
over wide on-chip busses
I Can waste bandwidthif data is not used
I Fewer blocks→ moreconflicts
ECE 4750 T06: Cache Microarchitecture 23 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Rate: Large Cache Size
Empirical Rule of Thumb:If cache size is doubled, miss rate usually drops by about
√2
ECE 4750 T06: Cache Microarchitecture 24 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Rate: High Associativity
Empirical Rule of Thumb:Direct-mapped cache of size N has about the same miss rate as a two-way
set-associative cache of size N/2
ECE 4750 T06: Cache Microarchitecture 25 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Rate: Compiler Optimizations
I Restructuring code affects the data block access sequence. Group data accesses together to improve spatial locality. Re-order data accesses to improve temporal locality
I Prevent data from entering the cache. Useful for variales that will only be accessed once before eviction. Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hits or page table bits)
I Kill data that will never be used again. Streaming data exploits spatial locality but not temporal locality. Replace into dead-cache locations
ECE 4750 T06: Cache Microarchitecture 26 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Loop Interchange
for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } }
for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } }
What type of locality does this improve?
ECE 4750 T06: Cache Microarchitecture 27 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Loop Fusion
for(i=0; i < N; i++) a[i] = b[i] * c[i];
for(i=0; i < N; i++) d[i] = a[i] * c[i];
for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i];
}
What type of locality does this improve? ECE 4750 T06: Cache Microarchitecture 28 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Penalty: Multi-Level Caches
Processor L1 Cache L2 Cache
Hit
L1 Miss -- L2 Hit
MainMemory
Avg Mem Access Time =Hit Time of L1 + ( Miss Rate of L1 × Miss Penalty of L1 )
Miss Penalty of L1 =Hit Time of L2 + ( Miss Rate of L2 × Miss Penalty of L2 )
I Local miss rate = misses in cache / accesses to cache
I Global miss rate = misses in cache / processor memory accesses
I Misses per instruction = misses in cache / number of instructions
ECE 4750 T06: Cache Microarchitecture 29 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Penalty: Multi-Level Caches
I Use smaller L1 if there is also an L2. Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty. Reduces average access energy
I Use simpler write-through L1 with on-chip L2. Write-back L2 cache absorbs write traffic, doesn’t go off-chip. Simplifies processor pipeline. Simplifies on-chip coherence issues
I Inclusive Multilevel Cache. Inner cache holds copy of data in outer cache. External coherence is simpler
I Exclusive Multilevel Cache. Inner cache hold data not in outer cache. Swap lines between inner/outer cache on miss
ECE 4750 T06: Cache Microarchitecture 30 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Reduce Miss Penalty: Prioritize Reads
Data Cache
Unified L2 Cache
RF
CPU
Write buffer
Evicted dirty lines for writeback cache OR All writes in writethrough cache
I Processor not stalled on writes, and read misses can go ahead ofwrites to main memory
I Write buffer may hold updated value of location needed by read miss. On read miss, wait for write buffer to be empty. Check write buffer addresses and bypass
ECE 4750 T06: Cache Microarchitecture 31 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch • Basic Optimizations • Cache Examples
Cache Optimizations Impacton Average Memory Access Time
Hit Miss MissTechnique Time Rate Penalty HW
Parallel read hit − 0Pipelined write hit − 1Smaller caches − ++ 0
Large block size − + 0Large cache size ++ − 1High associativity ++ − 1Compiler optimizations − 0
Multi-level cache − 2Prioritize reads − 1
ECE 4750 T06: Cache Microarchitecture 32 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch Basic Optimizations • Cache Examples •
Agenda
Single-Bank Cache Microarchitecture
Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples
ECE 4750 T06: Cache Microarchitecture 33 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch Basic Optimizations • Cache Examples •
Intel Itanium-2 On-Chip Caches
February 9, 2010 CS152, Spring 2010 2/17/2009 24
Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency
Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency
Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency
ECE 4750 T06: Cache Microarchitecture 34 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch Basic Optimizations • Cache Examples •
IBM Power-7 On-Chip Caches
February 9, 2010 CS152, Spring 2010
Power 7 On-Chip Caches [IBM 2009]
25
32KB L1 I$/core
32KB L1 D$/core
3-cycle latency
256KB Unified L2$/core
8-cycle latency
32MB Unified Shared L3$
Embedded DRAM
25-cycle latency to local slice
ECE 4750 T06: Cache Microarchitecture 35 / 36
Single-Bank Cache uArch Multi-Bank Cache uArch Basic Optimizations • Cache Examples •
Acknowledgements
Some of these slides contain material developed and copyrighted by:
Arvind (MIT), Krste Asanovic (MIT/UCB), Joel Emer (Intel/MIT)James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB)
MIT material derived from course 6.823UCB material derived from courses CS152 and CS252
ECE 4750 T06: Cache Microarchitecture 36 / 36