Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler...
Transcript of Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler...
CSE 4201
Memory Hierarchy DesignCh. 5
(Hennessy and Patterson)
Memory Hierarchy
We need huge amount of cheap and fast memory
Memory is either fast or cheap; never both. Do as politicians do: fake it
Give a little bit of fast memory and tons of cheap memory.
As technology progresses Cheap becomes cheaper rapidly Fast becomes faster rapidly Cheap does not become fast, or fast cheap
Since the 80's...
Processors became 10,000 times faster Memory became 10 times faster Back then cache was for high performance
systems Now we need multiple levels of cache
Addressing Scheme
Match?
Address
Mapping
Set
BlockBlock Address
Block Offset
Cache Index
Tag
Set-Associative
(Set Address) = (Block Address) MOD K Where K is the number of sets in kache
(Block Address) = (Address) DIV b Where b is the number of bytes in the block
(Block Offset) = (Address) MOD b Set has n blocks. (n-way associative) Every block has data and address (tag). If K=1, fully associative If n=1, direct mapped
Victim Selection
Which block to expel to make room for new entry Least recently used Random First In First Out
All work more or less the same. LRU is rarely exact, almost always approximate Little effect on big caches About 10% for smaller
What Happens on a Write?
Writes are less common than reads All instruction fetches are reads Stores are 10% of the instructions, loads, 25% We have 10 writes for every 125 reads Better take good care of the reads
Writes are costlier than reads We write 1-8 bytes at a time in a block typically 32-
64 bytes long Have issues with consistency
Write Through, Back
Write through No need to write back on a cache miss No need to have dirty bit
Write back Less bus traffic
Write Through, Back
Write through <-> no-write-allocate Allocate cache block only on reads Multiple writes w/o immediate read do not disturb
cache
Write back <-> write-allocate Makes subsequent reads fast
AMD Opteron Cache
L1 cache (data): 64 KB, 64 byte blocks 1 K blocks 2-way associative 512 sets. LRU Write back, write allocate
64 bogo-bits address 48 virtual, 40 physical
AMD Opteron Cache
Various sizes: Physical address: 40 bits Block address: 34 bits Block offset: 6 bits Cache index: 9 bits Tag: 25 bits Size of set: 2 blocks (2-way set associative) Number of clock cycles: 2 (2 stalls on hazard)
AMD Opteron Cache
Steps of cache hit: The 40 bits are split int tag(25), index(9), offset(6) A set (2 blocks) is retrieved using the index Their tags are compared and their valid bits
checked The correct block is selected The 3 MSBits of the offset are used to select the
word to be read/written. Update the LRU bits
AMD Opteron Cache
Steps of cache miss Same up until we know it is a miss Identify a victim (LRU) If victim dirty write back If read, stall until next level responds, if write
continue (provisionally)
Miss Rate
Not your elementary school teacher The three Cs
Compulsory (the first time) Capacity (reached the maximum number of blocks
in the set) Conflict (when the blocks have to share the same
spot)
We may add one more: coherency
Sneaky Miss Rate
Miss Rate can be misleading Defined as misses per (1000) access(es) Our delay is related to misses per instruction
Misses per instruction is the miss rate times the memory accesses per instruction.
Even this can be misleading We want to reduce the delaydelay
What is the delay
Avg. Mem. Access Time = Hit Time + Miss Rate * Miss Penalty
We do better by decreasing any of the three quantities in the right hand side
Unfortunately, these always involve trade-offs And, they are just an approximation of the
effect on the execution time.
Complications...
What exactly is a miss in speculative execution?
How much does a miss cost under dynamic scheduling?
Under multi-threading? If we allow a miss over miss?
Example
Effective Access time for 16KB+16KB split cache
Miss per 1000 instructions: 3.82 (instr. Cache), 40.9 (data cache)
Mix: 36% of instructions are load/store Hit: 1 cycle, Miss: 100 cycles
Example
Instruction miss rate: 0.00382
Data miss rate: 0.0409/0.36=0.114
Percent of references that are instructions 100/136 = 74%
Avg. Mem. Acc. Time: 74%*(1+0.004*100) + 26%*(0+0.114*100)
Miss Penalty under DynamicExecution
Is it the full latency? Is it the “exposed” latency? What about the latency due to contention by
speculative instructions Any form of latency has the same problem Simple (simplistic) solution
Find which instruction did not commit in time Attribute the stall to it
How to Increase Performance
Larger cache Obviously reduces misses Increases cost, power Increases hit time
Larger block Decreases compulsory (initial) misses Better exploits spatial locality Decreases number of blocks Increases miss penalty, bus traffic
How to Increase Performance
Higher associativity Reduce conflict misses Increase hit time, silicon area, power consumption
Multilevel cache Reduces hit time and miss penalty Increase cost and power
Give priority to read misses Let reads jump the queue
Overlap TLB and cache read...
TLB and Cache
Cache understands physical addresses We have to consult the TLB to convert a virtual
address to physical How about if we overlap the two?
When is such a thing possible?
What is the Trick?
TLB is a small cache that associates a (virtual) page number to (physical) frame number
The page offset and the frame offset are the same and need no translation
If the page offset is enough to index the set in the cache We do not need any bit from the frame number We can retrieve the set while the TLB does the
translation When the TLB is done we compare the tags
This is the Trick
Match?
Mapping
Set
BlockBlock Address
Block Offset
Cache Index
Tag
Virtual
Physical
TLB
Disadvantages
Cache size = Page size * associativity Usually we want a “medium” size page and a
large cache. There are ways to deviate from this rule with
extra hardware.
11 Advanced Optimizations
We organize them in 5 groups Reduce Hit Time Increase bandwidth Reduce Miss Penalty Reduce Miss Rate Prefetching
Small is Beautiful
Small and simple caches are faster Reduce size Reduce associativity Rely on L2 cache L1 cache sizes do not change much with
technology
Way Prediction
Tag comparison costly Store, along with the data, prediction bits for the
next access The index is augmented by the predictor bits The data is sent to the CPU while we check the
tags If the tags do not match, we send an “Oops!” Pentium 4 uses it
Example
Hit rate 85% (typical) Hit: 1 cycle Miss: 3 cycle Without: 2 cycle .85*1+.15*3 = 1.3 < 2
Trace Cache
Seems so devious... It is almost Harry-Potterish The cache contains dynamic trace (sequence
of instructions as they are executed) Branch prediction is folded into the cache Pentium 4 uses it for its micro-operations cache
Cache Pipeline
Most caches have more than 1 cycle Pipelining is tried and true Embed the cache pipeline into the CPU pipeline
Pentium 4 takes 4 cycles (despite way prediction, etc)
Non-Blocking Miss
Allow hit under miss Or multiple miss under miss
FP intensive programs benefit from multiple miss under miss
Dynamic execution benefits from it
Multi-Banked Cache
Multi-bank (aka interleaved) memories were always popular
Suits best for L2 cache Allows each bank to be smaller Allows each to work independently Increase bandwidth
AMD Opteron has 2 banks, Sun T1 has 4
Critical Word First
Critical (the one we asked) first If the block is transmitted in multiple cycles
Early restart Do not wait for the whole block to arrive
Merging Write Buffer
A write miss might be in the (victim) write-back buffer
Similar idea to victim buffer (virtual memory)
Compiler Re-ordering
Try to access arrays the way they are in the cache
The magic behind fast matrix multiplication (blocking) Break the matrix into pieces that are comfortably
fitting in the cache
Prefetching
Hardware If two misses in the same page, prefetch Most prefetch instructions from the instruction
cache Opteron and P-4 prefetch data too.
Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic
Memory Technologies
SRAM Static RAM Big transistors optimized for speed
DRAM Cheap capacitors Optimized for density Reads destructively Needs refreshing
DRAMs Rule the Desktop
Memory size and CPU speed grow at the same speed It always took about a second to scan the whole
memory.
Through most of their history increased 4-fold every three years
Now increase 4-fold every four years. Speed increases about 5% per year.
DRAM Organization
Memory array
Dat
a-I/
OAddress Buffer
Row Decoder
ColumnDecoder
Sense Amps
RAS and CAS
Row Address Strobe Column Address Strobe First goes RAS Whole row is copied out CAS selects the bit or bits
Improving RAM
Fast Page Mode Increment CAS several times with the same RAS
Make use of the modularity available Memory is organized in blocks 1-4 Mbits each for
manufacturing reasons. Naturally interleaved
SDRAM, DDR
Synchronous DRAM Shares the clock with the CPU No synchronization overhead in communication.
DDR Double Data Rate Front end of memory is fast Heavily interleaved back-ends
Virtual Memory
Expand RAM to disk (not that useful today) Allow multiple processes to share the physical
memory Allow arbitrary mapping
File I/O, shared memory, dynamic libraries, etc
Critical to security
Security
Virtual memory handled through the kernel Page tables can be manipulated only in monitor
mode A process does not have the means to access
the space of another process
However...
A kernel is a huge program Huge programs have bugs Most bugs cause the system to crush A few of them are exploits.
A better way...
Use virtual machines Much smaller Fewer bugs One extra level of protection
Vms have other advantages as well Share a computer Cloud computing Can migrate a live program to different H/W
VMM
Virtual Machine Monitors (hypervisors) Allow a guest OS to run efficiently as a process on
a host OS User level code runs natively System calls are trapped and emulated VMM mediates between the guest OS and the H/W
on the host Network connection, USB device management, etc Filesystem and state.
ISA Support
An ISA supporting virtualization is called virtualizable
Virtualization is a new idea (geologically speaking)
Attempts by guest to execute privileged instructions result in traps The problem is that not all relevant instructions
result in traps And handling virtual memory is tricky
Virtual Memory for Virtual Machines
Normally we distinguish between Virtual memory Physical memory
Now we have an intermediate level Real memory
Guest OS maps virtual to real VMM maps real to physical
Shadow Page Table
Two step process Too slow Interferes with h/w assisted virtual memory
Shadow tables do it in one shot But this means guest OS cannot manage the page
tables of its own processes TLBs must have PID tags and/or be flushed on
context switch
IBM ('70s) had one more level of indirection
Virtualized I/O
There are far too many devices and drivers to handle
I/O happens with the mediation of H/W, so it would be too slow to handle with emulation
Solution: generic devices for each type of I/O Network: time shared or NATed.
Example: Xen
Instead of trying to emulate everything just to trick the guest
Allow small modifications to the guest to keep things simple.
It is called paravirtualization and Xen is the most popular example (VMWare is another)
The Tricks of Xen
Augment kernel E.g. 1% of Linux is modified
Uses the protection levels of x86 Xen at level 0 (highest), guest OS at 1, apps at 3
Wraps I/O devices in special virtual machines (driver domains) and talks to them with page remapping
VMM and ISA
Designers of ISAs were cheapos! To save a couple of bits had the same
instruction behave differently in monitor mode and user mode POPF (pop flags) ignores privileged flags in user
mode
70's technology IBM-370 is still the golden standard.
Cache and I/O
Should we do I/O with the cache? Get the data immediately with perfect consistency Slows down processor, infects cache
I/O directly with memory Most popular Works well with write through (no stale data) Or can mark pages as non-cacheable Or flush cache Or send cache invalidations
Fallacy: Predicting cache performance
Miss rates vary by a factor of 10,000 or more Tremendous difference between instruction and
data miss rates
RAMBUS promises
RAMBUS had a bandwidth 8 times higher than competition
Performance was only 0-15% faster overall Cost was 2-3 times higher (20% larger die) The reason was that most of the traffic is at the
L2 cache.