Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4...

90
Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Transcript of Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4...

Page 1: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5

Large and Fast: Exploiting Memory Hierarchy

Page 2: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Processor-Memory Performance Gap

1

10

100

1000

10000

1980

1983

1986

1989

1992

1995

1998

2001

2004

Year

Perf

orm

ance

“Moore’s Law”

µProc55%/year(2X/1.5yr)

DRAM7%/year(2X/10yrs)

Processor-MemoryPerformance Gap(grows 50%/year)

Page 3: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

The “Memory Wall” Processor vs DRAM speed disparity continues

to grow

0.01

0.1

1

10

100

1000

VAX/1980 PPro/1996 2010+

CoreMemory

Clo

cks

per i

nstru

ctio

n

Clo

cks

per D

RA

M a

cces

s

Good memory hierarchy (cache) design is increasingly important to overall performance

Page 4: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

4

Memory Technology

Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB

Dynamic RAM (DRAM) 50ns – 70ns, $20 – $75 per GB

Magnetic disk 5ms – 20ms, $0.20 – $2 per GB

Ideal memory Access time of SRAM Capacity and cost/GB of disk

Page 5: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Memory Hierarchy The Pyramid

5

Page 6: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

SecondLevelCache

(SRAM)

A Typical Memory Hierarchy

Control

Datapath

SecondaryMemory(Disk)

On-Chip Components

RegFile

MainMemory(DRAM)D

ataC

acheInstr

Cache

ITLBD

TLB

Speed (cycles): ½’s 1’s 10’s 100’s 10,000’s

Size (bytes): 100’s 10K’s M’s G’s T’s

Cost: highest lowest

Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology while at the speed offered by the fastest technology

Page 7: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Inside the Processor

AMD Barcelona: 4 processor cores

Page 8: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

8

Principle of Locality

Programs access a small proportion of their address space at any time

Temporal locality (locality in time) Items accessed recently are likely to be accessed

again soon e.g., instructions in a loop, induction variables

Keep most recently accessed items into the cache Spatial locality (locality in space)

Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data

Move blocks consisting of contiguous words closer to the processor

Page 9: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9

Taking Advantage of Locality

Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk

to smaller DRAM memory Main memory

Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU

Page 10: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10

Memory Hierarchy Levels If accessed data is present in

upper level Hit: access satisfied by upper level

Hit ratio: hits/accesses Hit time:

If accessed data is absent Miss: data not in upper level

Miss ratio: misses/accesses= 1 – hit ratio

Miss penalty:

Block (aka cache line) Unit of copying May be multiple words

Time to access the block + Time to determine hit/miss

Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor

Page 11: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11

Cache Memory

Cache memory The level of the memory hierarchy closest to the CPU

Given accesses X1, …, Xn–1, Xn

§5.3 The Basics of C

aches

How do we know if the data is present?

Where do we look?

Page 12: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12

Direct Mapped Cache

Location determined by address Direct mapped: only one choice

(Block address) modulo (#Blocks in cache)

#Blocks is a power of 2

Use low-order address bits

1 2 30 4 5 6 7

1 2 30 4 5 6 7 9 10 118 12 13 14 15 17 18 1916 20 21 22 23

1 2 30 4 5 6 7 1 2 30 4 5 6 7 1 2 30 4 5 6 7

Page 13: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13

Tags and Valid Bits

How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits

Called the tag

What if there is no data in a location? Valid bit: 1 = valid, 0 = not valid

Initially 0

Page 14: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14

Cache Example

8-blocks, 1 word/block, direct mapped Initial state

Index V Tag Data0002(010) N0012(110) N0102(210) N0112(310) N1002(410) N1012(510) N1102(610) N1112(710) N

Access sequence (address in word): 22, 26, 22, 26, 16, 3, 16, 18, 16

Page 15: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15

Cache Example

Index V Tag Data000 N001 N010 N011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block22 10 110 Miss 110

Page 16: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16

Cache Example

Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block26 11 010 Miss 010

Page 17: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17

Cache Example

Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block22 10 110 Hit 11026 11 010 Hit 010

Page 18: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18

Cache Example

Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 11 Mem[11010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block16 10 000 Miss 0003 00 011 Miss 011

Page 19: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19

Cache Example

Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 10 Mem[10010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block16 10 000 Hit 00018 10 010 Miss 010

Page 20: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20

Address Subdivision

Page 21: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Taking Advantage of Spatial Locality

0

Let cache block hold more than one word0 1 2 3 4 3 4 15

1 2

3 4 3

4 15

00 Mem(1) Mem(0)

miss

00 Mem(1) Mem(0)

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

01 5 4hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

miss

11 15 14

Start with an empty cache - all blocks initially marked as not valid

8 requests, 4 misses

Page 22: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Larger Block Size

64 blocks, 16 bytes/block To what block number does address 1200 map? 120010=0x000004B0

Tag Index Offset03491031

4 bits6 bits22 bits

Page 23: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Multiword Block Direct Mapped Cache

8Index

DataIndex TagValid012...

253254255

31 …… 12 11 . . . 4 3 2 1 0 Byte offset

20

20Tag

Hit Data

32

Block offset

Four words/block, cache size = 1K words

What kind of locality are we taking advantage of?

Page 24: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Block Size Considerations

Larger blocks should reduce miss rate Due to spatial locality Spatial locality decreases if the block becomes very

large In a fixed-sized cache

Larger blocks fewer of them More competition increased miss rate

Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help

Page 25: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Miss Rate v.s. Block Size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

Page 26: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Cache Field Sizes The number of bits in a cache includes the storage for

data, the tags, and the flag bits For a direct mapped cache with 2n blocks, n bits are used for the

index For a block size of 2m words (2m+2 bytes), m bits are used to

address the word within the block and 2 bits are used to address the byte within the word

What is the size of the tag field? The total number of bits in a direct-mapped cache is

then2n x (block size + tag field size + flag bits size)

How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?

Page 27: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Read hits (I$ and D$) this is what we want!

Write hits (D$ only) require the cache and memory to be consistent

always write the data into both the cache block and the next level in the memory hierarchy (write-through)

allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache

block to the next level in the memory hierarchy when that cache block is “evicted”)

Handling Cache Hits

Page 28: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Write-Through Write through: update the data in both the cache

and the main memory But makes writes take longer

Writes run at the speed of the next level in the memory hierarchy

e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1×100 = 11

Solution: write buffer Holds data waiting to be written to memory CPU continues immediately

Only stalls on write if write buffer becomes full

Page 29: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Write-Back

Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty Need a dirty bit

When a dirty block is replaced Write it back to memory Can use a write buffer

Page 30: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

30

Handling Cache Misses (single-word cache block)

Read misses (I$ and D$) Stall the CPU pipeline Fetch block from the next level of hierarchy Copy it to the cache Send the requested word to the processor Resume the CPU pipeline

Page 31: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

What should happen on a write miss? Write allocate (aka: fetch on write): Data at the

missed-write location is loaded to cache For write-through: two alternatives

Allocate on miss: fetch the block, i.e., write allocate No write allocate: don’t fetch the block

Since programs often write a whole data structure before reading it (e.g., initialization)

For write-back Usually fetch the block, i.e., write allocate

Handling Cache Misses (single-word cache block)

Page 32: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Multiword Block Considerations Read misses (I$ and D$)

Processed the same as for single word blocks – a miss returns the entire block from memory

Miss penalty grows as block size grows Early restart – processor resumes execution as soon as the

requested word of the block is returned Requested word first – requested word is transferred from

the memory to the cache (and processor) first

Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss

Page 33: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Multiword Block Considerations Write misses (D$)

If using write allocate, must first fetch the block from memory and then write the word to the block or could end up with a “garbled” block in the cache, e.g., for

4-word blocks, a new tag, one word of data from the new block, and three words of data from the old block

If not using write allocate, forward the write request to the main memory

Page 34: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Main Memory Supporting Caches Use DRAMs for main memory Example:

cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer

4-word cache block 3 different main memory configurations

1-word wide memory 4-word wide memory 4-bank interleaved memory

Page 35: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

1-Word Wide Bus, 1-Word Wide Memory

What if the block size is four words and the bus and the memory are 1-word wide?

cycle to send address

cycles to read DRAM

cycles to return 4 data words

total clock cycles miss penalty

Page 36: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

What if the block size is four words and the bus and the memory are 1-word wide?

cycle to send address

cycles to read DRAM

cycles to return 4 data words

total clock cycles miss penalty

15 cycles

15 cycles

15 cycles

15 cycles

14 x 15 = 60

4 65

Bandwidth = (4 x 4)/65 = 0.246 B/cycle

1-Word Wide Bus, 1-Word Wide Memory

Page 37: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

What if the block size is four words and the bus and the memory are 4-word wide?

cycle to send address

cycles to read DRAM

cycles to return 4 data words

total clock cycles miss penalty

4-Word Wide Bus, 4-Word Wide Memory

Page 38: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

What if the block size is four words and the bus and the memory are 4-word wide?

cycle to send address

cycles to read DRAM

cycles to return 4 data words

total clock cycles miss penalty

15 cycles

1151

17

Bandwidth = (4 x 4)/17 = 0.941 B/cycle

4-Word Wide Bus, 4-Word Wide Memory

Page 39: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Interleaved Memory, 1-Word Wide Bus For a block size of four words

cycle to send addresscycles to read DRAM bankscycles to return 4 data wordstotal clock cycles miss penalty

Page 40: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Interleaved Memory, 1-Word Wide Bus For a block size of four words

cycle to send addresscycles to read DRAM bankscycles to return 4 data wordstotal clock cycles miss penalty

15 cycles

15 cycles

15 cycles

15 cycles

Bandwidth = (4 x 4)/20 = 0.8 B/cycle

1

154*1 = 4

20

Page 41: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41

Advanced DRAM Organization

Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with

reduced latency Double data rate (DDR) DRAM

Transfer on rising and falling clock edges Quad data rate (QDR) DRAM

Separate DDR inputs and outputs

Page 42: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

Measuring Cache Performance Components of CPU time

Program execution cycles Includes cache hit time

Memory stall cycles Mainly from cache misses

With simplifying assumptions:

§5.4 Measuring and Im

proving Cache P

erformance

penalty MissnInstructio

MissesProgram

nsInstructio

penalty Missrate MissProgram

accessesMemory cycles stallMemory Total

CPU timeCPU time = IC × CPI × CC

= IC × (CPIideal + Average Memory-stall cycles) × CC

Page 43: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

Cache Performance Example Given

I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions

Miss cycles per instruction I-cache: 0.02 × 100 = 2

Need to fetch every instruction from I-cache D-cache: 0.36 × 0.04 × 100 = 1.44

36% instructions need to access D-cache Actual CPI = 2 + 2 + 1.44 = 5.44

Ideal CPU is 5.44/2 =2.72 times faster

Page 44: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44

Average Memory Access Time

Hit time is also important for performance Average memory access time (AMAT)

AMAT = Hit time + Miss rate × Miss penalty Example

CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%

AMAT = 1 + 0.05 × 20 = 2ns 2 cycles per instruction

Page 45: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

Performance Summary

When CPU performance increased Miss penalty becomes more significant Greater proportion of time spent on memory stalls

Decreasing base CPI

Increasing clock rate Memory stalls account for more CPU cycles

Can’t neglect cache behavior when evaluating system performance

Page 46: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Mechanisms to Reduce Cache Miss Rates

Allow more flexible block placement Use multiple levels of caches

46

Page 47: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Reducing Cache Miss Rates #11. Allow more flexible block placement In a direct mapped cache a memory block

maps to exactly one cache block At the other extreme, could allow a memory

block to be mapped to any cache block – fully associative cache

A compromise is to divide the cache into sets,each of which consists of n “ways” (n-way set associative) A memory block maps to a unique set (specified by

the index field) and can be placed in any way of that set (so there are n choices)

Page 48: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48

Associative Caches

Fully associative (only one set) Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive)

n-way set associative Each set contains n entries Block address determines which set

(Block address) modulo (#Sets in cache)

Search all entries in a given set at once n comparators (less expensive)

Page 49: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49

Associative Cache Example

Page 50: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50

Spectrum of Associativity

For a cache with 8 entries

Page 51: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51

Associativity Example

Compare 4-block caches Direct mapped, 2-way set associative,

fully associative Block address sequence: 0, 8, 0, 6, 8

Direct mapped

Block address

Cache index

Hit/miss Cache content after access0 1 2 3

0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

Page 52: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52

Associativity Example

2-way set associativeBlock

addressCache index

Hit/miss Cache content after accessSet 0 Set 1

0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

Fully associativeBlock

addressHit/miss Cache content after access

0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]

Page 53: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Another Reference String Mapping

0 4 0 4

0 4 0 4

Consider the main memory word reference string0 4 0 4 0 4 0 4

miss miss miss miss

miss miss miss miss

00 Mem(0) 00 Mem(0)01 4

01 Mem(4)000

00 Mem(0)01

4

00 Mem(0)01 4

00 Mem(0)01

401 Mem(4)

00001 Mem(4)

000

Start with an empty cache - all blocks initially marked as not valid

Ping pong effect due to conflict misses - two memory locations that map into the same cache block

8 requests, 8 misses

Page 54: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Another Reference String Mapping

0 4 0 4

Consider the main memory word reference string0 4 0 4 0 4 0 4Start with an empty cache - all

blocks initially marked as not valid

Page 55: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Another Reference String Mapping

0 4 0 4

Consider the main memory word reference string0 4 0 4 0 4 0 4

miss miss hit hit

000 Mem(0) 000 Mem(0)

Start with an empty cache - all blocks initially marked as not valid

010 Mem(4) 010 Mem(4)000 Mem(0) 000 Mem(0)

010 Mem(4)

Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

8 requests, 2 misses

Page 56: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56

How Much Associativity

Increased associativity decreases miss rate But with diminishing returns

Simulation of a system with 64KBD-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1%

Page 57: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Benefits of Set Associative Caches Largest gains are in going from direct mapped cache to

2-way set associative cache (20+% reduction in miss rate)

57

Page 58: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58

Set Associative Cache Organization

Page 59: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two

in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit

Block offset Byte offsetIndexTag

Page 60: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two

in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit

Block offset Byte offsetIndexTag

Decreasing associativity

Fully associative(only one set)Tag is all the bits exceptblock and byte offset

Direct mapped(only one way)Smaller tags, only a single comparator

Increasing associativity

Selects the setUsed for tag compare Selects the word in the block

Page 61: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Example: Tag Size

Cache 4K blocks 4-word block size

32-bit byte address Total number of tag bits for the cache (?)

Directed mapped 2-way set associative 4-way set associative Fully associative

61

Page 62: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62

Replacement Policy Direct mapped: no choice Set associative

Prefer non-valid entry, if there is one Otherwise, choose among entries in the set

Least-recently used (LRU) Choose the one unused for the longest time

Simple for 2-way, manageable for 4-way, too hard beyond that

Random Gives approximately the same performance as LRU

for high associativity

Page 63: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63

Reducing Cache Miss Rates #2

2. Use multiple levels of caches

Primary (L1) cache attached to CPU Small, but fast Separate L1 I$ and L1 D$

Level-2 cache services misses from L1 cache Larger, slower, but still faster than main memory Unified cache for both instructions and data

Main memory services L-2 cache misses Some high-end systems include L-3 cache

Page 64: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64

Multilevel Cache Example

Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns

With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9

Page 65: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65

Example (cont.)

Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5%

Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles

Primary miss with L-2 miss Extra penalty = 400 cycles

CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 Performance ratio = 9/3.4 = 2.6

Page 66: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66

Multilevel Cache Considerations

Primary cache Focus on minimal hit time Smaller total size with smaller block size

L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Larger total size with larger block size Higher levels of associativity

Page 67: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Global v.s. Local Miss Rate

Global miss rate (GR) The fraction of references that miss in all levels of a

multilevel cache Dictate how often the main memory is accessed

Global miss rate so far (GRS) The fraction of references that miss up to a certain

level in a multilevel cache Local miss rate (LR)

The fraction of references to one level of a cache that miss

L2$ local miss rate >> the global miss rate

67

Page 68: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Example

LR1=5%, LR2=20%, LR3=50% GR=LR1×LR2×LR3=0.05×0.2×0.5=0.005 GRS1=LR1=0.05 GRS2=LR1×LR2=0.05×0.2=0.01 GRS3=LR1×LR2×LR3=0.05×0.2×0.5=0.005 CPI=1+GSR1×Pty1+GSR2×Pty2+GSR3×Pty3

=1+0.05×10+0.01×20+0.005×100=2.2

68

Page 69: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Sources of Cache Misses Compulsory (cold start or process migration, first

reference): First access to a block, “cold” fact of life, not a whole lot you

can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant

Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)

Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time)

Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (may increase access time)

Page 70: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70

Cache Design Trade-offs

Design change Effect on miss rate Negative performance effect

Increase cache size Decrease capacity misses

May increase access time

Increase associativity Decrease conflict misses

May increase access time

Increase block size Decrease compulsory misses

Increases miss penalty. For very large block size, may increase miss rate due to pollution.

Page 71: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71

Virtual Memory A memory management technique developed

for multitasking computer architectures Virtualize various forms of data storage Allow a program to be designed as there is only one

type of memory, i.e., “virtual” memory Each program runs on its own virtual address space

Use main memory as a “cache” for secondary (disk) storage Allow efficient and safe sharing of memory among

multiple programs Provide the ability to easily run programs larger than

the size of physical memory Simplify loading a program for execution by

providing for code relocation

§5.7 Virtual Mem

ory

Page 72: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Two Programs Sharing Physical Memory

Program 1virtual address space

main memory

A program’s address space is divided into pages (fixed size) The starting location of each page (either in main memory or in

secondary memory) is contained in the program’s page table

Program 2virtual address space

“cache” of hard drive

Page 73: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Address Translation

Virtual Address (VA)

Page offsetVirtual page number

31 30 . . . 12 11 . . . 0

Page offsetPhysical page number

Physical Address (PA)29 . . . 12 11 0

Translation

So each memory request first requires an address translation from the virtual space to the physical space A virtual memory miss (i.e., when the page is not in physical

memory) is called a page fault

A virtual address is translated to a physical address by a combination of hardware and software

Page 74: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74

Page Fault Penalty

On page fault, the page must be fetched from disk Takes millions of clock cycles Handled by OS code

Try to minimize page fault rate Fully associative placement

A physical page can be placed at any place in the main memory

Smart replacement algorithms

Page 75: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75

Page Tables Stores placement information

Array of page table entries, indexed by virtual page number

Page table register in CPU points to page table in physical memory

If page is present in memory Page table entry stores the physical page number Plus other status bits (referenced, dirty, …)

If page is not present Page table entry can refer to location in swap space

on disk Swap space: the space on the disk reserved for the full virtual

memory space of a process

Page 76: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Address Translation Mechanisms

Physical pagebase addr

Main memory

Disk storage

Virtual page #

V11111101010

Page Table(in main memory)

Offset

Physical page #

Offset

Pag

e ta

ble

regi

ster

Page 77: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77

Translation Using a Page Table

Page 78: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78

Replacement and Writes To reduce page fault rate, prefer least-recently

used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access

to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used

recently Disk writes take millions of cycles

Replace a page at once, not individual locations Use write-back

Write through is impractical Dirty bit in PTE set when page is written

Page 79: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Virtual Addressing with a Cache Thus it takes an extra memory access to

translate a VA to a PA

CPU Trans-lation Cache Main

Memory

VA PA miss

hitdata

This makes memory (cache) accesses very expensive (if every access was really two accesses)

The hardware fix is to use a Translation Lookaside Buffer (TLB) – a small cache that keeps track of recently used address mappings to avoid having to do a page table lookup

Page 80: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80

Fast Translation Using a TLB

Address translation would appear to require extra memory references One to access the PTE Then the actual memory access

Check cache first

But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100

cycles for miss, 0.01%–1% miss rate Misses could be handled by hardware or software

Page 81: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Making Address Translation Fast

Physical pagebase addr

Main memory

Disk storage

Virtual page #

V11111101010

11101

TagPhysical page

base addrV

TLB

Page Table(in physical memory)

Pag

e ta

ble

regi

ster

Page 82: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82

Fast Translation Using a TLB

Page 83: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Translation Lookaside Buffers (TLBs)

Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

V Tag Physical Page # Dirty Ref Access

TLB access time is typically smaller than cache access time (because TLBs are much smaller than caches) TLBs are typically not more than 512 entries even on high end

machines

Page 84: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84

TLB Misses If page is in memory

Load the PTE from page table to TLB and retry Could be handled in hardware

Can get complex for more complicated page table structures Or handled in software

Raise a special exception, with optimized handler

If page is not in memory (page fault) OS handles fetching the page and updating the page

table Then restart the faulting instruction

Page 85: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85

TLB Miss Handler

TLB miss indicates two possibilities Page present, but PTE not in TLB

Handler copies PTE from memory to TLB Then restarts instruction

Page not preset in main memory Page fault will occur

Page 86: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86

Page Fault Handler

1. Use virtual address to find PTE2. Locate page on disk3. Choose page to replace if necessary

If the chosen page is dirty, write to disk first4. Read page into memory and update page table5. Make process runnable again

Restart from faulting instruction

Page 87: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87

TLB and Cache Interaction If cache tag uses

physical address Need to translate

before cache lookup

Alternative: use virtual address tag Complications due to

aliasing Different virtual

addresses for shared physical address

Page 88: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

TLB Event CombinationsTLB Page

TableCache Possible? Under what circumstances?

Hit Hit HitHit Hit Miss

Miss Hit HitMiss Hit Miss

Miss Miss MissHit Miss Miss/

HitMiss Miss Hit

Yes – what we want!

Yes – although the page table is not checked if the TLB hits

Yes – TLB miss, PA in page table

Yes – TLB miss, PA in page table, but datanot in cache

Yes – page faultImpossible – TLB translation not possible ifpage is not present in memory

Impossible – data not allowed in cache if page is not in memory

Page 89: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 89

Memory Protection

Different tasks can share parts of their virtual address spaces But need to protect against errant access Requires OS assistance

Hardware support for OS protection Privileged supervisor mode (aka kernel mode) Privileged instructions

Page tables and other state information only accessible in supervisor mode

System call exception (e.g., syscall in MIPS)

Page 90: Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 90

Concluding Remarks Fast memories are small, large memories are

slow We really want fast, large memories Caching gives this illusion

Principle of locality Programs use a small part of their memory space

frequently Memory hierarchy

L1 cache L2 cache … DRAM memory disk

Memory system design is critical for multiprocessors

§5.16 Concluding R

emarks