Ch 02. Memory Hierarchy Design (1)

Computer Architecture

Professor Yong Ho Song 1

Spring 2015

Computer Architecture Memory Hierarchy Design

Prof. Yong Ho Song


Professor Yong Ho Song

Introduction

Programmers want unlimited amounts of memory

with low latency

Fast memory technology is more expensive per bit

than slower memory

Solution: organize memory system into a hierarchy

Entire addressable memory space available in largest, slowest

memory

Incrementally smaller and faster memories, each containing a

subset of the memory below it, proceed in steps up toward the

processor



Principle of Locality

Temporal and spatial locality insures that nearly all

references can be found in smaller memories

Gives the allusion of a large, fast memory being presented to the

processor

Led to hierarchies based on memories of different

speeds and sizes



Memory Hierarchy



Memory Performance Gap

# of core

Clock frequency

LPDDR4

WideIO SDR

409.6 GB/s !!

25.8 GB/s !!



Memory Hierarchy Design

Memory hierarchy design becomes more crucial with

recent multi-core processors:

Aggregate peak bandwidth grows with # cores:

Intel Core i7 can generate two references per core per clock

Four cores and 3.2 GHz clock

25.6 billion 64-bit data references/second +

12.8 billion 128-bit instruction references

= 409.6 GB/s!

DRAM bandwidth is only 6% of this (25 GB/s)

Requires:

Multi-port, pipelined caches

Two levels of cache per core

Shared third-level cache on chip



Performance and Power

High-end microprocessors have >10 MB on-chip cache

Consumes large amount of area and power budget

Caches account for 25% to 50% of the total power

consumption



Memory Hierarchy Basics

When a word is not found in the cache, a miss occurs:

Fetch word from lower level in hierarchy, requiring a higher

latency reference

Lower level may be another cache or the main memory

Also fetch the other words contained within the block

Takes advantage of spatial locality

Place block into cache in any location within its set, determined

by address

Takes advantage of temporal locality

Block address MOD #Sets




N blocks in a set N-way set associative

Direct-mapped cache one block per set

Fully associative all blocks in one set

Writing to cache: two strategies

Write-through

Immediately update lower levels of hierarchy

Write-back

Only update lower levels of hierarchy when an updated block is

replaced

Both strategies use write buffer to make writes asynchronous




Miss rate

Fraction of cache access that result in a miss

Causes of misses

Compulsory

Very first reference to a block

Even infinite sized cache cannot avoid this type of cache miss

Capacity (in fully associative)

Cache memory is full of loaded blocks

Program working set is much larger than cache capacity

Conflict (not in fully associative)

Program makes repeated references to multiple addresses from

different blocks that map to the same location in the cache

Miss to the block which has been replaced



Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Note: Compulsory

Miss small



Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

2:1 Cache Rule

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2



Four Memory Hierarchy Questions

Q1: Where can a block be placed in the upper level?

(block placement)

Q2: How is a block found if it is in the upper level?

(block identification)

Q3: Which block should be replaced on a miss? (block

replacement)

Q4: What happens on a write? (write strategy)

Revised 2015



Q1: block placement

14

Direct Mapped

Set Associative

Revised 2015



Q2: block identification

Cache entry

Address tag to see if it matches the block address from the

processor

Valid bit to indicate whether or not the entry contains a valid

address

Processor address breakdown

Index to select an entry or a set of entry to compare the tag

Tag to match the selected entries

Block offset to access the individual byte (or word) within the

block

15

Revised 2015



Q3: block replacement

Random

To spread allocation uniformly

Some systems generate pseudorandom block numbers to get

reproducible behavior.

Least recently used (LRU)

To replace the block that has been unused for the longest time

If recently used blocks are likely to be used again, then a good

candidate for disposal is the least recently used block

First in, first out (FIFO)

Because LRU can be complicated to calculate, this approximates

LRU by determining the oldest block rather than the LRU.

16

Revised 2015



Q4: write strategy

Handling write hit

Write-through

Easy to implement

Ensure data coherency between multiprocessors and I/O

Avoiding write stall using write buffer

Write-back (requiring a dirty bit)

Better performance

Handling write miss

Write allocate

Write miss looks like a read miss from the memorys viewpoint

No write allocate

The block is modified only in the lower-level memory

Blocks stay out of the cache until the program tries to read the blocks

17

Revised 2015



Data Cache Sample: AMD Opteron

18

64KB Data Cache

Two-way Set Associative Cache

64 Byte Block

Revised 2015



Memory Access Performance

Average memory access time

For each memory access,

Revised average memory access time

Memory accesses

Instruction fetch

Data access

Average memory access per instruction

1 (instruction fetch rate) + d (data access rate)

d = #data instructions / all instruction count in a given program

Revised 2015

% instructions = 1 / (1 + d)

% data = d / (1 + d)



Cache Performance

CPU execution time

20

Revised 2015

Extra CPI due to cache miss



Six Basic Cache Optimizations

Larger block size

Reduces compulsory misses

Increases capacity and conflict misses, increases miss penalty

Larger total cache capacity to reduce miss rate

Increases hit time, increases power consumption

Higher associativity

Reduces conflict misses

Increases hit time, increases power consumption



Six Basic Cache Optimizations

Higher number of cache levels

Reduces overall memory access time

Giving priority to read misses over writes

Reduces miss penalty

Avoiding address translation in cache indexing

Reduces hit time



Larger block size

23

Revised 2015



Larger capacity and higher associativity

24

Revised 2015



Larger capacity and higher associativity

25

Revised 2015



Multi-level Cache

26

Revised 2015



Relative execution time by L2 cache size

27

Revised 2015



Higher priority to read misses

Case 1: write-through cache

Write buffer

May hold the updated version of a location needed on a read miss

Solution

Make read miss wait until the write buffer is empty

Or, check the write buffer on a read miss and if no conflicts, let the

read miss continue

Case 2: write-back cache

Write buffer

Put the replaced dirty block into write back

Then, read in the missed block from the memory system

28

Revised 2015



Physical vs. Virtual Caches

Physical caches are addressed with physical addresses

Virtual addresses are generated by the CPU

Address translation is required, which may increase the hit time

Virtual caches are addressed with virtual addresses

Address translation is not required for a hit (only for a miss)

CPU

Core TLB

Cache

Main

Memory

VA PA (on miss)

hit data

Revised 2015



Physical vs. Virtual Caches

one-step process in case of a hit (+)

cache needs to be flushed on a context switch unless process identifiers (PIDs) included in tags (-)

Aliasing problems due to the sharing of pages (-)

maintaining cache coherence (-)

CPU Physical

Cache TLB

Primary

Memory

VA

PA

Alternative: place the cache before the TLB

CPU

VA

Virtual

Cache

PA TLB

Primary

Memory

Revised 2015



Drawbacks of a Virtual Cache

Protection bits must be associated with each cache block

Whether it is read-only or read-write

Flushing the virtual cache on a context switch

To avoid mixing between virtual addresses of different processes

Can be avoided or reduced using a process identifier tag (PID)

Aliases

Different virtual addresses map to same physical address

Sharing code (shared libraries) and data between processes

Copies of same block in a virtual cache

Updates makes duplicate blocks inconsistent

Cant happen in a physical cache

Revised 2015



Aliasing in Virtual-Address Caches

VA1

VA2

Page Table

Data Pages

PA

VA1

VA2

1st Copy of Data at PA

2nd Copy of Data at PA

Tag Data

Two virtual pages share one physical page

Virtual cache can have two copies of same physical data. Writes to one copy

not visible to reads of other!

General Solution: Disallow aliases to coexist in cache

OS Software solution for direct-mapped cache

VAs of shared pages must agree in cache index bits; this ensures all VAs

accessing same PA will conflict in direct-mapped cache

Revised 2015



Address Translation during Indexing

To lookup a cache, we can distinguish between two tasks

Indexing the cache Physical or virtual address can be used

Comparing tags Physical or virtual address can be used

Virtual caches eliminate address translation for a hit

However, cause many problems (protection, flushing, and aliasing)

Best combination for an L1 cache

Index the cache using virtual address

Address translation can start concurrently with indexing

The page offset is same in both virtual and physical address

Part of page offset can be used for indexing limits cache size

Compare tags using physical address

Ensure that each cache block is given a unique physical address

Revised 2015



Concurrent Access to TLB & Cache

Index L is available without consulting the TLB

cache and TLB accesses can begin simultaneously

Tag comparison is made after both accesses are completed

Cases: L k-b, L > k-b (aliasing problem)

VPN L b

TLB Direct-map Cache

2L blocks

2b-byte block PPN Page Offset

=

hit? Data Physical Tag

Tag

VA

PA

Index

k

Revised 2015



Problem with L1 Cache size > Page size

Virtual Index now uses the lower a bits of VPN

VA1 and VA2 can map to same PPN

Aliasing Problem: Index bits = L > k-b

Can the OS ensure that lower a bits of VPN are same in PPN?

VPN a k b b

TLB

PPN ? Page Offset

Tag

VA

PA

Virtual Index = L bits

L1 cache

Direct-map

= hit?

PPN Data

PPN Data

VA1

VA2

k

Revised 2015



Anti-Aliasing with Higher Associativity

Set Associative Organization

VPN a L = k-b b

TLB 2L blocks

PPN Page Offset

=

hit?

Data

Tag

Physical Tag

VA

PA

Virtual

Index

k 2L blocks

2a

=

2a

Tag

Using higher associativity: cache size > page size

2a physical tags are compared in parallel

Cache size = 2a x 2L x 2b > page size (2k bytes)

Revised 2015



Anti-Aliasing via Second Level Cache

Usually a common L2 cache backs up both Instruction and Data L1 caches

L2 is typically inclusive of both Instruction and Data caches

CPU

L1 Data Cache

L1 Instruction Cache

Unified L2 Cache

RF Memory

Memory

Memory

Memory

Revised 2015



Anti-Aliasing Using L2: MIPS R10000

VPN a k b b

TLB

PPN Page Offset Tag

VA

PA

L-bit index L1 cache

Direct-map

= hit?

PPN Data

PPN Data

VA1

VA2

k

Direct-Mapped L2

PPN Data Suppose VA1 and VA2 (VA1 VA2) both map to same PPN

and VA1 is already in L1, L2

After VA2 is resolved to PA, a collision will be detected in L2.

VA1 will be purged from L1 and L2, and VA2 will be loaded no aliasing !

Revised 2015

VA1



Avoiding address translation time

41



Virtually indexed, physically tagged

42



Small and Simple First Level Caches

Small and simple first level caches

Critical timing path:

addressing tag memory, then

comparing tags, then

selecting correct set

Direct-mapped caches can overlap tag compare and transmission

of data

Lower associativity reduces power because fewer cache lines are

accessed

1st Optimization



L1 Size and Associativity

Access time vs. size and associativity

1st Optimization



L1 Size and Associativity

Energy per read vs. size and associativity

1st Optimization



Way Prediction

To improve hit time, predict the way to pre-set mux

Only a single tag comparison is performed

Miss results in checking the other blocks for matches in the next

clock cycle (longer miss penalty)

Prediction accuracy

> 90% for two-way, > 80% for four-way

I-cache has better accuracy than D-cache

Used on ARM Cortex-A8

Way selection

Extend to predict block as well

Increases mis-prediction penalty

2nd Optimization



Pipelined Cache Access

Pipeline cache access to improve bandwidth

Examples:

Pentium: 1 cycle

Pentium Pro Pentium III: 2 cycles

Pentium 4 Core i7: 4 cycles

Increases branch mis-prediction penalty

Makes it easier to increase associativity

3rd Optimization



Pipelined Cache Access

Pipeline cache access to improve bandwidth

Examples:

Pentium: 1 cycle

Pentium Pro Pentium III: 2 cycles

Pentium 4 Core i7: 4 cycles

Increases branch mis-prediction penalty

3rd Optimization



Non-blocking Caches

allows the CPU to continue executing instructions

while a miss is handled

some processors allow only 1 outstanding miss (hit under miss)

some processors allow multiple misses outstanding (miss under

miss)

can be used with both in-order and out-of-order

processors

in-order processors stall when an instruction that uses the load

data is the next instruction to be executed (non-blocking loads)

out-of-order processors can execute instructions after the load

consumer

49

4th Optimization



Miss Status Handling Register

Also called miss buffer

Keeps track of

Outstanding cache misses

Pending load/store accesses that refer to the missing cache block

Fields of a single MSHR

Valid bit

Cache block address (to match incoming accesses)

Control/status bits (prefetch, issued to memory, which subblocks

have arrived, etc)

Data for each subblock

For each pending load/store

Valid, type, data size, byte in block, destination register or store buffer entry address

4th Optimization



Miss Status Handling Register 4th Optimization



MSHR Operation

On a cache miss:

Search MSHR for a pending access to the same block

Found: Allocate a load/store entry in the same MSHR entry

Not found: Allocate a new MSHR

No free entry: stall

When a subblock returns from the next level in

memory

Check which loads/stores waiting for it

Forward data to the load/store unit

Deallocate load/store entry in the MSHR entry

Write subblock in cache or MSHR

If last subblock, dellaocate MSHR (after writing the block in

cache)

4th Optimization



Non-blocking Caches

53

4th Optimization



Multibanked Caches

Organize cache as independent banks to support

simultaneous access

ARM Cortex-A8 supports 1-4 banks for L2

Intel i7 supports 4 banks for L1 and 8 banks for L2

Interleave banks according to block address

Works best when accesses naturally spread across

banks

5th Optimization



Critical Word First, Early Restart

Critical word first

Request missed word from

memory first

Send it to the processor as soon

as it arrives

Early restart

Request words in normal order

Send missed work to the processor as soon as it arrives

Effectiveness of these strategies depends on block

size and likelihood of another access to the portion of

the block that has not yet been fetched

6th Optimization



Merging Write Buffer

When storing to a block that is already pending in the

write buffer, update write buffer

Reduces stalls due to full write buffer

Multiword writes are usually faster than writes

performed one word at a time

Write merging

No write merging

7th Optimization



Compiler Optimizations

Loop Interchange

Swap nested loops to access memory in sequential order

Blocking

Instead of accessing entire rows or columns, subdivide matrices

into blocks

Requires more memory accesses but improves locality of

accesses

8th Optimization



Loop Interchange

Swap nested loops to access memory in sequential order

/* Before */

for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

j

i

0x0000

0x4E20

0x9C40

0xEA60

j

i

0x0000

0x4E20

0x9C40

0xEA60

8th Optimization



Blocking 8th Optimization



8th Optimization



Hardware Prefetching

Fetch two blocks on miss (include next sequential

block)

Prefetched block is placed into the instruction stream

buffer

9th Optimization




Instruction Prefetching

Typically, CPU fetches 2 blocks on a miss: the requested block and the

next consecutive block.

Requested block is placed in instruction cache when it returns, and

prefetched block is placed into instruction stream buffer

Data Prefetching

Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8

different 4 KB pages

Prefetching invoked if 2 successive L2 cache misses to a page,

if distance between those cache blocks is < 256 bytes

Revised 2015


Professor Yong Ho Song 63

Issues in Prefetching

Usefulness should produce hits

Timeliness not late and not too early

Cache and bandwidth pollution

L1 Data

L1 Instruction

Unified L2

Cache

RF

CPU

Prefetched data

Revised 2015



Hardware Instruction Prefetching

Instruction prefetch in Alpha AXP 21064

Fetch two blocks on a miss; the requested block (i) and the next

consecutive block (i+1)

Requested block placed in cache, and next block in instruction

stream buffer

If miss in cache but hit in stream buffer, move stream buffer block

into cache and prefetch next block (i+2)

L1 Instruction Unified L2

Cache

RF

CPU

Stream

Buffer

Prefetched

instruction block Req

block

Req

block

Revised 2015



Hardware Data Prefetching

Prefetch-on-miss:

Prefetch b + 1 upon miss on b

One Block Lookahead (OBL) scheme

Initiate prefetch for block b + 1 when block b is accessed

Why is this different from doubling block size?

Can extend to N block lookahead

Strided prefetch

If observe sequence of accesses to block b, b+N, b+2N, then

prefetch b+3N etc.

Example

IBM Power 5 supports 8 independent streams of strided prefetch

per processor, prefetching 12 lines ahead of current access

Revised 2015




Fetch two blocks on miss (include next sequential

block)

Pentium 4 Pre-fetching

9th Optimization



Compiler Prefetching

Insert prefetch instructions before data is needed

Non-faulting: prefetch doesnt cause exceptions

Register prefetch

Loads data into register

Cache prefetch

Loads data into cache

Combine with loop unrolling and software pipelining

10th Optimization



Compiler Prefetching

Insert prefetch instructions before data is needed

Timing is the biggest issue, not predictability

If you prefetch very close to when the data is required, you

might be too late

Prefetch too early, cause pollution

Estimate how long it will take for the data to come into L1, so we

can set P appropriatly

10th Optimization



Software Prefetching Data

Data Prefetch

Load data into register (HP PA-RISC loads)

Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)

Special prefetching instructions cannot cause faults;

a form of speculative execution

Issuing Prefetch Instructions takes time

Is cost of prefetch issues < savings in reduced misses?

Higher superscalar reduces difficulty of issue bandwidth

Revised 2015



Summary



Memory Technology

Performance metrics

Latency is concern of cache

Bandwidth is concern of multiprocessors and I/O

Access time

Time between read request and when desired word arrives

Cycle time

Minimum time between unrelated requests to memory

DRAM used for main memory, SRAM used for cache



Memory Technology

SRAM

Requires low power to retain bit

Requires 6 transistors/bit

DRAM

Must be re-written after being read

Must also be periodically refreshed

Every ~ 8 ms

Each row can be refreshed simultaneously

One transistor/bit

Address lines are multiplexed:

Upper half of address: row access strobe (RAS)

Lower half of address: column access strobe (CAS)



Memory Technology

Amdahl:

Memory capacity should grow linearly with processor speed

Unfortunately, memory capacity and speed has not kept pace

with processors

Some optimizations:

Multiple accesses to same row

Synchronous DRAM

Added clock to DRAM interface

Burst mode with critical word first

Wider interfaces

Double data rate (DDR)

Multiple banks on each DRAM device



Memory Optimizations




DDR (Double Data Rate)

DDR2

Lower power (2.5 V -> 1.8 V)

Higher clock rates (266 MHz, 333 MHz, 400 MHz)

DDR3

1.5 V

800 MHz

DDR4

1-1.2 V

1600 MHz

GDDR5 is graphics memory based on DDR3




Graphics memory

Achieve 2-5 X bandwidth per DRAM vs. DDR3

Wider interfaces (32 vs. 16 bit)

Higher clock rate

Possible because they are attached via soldering instead of socketted DIMM modules

Reducing power in SDRAMs

Lower voltage

Low power mode (ignores clock, continues to refresh)



Memory Power Consumption



Flash Memory

Type of EEPROM

Must be erased (in blocks) before being overwritten

Non volatile

Limited number of write cycles

Cheaper than SDRAM, more expensive than disk

Slower than SRAM, faster than disk



Memory Dependability

Memory is susceptible to cosmic rays

Soft errors: dynamic errors

Detected and fixed by error correcting codes (ECC)

Hard errors: permanent errors

Use sparse rows to replace defective rows

Chipkill: a RAID-like error recovery technique



Virtual Memory

Protection via virtual memory

Keeps processes in their own memory space

Role of architecture:

Provide user mode and supervisor mode

Protect certain aspects of CPU state

Provide mechanisms for switching between user mode and

supervisor mode

Provide mechanisms to limit memory accesses

Provide TLB to translate addresses



Virtual Memory Concepts

What is Virtual Memory?

Uses disk as an extension to memory system

Main memory acts like a cache to hard disk

Each process has its own virtual address space

Page: a virtual memory block

Page fault: a memory miss

Page is not in main memory transfer page from disk to memory

Address translation:

CPU and OS translate virtual addresses to physical addresses

Page Size:

Size of a page in memory and on disk

Typical page size = 4KB 16KB

Revised 2015



Virtual Memory Concepts

A programs address space is divided into pages

All pages have the same fixed size (simplifies their allocation)

Program 1

virtual address space

main memory

Program 2

virtual address space

Pages are either in main

memory or in secondary

storage (hard disk)

Revised 2015



Issues in Virtual Memory

Page Size

Small page sizes ranging from 4KB to 16KB are typical today

Large page size can be 1MB to 4MB (reduces page table size)

Recent processors support multiple page sizes

Placement Policy and Locating Pages

Fully associative placement is typical to reduce page faults

Pages are located using a structure called a page table

Page table maps virtual pages onto physical page frames

Handling Page Faults and Replacement Policy

Page faults are handled in software by the operating system

Replacement algorithm chooses which page to replace in memory

Write Policy

Write-through will not work, since writes take too long

Instead, virtual memory systems use write-back

Revised 2015



Three Advantages of Virtual Memory

Memory Management

Programs are given contiguous view of memory

Pages have the same size simplifies memory allocation

Physical page frames need not be contiguous

Only the Working Set of program must be in physical memory

Stacks and Heaps can grow

Use only as much physical memory as necessary

Protection

Different processes are protected from each other

Different pages can be given special behavior (read only, etc)

Kernel data protected from User programs

Protection against malicious programs

Sharing

Can map same physical page to multiple users Shared memory

Revised 2015



Page Table and Address Mapping

Page Table Register

contains the address

of the page table

P age offse tV ir tua l page num ber

V irtua l add ress

P age offse tP hysica l page num ber

Phys ical add ress

P hysica l page num berVa lid

If 0 then page is no t

presen t in m em ory

Page tab le registe r

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 2 8 27 15 1 4 13 1 2 11 10 9 8 3 2 1 0

Page Table maps

virtual page numbers

to physical frames

Virtual page number

is used as an index

into the page table

Page Table Entry

(PTE): describes the

page and its usage

Revised 2015



Page Table contd

Each process has a page table

The page table defines the address space of a process

Address space: set of page frames that can be accessed

Page table is stored in main memory

Can be modified only by the Operating System

Page table register

Contains the physical address of the page table in memory

Processor uses this register to locate page table

Page table entry

Contains information about a single page

Valid bit specifies whether page is in physical memory

Physical page number specifies the physical page address

Additional bits are used to specify protection and page use

Revised 2015



Size of the Page Table

One-level table is simplest to implement

Each page table entry is typically 4 bytes

With 4K pages and 32-bit virtual address space, we need:

232/212 = 220 entries 4 bytes = 4 MB

With 4K pages and 48-bit virtual address space, we need:

248/212 = 236 entries 4 bytes = 238 bytes = 256 GB !

Cannot keep whole page table in memory!

Most of the virtual address space is unused

Virtual Page Number 20 Page Offset 12

Revised 2015



Reducing the Page Table Size

Use a limit register to restrict the size of the page table

If virtual page number > limit register, then page is not allocated

Requires that the address space expand in only one direction

Divide the page table into two tables with two limits

One table grows from lowest address up and used for the heap

One table grows from highest address down and used for stack

Does not work well when the address space is sparse

Use a Multiple-Level (Hierarchical) Page Table

Allows the address space to be used in a sparse fashion

Sparse allocation without the need to allocate the entire page table

Primary disadvantage is multiple level address translation

Revised 2015



Multi-Level Page Table

Level 1

Page Table

Level 2

Page Tables

Data Pages

page in primary memory

page in secondary memory

Root of the Current

Page Table

p1

p2

Virtual Address

(Processor

Register)

PTE of a nonexistent page

p1 p2 offset

0 11 12 21 22 31

10-bit

L1 index

10-bit

L2 index

Revised 2015



Variable-Sized Page Support

Level 1

Page Table

Level 2

Page Tables page in primary memory large page in primary memory

page in secondary memory

PTE of a nonexistent page

Root of the Current

Page Table

p1

p2

Virtual Address

(Processor

Register)

p1 p2 offset

0 11 12 21 22 31

10-bit

L1 index

10-bit

L2 index

4 MB

page

Data Pages

Revised 2015



Hashed Page Table

hash

index

Base of Table

+ PA of PTE

Hashed Page Table VPN offset

Virtual Address

PID VPN PID PPN link

VPN PID PPN

One table for all processes

Table is only small fraction of memory

Number of entries is 2 to 3 times number of

page frames to reduce collision probability

Hash function for address translation

Search through a chain of page table entries

Revised 2015



Handling a Page Fault

Page fault: requested page is not in memory

The missing page is located on disk or created

Page is brought from disk and Page table is updated

Another process may be run on the CPU while the first process

waits for the requested page to be read from disk

If no free pages are left, a page is swapped out

Pseudo-LRU replacement policy

Reference bit for each page each page table entry

Each time a page is accessed, set reference bit =1

OS periodically clears the reference bits

Page faults are handled completely in software by the

OS

It takes milliseconds to transfer a page from disk to memory

Revised 2015



Write Policy

Write through does not work

Takes millions of processor cycles to write disk

Write back

Individual writes are accumulated into a page

The page is copied back to disk only when the page is replaced

Dirty bit

1 if the page has been written

0 if the page never changed

Revised 2015



CPU

Core TLB Cache

Main

Memory

VA PA miss

hit

data

TLB = Translation Lookaside Buffer

Address translation is very expensive

Must translate virtual memory address on every memory access

Multilevel page table, each translation is several memory

accesses

Solution: TLB for address translation

Keep track of most common translations in the TLB

TLB = Cache for address translation

Revised 2015



Translation Lookaside Buffer

VPN offset

VPN PPN

physical address PPN offset

virtual address

hit?

(VPN = virtual page number)

(PPN = physical page number)

TLB hit: Fast single cycle translation

TLB miss: Slow page table translation

Must update TLB on a TLB miss

Revised 2015

V R W D



Address Translation & Protection

Every instruction and data access needs address translation and protection checks

Check whether page is read only, writable, or executable

Check whether it can be accessed by the user or kernel only

Physical Address

Virtual Address

Address

Translation

Virtual Page No. (VPN) offset

Physical Page No. (PPN) offset

Protection

Check

Exception?

Kernel/User Mode

Read/Write

Revised 2015



Handling TLB Misses and Page Faults

TLB miss: No entry in the TLB matches a virtual address

TLB miss can be handled in software or in hardware

Lookup page table entry in memory to bring into the TLB

If page table entry is valid then retrieve entry into the TLB

If page table entry is invalid then page fault (page is not in memory)

Handling a Page Fault

Interrupt the active process that caused the page fault

Program counter of instruction that caused page fault must be saved

Instruction causing page fault must not modify registers or memory

Transfer control to the operating system to transfer the page

Restart later the instruction that caused the page fault

Revised 2015



Handling a TLB Miss

Software (MIPS, Alpha)

TLB miss causes an exception and the operating system walks

the page tables and reloads TLB

A privileged addressing mode is used to access page tables

Hardware (SPARC v8, x86, PowerPC)

A memory management unit (MMU) walks the page tables and

reloads the TLB

If a missing (data or PT) page is encountered during the TLB

reloading, MMU gives up and signals a Page-Fault exception for

the original instruction. The page fault is handled by the OS

software.

Revised 2015



Address Translation Summary

TLB

Lookup

Page Table

Walk

Update TLB Page Fault

(OS loads page)

Protection

Check

Physical

Address

(to cache)

miss hit

not in memory in memory denied permitted

Protection

Fault

hardware

hardware or software

software

SEGFAULT

Restart instruction

Revised 2015



TLB, Page Table, Cache Combinations

TLB Page Table Cache Possible? Under what circumstances?

Hit Hit Hit Yes what we want!

Hit Hit Miss Yes although the page table is not

checked if the TLB hits

Miss Hit Hit Yes TLB miss, PA in page table

Miss Hit Miss Yes TLB miss, PA in page table, but data is

not in cache

Miss Miss Miss Yes page fault (page is on disk)

Hit Miss Hit/Miss Impossible TLB translation is not possible

if page is not in memory

Miss Miss Hit Impossible data not allowed in cache

if page is not in memory

Revised 2015



Address Translation in CPU Pipeline

Software handlers need restartable exception on page fault

Handling a TLB miss needs a hardware or software mechanism to refill TLB

Need mechanisms to cope with the additional latency of a TLB

Slow down the clock

Pipeline the TLB and cache access

Virtual address caches

Parallel TLB/cache access

PC Inst TLB

Inst. Cache D

Decode E M Data TLB

Data Cache W +

TLB miss? Page Fault?

Protection violation?

TLB miss? Page Fault?

Protection violation?

Revised 2015



Virtual Machines

Supports isolation and security

Sharing a computer among many unrelated users

Enabled by raw speed of processors, making the

overhead more acceptable

Allows different ISAs and operating systems to be

presented to user programs

System Virtual Machines

SVM software is called virtual machine monitor or hypervisor

Individual virtual machines run under the monitor are called

guest VMs



Impact of VMs on Virtual Memory

Each guest OS maintains its own set of page tables

VMM adds a level of memory between physical and virtual

memory called real memory

VMM maintains shadow page table that maps guest virtual

addresses to physical addresses

Requires VMM to detect guests changes to its own page table

Occurs naturally if accessing the page table pointer is a privileged

operation



End of Chapter

105

Ch 02. Memory Hierarchy Design (1)

Documents

Transcript of Ch 02. Memory Hierarchy Design (1)