Ch 02. Memory Hierarchy Design (1)

103
Computer Architecture Professor Yong Ho Song 1 Spring 2015 Computer Architecture Memory Hierarchy Design Prof. Yong Ho Song

description

It shows the memory hierarchy in detail

Transcript of Ch 02. Memory Hierarchy Design (1)

  • Computer Architecture

    Professor Yong Ho Song 1

    Spring 2015

    Computer Architecture Memory Hierarchy Design

    Prof. Yong Ho Song

  • Computer Architecture

    Professor Yong Ho Song

    Introduction

    Programmers want unlimited amounts of memory

    with low latency

    Fast memory technology is more expensive per bit

    than slower memory

    Solution: organize memory system into a hierarchy

    Entire addressable memory space available in largest, slowest

    memory

    Incrementally smaller and faster memories, each containing a

    subset of the memory below it, proceed in steps up toward the

    processor

  • Computer Architecture

    Professor Yong Ho Song

    Principle of Locality

    Temporal and spatial locality insures that nearly all

    references can be found in smaller memories

    Gives the allusion of a large, fast memory being presented to the

    processor

    Led to hierarchies based on memories of different

    speeds and sizes

  • Computer Architecture

    Professor Yong Ho Song

    Memory Hierarchy

  • Computer Architecture

    Professor Yong Ho Song

    Memory Performance Gap

    # of core

    Clock frequency

    LPDDR4

    WideIO SDR

    409.6 GB/s !!

    25.8 GB/s !!

  • Computer Architecture

    Professor Yong Ho Song

    Memory Hierarchy Design

    Memory hierarchy design becomes more crucial with

    recent multi-core processors:

    Aggregate peak bandwidth grows with # cores:

    Intel Core i7 can generate two references per core per clock

    Four cores and 3.2 GHz clock

    25.6 billion 64-bit data references/second +

    12.8 billion 128-bit instruction references

    = 409.6 GB/s!

    DRAM bandwidth is only 6% of this (25 GB/s)

    Requires:

    Multi-port, pipelined caches

    Two levels of cache per core

    Shared third-level cache on chip

  • Computer Architecture

    Professor Yong Ho Song

    Performance and Power

    High-end microprocessors have >10 MB on-chip cache

    Consumes large amount of area and power budget

    Caches account for 25% to 50% of the total power

    consumption

  • Computer Architecture

    Professor Yong Ho Song

    Memory Hierarchy Basics

    When a word is not found in the cache, a miss occurs:

    Fetch word from lower level in hierarchy, requiring a higher

    latency reference

    Lower level may be another cache or the main memory

    Also fetch the other words contained within the block

    Takes advantage of spatial locality

    Place block into cache in any location within its set, determined

    by address

    Takes advantage of temporal locality

    Block address MOD #Sets

  • Computer Architecture

    Professor Yong Ho Song

    Memory Hierarchy Basics

    N blocks in a set N-way set associative

    Direct-mapped cache one block per set

    Fully associative all blocks in one set

    Writing to cache: two strategies

    Write-through

    Immediately update lower levels of hierarchy

    Write-back

    Only update lower levels of hierarchy when an updated block is

    replaced

    Both strategies use write buffer to make writes asynchronous

  • Computer Architecture

    Professor Yong Ho Song

    Memory Hierarchy Basics

    Miss rate

    Fraction of cache access that result in a miss

    Causes of misses

    Compulsory

    Very first reference to a block

    Even infinite sized cache cannot avoid this type of cache miss

    Capacity (in fully associative)

    Cache memory is full of loaded blocks

    Program working set is much larger than cache capacity

    Conflict (not in fully associative)

    Program makes repeated references to multiple addresses from

    different blocks that map to the same location in the cache

    Miss to the block which has been replaced

  • Computer Architecture

    Professor Yong Ho Song

    Cache Size (KB)

    Mis

    s R

    ate

    pe

    r T

    yp

    e

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    1 2 4 8

    16

    32

    64

    12

    8

    1-way

    2-way

    4-way

    8-way

    Capacity

    Compulsory

    3Cs Absolute Miss Rate (SPEC92)

    Conflict

    Note: Compulsory

    Miss small

  • Computer Architecture

    Professor Yong Ho Song

    Cache Size (KB)

    Mis

    s R

    ate

    pe

    r T

    yp

    e

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    1 2 4 8

    16

    32

    64

    12

    8

    1-way

    2-way

    4-way

    8-way

    Capacity

    Compulsory

    2:1 Cache Rule

    Conflict

    miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

  • Computer Architecture

    Professor Yong Ho Song

    Four Memory Hierarchy Questions

    Q1: Where can a block be placed in the upper level?

    (block placement)

    Q2: How is a block found if it is in the upper level?

    (block identification)

    Q3: Which block should be replaced on a miss? (block

    replacement)

    Q4: What happens on a write? (write strategy)

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Q1: block placement

    14

    Direct Mapped

    Set Associative

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Q2: block identification

    Cache entry

    Address tag to see if it matches the block address from the

    processor

    Valid bit to indicate whether or not the entry contains a valid

    address

    Processor address breakdown

    Index to select an entry or a set of entry to compare the tag

    Tag to match the selected entries

    Block offset to access the individual byte (or word) within the

    block

    15

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Q3: block replacement

    Random

    To spread allocation uniformly

    Some systems generate pseudorandom block numbers to get

    reproducible behavior.

    Least recently used (LRU)

    To replace the block that has been unused for the longest time

    If recently used blocks are likely to be used again, then a good

    candidate for disposal is the least recently used block

    First in, first out (FIFO)

    Because LRU can be complicated to calculate, this approximates

    LRU by determining the oldest block rather than the LRU.

    16

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Q4: write strategy

    Handling write hit

    Write-through

    Easy to implement

    Ensure data coherency between multiprocessors and I/O

    Avoiding write stall using write buffer

    Write-back (requiring a dirty bit)

    Better performance

    Handling write miss

    Write allocate

    Write miss looks like a read miss from the memorys viewpoint

    No write allocate

    The block is modified only in the lower-level memory

    Blocks stay out of the cache until the program tries to read the blocks

    17

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Data Cache Sample: AMD Opteron

    18

    64KB Data Cache

    Two-way Set Associative Cache

    64 Byte Block

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Memory Access Performance

    Average memory access time

    For each memory access,

    Revised average memory access time

    Memory accesses

    Instruction fetch

    Data access

    Average memory access per instruction

    1 (instruction fetch rate) + d (data access rate)

    d = #data instructions / all instruction count in a given program

    Revised 2015

    % instructions = 1 / (1 + d)

    % data = d / (1 + d)

  • Computer Architecture

    Professor Yong Ho Song

    Cache Performance

    CPU execution time

    20

    Revised 2015

    Extra CPI due to cache miss

  • Computer Architecture

    Professor Yong Ho Song

    Six Basic Cache Optimizations

    Larger block size

    Reduces compulsory misses

    Increases capacity and conflict misses, increases miss penalty

    Larger total cache capacity to reduce miss rate

    Increases hit time, increases power consumption

    Higher associativity

    Reduces conflict misses

    Increases hit time, increases power consumption

  • Computer Architecture

    Professor Yong Ho Song

    Six Basic Cache Optimizations

    Higher number of cache levels

    Reduces overall memory access time

    Giving priority to read misses over writes

    Reduces miss penalty

    Avoiding address translation in cache indexing

    Reduces hit time

  • Computer Architecture

    Professor Yong Ho Song

    Larger block size

    23

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Larger capacity and higher associativity

    24

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Larger capacity and higher associativity

    25

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Multi-level Cache

    26

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Relative execution time by L2 cache size

    27

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Higher priority to read misses

    Case 1: write-through cache

    Write buffer

    May hold the updated version of a location needed on a read miss

    Solution

    Make read miss wait until the write buffer is empty

    Or, check the write buffer on a read miss and if no conflicts, let the

    read miss continue

    Case 2: write-back cache

    Write buffer

    Put the replaced dirty block into write back

    Then, read in the missed block from the memory system

    28

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Physical vs. Virtual Caches

    Physical caches are addressed with physical addresses

    Virtual addresses are generated by the CPU

    Address translation is required, which may increase the hit time

    Virtual caches are addressed with virtual addresses

    Address translation is not required for a hit (only for a miss)

    CPU

    Core TLB

    Cache

    Main

    Memory

    VA PA (on miss)

    hit data

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Physical vs. Virtual Caches

    one-step process in case of a hit (+)

    cache needs to be flushed on a context switch unless process identifiers (PIDs) included in tags (-)

    Aliasing problems due to the sharing of pages (-)

    maintaining cache coherence (-)

    CPU Physical

    Cache TLB

    Primary

    Memory

    VA

    PA

    Alternative: place the cache before the TLB

    CPU

    VA

    Virtual

    Cache

    PA TLB

    Primary

    Memory

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Drawbacks of a Virtual Cache

    Protection bits must be associated with each cache block

    Whether it is read-only or read-write

    Flushing the virtual cache on a context switch

    To avoid mixing between virtual addresses of different processes

    Can be avoided or reduced using a process identifier tag (PID)

    Aliases

    Different virtual addresses map to same physical address

    Sharing code (shared libraries) and data between processes

    Copies of same block in a virtual cache

    Updates makes duplicate blocks inconsistent

    Cant happen in a physical cache

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Aliasing in Virtual-Address Caches

    VA1

    VA2

    Page Table

    Data Pages

    PA

    VA1

    VA2

    1st Copy of Data at PA

    2nd Copy of Data at PA

    Tag Data

    Two virtual pages share one physical page

    Virtual cache can have two copies of same physical data. Writes to one copy

    not visible to reads of other!

    General Solution: Disallow aliases to coexist in cache

    OS Software solution for direct-mapped cache

    VAs of shared pages must agree in cache index bits; this ensures all VAs

    accessing same PA will conflict in direct-mapped cache

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Address Translation during Indexing

    To lookup a cache, we can distinguish between two tasks

    Indexing the cache Physical or virtual address can be used

    Comparing tags Physical or virtual address can be used

    Virtual caches eliminate address translation for a hit

    However, cause many problems (protection, flushing, and aliasing)

    Best combination for an L1 cache

    Index the cache using virtual address

    Address translation can start concurrently with indexing

    The page offset is same in both virtual and physical address

    Part of page offset can be used for indexing limits cache size

    Compare tags using physical address

    Ensure that each cache block is given a unique physical address

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Concurrent Access to TLB & Cache

    Index L is available without consulting the TLB

    cache and TLB accesses can begin simultaneously

    Tag comparison is made after both accesses are completed

    Cases: L k-b, L > k-b (aliasing problem)

    VPN L b

    TLB Direct-map Cache

    2L blocks

    2b-byte block PPN Page Offset

    =

    hit? Data Physical Tag

    Tag

    VA

    PA

    Index

    k

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Problem with L1 Cache size > Page size

    Virtual Index now uses the lower a bits of VPN

    VA1 and VA2 can map to same PPN

    Aliasing Problem: Index bits = L > k-b

    Can the OS ensure that lower a bits of VPN are same in PPN?

    VPN a k b b

    TLB

    PPN ? Page Offset

    Tag

    VA

    PA

    Virtual Index = L bits

    L1 cache

    Direct-map

    = hit?

    PPN Data

    PPN Data

    VA1

    VA2

    k

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Anti-Aliasing with Higher Associativity

    Set Associative Organization

    VPN a L = k-b b

    TLB 2L blocks

    PPN Page Offset

    =

    hit?

    Data

    Tag

    Physical Tag

    VA

    PA

    Virtual

    Index

    k 2L blocks

    2a

    =

    2a

    Tag

    Using higher associativity: cache size > page size

    2a physical tags are compared in parallel

    Cache size = 2a x 2L x 2b > page size (2k bytes)

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Anti-Aliasing via Second Level Cache

    Usually a common L2 cache backs up both Instruction and Data L1 caches

    L2 is typically inclusive of both Instruction and Data caches

    CPU

    L1 Data Cache

    L1 Instruction Cache

    Unified L2 Cache

    RF Memory

    Memory

    Memory

    Memory

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Anti-Aliasing Using L2: MIPS R10000

    VPN a k b b

    TLB

    PPN Page Offset Tag

    VA

    PA

    L-bit index L1 cache

    Direct-map

    = hit?

    PPN Data

    PPN Data

    VA1

    VA2

    k

    Direct-Mapped L2

    PPN Data Suppose VA1 and VA2 (VA1 VA2) both map to same PPN

    and VA1 is already in L1, L2

    After VA2 is resolved to PA, a collision will be detected in L2.

    VA1 will be purged from L1 and L2, and VA2 will be loaded no aliasing !

    Revised 2015

    VA1

  • Computer Architecture

    Professor Yong Ho Song

    Avoiding address translation time

    41

  • Computer Architecture

    Professor Yong Ho Song

    Virtually indexed, physically tagged

    42

  • Computer Architecture

    Professor Yong Ho Song

    Small and Simple First Level Caches

    Small and simple first level caches

    Critical timing path:

    addressing tag memory, then

    comparing tags, then

    selecting correct set

    Direct-mapped caches can overlap tag compare and transmission

    of data

    Lower associativity reduces power because fewer cache lines are

    accessed

    1st Optimization

  • Computer Architecture

    Professor Yong Ho Song

    L1 Size and Associativity

    Access time vs. size and associativity

    1st Optimization

  • Computer Architecture

    Professor Yong Ho Song

    L1 Size and Associativity

    Energy per read vs. size and associativity

    1st Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Way Prediction

    To improve hit time, predict the way to pre-set mux

    Only a single tag comparison is performed

    Miss results in checking the other blocks for matches in the next

    clock cycle (longer miss penalty)

    Prediction accuracy

    > 90% for two-way, > 80% for four-way

    I-cache has better accuracy than D-cache

    Used on ARM Cortex-A8

    Way selection

    Extend to predict block as well

    Increases mis-prediction penalty

    2nd Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Pipelined Cache Access

    Pipeline cache access to improve bandwidth

    Examples:

    Pentium: 1 cycle

    Pentium Pro Pentium III: 2 cycles

    Pentium 4 Core i7: 4 cycles

    Increases branch mis-prediction penalty

    Makes it easier to increase associativity

    3rd Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Pipelined Cache Access

    Pipeline cache access to improve bandwidth

    Examples:

    Pentium: 1 cycle

    Pentium Pro Pentium III: 2 cycles

    Pentium 4 Core i7: 4 cycles

    Increases branch mis-prediction penalty

    3rd Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Non-blocking Caches

    allows the CPU to continue executing instructions

    while a miss is handled

    some processors allow only 1 outstanding miss (hit under miss)

    some processors allow multiple misses outstanding (miss under

    miss)

    can be used with both in-order and out-of-order

    processors

    in-order processors stall when an instruction that uses the load

    data is the next instruction to be executed (non-blocking loads)

    out-of-order processors can execute instructions after the load

    consumer

    49

    4th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Miss Status Handling Register

    Also called miss buffer

    Keeps track of

    Outstanding cache misses

    Pending load/store accesses that refer to the missing cache block

    Fields of a single MSHR

    Valid bit

    Cache block address (to match incoming accesses)

    Control/status bits (prefetch, issued to memory, which subblocks

    have arrived, etc)

    Data for each subblock

    For each pending load/store

    Valid, type, data size, byte in block, destination register or store buffer entry address

    4th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Miss Status Handling Register 4th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    MSHR Operation

    On a cache miss:

    Search MSHR for a pending access to the same block

    Found: Allocate a load/store entry in the same MSHR entry

    Not found: Allocate a new MSHR

    No free entry: stall

    When a subblock returns from the next level in

    memory

    Check which loads/stores waiting for it

    Forward data to the load/store unit

    Deallocate load/store entry in the MSHR entry

    Write subblock in cache or MSHR

    If last subblock, dellaocate MSHR (after writing the block in

    cache)

    4th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Non-blocking Caches

    53

    4th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Multibanked Caches

    Organize cache as independent banks to support

    simultaneous access

    ARM Cortex-A8 supports 1-4 banks for L2

    Intel i7 supports 4 banks for L1 and 8 banks for L2

    Interleave banks according to block address

    Works best when accesses naturally spread across

    banks

    5th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Critical Word First, Early Restart

    Critical word first

    Request missed word from

    memory first

    Send it to the processor as soon

    as it arrives

    Early restart

    Request words in normal order

    Send missed work to the processor as soon as it arrives

    Effectiveness of these strategies depends on block

    size and likelihood of another access to the portion of

    the block that has not yet been fetched

    6th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Merging Write Buffer

    When storing to a block that is already pending in the

    write buffer, update write buffer

    Reduces stalls due to full write buffer

    Multiword writes are usually faster than writes

    performed one word at a time

    Write merging

    No write merging

    7th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Compiler Optimizations

    Loop Interchange

    Swap nested loops to access memory in sequential order

    Blocking

    Instead of accessing entire rows or columns, subdivide matrices

    into blocks

    Requires more memory accesses but improves locality of

    accesses

    8th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Loop Interchange

    Swap nested loops to access memory in sequential order

    /* Before */

    for (k = 0; k < 100; k = k+1)

    for (j = 0; j < 100; j = j+1)

    for (i = 0; i < 5000; i = i+1)

    x[i][j] = 2 * x[i][j];

    /* After */

    for (k = 0; k < 100; k = k+1)

    for (i = 0; i < 5000; i = i+1)

    for (j = 0; j < 100; j = j+1)

    x[i][j] = 2 * x[i][j];

    j

    i

    0x0000

    0x4E20

    0x9C40

    0xEA60

    j

    i

    0x0000

    0x4E20

    0x9C40

    0xEA60

    8th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Blocking 8th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    8th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Hardware Prefetching

    Fetch two blocks on miss (include next sequential

    block)

    Prefetched block is placed into the instruction stream

    buffer

    9th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Hardware Prefetching

    Instruction Prefetching

    Typically, CPU fetches 2 blocks on a miss: the requested block and the

    next consecutive block.

    Requested block is placed in instruction cache when it returns, and

    prefetched block is placed into instruction stream buffer

    Data Prefetching

    Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8

    different 4 KB pages

    Prefetching invoked if 2 successive L2 cache misses to a page,

    if distance between those cache blocks is < 256 bytes

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song 63

    Issues in Prefetching

    Usefulness should produce hits

    Timeliness not late and not too early

    Cache and bandwidth pollution

    L1 Data

    L1 Instruction

    Unified L2

    Cache

    RF

    CPU

    Prefetched data

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Hardware Instruction Prefetching

    Instruction prefetch in Alpha AXP 21064

    Fetch two blocks on a miss; the requested block (i) and the next

    consecutive block (i+1)

    Requested block placed in cache, and next block in instruction

    stream buffer

    If miss in cache but hit in stream buffer, move stream buffer block

    into cache and prefetch next block (i+2)

    L1 Instruction Unified L2

    Cache

    RF

    CPU

    Stream

    Buffer

    Prefetched

    instruction block Req

    block

    Req

    block

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Hardware Data Prefetching

    Prefetch-on-miss:

    Prefetch b + 1 upon miss on b

    One Block Lookahead (OBL) scheme

    Initiate prefetch for block b + 1 when block b is accessed

    Why is this different from doubling block size?

    Can extend to N block lookahead

    Strided prefetch

    If observe sequence of accesses to block b, b+N, b+2N, then

    prefetch b+3N etc.

    Example

    IBM Power 5 supports 8 independent streams of strided prefetch

    per processor, prefetching 12 lines ahead of current access

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Hardware Prefetching

    Fetch two blocks on miss (include next sequential

    block)

    Pentium 4 Pre-fetching

    9th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Compiler Prefetching

    Insert prefetch instructions before data is needed

    Non-faulting: prefetch doesnt cause exceptions

    Register prefetch

    Loads data into register

    Cache prefetch

    Loads data into cache

    Combine with loop unrolling and software pipelining

    10th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Compiler Prefetching

    Insert prefetch instructions before data is needed

    Timing is the biggest issue, not predictability

    If you prefetch very close to when the data is required, you

    might be too late

    Prefetch too early, cause pollution

    Estimate how long it will take for the data to come into L1, so we

    can set P appropriatly

    10th Optimization

  • Computer Architecture

    Professor Yong Ho Song

    Software Prefetching Data

    Data Prefetch

    Load data into register (HP PA-RISC loads)

    Cache Prefetch: load into cache

    (MIPS IV, PowerPC, SPARC v. 9)

    Special prefetching instructions cannot cause faults;

    a form of speculative execution

    Issuing Prefetch Instructions takes time

    Is cost of prefetch issues < savings in reduced misses?

    Higher superscalar reduces difficulty of issue bandwidth

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Summary

  • Computer Architecture

    Professor Yong Ho Song

    Memory Technology

    Performance metrics

    Latency is concern of cache

    Bandwidth is concern of multiprocessors and I/O

    Access time

    Time between read request and when desired word arrives

    Cycle time

    Minimum time between unrelated requests to memory

    DRAM used for main memory, SRAM used for cache

  • Computer Architecture

    Professor Yong Ho Song

    Memory Technology

    SRAM

    Requires low power to retain bit

    Requires 6 transistors/bit

    DRAM

    Must be re-written after being read

    Must also be periodically refreshed

    Every ~ 8 ms

    Each row can be refreshed simultaneously

    One transistor/bit

    Address lines are multiplexed:

    Upper half of address: row access strobe (RAS)

    Lower half of address: column access strobe (CAS)

  • Computer Architecture

    Professor Yong Ho Song

    Memory Technology

    Amdahl:

    Memory capacity should grow linearly with processor speed

    Unfortunately, memory capacity and speed has not kept pace

    with processors

    Some optimizations:

    Multiple accesses to same row

    Synchronous DRAM

    Added clock to DRAM interface

    Burst mode with critical word first

    Wider interfaces

    Double data rate (DDR)

    Multiple banks on each DRAM device

  • Computer Architecture

    Professor Yong Ho Song

    Memory Optimizations

  • Computer Architecture

    Professor Yong Ho Song

    Memory Optimizations

  • Computer Architecture

    Professor Yong Ho Song

    Memory Optimizations

    DDR (Double Data Rate)

    DDR2

    Lower power (2.5 V -> 1.8 V)

    Higher clock rates (266 MHz, 333 MHz, 400 MHz)

    DDR3

    1.5 V

    800 MHz

    DDR4

    1-1.2 V

    1600 MHz

    GDDR5 is graphics memory based on DDR3

  • Computer Architecture

    Professor Yong Ho Song

    Memory Optimizations

    Graphics memory

    Achieve 2-5 X bandwidth per DRAM vs. DDR3

    Wider interfaces (32 vs. 16 bit)

    Higher clock rate

    Possible because they are attached via soldering instead of socketted DIMM modules

    Reducing power in SDRAMs

    Lower voltage

    Low power mode (ignores clock, continues to refresh)

  • Computer Architecture

    Professor Yong Ho Song

    Memory Power Consumption

  • Computer Architecture

    Professor Yong Ho Song

    Flash Memory

    Type of EEPROM

    Must be erased (in blocks) before being overwritten

    Non volatile

    Limited number of write cycles

    Cheaper than SDRAM, more expensive than disk

    Slower than SRAM, faster than disk

  • Computer Architecture

    Professor Yong Ho Song

    Memory Dependability

    Memory is susceptible to cosmic rays

    Soft errors: dynamic errors

    Detected and fixed by error correcting codes (ECC)

    Hard errors: permanent errors

    Use sparse rows to replace defective rows

    Chipkill: a RAID-like error recovery technique

  • Computer Architecture

    Professor Yong Ho Song

    Virtual Memory

    Protection via virtual memory

    Keeps processes in their own memory space

    Role of architecture:

    Provide user mode and supervisor mode

    Protect certain aspects of CPU state

    Provide mechanisms for switching between user mode and

    supervisor mode

    Provide mechanisms to limit memory accesses

    Provide TLB to translate addresses

  • Computer Architecture

    Professor Yong Ho Song

    Virtual Memory Concepts

    What is Virtual Memory?

    Uses disk as an extension to memory system

    Main memory acts like a cache to hard disk

    Each process has its own virtual address space

    Page: a virtual memory block

    Page fault: a memory miss

    Page is not in main memory transfer page from disk to memory

    Address translation:

    CPU and OS translate virtual addresses to physical addresses

    Page Size:

    Size of a page in memory and on disk

    Typical page size = 4KB 16KB

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Virtual Memory Concepts

    A programs address space is divided into pages

    All pages have the same fixed size (simplifies their allocation)

    Program 1

    virtual address space

    main memory

    Program 2

    virtual address space

    Pages are either in main

    memory or in secondary

    storage (hard disk)

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Issues in Virtual Memory

    Page Size

    Small page sizes ranging from 4KB to 16KB are typical today

    Large page size can be 1MB to 4MB (reduces page table size)

    Recent processors support multiple page sizes

    Placement Policy and Locating Pages

    Fully associative placement is typical to reduce page faults

    Pages are located using a structure called a page table

    Page table maps virtual pages onto physical page frames

    Handling Page Faults and Replacement Policy

    Page faults are handled in software by the operating system

    Replacement algorithm chooses which page to replace in memory

    Write Policy

    Write-through will not work, since writes take too long

    Instead, virtual memory systems use write-back

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Three Advantages of Virtual Memory

    Memory Management

    Programs are given contiguous view of memory

    Pages have the same size simplifies memory allocation

    Physical page frames need not be contiguous

    Only the Working Set of program must be in physical memory

    Stacks and Heaps can grow

    Use only as much physical memory as necessary

    Protection

    Different processes are protected from each other

    Different pages can be given special behavior (read only, etc)

    Kernel data protected from User programs

    Protection against malicious programs

    Sharing

    Can map same physical page to multiple users Shared memory

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Page Table and Address Mapping

    Page Table Register

    contains the address

    of the page table

    P age offse tV ir tua l page num ber

    V irtua l add ress

    P age offse tP hysica l page num ber

    Phys ical add ress

    P hysica l page num berVa lid

    If 0 then page is no t

    presen t in m em ory

    Page tab le registe r

    Page table

    20 12

    18

    31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

    29 2 8 27 15 1 4 13 1 2 11 10 9 8 3 2 1 0

    Page Table maps

    virtual page numbers

    to physical frames

    Virtual page number

    is used as an index

    into the page table

    Page Table Entry

    (PTE): describes the

    page and its usage

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Page Table contd

    Each process has a page table

    The page table defines the address space of a process

    Address space: set of page frames that can be accessed

    Page table is stored in main memory

    Can be modified only by the Operating System

    Page table register

    Contains the physical address of the page table in memory

    Processor uses this register to locate page table

    Page table entry

    Contains information about a single page

    Valid bit specifies whether page is in physical memory

    Physical page number specifies the physical page address

    Additional bits are used to specify protection and page use

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Size of the Page Table

    One-level table is simplest to implement

    Each page table entry is typically 4 bytes

    With 4K pages and 32-bit virtual address space, we need:

    232/212 = 220 entries 4 bytes = 4 MB

    With 4K pages and 48-bit virtual address space, we need:

    248/212 = 236 entries 4 bytes = 238 bytes = 256 GB !

    Cannot keep whole page table in memory!

    Most of the virtual address space is unused

    Virtual Page Number 20 Page Offset 12

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Reducing the Page Table Size

    Use a limit register to restrict the size of the page table

    If virtual page number > limit register, then page is not allocated

    Requires that the address space expand in only one direction

    Divide the page table into two tables with two limits

    One table grows from lowest address up and used for the heap

    One table grows from highest address down and used for stack

    Does not work well when the address space is sparse

    Use a Multiple-Level (Hierarchical) Page Table

    Allows the address space to be used in a sparse fashion

    Sparse allocation without the need to allocate the entire page table

    Primary disadvantage is multiple level address translation

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Multi-Level Page Table

    Level 1

    Page Table

    Level 2

    Page Tables

    Data Pages

    page in primary memory

    page in secondary memory

    Root of the Current

    Page Table

    p1

    p2

    Virtual Address

    (Processor

    Register)

    PTE of a nonexistent page

    p1 p2 offset

    0 11 12 21 22 31

    10-bit

    L1 index

    10-bit

    L2 index

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Variable-Sized Page Support

    Level 1

    Page Table

    Level 2

    Page Tables page in primary memory large page in primary memory

    page in secondary memory

    PTE of a nonexistent page

    Root of the Current

    Page Table

    p1

    p2

    Virtual Address

    (Processor

    Register)

    p1 p2 offset

    0 11 12 21 22 31

    10-bit

    L1 index

    10-bit

    L2 index

    4 MB

    page

    Data Pages

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Hashed Page Table

    hash

    index

    Base of Table

    + PA of PTE

    Hashed Page Table VPN offset

    Virtual Address

    PID VPN PID PPN link

    VPN PID PPN

    One table for all processes

    Table is only small fraction of memory

    Number of entries is 2 to 3 times number of

    page frames to reduce collision probability

    Hash function for address translation

    Search through a chain of page table entries

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Handling a Page Fault

    Page fault: requested page is not in memory

    The missing page is located on disk or created

    Page is brought from disk and Page table is updated

    Another process may be run on the CPU while the first process

    waits for the requested page to be read from disk

    If no free pages are left, a page is swapped out

    Pseudo-LRU replacement policy

    Reference bit for each page each page table entry

    Each time a page is accessed, set reference bit =1

    OS periodically clears the reference bits

    Page faults are handled completely in software by the

    OS

    It takes milliseconds to transfer a page from disk to memory

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Write Policy

    Write through does not work

    Takes millions of processor cycles to write disk

    Write back

    Individual writes are accumulated into a page

    The page is copied back to disk only when the page is replaced

    Dirty bit

    1 if the page has been written

    0 if the page never changed

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    CPU

    Core TLB Cache

    Main

    Memory

    VA PA miss

    hit

    data

    TLB = Translation Lookaside Buffer

    Address translation is very expensive

    Must translate virtual memory address on every memory access

    Multilevel page table, each translation is several memory

    accesses

    Solution: TLB for address translation

    Keep track of most common translations in the TLB

    TLB = Cache for address translation

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Translation Lookaside Buffer

    VPN offset

    VPN PPN

    physical address PPN offset

    virtual address

    hit?

    (VPN = virtual page number)

    (PPN = physical page number)

    TLB hit: Fast single cycle translation

    TLB miss: Slow page table translation

    Must update TLB on a TLB miss

    Revised 2015

    V R W D

  • Computer Architecture

    Professor Yong Ho Song

    Address Translation & Protection

    Every instruction and data access needs address translation and protection checks

    Check whether page is read only, writable, or executable

    Check whether it can be accessed by the user or kernel only

    Physical Address

    Virtual Address

    Address

    Translation

    Virtual Page No. (VPN) offset

    Physical Page No. (PPN) offset

    Protection

    Check

    Exception?

    Kernel/User Mode

    Read/Write

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Handling TLB Misses and Page Faults

    TLB miss: No entry in the TLB matches a virtual address

    TLB miss can be handled in software or in hardware

    Lookup page table entry in memory to bring into the TLB

    If page table entry is valid then retrieve entry into the TLB

    If page table entry is invalid then page fault (page is not in memory)

    Handling a Page Fault

    Interrupt the active process that caused the page fault

    Program counter of instruction that caused page fault must be saved

    Instruction causing page fault must not modify registers or memory

    Transfer control to the operating system to transfer the page

    Restart later the instruction that caused the page fault

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Handling a TLB Miss

    Software (MIPS, Alpha)

    TLB miss causes an exception and the operating system walks

    the page tables and reloads TLB

    A privileged addressing mode is used to access page tables

    Hardware (SPARC v8, x86, PowerPC)

    A memory management unit (MMU) walks the page tables and

    reloads the TLB

    If a missing (data or PT) page is encountered during the TLB

    reloading, MMU gives up and signals a Page-Fault exception for

    the original instruction. The page fault is handled by the OS

    software.

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Address Translation Summary

    TLB

    Lookup

    Page Table

    Walk

    Update TLB Page Fault

    (OS loads page)

    Protection

    Check

    Physical

    Address

    (to cache)

    miss hit

    not in memory in memory denied permitted

    Protection

    Fault

    hardware

    hardware or software

    software

    SEGFAULT

    Restart instruction

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    TLB, Page Table, Cache Combinations

    TLB Page Table Cache Possible? Under what circumstances?

    Hit Hit Hit Yes what we want!

    Hit Hit Miss Yes although the page table is not

    checked if the TLB hits

    Miss Hit Hit Yes TLB miss, PA in page table

    Miss Hit Miss Yes TLB miss, PA in page table, but data is

    not in cache

    Miss Miss Miss Yes page fault (page is on disk)

    Hit Miss Hit/Miss Impossible TLB translation is not possible

    if page is not in memory

    Miss Miss Hit Impossible data not allowed in cache

    if page is not in memory

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Address Translation in CPU Pipeline

    Software handlers need restartable exception on page fault

    Handling a TLB miss needs a hardware or software mechanism to refill TLB

    Need mechanisms to cope with the additional latency of a TLB

    Slow down the clock

    Pipeline the TLB and cache access

    Virtual address caches

    Parallel TLB/cache access

    PC Inst TLB

    Inst. Cache D

    Decode E M Data TLB

    Data Cache W +

    TLB miss? Page Fault?

    Protection violation?

    TLB miss? Page Fault?

    Protection violation?

    Revised 2015

  • Computer Architecture

    Professor Yong Ho Song

    Virtual Machines

    Supports isolation and security

    Sharing a computer among many unrelated users

    Enabled by raw speed of processors, making the

    overhead more acceptable

    Allows different ISAs and operating systems to be

    presented to user programs

    System Virtual Machines

    SVM software is called virtual machine monitor or hypervisor

    Individual virtual machines run under the monitor are called

    guest VMs

  • Computer Architecture

    Professor Yong Ho Song

    Impact of VMs on Virtual Memory

    Each guest OS maintains its own set of page tables

    VMM adds a level of memory between physical and virtual

    memory called real memory

    VMM maintains shadow page table that maps guest virtual

    addresses to physical addresses

    Requires VMM to detect guests changes to its own page table

    Occurs naturally if accessing the page table pointer is a privileged

    operation

  • Computer Architecture

    Professor Yong Ho Song

    End of Chapter

    105