2009 Sept. 10SYSC 2001 - F09. SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache...

2009 Sept. 10 SYSC 2001 - F09. SYSC2001-Ch4.ppt 1

Chapter 4

Cache Memory

4.1 Memory system

4.2 Cache principles

4.3 Cache design

4.4 Examples


Memory: Location Registers: inside cpu

• Fastest – on CPU chip

Cache: very fast, semiconductor, close to CPU

Internal or main memory

• Typically semiconductor media (transistors)

• Fast, random access, on system bus

External or secondary memory

• peripheral storage devices (e.g. disk, tape)

• Slower, often magnetic media , maybe slower bus


Memory: Capacity

Word size: # of bits in natural unit of organization

• Usually related to length of an instruction or the number of bits used to represent an integer number

Capacity expressed as number of words or number of bytes

• Usually a power of 2, e.g. 1 KB 1024 bytes why?


Other Memory System Characteristics Unit of Transfer: Number of bits read from, or written

into memory at a time

• Internal : usually governed by data bus width

• External : usually a block of words e.g 512 or more

Addressable unit: smallest location which can be uniquely addressed

• Internal : word or byte

• External : device dependent e.g. a disk “cluster”


Sequential Access Method

Start at the beginning – read through in order

Access time depends on location of data and previous location

e.g. tape

location of interest

start

read to here

. . .

first location


Direct Access Individual blocks have unique address

Access is by jumping to vicinity plus sequential search (or waiting! e.g. waiting for disk to rotate)

Access time depends on target location and previous location

e.g. disk

block i

. . .

jump to here

read to here


Random Access Method

Individual addresses identify specific locations

Access time independent of location or previous access

e.g. RAM. . .

read here

main memory types

Ch. 5.1


Performance (Speed) Access time

• Time between presenting the address and getting the valid data (memory or other storage)

Memory cycle time

• Some time may be required for the memory to “recover” before next access

• cycle time = access + recovery

Transfer rate

• rate at which data can be moved

• for random access memory = 1 / cycle time

(cycle time)-1


Memory Hierarchy

size ? speed ? cost ?

registers

• in CPU

internal

• may include one or more levels of cache

external memory

• backing store

smallest, fastest, most expensive, most frequently accessed

medium, quick, price varies

largest, slowest, cheapest, least frequently accessed


Performance & Hierarchy List Registers

Level 1 Cache

Level 2 Cache

Main memory

Disk cache

Disk

Optical

Tape

soon( 2 slides ! )

Faster, +$/byte

Slower, -$/byte


Locality of Reference (circa 1968) During program execution memory references tend to

cluster, e.g. loops

Many instructions in localized areas of pgm are executed repeatedly during some time period, and remainder of pgm is accessed infrequently. (Tanenbaum)

Temporal LOR: a recently executed instruction is likely to be executed again soon

Spatial LOR: instructions with addresses close to a recently executed instruction are likely to be executed soon.

Same principles apply to data references.


Cache

small amount of fast memory

sits between normal main memory and CPU

may be located on CPU chip or module

word transfer

block transfer

cache

cache viewsmain memoryas organized in “blocks”

smaller thanmain memory

recall “blocks” from a few slides back


Why does Caching Improve Speed?Example:

Main memory has 100,000 words, access time is 0.1 s.

Cache has 1000 words and access time is 0.01 s.

If word is

• in cache (hit), it can be accessed directly by processor.

• in memory (miss), it must be first transferred to cache before access.

Suppose that 95% of access requests are hits.

Average time to access a word (0.95)(0.01 s)+0.05(0.1 s+ 0.01 s) = 0.015 s

Close to cache speed

Key proviso


Cache Read Operation

CPU requests contents of memory location

check cache for contents of location cache hit !

cache miss !present get data from cache (fast)

not present read required block from main to cache

– then deliver data from cache to CPU


Cache Design

Size

Mapping Function

Replacement Algorithm

Write Policy

Block Size

Number of Caches


Size

Cost

• More cache is expensive

Speed

• More cache is faster (up to a point)

• Checking cache for data takes time


Mapping Function how does cache contents map to main memory contents?

cache

tag data block000

xxx

blocki

blockj

. . .

main memory

address contents

line

use tag (and maybe line address) to

identify block address


Cache Basics

cache line vs. main memory location

– same concept – avoid confusion (?)

– line has address and contents

contents of cache line divided into tag and data fields

• fixed width

• fields used differently !

• data field holds contents of a block of main memory

• tag field helps identify the start address of the block of memory that is in the data field

cache line width bigger than

memory location width !


Mapping Function Example

cache of 64 KByte

• 16 K (214) lines – each line is 5 bytes wide = 40 bits

16 MBytes main memory

24 bit address

• 224 = 16 M

will consider DIRECT and ASSOCIATIVE mappings

4 byte blocksof main memory

holds up to 64 Kbytes of main memory contents

tag field: 1 byte

data field: 4 bytes


Direct Mapping

each block of main memory maps to only one cache line

• i.e. if a block is in cache, it must be in one specific place – based on address!

split address into two parts

• least significant w bits identify unique word in block

• most significant s bits specify one memory block

– split s bits into:

• cache line address field r bits

• tag field of s-r most significant bits

s w

tag line

address

s – r r

s

line field identifies line

containingblock !


tags-r

line address r

wordw

8 14 2

24 bit address

s = 22 bit block identifier

2 bit word identifier (4 byte block)

Direct Mapping: Address Structure for Example

two blocks may have the same r value, but then always have different tag value !


Direct Mapping Cache Line Table

cache line main memory blocks held

0 0, m, 2m, 3m, … 2s-m

1 1, m+1, 2m+1, … 2s-m+1

m-1 m-1, 2m-1,3m-1, … 2s-1

. . .

. . .

m=214

s=22

each block = 4 bytes

But…a line can contain only one of these at a time!


Direct Mapping Summary

Address length = (s + w) bits

Number of addressable units = 2s+w words or bytes

Block size = line size – tag size = 2w words or bytes

Number of blocks in main memory = 2s+ w/2w = 2s

Number of lines in cache = m = 2r

Size of tag = (s – r) bits


Direct Mapping pros & cons

Simple

Inexpensive

Fixed location for given block

• If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high


Associative Memory

read: specify tag field value and word select

checks all lines – finds matching tag

• return contents of data field @ selected word

access time independent of location or previous access

write to data field @ tag value + word select

• what if no words with matching tag? Ch. 4.3


Associative Mapping

main memory block can load into any line of cache

memory address is interpreted as tag and word select in block

tag uniquely identifies block of memory !

every line’s tag is examined for a match

cache searching gets expensive

s = tag does not use line address !


Tag 22 bitWord2 bit

Associative MappingAddress Structure

22 bit tag stored with each 32 bit block of data

Compare tag field with tag entry in cache to check for hit

Least significant 2 bits of address identify which 8 bit word is required from 32 bit data block

e.g.

• Address Tag Data Cache line

• FFFFFC 3FFFFF 24682468 any, e.g. 3FFF


Associative Mapping Summary

Address length = (s + w) bits

Number of addressable units = 2s+w words or bytes

Block size = line size – tag size = 2w words or bytes

Number of blocks in main memory = 2s+ w/2w = 2s

Number of lines in cache = undetermined

Size of tag = s bits


Set Associative Mapping

Cache is divided into a number of sets

Each set contains k lines k – way associative

A given block maps to any line in a given set

• e.g. Block B can be in any line of set i

e.g. 2 lines per set

• 2 – way associative mapping

• A given block can be in one of 2 lines in only one set


Set Associative MappingAddress Structure

Use set field to determine which set of cache lines to look in (direct)

Within this set, compare tag fields to see if we have a hit (associative)

e.g

• Address Tag Data Set number

• FFFFFC 1FF 12345678 1FFF

• 00FFFF 001 11223344 1FFF

Tag 9 bit Set 13 bitWord2 bit

Same Set, different Tag, different Word


e.g Breaking into Tag, Set, Word

Given Tag=9 bits, Set=13 bits, Word=2 bits

Given address FFFFFD16

What are values of Tag, Set, Word?

• First 9 bits are Tag, next 13 are Set, next 2 are Word

• Rewrite address in base 2: 1111 1111 1111 1111 1111 1101

• Group each field in groups of 4 bits starting at right

• Add zero bits as necessary to leftmost group of bits

0001 1111 1111 0001 1111 1111 1111 0001

1FF 1FFF 1 (Tag, Set, Word)


Replacement Algorithms Direct Mapping

what if bringing in a new block, but no line available in cache?

must replace (overwrite) a line – which one?

direct no choice

• each block only maps to one line

replace that line


Replacement Algorithms Associative & Set Associative

hardware implemented algorithm (speed)

Least Recently Used (LRU)

e.g. in 2-way set associative

• which of the 2 blocks is LRU?

First In first Out (FIFO)

• replace block that has been in cache longest

Least Frequently Used (LFU)

• replace block which has had fewest hits

Random


Write Policy

must not overwrite a cache block unless main memory is up to date

Complication: Multiple CPUs may have individual caches!!

Complication: I/O may address main memory too (read and write)!!

N.B. 15% of memory references are writes


Write Through Method

all writes go to main memory as well as cache

Each of multiple CPUs can monitor main memory traffic to keep its own local cache up to date

lots of traffic slows down writes


Write Back Method

updates initially made in cache only

update (dirty) bit for cache slot is set when update occurs

if block is to be replaced, write to main memory only if update bit is set

Other caches get out of sync

I/O must access main memory through cache


Multiple Caches on one processor two levels – L1 close to processor (often on chip)

– L2 – between L1 and main memory

check L1 first – if miss – then check L2

• if L2 miss – get from memory

processor L1 L2localbus system bus

to high speed bus


Unified vs. Split Caches

unified both instruction and data in same cache

split separate caches for instructions and data

• separate local busses to cache

increased concurrency pipelining

• allows instruction fetch to be concurrent with operand access

Ch. 12


Pentium Family Cache Evolution

80386 – no on chip cache

80486 – 8k using 16 byte lines and four way set associative organization

Pentium (all versions) – two on chip L1 (split) caches

• data & instructions


Pentium 4 Cache

Pentium 4 – split L1 caches

• 8k bytes

• 128 lines of 64 bytes each

• four way set associative = 32 sets

unified L2 cache – feeding both L1 caches

• 256k bytes

• 2048 (2k) lines of 128 bytes each

• 8 way set associative = 256 sets

how many bits ?w wordss set


Power PC Cache Evolution

601 – single 32kb 8 way set associative

603 – 16kb (2 x 8kb) two way set associative

604 – 32kb

610 – 64kb

G3 & G4

• 64kb L1 cache 8 way set associative

• 256k, 512k or 1M L2 cache two way set associative

2009 Sept. 10SYSC 2001 - F09. SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache...

Documents

Transcript of 2009 Sept. 10SYSC 2001 - F09. SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache...