1 1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts...

46
1 1998 Morgan Kaufmann Publishers Memory CPU Memory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories are expensive, slow memories are cheap. A fast and large memory results in an expensive system. Memory is used to store programs and data.

Transcript of 1 1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts...

Page 1: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

11998 Morgan Kaufmann Publishers

Memory

CPU MemoryMemory should belarge and fast.

large amountsof code and data

time-criticalapplications

Fast memories are expensive, slow memories are cheap.

A fast and large memory results in an expensive system.

Memory is used to store programs and data.

Page 2: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

21998 Morgan Kaufmann Publishers

Main Memory Types

• CPU uses the main memory on the instruction level.

• RAM (Random Access Memory): we can read and write.– Static– Dynamic

• ROM (Read-Only Memory): we can only read.– Changing of the contents is possible for most types.

• Characteristics– Access time– Price– Volatility

Page 3: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

31998 Morgan Kaufmann Publishers

• SRAM:

– bit value is stored on a pair of inverting gates

– very fast but takes up more space than DRAM (4 to 6 transistors)

• DRAM:

– bit value is stored as a charge on capacitor (must be refreshed)

– very small but slower than SRAM (factor of 2 to 5)

Random Access Memories

Word line

Bit line

Capacitor

Access transistor

1

1

Word line

BitBit

Page 4: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

41998 Morgan Kaufmann Publishers

• Users want large and fast memories!

SRAM access times are 5 - 12 ns at cost of 25 $ per MB.DRAM access times are 5 - 20 ns at cost of .15 $ per MB.Disk access times are 7 - 10 ms at cost of .001 $ per MB.

• Give it to them anyway.

– build a memory hierarchy

Exploiting Memory Hierarchy

2003

Levels in thememory hierarchy

Increasing distance from the CPU in access time

Size of the memory at each level

CPU

Level 1

Level 2

Level n

Page 5: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

51998 Morgan Kaufmann Publishers

Locality

• A principle that makes having a memory hierarchy a good idea

• If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

• Why does code have locality?

– loops

– instructions accessed sequentially

– arrays, records

Page 6: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

61998 Morgan Kaufmann Publishers

Memory Hierarchy

CPU CacheMainmemory

Secondarymemory

LevelsL1, L2, …(hardwareimplementation,SRAMs)

Virtual memoryRegisters

(software implementation)

The computer uses the main memory (DRAM)on the instruction level.

Page 7: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

71998 Morgan Kaufmann Publishers

Memory Hierarchy

A pair of levels in the memory hierarchy:

– two levels: upper and lower

– block: minimum unit of data transferred between upper and lower level

– hit: data requested is in the upper level

• hit rate

– miss: data requested is not in the upper level

• miss rate

– miss penalty depends mainly on lower level access time

Page 8: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

81998 Morgan Kaufmann Publishers

• What information goes into the cache

in addition to the referenced one?

• How do we know if a data item is in the cache?

• If it is, how do we find it?

Cache Basics

Page 9: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

91998 Morgan Kaufmann Publishers

• Simple approach: Direct mapped– block size is one word– every main memory location can be mapped to exactly one cache

location– lots of words in the main memory share a single location in the

cache

• Address in the cache = (address in the main memory) modulo (number of words in the cache)

– cache address is identical with lower bits in the main memory address

– tag (higher address bits) differentiates between competing main memory words

• We are taking advantage of temporal locality.

Direct Mapped Cache

Page 10: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

101998 Morgan Kaufmann Publishers

Direct Mapped Cache: Simple Example

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111

Page 11: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

111998 Morgan Kaufmann Publishers

A More Realistic Example

• 32 bit word length

• 32 bit address

• 1 kW cache

• block size 1 word

• 10 bit cache index

• 20 bit tag size

• 2 bit byte offset (word alignment assumed)

• valid bit

Page 12: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

121998 Morgan Kaufmann Publishers

Cache Access

A d dress (s h ow ing b it pos itio ns )

20 10

By teoffse t

V alid T ag D a taIn de x

0

1

2

10 21

10 22

10 23

T a g

Ind ex

H it D ata

20 32

3 1 30 1 3 12 11 2 1 0

Page 13: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

131998 Morgan Kaufmann Publishers

Cache Size

• Cache memory size

1024 32 b = 32 kb

• Tag memory size

1024 20 b = 20 kb

• Valid information

1024 1 b = 1 kb

• Efficiency

32/53 = 60.4 %

Page 14: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

141998 Morgan Kaufmann Publishers

Cache Hits and Misses

• Cache hit - continue

– access the cache

• Cache miss

– stall the CPU

– get information from the main memory

– write information in the cache

• data, tag, set valid bit

– resume execution

Page 15: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

151998 Morgan Kaufmann Publishers

• Read hits– this is what we want!

• Read misses– stall the CPU, fetch block from memory, deliver to cache, restart

• Write hits:– can replace the data in the cache and memory (write-through)– write the data only in the cache, write in the main memory later

(write-back)

• Write misses:– read the entire block into the cache, then write the word

Hits and Misses in More Detail

Page 16: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

161998 Morgan Kaufmann Publishers

Write Through and Write Back

• Write through

– update cache and main memory at the same time

– may result in extra main memory writes

– requires a write buffer; it stores data while it is waiting to be written to memory

• Write back

– update main memory only during replacement

– replacement may be slower

Page 17: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

171998 Morgan Kaufmann Publishers

Combined vs. Split Cache

• Combined cache

– size equal to the sum of the split caches

– no rigid division between locations used by instructions and data

– usually a slightly better hit rate

– possibly stalls due to simultaneous access to instructions and data

– lower bandwidth because of sharing of resources

• Split instruction and data cache

– increased bandwidth from the cache

– slightly lower hit rate

– no conflict when accessing instruction and data simultaneously

Page 18: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

181998 Morgan Kaufmann Publishers

• The cache described so far:

– simple

– block size one word

– only temporal locality exploited

• Spatial locality

– block size longer than one word

– when a miss occurs, multiple adjacent words are fetched

Taking Advantage of Spatial Locality

Page 19: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

191998 Morgan Kaufmann Publishers

Direct Mapped Cache with Four-Word Blocks

A dd ress (show in g b it po sitions)

16 12 B yte

o ffset

V Tag D ata

H it D ata

16 32

4K

en tries

16 bits 128 bits

Mu x

32 32 32

2

32

Block offsetInd ex

T ag

31 16 15 4 3 2 1 0

Page 20: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

201998 Morgan Kaufmann Publishers

• Increasing block size tends to decrease miss rate:

• There is more spatial locality in code:

Miss Rate vs. Block Size

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

ProgramBlock size in

wordsInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%

spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%

Page 21: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

211998 Morgan Kaufmann Publishers

Achieving Higher Memory Bandwidth

CPU

Cache

Bus

Memory

a. One-word-wide memory organization

CPU

Bus

b. Wide memory organization

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization

Miss penalties for a four-word block: a. four memory latencies and four bus cycles b. one memory latency and one bus cycle c. one memory latency and four bus cycles.The memory latency is much longer than the bus cycle.

Page 22: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

221998 Morgan Kaufmann Publishers

Performance

• Simplified model:

execution time = (execution cycles + stall cycles) cycle time

stall cycles = # of instructions miss rate miss penalty

• The model is more complicated for writes than reads(write-through vs. write-back, write buffers)

• Two ways of improving performance:

– decreasing the miss rate

– decreasing the miss penalty

Page 23: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

231998 Morgan Kaufmann Publishers

Flexible Placement of Cache blocks

• Direct mapped cache– a memory block can go exactly in one place in the cache– use the tag to identify the referenced word– easy to implement

• Fully associative cache– a memory block can be placed in any location in the cache– search all entries in the cache in parallel– expensive to implement (a comparator associated with each cache entry)

• Set-associative cache– a memory block can be placed in a fixed number of locations– n locations: n-way set-associative cache– a block is mapped to a set;

it can be placed in any element in that set– search the elements of the set– implementation simpler than in fully associative cache

Page 24: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

241998 Morgan Kaufmann Publishers

Cache Types

1

2Tag

Data

Block # 0 1 2 3 4 5 6 7

Search

Direct mapped

1

2Tag

Data

Set # 0 1 2 3

Search

Set associative

1

2Tag

Data

Search

Fully associative

We are looking at block 12 in an 8-block cache; 12 mod 8 = 4, 12 mod 4 = 0

Page 25: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

251998 Morgan Kaufmann Publishers

Mapping of an Eight-Block Cache

Tag Data Tag Da ta Tag Da ta Tag Da ta Tag Da ta Tag Da ta Tag Data Tag Data

E ight-way se t a ssoc iative (fully assoc iative)

T ag Data Tag Da ta Ta g Da ta Ta g Da ta

Fou r-w ay set assoc ia tive

S et

0

1

T ag Data

O ne- way se t associative(direc t m apped)

B lock

0

7

1

2

3

4

5

6

Tag Da ta

Two-way se t a ssociative

Set

0

1

2

3

Ta g Data

Page 26: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

261998 Morgan Kaufmann Publishers

Performance Improvement

Associativity reduces high miss rates.

Program Associativity Instruction Data Combined

miss rate miss rate miss rate

gcc 1 2.00% 1.70% 1.90%

gcc 2 1.60% 1.40% 1.50%

gcc 4 1.60% 1.40% 1.50%

spice 1 0.30% 0.60% 0.40%

spice 2 0.30% 0.60% 0.40%

spice 4 0.30% 0.60% 0.40%

Page 27: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

271998 Morgan Kaufmann Publishers

Locating a Block

• Address portions

– Index selects the set.

– Tag chooses the the block by comparison.

– Block offset is the address of the data within the block.

• The costs of an associative cache

– comparators and multiplexers

– time for comparison and selection

block offsetindextag

Page 28: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

281998 Morgan Kaufmann Publishers

Four-Way Set-Associative Cache

Address

22 8

V TagIndex

0

1

2

253

254

255

Data V Tag Data V Tag Data V Tag Data

3222

4 - to -1 m ultip lexor

H it Data

123891011123031 0

Page 29: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

291998 Morgan Kaufmann Publishers

Replacement Strategy

• Replacement is needed in associative caches.

• Random

• First-in-first-out– oldest block is replaced

• First-in-not-used-first-out– oldest of the blocks having not been accessed after the previous

replacement is replaced

• LRU (Least Recently Used)– the block having been unused for the longest time is replaced

Page 30: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

301998 Morgan Kaufmann Publishers

Random vs. LRU

• Random

– simple to implement

– almost as good as other algorithms

• LRU (Least Recently Used)

– 2-way set-associative: implementation simple (one bit)

– 4-way set-associative: approximated to make implementation reasonably simple

Page 31: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

311998 Morgan Kaufmann Publishers

Pseudo LRU for Four-Way S-A Cache

• Approximation of LRU, implemented with 3 bits per set• Replacement needed: check B1

check B2 check B3

replace replace replace replace block 0 block 1 block 2 block 3

• At every cache access two of the bits are updated to point away from the MRU block

• Always chooses the best or second best choice

Page 32: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

321998 Morgan Kaufmann Publishers

Performance (SPEC92)

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

t e

Associativity 16 KB

32 KB

64 KB

128 KB

Page 33: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

331998 Morgan Kaufmann Publishers

Multilevel Caches

• Usually two levels:– L1 cache is often on the same chip as the processor– L2 cache is usually off-chip– miss penalty goes down if data is in L2 cache

• Example:– CPI of 1.0 on a 500Mhz machine with a 200ns main memory access

time:miss penalty 100 clock cycles

– Add a 2nd level cache with 20ns access time:miss penalty 10 clock cycles

• Using multilevel caches:– minimise the hit time on L1– minimise the miss rate on L2

Page 34: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

341998 Morgan Kaufmann Publishers

Virtual Memory

• Main memory can act as a “cache” for the secondary storage– large virtual address space used in each program– smaller main memory

• Motivations– efficient and safe sharing of memory among multiple programs– remove programming burdens of a small main memory

• Advantages:– illusion of having more physical memory– program relocation – protection

Page 35: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

351998 Morgan Kaufmann Publishers

Virtual Memory

Physica l addresses

D isk add resses

Virtua l addresses

A ddress trans la tion

Page 36: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

361998 Morgan Kaufmann Publishers

Pages: virtual memory blocks

• CPU produces a virtual address

– translated by a combination of hardware and software to a physical address

• Virtual address: virtual page number and page offset

• Physical address: physical page number and page offset

• Page fault: data is not in memory, retrieve it from disk

Page 37: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

371998 Morgan Kaufmann Publishers

Address Translation

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation

Page 38: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

381998 Morgan Kaufmann Publishers

Virtual Memory Design

• Huge miss penalty, thus pages should be fairly large (4 - 64 kB).

• Reducing page faults is important (LRU is worth the price).

• Page faults can be handled in software instead of hardware– overhead small compared to the access time to disk

• Virtual memory systems use write-back.– using write-through is too expensive

• Dirty bit– indicates whether a page needs to be copied back when it is replaced– initially cleared, set when the page is first written

Page 39: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

391998 Morgan Kaufmann Publishers

Page Tables

Physical memory

Disk storage

Valid

111101101101

Page table

Virtual pagenumber

Physical page ordisk address

Page 40: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

401998 Morgan Kaufmann Publishers

Page Table Details

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Page 41: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

411998 Morgan Kaufmann Publishers

Making Address Translation Fast

• A cache for address translations: translation lookaside buffer (TLB)

Valid

111101101101

Page table

Physical pageaddressValid

TLB

111101

TagVirtual page

number

Physical pageor disk address

Physical memory

Disk storage

Page 42: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

421998 Morgan Kaufmann Publishers

MIPS R2000 TLB and Cache

Valid Tag Data

Page offset

Page offset

Virtual page number

Virtual address

Physical page numberValid

1220

20

16 14

Cache index

32

Cache

DataCache hit

2

Byteoffset

Dirty Tag

TLB hit

Physical page number

Physical address tag

TLB

Physical address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0

Page 43: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

431998 Morgan Kaufmann Publishers

TLBs and caches

Yes

Deliver datato the CPU

Write?

Try to read datafrom cache

Write data into cache,update the tag, and put

the data and the addressinto the write buffer

Cache hit?Cache miss stall

TLB hit?

TLB access

Virtual address

TLB missexception

No

YesNo

YesNo

Write accessbit on?

YesNo

Write protectionexception

Physical address

Page 44: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

441998 Morgan Kaufmann Publishers

Protection and Virtual Memory

• Multiple processes and the operating system

– share a single main memory

– memory protection is provided

• A user process can not access other processes’ data

• The operating system takes care of system administration

– page tables, TLBs

Page 45: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

451998 Morgan Kaufmann Publishers

Hardware Requirements for Protection

• At least two operating modes– user process– operating system process (also called kernel, supervisor or executive

process)

• Portion of the CPU state a user process can read but not write– user/supervisor mode bit– page table pointer– TLB

• Mechanisms for going from user mode to supervisor mode, and vice versa– system call exception– return from exception

Page 46: 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

461998 Morgan Kaufmann Publishers

Handling Page Faults and TLB Misses

• TLB miss

– page present in the memory create missing TLB entry

– page not present in the memory page fault transfer control to the operating system

• Look at the matching page table entry

– valid bit on copy the page table entry from memory into the TLB

– valid bit off page fault exception

• Page fault

– EPC contains the virtual address of the faulting page

– find the page and move it into the memory after choosing a page to be replaced