Memory and Cache - csce.uark.educsce.uark.edu/.../lectures/lecture14-memory-cache.pdf · Cache...

Memory and Cache

Alexander Nelson

April 20, 2020

University of Arkansas - Department of Computer Science and Computer Engineering

“Ideally one would deire an indefinitely large

memory capacity such that any particular ...

word would be immediately available. ... We

are ... forced to recognize the possibility of

construction a hierarchy of memories,e ach of

which has greater capacity than the preceding

but which is less quicly accessible” - Burks,

Goldstine, von Neumann (1946)

Memory

What is memory?

SRAM Cell DRAM Cell

Memory

Memory – Safe place to store data

• Includes static & dynamic memory technologies

• Includes magnetic disk storage

Goal – Be able to access data quickly & reliably

May need to access a lot of data

Memory

Why a hierarchy of memories?

Type Access Time Cost per GiB

SRAM semiconductor memory 0.5ns - 2.5ns $500 - $1000

DRAM semiconductor memory 50ns - 70ns $10 - $20

Flash semiconductor memory 5,000ns - 50,000ns $0.75 - $1.00

Magnetic disk 5M ns - 20M ns $0.05 - $0.10

(Numbers as of 2012 per our book)

Tradeoff between cost and speed!

Goal: Hierarchy of Memories

Why this structure?4

Cache Memory

Use near-CPU memories as cache!

Cache – “A safe place for hiding or storing things”

Copy data from main memory to near memory for quick access

How do we know what will be used?

Principle of Temporal Locality

def. “An item that has been referenced is likely to be referenced

again soon”

e.g. Web Browser cache – Store more static aspects of pages

e.g. Variables in a loop

Principle of Spatial Locality

def. “An item that is referenced is likely near to items that will be

referenced soon”

e.g. Website pre-fetch – Downloading content before you scroll so

that content shows in real-time

e.g. Sequential instruction access

Locality

These two principles of locality enable caches to work well

How do we define working well?

Hit Rate – Fraction of memory accesses found in the upper levels

of cache!

conversely

Miss Rate – Fraction of memory accesses not found in upper

levels of cache

Miss Rate = 1 - Hit Rate

Good cache choices increase hit rate!

Taking Advantage of Locality

Cache Strategy:

• Store everything on disk

• Copy recently accessed items

to smaller DRAM memory

• Copy items nearby recent

accesses to smaller DRAM

memory

• Copy most recent accesses

(and nearby) from DRAM to

Hit Rate & Miss Penalty

Block (a.k.a line) – Minimum addressable unit in memory

If Block present in memory = Hit

If Block not present in memory = Miss

Hit Time = Time required to access a given level of hierarchy if

item is present

Miss Penalty = Time required to fetch a block into a level of

memory hierarchy from a lower level

Hierarchy performance based on improved Hit Time, reduced Miss

Rate & Miss Penalty

Memory Technologies

SRAM – Static Random Access Memroy

SRAM – 6-Transistor SRAM cell

Fixed access & write times – typically very close to cycle time

DRAM – Dynamic Random Access Memory

DRAM – Uses a capacitor to

store bit

Must be periodically refreshed

To refresh:

• Read value

• Write it back

Charge kept for several

milliseconds

DRAM Structure

Uses two-level decode structure – Allows refreshing entire rows at

the same time

DRAM Organization

Bits in DRAM organized as rectangular array

• DRAM accesses entire row

• Burst Mode – supply successive words from a row with

reduced latency

Double Data Rate DRAM (DDR DRAM)

• Transfer on rising and falling clock edges

Quad Data Rate (QDR) DRAM

• Separate DDR inputs & outputs

DRAM – Cost & Capacity

DRAM – Often organized into DIMM

DIMM = Dual Inline Memory Modules

DIMMs contain 4-16 DRAMs

Cost per bit has dramatically reduced

DRAM Performance Factors

Row Buffer

• Allows several words to be read and refreshed in parallel

Synchronous DRAM

• Allow consecutive accesses in bursts without need to send

addresses

• Improves bandwidth

DRAM banking

• Allow simultaneous access to multiple DRAMs

• Improves bandwidth

Increasing Memory Bandwidth

Flash Storage

Flash – Nonvolatile Semiconductor Storage

• 100x - 1000x faster than disk

• Smaller, lower power, more robust

• More expensive per bit

Disk Storage

Disk – Nonvolatile, rotating magnetic storage

Disk Sectors and Access

Each sector records

• Sector ID

• Data (512 bytes, 4096 bytes proposed)

• Error correcting code (ECC)

• Used to hide defects and recording errors

• Synchronization fields and gaps

Access to a sector involves – Queuing delay if other accesses are

pending; Seek: move the heads; Rotational latency; Data transfer,

& Controller overhead

Cache Memory

Cache memory – level of memory hierarchy closest to CPU

Cache memory is smaller than lower levels of hierarchy

How do we know if data is in the cache?

Where would we look?

Cache Memory

Direct Mapped Cache – Location determined by address

Direct mapped = only one choice!

Cache address = (block address) mod (number of blocks in cache)

# of blocks is a power of 2 22

Cache Memory

How do we know which block in a current location?

Cache Memory

Store block address as well as data!

• Only need high order bits

• i.e. (block address) / (number blocks in cache)

• Called the “tag”

What if no data in a location?

Cache Memory

Store block address as well as data!

• Only need high order bits

• i.e. (block address) / (number blocks in cache)

• Called the “tag”

What if no data in a location?

Valid bit!

• 1 = present

• 0 = not present

• initialized to 0

Cache Example

Consider the following cache:

8 blocks, 1 word/block, direct mapped

Note, all valid bits = 0

Cache Example – Initial State

Instruction 1:

lw $s0, $zero, 101102 # Load from 101102 into register

Is this cache hit or miss?

Cache Example – Initial State

Instruction 1:

Is this cache hit or miss?

MISS! – Cache is empty, all valid bits are 0!

Load block from memory into cache, then into register

Cache Example – State 1

New state:

Instruction 2:

Cache hit or miss?

MISS! – Index 010 not valid

Load block from memory, into register. New state:

Instruction 3:

Cache hit or miss?30

HIT! – Index 110 valid, and Tag == 10

Load directly from cache into register, no state change

Instruction 4:

MISS – Index 000 not valid

Load into cache then register

Instruction 5: lw $s5, $zero, 000112 # Load from 000112 into

register

Cache hit or miss?

Instruction 6: lw $s6, $zero, 100002 # Load from 100002 into

register

Cache hit or miss?

Instruction 5: MISS! – index 011 not valid

Instruction 6: HIT! – index 000 valid, tag == 10

New state:

Instruction 7 – lw $s7, $zero, 100102

MISS! – Index 010 valid, but tag != 10

Replace cache with memory, load into register

Final state:

Address Subdivision

LSBs determine which cache

block to address

• Less “Byte Offset” = size of

block in words

Compare MSBs to identify if tag

matches

Example – Larger Block Size

What if we use a larger block?

64 blocks, 16 bytes/block = 4 words/block

• What block number does address 1200 map to?

Example – Larger Block Size

What if we use a larger block?

64 blocks, 16 bytes/block = 4 words/block

• What block number does address 1200 map to?

Block address = (1200/16) = 75

Block Number = 75 mod 64 = 11

Block Size Considerations

Larger blocks should reduce miss rate!

• What principle helps us here?

Block Size Considerations

Larger blocks should reduce miss rate!

• What principle helps us here?

• Spatial Locality!

But in a fixed-size cache:

• Larger blocks = fewer # of blocks

• More competition = increased miss rate

• Larger blocks leads to pollution

Larger miss penalty!

• Can override benefit of reduced miss rate

• Early restart and critical-word-first addressing can help

Cache Miss

On cache hit, CPU loads from cache & proceeds normally

What happens on a cache miss?

Cache Miss

On cache hit, CPU loads from cache & proceeds normally

What happens on a cache miss?

On cache miss:

• Stall CPU pipeline

• Fetch block from next level of hierarchy

• Instruction Cache Miss?

• Restart instruction fetch

• Data Cache Miss?

• Complete data access

Write-Through Cache

What about cache writing?

Write-Through Cache

On data write hit, could just update block in cache

• Is this true memory hierarchy?

• No! – Memory & cache inconsistent

Write Through – Also update memory on writes!

What is the drawback of this strategy?

Write-Through Cache

On data write hit, could just update block in cache

• Is this true memory hierarchy?

• No! – Memory & cache inconsistent

Write Through – Also update memory on writes!

What is the drawback of this strategy?

Writes take longer!

• e.g. Base CPI = 1

• If 10% of instructions are store & write-back takes 100 cycles

• Effective CPI = 1 + 0.1 × 100 = 11!

• Over an order of magnitude slowdown!

Write Buffer

How do we solve this problem?

Write Buffer! – Hold onto data waiting to be written to memory

CPU continues immediately, only stalls on write if buffer full

Write-Back Cache

Alternate Strategy – Only update memory when cache swapping

Called “Write-Back” cache

Strategy:

• On data-write hit, update in cache

• Keep track of whether a block is “dirty”

• When dirty block replaced – write back to memory

• Can use write buffer to allow replacing block to be read first

Write Allocation

What should happen on a write miss?

Write Allocation

What should happen on a write miss?

Alternatives for write-through

• Allocate on miss – fetch the block

• Write around – don’t fetch the block

• Programs often write a whole block before reading it

• e.g. Initialization of arrays

For write-back – Usually fetch the block

Example – Intrinsity FastMATH

Embedded MIPS processor:

• 12-stage pipeline

• Instruction and data access on each cycle

Split cache – separate Instruction and Data caches

• Each 16KB

• 256 blocks, 16 words/block

• D-cache - write-through or write-back

SPEC2000 miss rates:

• I-cache = 0.4%

• D-cache = 11.4%

• Weighted average = 3.2%

Main Memory Supporting Caches

Use DRAMs for main memory

• Fixed width (e.g. 1 word)

• Connected by fixed-width clocked bus

• Bus clock is typically slower than CPU clock

Example cache block read:

• 1 bus cycle for address transfer

• 15 bus cycles per DRAM access

• 1 bus cycle per data transfer

For 4-word blcok, 1-word-wide DRAM

• Miss penalty = 1 + 4 × 15 + 4 × 1 = 65 cycles

• Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Improving Cache Performance

Measuring Cache Performance

Components of CPU time:

• Program execution cycles – includes cache hit time

• Memory stall cycles – mainly cache misses

If we make a few simplifications:

Memory stall cycles = Memory accessesProgram ×Miss rate ×Miss penalty

= InstructionsProgram × Misses

Instruction ×Miss penalty

Cache Performance Example

Given:

• I-cache miss rate = 2%

• D-cache miss rate = 4%

• Miss penalty = 100 cycles

• Base CPI (ideal cache) = 2

• Load & stores = 36% of instructions

Then, miss cycles per instruction:

• I-cache = 0.02 × 100 = 2

• D-cache = 0.36 × 0.04 × 100 = 1.44

Therefore, actual CPI:

• 2 (base) + 2 (I-cache) + 1.44 (D-cache) = 5.44

• Ideal: 5.442 = 2.72 times faster

Average Access Time

Hit time is also important for performance

Average memory access time (AMAT)

• AMAT = Hit time + Miss rate × Miss penalty

Example:

• CPU with 1ns clock

• Hit time = 1 cycle

• Miss penalty = 20 cycles,

• I-cache miss rate = 0.05 (5%)

AMAT = 1 + 0.05 × 20 = 2ns

(2 cycles per instruction)

Performance Summary

When CPU performance increased, miss penalty becomes more

significant!

Decreasing base CPI – greater proportion of time spent on memory

stalls (Amdahl’s law)

Increasing clock rate – memory stalls account for more CPU cycles

Can’t neglect cache behavior when evaluating system performance!

Associative Caches

What if we made caches more flexible to improve hit rate?

Associative caches – Blocks have flexibility on where to go

Fully Associative Cache – Allow block to any cache entry

• Requires all entries to be searched at once

• Comparator per entry (expensive!)

n-way set associative – each set contains n entries

• Block number determines which set

• (Block number) mod (#Sets in cache)

• Search all entries in a given set at once

• n comparators (less expensive!)

Associative Caches

Spectrum of Associativity

For a cache with 8 entries:

Associativity Example

Compare 4-block caches:

• Direct mapped, 2-way set associative, fully associative

• Block access sequence: 0, 8, 0, 6, 8

Direct Mapped:

Associativity Example

2-way set associative:

Fully associative:

How Much Associativity?

Increased associativity decreases miss rate

• With diminishing returns

Simulation of a system with 64KB D-cache, 16-word blocks,

SPEC2000

• 1-way (direct) – 10.3% Miss rate

• 2-way – 8.6% Miss rate

Set Associative Cache Organization

Replacement Policy

Direct mapped – no choice!

Set associative:

• Prefer non-valid entry, if exists

• Else, choose among entries in the set

LRU – least recently used:

• Choose the block unused for longest time

• Simple for 2-way, manageable for 4-way, hard beyond that

Random – Approximately same performance as LRU for high

associativity

Multilevel Caches

Primary cache attached to CPU – small but fast!

Level-2 cahce services misses from primary cache

• Larger, slower, still faster than main memory

Main memory services L-2 cache misses

Some systems include L-3 cache

L-4 Cache? – Very uncommon

Multilevel Cache Example

Given:

• CPU base CPI = 1

• Clock rate = 4 GHz

• Miss rate/instruction = 2%

• Main memory access time = 100ns

With just primary cache:

• Miss penalty = 100ns0.25ns = 400 cycles

• Effective CPI = 1 + 0.02 × 400 = 9

Multilevel Cache Example

Now add L-2 cache:

• Access time = 5ns

• Global miss rate to main memory = 0.5%

Primary miss with L-2 hit:

• Penalty = 5ns0.25ns = 20 cycles

Primary miss with L-2 miss:

• Extra penalty = 500 cycles

CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4

Performance ratio = 93.4 = 2.6 times faster!

Multilevel Cache Considerations

Primary cache – focus on minimal hit time

L-2 cache:

• Focus on low miss-rate to avoid main memory access

• Hit time has less overall impact

Results:

• L-1 cache usually smaller than a single cache

• L-1 block size smaller than L-2 block size

Interactions with Advanced CPUs

Out-of-order CPUs can execute instructions during cache miss

• Pending store stays in load/store unit

• Dependent instructions wait in reservation stations

• Independent instructions continue

Effect of miss depends on program data flow

• Much harder to analyze

• Use system simulation

Interactions with Software

Misses depend on memory access

patterns

• Algorithm behavior

• Compiler optimization for memory

access

Dependable Memory

Memory hierarchy needs to hold data

safely, even in failure

Dependability from redundancy!

Two is one, one is none!

Defining Failure

How to define failure?

1. Service Accomplishment – Service is delivered as specified

2. Service Interruption – Delivered service is different from

specification

Failure – Transition from state 1 to state 2

Restoration – Transition from state 2 to state 1

Failures can be intermittent or permanent

Reliability

Reliability – A measure of continuous service accomplishment

Can be expressed in the amount of time to failure

Mean Time to Failure (MTTF)

–Average amount of time until a failure

Annual Failure Rate (AFR)

– Percentage of devices expected to fail in a year

AFR = 1 − exp(−8766

MTTF) (1)

Reliability Example

Example: Assume HD with 1M hours MTTF = 114 years! Assume

server farm with 100k disks. How many disks fail per year?

AFR = 1 − exp( −87661,000,000)

AFR = 1 − exp(−0.008766)

AFR = 1 − 0.99127

AFR = 0.00872 == 0.872%

Failures per year = AFR × #components

Failures per year = 0.00872 × 100, 000

Failures per year = 872.77 Disks/year!

Reliability Example

Our book uses an approximation for AFR:

AFR ≈ 8766

MTTF(2)

Using this approximation:

AFR = 87661,000,000 = 0.008766

Failures per year = 0.008766 × 100, 000 = 876.6 disks/year

Fairly close approximation for small AFR!

Availability

Service Interruption can be measured in amount of time to repair

Mean Time to Repair (MTTR)

Mean Time Between Failures (MTBF)

– Sum of MTTF + MTTR

Availability – Measure of service accomplishment

Measured with respect to accomplishment and interruption

Therefore:

Availability =MTTF

MTTF + MTTR(3)

Availability

Goal – keep availability high!

Shorthand = “nines of availability”

• One nine = 90% = 36.5 days of repair/year

• Two nines = 99% = 3.65 days of repair/year

• Three nines = 99.9% = 526 minutes of repair/year

• Four nines = 99.99% = 52.6 minutes of repair/year

• Five nines = 99.999% = 5.26 minutes of repair/year

Good services can provide four or five nines of availability

Increasing Availability

Availability can be improved by:

• Increasing MTTF

• Decreasing MTTR

Fault – Failure of any single component

• May or may not result in system failure

Increasing Availability

Three ways to improve MTTF:

• Fault Avoidance – Prevent fault occurrence by construction

• Fault Tolerance – Using redundancy to allow service to

comply with specification despite faults

• Fault Forecasting – Predicting presence and creation of faults,

allowing component to be replaced before it fails

Example: Hamming ECC Code

Example of improving dependability:

Goal – Reduce errors as the result of individual bits being incorrect

Hamming Distance – # of bits that are different between two bit

patterns

What is the hamming distance between the following:

0b00110011

0b00100001

Example of improving dependability:

Goal – Reduce errors as the result of individual bits being incorrect

Hamming Distance – # of bits that are different between two bit

patterns

What is the hamming distance between the following:

0b00110011

0b00100001

Minimum Hamming Distance – Minimum allowable difference

between bit patterns

If minimum distance == 2, provide 1-bit error detection

• e.g. Parity codes

If minimum distance == 3, provides singe-bit error correction,

2-bit error detection

Hamming Error Correction Code (ECC) – distance-3 code

Steps:

1. Start number bits from 1 on the left

2. Mark all bit positions that are powers of 2 as parity bits

(positions 1, 2, 4, 8, 16, ...)

3. All other bit positions are used for data bits

4. Position of parity bit determines the sequence of data bits

that it checks

5. Set parity bits to create even parity for each group

For 8 data bits:

• Parity bits at 1, 2, 4, 8

• Data bits at 3, 5, 6, 7, 9, 10, 11, 12

• Bits are checked as follows:

Decoding ECC

Value of parity bits indicates which bits are in error!

• Use numbering from encoding procedure

• E.g.:

• Parity bits = 0000 – No Error!

• Parity bits = 1010 – bit 10 was flipped

Example Hamming SEC/DED

Did not stop at single ECC

Add an additional parity bit for the whole word (pn)

Makes Hamming Distance == 4

Single Error Correcting, Double Error Detection (SEC/DED)

Decoding: Let H = SEC parity bits

• H even, pn even – No error

• H odd, pn odd – Correctable single bit error

• H even, pn odd – Error in pn bit

• H odd, pn even – Double error occurred

Memory and Cache - csce.uark.educsce.uark.edu/.../lectures/lecture14-memory-cache.pdf · Cache...

Documents

Transcript of Memory and Cache - csce.uark.educsce.uark.edu/.../lectures/lecture14-memory-cache.pdf · Cache...

Memory Hierarchy Design Memory Hierarchy Design. 2 Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss.

Cache memory and cache

Oracle 12c In-memory Database Cache.pdf

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Winter 2002CSE 141 - Cache Improving memory with caches CPU On-chip cache Off-chip cache DRAM memory Disk memory.

Introduction of Cache Memory - cs.umd.edumeesh/cmsc411/website/proj01/cache/cache.pdf · Introduction of Cache Memory 1. Basic Cache Structure Processors are generally able to perform

BIOS 10 - Cachemy.fit.edu/~vkepuska/ece3552/TI DSP-BIOS/BIOS/PDF/BIOS 10 - Cache.pdf · Enable Cache BIOS System Integration - Cache Options 10 - 7 Enable Cache Option 3 : Use Cache

Cache memory ...

Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

NUMA machines and directory cache mechanismsCOMA(Cache Only Memory Machine) No home memory and every memory behaves like cache (Not actual cache) Cache line gathers to required clusters.

OC-Cache: An Open-channel SSD Based Cache for Multi-Tenant ...ranger.uta.edu/~sjiang/pubs/papers/wang18-oc-cache.pdf · OC-Cache uses a tenant’s miss ratio curve to determine the

Cache memory

Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.

Cache Optimization for Embedded Processor Cores: An ...ziyang.eecs.umich.edu/~dickrp/ides/papers/ghosh-embedded-cache.pdf · Cache Optimization for Embedded Processor Cores: An Analytical