ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...

1

ECE4750/CS4420 Computer Architecture

L6: Advanced Memory Hierarchy

Edward Suh

Computer Systems Laboratory

[email protected]

2

Announcements

Lab 1 due today

Reading: Chapter 5.1 – 5.3

ECE4750/CS4420 — Computer Architecture, Fall 2008

2

3

Overview

How to improve cache performance

Recent research: Flash cache


4

Improving Cache Performance


Average memory access time =Hit time + Miss rate x Miss penalty

To improve performance:

3

5

Small, Simple Caches

On many machines today the cache access sets the cycle time

• Hit time is therefore important beyond its effect on AMAT


=

hit

data

tag index byte

=

data

tag index byte

= mux

hit

6

Way Predicting Caches(MIPS R10000 L2)

Use processor address to index into way prediction table

Look in predicted way at given index, then:


4

7


Decrease Hit Time

Decrease Miss Rate

Decrease Miss Penalty


8

Causes for Cache Misses

• Compulsory: first-reference to a block a.k.a. coldstart misses

- misses that would occur even with

• Capacity: cache is too small to hold all data needed by the program- misses that would occur even under

• Conflict: misses that occur because of collisionsdue to block-placement strategy- misses that would not occur with


5

9

Effect of Cache Parameters

• Larger cache size

• Higher associativity

• Larger block size


10

Victim Cache (HP7200)


L1 Data Cache

Unified L2 Cache

RF

CPU

Victim cache is a small asociative back-up cache, added to a direct

mapped cache, which holds recently evicted blocks

6

11

Prefetching

Speculate on future instruction and data accesses and fetch them into cache(s)

Varieties of prefetching

• Hardware prefetching

• Software prefetching

• Mixed schemes

What types of misses does prefetching affect?


12

Issues in Prefetching

Usefulness

Timeliness

Cache and bandwidth pollution


L1 Data

L1

Instruction

Unified L2

Cache

RF

CPU

Prefetched data

7

13

Hardware Instruction Prefetch

Alpha 21064


L1

Instruction

Unified L2

CacheRF

CPU

Stream

Buffer

Prefetched

instruction blockReq

block

Req

block

14

Hardware Data Prefetching

Prefetch-on-miss:

One Block Lookahead (OBL) scheme

Strided prefetch


8

15

Software Prefetching

Compiler-directed prefetching

• compiler can analyze code and know where misses occur

for(i=0; i < N; i++) {

prefetch( &a[i + 1] );

prefetch( &b[i + 1] );

SUM = SUM + a[i] * b[i];

}

Issues?

What property do we require of the cache for prefetching to work?


16

Compiler Optimizations

Restructuring code affects the data block access sequence

• Group data accesses together to improve spatial locality

• Re-order data accesses to improve temporal locality

Prevent data from entering the cache

• Useful for variables that will only be accessed once before being replaced

• Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits)

Kill data that will never be used again

• Streaming data exploits spatial locality but not temporal locality

• Replace into dead cache locations


9

17

Array Merging

Some weak programmers may produce code like:

int val[SIZE];

int key[SIZE];

… and proceed to reference val and key in lockstep

Problem?


18

Loop Interchange

for(j=0; j < N; j++) {for(i=0; i < M; i++) {

x[i][j] = 2 * x[i][j];}

}

What type of locality does this improve?


10

19

Loop Fusion

for(i=0; i < N; i++)

for(j=0; j < M; j++)

a[i][j] = b[i][j] * c[i][j];

for(i=0; i < N; i++)

for(j=0; j < M; j++)

d[i][j] = a[i][j] * c[i][j];

What type of locality does this improve?ECE4750/CS4420 — Computer Architecture, Fall 2008

20

Blocking

for(i=0; i < N; i++)for(j=0; j < N; j++) {

r = 0;for(k=0; k < N; k++)

r = r + y[i][k] * z[k][j];x[i][j] = r;

}


x y zj k j

i i k

Not touched Old access New access

11

21

Blocking

for(jj=0; jj < N; jj=jj+B)

for(kk=0; kk < N; kk=kk+B)

for(i=0; i < N; i++)

for(j=jj; j < min(jj+B,N); j++) {

r = 0;

for(k=kk; k < min(kk+B,N); k++)

r = r + y[i][k] * z[k][j];

x[i][j] = x[i][j] + r;

}


x y zj k j

i i k

22


Decrease Hit Time

Decrease Miss Rate

Decrease Miss Penalty• Some cache misses are inevitable

• when they do happen, want to service as quickly as possible


12

23

Multilevel Caches

A memory cannot be large and fast

Increasing sizes of cache at each level

AMAT = Hit timeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit timeL2 + Miss RateL2 x Miss PenaltyL2

What is 2nd-level miss rate?

• local miss rate – number of cache misses / cache accesses

• global miss rate – number of cache misses / CPU memory refs


CPU L1 L2 DRAM

24

L2 Cache Hit Time vs. Miss Rate


13

25

Reduce Read Miss Penalty

Let read misses bypass writes

Problem?

Solution?


Data Cache

Unified L2 Cache

RF

CPU

Writebuffer

26

Early Restart

Decrease miss penalty with no new hardware

• well, okay, with some more complicated control

Strategy: impatience!

There is no need to wait for entire line to be fetched

Early Restart – as soon as the requested word (or double word) of the cache block arrives, let the CPU continue execution

If CPU references another cache line or a later word in the same line: stall

Early restart is often combined with the next technique…


14

27

Critical Word First

Improvement over early restart

• request missed word first from memory system

• send it to the CPU as soon as it arrives

• CPU consumes word while rest of line arrives

Example: 32B block (8 words), miss on address 20

• words return from memory system as follows:


0 4 8 12 16 20 24 28

28

Sub-blocking

Tags are too large, i.e., too much overhead

• Simple solution: Larger blocks, but miss penalty could be large.

Sub-block placement (aka sector cache)

• A valid bit added to units smaller than the full block, called sub-blocks

• Only read a sub-block on a miss

• If a tag matches, is the word in the cache?


100

300

204

1 1 1 1

1 1 0 0

0 1 0 1

15

29

Recent Research: Flash Cache

Slides from Taeho Kgil and Trevor Mudge (U of Michigan)

Roadmap for Memory - Flash


2005 2007 2009 2011 2013

Cell Density of SRAM

MBytes / cm211 17 28 46 74

Cell Density of DRAM

MBytes / cm2153 243 458 728 1,154

Cell Density of NAND

Flash Memory (SLC /

MLC) MBytes / cm2

339 /

713

604 /

1,155

864 /

1,696

1,343 /

5,204

2,163 /

8,653

From ITRS 2005 Roadmap

30

Power/Performance of DRAM and Flash


Density

(Gb /cm2)$/Gb

Active

Power

Idle

Power

Read

latency

Write

latency

Erase

latency

DRAM 0.7 48 878mW 80mW 55ns 55ns N/A

NAND 1.42 21 27mW 6μW 25μs 200μs 1.5ms

Flash good for idle power optimization

• 1000× less power than DRAM

Flash not so good for low access latency usage model

• DRAM still required for decent access latencies

Performance and power for 1Gb memory from Samsung

Datasheets 2003

16

31

Cost of Power and Cooling


World wide cost of purchasing and operating servers Source: IDC

50%

25%

32

System-Level Power Consumption


From Sun talk given by James Laudon

SunFire T2000 Power running SpecJBB

I/O 22%

Fans

10%

AC/DC

conversion

15%

Processor

26%

16GB memory

22%

Disk 4%

Processor

16GB memory

I/O

Disk

Service Processor

Fans

AC/DC conversion

Total Power 271W

17

33

Case for Flash as 2nd Disk Cache

Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more)

Large working-set is cached to main memory to maintain high throughput

Large portion of DRAM to disk cache

Many server applications are read intensive than write intensive

32GB DRAM on SunFire T2000 consumes idle power 45W

Flash memory consumes 1000x less idle power than DRAM

Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content

Client requests are spatially and temporally a zipf like distribution

Ex) 90% of client requests are to 20% of files


34

Latency vs. Throughput

0

200

400

600

800

1,000

1,200

12us 25us 50us 100us 400us 1600us

disk cache access latency to 80% of files

Ne

two

rk b

an

dw

idth

- M

bp

s

(Th

rou

gh

pu

t)

MP4 MP8 MP12

Specweb99


18

35

Overall Architecture


1GB DRAM

Processors

HDD ctrl.

Hard Disk

Drive

Main memory

Baseline without

FlashCache

36

Flash Lifetime


0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

0% 20% 40% 60% 80% 100%

Flash memory size (percentage of working set size)

Lif

eti

me

- y

ea

rs

SURGE SPECWeb99 Financial1 WebSearch1

19

37

Programmable Flash Controller


Flash Density

Control

Flash Program /

Erase time control

GF Field LUT

BCH Encode

/ Decode

CRC Encode

/ Decode

BCH configuration

Descriptor

Density Descriptor

P/E Descriptor

Flash Address

Flash Data (Writes)

Bit Error (Yes / No)

Flash Data (Reads)

External

Interface

NAND

Flash

Memory

Programmable Flash memory controller

38

Impact of ECC


100,000

1,100,000

2,100,000

3,100,000

4,100,000

5,100,000

6,100,000

7,100,000

0 2 4 6 8 10 12

# of correctable errors (code strength)

ma

x.

tole

rab

le P

/E c

yc

les

stdev = 0 stdev = 5% of meanstdev = 10% of mean stdev = 20% of mean

20

39

Flash Lifetime w/ Programmable Controller


0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

0% 20% 40% 60% 80% 100%

Flash memory size (percentage of working set size)

Lif

eti

me

- y

ea

rs

SURGE SPECWeb99 Financial1 WebSearch1

40

Overall Performance - Mbps

0

200

400

600

800

1,000

1,200

DRAM 32MB +

FLASH 1GB

DRAM 64MB +

FLASH 1GB

DRAM 128MB

+FLASH 1GB

DRAM 256MB

+FLASH 1GB

DRAM 512MB

+FLASH 1GB

DRAM 1GB

Ne

two

rk B

an

dw

idth

- M

bp

s

MP4 MP8 MP12

Specweb99


21

41

Overall Main Memory Power

0

0.5

1

1.5

2

2.5

3

DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash

1GB

Ov

era

ll P

ow

er

- W

read power write power idle power

SpecWeb99

2.5W

1.6W

0.6W


ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...

Documents

Transcript of ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...