Improving DRAM Performance by Parallelizing Refreshes with Accesses

34
Improving DRAM Performance by Parallelizing Refreshes with Accesses Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Kevin Chang

description

Improving DRAM Performance by Parallelizing Refreshes with Accesses. Kevin Chang. Donghyuk Lee, Zeshan Chishti , Alaa Alameldeen , Chris Wilkerson, Yoongu Kim, Onur Mutlu. Executive Summary. DRAM refresh interferes with memory accesses Degrades system performance and energy efficiency - PowerPoint PPT Presentation

Transcript of Improving DRAM Performance by Parallelizing Refreshes with Accesses

Page 1: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes

with Accesses

Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu

Kevin Chang

Page 2: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

2

Executive Summary• DRAM refresh interferes with memory accesses

– Degrades system performance and energy efficiency– Becomes exacerbated as DRAM density increases

• Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests

• Our mechanisms:– 1. Enable more parallelization between refreshes and accesses across

different banks with new per-bank refresh scheduling algorithms– 2. Enable serving accesses concurrently with refreshes in the same bank

by exploiting DRAM subarrays• Improve system performance and energy efficiency for a wide

variety of different workloads and DRAM densities– 20.2% and 9.0% for 8-core systems using 32Gb DRAM– Very close to the ideal scheme without refreshes

Page 3: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

3

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms• Results

Page 4: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

4

Refresh Penalty

Processor

Mem

ory

Cont

rolle

r

DRAMRefreshReadData

Capacitor

Accesstransistor

Refresh delays requests by 100s of nsRefresh interferes with memory accesses

Page 5: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

5

Time

Per-bank refresh in mobile DRAM (LPDDRx)

Existing Refresh Modes

Time

All-bank refresh in commodity DRAM (DDRx)

Bank 7

Bank 1Bank 0

Bank 7

Bank 1Bank 0

Refresh

Round-robin order

Per-bank refresh allows accesses to other banks while a bank is refreshing

Page 6: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

6

Shortcomings of Per-Bank Refresh• Problem 1: Refreshes to different banks are scheduled

in a strict round-robin order – The static ordering is hardwired into DRAM chips– Refreshes busy banks with many queued requests when

other banks are idle

• Key idea: Schedule per-bank refreshes to idle banks opportunistically in a dynamic order

Page 7: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

7

Shortcomings of Per-Bank Refresh• Problem 2: Banks that are being refreshed cannot

concurrently serve memory requests

TimeBank 0RD

Delayed by refresh

Per-Bank Refresh

Page 8: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

8

Shortcomings of Per-Bank Refresh• Problem 2: Refreshing banks cannot concurrently serve

memory requests• Key idea: Exploit subarrays within a bank to parallelize

refreshes and accesses across subarrays

Time Bank 0Subarray 1Subarray 0

RD

Subarray Refresh Time

Parallelize

Page 9: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

9

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms• Results

Page 10: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

10

DRAM System Organization

Rank 1Bank 7

Bank 1Bank 0

Rank 0Rank 1

DRAM

• Banks can serve multiple requests in parallel

Page 11: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

11

DRAM Refresh Frequency• DRAM standard requires memory controllers to send

periodic refreshes to DRAM

tRefPeriod (tREFI): Remains constant

tRefLatency (tRFC): Varies based on DRAM chip density (e.g., 350ns)

Timeline

Read/Write: roughly 50ns

Page 12: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

12

Increasing Performance Impact• DRAM is unavailable to serve requests for

of time

• 6.7% for today’s 4Gb DRAM

• Unavailability increases with higher density due to higher tRefLatency– 23% / 41% for future 32Gb / 64Gb DRAM

tRefLatencytRefPeriod

Page 13: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

13

• Shorter tRefLatency than that of all-bank refresh• More frequent refreshes (shorter tRefPeriod)

All-Bank vs. Per-Bank Refresh

Timeline

Bank 0

Bank 1 Refresh

Per-Bank Refresh: In mobile DRAM (LPDDRx)

Refresh

Timeline

Bank 0

Bank 1

All-Bank Refresh: Employed in commodity DRAM (DDRx, LPDDRx)

Refresh

RefreshRefresh Staggered across

banks to limit power

Read

Read

Read

Read

Can serve memory accesses in parallel with refreshes across banks

Page 14: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

14

Shortcomings of Per-Bank Refresh• 1) Per-bank refreshes are strictly scheduled in

round-robin order (as fixed by DRAM’s internal logic)

• 2) A refreshing bank cannot serve memory accesses

Goal: Enable more parallelization between refreshes and accesses using practical mechanisms

Page 15: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

15

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms– 1. Dynamic Access-Refresh Parallelization (DARP)– 2. Subarray Access-Refresh Parallelization (SARP)

• Results

Page 16: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

16

Our First Approach: DARP• Dynamic Access-Refresh Parallelization (DARP)– An improved scheduling policy for per-bank refreshes– Exploits refresh scheduling flexibility in DDR DRAM

• Component 1: Out-of-order per-bank refresh– Avoids poor static scheduling decisions– Dynamically issues per-bank refreshes to idle banks

• Component 2: Write-Refresh Parallelization– Avoids refresh interference on latency-critical reads– Parallelizes refreshes with a batch of writes

Page 17: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

17

1) Out-of-Order Per-Bank Refresh • Dynamic scheduling policy that prioritizes refreshes to

idle banks• Memory controllers decide which bank to refresh

Page 18: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

18

Bank 1

Bank 0

Our mechanism: DARP

1) Out-of-Order Per-Bank Refresh

Refresh

Read

TimelineBank 1

Bank 0 Refresh Read

Refresh Read

Baseline: Round robin

Refresh

Read

Saved cycles

Delayed by refresh

Saved cycles

Rea d

Request queue (Bank 0) Request queue (Bank 1)

Rea d

Reduces refresh penalty on demand requests by refreshing idle banks first in a flexible order

Page 19: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

19

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms– 1. Dynamic Access-Refresh Parallelization (DARP)

• 1) Out-of-Order Per-Bank Refresh• 2) Write-Refresh Parallelization

– 2. Subarray Access-Refresh Parallelization (SARP)

• Results

Page 20: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

20

Refresh Interference on Upcoming Requests

• Problem: A refresh may collide with an upcoming request in the near future

Bank 1

Bank 0 Refresh

Read

Read Delayed by refresh

Time

Page 21: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

21

DRAM Write Draining • Observations: • 1) Bus-turnaround latency when transitioning from

writes to reads or vice versa– To mitigate bus-turnaround latency, writes are typically

drained to DRAM in a batch during a period of time• 2) Writes are not latency-critical

TimelineBank 1

Bank 0

WriteRead Write

TurnaroundWrite

Page 22: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

22

2) Write-Refresh Parallelization• Proactively schedules refreshes when banks are serving

write batches

TimelineBank 1

Bank 0

Turnaround

Refresh

Read

Read

Baseline

Delayed by refresh

Write Write Write

Write-refresh parallelization

TimelineBank 1

Bank 0

ReadTurnaround

Read

Write Write Write

Refresh

1. Postpone refreshRefresh

2. Refresh during writesSaved cycles

Avoids stalling latency-critical read requests by refreshing with non-latency-critical writes

Page 23: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

23

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms– 1. Dynamic Access-Refresh Parallelization (DARP)– 2. Subarray Access-Refresh Parallelization (SARP)

• Results

Page 24: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

24

Our Second Approach: SARPObservations:1. A bank is further divided into subarrays– Each has its own row buffer to perform refresh operations

2. Some subarrays and bank I/O remain completely idle during refresh

Bank 7

Bank 1Bank 0

SubarrayBank I/O Row Buffer

Idle

Page 25: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

25

Our Second Approach: SARP• Subarray Access-Refresh Parallelization (SARP):– Parallelizes refreshes and accesses within a bank

Page 26: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

26

Our Second Approach: SARP• Subarray Access-Refresh Parallelization (SARP):– Parallelizes refreshes and accesses within a bank

Very modest DRAM modifications: 0.71% die area overhead

Bank 7

Bank 1Bank 0

SubarrayBank I/O

TimelineSubarray 1

Subarray 0

Bank 1

DataRefresh

Refresh

Read

Read

Page 27: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

27

Outline• Motivation and Key Ideas• DRAM and Refresh Background• Our Mechanisms• Results

Page 28: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

28

Methodology

• 100 workloads: SPEC CPU2006, STREAM, TPC-C/H, random access

• System performance metric: Weighted speedup

DDR3 RankSimulator configurations

Mem

ory

Cont

rolle

r8-core

processor

Mem

ory

Cont

rolle

r

Bank 7

Bank 1

Bank 0

L1 $: 32KBL2 $: 512KB/core

Page 29: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

29

Comparison Points• All-bank refresh [DDR3, LPDDR3, …]

• Per-bank refresh [LPDDR3]

• Elastic refresh [Stuecheli et al., MICRO ‘10]:– Postpones refreshes by a time delay based on the predicted

rank idle time to avoid interference on memory requests– Proposed to schedule all-bank refreshes without exploiting

per-bank refreshes– Cannot parallelize refreshes and accesses within a rank

• Ideal (no refresh)

Page 30: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

30

8Gb 16Gb 32Gb0

1

2

3

4

5

6All-BankPer-BankElasticDARPSARPDSARPIdeal

DRAM Chip Density

Wei

ghte

d Sp

eedu

p (G

eoM

ean)

System Performance7.9% 12.3% 20.2%

1. Both DARP & SARP provide performance gains and combining them (DSARP) improves even more2. Consistent system performance improvement across DRAM densities (within 0.9%, 1.2%, and 3.8% of ideal)

Page 31: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

31

Energy Efficiency

3.0% 5.2% 9.0%

Consistent reduction on energy consumption

8Gb 16Gb 32Gb05

1015202530354045

All-BankPer-BankElasticDARPSARPDSARPIdeal

DRAM Chip Density

Ener

gy p

er A

cces

s (n

J)

Page 32: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

32

Other Results and Discussion in the Paper• Detailed multi-core results and analysis

• Result breakdown based on memory intensity

• Sensitivity results on number of cores, subarray counts, refresh interval length, and DRAM parameters

• Comparisons to DDR4 fine granularity refresh

Page 33: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

33

Executive Summary• DRAM refresh interferes with memory accesses

– Degrades system performance and energy efficiency– Becomes exacerbated as DRAM density increases

• Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests

• Our mechanisms:– 1. Enable more parallelization between refreshes and accesses across

different banks with new per-bank refresh scheduling algorithms– 2. Enable serving accesses concurrently with refreshes in the same bank

by exploiting DRAM subarrays• Improve system performance and energy efficiency for a wide

variety of different workloads and DRAM densities– 20.2% and 9.0% for 8-core systems using 32Gb DRAM– Very close to the ideal scheme without refreshes

Page 34: Improving DRAM Performance  by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes

with Accesses

Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu

Kevin Chang