Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS....

37
Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM Donghyuk Lee Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, Onur Mutlu Decoupled Direct Memory Access

Transcript of Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS....

Page 1: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM

DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,

Jongmoo Choi,Onur Mutlu

DecoupledDirectMemoryAccess

Page 2: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

2

processor

LogicalSystemOrganization

mainmemory

IOdevices

CPUaccess

IOaccess

MainmemoryconnectsprocessorandIOdevicesasanintermediatelayer

Page 3: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

3

processor

PhysicalSystemImplementation

mainmemory

IOdevices

CPUaccess

IOaccess

IOaccess

HighPinCostinProcessor

HighContentioninMemoryChannel

Page 4: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

4

processor

OurApproach

mainmemory

IOdevices

CPUaccess

EnablingIOchannel,decoupled & isolated fromCPUchannel

IOaccess

IOaccess

Page 5: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

5

ExecutiveSummary• Problem

– CPUandIOaccessescontendforthesharedmemorychannel

• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports

àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices

àDecoupleCPUandIOaccesses

• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)

• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)

Page 6: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

6

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

1.Problem

Page 7: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

7

mainmemory

CPU

DMA

graphics

network

storage

USB

IOinterfacememorycontroller

MemoryChannelContentionDRAMChip

ProcessorChip

Problem1:MemoryChannelContention

DMAIOinterface

Page 8: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

8

0%

20%

40%

60%

80%

100%TimeSpentonCPU-GPUCommunication

Benchmarks

33.5%onaverage

Fractio

nofExecutio

nTime

AlargefractionoftheexecutiontimeisspentonIOaccesses

Problem1:MemoryChannelContention

Page 9: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

9

IntegratingIOinterfaceontheprocessorchipleadstohighareacost

ProcessorPinCount(w/opowerpins)

power memory(2ch)

IOinterface(10.6%)

IOinterface(28.4%)

others

memory(2ch)

(w/powerpins)ProcessorPinCount

959pinsintotal 359pinsintotal

Problem2:HighCostforIOInterfaces

Page 10: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

10

SharedMemoryChannel

• MemorychannelcontentionforIOaccessandCPUaccess

• HighareacostforintegratingIOinterfacesonprocessorchip

Page 11: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

11

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

Page 12: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

12

OurApproach

CPU

DMA

graphics

network

storage

USB

DRAMChip

mainmemory

?

DMACTRL.

DMAcontrol

ProcessorChip

controlchannel

Dual-Data-PortDRAM

Port1

Port2

memorycontroller IOinterface

DMAChip DMAIOinterface

Page 13: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

13

OurApproach

?

CPU

graphics

network

storage

USB

DRAMChip

DMACTRL.

DMAcontrol

ProcessorChip

controlchannel

Dual-Data-PortDRAM

Port1

Port2

memorycontroller

DMAChip DMAIOinterface

IOACCESS

DecoupledDirectMemoryAccess

CPUACCESS

Page 14: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

14

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

Page 15: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

15

peripherallogic

bank

Background:DRAMOperation

mem

orychannel

datachannel controlchannel

control

port

dataport

control

port

dataport

bank

activateread

bankbankREADY

DRAMperipherallogic:i)controlsbanks,andii)transfersdataovermemorychannel

memorycontrolleratCPU

Page 16: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

16

bank

Problem:SingleDataPort

periphery

Requestsareservedseriallyduetosingledataport

datachannel controlchannel

control

port

dataport

read

control

port

dataport

bankREADY

bankREADY

dataport

read

ManyBanks

SingleDataPort

memorycontrolleratCPU

Page 17: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

17

Problem:SingleDataPort

RD

DATA

RD

DATA

ControlPort

DataPort

time

RD

DATA

RDControlPort

DataPort1

time

DATADataPort2

WhataboutaDRAMwithtwodataports?

Page 18: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

18

bank

periphery

twicethebandwidth&independentdataportswithlowoverhead

datachannel controlchanneldataport1

bank

bank

control

port

toPort1(upper)

toPort2(lower)

bankdatabus

portse

lectsignal

dataport2

datachannel

mux

mux

OverheadArea:1.6%↑Pins:20↑

Dual-Data-PortDRAM

Page 19: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

19

DDP-DRAMMemorySystem

bank

periphery

CPUchannel controlchannelwithportselectdata

port1

bank

bank

control

port

dataport2

IOchannel

mux

mux

DDMAIOinterface

memorycontrolleratCPU

Page 20: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

20

ThreeDataTransferModes

• CPUAccess:AccessthroughCPUchannel– DRAMread/writewithCPUportselection

• IOAccess:AccessthroughIOchannel– DRAMread/writewithIOportselection

• PortBypass:Directtransferbetweenchannels– DRAMaccesswithportbypassselection

Page 21: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

21

1.CPUAccessMode

bank

periphery

CPUchannel

bank

control

port

dataport2

IOchannel

DDMAIOinterface

controlchannelwithportselect

mux

mux

dataport

bankREADY

memorycontrolleratCPU

read

control

port

CPUchanneldataport1

controlchannelwithCPUchannel

Page 22: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

22

2.IOAccessMode

bank

periphery

CPUchannel

bank

control

portIOchannel

DDMAIOinterface

controlchannelwithportselect

mux

mux

dataport1

controlchannelwith IOchannel

memorycontrolleratCPU

IOchannel

dataportdataport2

bankREADY

read

control

port

Page 23: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

23

3.PortBypassMode

bank

periphery

CPUchannel

bank

control

portIOchannel

controlchannelwithportselect

mux

mux

controlchannelwith portbypass

IOchannel

bank

dataport

dataport

dataport2

dataport1

CPUchannel

DDMAIOinterface

memorycontrolleratCPU

Page 24: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

24

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

Page 25: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

25

ThreeApplicationsforDDMA

• Communicationb/wComputeUnits– CPU-GPUcommunication

• In-MemoryCommunicationandInitialization– Bulkpagecopy/initialization

• Communicationb/wMemoryandStorage– Servingpagefault/fileread&write

Page 26: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

26

ctrl.channel

DDMActrl.

read

with

IOse

l.

CPU→

GPU

1.ComputeUnit↔ComputeUnitCPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

GPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

ctrl.channel

DDMActrl.

destination

DDMAIOinterface

source Ack.destination

DDMAIOinterface

write

with

IOse

l.

TransferdatathroughDDMAwithoutinterferingw/CPU/GPUmemoryaccesses

CPU

memorycontroller

GPU

memorycontroller

Page 27: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

27

ctrl.chan.

read

with

IOse

l.write

with

IOse

l.

2.In-MemoryCommunication

DDMActrl.

CPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

sourcedestination

TransferdatainDRAMthroughDDAMwithoutinterferingwithCPUmemoryaccesses

CPU

memorycontroller

Page 28: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

28

DDMActrl.

Acc.Storage

Ack.

3.Memory↔Storage

ctrl.chan.

write

with

IOse

l.

CPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface StorageStorage(source)

destination

DDMAIOinterface

TransferdatafromstoragethroughDDMAwithoutinterferingwithCPUmemoryaccesses

destination

CPU

memorycontroller

Page 29: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

29

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

Page 30: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

30

EvaluationMethods• System

– Processor:4– 16cores– LLC:16-wayassociative,512KBprivatecache-slice/core– Memory:1– 4ranksand1– 4channels

• Workloads– Memoryintensive:SPECCPU2006,TPC,stream(31benchmarks)

– CPU-GPUcommunicationintensive:polybench (8benchmarks)

– In-memorycommunicationintensive:apache,bootup,compiler,filecopy,mysql,fork,shell,memcached (8intotal)

Page 31: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

31

0%

5%

10%

15%

20%

25%

4-Core 8-Core 16-Core0%

5%

10%

15%

20%

25%

4-Core 8-Core 16-Core

Perfo

rmanceIm

provem

ent

Perfo

rmanceIm

provem

ent

CPU-GPUComm.-Intensive In-MemoryComm.-Intensive

More performance improvementathighercorecountHighperformanceimprovement

Performance(2Channel,2Rank)

Page 32: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

32

PerformanceonVariousSystems

0%5%10%15%20%25%30%35%40%

1rank 2rank 4rank0%5%10%15%20%25%30%35%40%

1ch 2ch 4ch

ChannelCount RankCount

Perfo

rmanceIm

provem

ent

Perfo

rmanceIm

provem

ent

Performanceincreaseswithrankcount

Page 33: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

33

0

200

400

600

800

1000

1200

1ch 1chDDMA

2ch0%20%40%60%80%100%120%140%160%180%

1ch 1chDDMA

2ch

Perfo

rmance

ProcessorP

inCou

nt

DDMAachieveshigherperformanceatlowerprocessorpincount

959 915

1103

DDMAvs.DoublingChannel

Page 34: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

34

Conclusion• Problem

– CPUandIOaccessescontendforthesharedmemorychannel

• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports

àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices

àDecoupleCPUandIOaccesses

• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)

• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)

Page 35: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM

DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,

Jongmoo Choi,Onur Mutlu

DecoupledDirectMemoryAccess

Page 36: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

36

SystemOverhead

DDMAreducesmoreexpensiveon-chiparea,whileincreasinglessexpensiveoff-chiparea

processor

DRAM

IOdevices

ConventionalSystem

processor

DDP-DRAM

IOdevicesDDMA-IO

ProposedSystem

LowC

ost

High

Page 37: Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS. 14 Outline 1. Problem 3. Dual-Data-Port DRAM 5. Evaluation 4. Applications for

37

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

ChannelUtilizationAnalysis

SimultaneousChannelUtilizationàPerformanceImprovement

CPU-GPUCommunication-Intensive

ChannelU

tiliza

tion

CPU IO

CPU IO

CPU IO

CPU IO

CPU IO

CPU IO

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

BothChannelsBusy SingleChannelBusy

4