Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical...

Memory Hierarchy

L: 0.5ns, C: 10MB

L: 50ns, C: 100GB

BW: 100GB/s

L: 10us, C: 2TB

BW: 2GB/s

L: 10ms, C: 4TB

BW: 600MB/s

Latency, Capacity, Bandwidth

Controller

DRAM Primer<bank, row, column>

Page buffer per bank

DRAM Characteristics

DRAM page crossing Charge ~10K DRAM cells and bitlines

Increase power & latency

Decrease effective bandwidth

Sequential access VS. random access Less page crossing

Lower power consumption

4.4x shorter latency

10x better BW

Take Away: DRAM = Disk

Embedded Controller

Opportunities for customization

Bad News

None available as in general purpose processor

Good News

Agenda

Overview

Multi-Port Memory Controller (MPMC)

Design

“Out-of-Core” Algorithmic Exploration

Motivating Example: H.264 Decoder

6.4 9.6 1.2 164.8 0.09 31.0 156.7 94MB/s Dynamic latency,

BW and power

Diverse QoS requirements

Latency sensitive

Bandwidth sensitive

Wanted

Bandwidth guarantee

Prioritized access

Reduced page crossing

Previous Works Bandwidth guarantee

• Q0: Distinguish bandwidth guarantee for different classes of ports

• Q1: Distinguish bandwidth guarantee for each port Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth

Q0 Q1 Q2 Q3 Q4

[Rixner,00][McKee,00][Hur,04] ✓

[Heighecker,03,05][Whitty,08] ✓ ✓ ✓

[Lee,05] ✓ ✓

[Burchard,05] ✓ ✓

Proposed BCBR ✓ ✓ ✓ ✓11

Key Observations

Port locality: Same port requests

same DRAM page

Service time flexibility 1/24 second to decode a

video frame 4M cycles at 100 MHz for

request reordering

Residual bandwidth Statically allocated BW Underutilized at runtime

Weighted round robin: Minimum BW guarantee Busting service

Credit borrow & repay Reorder requests according

to priority

Dynamic BW calculation Capture and re-allocate

residual BW

T(Rij): arriving time of jth requests for Qi

Weighted Round Robin

Assume bandwidth requirement Q2: 30% Q1: 50% Q0: 20%

Request time:

Service time:

Clock:

Tround = 10

Time: scheduling cycles

0 1 2 3 4 5 6 7 8 9

R13R10

R14R11 R12 R13 R14

R00 R01

Problem with WRR

Priority: Q0 > Q2

8 cycles of waiting time! Could be worse!

Clock: 0 1 2 3 4 5 6 7 8 9

R13R10

R14R11 R12 R13 R14

R00 R01

Borrow Credits

Zero Waiting time for Q0!

Clock: 0 1 2 3 4 5 6 7 8 9

Q0*R00

R00 R01

debtQ0 Q2 Q2Q2

borrow

Repay Later

At Q0’s turn, BW guarantee is recovered

Clock: 0 1 2 3 4 5 6 7 8 9

Q0*R00

R12 R13R10

R14R11 R12 R13 R14

R00 R01

debtQ0 Q2 Q2Q2

R21 R22

Prioritized access!

Problem: Depth of DebtQ DebtQ as residual BW collector

BW allocated to Q0 increases to: 20% + residual BW

Requirement for the depth of DebtQ0 decreasesClock: 0 1 2 3 4 5 6 7 8 9

T(R2)Q2

T(R1)Q1

T(R0)Q0*

R12 R13

R11 R12 R13

R00 R01

debtQ0 Q2 Q2Q2

R21 R22

Help repay

Evaluation Framework Simulation Framework

Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ

Port 0 1 2 3 4

RR 1.08% 24% 24% 24% 24%PQ 0.73% 80% 18% 0% 0%BGPQ 1.07% 39% 20% 20% 20%WRR 0.76% 33% 22% 22% 22%BCBR 0.76% 33% 22% 22% 22%

Bandwidth Guarantee

Bandwidth guarantees: P0: 2% P1: 30% P2: 20% P3:20%

P4:20%

System residual: 8%No BW

guarantee

Provides BW guarantee!

Cache Response Latency

Average 16x faster than WRR As fast as PQ (prioritized access)

DRAM Energy & BW Efficiency

30% less page crossing (compared to RR) 1.4x more energy efficient 1.2x higher effective DRAM BW

As good as WRR (exploit port locality)

RR BGPQ WRR BCBR

GB/J 0.298 0.289 0.412 0.411

Act-Pre Ratio 29.6% 30.1% 23.0% 23.0%

Improvement 1.0x 0.97x 1.38x 1.38x

Hardware Cost

Xilinx MPMC: frontend + backend 3450 LUTs 5540 registers 1-9 BRAMs

BCBR + Speedy 3379 LUTs 2264 registers 4 BRAMs

BCBR: frontend 1393 LUTs 884 registers 0 BRAM

Reference backend: speedy DDRMC 1986 LUTs 1380 registers 4 BRAMs

Better performance without higher cost!

Agenda

Overview

Multi-Port Memory Controller (MPMC)

Design

“Out-of-Core” Algorithm / Architecture

Exploration

Remember DRAM=DISK

So let’sAsk the same questionPlug-on DRAM parametersGet DRAM-specific answers

Out-of-core algorithmsData does not fit DRAMPerformance dominated by IO

Key questionsReduce #IOsBlock granularity

Motivating Example: CDN

Caches in CDN Get closer to users

Save bandwidth

Zipf’s law 80-20 rule hit

Video Cache

Defining the KnobsTransaction

a number of column access commands enclosed by row activation / precharge

W: burst sizes : # bursts

Function of array organization & timing params

Function of algorithmic parameters

D-nary Heap

Algorithmic Design Variable:Branching Factor

Record Size

B+ Tree

Lessons Learned

Optimal result can be beautifully

derived!

Big O does not matter in some cases Depending on data input characteristics

Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical...

Documents

Transcript of Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical...

Fundamental electrophysics and engineering design of ...users.ece.cmu.edu/~jzhu/...200_Maglev_presentation.pdf · • Magnetic memory (tapes, disks, MRAM, magnetic stripes, …) •

Introduction to Biometric Recognition …jzhu/class/18200/F05/Lecture...Introduction to Biometric Recognition Technologies and Applications Dr. Marios Savvides Carnegie Mellon CyLab

Scalable Replay-Based Replication For Fast … Replay-Based Replication For Fast Databases Dai Qin University of Toronto mike@eecg.toronto.edu Angela Demke Brown University of Toronto

calcm - Carnegie Mellon Universityusers.ece.cmu.edu/~jzhu/class/18200/F04/Lecture09... · 18-741: Advanced Computer Architecture 18-747: Advanced Topics in Microarchitecture High-Perf.

Magnetic Random Access Memory (MRAM ... - …jzhu/class/18200/F04/Lecture05_18200... · 1 24 August 2004 Magnetic Random Access Memory (MRAM)Magnetic Random Access Memory (MRAM) Jimmy

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos moshovos@eecg.toronto.edu ECE, Univ.

ECE Circuits Curriculum - Carnegie Mellon Universityusers.ece.cmu.edu/~jzhu/class/18200/F06/L03_Mukherjee.pdfAnalog Design in digital IC technology ADC DAC Driver Supply Support Analog

Jichen Zhu, Ph.D. - Westphal College of Media Arts …digm.drexel.edu/jzhu//CV/Zhu_CV.pdfAssociate Professor, Drexel University 2016 – present Digital Media, Antoinette Westphal

Jim HoburgJim Hoburg Jimmy ZhuJimmy Zhuusers.ece.cmu.edu/~jzhu/class/18200/F06/L04_Hoburg_Zhu.pdf · 2006. 11. 22. · review of Maxwell's equations, ... Stanford Univ.) This mechanism

LogiCORE IP FIFO Generator v8 - eecg.toronto.edu€¦ · LogiCORE IP FIFO Generator v8.1 DS317 March 1, 2011 Product Specification LogiCORE IP Facts Core Specifics Supported FPGA

Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

The OMG Real Time Data Distribution Service Hans-Arno Jacobsen E-mail: jacobsen@eecg.toronto.edu Phone: 416-946 7586 Department of Electrical and Computer.

Time Reversal Focusing - Carnegie Mellon Universityusers.ece.cmu.edu/~jzhu/class/18200/F05/Lecture07_Stancil_talk.pdf · System Backbone Every building has a built-in RF distribution

The Emerging Trends in Electrical and Computer Engineeringusers.ece.cmu.edu/~jzhu/class/18200/F06/L01A_Krogh_F06.pdf · 18777 Large-Scale Dynamic Systems 18791 Digital Signal Processing

18200-lecture - Carnegie Mellon Universityusers.ece.cmu.edu/~jzhu/class/18200/F06/L02_Falsafi.pdf18-341: Logic Design Using Simulation, Synthesis, and Verification Techniques Modern

Introduction to Biometric Technologies and Applicationsusers.ece.cmu.edu/~jzhu/class/18200/F06/L10A_Savvid… · · 2006-11-22Introduction to Biometric Technologies and Applications

Performance Evaluation and Benchmarking Using DEA Joe Zhu Department of Management Worcester Polytechnic Institute Worcester, MA 01609 jzhu@wpi.edu .

10 switched capacitor - Computer Engineering …johns/ece1371/slides/10_switched_capacitor.… · Switched-Capacitor Circuits David Johns and Ken Martin University of Toronto (johns@eecg.toronto.edu)

Magnetic Random Access Memory (MRAM ... - …users.ece.cmu.edu/~jzhu/class/18200/F04/Lecture05_18200_jzhu.pdfMagnetic Random Access Memory (MRAM) ... (TDM) 9Wavelength Division Multiplexing

Biometric Recognition using Advanced Title Goes Here ...users.ece.cmu.edu/~jzhu/class/18200/F04/Lecture06... · Palmprint, Fingerprint) Teaching: (Pattern Recognition Theory) ...