The Case for Embedded NoCs on FPGAs

download The Case for  Embedded  NoCs on FPGAs

If you can't read please download the document

description

The Case for Embedded NoCs on FPGAs. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Embedded NoCs. 3. Area & Power Analysis. 4. Comparison Against P2P/Buses. 1. Why NoCs on FPGAs?. Motivation. Logic Blocks. Switch Blocks. Wires. Interconnect. - PowerPoint PPT Presentation

Transcript of The Case for Embedded NoCs on FPGAs

Design Tradeoffs For hard and Soft FPGA-based Networks-on-Chip

The Case for Embedded NoCs on FPGAsMohamed ABDELFATTAHVaughn BETZOutline2Why NoCs on FPGAs?Embedded NoCsArea & Power Analysis123Comparison Against P2P/Buses4InterconnectMotivation3

1. Why NoCs on FPGAs?Logic BlocksSwitch BlocksWiresMotivation41. Why NoCs on FPGAs?

Logic BlocksSwitch BlocksWiresHard Blocks: Memory Multiplier Processor

Motivation51. Why NoCs on FPGAs?Logic BlocksSwitch BlocksWiresHard InterfacesDDR/PCIe ..Interconnect still the sameHard Blocks: Memory Multiplier Processor1600 MHz200 MHz800 MHz

Motivation6DDR3 PHY and ControllerProblems:Bandwidth requirements for hard logic/interfacesTiming closure1. Why NoCs on FPGAs?PCIe ControllerGigabit Ethernet1600 MHz200 MHz800 MHz

Motivation7DDR3 PHY and ControllerProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominated1. Why NoCs on FPGAs?PCIe ControllerGigabit Ethernet

BarcelonaLos AngelesKeep the roads, but add freeways.Hard BlocksLogic ClusterSource: Google Earth

9DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominated

FPGA with NoCNoCRoutersLinksRouter forwards data packetRouter moves data to local interconnect

10DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominatedAbstraction favours modularity:Parallel compilationPartial reconfigurationMulti-chip interconnect

FPGA with NoC

Pre-design NoC to requirements NoC links are re-usable NoC is heavily pipelined NoC abstraction favors modularity High bandwidth endpoints known

11DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetFPGA with NoC Latency-tolerant communication NoC abstraction favors modularityProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominatedAbstraction favours modularity:Parallel compilationPartial reconfigurationMulti-chip interconnect

NoCs can simplify FPGA designDoes the NoC abstraction come at a high area/power cost?How to integrate NoCs in FPGAs?How do embedded NoCs compare to current interconnects?Outline12Why NoCs on FPGAs?Embedded NoCsArea & Power Analysis123Mixed NoCsHard NoCsComparison Against P2P/Buses4Embedded NoCs

2. Embedded NoCsMixed NoCHard NoCSoft LinksHard RoutersHard LinksHard Routers=++=Soft NoCSoft LinksSoft Routers+=14SoftHardFPGA CAD ToolsASIC CAD Tools

Design Compiler

AreaSpeedPower?PowerMethodologyToggle ratesGate-level simulationGate-level simulationMixedHSPICE

FPGARouter15

Mixed NoCs2. Embedded NoCsLogic blocksBaseline RouterProgrammablesoft interconnectWidthVCsPortsBuffer322510/VCMixed NoCSoft LinksHard Routers+=

FPGARouter16Mixed NoCs2. Embedded NoCs

16Mixed NoCSoft LinksHard Routers+=

Router17Assumed a mesh Can form any topology

FPGAMixed NoCs2. Embedded NoCsSpecial FeatureConfigurable topology

FPGARouter18Hard NoCs2. Embedded NoCsLogic blocksDedicated hard interconnectProgrammablesoft interconnect18Hard NoCHard LinksHard Routers+=

FPGARouter19Hard NoCs2. Embedded NoCs

19Hard NoCHard LinksHard Routers+=

FPGARouter20Hard NoCs2. Embedded NoCsLow-V mode1.1 V0.9 VSave 33% Dynamic PowerSpecial Feature~15% slower20Hard NoCHard LinksHard Routers+=21Fabric Port2. Embedded NoCs21

Width adaptationFrequency adaptationVoltage adaptationBridge NoC and FPGA fabric:Bus protocol e.g. AXIOutline22Why NoCs on FPGAs?Embedded NoCs12Area & Power AnalysisSoft vs. mixed vs.Hard3System Area/PowerComparison Against P2P/Buses4Router Microarchitecture23State-of-the-art router architecture from Stanford:NoC community have excelled at building on-chip routers: We just use itTo meet FPGA bandwidth requirements: High-performance routerComplex functionality such as virtual channels: Assigning traffic priority could be useful3. Area/Power Analysis

Routers and Links243. Area/Power AnalysisHard Router vs. Soft Router9X smaller, 2.4X faster, 1.4X lower power30X smaller, 6X faster, 14X lower powerHard Links vs. Soft LinksSoft, Mixed and Hard25Area GapSpeed GapPower GapMixedHard (Low-V)Soft20X 23X smaller5X 6X faster9X 11X (15X) lessAverage1X3. Area/Power Analysis

Soft, Mixed and HardMixedHard SoftSpeedSpeedBisection BW~ 1.5% of FPGA33% of FPGA730 940 MHz166 MHz~ 50 GB/s~ 10 GB/s64 NoC[65 nm]3. Area/Power Analysis

576 LBs~12,500 LBsArea448 LBs64-node NoC on Stratix IIISoft, Mixed and HardMixedHard (Low-V)SoftSpeedSpeedBisection BW~ 1.5% of FPGA33% of FPGA730 940 MHz166 MHz~ 50 GB/s~ 10 GB/s64 NoC[65 nm]3. Area/Power Analysis576 LBs~12,500 LBsArea448 LBs

Provides ~50GB/s peak bisection bandwidthVery Cheap! Less than cost of 3 soft nodes64-node NoC on Stratix IIINoC Power Budget28Soft NoCMixed NoCHard NoCHard NoC (Low-V)17.4 W250 GB/s total bandwidthTypical FPGA Dynamic Power123%How much is used for system-level communication?3. Area/Power AnalysisLargest Stratix-III deviceNoC Power Budget29Soft NoCMixed NoCHard NoCHard NoC (Low-V)17.4 WNoC250 GB/s total bandwidth15%Typical FPGA Dynamic Power3. Area/Power Analysis123%NoC Power Budget30NoC17.4 WTypical FPGA Dynamic PowerSoft NoCMixed NoCHard NoCHard NoC (Low-V)250 GB/s total bandwidth15%123%11%3. Area/Power AnalysisNoC Power Budget31NoC17.4 WTypical FPGA Dynamic PowerSoft NoCMixed NoCHard NoCHard NoC (Low-V)250 GB/s total bandwidth15%123%11%7%3. Area/Power AnalysisBandwidth in Perspective32

14.6 GB/s14.6 GB/s14.6 GB/s14.6 GB/s17 GB/s17 GB/s17 GB/s17 GB/sDDR3 Module 1PCIe Module 2Full theoretical BW126 GB/sAggregate Bandwidth3.5%NoC Power BudgetCross whole chip!3. Area/Power AnalysisOutline33Why NoCs on FPGAs?Embedded NoCs12Area &Power AnalysisPoint-to-point links3Comparison Against P2P/Buses4Qsys BusesFPGA Interconnect3411Point-to-point LinksBroadcast11nMultiple Masters11Mux + ArbiternMultiple Masters, Multiple Slaves11Mux + ArbiternnMux + ArbiterInterconnect = Just wiresInterconnect = Wires + LogicInterconnect = NoC1..........................n..Compare wires interconnect to NoCs4. ComparisonNoC Power vs. FPGA Interconnect35Hard and Mixed NoCs Area/Power EfficientLength of 1 NoC Link1 % area overhead on Stratix 5

Runs at 730-943 MHzPower on-par with simplest FPGA interconnect200 MHzHigh Performance / Packet Switched4. ComparisonDDR3: Qsys Bus vs. NoC364. Comparison

Qsys bus: Build logical bus from fabricEmbedded NoC: 16 Nodes, hard routers & linksDesign Effort374. Comparison

Steps to close timing using QsyscloseFPGADesign Effort384. Comparison

Steps to close timing using QsysfarFPGADesign Effort394. Comparison

Steps to close timing using QsysfarFPGATiming closure can be simplified with an embedded NoCArea Comparison404. Comparison

Area Comparison414. Comparison

Area Comparison424. Comparison

Entire NoC smaller than bus for 3 modules!Area Comparison434. Comparison

1/8 Hard NoC BW used already less area for most systemsPower Comparison444. Comparison

Hard NoC saves power for even the simplest systems123Big city needs freeways to handle trafficArea: 20-23XWhy NoCs on FPGAs?Embedded NoCs: Mixed & HardArea & Power AnalysisSpeed: 5-6XPower: 9-15XArea Budget for 64 nodes: ~1%Power Budget for 100 GB/s: 3-7%Comparison Against P2P/Buses4Raw efficiency close to simplest P2P linksNoC more efficient & lower design effort46

eecg.utoronto.ca/~mohamed/noc_designer.html47

Thank You!eecg.utoronto.ca/~mohamed/noc_designer.html200 MHz 128-bit module, 900 MHz 32-bit router?Configurable time-domain mux / demux: match bandwidthAsynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules48

2. Embedded NoCsFabric Port491. Why NoCs on FPGAs?Compute Acceleration

MaxelerGeoscience (14x, 70x)Financial analysis (5x, 163x)Altera OpenCLVideo compression (3x, 114x)Information filtering (5.5x)GPUCPU501. Why NoCs on FPGAs?Compute Acceleration

511. Why NoCs on FPGAs?Compute Acceleration

521. Why NoCs on FPGAs?

Compute AccelerationNoC