Custom Hardware Modelling for FPGAs and Embedded Linux Platforms
The Case for Embedded NoCs on FPGAs
description
Transcript of The Case for Embedded NoCs on FPGAs
Design Tradeoffs For hard and Soft FPGA-based Networks-on-Chip
The Case for Embedded NoCs on FPGAsMohamed ABDELFATTAHVaughn BETZOutline2Why NoCs on FPGAs?Embedded NoCsArea & Power Analysis123Comparison Against P2P/Buses4InterconnectMotivation3
1. Why NoCs on FPGAs?Logic BlocksSwitch BlocksWiresMotivation41. Why NoCs on FPGAs?
Logic BlocksSwitch BlocksWiresHard Blocks: Memory Multiplier Processor
Motivation51. Why NoCs on FPGAs?Logic BlocksSwitch BlocksWiresHard InterfacesDDR/PCIe ..Interconnect still the sameHard Blocks: Memory Multiplier Processor1600 MHz200 MHz800 MHz
Motivation6DDR3 PHY and ControllerProblems:Bandwidth requirements for hard logic/interfacesTiming closure1. Why NoCs on FPGAs?PCIe ControllerGigabit Ethernet1600 MHz200 MHz800 MHz
Motivation7DDR3 PHY and ControllerProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominated1. Why NoCs on FPGAs?PCIe ControllerGigabit Ethernet
BarcelonaLos AngelesKeep the roads, but add freeways.Hard BlocksLogic ClusterSource: Google Earth
9DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominated
FPGA with NoCNoCRoutersLinksRouter forwards data packetRouter moves data to local interconnect
10DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominatedAbstraction favours modularity:Parallel compilationPartial reconfigurationMulti-chip interconnect
FPGA with NoC
Pre-design NoC to requirements NoC links are re-usable NoC is heavily pipelined NoC abstraction favors modularity High bandwidth endpoints known
11DDR3 PHY and Controller1. Why NoCs on FPGAs?PCIe ControllerGigabit EthernetFPGA with NoC Latency-tolerant communication NoC abstraction favors modularityProblems:Bandwidth requirements for hard logic/interfacesTiming closureHigh interconnect utilization:Huge CAD ProblemSlow compilationPower/area utilizationWire speed not scaling:Delay is interconnect-dominatedAbstraction favours modularity:Parallel compilationPartial reconfigurationMulti-chip interconnect
NoCs can simplify FPGA designDoes the NoC abstraction come at a high area/power cost?How to integrate NoCs in FPGAs?How do embedded NoCs compare to current interconnects?Outline12Why NoCs on FPGAs?Embedded NoCsArea & Power Analysis123Mixed NoCsHard NoCsComparison Against P2P/Buses4Embedded NoCs
2. Embedded NoCsMixed NoCHard NoCSoft LinksHard RoutersHard LinksHard Routers=++=Soft NoCSoft LinksSoft Routers+=14SoftHardFPGA CAD ToolsASIC CAD Tools
Design Compiler
AreaSpeedPower?PowerMethodologyToggle ratesGate-level simulationGate-level simulationMixedHSPICE
FPGARouter15
Mixed NoCs2. Embedded NoCsLogic blocksBaseline RouterProgrammablesoft interconnectWidthVCsPortsBuffer322510/VCMixed NoCSoft LinksHard Routers+=
FPGARouter16Mixed NoCs2. Embedded NoCs
16Mixed NoCSoft LinksHard Routers+=
Router17Assumed a mesh Can form any topology
FPGAMixed NoCs2. Embedded NoCsSpecial FeatureConfigurable topology
FPGARouter18Hard NoCs2. Embedded NoCsLogic blocksDedicated hard interconnectProgrammablesoft interconnect18Hard NoCHard LinksHard Routers+=
FPGARouter19Hard NoCs2. Embedded NoCs
19Hard NoCHard LinksHard Routers+=
FPGARouter20Hard NoCs2. Embedded NoCsLow-V mode1.1 V0.9 VSave 33% Dynamic PowerSpecial Feature~15% slower20Hard NoCHard LinksHard Routers+=21Fabric Port2. Embedded NoCs21
Width adaptationFrequency adaptationVoltage adaptationBridge NoC and FPGA fabric:Bus protocol e.g. AXIOutline22Why NoCs on FPGAs?Embedded NoCs12Area & Power AnalysisSoft vs. mixed vs.Hard3System Area/PowerComparison Against P2P/Buses4Router Microarchitecture23State-of-the-art router architecture from Stanford:NoC community have excelled at building on-chip routers: We just use itTo meet FPGA bandwidth requirements: High-performance routerComplex functionality such as virtual channels: Assigning traffic priority could be useful3. Area/Power Analysis
Routers and Links243. Area/Power AnalysisHard Router vs. Soft Router9X smaller, 2.4X faster, 1.4X lower power30X smaller, 6X faster, 14X lower powerHard Links vs. Soft LinksSoft, Mixed and Hard25Area GapSpeed GapPower GapMixedHard (Low-V)Soft20X 23X smaller5X 6X faster9X 11X (15X) lessAverage1X3. Area/Power Analysis
Soft, Mixed and HardMixedHard SoftSpeedSpeedBisection BW~ 1.5% of FPGA33% of FPGA730 940 MHz166 MHz~ 50 GB/s~ 10 GB/s64 NoC[65 nm]3. Area/Power Analysis
576 LBs~12,500 LBsArea448 LBs64-node NoC on Stratix IIISoft, Mixed and HardMixedHard (Low-V)SoftSpeedSpeedBisection BW~ 1.5% of FPGA33% of FPGA730 940 MHz166 MHz~ 50 GB/s~ 10 GB/s64 NoC[65 nm]3. Area/Power Analysis576 LBs~12,500 LBsArea448 LBs
Provides ~50GB/s peak bisection bandwidthVery Cheap! Less than cost of 3 soft nodes64-node NoC on Stratix IIINoC Power Budget28Soft NoCMixed NoCHard NoCHard NoC (Low-V)17.4 W250 GB/s total bandwidthTypical FPGA Dynamic Power123%How much is used for system-level communication?3. Area/Power AnalysisLargest Stratix-III deviceNoC Power Budget29Soft NoCMixed NoCHard NoCHard NoC (Low-V)17.4 WNoC250 GB/s total bandwidth15%Typical FPGA Dynamic Power3. Area/Power Analysis123%NoC Power Budget30NoC17.4 WTypical FPGA Dynamic PowerSoft NoCMixed NoCHard NoCHard NoC (Low-V)250 GB/s total bandwidth15%123%11%3. Area/Power AnalysisNoC Power Budget31NoC17.4 WTypical FPGA Dynamic PowerSoft NoCMixed NoCHard NoCHard NoC (Low-V)250 GB/s total bandwidth15%123%11%7%3. Area/Power AnalysisBandwidth in Perspective32
14.6 GB/s14.6 GB/s14.6 GB/s14.6 GB/s17 GB/s17 GB/s17 GB/s17 GB/sDDR3 Module 1PCIe Module 2Full theoretical BW126 GB/sAggregate Bandwidth3.5%NoC Power BudgetCross whole chip!3. Area/Power AnalysisOutline33Why NoCs on FPGAs?Embedded NoCs12Area &Power AnalysisPoint-to-point links3Comparison Against P2P/Buses4Qsys BusesFPGA Interconnect3411Point-to-point LinksBroadcast11nMultiple Masters11Mux + ArbiternMultiple Masters, Multiple Slaves11Mux + ArbiternnMux + ArbiterInterconnect = Just wiresInterconnect = Wires + LogicInterconnect = NoC1..........................n..Compare wires interconnect to NoCs4. ComparisonNoC Power vs. FPGA Interconnect35Hard and Mixed NoCs Area/Power EfficientLength of 1 NoC Link1 % area overhead on Stratix 5
Runs at 730-943 MHzPower on-par with simplest FPGA interconnect200 MHzHigh Performance / Packet Switched4. ComparisonDDR3: Qsys Bus vs. NoC364. Comparison
Qsys bus: Build logical bus from fabricEmbedded NoC: 16 Nodes, hard routers & linksDesign Effort374. Comparison
Steps to close timing using QsyscloseFPGADesign Effort384. Comparison
Steps to close timing using QsysfarFPGADesign Effort394. Comparison
Steps to close timing using QsysfarFPGATiming closure can be simplified with an embedded NoCArea Comparison404. Comparison
Area Comparison414. Comparison
Area Comparison424. Comparison
Entire NoC smaller than bus for 3 modules!Area Comparison434. Comparison
1/8 Hard NoC BW used already less area for most systemsPower Comparison444. Comparison
Hard NoC saves power for even the simplest systems123Big city needs freeways to handle trafficArea: 20-23XWhy NoCs on FPGAs?Embedded NoCs: Mixed & HardArea & Power AnalysisSpeed: 5-6XPower: 9-15XArea Budget for 64 nodes: ~1%Power Budget for 100 GB/s: 3-7%Comparison Against P2P/Buses4Raw efficiency close to simplest P2P linksNoC more efficient & lower design effort46
eecg.utoronto.ca/~mohamed/noc_designer.html47
Thank You!eecg.utoronto.ca/~mohamed/noc_designer.html200 MHz 128-bit module, 900 MHz 32-bit router?Configurable time-domain mux / demux: match bandwidthAsynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules48
2. Embedded NoCsFabric Port491. Why NoCs on FPGAs?Compute Acceleration
MaxelerGeoscience (14x, 70x)Financial analysis (5x, 163x)Altera OpenCLVideo compression (3x, 114x)Information filtering (5.5x)GPUCPU501. Why NoCs on FPGAs?Compute Acceleration
511. Why NoCs on FPGAs?Compute Acceleration
521. Why NoCs on FPGAs?
Compute AccelerationNoC