Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3...

52
The Case for Embedded NoCs on FPGAs Mohamed ABDELFATTAH Vaughn BETZ

Transcript of Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3...

Page 1: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

The Case for Embedded NoCs on FPGAs

Mohamed ABDELFATTAHVaughn BETZ

Page 2: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

2

Outline

Why NoCs on FPGAs?

Embedded NoCs

Area & Power Analysis

1

2

3

Comparison Against P2P/Buses4

Page 3: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

3

Interconnect

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Page 4: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

4

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard Blocks:• Memory• Multiplier• Processor

Page 5: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

5

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard InterfacesDDR/PCIe ..

Interconnect still the same

Hard Blocks:• Memory• Multiplier• Processor

1600 MHz

200 MHz

800 MHz

Page 6: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

6

MotivationDDR3 PHY and Controller

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1600 MHz

200 MHz

800 MHz

Page 7: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

7

MotivationDDR3 PHY and Controller

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Page 8: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Barcelona Los Angeles

Keep the “roads”, but add “freeways”.

Hard Blocks

Logic Cluster

Source: Google Earth

Page 9: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

9

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

FPGA with NoCNoC

Routers

Links Router forwards data packet

Router moves data to local interconnect

Page 10: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

10

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

FPGA with NoC

Pre-design NoC to requirements NoC links are “re-usable” NoC is heavily “pipelined” NoC abstraction favors modularity

High bandwidth endpoints known

Page 11: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

11

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

FPGA with NoC

Latency-tolerant communication NoC abstraction favors modularity

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

NoCs can simplify FPGA design

Does the NoC abstraction come at a high area/power cost?

How to integrate NoCs in FPGAs?

How do embedded NoCs compare to current interconnects?

Page 12: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

12

OutlineWhy NoCs on FPGAs?

Embedded NoCs

Area & Power Analysis

1

2

3

Mixed NoCs Hard NoCs

Comparison Against P2P/Buses4

Page 13: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Embedded NoCsFPGA

DD

Rx In

terf

ace

PCIe

Inte

rfac

e

Router

Compute Module

Links(Hard or Soft)

Fabric

Port

(Hard or Soft)

2. Embedded NoCs

“Mixed” NoC

“Hard” NoC

Soft LinksHard Routers

Hard LinksHard Routers =++

=“Soft” NoCSoft LinksSoft Routers + =

Page 14: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

14

Soft Hard

FPGA CAD Tools ASIC CAD Tools

Design Compiler

Area

Speed

Power?Power

Methodology

Toggle rates

Gate-level simulation Gate-level simulation

Mixed

HSPICE

Page 15: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

15

Router Logic

Programmable Interconnect

FPGA

Router

Mixed NoCs2. Embedded NoCs

Logic blocks

Baseline Router

Programmable“soft” interconnect

Width VCs Ports Buffer

32 2 5 10/VC

“Mixed” NoCSoft LinksHard Routers + =

Page 16: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

16

Router Logic

Programmable Interconnect

FPGA

Router

Mixed NoCs2. Embedded NoCs

Router Logic

16“Mixed” NoCSoft LinksHard Routers + =

Page 17: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

17

Router Logic

Programmable Interconnect

Router

Assumed a mesh Can form any topology

FPGA

Mixed NoCs2. Embedded NoCs

Special FeatureConfigurable topology

Page 18: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

18

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Logic blocks

Dedicated “hard” interconnect

Programmable“soft” interconnect

18“Hard” NoCHard LinksHard Routers + =

Page 19: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

19

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Router Logic

19“Hard” NoCHard LinksHard Routers + =

Page 20: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

20

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Low-V mode

1.1 V0.9 V

Save 33% Dynamic Power

Special Feature

~15% slower

20“Hard” NoCHard LinksHard Routers + =

Page 21: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

21

Fabric Port2. Embedded NoCs

21

Router

Compute Module

Links(Hard or Soft)

Fabric

Port

(Hard or Soft)• Width adaptation

• Frequency adaptation

• Voltage adaptation

Bridge NoC and FPGA fabric:

• Bus protocol e.g. AXI

Page 22: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

22

OutlineWhy NoCs on FPGAs?

Embedded NoCs

1

2

Area & Power Analysis

Soft vs. mixed vs.Hard

3

System Area/Power

Comparison Against P2P/Buses4

Page 23: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

23

Router Microarchitecture

State-of-the-art router architecture from Stanford:1. NoC community have excelled at building on-chip routers:

We just use it2. To meet FPGA bandwidth requirements:

High-performance router3. Complex functionality such as virtual channels:

Assigning traffic priority could be useful

3. Area/Power Analysis

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Page 24: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Routers and Links

24

3. Area/Power Analysis

Hard Router vs. Soft Router

9X smaller, 2.4X faster, 1.4X lower power

30X smaller, 6X faster, 14X lower power

Hard Links vs. Soft Links

Page 25: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Soft, Mixed and Hard

Mixed Hard Soft

Speed

Speed

Bisection BW

~ 1.5% of FPGA33% of FPGA

730 – 940 MHz166 MHz

~ 50 GB/s~ 10 GB/s

64 –

NoC

[65 nm]

3. Area/Power Analysis

576 LBs~12,500 LBsArea

448 LBs

64-node NoC on Stratix III

Page 26: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Soft, Mixed and Hard

Mixed Hard (Low-V)Soft

Speed

Speed

Bisection BW

~ 1.5% of FPGA33% of FPGA

730 – 940 MHz166 MHz

~ 50 GB/s~ 10 GB/s

64 –

NoC

[65 nm]

3. Area/Power Analysis

576 LBs~12,500 LBsArea

448 LBs

Provides ~50GB/s peak bisection bandwidth

Very Cheap! Less than cost of 3 soft nodes

64-node NoC on Stratix III

Page 27: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

28

NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)

17.4 W

250 GB/s total bandwidth

Typical FPGA Dynamic Power

123%How much is used for system-level communication?

3. Area/Power Analysis

Largest Stratix-III device

Page 28: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

29

NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)

17.4 W

NoC

250 GB/s total bandwidth 15%

Typical FPGA Dynamic Power

3. Area/Power Analysis

123%

Page 29: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

30

NoC Power Budget

NoC

17.4 WTypical FPGA

Dynamic Power

Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11%

3. Area/Power Analysis

Page 30: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

31

NoC Power Budget

NoC

17.4 WTypical FPGA

Dynamic Power

Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11% 7%

3. Area/Power Analysis

Page 31: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

32

Bandwidth in Perspective

14.6 GB/s

14.6 GB/s

14.6 GB/s

14.6 GB/s

17 G

B/s

17 G

B/s

17 G

B/s

17 G

B/s

DDR3 Module 1

PCIe Module 2

Full theoretical BW

126 GB/sAggregate Bandwidth

3.5%NoC Power Budget

Cross whole chip!

3. Area/Power Analysis

Page 32: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

33

OutlineWhy NoCs on FPGAs?

Embedded NoCs

1

2

Area &Power Analysis

Point-to-point links

3

Comparison Against P2P/Buses4

Qsys Buses

Page 33: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

34

FPGA Interconnect

1 1

Point-to-point Links

Broadcast

1 1

n

Multiple Masters

1

1Mux + Arbiter

n

Multiple Masters, Multiple Slaves

1 1Mux + Arbiter

n nMux + Arbiter

Interconnect = Just wires Interconnect = Wires + Logic Interconnect = NoC

1 .. .. ..

.. .. .. ..

.. .. ..

.. .. .. n

..Compare “wires” interconnect to NoCs

4. Comparison

Page 34: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

35

NoC Power vs. FPGA Interconnect

Hard and Mixed NoCs Area/Power Efficient

Length of 1 NoC Link1 % area overhead on Stratix 5

Runs at 730-943 MHz

Power on-par with simplest FPGA interconnect

200 MHz

High Performance / Packet Switched

4. Comparison

Page 35: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

36

DDR3: Qsys Bus vs. NoC4. Comparison

Qsys bus: Build logical bus from fabric

Embedded NoC: 16 Nodes, hard routers & links

Page 36: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

37

Design Effort4. Comparison

• Steps to close timing using Qsys

close

FPGA

Page 37: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

38

Design Effort4. Comparison

• Steps to close timing using Qsys

far

FPGA

Page 38: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

39

Design Effort4. Comparison

• Steps to close timing using Qsys

far

FPGA

Timing closure can be simplified with an embedded NoC

Page 39: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

40

Area Comparison4. Comparison

Page 40: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

41

Area Comparison4. Comparison

Page 41: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

42

Area Comparison4. Comparison

Entire NoC smaller than bus for 3 modules!

Page 42: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

43

Area Comparison4. Comparison

1/8 Hard NoC BW used already less area for most systems

Page 43: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

44

Power Comparison4. Comparison

Hard NoC saves power for even the simplest systems

Page 44: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

1

2

3

Big city needs freeways to handle traffic

Area: 20-23X

Why NoCs on FPGAs?

Embedded NoCs: Mixed & Hard

Area & Power Analysis

Speed: 5-6X Power: 9-15X

• Area Budget for 64 nodes: ~1%• Power Budget for 100 GB/s: 3-7%

Comparison Against P2P/Buses4• Raw efficiency close to simplest P2P links• NoC more efficient & lower design effort

Page 45: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

46

eecg.utoronto.ca/~mohamed/noc_designer.html

Page 46: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

47

Thank You!

eecg.utoronto.ca/~mohamed/noc_designer.html

Page 47: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

48

200 MHz 128-bit module, 900 MHz 32-bit router? Configurable time-domain mux / demux: match bandwidth Asynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules

2. Embedded NoCs

Fabric Port

Page 48: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

49

1. Why NoCs on FPGAs?

Compute Acceleration

• Maxeler• Geoscience (14x, 70x)• Financial analysis (5x, 163x)

• Altera OpenCL• Video compression (3x, 114x)• Information filtering (5.5x)

GPU CPU

Page 49: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

50

1. Why NoCs on FPGAs?

Compute Acceleration

Page 50: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

51

1. Why NoCs on FPGAs?

Compute Acceleration

Page 51: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

52

1. Why NoCs on FPGAs?

Compute Acceleration

NoC

Page 52: Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.