OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++...

32
1 OpenSoC Fabric An On-Chip Network Generator Farzad Fatollahi-Fard, Dave Donofrio, George Michelogiannakis, John Shalf 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2016) April17-19, 2016. Uppsala, Sweden.

Transcript of OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++...

Page 1: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

1

OpenSoC FabricAn On-Chip Network GeneratorFarzad Fatollahi-Fard, Dave Donofrio, George Michelogiannakis, John Shalf2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2016)April17-19, 2016. Uppsala, Sweden.

Page 2: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

2

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 3: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Motivation

‣ Want to build and model candidate future HPC chip multiprocessors

3

Why Are We Doing This?

Parallelism is growing at exponential

rate

Data movement dominates

power costs

An analysis of on-chip interconnection networks for large-scale chip multiprocessorsACM Transactions on computer architecture and code optimization (TACO), April 2010

Network topologygreatly affects

application performance

Page 4: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

4

What Interconnect Provides the Performance? Is it Open Source?What tools exist to answer these questions?

Page 5: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

What tools exist for SoC research

‣ Software models • Fast to create, but

plagued by long runtimes as system size increases

‣ Hardware emulation• Fast, accurate evaluate

that scales with system size but suffers from long development time

What tools do we have to evaluate large, complex networks of cores?

A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. FPGA 2008

Page 6: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Comparison of NoCs

Language Accuracy Verification Drawbacks

Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size

Garnet C++(GEM5) Event-Driven Other Simulators

Not fast enough for larger simulations

(1K+ cores)

NoCTweak SystemC Cycle-Accurate RTL Long runtimes limit simulation size

PhoenixSim OMNeT++ Event-Driven Other Simulators For Photonicson-chip networks

Topaz C++(GEM5) Cycle-Accurate Other Simulators

Not fast enough for larger simulations

(1K+ cores)

6

Software Tools

Page 7: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Comparison of NoCs

Language Features Open Source? DrawbacksStanford NoCRouter Verilog Long list of Verilog

parameters Yes -Hard to configure

CONNECT BluspecSystemVerilog

Completelycustomizable via

websiteYes

(noncommercial)-Designed for FPGAs

ARM CoreLink Pre-generated IP Up to clusters of48 cores No

-Designed for ARM cores (not design space exploration)-For “small” designs-Cache Coherent

ArterisFlexNoC Pre-generated IP Tool optimized for

VLSI design No -Full parametersunknown

7

Hardware Tools

Page 8: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

8

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 9: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

OpenSoC Fabric

9

‣ Part of the CoDEx tool suite

‣ Written in Chisel

‣ Dimensions, topology, VCs all configurable

‣ Fast functional C++ model for functional validation

‣ Verilog based description for FPGA or ASIC• Synthesis path enables accurate

power / energy modeling

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AXI

CPU(s)

AXI

CPU(s)

AXI

AXI

10GbE

PCIe

An Open-Source, Flexible, Parameterized, NoC Generator

Page 10: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Current Status

10

‣ Multiple Topologies• Mesh

• Flattened Butterfly

‣ Wormhole Flow Control

‣ Virtual Channels

‣ Run both through ASIC and FPGA tools

‣ Available for download• www.opensocfabric.org

Version 1.1.2 Released

Page 11: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

11

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 12: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Chisel: A New Hardware DSL

‣ Chisel provides both software and hardware models from the same codebase

‣ Object-oriented hardware development• Allows definition of

structs and other high-level constructs

‣ Powerful libraries and components ready to use

‣ Working processors fabricated using chisel

Using Scala to construct Verilog and C++ descriptions

Verilog

FPGA ASIC

Hardware Compilation

Software Compilation

SystemC Simulation

C++ Simulation

Scala

Chisel

Page 13: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Recent Chisel Designs

13

Chisel code successfully boots Linux

Clock test site

SRAM test site

DCDC test site

Processor Site

• First tape-out in 2012• Raven core taped out in

2014 – 28nm

Page 14: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Chisel Overview

14

‣ Not “Scala to Gates”

‣ Describe hardware functionality

‣ Chisel creates graph representation• Flattened

‣ Each node translated to Verilog or C++

How does Chisel work?

Algebraic Graph Construction 16

Mux(x > y, x, y) > Mux

x

y

Algebraic Graph Construction 16

Mux(x > y, x, y) > Mux

x

y

Page 15: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

OpenSoC – Top Level Diagram

15

Page 16: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

OpenSoC – Functional Hierarchy

16

AXI AHB FIFO Router Routing Function

Mesh Flattened Butterfly TorusSwitch Allocator

Arbiter

Round Robin Priority Cyclic

Top-Level

Network Interface Topology Injection/

Ejection

Page 17: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

17

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 18: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Configuring

‣ OpenSoC configured at run time through Parameters class

• Declared at top level, sub modules can add / change parameters tree

‣ Not limited to just numerical values

• Leverage Scala to pass functions to parameterize module creation- Example: Routing Function constructor passed as parameter to

router

18

Parameters

Page 19: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Configuring

‣ All OpenSoC Modules take a Parameters class as a constructor argument

‣ Setting parameters:• parms.child("MySwitch", Map( ("numInPorts"->Soft(8)),

("numOutPorts"->Soft(3) ))

‣ Getting a parameter:• val numInPorts = parms.get[Int]("numInPorts")

19

Parameters

Page 20: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Developing

20

‣ Modules have a standard interface that you inherit

‣ Development of modules is very quick• Flattened Butterfly

took 2 hours of development

Incredibly Fast Development Time

abstract class VCRouter(parms: Parameters)extends Module(parms) {

val numInChannels = parms.get[Int]("numInChannels")

val numOutChannels = parms.get[Int] ("numOutChannels")

val nunVCs = parms.get[Int]("numVCs") val io = new Bundle {val inChannels = Vec.fill(numInChannels) { new ChannelVC(parms) }

val outChannels = Vec.fill(numOutChannels) { new ChannelVC(parms).flip() }

} }

class SimpleVCRouter(parms: Parameters)extends VCRouter(parms) {

// Implementation }

Page 21: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

OpenSoC – Functional Hierarchy

21

AXI AHB FIFO Router Routing Function

Mesh Flattened Butterfly TorusSwitch Allocator

Arbiter

Round Robin Priority Cyclic

Top-Level

Network Interface Topology Injection/

Ejection

Page 22: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

OpenSoC – Top Level Modules

‣ Stiches routers together

‣ Assigns routers individual ID

‣ Assigns Routing Function to routers

‣ Passes down Arbitration scheme

‣ Connections Injection and Ejection Queues for network endpoints

22

Topology

Page 23: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

23

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 24: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

24

Results – Traffic Patterns

4x4 DORSingle ConcentrationDual Virtual Channel Mesh Network

Page 25: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

25

Results – Average LatencyCompared to

BooksimOpenSoC Fabric

(Software)OpenSoC Fabric

(Hardware)

Uniform +1.86% +8.37%

Tornado +0.84% +0.42%

Transpose +7.37% +8.29%

Neighbor +0.84% +6.28%

Bit Reverse +1.85% +10.6%

Page 26: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

26

Results – Latency and Utilization

Nearest Neighbor Traffic Pattern

Page 27: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Results – Application TracesCompared to

Booksim OpenSoC

AMR Avg latency -2.42%

MiniDFT Avg latency -28.3%

AMG Avg latency +16.3%

AMR Execution time -2.19%

MiniDFT Execution time -5.25%

AMG Execution time +130.8%

27

Page 28: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

28

Motivation 1OpenSoC Fabric 2What is Chisel? 3

OpenSoC Fabric Breakdown 4Results 5Conclusion and Future Work 6

Page 29: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Future additions

‣ Upgrade OpenSoC Fabric to use Chisel 3

‣ A collection of topologies and routing functions

‣ Standardized interfaces at the endpoints

‣ Power modeling in the C++ model

29

Towards a full set of features

Page 30: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Conclusion

‣ This is an open-source community-driven infrastructure• We are counting on your contributions

30

Page 31: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

Acknowledgements

‣ UCB Chisel

‣ US Dept of Energy

‣ Laboratory for Physical Sciences

‣ Ke Wen

‣ Columbia LRL

‣ John Bachan

31

Page 32: OpenSoC Fabric · Booksim C++ Cycle-Accurate RTL Long runtimes limit simulation size Garnet C++ (GEM5) Event-Driven Other Simulators Not fast enough for larger simulations (1K+ cores)

32

More Informationhttp://opensocfabric.org