Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux &...

50
Celerity: An Open Source 511-core RISC-V Tiered Accelerator Fabric Prof. Michael Taylor Bespoke Silicon Group University of Washington http://www.opencelerity.org

Transcript of Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux &...

Page 1: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Celerity:AnOpenSource511-coreRISC-VTieredAcceleratorFabric

Prof.MichaelTaylorBespokeSiliconGroup

UniversityofWashingtonhttp://www.opencelerity.org

Page 2: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

OutlineTheCelerityOpenSourceRISC-VTieredAcceleratorFabric:FastArchitecturalDesignMethodologiesforFastChips

BaseJump:DesigningtheDNAforOpenSourceASICs

BaseJump Manycore:“OpenSourceforGPU”

http://www.opencelerity.org

http://www.bjump.org

http://www.bjump.org/manycore

Page 3: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

TheCelerityOpenSourceRISC-VTieredAcceleratorFabric:FastArchitecturalDesignMethodologiesforFastChips

RitchieZhao,ChunZhao,ShaolinXie,Bandhav Veluri,LuisVega, ChristopherTorng,Ningxiao Sun,AustinRovinski,AnujRao,Gai Liu,PaulGao,ScottDavidson,

SteveDai,Aporva Amarnath,KhalidAl-Hawaj,TutuAjayi

ChristopherBatten,RonaldG.Dreslinski,RajeshK.Gupta,MichaelB.Taylor,Zhiru Zhang

Page 4: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BuildingwiththeRISC-VSoftware/HardwareEcosystem

Celerity::Introduction

SoftwareToolchain• Acomplete,off-the-shelfsoftwarestack(e.g.,binutils,GCC,newlib/glibc,Linuxkernel&distros)forbothembeddedandgeneral-purpose

Architecture• RISC-VISAspecificationdesignedtobebothmodularandextensible,withasmallbaseISAandoptionalextensions

Microarchitecture• On-chipnetworkspecificationsandimplementations(NASTI,TileLink)• RISC-Vprocessorimplementationsforbothin-order(BerkeleyRocket)andout-of-order(BerkeleyBOOM)cores

PhysicalDesign• PreviousspinsofchipsforreferenceTesting• Standardcoreverificationtestsuites+Turn-keyFPGAgateware

ApplicationAlgorithm

Operating System

Instruction Set Architecture

Register-Transfer Level

Circuits

Programming Language

Compilers

Microarchitecture

Gate-Level

TechnologyDevices

Page 5: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabric

thattargets highlyperformantandenergy-efficientembeddedsystems

FundedbytheDARPACRAFTprogram,“CircuitRealizationAtFasterTimescales”

Thegoalwastodevelopnewmethodologiestodesignchipsmorequickly

Celerity::Introduction

General-PurposeTier

ManycoreTier

SpecializationTier

WeleveragedtheRISC-Vsoftware/hardwareecosystem as webuiltCelerity,andwebelieveitwasinstrumentalinenablingateamof20graduatestudentstotapeoutacomplexSoCinonly9months

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 6: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Celerity:ChipOverview

• TSMC16nmFFC• 25mm2 diearea(5mmx5mm)• ~385milliontransistors• 511RISC-Vcores

• 5Linux-capableRV64GBerkeleyRocketcores• 496-core RV32IM meshtiledarray“manycore”• 10-coreRV32IMmeshtiledarray(lowvoltage)

• Binarized NeuralNetworkSpecializedAccelerator• On-chipsynthesizablePLLsandDC/DCLDO

• Developedin-house• 3Clockdomains

• 400MHz– DDRI/O• 625MHz– Rocketcore+Specializedaccelerator• 1.05GHz– Manycorearray

• 672-pinflipchipBGApackage• 9-monthsfromPDKaccesstotape-out

Celerity::Introduction

http://www.opencelerity.org

Page 7: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Agenda

• Introduction• ForeachTier:

• Whatdidwebuild?• Howdidwebuildit?• RISC-VEcosystemSuccesses• RISC-VEcosystemChallenges

• Conclusion

Celerity::Introduction

General-PurposeTier

ManycoreTier

SpecializationTier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 8: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Celerity:General-Purpose Tier

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V• ChallengeswithRISC-V

General-PurposeTier

ManycoreTier

SpecializationTier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 9: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

General-PurposeTierOverview

• 5 BerkeleyRocketCores(RV64G)• Workload

• General-purposecompute• Operatingsystem(e.g.Linux&TCP/IPStack)• InterruptandExceptionhandling• Programdispatchandcontrolflow

• Interface• Interfacetooff-chipI/Oandotherperipherals• 4Coresconnecttothemanycore array• 1CoreinterfaceswiththeBNN

• Memory• Eachcoreexecutesindependentlywithinitsownaddressspace

• Memorymanagementforalltiers

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Man

ycore

BNN

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

BaseJumpF

SB

BaseJumpM

otherboard

Page 10: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Berkeley RocketCores

• 5BerkeleyRocketCores(https://github.com/freechipsproject/rocket-chip)

• GeneratedfromChisel• RV64GISA• 5-stage,in-order,scalarprocessor• Double-precisionfloatingpoint• I-Cache:16KB4-wayassoc.• D-Cache:16KB4-wayassoc.

• PhysicalImplementation• ~900 MHz• 5corespermm2

http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 11: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

DesignIterations2.Alpaca

3.Bison 4.Coyote

1.LoopbackB

aseJ

ump

Mot

herb

oard

Bas

eJum

pFS

B Loopback FIFO

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

NA

STI RISC-V Rocket Core

I-CacheD-CacheN

AST

I RISC-V Rocket Core

I-CacheD-Cache

RoCC AcceleratorBas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• SuccesseswithRISC-V•ChallengeswithRISC-V

ImplementedNASTIbridgeandconnectedrocketcoreBaselinedesigntovalidateFSBandNorthbridge

ImplementedacceleratorconnectedthroughBlackboxed RoCC ModularizedRoCCinterfacetoaccelerator

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Accelerator

… …

Page 12: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Motherboard Celerity SoC

Off-ChipInterfaceandNorthbridge

• Open-sourceBaseJumpIPLibrary• http://bjump.org

• FrontSidebus• BaseJumpCommunicationLink• HighSpeed(DDR)Source-SynchronousCommunicationInterface

• Packaging• ModifiedBaseJumpBGAPackageandI/ORing

• Validation• BaseJumpSuperTroublePCB (DaughterCard)• BaseJumpMotherboard(ZedBoard)

DRAM Controller

Ethernet

SSD

L2 $

JTAG

Bas

eJum

pFS

B &

FPG

A B

ridge

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Bas

eJum

pFP

GA

Brid

ge

Clocks

...

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 13: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

RISC-VSuccesses

• BerkeleyRocketCores• Veryquicklygeneratedvalidateddesigns• Vibrantecosystemtoprovidefeedbackandsupport• TestandValidationinfrastructure• SoftwareandToolchainsupport

• FlexiblememorysystemandperipheralI/Osupport• EasyintegrationwithBaseJump IPLibrary

• Balancesextensibilityandsoftwaresupport

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 14: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

OutlineTheCelerityOpenSourceRISC-VTieredAcceleratorFabric:FastArchitecturesandDesignMethodologiesforFastChips

BaseJump:DesigningtheDNAforOpenSourceASICs

BaseJump Manycore:“OpenSourceforGPU”

http://www.opencelerity.org

http://www.bjump.org

http://www.bjump.org/manycore

Page 15: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Celerity:Manycore Tier (BaseJump Manycore)

General-PurposeTier

ManycoreTier

SpecializationTierDevelopedbyTaylor’s

BespokeSiliconGroup@UWCelerity::Manycore Tier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

http://bjump.org/manycore

Page 16: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Manycore architecture

TheVanillacore: SimplebutefficienttorunCcodewithoutanytoolchainmodification

• ISA:RV32IM• Pipeline:5-stage,fullyforwarded,in-order,singleissue

• Scratchpadmemory:4KBforIMem,4KBforDMem

• SecondTape-outofthistiledarchitecture(10-core)

...

… … …...

...

...

...

… … …

NOCRouter

RISC-V Core

MEM

C

ross

bar

DMEM

IMEM

496RISC-VCores

Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 17: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Manycore MeshNetwork• 80-bitForwardLinks

• Single-flit• <XY_dest,XY_src,data>• ParameterizedFieldSizes

• 10-bitReverseLinks• RoutesXY_src backtosrc.Allowsfences.

• Router• SimpleXY-dimensionrouting• 2-elsbufferingperinputport.• Novirtualchannels.• Tiny• In-orderdelivery• DeadlockFree

17Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

...

...

...

Forward packetForward responseReverse packetReverse response

bufferedrouter

tilelink protocol

Page 18: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Manycore LinkstoGeneral-PurposeandSpecializedTier

CrossClockDomaininterface• ToGeneral-PurposeTier:ConvertRoCCtolink

protocol,supportconfiguringDMA,writeandresetmanycore etc.

• ToSpecializedTier:Aggregatelinkinterfacetoincreasethebandwidthandthroughput

Asy

nc F

IFO

Endp

oint

DM

A

L1D Cache

Core

req

resp

cmdrespbusy

link_to_rocc Router

...

… … …...

...

...

...

… … …

Rocket

Rocket

Rocket

Rocket

RoCC

RoCC

RoCC

RoCC

General-Purpose Tier clock domain

Manycoreclock domain

Specialized Tier clock domain

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

32

32

32

32

64

64

64

Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Cross ClkDomain

Cross Clk Domain

Page 19: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Manycore ProgrammingSupportforStreaming

Producer-consumerprogrammingmodel:extendedinstructionsforefficientinter-tilesynchronization

• LoadReserved(lr.w):loadvalueandsetthereservationaddress

• Load-on-broken-reservation(lr.lbr):stallifthereservedaddresswasn’twrittenbyothercores

• Consumer: waiton<address,value>• Benefits:Nopolling,nointerrupt,fastresponse,stalledpipelinecansavepower

InputSplit Join

Feedback

Pipeline

Output

Producer-consumer Programming

DMEM

Core A Core B

NoC

Remote store

Reserved Address

Invoke pipelineStalled Pipeline

waiting for events

Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 20: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Manycore:CurrentEfforts

CUDAFastPortingSupportforCUDACodePortingExistingGPULibrariesandPrimitives

FocusonEmbedded;exploitlocalityofcoresratherthanrelyingonexternalstorageinGDDR5

HappytoprovidesimulationimagesonAmazonF1forthosewhowishtocollaborateonprogrammingmodels.

Page 21: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

PhysicalThreadDensity Comparison

[1]J.Balkind,etal.“OpenPiton:AnOpenSourceManycoreResearchFramework,”intheInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2016.[2]R.Balasubramanian,etal."EnablingGPGPULow-LevelHardwareExplorationswithMIAOW:AnOpen-SourceRTLImplementationofaGPGPU,"inACMTransactionsonArchitectureandCodeOptimization(TACO). 12.2(2015):21.

Configuration NormalizedArea(32nm)

AreaRatio

CelerityTile@16nm

D-MEM=4KBI-MEM=4KB

0.024*(32/16)2=0.096mm2 1x

OpenPitonTile@32nm

L1D-Cache=8KBL1I-Cache=16KB

L1.5/L2Cache=72KB1.17mm2 [1] 12x

RawTile@180nm

L1D-Cache=32KBL1I-SRAM=96KB

16.0*(32/180)2=0.506mm2 5.25x

MIAOWGPUComputeUnitLane

@32nm

VRF=256KBSRF=2KB

15.0/16=0.938mm2[2] 9.75x

Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

• Timing:1.05GHz@16nm• Area:42corespermm^2

NormalizedPhysicalThreads(ALUops)perArea

Page 22: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

HowdidwebuildtheManycore tier?

BasejumpSTLlibrary

Dataflow

NoC

Arithmetic

RISC-Vtoolchain

AssemblyTestSuite

ModifiedRuntime

CCompiler

…OpenSource

Design Testing

I Mem D MemRF

ReplicatedHard-macro

One tile

SizeComparison

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

Celerity::Manycore Tier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

HierarchicalFlow

1tile

1Die

1Rocket

1BNN

OpenSource

Page 23: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Manycore TileEfficiencyAnalysis

Tile:• VanillaCore(pipeline,ALU,MUL,RF)

• Router• RF(2R1W)• 4KBIMEM,4KBDMEM(1RW)

• Timing:1.05GHz@16nm• Area:0.024mm2 @16nm• UtilizationRatio:90%

23

CellAreaBreakdown:Memoryaccountsfortwo-thirdsofthearea

Celerity::Manycore Tier::Whatisit? • Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 24: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Celerity:SpecializationTier

General-PurposeTier

ManycoreTier

SpecializationTier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-CacheN

AST

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 25: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

CaseStudy:MappingFlexibleImageRecognitiontoaTieredAcceleratorFabric

Threestepstomapapplicationstotieredacceleratorfabric:Step1. Implementthealgorithmusingthegeneral-purposetierStep2. AcceleratethealgorithmusingeithertheManycore tier

OR thespecializationtierStep3. Improveperformancebycooperativelyusingboththe

specializationAND theManycore tier

Convolution Pooling Convolution Pooling Fully-connected

bird(0.02)boat(0.94)

cat(0.04)dog(0.01)

Manycore Tier

Specialization Tier

General-PurposeTier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 26: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

Step2:ApplicationtoAcceleratorGeneral-PurposeTierforWeightStorage

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 27: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Step3:AssistingAcceleratorsManycore TierforWeightStorage

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

• Each core in the Manycore tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 28: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Step3:AssistingAcceleratorsManycore TierforWeightStorage

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

• Each core in the Manycore tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 29: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Step3:AssistingAcceleratorsManycore TierforWeightStorage

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

• Each core in the Manycore tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 30: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

PerformanceBenefitsofCooperativelyUsingtheManycore andtheSpecializationTiers

General-Purpose Tier Software implementation assuming ideal performance estimated with an optimistic one instruction per cycle

Specialization Tier Full-system RTL simulation of the BNN specialized accelerator running with a frequency of 625 MHz

Specialization + ManycoreTiers

Full-system RTL simulation of the BNN specialized accelerator with the weights being streamed from the manycore

General-Purpose Tier Specialization Tier Specialization + Manycore Tiers

Runtime per Image (ms) 4,024 20 3.3

Power (Watts) 0.2 – 0.5 0.2 – 0.5 0.5 – 2.0

Improvement in Perf / Power 1x ~200x ~400x

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 31: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

DesignMethodology

SystemCConstraints

StratusHLS

RTL

PyMTL

Wrappers&Adapters

FinalRTL

void bnn::dma_req() {while( 1 ) {DmaMsg msg = dma_req.get();

for ( int i = 0; i < msg.len; i++ ) {HLS_PIPELINE_LOOP( HARD_STALL, 1 );

int req_type = 0;word_t data = 0;addr_t addr = msg.base + i*8;

if ( type == DMA_TYPE_WRITE ) {data = msg.data;req_type = MemReqMsg::WRITE;} else {req_type = MemReqMsg::READ;}

memreq.put(MemReqMsg(req_type,addr,data));}

dma_resp.put(DMA_REQ_DONE);}}

IncludingRoCCInterfaces

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 32: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabricthat

targetshighlyperformantandenergy-efficientembeddedsystems

Celerity’s goal wastodevelopnewmethodologiestodesignchipsmorequickly

WebelievetheRISC-Vsoftware/hardwareecosystem wasinstrumentalinenablingateamof20graduatestudents

totapeoutacomplexSoC inonly9months

General-PurposeTier

ManycoreTier

SpecializationTier

Wethankthemanycontributorstotheopen-sourceRISC-VsoftwareandhardwareecosystemwithspecialthankstoU.C.

BerkeleyforformingtheRISC-Vecosystem

Celerity::Conclusion

Acknowledgements:DARPA,undertheCRAFTprogram

SpecialthankstoDr.LintonSalmonforprogramsupportandcoordination

http://www.opencelerity.org

Page 33: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

AcceleratingASICDesignThroughReuse

• Basejump:Open-sourcepolymorphicHWcomponents• Designlibraries:BSGIPCores,BGAPackage,I/OPadRing• Testinfrastructure:DoubleTroublePCB,RealTroublePCB• Availableatbjump.org

• RISC-V:Open-sourceISA• Rocketcore:highperformanceRV64Gin-ordercore• Vanilla-5:highefficiencyRV32IMin-ordercore

• RoCC:Open-sourceon-chipinterconnect• Commoninterfacetoconnectall3computetiers

• Extensibledesigns• BSGManycore:fullyparameterizedRTLandAPRscripts

• ThirdPartyIP• ARMStandardCells,I/Ocells,RF/SRAMgenerators

Page 34: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump:DesigningtheDNAforOpenSourceASICs

http://www.bjump.org

Page 35: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

In the Bespoke Silicon Group (BSG), We Think Building Hardware Is An Epic Sport

BaseJump On-ChipClock Generator

ASIC TapeoutsBSG-Loopback (180nm) – November 2016BSG-X (180nm) – December 2016Celerity (16nm) – April 2017

6-months

Page 36: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

Let’s Build the DNA for Open Source ASICs

• The components required to build a full system• RTL Design (NOCs, async crossers, arbiters, FIFOs, …)• IP Cores (High-speed IO, PLLs, CPU, …)• Hardware emulation• Socket (package and padring)• PCB motherboard

• Digital ASIC Systems share a lot of “DNA”• Many of these components are very common• Only minor modifications (if any) are needed• Every chip inherits many defaults from the last; gets easier and easier

• What if we could share a “base class” for ASICs across the world and extend to fit our system requirements?

Page 37: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump: An Open Source “Base Class” For ASIC Designs• A collection of open source components that act

as a starting place for each step in building an end-to-end system

• Major Components• BaseJump STL• BaseJump Socket

• IO Padring• BGA Package

• BaseJump Motherboards• Emulation PCB, ASIC PCB

• BaseJump Rocket & RV-IOV Adapters• BaseJump Manycore

• The manycore you saw earlier• BaseJump FPGA Bridge http://bjump.org

Allthehardwareyouareabouttoseehascompletedesignsonourwebsiteandwe’lltellyouwhotosendittogetitfabricate.

Page 38: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump STL

• Like C++ STL but for SystemVerilog• Many Essential IP blocks (stop building them over and over!)

• Basic building blocks (FIFOs, crossbars, arbiters)• Intuitive interfaces• Highly parameterizable

• More advanced IP blocks• Clock generators• High speed source synchronous I/O• SoC configuration interface (like SPI or JTAG)

• Verification and validated• Unit testing and regression testing suites• Many components are silicon proven

Page 39: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump STL Components

popcount, flop trays, decoders, lfsr, multiplies, flexible muxes, transposers crossbars, gray_to_binary, priority encoder, thermometer encoders, counters

asynchronous fifos and interfaces

synthesizable digital clock generator

high-speed I/O source-synchronous interface

FIFOs, stream mergers, round-robin arbitrators, serial-to‐parallel converters

front side bus (high-speed bridge between off-chip and on‐chip worlds)

portability layer for SRAMs

mesosynchronous I/O library (high_speed + low latency)

network-on‐chip building blocks RISC‐V interface logic

SoC configuration interface (like SPI or JTAG)

Test bench blocks; reset generators, delay lines, clock gens

bsg_misc

bsg_asyncbsg_clk_genbsg_comm_linkbsg_dataflowbsg_fsbbsg_membsg_mesosyncbsg_nocbsg_tagbsg_test

Package Example IP Cores

SeveralHundredModules,AllParameterized

Page 40: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Socket:Standard I/O Interfaces for Your ASIC• Complete suite for speaking out of your

chip to an FPGA at high frequency• RTL Code for ASIC (bsg_comm_link)• RTL Code for FPGA “ ”• Standard I/O padring• BGA Package• BGA Socket

• Optimized for signal integrity and cost• DDR, source synchronous• Single-ended• Starting point for new padring designs• Can repurpose pins

• Switch pin directions• Replace with analog pins• Add more power domains

InspiredDARPACRAFTFlipchipSockets!

BGAPadring

BGASocket

BGASubstrate

Page 41: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump: Motherboards• DoubleTrouble: pre-tapeout HW emulation• RealTrouble: post-tapeout ASIC bring-up• BaseJump includes open firmware for FPGA

• Can be used on DoubleTrouble and RealTrouble• Can connects to Xilinx dev boards over FMC

• Allows us to use Xilinx’s IP cores (DRAM, PCIe, …)PlugyourASICIntothissocketAndyouaredone!

Page 42: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

http://bjump.org

Page 43: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump Socket:IO Padring

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

http://bjump.org

Page 44: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump DoubleTrouble:HW Emulation Motherboard

BaseJump Socket:IO Padring

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

http://bjump.org

Page 45: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump DoubleTrouble:HW Emulation Motherboard

BaseJump Socket:IO Padring

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

BaseJump:Open FPGA

Firmware

http://bjump.org

Page 46: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump DoubleTrouble:HW Emulation Motherboard

BaseJump Socket:IO Padring

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

BaseJump:Open FPGA

Firmware

Tape out!

http://bjump.org

Page 47: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump DoubleTrouble:HW Emulation Motherboard

BaseJump Socket:IO Padring

HD

L D

esig

n

BaseJump STL:Standard library of

hardware components

BaseJump:Open FPGA

Firmware

BaseJump Socket:BGA Package

http://bjump.org

Page 48: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

BaseJump DoubleTrouble:HW Emulation Motherboard

BaseJump Socket:IO Padring

BaseJump RealTrouble:Bring-up Motherboard

HD

L D

esig

n

http://bjump.org

BaseJump STL:Standard library of

hardware components

BaseJump Socket:BGA Package

BaseJump:Open FPGA

Firmware

Page 49: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

OutlineTheCelerityOpenSourceRISC-VTieredAcceleratorFabric:FastArchitecturalDesignMethodologiesforFastChips

BaseJump:DesigningtheDNAforOpenSourceASICs

BaseJump Manycore:“OpenSourceforGPU”

http://www.opencelerity.org

http://www.bjump.org

http://www.bjump.org/manycore

Page 50: Wed0900 Celerity - An Open Source 511-core RISC-V Tiered ......•Operating system (e.g. Linux & TCP/IP Stack) •Interrupt and Exception handling •Program dispatch and control flow

ThanksRitchieZhao,ChunZhao,ShaolinXie,Bandhav Veluri,LuisVega, ChristopherTorng,

Ningxiao Sun,AustinRovinski,AnujRao,Gai Liu,PaulGao,ScottDavidson,SteveDai,Aporva Amarnath,KhalidAl-Hawaj,TutuAjayi

ChristopherBatten,RonaldG.Dreslinski,RajeshK.Gupta,MichaelB.Taylor,Zhiru Zhang