Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣...

24
1 Emulating Future HPC SoC Architectures Using RISC-V Farzad Fatollahi-Fard, Dave Donofrio, John Shalf Lawrence Berkeley National Lab RISC-V Workshop January 5, 2016 – Redwood City, CA

Transcript of Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣...

Page 1: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

1

Emulating Future HPC SoCArchitectures Using RISC-VFarzad Fatollahi-Fard, Dave Donofrio, John ShalfLawrence Berkeley National LabRISC-V WorkshopJanuary 5, 2016 – Redwood City, CA

Page 2: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Should HPC Take Inspiration from the Embedded Market?

‣ Have most of the IP and experience with for low-power technology

• Have sophisticated tools for rapid turn-around of designs

‣ Vibrant commodity market in IP components

• Change your notion of “commodity”!

• It’s commodity IP on the chip (not the chip itself!)

‣ Design validation / verification dominate cost

• Another benefit of COTS IP

Page 3: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Building an SoC for HPC

‣ Consumer market dominates PC and server market• Smartphone and tablets are in control

• Huge investments in IP, design practices, etc.

‣ HPC is power limited (delivered performance/watt)• Need better computational efficiency and lower power with greater

parallelism

• Embedded has always been driven by max performance/watt (max battery life) and minimizing cost

‣ HPC and embedded requirements are now aligned• …and now we have a very large commodity ecosystem

‣ Why not leverage technologies for the embedded and consumer for HPC?

3

Is this a good idea?

Page 4: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

4

Looking back…A previous HPC system design based on semi-custom SoCs

Page 5: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Green Wave2009-2012

‣ Seismic imaging used extensively by oil and gas industry

• Dominant method is RTM (Reverse Time Migration)

‣ RTM models acoustic wave propagation through rock strata using explicit PDE solve for elastic equation in 3D

• High order (8th or more) stencils

• High computational intensity

Page 6: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Green Wave Design StudySeismic Imaging

Performance Energy Efficiency

EmbeddedDesign library

EmbeddedDesign library

Page 7: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Embedded SoC Efficiency Competitive with cutting-edge designs

!"#"$!"$#"%!"%#"&!"&#"'!"'#"#!"

()*" +,-./" 01" ()*" +,-./" 01"

234"5-6,-" $%34"5-6,-"

!"#

$%&'('()

*

3.5x4x

Fermi without host

7.6x8x

Page 8: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Green Wave Chip Block Diagram

8

‣ 12 x 12 2D on-chip torus network

‣ 676 Compute cores (500 in compute clusters, 176 in peripheral clusters)

‣ 33 Supervisory cores

‣ 1 PCI express interface

‣ 8 Hybrid Memory Cube (HMC) interfaces

‣ 1 Flash controller

‣ 1 1000BaseT Ethernet controller

‣ It is not anticipated that all cores will be utilized – some are spares for yield enhancement.

Courtesy Marty Deneroff, Green Wave, Inc.

10

Example: Green Wave Chip Block Diagram •  12 x 12 2D on-chip torus network •  676 Compute cores (500 in compute

clusters, 176 in peripheral clusters) •  33 Supervisory cores •  1 PCIexpress interface (16 bit Gen 3) •  8 Hybrid Memory Cube (HMC) interfaces •  1 Flash controller •  1 1000BaseT ethernet controller

It is not anticipated that all cores will be utilized – some are spares for yield enhancement.

Page 9: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

MemoryDRAM

DRAM

Building an SoC from IP Logic BlocksIt’s Legos with a some extra integration and verification cost

Processor Core (ARM, Tensilica, RISC-V)With extra “options” like DP FPU, ECC

OpenSoC Fabric (on-chip network)(currently proprietary ARM or Arteris)

DDR memory controller(Denali/Cadence, SiCreations)+ Phy & Programmable PLL

PCIe Gen3 Root complex

Integrated FLASH Controller

10GigE or IB DDR 4x Channel

MemControlMem

Control

PCIe

FLASH Control

IB or GigE

IB or GigE

IO

Page 10: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Hierarchical Power Costs

10

Data Movement is the Dominant Power Cost

120 pJ

2000 pJ

250 pJ

~2500 pJ

100 pJ

6 pJ

Cost to move data off chip to a neighboring node

Cost to move data off chip into DRAM

Cost to move off-chip, but stay within the package (SMP)

Cost to move data 20 mm on chip

Typical cost of a single floating point operation

Cost to move data 1 mm on-chip

Page 11: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Other Parameters Impact PerformanceFor Example, Topology…

‣ Network topology can greatly influence application performance

An analysis of on-chip interconnection networks for large-scale chip multiprocessorsACM Transactions on computer architecture and code optimization (TACO), April 2010

Page 12: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Our SoC for HPC System

12

‣ Z-Scale processors connected in a Concentrated Mesh

‣ 4 Z-scale processors

‣ 2x2 Concentrated mesh with 2 virtual channels

‣ Designed for Area Efficiency on the FPGA

A Point Design on an FPGA

FPGA 0 FPGA 1 FPGA 2

FPGA 5FPGA 4FPGA 3

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Page 13: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Z-Scale

13

‣ A tiny 32-bit 3-stage RISC-V core generator suited for microcontrollers and embedded systems

‣ Z-scale is designed to talk to AHB-Lite buses

‣ Z-scale generator also generates the interconnect between core and devices• Includes buses, slave

muxes, and crossbars

Tiny 32-bit RISC-V System

PCGen IF DE

EX

WBMEMMUL

I-BusRequest

I-BusResponse

D-BusRequest

D-BusResponse

Page 14: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Top-Level Z-Scale Configuration

14

Z-Scale

Crossbar

DMACaches

I$ D$

MSG

MSG

DRAM Router

Send Recv

Page 15: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

OpenSoC Fabric

15

‣ Written in Chisel

‣ Dimensions, topology, VCs all configurable

‣ AHB Endpoints available now• AXI in development

‣ Fast functional C++ model for functional validation

‣ Verilog based description for FPGA or ASIC• Synthesis path enables accurate

power / energy modeling

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AXI

CPU(s)

AXI

CPU(s)

AXI

AXI

10GbE

PCIe

An Open-Source, Flexible, Parameterized, NoC Generator

Page 16: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Top Level Diagram

16

Page 17: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Functional Hierarchy

17

Page 18: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Top Level Modules

‣ Stiches routers together

‣ Assigns routers individual ID

‣ Assigns Routing Function to routers

‣ Connections Injection and Ejection Queues for network endpoints

18

Topology

Page 19: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

OpenSoC Top Level Modules

‣ Created and connected by Topology module‣ Instantiates and connects: • Routing Function• Allocators• Switch

‣ Pipelined• 3 stage pipeline for Wormhole• 4 stage pipeline for VCs• Includes state storage for each sub-module

‣ Connects to Injection / Ejection Queues19

Router

Page 20: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

OpenSoC Top Level Modules

ArbArb

VC Allocator Switch Allocator

Allocation

Switch

Allocator Result RegisterSwitch Control

Input / VC ControlInput FIFOper VC

Head flits passed to allocator

Register Files per VC for Head Flits

Function

Routing Functions

Routing Function consumesHead Flits, assigns range of VCs,

updates register file with result

Assigns VC from range specified by routing

function to a single head flit

Per VC Credit passed back to allocator

Single Output Reg

RouterState Machine

Concentration

Router

Router N

Concentration

Router

Router

Router

20

VC Router

Page 21: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Configuration options

21

‣ Network Parameters

• Dimension

• Routers per dimension

• Concentration

• Virtual Channels

• Topology

• Queue depths

• Routing Function

A few of the current run time configuration parameters

‣ Packet / Flit Parameters

• Flit widths

• Packet types / lengths

‣ Testing Parameters

• Pattern- Neighbor, random,

tornado, etc

• Injection Rate

Highly modular architecture supports FUB replacement

Page 22: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Demo Setup

22

HPC SoC Design Versus Commodity HPC Design

FPGA 0 FPGA 1 FPGA 2

FPGA 5FPGA 4FPGA 3

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

Core Off-Chip

Socket 0

Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core Core

Socket 1

Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core Core

DDR3 DRAM

Page 23: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

Demo Results

23

Exchanging of Cache Lines Versus Message Passing

0 100 200 300 400 500 600

Local Exchange

Remote Exchane

Cycles

Inter-Thread Latency

RISCV-SoCx86

Page 24: Emulating Future HPC SoC Architectures Using RISC-V · 2020. 8. 13. · Building an SoC for HPC ‣ Consumer market dominates PC and server market • Smartphone and tablets are in

24

More Informationhttp://www.codexhpc.org

http://opensocfabric.org