BeiHang Short Course, Part 6: CoRAM FPGA...

14
9/24/2014 1 BeiHang Short Course, Part 6: CoRAM FPGA “Architecture” James C. Hoe Department of ECE J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSCL6s1 Department of ECE Carnegie Mellon University Collaborators: Eric S. Chung (MSR), Gabriel Weisz, Michael Papamichael, and Ken Mai Eric S. Chung, James C. Hoe, and Kenneth Mai, “CoRAM: An InFabric Memory Architecture for FPGAbased Computing,” Proc. ACM International Symposium on FieldProgrammable Gate Arrays (FPGA), pp 97106, February 2011. Moore’s Law for FPGA 10TFLOPS Stratix 10 J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSCL6s2 2000 2005 2010 2015

Transcript of BeiHang Short Course, Part 6: CoRAM FPGA...

Page 1: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

1

BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”

James C. Hoe

Department of ECE

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s1

Department of ECE

Carnegie Mellon University

Collaborators: Eric S. Chung (MSR), Gabriel Weisz, Michael Papamichael, and Ken Mai

Eric S. Chung, James C. Hoe, and Kenneth Mai, “CoRAM: An In‐Fabric Memory Architecture for FPGA‐based Computing,” Proc. ACM International Symposium on Field‐Programmable Gate Arrays (FPGA), pp 97‐106, February 2011. 

Moore’s Law for FPGA

10TFLOPSStratix 10

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s2

2000 2005 2010 2015

Page 2: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

2

Why doesn’t everybody want to compute with FPGAs?

“ di i ll G h b h b d

G d O d i d f i

“Traditionally, FPGAs have been the bastardstep‐brother of ASICs…”

Proceedings of ISFPGA 2004

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s3

• FPGAs today NOT designed for computing

– tools and languages difficult to use

– users exposed to device and platform details

– no application portability

Outline

• Motivations

• CoRAM Architecture

• Current work and evaluations

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s4

• Conclusions

Page 3: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

3

Review of FPGA Anatomy 

Block RAM

I

Programmable lookup tables (LUT) and flip‐flops (FF)

aka “soft logic” or “fabric”

4KBBlockRAM

Write Data

Read DataLocal 

Address

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s5

I/O pins Interconnect

LUT FF

Computing on FPGA TodayI/O Pads

kernel

SRAM

User responsible for:

application and data

kernel

SRAMFPGA

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s6

I/O Pads

kernel

SRAM

Page 4: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

4

Computing on FPGA TodayI/O Pads

Memory Controller Memory Controller

User responsible for:

application and data

platform I/O and 

memory interfaces

data distribution

kernel

SRAM

ControlLogic

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s7

memory control logic

I/O Pads

Memory Controller Memory Controller

The CoRAM FPGAI/O Pads

Memory Controller Memory ControllerMemory Interfaces (Global Address Space)

SRA

M

kernel

SRAM

ControlLogic

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s8

I/O Pads

Memory Controller Memory ControllerMemory Interfaces (Global Address Space)

Page 5: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

5

The CoRAM FPGA

Memory Interfaces (Global Address Space)

SRA

Mkernel

Control

Hard logic for data 

distribution (NoC)

“Software”‐managed 

memory hierarchy

General‐purpose Network‐on‐Chip

ControlLogic

kernel

SRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s9

Memory Interfaces (Global Address Space)

Controlthreads

General‐purpose Network‐on‐Chip

CoRAM FPGA Architectural View

SRA

M

kernel

Control

Hard logic for data 

distribution (NoC)

“Software”‐managed 

memory hierarchy

Simple, portable

kernel

SRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s10

Controlthreads

Simple, portable 

abstraction for user

Memory

Page 6: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

6

FPGA as First‐Class, Equal Partner

CPU FPGACPUCPU FPGA

DRAM DRAM

CPU

Memory Bus / Interconnect

Core Core FPGA

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s11

L1

Single‐Chip Heterogeneous Processor

L1

L2 Cache

Fabric

DRAM

Simple Examplectrlthread() {

cohandle c0 = get_sram(“c0”);

1 c coram rite(c0

SRAMAddress

}

module verilog_top(…);

coram c0(/*ports*/);

cofifo f0 (/*ports*);

…endmodule

Control Thread

1 c_coram_write(c0,sram_address,global_address,size);

2 c_fifo_write(1);

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s12

C0WrData

RdData

kernel Thread

1GlobalMemory

DataData

2

Channel FIFO

Page 7: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

7

Control ActionsEdge Memory

SingleCoRAM

Edge Memory2 CoRAMs

(concatenated)

memaddroffset

bytes

coram_write( coh coram, void *offset,void *memaddr, int bytes);

CoRAM

memaddroffset

bytes

collective_write(coh coram, void *offset,void *memaddr, int bytes);

Edge MemorySrc CoRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s13

2 CoRAMs(interleaved)

memaddr offset

collective_write(coh coram, void *offset,void *memaddr, int bytes);

A BC D

A B C D

srcoffsetdstoffset

bytescoram_copy(coh src, coh dst,

void *srcoffset,void *dstoffset,int bytes);

Dst CoRAM

Design Case Study: MMMult

MMM in hardwareCompute Engine

Single Compute Engine

N compute engines(1 per row of matrix )

custom data network

Compute Engine

Compute Engine

Compute Engine

Compute Engine

Compute Engine

p g

C=ABA

Network

B SRAM

OP=writebackOP=load

R

R

x +

WenRenPktIn

PktOut

A/B SRAMsC SRAM

Control

toNext

PacketProcess

x +

A/B SRAMsC SRAM

Control

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s14

Compute Engine

B C

OP PE_ID DATA

DRAM

Packet Generator Address Generator

Packet Format

C SRAM

B

A SRAM

toNext

Page 8: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

8

Control Thread Program

MMMult with CoRAM

Channel FIFOPEs ready

void ctrl_thread() {for (j = 0; j < N; j += NB)for (i = 0; i < N; i += NB)for (k = 0; k < N; k += NB) { c_fifo_read(…); for (m = 0; m < NB; m++) { c_collective_write(

ramsA, m*NB, A + i*N+k + m*N

Start compute

x +

A/B SRAMsC SRAM

Control

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s15

A + i*N+k + m*N, NB*dsz);

…}

c_fifo_write(…);}

}}

}

Related Work in Abstracting

• BORPH [So, UCB PhD Thesis 2007]

– support HW kernels with fixed interface/infrastructurepp

– manage HW kernels as processes

– communication between HW and SW processes

• VirtualRC [Kirchgessner, et al., ISFPGA 2012]

– “virtual machine” for FPGA platforms

– fixed SW API to interact with HW kernels

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s16

fixed SW API to interact with HW kernels

• LEAP [Adler, et al., ISFPGA 2011]

– large, linear memory abstraction of local backing store

– communicating with SW and remote memory

Page 9: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

9

Outline

• Motivations

• CoRAM FPGA Architecture

• Development and evaluations

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s17

• Conclusions

Beneath the Abstraction

Control 

M

kernel

Architecture

Microarchitecture

t t

thread programs

Control Unit Control Unit

SRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s18

Memory Interfaces (DRAM or Cache)

Fabric

Bulk Data Distribution (Network‐on‐Chip)

Cluster Unit

Fabric

Cluster Unit

CoRAMFPGA

Page 10: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

10

Decoder

CoramIdx == 0?

CoramIdx == 1?

CoramEn0 (1‐bit)

Data Steer

4‐inputOR

Mux 

LaneSelect0 (2‐bit)

CoRAM Cluster DesignSRAMFrontend Pipeline DataDeq

64x5‐bitmaptable

i10‐bit/6‐bit

Divider

LocalAddress (10‐bit)Map Table Index (5‐bit)

CoramIdx == 31?

CoramIdx == 0?

CoramIdx == 1?

CoramIdx == 31?

CoramIdx == 0?

CoramIdx == 1?

CoramIdx == 31?

CoramIdx == 0?

CoramIdx == 1?

CoramIdx == 31?

Ctrl

LaneSelect (2‐bit)

4‐inputOR

Mux Ctrl

4‐inputOR

Mux Ctrl

4‐inputOR

Mux Ctrl

32

addrQ

C Id (5 bit)

64x5‐bitmaptable

64x5‐bitmaptable

64x5‐bitmaptable

i+1

i+2

i+3

10‐bit/6‐bit

Divider

10‐bit/6‐bit

Divider

10‐bit/6‐bit

Dividerw

idxQ Hazard Detection

Deq0

Deq1

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s19

DataIn (32‐bit)

CORAM_WORD_WIDTH (32‐bit)

Data arrives sequentially from main memory, all accesses assumed 4‐byte word‐aligned

datai

datai+1

datai+2

datai+3

dataQCoramIdx (5‐bit)

CoramAddr (10‐bit)

Data response packet arrival in single clock

Cluster Resources

5000

FPGA logic area for 32‐way Cluster, 16B memory datapath, W=4

21% of V6 LX760 (if every BRAM used as CoRAM)

1500

2000

2500

3000

3500

4000

4500

t LU

Ts (40nm Virtex‐6)

Output 32x4 Crossbar (data)Input 4x32 Crossbar (data)Input 4x32 Crossbar (address)MemRequestQ (data‐out)DataOutQsDividers (7‐bit x 6‐bit)Input Crossbar DecoderDataInQsIntermediate Data QsMessage CountersLane SchedulerIntermediate State TablesOtherFreelist FIFO (physical tags)PtagQ (data‐out)Freelist FIFO (transaction tags)

21% of V6‐LX760 (if every BRAM used as CoRAM)

17%

10% 9%

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s20

0

500

1000

1500

6‐input Freelist FIFO (transaction tags)

Input 4x32 Crossbar (1‐bit rnw)Pattern TableDivision Status TableFree ptag counterIntermediate Token QPattern Offset TableLane Mask LogicGrantQControlActionQMemRequestQ (data‐in)

<1%FPGA max clock freq ~ 250‐300MHz

65nm ASIC ~1.1GHz

Page 11: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

11

RTL Prototyping on FPGA

Microblaze(100MHz)

I/O DevicesI/O Subsystem 

100MHz

Router

Router

CacheBank

CacheBank

Memory

128b

128bRouter

AXI Memory Bus

100MHz100MHz

Adapter

200MHz

Router

x+

Control Unit

x+

Control Unit

x+

Control UnitControl

and Drivers

MMMultiply Compute Elements

CoRAMCluster

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s21

MemoryController(Xilinx MIG) 128b

Arbiter

RouterRouter

RouterCacheBank

128bCacheBank

Xilinx Platform

 

512b

x+

Control Unit

x+

Control Unit

x+

Control UnitControl

CoRAMCluster

NoC and ClustersDRAM Interfaces DRAM Caches

CoRAMCluster

Router

Outline

• Motivations

• CoRAM FPGA Architecture

• Development and evaluations

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s22

• Conclusions

Page 12: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

12

M

kernel

BTW, we got it wrong……

M

FFT

M

FFT

CoRAM

kernel

Controlthreads

Memory

CoRAM

FFT(streaming)

CoRAM

FFT(streaming)

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s23

• Architecture design is an exercise in compromise

i.e., enough but not too much

• Do we really need an FPGA “architecture” just for computing?

CoRAM “Architecture” Revisited

Memory Interfaces (Global Address Space) Must have hardened 

CoRA

M

kernel

Control

memory interface and NoC

I like the decoupled computation paradigm

Do I need to harden

General‐purpose Network‐on‐Chip

kernel

CoRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s24

Memory Interfaces (Global Address Space)

Controlthreads

Do I need to harden CoRAM?

– fixed set of commands

– insolate logic to one‐side

General‐purpose Network‐on‐Chip

Page 13: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

13

CoRAM Reconstructed as “APIs”

nds API

‐like API

infrastructure for lib builders

abstraction for app builders

CoRAM

CoRAM

controlh d k l

comman

mem

ory

pattern specific logic

pattern specificlogic B

RAM

NoC

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s25

threads HW kernels

end developer

end developer

pattern specificlogic

Memory Interface

Recap the Last Hour

• Future must look beyond general purpose

– FPGAs offer impressive performance and efficiencyp p y

– but need to be easy‐to‐use and portable

• The CoRAM Architecture/API

– standard, scalable memory abstraction for FPGA

– simplifies application development and portability

• Software is easy; hardware is hard?

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s26

• Software is easy; hardware is hard?

– C libraries and system calls is a big part of this impression

– CPUs and GPUs are not much easier if to get the last ounce of performance 

Page 14: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?

9/24/2014

14

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s27

Computer Architecture Lab (CALCM)

Carnegie Mellon University

http://www.ece.cmu.edu/CALCM