EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I...

28
EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB A Case for FAME: FPGA Architecture Model Execution Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanovic, David Patterson The Parallel Computing Lab, UC Berkeley ISCA ’10

Transcript of EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I...

Page 1: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

A Case for FAME:FPGA Architecture Model Execution

Zhangxi Tan, Andrew Waterman,

Henry Cook, Sarah Bird,

Krste Asanovic, David Patterson

The Parallel Computing Lab, UC Berkeley

ISCA ’10

Page 2: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

2

A Brief History of Time

Hardware prototyping initially popular for architects

Prototyping each point in a design space is expensive

Simulators became popular cost-effective alternative Software Architecture Model Execution (SAME)

simulators most popular SAME performance scaled with uniprocessor

performance scaling

Page 3: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

3

The Multicore Revolution

Abrupt change to multicore architectures HW, SW systems larger, more complex

Timing-dependent nondeterminism Dynamic code generation Automatic tuning of app kernels

We need more simulation cycles than ever

Page 4: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

4

The Multicore Simulation Gap

As number of cores increases exponentially, time to model a target cycle increases accordingly

SAME is difficult to parallelize because of cycle-by-cycle interactions Relaxed simulation synchronization may not work

Must bridge simulation gap

Page 5: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

One Decade of SAME

Median Instructions Simulated/ Benchmark

Median #Cores

Median Instructions Simulated/ Core

ISCA 1998

267M 1 267M

ISCA 2008

825M 16 100M

Effect is dramatically shorter (~10 ms) simulation runs

5

Page 6: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

6

FAME: FPGA Architecture Model Execution

The SAME approach provides inadequate simulation throughput and latency

Need a fundamentally new strategy to maximize useful experiments per day Want flexibility of SAME and performance of hardware

Ours: FPGA Architecture Model Execution (FAME) (cf. SAME, Software Architecture Model Execution)

Why FPGAs? FPGA capacity scaling with Moore’s Law Now can fit a few cores on die Highly concurrent programming model with cheap

synchronization

Page 7: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Non-FAME:FPGA Computers

FPGA Computers: using FPGAs to build a production computer

RAMP Blue (UCB 2006) 1008 MicroBlaze cores No MMU, message passing only Requires lots of hardware

• 21 BEE2 boards (full rack) / 84 FPGAs RTL directly mapped to FPGA Time-consuming to modify

Cool, useful, but not a flexible simulator

7

Page 8: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

FAME:System Simulators in FPGAs

8

CORE

D$

DRAM

Shared L2$ / Interconnect

…I$

CORE

D$

I$CORE

D$

I$CORE

D$

I$CORE

D$

DRAM

L2$

I$CORE

D$

I$CORE

D$

I$

L2$ L2$

Target System A Target System B

Host System(FAME simulator)

Page 9: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

9

A Vast FAME Design Space

FAME design space even larger than SAME’s Three dimensions of FAME simulators

Direct or Decoupled: does one host cycle model one target cycle?

Full RTL or Abstract RTL? Host Single-threaded or Host Multi-threaded?

See paper for a FAME taxonomy!

Page 10: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

FAME Dimension 1:Direct vs. Decoupled

Direct FAME: compile target RTL to FPGA Problem: common ASIC structures map poorly to FPGAs Solution: resource-efficient multi-cycle FPGA mapping Decoupled FAME: decouple host cycles from target cycles

Full RTL still modeled, so timing accuracy still guaranteed

10

R1R2R3R4W1W2

RegFile

Rd1Rd2Rd3Rd4

R1R2W1

RegFile

Rd1Rd2

Target System Regfile Decoupled Host Implementation

FSM

Page 11: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

11

FAME Dimension 2:Full RTL vs. Abstract RTL

Decoupled FAME models full RTL of target machine Don’t have full RTL in initial design phase Full RTL is too much work for design space exploration

Abstract FAME: model the target RTL at a high level For example, split timing and functional models (à la SAME) Also enables runtime parameterization: run different simulations without

re-synthesizing the design

Advantages of Abstract FAME come at cost: model verification Timing of abstract model not guaranteed to match target machine

Abstraction

Functional Model Target

RTL

Timing Model

Page 12: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

FAME Dimension 3:Single- or Multi-threaded Host

Problem: can’t fit big manycore on FPGA, even abstracted Problem: long host latencies reduce utilization Solution: host-multithreading

12

CPU1

CPU2

CPU3

CPU4Target Model

Multithreaded Emulation Engine (on FPGA)

+1

2

PC1PC

1PC1PC

1I$ IR GPRGPRGPRGPR1

X

Y

2

D$Single hardware pipeline with multiple copies of CPU state

Page 13: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Metrics besides Cycles:Power, Area, Cycle Time

FAME simulators determine how many cycles a program takes to run

Computing Power/Area/Cycle Time: SAME old story Push target RTL through VLSI flow Analytical or empirical models Collecting event stats for model inputs is much faster than

with SAME

13

Page 14: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

RAMP Gold: A Multithreaded FAME

SimulatorRapid accurate simulation of manycore architectural ideas using FPGAs

Initial version models 64 cores of SPARC v8 with shared memory system on $750 board

Hardware FPU, MMU, boots OS. Cost Performance

(MIPS) Simulations per day

Simics (SAME) $2,000 0.1 - 1 1

RAMP Gold (FAME) $2,000 + $750 50 - 100 100

14

Page 15: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

RAMP Gold Target Machine

SPARC V8CORE

I$ D$

DRAM

Shared L2$ / Interconnect

SPARC V8CORE

I$ D$

SPARC V8CORE

I$ D$

SPARC V8CORE

I$ D$

…64 cores

15

Page 16: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

RAMP Gold Model

Functional Model Pipeline

Arch State

Timing Model Pipeline

Timing State

SPARC V8 ISAOne-socket manycore target

Split functional/timing model, both in hardware

–Functional model: Executes ISA

–Timing model: Capture pipeline timing detail

Host multithreading of both functional and timing models

Functional-first, timing-directed

Built for Xilinx Virtex-5 systems

[ RAMP Gold, DAC ‘10 ]

16

CORE

D$

DRAM

Shared L2$ / Interconnect

64 coresI$

CORE

D$

I$CORE

D$

I$CORE

D$

I$

Page 17: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

17

Case Study: Manycore OS Resource Allocation

Spatial resource allocation in a manycore system is hard Combinatorial explosion in number of apps and

number of resources

Idea: use predictive models of app performance to make it easier on OS

HW partitioning for performance isolation (so models still work when apps run together)

Problem: evaluating effectiveness of resulting scheduling decisions requires running hundreds of schedules for billions of cycles each

Simulation-bound: 8.3 CPU-years for Simics! See paper for app modeling strategy details

Page 18: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

18

Case Study: Manycore OS Resource Allocation

Synthetic Only PARSEC Small0

0.5

1

1.5

2

2.5

3

3.5

4

worst sched.

chosen sched.

best sched.

Norm

aliz

ed R

unti

me

Page 19: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

19

Case Study: Manycore OS Resource Allocation

The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality!

Synthetic Only PARSEC Small PARSEC Large0

0.5

1

1.5

2

2.5

3

3.5

4

worst sched.

chosen sched.

best sched.

Norm

aliz

ed R

unti

me

Page 20: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

20

RAMP Gold Performance

FAME (RAMP Gold) vs. SAME (Simics) Performance PARSEC parallel benchmarks, large input sets >250x faster than full system simulator for a 64-core target system

4 8 16 32 640

50

100

150

200

250

300

Functional only

Functional+cache/memory (g-cache)

Functional+cache/memory+coherency (GEMS)

Number of Target Cores

Spee

dup

(Geo

met

ric

Mea

n)

Page 21: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

21

Researcher Productivity is Inversely Proportional to Latency

Simulation latency is even more important than throughput How long before experimenter gets feedback? How many experimenter-days are wasted if there was an

error in the experimental setup?

Median Latency (days)

Maximum Latency (days)

FAME 0.04 0.12

SAME 7.50 33.20

Page 22: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

22

Fallacy: FAME is too hard

FAME simulators more complex, but not greatly so Efficient, complete SAME simulators also quite

complex Most experiments only need to change timing model

RAMP Gold’s timing model is only 1000 lines of SystemVerilog

Modeled Globally Synchronized Frames [Lee08] in 3 hours & 100 LOC

Corollary fallacy: architects don’t need to write RTL We design hardware; we shouldn’t be scared of HDL

Page 23: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

23

Fallacy: FAME Costs Too Much

Running SAME on cloud (EC2) much more expensive! FAME: 5 XUP boards at $750 ea.; $0.10 per kWh SAME: EC2 Medium-High instances at $0.17 per hour

Runtime (hours)

Cost for first experiment

Cost for next experiment

Carbon offset (trees)

FAME 257 $3,750 $10 0.1

SAME

73,000 $12,500 $12,500 55.0

Are architects good stewards of the environment? SAME uses energy of 45 seconds of Gulf oil spill!

Page 24: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

24

Fallacy: statistical samplingwill save us

Sampling may not make sense for multiprocessors Timing is now architecturally visible May be OK for transactional workloads

Even if sampling is appropriate, runtime dominated by functional warming => still need FAME FAME simulator ProtoFlex (CMU) originally designed for this

purpose

Parallel programs of the future will likely be dynamically adaptive and auto-tuned, which may render sampling useless

Page 25: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

25

Challenge: Simulator Debug Loop can be Longer

Takes 2 hours to push RAMP Gold through the CAD tools Software RTL simulation to debug simulator is also

very slow

SAME debug loop only minutes long

But sheer speed of FAME eases some tasks Try debugging and porting a complex parallel

program in SAME

Page 26: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

26

Challenge: FPGA CAD Tools

Compared to ASIC tools, FPGA tools are immature

Encountered 84 formally-tracked bugs developing RAMP Gold Including several in the formal verification tools!!

By far FAME’s biggest barrier (Help us, industry!) On the bright side, the more people using FAME,

the better

Page 27: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

27

When should Architectsstill use SAME?

SAME still appropriate in some situations Pure functional simulation ISA design Uniprocessor pipeline design

FAME necessary for manycore research with modern applications

Page 28: EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

28

Conclusions

FAME uses FPGAs to build simulators, not computers

FAME works, it’s fast, and we’re using it SAME doesn’t cut it, so use FAME!

Thanks to the entire RAMP community for contributions to FAME methodology

Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel, and UC Discovery for funding support

RAMP Gold source code is available: http://ramp.eecs.berkeley.edu/gold