© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...

141
1 © Derek Chiou Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat University of Texas at Austin Electrical and Computer Engineering
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    231
  • download

    0

Transcript of © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...

Page 1: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1© Derek Chiou

Functional/Timing Split in UT FAST

Derek Chiou,

Dam Sunwoo, Joonsoo Kim,

Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat

University of Texas at Austin

Electrical and Computer Engineering

Page 2: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 2

Test of size

First, Some Terminology

Host: the system on which a simulator runs Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM A Xilinx FPGA board

Target: the system being modeled Alpha 21264 processor Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM

Host

Simulator

Target

Your desktop

Simplescalar (sim-alpha)

Alpha 21264

Page 3: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 3

Test of size

FAST Goals RTL-level cycle-accuracy Complex ISA capable (x86, PowerPC) Complex micro-architecture capable (Intel Core 2) Off-the-shelf OS, apps (Windows, MS Word, Linux)

Lesson I learned from my grad school career Can run other stuff too of course

MP-capable (scale with FPGA resources) Fast (10MIPS range) (Relatively) easy to implement, modify, extend

Page 4: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 4

Test of size

FAST Prototype in Real Time

Page 5: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 5

Test of size

FAST: Speculative Functional/Timing Partitioning Proven partitioning (FastSim)

FM executes instructions to completion, pushes inst trace to TM

FM insts used as TM fetched insts If functional insts != timing insts, TM

forces FM to rollback Eg., branch mis-speculation, resolve,

memory ordering Clean inst trace/rollback interface

Optimize the common case! FM runs independently from TM when

functional insts == timing insts! Easy to parallelize

Better target uArch simulates faster

Factorized, not partitioned! (FM + TM) < (monolithic simulator) FM fairly simple, only functionality TM fairly simple, only timing

FunctionalModel

(ISA + peripherals)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

CachesArbitrationPipeliningAssociativity….

Inst trace

Page 6: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 6

Test of size

High Level FAST Architecture:A Parallelized Simulator

Parallelized between FM & TM Parallel target would have a parallelized FM, each target core

conceptually running on a separate host core Parallelized TM

Parallelizes nicely in hardware (FPGA) Latency tolerant, infrequent round-trips Stats can be done in hardware! (no performance impact)

trace

FunctionalModel

(ISA + peripherals)

TimingModel

(Micro-architecture)

Host HostFPGA

Page 7: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 7

Test of size

What Is A FAST Functional Model? Requirements

Fast, Full System, generate instruction trace, support rollback

Hardware functional models Fast, but FPGA implementation difficult to make complete

x86, boots Windows? Simple, resource efficient FM sufficient (but need trace, rollback)

Software functional models Bochs, QEMU, Simics, SimNow, SimOS, etc.

Run on fastest hardware we know about to execute an ISA Full system

Page 8: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 8

Test of size

What is a FAST Timing Model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Page 9: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 9

Test of size

Modular Timing Models:Modules + Connectors

Modules model timing functionality E.g., rename, caches, etc. Built hierarchically for extensibility

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs,

Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation

E.g., Rename, Cache Executed multiple times per target

cycle for wider processor, higher associativity

Simplifies implementation, tradeoff time for space

Connectors connect modules Abstract timing from modules

Throughput (input, output), delay, maxTransactions

Stats and tracing

Bill Reinhart

Page 10: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 10

Test of size

Microcode Compiler Intention

Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86

Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org

Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA

Generates microcode that “runs” on the timing model

Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported

Average 1.27 uOps per handled dynamic x86 instruction Microcode ISA is simple load/store with some x86 specialization

Build a functional model as a processor executing microcode ISA?

Nikhil Patil

Page 11: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 11

Test of size

Prototype Overview

Software functional model Eventually hardware functional model, but software sim exists

FPGA-based timing model written in Bluespec Complex OoO micro-architecture fits in a single FPGA

DRC or XUP

trace

FunctionalModel

Software

TimingModel

Bluespec HDL

Processor FPGADRCComputer

HT Xilinx FPGAPowerPC 405

Page 12: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 12

Test of size

Current Prototype Functional Model Derived from QEMU

Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

Prototype QEMU currently supports x86 Added tracing, rollback (implemented with checkpoint)

Including I/O (keyboard, mouse, video) Hosts

x86 machines PowerPC inside of an FPGA

Dam Sunwoo, Jeb Keefe

Page 13: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 13

Test of size

Current Prototype Timing Model:OOO Superscalar

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1

Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

Page 14: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 14

Test of size

Current Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%

Includes Operating System CodeIncludes Operating System Code

Page 15: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 15

Test of size

FAST Future Work Improve performance (currently TM bottlenecked)

TM (5MIPS-20MIPS) FM (10MIPS-100MIPS)

Optimize simulator (10MIPS-20MIPS) Hardware FM (FPGA ~100MIPS)

Simple pipe running microcode ISA Multithreaded (Protoflex) to improve throughput

Real processor??? (debug for trace, transactional for rollback?) FAST-MP

Need MP host PowerPC ISA support (this month) Power estimation capabilities in FPGAs

Don’t slow down FAST performance

Page 16: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 16

Test of size

FAST and RAMP-Classic: A Comparison

FAST RAMPCores Arbitrary, complex cores from

simple core+rollback+TMFull RTL, fits in FPGA

ISAs/OSs Arbitrary ISA/OS

(x86/Windows, PowerPC, …)

Depends on core

(PowerPC, Sparc)

Accuracy RTL cycle-accurate

Host cycles >= target cycles enabling reuse of hardware resources

Only accurate target RTL available (unless timing model used)

Host cycle == target cycle

Speed Complexity/Resource tradeoff As fast as the FPGA will run

Scalability Depends on FPGA resources (TM costs)

Depends on FPGA resources

Page 17: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 17

Test of size

FAST-MP: RAMP as a functional model How to do FAST-MP? Need multicore host! RAMP will not be able to accurately model Intel Core X

micro-architecture in RAMP Unless it becomes much simpler in-order pipeline

even then, will Intel give us RTL? But, RAMP processor can execute ISA

Target ISA == Host ISA Add trace, rollback capabilities

Target ISA != Host ISA Simulate target ISA Hardware support for target ISA, trace, rollback?

RAMP becomes a FAST functional model Use timing model to predict arbitrary behavior

Page 18: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 18

Test of size

FAST & RAMP-White Shared Infrastructure FAST connectors as White connectors

Stats, timing FAST modules as White modules

Quad ported RAM CAM/Cache on top of quad-ported RAM Multi-host cycle caches Branch predictors Load/store units, bus interface units Interconnection networks

Eventually, full processors? Executing microcode, software cracking?

FAST TM as RAMP TM At that point, RAMP == FAST-MP

Page 19: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 19

Test of size

Conclusions Split functional/timing can be cycle-accurate

Roll back (FAST) or Timing-directed (timing-first?) (HASim)

We believe that roll back can be done relatively cheaply Can be done in software at about order of magnitude impact Hardware support appears to be reasonable

Log old values, playback (easy for simple core) Transactional processor?

Simple cores + roll back as host for functional model? Virtualization for more threads is orthogonal, but

multiplicative in resources

Page 20: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 20

Test of size

Another View of FAST Old processor/system modeling new processor/system

Leverage the fact that most of the time, functionality and order is identical

Differences in timing between old and new modeled in timing model

Roll back/re-execute to deal with differences Use current generation host as functional model for

next generation target? Trace implemented by debug support Roll back implemented with transactional support

Page 21: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 21

Test of size

Host == Target Modules in FAST FAST can leverage modules that are full and accurate

implementations of modules

Host TLB == target TLB

Memory requests issued when timing model

Page 22: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 22

Test of size

FAST-MP: Mixing

P P

IU IU

Page 23: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 23

Test of size

What Is A FAST Functional Model? Requirements

Fast Full System Generates instruction trace Supports rollback

Hardware functional models (very fast) Real processor doesn’t support trace/rollback FPGA implementation difficult to make complete

x86, boots Windows? Software functional models exist today

Bochs, QEMU, Simics, SimNow, SimOS, etc. Relatively fast, full system

Run on fastest hardware we know about to execute an ISA Can be modified to generate trace/support rollback

Page 24: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 24

Test of size

trace

Step 1: Improving Performance via Parallelization

Parallel slowdown due to communication? FM runs ahead, speculatively, round-trip communication infrequent Round-trip communication only when (functional path != timing path)

Microprocessors have same problem Multiple issue, deep pipelines only work if predicted path is correct

FM like perfect front end of processor, real uArch (TM) slows it down The better the target micro-architecture, the faster the simulator

FunctionalModel

(ISA + peripherals)

TimingModel

(Micro-architecture)

HostHost Host

Page 25: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 25

Test of size

Current Prototype Functional Model Derived from QEMU

Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

Prototype currently supports x86 Added tracing, rollback (implemented with checkpoint)

Including I/O (keyboard, mouse, video) Hosts

x86 machines PowerPC inside of an FPGA

PowerPC target by January about 1 month to port

Dam Sunwoo, Jeb Keefe

Page 26: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 26

Test of size

Current Prototype Timing Model

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1

Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

Page 27: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 27

Test of size

Microcode Compiler Intention

Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86

Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org

Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA

Generates microcode that “runs” on the timing model

Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported

Average 1.27 uOps per handled dynamic x86 instruction

Nikhil Patil

Page 28: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 28

Test of size

Current Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%

Includes Operating System CodeIncludes Operating System Code

Page 29: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 29

Test of size

Performance Details Timing model is current bottleneck

100MHz host cycle (not pushing timing) Currently taking ~30 host (FPGA) cycles per target cycle, max about 54

cycles (currently max latency defines target clock) BP is a simple gshare predictor

Functional model Unoptimized modified QEMU With perfect BP, immediate return from TM, 5.4MIPS

FM/TM communication 469ns blocking read from Opteron on DRC (has gotten better)

Poll every other basic block 13ns/word for burst write

Page 30: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 30

Test of size

Some Related Work (there is a lot) Software

Functional/timing partitioned Asim, current M5, Timing-First, Opal all timing model driven

Timing model tells functional model what to do and when to do it

FastSim (Schnarr, et al, ASPLOS 98) Functional/timing, rollback when functional path != timing path But, instrumented binaries, not parallelized, no hardware

Hardware HASim: Hardware ASim (Emer, et. al)

Timing-first Seven points of communication between FM & TM

Requires infinitely renamed out-of-order FM Current supports a simplified MIPS ISA

Page 31: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 31

Test of size

Conclusions/Future Work It works Current FAST simulator prototype

1.2MIPS (unoptimized), about 1000 times slower than target Timely: during architecture phase Complete: runs Windows, Linux Transparent: extensive, hardware-based stats Relatively inexpensive, easy to build and extend

(Some) future work Optimize

5MIPS soon, 10MIPS-20MIPS later (hardware FM using uCode?) More realistic timing model & calibration Tattler: automatic bottleneck detection CMP/SMP targets

Page 32: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 32

Test of size

FAST Relative Speeds (Current Prototype)

1

10

100

1000

FA

ST

Rela

tive S

peed

Page 33: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 33

Test of size

X86 Micro-Op Coverage

0

0.2

0.4

0.6

0.8

1

1.2

Dyn

am

ic In

str

ucti

on

Co

vera

ge

Page 34: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 34

Test of size

Number of uOps/x86 Instruction

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

uO

ps

/x8

6

Page 35: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 35

Test of size

Outline What is FAST (1 minute)

Targeted properties Demo (30 sec)

Runs x86, Windows at speeds fast enough to interact How?

Start with partition Proven strategy (FastSim) Rollback to handle branch mis-speculation Simplifies full-system capabilities, since the complexity of the full-

system is encapsulated in the functional model Functional model can be full-system simulator or processor

FM passes trace to TM TM is simple

can model complex structures fairly low weight Improve performance

Parallelize on FM/TM boundary Round-trip communication infrequent

Permit FM to run ahead speculatively Doesn’t mis-speculation slow things down? Learn from computer architecture (microcoded,

pipelined, OoO) Speculation permits computer system to run

faster Key is not to mis-speculate often Parallelize on functional/timing boundary Handles branch prediction, anything where

functional path is not equal to target path Rollback required from FM

Bottleneck is timing model Parallelize TM?

Difficult to do in software Practical limiations due to number of processors that

communicate quickly Hardware (FPGA) is ideal

Parallelizes nicely TM partition simplifies hardware Latency tolerant, infrequent round-trips

Prototype Overview of prototype

Software functional model running on processor host TM in Bluespec on FPGA Describe some details DRC or XUP

FM description TM description

Block diagram Microcode compiler Performance Relative performance Bottlenecks

Future work Conclusions

Page 36: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 36

Test of size

Functional Model Modifications Need to

Rollback, force branch Rollback, restore and continue

How? set_pc(inst_num, pc)

Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC

Sufficient Can be implemented with checkpoints

ISA state, memory, peripherals

BR

BR

BR

BRBR

Page 37: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 37

Test of size

Page 38: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 38

Test of size

Why

Page 39: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 39

Test of size

Why Are Simulators Slow? Complexity

40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Multiple Decode Out-of-order issue I/O Multiple cores, processors

Modeling PARALLELISM

Can we parallelize simulator?

Page 40: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 40

Test of size

Parallelizing Simulators Software

Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical

If a software simulator could be parallelized, use same techniques to make the target faster?

Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

Page 41: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 41

Test of size

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

Modeling Complex Processors on FPGA(s) Compile RTL for FPGA

Pentium fits in large FPGA 3.1M transistors (Lu, Intel)

Issues Fit: Core 2 in a single FPGA?

Impossible: Processors grow as fast as FPGAs

Multiple FPGAs to model one processor

Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify

FPGAPentiumPentium Core 2

Page 42: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 42

Test of size

Strawman Solution Partition

Break problem down into multiple pieces Target module boundaries obeyed?

Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation

Partitioned simulator, each partition running in potentially different host technology

Page 43: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 43

Test of size

Simple Module-based Software/Hardware Partitioning

Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)

0x2

addrinst

Instruction$/Mem

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data $/Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

I1

I2

bypass

IS THERE A BETTER PARTITIONING?

Page 44: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 44

Test of size

FAST

FunctionalModel

(ISA)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

FetchDecodeRenameReservation stationsScheduling windowReorder buffer….

Inst trace

FPGA

Page 45: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 45

Test of size

More Complexity

Caches/TLBs? Keep tags, pass address (virtual and

physical if necessary) Hits, misses determined but don’t need

data Superscalar (multiple issue)?

“Fetch and issue” multiple instructions assuming they meet boundary constraints

Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions

NO DATAPATH only part of control path

Page 46: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 46

Test of size

A Timing Model

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Can Easily Execute in Parallel in Hardware

Page 47: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 47

Test of size

Trace-Driven Simulation? Described a trace-driven simulator Accurate if “functional path” == “target path”

“Ideal” processor

Unfortunately, functional path is not always equal to timing path Even for simple pipelines Real processors speculatively execute, assuming their target

path is the functional path

Page 48: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 48

Test of size

Branch Prediction

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-

speculate Rollback, restore

Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be

handled this way

Page 49: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 49

Test of size

Speculative Simulation FAST simulators themselves are speculative

Speculate functional path == target path Timing model detects functional path != target path

Forces functional model down wrong path OR Returns functional model to functional path

Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)

Functional model reusable Timing model determines target behavior

Page 50: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 50

Test of size

FAST Tool Flow

Functional modelspecification

Timing Model(FPGA or software)

Instruction decode Functional Model

(software and/or FPGA)

Functional model

Timing modelSpecification

(modules + connector)

Timing model (RTL/C)

Timing Model Compiler(Bluespec)

Functional model compiler (gcc)

RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using Intel AWB for stats, etc.

Page 51: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 51

Test of size

Performance Details Functional model

Unoptimized modified QEMU Driving an FPGA-based timing model

Standard producer/consumer parallel communication problems 469ns blocking round-trip from Opteron to FPGA Poll every other basic block

BP is a simple gshare predictor Timing model (unoptimized)

100MHz host cycle right now (not pushing timing) Currently taking ~30 host cycles per target cycle, max about 54 cycles

(currently max latency defines target clock) Complex timing model fits into single FPGA

Page 52: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 52

Test of size

Outline Motivation

FAST: Parallelized complex core, full-system simulator

RAMP-White: Parallel hosts running parallel targets

Page 53: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 53

Test of size

RAMP-White Requirements Coherent shared memory experimental platform

Configurable coherence protocol, engine

Scalable to the same level as other RAMP machines

1K eventual target Down to 2

Full system (OS, I/O, etc.) Intentions

ISA/Architecture independent (like all RAMP efforts) Use different cores

Integrate components from other RAMP participants A test-bed for sharing IP

Page 54: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 54

Test of size

Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL

Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost to academics (www.bluespec.com)

Start with XUP board We had XUP before BEE2

Support Leon3 and PowerPC Started with embedded PowerPC but Linux, non-MP core issues Restarted with Leon3

Linux, MP-capable core Get RAMP-White infrastructure to work

Plan to port back to embedded PowerPC, easy to move to soft PowerPC

Page 55: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 55

Test of size

High-Level Architecture Philosophy Flexibility

Avoid wasted work Easy changes Module-agnostic

Processors, network, I/O, etc.

Interfaces Complete set of necessary interfaces All communication via messages

Fixed fields, but fields are configurable “shims” connect components to White infrastructure

Use existing IP

Page 56: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 56

Test of size

RAMP-White Block DiagramRAMP-White Block Diagram

Network Router

Intersection Unit (IU)

Memory Controller

(MC)

IO & Platform

Devices

Processor

Network Interface

(NIU)

Coherent $

Proc dependent

Page 57: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 57

Test of size

Three Phase Approach to Hardware H1: Incoherent shared memory

No hardware global cache, just global shared memory support Optional cache for local memory

However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor

Ring network H2: Ring-based coherence (scalable bus)

Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol

True coherence engine not required But, very restricted communication

Sufficient for testing, modeling many targets H3: General network-based coherence

Requires general coherence engine, general network

H4: Different cores

IUP $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

Page 58: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 58

Test of size

Operating System Issues with SMP OS on embedded PowerPC

Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,

bring-up) How scalable anyways? (1K processors)

Four phase approach using RAMP-White hardware O1: Separate OS per core (PowerPC) working for XUP

Region of memory is global (mmap) Locks using regular loads/stores + sequential consistency

O2: SMP OS on Leon3 Use RAMP-White scalable hardware across multiple FPGAs Snoopy cache

O3: SMP OS on Leon3 using directory cache O4: SMP OS on PowerPC port

Page 59: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 59

Test of size

Programmer View Sequential consistency

PowerPC Global addresses labeled as uncached

Ordered accesses from PowerPC 405 Coherent global cache still uncached from processor

Soft cores can be weaker

User interface Terminal per core/OS if desired Mmap to map shared memory

Page 60: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 60

Test of size

H1/O1 RAMP-White Hari Angepat did the work

Components Written in Bluespec NIU code complete and tested

2 processor ring (PowerPC) IU code complete and tested

Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface

Hardware intended to target different ISAs PLB master and slave shims written

Some preliminary OS work Multi-image mmap interface running

Page 61: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 61

Test of size

Current RAMP-White Phase 1

Intersection Unit (IU)

IO & Platform

Devices

PPC 405/Leon3

Network Interface

(NIU)

Memory Controller

(MC)

PLB shim

Intersection Unit (IU)

PPC 405/Leon3

Network Interface

(NIU)

Linux Linux

Page 62: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 62

Test of size

Our Long Term Plans H1/O1, XUP working June 2007

PowerPC With multi-OS, limited device support

H2/O2 BEE2, Leon3 Coherent cache, IU forwarding modifications SMP

H3/O3 BEE2/BEE3, Leon3 Arbitrary network, cache coherency engine

Getting network from Washington, Berkeley H4/O4 BEE2/BEE3, PowerPC

Port back to PowerPC

Page 63: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 63

Test of size

RAMP Conclusions RAMP-White architecture

Phased approach minimizes wasted work Designed to be easy to modify for your purpose

Many architectures only require modified coherence engine, maybe cache

ISA/implementation agnostic Care taken to not be specific

RAMP-White Phase 1 works Running on XUP

Phase 2 close to working Running on BEE2

Page 64: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 64

Test of size

Future Work: FAST + RAMP-White FAST is currently not scalable

Need parallel functional models and parallel timing models

Parallel functional model runs on parallel host Maintain speed

Intention is to run FAST on top of RAMP-White Modify soft core to support checkpoint, rollback in hardware

FAST provides cycle-accurate of complex cores RAMP provides high performance, scalable host FAST+RAMP provides scalable complex core simulator

Page 65: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 65

Test of size

Acknowledgements Students

Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Hari Angepat (RAMP-White) Eric Johnson (verification, Linux)

Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale

Software Bluespec

Open-source full system simulators QEMU, Bochs

Page 66: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

66© Derek Chiou

Extra slides

Page 67: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 67

Test of size

What is in a Trace? Conceptually, everything a functional model can

produce Flattened opcode, virtual/physical address of instruction,

virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.

Can be heavily compressed Eg., simulator TLB to avoid physical address

Page 68: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 68

Test of size

Modules Modules predict what happens each cycle

Which instruction is scheduled? Modules hierarchically constructed

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)

E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher

associativity Simplifies implementation, tradeoff time for space

Joonsoo Kim, Nikhil Patil

Page 69: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 69

Test of size

Connector Interface Commit

Indicates module is done with target cycle

Done Indicates Connector is done

with target cycle Enq, First, Deq

Just like FIFO

EnqCommit

Done

DeqCommit First Done

Bill Reinhart

Page 70: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 70

Test of size

32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA

Need more than 32b

Eventually, hope for 64b soft-core processors

For now two options: live with 4GB space Or, provide one more layer of translation

Physical address in certain region is global virtual address Translated by hardware to node + physical address

Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0

Page 71: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 71

Test of size

Node Architecture

IU

P

P $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

Page 72: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 72

Test of size

Generalized Architecture

Proc

IU NIUMC

$

Mem

OPBbridge

Intersection Unit Network Interface Unit

PLB

Proc dependent

Proc independent

Page 73: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 73

Test of size

Intersection Unit Processor interface Slave Snoop

Network interface Master (send) Slave (receive)

Memory interface Master (issue memory requests)

Hooks for coherency engine Bluespec nice to specify coherence

engine Incoherent version is a special case

Programmable memory regions Global (local and remote) Local translation

Intersection Unit (IU)

Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $

Page 74: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 74

Test of size

Network Interface Unit Currently two virtual channels

Split into two components Msg composition/Queuing Net transmit/receive

Insert/extract for ring Intended to permit other net-

specific transmit/receive

One input/one output Creates a simple

unidirectional ring Can interface to more

advanced fabrics

Intersection Unit (IU)

Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $

Page 75: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 75

Test of size

Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP

Used some code (PLB master) Red-BEE is not ready to distribute

Looking for switch code Berkeley’s code on CVS repository

But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now

Would like to steal software OS (kernel proxy) SMP OS port

Naming MPI reference design in BEE2 repository Is that RAMP-Blue?

A central CVS repository for RAMP code?

Page 76: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 76

Test of size

Sharing Over the Long Term Processor is shared

Leon PowerPC MicroBlaze Everything else

MC is shared Xilinx or Berkeley

Coherent cache can be shared Transactional/traditional Borrow Stanford’s?

Coherency engine can be shared CMU/Stanford

IU functionality can be shared Trying to make ours general

NIU can be shared Borrow half from Berkeley?

Network can be shared Borrow Berkeley’s?

Proc

IU NIUMC

$

Mem

Peripherals

CCE

Page 77: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 77

Test of size

IU Internal Message

Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size

Bluespec permits easy modification for your protocol

PRI CMD PERM SIZE TAG

GADDR

DATA

Page 78: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 78

Test of size

Network Message

PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data

PRI DEST SRC NETTAG CMD

MESSAGE

SIZE

Page 79: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 79

Test of size

Intersection Unit InternalsIntersection Unit Internals

Intersection Unit ControllerMemory

Controller & DRAM

Controller BRAMs

Proc IO Net

Proc IO Net

Global Address Translation

hardware

Page 80: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

80© Derek Chiou

Fast, Full-System, Cycle-Accurate Computer Simulators via Parallelization

Derek ChiouUniversity of Texas at Austin

Electrical and Computer Engineering

Page 81: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 81

Test of size

My Ideal Computer Simulator

Fast: as fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive?

Timely: enough time to make decisions Accurate: produce cycle-accurate numbers Complete: run unmodified operating systems,

applications,… Transparent: full visibility, no performance hit Inexpensive: need thousands Flexible: quick changes, generate from RTL

Page 82: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 82

Test of size

Current Software Simulators

Performance-Detail tug-of-war Higher performance means less

simulation tasks Higher detail means more

simulation tasks

Simulator Slowdown Sim 2 min

ISA (SimNow, Simics) 1-100 10 min

Caches 10-1000 10 hrs

OOO (ASim ~1-10KIPS) 10K-1M 1 year

RTL 100M-1B 2000 years

CMP? ??? ???

Page 83: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 83

Test of size

Why So Slow? Complexity

40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Decoder I/O

Modeling PARALLELISM Parallelize simulator?

Sampling, benchmarking are orthogonal

Page 84: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 84

Test of size

Outline Motivation

How to Parallelize?

Functional Models and Timing Models

Status, Conclusions

Page 85: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 85

Test of size

Parallelizing Simulators Software

Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical

If a software simulator could be parallelized, use same techniques to make the target faster?

Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

Page 86: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 86

Test of size

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

Modeling Complex Processors on FPGA(s) Compile RTL for FPGA

Pentium fits in large FPGA 3.1M transistors (Lu, Intel)

Issues Fit: Core 2 in a single FPGA?

Impossible: Processors grow as fast as FPGAs

Multiple FPGAs to model one processor

Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify

FPGAPentiumPentium Core 2

Page 87: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 87

Test of size

Strawman Solution Partition

Break problem down into multiple pieces Target module boundaries obeyed?

Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation

Partitioned simulator, each partition running in potentially different host technology

Page 88: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 88

Test of size

Simple Module-based Software/Hardware Partitioning

Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)

0x2

addrinst

Instruction$/Mem

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data $/Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

I1

I2

bypass

IS THERE A BETTER PARTITIONING?

Page 89: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 89

Test of size

Our Partitioning: Functionality/Timing Boundaries Proven Software Partitioning

Asim, FastSim, etc. Factorized, not partitioned!

Functional simulators exist Timing model becomes very simple

Promotes reuse Functional/Timing Simplifies timing model

Latency tolerant Separate FM & TM Software functional model? Hardware timing model?

Balance time/spaceFunctional

Model

(ISA + peripherals)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

CachesArbitrationPipeliningAssociativity….

Inst trace

Page 90: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 90

Test of size

Partition on ISA/Timing Proven Partitioning

Asim, Simplescalar, Timing-First, FastSim, etc.

Simplifies simulator. Promotes reuse

Same performance in software Asim at 10KHz

Most of the time spent in timing model!

Hardware???FunctionalModel

(ISA)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

FetchDecodeRenameReservation stationsScheduling windowReorder buffer….

Inst trace

Page 91: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 91

Test of size

FAST

FunctionalModel

(ISA)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

FetchDecodeRenameReservation stationsScheduling windowReorder buffer….

Inst trace

FPGA

Page 92: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 92

Test of size

What is in a Trace? Conceptually, everything a functional model can

produce Flattened opcode, virtual/physical address of instruction,

virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.

Can be heavily compressed Eg., simulator TLB to avoid physical address

Page 93: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 93

Test of size

What is a FAST Timing Model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Stats gathering in hardware => no performance impact

Page 94: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 94

Test of size

More Complexity

Caches/TLBs? Keep tags, pass address (virtual and

physical if necessary) Hits, misses determined but don’t need

data Superscalar (multiple issue)?

“Fetch and issue” multiple instructions assuming they meet boundary constraints

Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions

NO DATAPATH only part of control path

Page 95: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 95

Test of size

A Timing Model

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Page 96: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 96

Test of size

Driving a Timing Model

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

Memory &I/O timingmodels

FunctionalModel

Can Easily Execute in Parallel in Hardware

Page 97: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 97

Test of size

Trace-Driven Simulation? What we’ve described is a trace-driven simulator Accurate if “functional path” == “target path”

“Ideal” processor

Unfortunately, functional path is not always equal to timing path Real processors speculatively execute, assuming their target

path is the functional path

Page 98: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 98

Test of size

Branch Prediction

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-

speculate Rollback, restore

Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be

handled this way

Page 99: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 99

Test of size

Speculative Simulation FAST simulators themselves are speculative

Speculate functional path == target path Timing model detects functional path != target path

Forces functional model down wrong path OR Returns functional model to functional path

Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)

Functional model reusable Timing model determines target behavior

Page 100: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 100

Test of size

A Parallel Simulator Functional runs in parallel with timing Functional model can run in parallel

OoO superscalar processor?

Timing model runs in parallel in hardware Hard to not run in parallel Every register-to-register transition could potentially be

parallelized

Page 101: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 101

Test of size

Outline Motivation

How to Parallelize?

Functional Models, Timing Models and Tools

Status, Related Work, Conclusions

Page 102: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 102

Test of size

Functional Models Hardware functional models

Difficult to make complete (use Protoflex methods?) Size

Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA

Can we use them? What modifications?

Page 103: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 103

Test of size

Functional Model Modifications Need to

Rollback, force branch Rollback, restore and continue

How? set_pc(inst_num, pc)

Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC

Sufficient Can be implemented with checkpoints

ISA state, memory, peripherals

BR

BR

BR

BRBR

Page 104: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 104

Test of size

Our Functional Model Derived from QEMU

Fast Boots Linux, Windows x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

We support x86 for now Added tracing, checkpoint, rollback Runs on x86 hosts as well as embedded PowerPC host inside

FPGA PowerPC in the works

Dam Sunwoo

Page 105: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 105

Test of size

Modular Timing Models Modules

Models timing functionality Rename, caches, etc.

Connectors Attach modules together Abstract timing from modules

Throughput (input, output), delay, maxTransactions

Stats and tracing

Page 106: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 106

Test of size

Modules Modules predict what happens each cycle

Which instruction is scheduled? Modules hierarchically constructed

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)

E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher

associativity Simplifies implementation, tradeoff time for space

Joonsoo Kim, Nikhil Patil

Page 107: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 107

Test of size

Connector Interface Commit

Indicates module is done with target cycle

Done Indicates Connector is done

with target cycle Enq, First, Deq

Just like FIFO

EnqCommit

Done

DeqCommit First Done

Bill Reinhart

Page 108: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 108

Test of size

Current Timing Model

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1

Page 109: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 109

Test of size

FAST Tool Flow

Functional modelspecification

Timing Model(FPGA or software)

Instruction decode Functional Model

(software and/or FPGA)

Functional model

Timing modelSpecification

(modules + connector)

Timing model (RTL/C)

Timing Model Compiler(Bluespec)

Functional model compiler (gcc)

RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using Intel AWB for stats, etc.

Page 110: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 110

Test of size

Outline Motivation

How to Parallelize?

Functional Models and Timing Models

Status, Related Work, Conclusions

Page 111: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 111

Test of size

Execution Platforms DRC Computer

Opteron + FPGA on HT Xilinx development boards

XUP, ML-310 (embedded PowerPC)

Research Accelerator for MP (RAMP) A shared infrastructure for MP development Uses BEE2/BEE3 FPGA boards

4/5 large FPGAs and fast off-board I/O Large collaboration between Berkeley, CMU (Hoe), MIT, Stanford,

Texas, Washington and Intel.

Page 112: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 112

Test of size

Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%

Includes Operating System CodeIncludes Operating System Code

Page 113: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 113

Test of size

Comparison to Related Work

1

10

100

1000

Intel Asim AMD GEMS Freescale IBM Mambo PTLSim sim-outorder FAST

FA

ST

Sp

eed

up

Page 114: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 114

Test of size

Acknowledgements Students

Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Eric Johnson (verification, Linux)

Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale

Software Bluespec

Open-source full system simulators QEMU, Bochs

Page 115: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 115

Test of size

Conclusions

Fast: Expect 2MIPS-10MIPS soon

2-4 orders of magnitude slower than target Interactive

Timely: quick model building Accurate: produce cycle-accurate numbers Complete: runs Linux/Windows + apps now, compile apps on

standard machines Transparent:

Hardware stats provides full visibility, no performance hit Inexpensive:

$300/$600 XUP boards can fit many timing models $9K (academic) DRC is still cost effective per cycle

Flexible: tools

Page 116: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

116© Derek Chiou

Backup Slides

Page 117: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 117

Test of size

Connectors Abstract timing information from modules

Cannot do so perfectly (state within modules) Characteristics

Throughput (input and output), minimum delay, maximum outstanding Entries are equal and fully shiftable

If you have distinct lanes, need separate connectors Provide stats gathering, tracing

512 entry trigger-based trace Stats funneled BACK through connector

helps with place-and-route (Nikhil Patil)

Bill Reinhart

Page 118: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 118

Test of size

Example Connector Interface Code (in ALU Module)module mkALU#(ConsumerPort#(Tuple2#(Maybe#(PReg_t), ROBTag_t)) inQ,

ProducerPort#(Tuple2#(Maybe#(PReg_t), Execute2ROB_t)) outQ)(ALU);

rule pass;inQ.deq;match { .dest, .robtag } = inQ.first;outQ.enq(tuple2(dest, Execute2ROB_t{robtag:robtag, exception: Invalid}));

endrule

rule done (inQ.done || outQ.done);inQ.commit(True);outQ.commit(True);

endrule

endmodule

ALU latency defined by Connector min-delay!

Page 119: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 119

Test of size

FAST Tool Flow

Functional modelspecification

Timing Model(FPGA or software)

Instruction decode Functional Model

(software and/or FPGA)

Functional model

Timing modelSpecification

(modules + connector)

Timing model (RTL/C)

Timing Model Compiler(Bluespec)

Functional model compiler (gcc)

RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using AWB for stats, etc.

Page 120: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 120

Test of size

Microcode Compiler Uses the LLVM Compiler Infrastructure developed at UIUC (

http://www.llvm.org) Compile the Bochs CPU (Bochs is another portable x86 full-

system emulator.) Backend retargetted to the micro-op "ISA“

Intention is to automate generation of new ISA instructions, automatically retarget new micro-architectures

Generates microcode that “runs” on the timing model

Nikhil Patil

Page 121: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 121

Test of size

Example Microcode Compile"ADD r/m32, r32" instruction.  void BX_CPU_C::ADD_EdGd(bxInstruction_c *i)   {

Bit32u op2_32, op1_32, sum_32;   op2_32 = BX_READ_32BIT_REG(i->nnn());   if (i->modC0()) {   op1_32 = BX_READ_32BIT_REG(i->rm());  sum_32 = op1_32 + op2_32;  BX_WRITE_32BIT_REGZ(i->rm(), sum_32);}else {  read_RMW_virtual_dword(i->seg(),

RMAddr(i), &op1_32);  sum_32 = op1_32 + op2_32;  write_RMW_virtual_dword(sum_32)  }  SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32)

}

When ModRM != 0xC0:

        %u0 = ADDRGEN        LOADd %v0:[%u0] -> %u1        cc,%u1 = add %u1, %nnn         STOREd %v0:[%u0] <- %u1

When ModRM == 0xC0:

        cc,%rm = add %rm, %nnn

Page 122: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 122

Test of size

RTL to Timing Model

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Timing model perfectly models RTLVerification??? But, do you really want to do it this way?

Page 123: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 123

Test of size

Traces to Timing Model Interface provides ability for functional model to pass

instruction trace to timing model Storage that timing model components can read

Compression TLB to eliminate physical address Cache static information (src/dest registers, opcode, etc.)

Joonsoo Kim

Page 124: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 124

Test of size

Adding Modules Easy to add modules to improve accuracy

No functionality, only timing

E.g., DRAM model Model banks, row register, control timing, refresh Results in some requests taking longer If modeling memory controller, reorder memory operations as

well

Page 125: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 125

Test of size

Long Term Plans/Impact CMP/SMP support Early power estimation (with Freescale) Simulator available for software development/tuning before

hardware True co-development Performance/power revs of software

Design derived from simulator Write cycle-accurate simulator, automatically generate design (at

RTL + libraries level)

Change the way computer systems are designed and evaluated

Page 126: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 126

Test of size

Demo Compile code, run on

simulator

BP Perfect, gshare See difference in simulator

performance #ALUs

1, 8

Decode

Fetch

R/S

ROB

Rename

L1$

Br ALU TLB Ld/st $

L2$BP

Page 127: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 127

Test of size

Better Partitioning? Traditional partitioning on module boundaries

Timing and functionality

Is there a better way? 64b adder

cannot be implemented as a single monolithic entity But, 64 1b adders very tractable

Can compose a 64b adder from 64 x 1b adders

Is there a better partitioning for simulation?

Page 128: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 128

Test of size

Current/Future Work Finish uni-processor version

Very close, running on one platform Debug rest of pipeline

Shared/coherent bus MP version

Porting QEMU to MP host MP timing model

Hardware functional model Tool chain

Page 129: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 129

Test of size

FPGA Resources for TM OOO, branch prediction, ROB, two ALUs, 1 branch unit, 1 load/store unit (32 entry load/store

queue), iTLB, dTLB 7221 slices (52% of a 2VP30) 9 block RAMs (6% of a 2VP30) NOTE: large structures have not yet been mapped to block RAMs (trace buffer, ROB)

Configurable cache model (old Verilog version) 32KB 4-way set associative cache with 16B cache-lines

165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30)

2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)

2VP30 is an old FPGA found on a $600 list-price XUP board Current FPGAs

10 times as much logic (330K logic cells (2VP30 is around 30K)) 3 to 4 times as much block RAM

Page 130: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 130

Test of size

Current Limitations No datapath

Data speculation requires datapath

Control path/data path crossover Informing loads, Query loads

Can support But, can require additional communication Quite possible

Page 131: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 131

Test of size

FAST Tool Flow

Functional modelspecification

Timing Model(FPGA or software)

Instruction decode Functional Model

(software and/or FPGA)

Functional model

Timing modelSpecification

(modules + glue)

Timing model (RTL/C)

Timing Model Compiler(Bluespec)

Functional model compiler (gcc)

RTL-to-Timing Model

RTL

DecodeGenerator

Page 132: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 132

Test of size

RTL to Timing Model

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Timing model perfectly models RTLVerification???

Page 133: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 133

Test of size

Hardware Related Work FAST is first FPGA-based functional/timing model

partitioned simulator that we know of HASim (Emer et al)

FPGA-based processor emulators Academic: RAMP, MIT UMUM, Scale, Stanford FAST, ATLAS,

CMU, … Intel (Shih-Lien Lu) !Flexible (huge effort), !Accurate (old processors, not

modeling everything)

Page 134: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 134

Test of size

Software Related Work FastSim: Schnarr, Larus (1998)

Direct execution with rollback, memoization for speed Simplescalar accuracy

Emer, et al: Asim Intel cycle-accurate, about 10KHz

PTLSim: Yourst Full system x86, about 200KHz Seems similar to Simplescalar in terms of accuracy Uses Xen to fast forward

Page 135: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 135

Test of size

Simulator Users and Tradeoffs

Architects: e.g., Matt Performance/Power/Reliability Timely, ~Accurate, Flexible !Fast, !Complete

Software: e.g., Wen-Mei Development, tuning Fast, OS !Accurate, !Flexible, !Timely

Implementation (RTL) Correctness Accurate, ~Timely !Fast, !Complete

Page 136: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 136

Test of sizeStolen from http://research.microsoft.com/si/PPT/HardwareModelingInfrastructure.pdf

Impossible????

Page 137: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 137

Test of size

Microprocessors One instruction executes to completion before next

starts Implementation is different leading to different

performance

Page 138: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 138

Test of size

Microprocessors

In-OrderFront

Out-of-OrderMiddle

In-OrderEnd

Page 139: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 139

Test of size

Current Software Simulators

Performance-Detail tug-of-war Higher performance means less

simulation tasks Higher detail means more

simulation tasks

Simulator Slowdown Sim 2 min

ISA (SimNow, Simics) 1-100 10 min

Caches 10-1000 10 hrs

OOO (ASim ~1-10KIPS) 10K-1M 1 year

RTL 100M-1B 2000 years

CMP? ??? ???

Page 140: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 140

Test of size

Why Build? (Anant) Software won’t work unless you are building hardware Motivation for software tools Large data sets Hard problems better understood and show up once

you became building Have to solve hard problems More radical the idea, more important it is to build

World only trusts end-to-end results

Cycle simulator only becomes accurate after hardware gets precisely defined

Needed for commercialization

Page 141: © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

1/17/08 RAMP Retreat 141

Test of size

Functional Models Hardware functional models

Difficult to make complete (use Protoflex methods?) Size

Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA

Can we use them? What modifications?