© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...

1© Derek Chiou

Functional/Timing Split in UT FAST

Derek Chiou,

Dam Sunwoo, Joonsoo Kim,

Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat

University of Texas at Austin

Electrical and Computer Engineering

1/17/08 RAMP Retreat 2

Test of size

First, Some Terminology

Host: the system on which a simulator runs Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM A Xilinx FPGA board

Target: the system being modeled Alpha 21264 processor Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM

Host

Simulator

Target

Your desktop

Simplescalar (sim-alpha)

Alpha 21264


Test of size

FAST Goals RTL-level cycle-accuracy Complex ISA capable (x86, PowerPC) Complex micro-architecture capable (Intel Core 2) Off-the-shelf OS, apps (Windows, MS Word, Linux)

Lesson I learned from my grad school career Can run other stuff too of course

MP-capable (scale with FPGA resources) Fast (10MIPS range) (Relatively) easy to implement, modify, extend


Test of size

FAST Prototype in Real Time


Test of size

FAST: Speculative Functional/Timing Partitioning Proven partitioning (FastSim)

FM executes instructions to completion, pushes inst trace to TM

FM insts used as TM fetched insts If functional insts != timing insts, TM

forces FM to rollback Eg., branch mis-speculation, resolve,

memory ordering Clean inst trace/rollback interface

Optimize the common case! FM runs independently from TM when

functional insts == timing insts! Easy to parallelize

Better target uArch simulates faster

Factorized, not partitioned! (FM + TM) < (monolithic simulator) FM fairly simple, only functionality TM fairly simple, only timing

FunctionalModel

(ISA + peripherals)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

CachesArbitrationPipeliningAssociativity….

Inst trace


Test of size

High Level FAST Architecture:A Parallelized Simulator

Parallelized between FM & TM Parallel target would have a parallelized FM, each target core

conceptually running on a separate host core Parallelized TM

Parallelizes nicely in hardware (FPGA) Latency tolerant, infrequent round-trips Stats can be done in hardware! (no performance impact)

trace

FunctionalModel

(ISA + peripherals)

TimingModel


Host HostFPGA


Test of size

What Is A FAST Functional Model? Requirements

Fast, Full System, generate instruction trace, support rollback

Hardware functional models Fast, but FPGA implementation difficult to make complete

x86, boots Windows? Simple, resource efficient FM sufficient (but need trace, rollback)

Software functional models Bochs, QEMU, Simics, SimNow, SimOS, etc.

Run on fastest hardware we know about to execute an ISA Full system


Test of size

What is a FAST Timing Model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2


Test of size

Modular Timing Models:Modules + Connectors

Modules model timing functionality E.g., rename, caches, etc. Built hierarchically for extensibility

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs,

Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation

E.g., Rename, Cache Executed multiple times per target

cycle for wider processor, higher associativity

Simplifies implementation, tradeoff time for space

Connectors connect modules Abstract timing from modules

Throughput (input, output), delay, maxTransactions

Stats and tracing

Bill Reinhart


Test of size

Microcode Compiler Intention

Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86

Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org

Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA

Generates microcode that “runs” on the timing model

Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported

Average 1.27 uOps per handled dynamic x86 instruction Microcode ISA is simple load/store with some x86 specialization

Build a functional model as a processor executing microcode ISA?

Nikhil Patil


Test of size

Prototype Overview

Software functional model Eventually hardware functional model, but software sim exists

FPGA-based timing model written in Bluespec Complex OoO micro-architecture fits in a single FPGA

DRC or XUP

trace

FunctionalModel

Software

TimingModel

Bluespec HDL

Processor FPGADRCComputer

HT Xilinx FPGAPowerPC 405


Test of size

Current Prototype Functional Model Derived from QEMU

Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

Prototype QEMU currently supports x86 Added tracing, rollback (implemented with checkpoint)

Including I/O (keyboard, mouse, video) Hosts

x86 machines PowerPC inside of an FPGA

Dam Sunwoo, Jeb Keefe


Test of size

Current Prototype Timing Model:OOO Superscalar

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1

Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson


Test of size

Current Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%

Includes Operating System CodeIncludes Operating System Code


Test of size

FAST Future Work Improve performance (currently TM bottlenecked)

TM (5MIPS-20MIPS) FM (10MIPS-100MIPS)

Optimize simulator (10MIPS-20MIPS) Hardware FM (FPGA ~100MIPS)

Simple pipe running microcode ISA Multithreaded (Protoflex) to improve throughput

Real processor??? (debug for trace, transactional for rollback?) FAST-MP

Need MP host PowerPC ISA support (this month) Power estimation capabilities in FPGAs

Don’t slow down FAST performance


Test of size

FAST and RAMP-Classic: A Comparison

FAST RAMPCores Arbitrary, complex cores from

simple core+rollback+TMFull RTL, fits in FPGA

ISAs/OSs Arbitrary ISA/OS

(x86/Windows, PowerPC, …)

Depends on core

(PowerPC, Sparc)

Accuracy RTL cycle-accurate

Host cycles >= target cycles enabling reuse of hardware resources

Only accurate target RTL available (unless timing model used)

Host cycle == target cycle

Speed Complexity/Resource tradeoff As fast as the FPGA will run

Scalability Depends on FPGA resources (TM costs)

Depends on FPGA resources


Test of size

FAST-MP: RAMP as a functional model How to do FAST-MP? Need multicore host! RAMP will not be able to accurately model Intel Core X

micro-architecture in RAMP Unless it becomes much simpler in-order pipeline

even then, will Intel give us RTL? But, RAMP processor can execute ISA

Target ISA == Host ISA Add trace, rollback capabilities

Target ISA != Host ISA Simulate target ISA Hardware support for target ISA, trace, rollback?

RAMP becomes a FAST functional model Use timing model to predict arbitrary behavior


Test of size

FAST & RAMP-White Shared Infrastructure FAST connectors as White connectors

Stats, timing FAST modules as White modules

Quad ported RAM CAM/Cache on top of quad-ported RAM Multi-host cycle caches Branch predictors Load/store units, bus interface units Interconnection networks

Eventually, full processors? Executing microcode, software cracking?

FAST TM as RAMP TM At that point, RAMP == FAST-MP


Test of size

Conclusions Split functional/timing can be cycle-accurate

Roll back (FAST) or Timing-directed (timing-first?) (HASim)

We believe that roll back can be done relatively cheaply Can be done in software at about order of magnitude impact Hardware support appears to be reasonable

Log old values, playback (easy for simple core) Transactional processor?

Simple cores + roll back as host for functional model? Virtualization for more threads is orthogonal, but

multiplicative in resources


Test of size

Another View of FAST Old processor/system modeling new processor/system

Leverage the fact that most of the time, functionality and order is identical

Differences in timing between old and new modeled in timing model

Roll back/re-execute to deal with differences Use current generation host as functional model for

next generation target? Trace implemented by debug support Roll back implemented with transactional support


Test of size

Host == Target Modules in FAST FAST can leverage modules that are full and accurate

implementations of modules

Host TLB == target TLB

Memory requests issued when timing model


Test of size

FAST-MP: Mixing

P P

IU IU


Test of size

What Is A FAST Functional Model? Requirements

Fast Full System Generates instruction trace Supports rollback

Hardware functional models (very fast) Real processor doesn’t support trace/rollback FPGA implementation difficult to make complete

x86, boots Windows? Software functional models exist today

Bochs, QEMU, Simics, SimNow, SimOS, etc. Relatively fast, full system

Run on fastest hardware we know about to execute an ISA Can be modified to generate trace/support rollback


Test of size

trace

Step 1: Improving Performance via Parallelization

Parallel slowdown due to communication? FM runs ahead, speculatively, round-trip communication infrequent Round-trip communication only when (functional path != timing path)

Microprocessors have same problem Multiple issue, deep pipelines only work if predicted path is correct

FM like perfect front end of processor, real uArch (TM) slows it down The better the target micro-architecture, the faster the simulator

FunctionalModel

(ISA + peripherals)

TimingModel


HostHost Host


Test of size

Current Prototype Functional Model Derived from QEMU

Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

Prototype currently supports x86 Added tracing, rollback (implemented with checkpoint)

Including I/O (keyboard, mouse, video) Hosts

x86 machines PowerPC inside of an FPGA

PowerPC target by January about 1 month to port

Dam Sunwoo, Jeb Keefe


Test of size

Current Prototype Timing Model

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1

Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson


Test of size

Microcode Compiler Intention

Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86

Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org

Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA


Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported

Average 1.27 uOps per handled dynamic x86 instruction

Nikhil Patil


Test of size

Current Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%



Test of size

Performance Details Timing model is current bottleneck

100MHz host cycle (not pushing timing) Currently taking ~30 host (FPGA) cycles per target cycle, max about 54

cycles (currently max latency defines target clock) BP is a simple gshare predictor

Functional model Unoptimized modified QEMU With perfect BP, immediate return from TM, 5.4MIPS

FM/TM communication 469ns blocking read from Opteron on DRC (has gotten better)

Poll every other basic block 13ns/word for burst write


Test of size

Some Related Work (there is a lot) Software

Functional/timing partitioned Asim, current M5, Timing-First, Opal all timing model driven

Timing model tells functional model what to do and when to do it

FastSim (Schnarr, et al, ASPLOS 98) Functional/timing, rollback when functional path != timing path But, instrumented binaries, not parallelized, no hardware

Hardware HASim: Hardware ASim (Emer, et. al)

Timing-first Seven points of communication between FM & TM

Requires infinitely renamed out-of-order FM Current supports a simplified MIPS ISA


Test of size

Conclusions/Future Work It works Current FAST simulator prototype

1.2MIPS (unoptimized), about 1000 times slower than target Timely: during architecture phase Complete: runs Windows, Linux Transparent: extensive, hardware-based stats Relatively inexpensive, easy to build and extend

(Some) future work Optimize

5MIPS soon, 10MIPS-20MIPS later (hardware FM using uCode?) More realistic timing model & calibration Tattler: automatic bottleneck detection CMP/SMP targets


Test of size

FAST Relative Speeds (Current Prototype)

1

10

100

1000

FA

ST

Rela

tive S

peed


Test of size

X86 Micro-Op Coverage

0

0.2

0.4

0.6

0.8

1

1.2

Dyn

am

ic In

str

ucti

on

Co

vera

ge


Test of size

Number of uOps/x86 Instruction

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

uO

ps

/x8

6


Test of size

Outline What is FAST (1 minute)

Targeted properties Demo (30 sec)

Runs x86, Windows at speeds fast enough to interact How?

Start with partition Proven strategy (FastSim) Rollback to handle branch mis-speculation Simplifies full-system capabilities, since the complexity of the full-

system is encapsulated in the functional model Functional model can be full-system simulator or processor

FM passes trace to TM TM is simple

can model complex structures fairly low weight Improve performance

Parallelize on FM/TM boundary Round-trip communication infrequent

Permit FM to run ahead speculatively Doesn’t mis-speculation slow things down? Learn from computer architecture (microcoded,

pipelined, OoO) Speculation permits computer system to run

faster Key is not to mis-speculate often Parallelize on functional/timing boundary Handles branch prediction, anything where

functional path is not equal to target path Rollback required from FM

Bottleneck is timing model Parallelize TM?

Difficult to do in software Practical limiations due to number of processors that

communicate quickly Hardware (FPGA) is ideal

Parallelizes nicely TM partition simplifies hardware Latency tolerant, infrequent round-trips

Prototype Overview of prototype

Software functional model running on processor host TM in Bluespec on FPGA Describe some details DRC or XUP

FM description TM description

Block diagram Microcode compiler Performance Relative performance Bottlenecks

Future work Conclusions


Test of size

Functional Model Modifications Need to

Rollback, force branch Rollback, restore and continue

How? set_pc(inst_num, pc)

Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC

Sufficient Can be implemented with checkpoints

ISA state, memory, peripherals

BR

BR

BR

BRBR


Test of size


Test of size

Why


Test of size

Why Are Simulators Slow? Complexity

40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Multiple Decode Out-of-order issue I/O Multiple cores, processors

Modeling PARALLELISM

Can we parallelize simulator?


Test of size

Parallelizing Simulators Software

Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical

If a software simulator could be parallelized, use same techniques to make the target faster?

Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability


Test of size

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

Modeling Complex Processors on FPGA(s) Compile RTL for FPGA

Pentium fits in large FPGA 3.1M transistors (Lu, Intel)

Issues Fit: Core 2 in a single FPGA?

Impossible: Processors grow as fast as FPGAs

Multiple FPGAs to model one processor

Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify

FPGAPentiumPentium Core 2


Test of size

Strawman Solution Partition

Break problem down into multiple pieces Target module boundaries obeyed?

Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation

Partitioned simulator, each partition running in potentially different host technology


Test of size

Simple Module-based Software/Hardware Partitioning

Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)

0x2

addrinst

Instruction$/Mem

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data $/Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

I1

I2

bypass

IS THERE A BETTER PARTITIONING?


Test of size

FAST

FunctionalModel

(ISA)

TimingModel




FetchDecodeRenameReservation stationsScheduling windowReorder buffer….

Inst trace

FPGA


Test of size

More Complexity

Caches/TLBs? Keep tags, pass address (virtual and

physical if necessary) Hits, misses determined but don’t need

data Superscalar (multiple issue)?

“Fetch and issue” multiple instructions assuming they meet boundary constraints

Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions

NO DATAPATH only part of control path


Test of size

A Timing Model

iTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Can Easily Execute in Parallel in Hardware


Test of size

Trace-Driven Simulation? Described a trace-driven simulator Accurate if “functional path” == “target path”

“Ideal” processor

Unfortunately, functional path is not always equal to timing path Even for simple pipelines Real processors speculatively execute, assuming their target

path is the functional path


Test of size

Branch Prediction

iTLB iCache

dTLB dCache

Align & Pick



L2 Cache

FunctionalModel


Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-

speculate Rollback, restore

Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be

handled this way


Test of size

Speculative Simulation FAST simulators themselves are speculative

Speculate functional path == target path Timing model detects functional path != target path

Forces functional model down wrong path OR Returns functional model to functional path

Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)

Functional model reusable Timing model determines target behavior


Test of size

FAST Tool Flow

Functional modelspecification

Timing Model(FPGA or software)

Instruction decode Functional Model

(software and/or FPGA)

Functional model

Timing modelSpecification

(modules + connector)

Timing model (RTL/C)

Timing Model Compiler(Bluespec)

Functional model compiler (gcc)

RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using Intel AWB for stats, etc.


Test of size

Performance Details Functional model

Unoptimized modified QEMU Driving an FPGA-based timing model

Standard producer/consumer parallel communication problems 469ns blocking round-trip from Opteron to FPGA Poll every other basic block

BP is a simple gshare predictor Timing model (unoptimized)

100MHz host cycle right now (not pushing timing) Currently taking ~30 host cycles per target cycle, max about 54 cycles

(currently max latency defines target clock) Complex timing model fits into single FPGA


Test of size

Outline Motivation

FAST: Parallelized complex core, full-system simulator

RAMP-White: Parallel hosts running parallel targets


Test of size

RAMP-White Requirements Coherent shared memory experimental platform

Configurable coherence protocol, engine

Scalable to the same level as other RAMP machines

1K eventual target Down to 2

Full system (OS, I/O, etc.) Intentions

ISA/Architecture independent (like all RAMP efforts) Use different cores

Integrate components from other RAMP participants A test-bed for sharing IP


Test of size

Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL

Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost to academics (www.bluespec.com)

Start with XUP board We had XUP before BEE2

Support Leon3 and PowerPC Started with embedded PowerPC but Linux, non-MP core issues Restarted with Leon3

Linux, MP-capable core Get RAMP-White infrastructure to work

Plan to port back to embedded PowerPC, easy to move to soft PowerPC


Test of size

High-Level Architecture Philosophy Flexibility

Avoid wasted work Easy changes Module-agnostic

Processors, network, I/O, etc.

Interfaces Complete set of necessary interfaces All communication via messages

Fixed fields, but fields are configurable “shims” connect components to White infrastructure

Use existing IP


Test of size

RAMP-White Block DiagramRAMP-White Block Diagram

Network Router

Intersection Unit (IU)

Memory Controller

(MC)

IO & Platform

Devices

Processor

Network Interface

(NIU)

Coherent $

Proc dependent


Test of size

Three Phase Approach to Hardware H1: Incoherent shared memory

No hardware global cache, just global shared memory support Optional cache for local memory

However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor

Ring network H2: Ring-based coherence (scalable bus)

Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol

True coherence engine not required But, very restricted communication

Sufficient for testing, modeling many targets H3: General network-based coherence

Requires general coherence engine, general network

H4: Different cores

IUP $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $


Test of size

Operating System Issues with SMP OS on embedded PowerPC

Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,

bring-up) How scalable anyways? (1K processors)

Four phase approach using RAMP-White hardware O1: Separate OS per core (PowerPC) working for XUP

Region of memory is global (mmap) Locks using regular loads/stores + sequential consistency

O2: SMP OS on Leon3 Use RAMP-White scalable hardware across multiple FPGAs Snoopy cache

O3: SMP OS on Leon3 using directory cache O4: SMP OS on PowerPC port


Test of size

Programmer View Sequential consistency

PowerPC Global addresses labeled as uncached

Ordered accesses from PowerPC 405 Coherent global cache still uncached from processor

Soft cores can be weaker

User interface Terminal per core/OS if desired Mmap to map shared memory


Test of size

H1/O1 RAMP-White Hari Angepat did the work

Components Written in Bluespec NIU code complete and tested

2 processor ring (PowerPC) IU code complete and tested

Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface

Hardware intended to target different ISAs PLB master and slave shims written

Some preliminary OS work Multi-image mmap interface running


Test of size

Current RAMP-White Phase 1


IO & Platform

Devices

PPC 405/Leon3

Network Interface

(NIU)

Memory Controller

(MC)

PLB shim


PPC 405/Leon3

Network Interface

(NIU)

Linux Linux


Test of size

Our Long Term Plans H1/O1, XUP working June 2007

PowerPC With multi-OS, limited device support

H2/O2 BEE2, Leon3 Coherent cache, IU forwarding modifications SMP

H3/O3 BEE2/BEE3, Leon3 Arbitrary network, cache coherency engine

Getting network from Washington, Berkeley H4/O4 BEE2/BEE3, PowerPC

Port back to PowerPC


Test of size

RAMP Conclusions RAMP-White architecture

Phased approach minimizes wasted work Designed to be easy to modify for your purpose

Many architectures only require modified coherence engine, maybe cache

ISA/implementation agnostic Care taken to not be specific

RAMP-White Phase 1 works Running on XUP

Phase 2 close to working Running on BEE2


Test of size

Future Work: FAST + RAMP-White FAST is currently not scalable

Need parallel functional models and parallel timing models

Parallel functional model runs on parallel host Maintain speed

Intention is to run FAST on top of RAMP-White Modify soft core to support checkpoint, rollback in hardware

FAST provides cycle-accurate of complex cores RAMP provides high performance, scalable host FAST+RAMP provides scalable complex core simulator


Test of size

Acknowledgements Students

Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Hari Angepat (RAMP-White) Eric Johnson (verification, Linux)

Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale

Software Bluespec

Open-source full system simulators QEMU, Bochs

66© Derek Chiou

Extra slides


Test of size

What is in a Trace? Conceptually, everything a functional model can

produce Flattened opcode, virtual/physical address of instruction,

virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.

Can be heavily compressed Eg., simulator TLB to avoid physical address


Test of size

Modules Modules predict what happens each cycle

Which instruction is scheduled? Modules hierarchically constructed

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)

E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher

associativity Simplifies implementation, tradeoff time for space

Joonsoo Kim, Nikhil Patil


Test of size

Connector Interface Commit

Indicates module is done with target cycle

Done Indicates Connector is done

with target cycle Enq, First, Deq

Just like FIFO

EnqCommit

Done

DeqCommit First Done

Bill Reinhart


Test of size

32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA

Need more than 32b

Eventually, hope for 64b soft-core processors

For now two options: live with 4GB space Or, provide one more layer of translation

Physical address in certain region is global virtual address Translated by hardware to node + physical address

Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0


Test of size

Node Architecture

IU

P

P $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $


Test of size

Generalized Architecture

Proc

IU NIUMC

$

Mem

OPBbridge

Intersection Unit Network Interface Unit

PLB

Proc dependent

Proc independent


Test of size

Intersection Unit Processor interface Slave Snoop

Network interface Master (send) Slave (receive)

Memory interface Master (issue memory requests)

Hooks for coherency engine Bluespec nice to specify coherence

engine Incoherent version is a special case

Programmable memory regions Global (local and remote) Local translation


Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $


Test of size

Network Interface Unit Currently two virtual channels

Split into two components Msg composition/Queuing Net transmit/receive

Insert/extract for ring Intended to permit other net-

specific transmit/receive

One input/one output Creates a simple

unidirectional ring Can interface to more

advanced fabrics


Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $


Test of size

Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP

Used some code (PLB master) Red-BEE is not ready to distribute

Looking for switch code Berkeley’s code on CVS repository

But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now

Would like to steal software OS (kernel proxy) SMP OS port

Naming MPI reference design in BEE2 repository Is that RAMP-Blue?

A central CVS repository for RAMP code?


Test of size

Sharing Over the Long Term Processor is shared

Leon PowerPC MicroBlaze Everything else

MC is shared Xilinx or Berkeley

Coherent cache can be shared Transactional/traditional Borrow Stanford’s?

Coherency engine can be shared CMU/Stanford

IU functionality can be shared Trying to make ours general

NIU can be shared Borrow half from Berkeley?

Network can be shared Borrow Berkeley’s?

Proc

IU NIUMC

$

Mem

Peripherals

CCE


Test of size

IU Internal Message

Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size

Bluespec permits easy modification for your protocol

PRI CMD PERM SIZE TAG

GADDR

DATA


Test of size

Network Message

PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data

PRI DEST SRC NETTAG CMD

MESSAGE

SIZE


Test of size

Intersection Unit InternalsIntersection Unit Internals

Intersection Unit ControllerMemory

Controller & DRAM

Controller BRAMs

Proc IO Net

Proc IO Net

Global Address Translation

hardware

80© Derek Chiou

Fast, Full-System, Cycle-Accurate Computer Simulators via Parallelization

Derek ChiouUniversity of Texas at Austin

Electrical and Computer Engineering


Test of size

My Ideal Computer Simulator

Fast: as fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive?

Timely: enough time to make decisions Accurate: produce cycle-accurate numbers Complete: run unmodified operating systems,

applications,… Transparent: full visibility, no performance hit Inexpensive: need thousands Flexible: quick changes, generate from RTL


Test of size

Current Software Simulators

Performance-Detail tug-of-war Higher performance means less

simulation tasks Higher detail means more

simulation tasks

Simulator Slowdown Sim 2 min

ISA (SimNow, Simics) 1-100 10 min

Caches 10-1000 10 hrs

OOO (ASim ~1-10KIPS) 10K-1M 1 year

RTL 100M-1B 2000 years

CMP? ??? ???


Test of size

Why So Slow? Complexity

40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Decoder I/O

Modeling PARALLELISM Parallelize simulator?

Sampling, benchmarking are orthogonal


Test of size

Outline Motivation

How to Parallelize?

Functional Models and Timing Models

Status, Conclusions


Test of size

Parallelizing Simulators Software

Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical

If a software simulator could be parallelized, use same techniques to make the target faster?

Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability


Test of size

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

FPGA FPGA

Modeling Complex Processors on FPGA(s) Compile RTL for FPGA

Pentium fits in large FPGA 3.1M transistors (Lu, Intel)

Issues Fit: Core 2 in a single FPGA?

Impossible: Processors grow as fast as FPGAs

Multiple FPGAs to model one processor

Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify

FPGAPentiumPentium Core 2


Test of size

Strawman Solution Partition

Break problem down into multiple pieces Target module boundaries obeyed?

Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation

Partitioned simulator, each partition running in potentially different host technology


Test of size

Simple Module-based Software/Hardware Partitioning

Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)

0x2

addrinst

Instruction$/Mem

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data $/Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

I1

I2

bypass

IS THERE A BETTER PARTITIONING?


Test of size

Our Partitioning: Functionality/Timing Boundaries Proven Software Partitioning

Asim, FastSim, etc. Factorized, not partitioned!

Functional simulators exist Timing model becomes very simple

Promotes reuse Functional/Timing Simplifies timing model

Latency tolerant Separate FM & TM Software functional model? Hardware timing model?

Balance time/spaceFunctional

Model

(ISA + peripherals)

TimingModel




CachesArbitrationPipeliningAssociativity….

Inst trace


Test of size

Partition on ISA/Timing Proven Partitioning

Asim, Simplescalar, Timing-First, FastSim, etc.

Simplifies simulator. Promotes reuse

Same performance in software Asim at 10KHz

Most of the time spent in timing model!

Hardware???FunctionalModel

(ISA)

TimingModel





Inst trace


Test of size

FAST

FunctionalModel

(ISA)

TimingModel





Inst trace

FPGA


Test of size

What is in a Trace? Conceptually, everything a functional model can

produce Flattened opcode, virtual/physical address of instruction,

virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.

Can be heavily compressed Eg., simulator TLB to avoid physical address


Test of size

What is a FAST Timing Model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Stats gathering in hardware => no performance impact


Test of size

More Complexity

Caches/TLBs? Keep tags, pass address (virtual and

physical if necessary) Hits, misses determined but don’t need

data Superscalar (multiple issue)?

“Fetch and issue” multiple instructions assuming they meet boundary constraints

Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions

NO DATAPATH only part of control path


Test of size

A Timing Model

iTLB iCache

dTLB dCache

Align & Pick



L2 Cache

FunctionalModel



Test of size

Driving a Timing Model

iTLB iCache

dTLB dCache

Align & Pick



L2 Cache


FunctionalModel

Can Easily Execute in Parallel in Hardware


Test of size

Trace-Driven Simulation? What we’ve described is a trace-driven simulator Accurate if “functional path” == “target path”

“Ideal” processor

Unfortunately, functional path is not always equal to timing path Real processors speculatively execute, assuming their target

path is the functional path


Test of size

Branch Prediction

iTLB iCache

dTLB dCache

Align & Pick



L2 Cache

FunctionalModel


Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-

speculate Rollback, restore

Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be

handled this way


Test of size

Speculative Simulation FAST simulators themselves are speculative

Speculate functional path == target path Timing model detects functional path != target path

Forces functional model down wrong path OR Returns functional model to functional path

Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)

Functional model reusable Timing model determines target behavior


Test of size

A Parallel Simulator Functional runs in parallel with timing Functional model can run in parallel

OoO superscalar processor?

Timing model runs in parallel in hardware Hard to not run in parallel Every register-to-register transition could potentially be

parallelized


Test of size

Outline Motivation

How to Parallelize?

Functional Models, Timing Models and Tools

Status, Related Work, Conclusions


Test of size

Functional Models Hardware functional models

Difficult to make complete (use Protoflex methods?) Size

Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA

Can we use them? What modifications?


Test of size

Functional Model Modifications Need to

Rollback, force branch Rollback, restore and continue

How? set_pc(inst_num, pc)

Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC

Sufficient Can be implemented with checkpoints

ISA state, memory, peripherals

BR

BR

BR

BRBR


Test of size

Our Functional Model Derived from QEMU

Fast Boots Linux, Windows x86, x86-64, PowerPC, Sparc, ARM, MIPS, …

We support x86 for now Added tracing, checkpoint, rollback Runs on x86 hosts as well as embedded PowerPC host inside

FPGA PowerPC in the works

Dam Sunwoo


Test of size

Modular Timing Models Modules

Models timing functionality Rename, caches, etc.

Connectors Attach modules together Abstract timing from modules

Throughput (input, output), delay, maxTransactions

Stats and tracing


Test of size

Modules Modules predict what happens each cycle

Which instruction is scheduled? Modules hierarchically constructed

CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB

Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)

E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher

associativity Simplifies implementation, tradeoff time for space

Joonsoo Kim, Nikhil Patil


Test of size

Connector Interface Commit

Indicates module is done with target cycle

Done Indicates Connector is done

with target cycle Enq, First, Deq

Just like FIFO

EnqCommit

Done

DeqCommit First Done

Bill Reinhart


Test of size

Current Timing Model

Fetch

BRU

R/SRename/

ROBD

ALU

LdSt Queue

iL1 dL1L2

ResultBus

TLB

MEM

BP

25 25

8

8 88

11

1 1

1 1

1 1

1

1 1

1 1


Test of size

FAST Tool Flow





Functional model






RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using Intel AWB for stats, etc.


Test of size

Outline Motivation

How to Parallelize?

Functional Models and Timing Models

Status, Related Work, Conclusions


Test of size

Execution Platforms DRC Computer

Opteron + FPGA on HT Xilinx development boards

XUP, ML-310 (embedded PowerPC)

Research Accelerator for MP (RAMP) A shared infrastructure for MP development Uses BEE2/BEE3 FPGA boards

4/5 large FPGAs and fast off-board I/O Large collaboration between Berkeley, CMU (Hoe), MIT, Stanford,

Texas, Washington and Intel.


Test of size

Simulator Performance on DRC

0

0.5

1

1.5

2

2.5

3

3.5

MIP

S

gshare

BP 97%

BP 100%



Test of size

Comparison to Related Work

1

10

100

1000

Intel Asim AMD GEMS Freescale IBM Mambo PTLSim sim-outorder FAST

FA

ST

Sp

eed

up


Test of size

Acknowledgements Students

Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Eric Johnson (verification, Linux)

Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale

Software Bluespec

Open-source full system simulators QEMU, Bochs


Test of size

Conclusions

Fast: Expect 2MIPS-10MIPS soon

2-4 orders of magnitude slower than target Interactive

Timely: quick model building Accurate: produce cycle-accurate numbers Complete: runs Linux/Windows + apps now, compile apps on

standard machines Transparent:

Hardware stats provides full visibility, no performance hit Inexpensive:

$300/$600 XUP boards can fit many timing models $9K (academic) DRC is still cost effective per cycle

Flexible: tools


Test of size

Connectors Abstract timing information from modules

Cannot do so perfectly (state within modules) Characteristics

Throughput (input and output), minimum delay, maximum outstanding Entries are equal and fully shiftable

If you have distinct lanes, need separate connectors Provide stats gathering, tracing

512 entry trigger-based trace Stats funneled BACK through connector

helps with place-and-route (Nikhil Patil)

Bill Reinhart


Test of size

Example Connector Interface Code (in ALU Module)module mkALU#(ConsumerPort#(Tuple2#(Maybe#(PReg_t), ROBTag_t)) inQ,

ProducerPort#(Tuple2#(Maybe#(PReg_t), Execute2ROB_t)) outQ)(ALU);

rule pass;inQ.deq;match { .dest, .robtag } = inQ.first;outQ.enq(tuple2(dest, Execute2ROB_t{robtag:robtag, exception: Invalid}));

endrule

rule done (inQ.done || outQ.done);inQ.commit(True);outQ.commit(True);

endrule

endmodule

ALU latency defined by Connector min-delay!


Test of size

FAST Tool Flow





Functional model






RTL-to-Timing Model

RTL

MicrocodeGenerator

Will be using AWB for stats, etc.


Test of size

Microcode Compiler Uses the LLVM Compiler Infrastructure developed at UIUC (

http://www.llvm.org) Compile the Bochs CPU (Bochs is another portable x86 full-

system emulator.) Backend retargetted to the micro-op "ISA“

Intention is to automate generation of new ISA instructions, automatically retarget new micro-architectures


Nikhil Patil


Test of size

Example Microcode Compile"ADD r/m32, r32" instruction. void BX_CPU_C::ADD_EdGd(bxInstruction_c *i) {

Bit32u op2_32, op1_32, sum_32; op2_32 = BX_READ_32BIT_REG(i->nnn()); if (i->modC0()) { op1_32 = BX_READ_32BIT_REG(i->rm()); sum_32 = op1_32 + op2_32; BX_WRITE_32BIT_REGZ(i->rm(), sum_32);}else { read_RMW_virtual_dword(i->seg(),

RMAddr(i), &op1_32); sum_32 = op1_32 + op2_32; write_RMW_virtual_dword(sum_32) } SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32)

}

When ModRM != 0xC0:

%u0 = ADDRGEN LOADd %v0:[%u0] -> %u1 cc,%u1 = add %u1, %nnn STOREd %v0:[%u0] <- %u1

When ModRM == 0xC0:

cc,%rm = add %rm, %nnn


Test of size

RTL to Timing Model

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Timing model perfectly models RTLVerification??? But, do you really want to do it this way?


Test of size

Traces to Timing Model Interface provides ability for functional model to pass

instruction trace to timing model Storage that timing model components can read

Compression TLB to eliminate physical address Cache static information (src/dest registers, opcode, etc.)

Joonsoo Kim


Test of size

Adding Modules Easy to add modules to improve accuracy

No functionality, only timing

E.g., DRAM model Model banks, row register, control timing, refresh Results in some requests taking longer If modeling memory controller, reorder memory operations as

well


Test of size

Long Term Plans/Impact CMP/SMP support Early power estimation (with Freescale) Simulator available for software development/tuning before

hardware True co-development Performance/power revs of software

Design derived from simulator Write cycle-accurate simulator, automatically generate design (at

RTL + libraries level)

Change the way computer systems are designed and evaluated


Test of size

Demo Compile code, run on

simulator

BP Perfect, gshare See difference in simulator

performance #ALUs

1, 8

Decode

Fetch

R/S

ROB

Rename

L1$

Br ALU TLB Ld/st $

L2$BP


Test of size

Better Partitioning? Traditional partitioning on module boundaries

Timing and functionality

Is there a better way? 64b adder

cannot be implemented as a single monolithic entity But, 64 1b adders very tractable

Can compose a 64b adder from 64 x 1b adders

Is there a better partitioning for simulation?


Test of size

Current/Future Work Finish uni-processor version

Very close, running on one platform Debug rest of pipeline

Shared/coherent bus MP version

Porting QEMU to MP host MP timing model

Hardware functional model Tool chain


Test of size

FPGA Resources for TM OOO, branch prediction, ROB, two ALUs, 1 branch unit, 1 load/store unit (32 entry load/store

queue), iTLB, dTLB 7221 slices (52% of a 2VP30) 9 block RAMs (6% of a 2VP30) NOTE: large structures have not yet been mapped to block RAMs (trace buffer, ROB)

Configurable cache model (old Verilog version) 32KB 4-way set associative cache with 16B cache-lines

165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30)

2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)

2VP30 is an old FPGA found on a $600 list-price XUP board Current FPGAs

10 times as much logic (330K logic cells (2VP30 is around 30K)) 3 to 4 times as much block RAM


Test of size

Current Limitations No datapath

Data speculation requires datapath

Control path/data path crossover Informing loads, Query loads

Can support But, can require additional communication Quite possible


Test of size

FAST Tool Flow





Functional model


(modules + glue)




RTL-to-Timing Model

RTL

DecodeGenerator


Test of size

RTL to Timing Model

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Timing model perfectly models RTLVerification???


Test of size

Hardware Related Work FAST is first FPGA-based functional/timing model

partitioned simulator that we know of HASim (Emer et al)

FPGA-based processor emulators Academic: RAMP, MIT UMUM, Scale, Stanford FAST, ATLAS,

CMU, … Intel (Shih-Lien Lu) !Flexible (huge effort), !Accurate (old processors, not

modeling everything)


Test of size

Software Related Work FastSim: Schnarr, Larus (1998)

Direct execution with rollback, memoization for speed Simplescalar accuracy

Emer, et al: Asim Intel cycle-accurate, about 10KHz

PTLSim: Yourst Full system x86, about 200KHz Seems similar to Simplescalar in terms of accuracy Uses Xen to fast forward


Test of size

Simulator Users and Tradeoffs

Architects: e.g., Matt Performance/Power/Reliability Timely, ~Accurate, Flexible !Fast, !Complete

Software: e.g., Wen-Mei Development, tuning Fast, OS !Accurate, !Flexible, !Timely

Implementation (RTL) Correctness Accurate, ~Timely !Fast, !Complete


Test of sizeStolen from http://research.microsoft.com/si/PPT/HardwareModelingInfrastructure.pdf

Impossible????


Test of size

Microprocessors One instruction executes to completion before next

starts Implementation is different leading to different

performance


Test of size

Microprocessors

In-OrderFront

Out-of-OrderMiddle

In-OrderEnd


Test of size

Current Software Simulators

Performance-Detail tug-of-war Higher performance means less

simulation tasks Higher detail means more

simulation tasks

Simulator Slowdown Sim 2 min

ISA (SimNow, Simics) 1-100 10 min

Caches 10-1000 10 hrs

OOO (ASim ~1-10KIPS) 10K-1M 1 year

RTL 100M-1B 2000 years

CMP? ??? ???


Test of size

Why Build? (Anant) Software won’t work unless you are building hardware Motivation for software tools Large data sets Hard problems better understood and show up once

you became building Have to solve hard problems More radical the idea, more important it is to build

World only trusts end-to-end results

Cycle simulator only becomes accurate after hardware gets precisely defined

Needed for commercialization


Test of size

Functional Models Hardware functional models

Difficult to make complete (use Protoflex methods?) Size

Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA

Can we use them? What modifications?

© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...

Documents

Transcript of © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...