6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System,...

17
6/15/06 Derek Chiou, UT Austin, RAMP 1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle- Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University of Texas at Austin Electrical and Computer Engineering
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    0

Transcript of 6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System,...

6/15/06 Derek Chiou, UT Austin, RAMP 1

Confessions of a RAMP Heretic:Fast, Full-System, Cycle-Accurate

x86/PowerPC/ARM/Sparc Simulators

Derek ChiouUniversity of Texas at Austin

Electrical and Computer Engineering

6/15/06 Derek Chiou, UT Austin, RAMP 2

FAST Goals Fast: as fast as possible

2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive?

Accurate: produce cycle-accurate numbers for modern microprocessors (Pentium M)

Complete: run unmodified operating systems, applications, ISAs,…

Transparent: full visibility, no performance hit Inexpensive: need thousands Usable: quick changes, use RTL to generate I/O: the MOST important part of systems

6/15/06 Derek Chiou, UT Austin, RAMP 3

Functional/Timing Partitioning Proven Partitioning

Asim, Simplescalar, Timing-First, Memoized, etc.

Simplifies simulator. Promotes reuse

Same performance in software Asim at 10KHz

Most of the time spent in timing model!

Hardware???

FunctionalModel

(ISA)

TimingModel

(Micro-architecture)

InstructionsArchitectural registers

Peripheral functionality…..

FetchDecodeRenameReservation stationsScheduling windowReorder buffer….

Inst stream

6/15/06 Derek Chiou, UT Austin, RAMP 4

FAST

Functional model could be Pure software (QEMU, Bochs, Simics, SimNow)

Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example)

Pure Hardware (Hoe et al) Hybrid (Hoe et al)

Timing model very simple hardware

FunctionalModel

(ISA)

TimingModel

(Micro-architecture)

Inst stream

FPGAFull-SystemSimulator

6/15/06 Derek Chiou, UT Austin, RAMP 5

What is a FAST Timing Model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

6/15/06 Derek Chiou, UT Austin, RAMP 6

More Complexity

Caches/TLBs? Keep tags, pass address (virtual and

physical if necessary) Hits, misses determined but don’t

need data Superscalar (multiple issue)?

“Fetch and issue” multiple instructions assuming they meet boundary constraints

Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions

NO DATAPATH (and only part of control path)!!!!

6/15/06 Derek Chiou, UT Austin, RAMP 7

Driving a Timing ModeliTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

6/15/06 Derek Chiou, UT Austin, RAMP 8

Complexity: BPiTLB iCache

dTLB dCache

Align & Pick

Decode Decode Decode

Sched Sched Sched Sched

L2 Cache

FunctionalModel

Memory &I/O timingmodels

Wrong-path instructions! Implement BP in timing model Timing model forces ISA

simulator to mis-speculate Rollback, restore

BP only works in processor if it’s fairly accurate Degrades to trace driven!

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP,

parallelism) can be handled this way

6/15/06 Derek Chiou, UT Austin, RAMP 9

Parallelism: Detect Problem & Rollback

FM

Memory

FM FM FM TM

Network

TM TM TM

Memory Model

6/15/06 Derek Chiou, UT Austin, RAMP 10

Functional Model Rollback Need to

Rollback, force branch Rollback, restore and continue

How? set_pc(inst_num, pc)

Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC

Sufficient Currently implemented with

checkpoints ISA state, memory, peripherals

Works for parallelism too

BR

BR

BR

BRBR

6/15/06 Derek Chiou, UT Austin, RAMP 11

RTL to Timing Model

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Timing model perfectly models RTLVerification???

6/15/06 Derek Chiou, UT Austin, RAMP 12

Current FAST System

FM TM

Linux

EmbeddedPowerPC

FPGAFabric

EmbeddedPowerPC

Virtex FPGA (XC2VP30)

Xilinx ML310/XUP Board

6/15/06 Derek Chiou, UT Austin, RAMP 13

QEMU on Xilinx PowerPC

6/15/06 Derek Chiou, UT Austin, RAMP 14

Status x86 functional model boots Linux, targeting 80486 to

Pentium D-like and beyond (Dam Sunwoo) Modified Bochs and QEMU

Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) Synthesized for FPGA, 8.5K lines of code, rated Top 5 User!

Memory, disk models Hope to have network model soon

Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly

6/15/06 Derek Chiou, UT Austin, RAMP 15

Timing Model Resources OOO, superscalar, 2b branch prediction, five functional units, 32KB

DCache [INTERFACE: Fast_if]+ [TM: IfcVB(interface bt. Bluespec &

Verilog)/CmdQ/Fetch/Decode/Rename/Execute] : 26% of V2P30 (3593 slices)

22 Block RAMS (out of 136) ROB broken right now

Early configurable cache model (state shouldn’t change much) 32KB 4-way set associative cache with 16B cache-lines

165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30)

2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)

6/15/06 Derek Chiou, UT Austin, RAMP 16

Current Performance

Functional model Up to 500K x86 inst/sec today on V2P30 FPGA

includes rollbacks assuming 5% mis-speculation Not that optimized

5MIPS unmodified 10M+ on 3.0GHz Pentium 4

DRC box should give this performance PowerPC ISA should be much faster!

PowerPC on PowerPC Timing model

Not bottleneck!

6/15/06 Derek Chiou, UT Austin, RAMP 17

Conclusions

1MHz to 100MHz, cycle-accurate, full-system, multiprocessor x86, x86-64, PowerPC, ARM, Sparc simulator

Leverage extant full-system simulators FPGA timing models maximize performance and

statistic gathering capabilities Pretty much any timing model seems to fit into a

single FPGA (Pentium M in V2P30?) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort