© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...
-
date post
15-Jan-2016 -
Category
Documents
-
view
231 -
download
0
Transcript of © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil...
1© Derek Chiou
Functional/Timing Split in UT FAST
Derek Chiou,
Dam Sunwoo, Joonsoo Kim,
Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat
University of Texas at Austin
Electrical and Computer Engineering
1/17/08 RAMP Retreat 2
Test of size
First, Some Terminology
Host: the system on which a simulator runs Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM A Xilinx FPGA board
Target: the system being modeled Alpha 21264 processor Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM
Host
Simulator
Target
Your desktop
Simplescalar (sim-alpha)
Alpha 21264
1/17/08 RAMP Retreat 3
Test of size
FAST Goals RTL-level cycle-accuracy Complex ISA capable (x86, PowerPC) Complex micro-architecture capable (Intel Core 2) Off-the-shelf OS, apps (Windows, MS Word, Linux)
Lesson I learned from my grad school career Can run other stuff too of course
MP-capable (scale with FPGA resources) Fast (10MIPS range) (Relatively) easy to implement, modify, extend
1/17/08 RAMP Retreat 4
Test of size
FAST Prototype in Real Time
1/17/08 RAMP Retreat 5
Test of size
FAST: Speculative Functional/Timing Partitioning Proven partitioning (FastSim)
FM executes instructions to completion, pushes inst trace to TM
FM insts used as TM fetched insts If functional insts != timing insts, TM
forces FM to rollback Eg., branch mis-speculation, resolve,
memory ordering Clean inst trace/rollback interface
Optimize the common case! FM runs independently from TM when
functional insts == timing insts! Easy to parallelize
Better target uArch simulates faster
Factorized, not partitioned! (FM + TM) < (monolithic simulator) FM fairly simple, only functionality TM fairly simple, only timing
FunctionalModel
(ISA + peripherals)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
CachesArbitrationPipeliningAssociativity….
Inst trace
1/17/08 RAMP Retreat 6
Test of size
High Level FAST Architecture:A Parallelized Simulator
Parallelized between FM & TM Parallel target would have a parallelized FM, each target core
conceptually running on a separate host core Parallelized TM
Parallelizes nicely in hardware (FPGA) Latency tolerant, infrequent round-trips Stats can be done in hardware! (no performance impact)
trace
FunctionalModel
(ISA + peripherals)
TimingModel
(Micro-architecture)
Host HostFPGA
1/17/08 RAMP Retreat 7
Test of size
What Is A FAST Functional Model? Requirements
Fast, Full System, generate instruction trace, support rollback
Hardware functional models Fast, but FPGA implementation difficult to make complete
x86, boots Windows? Simple, resource efficient FM sufficient (but need trace, rollback)
Software functional models Bochs, QEMU, Simics, SimNow, SimOS, etc.
Run on fastest hardware we know about to execute an ISA Full system
1/17/08 RAMP Retreat 8
Test of size
What is a FAST Timing Model?
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
1/17/08 RAMP Retreat 9
Test of size
Modular Timing Models:Modules + Connectors
Modules model timing functionality E.g., rename, caches, etc. Built hierarchically for extensibility
CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs,
Schedulers, ALUs Fetch, Decode, Rename, RS, ROB
Many are essentially wires (e.g., ALU) Often written to execute one operation
E.g., Rename, Cache Executed multiple times per target
cycle for wider processor, higher associativity
Simplifies implementation, tradeoff time for space
Connectors connect modules Abstract timing from modules
Throughput (input, output), delay, maxTransactions
Stats and tracing
Bill Reinhart
1/17/08 RAMP Retreat 10
Test of size
Microcode Compiler Intention
Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86
Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org
Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA
Generates microcode that “runs” on the timing model
Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported
Average 1.27 uOps per handled dynamic x86 instruction Microcode ISA is simple load/store with some x86 specialization
Build a functional model as a processor executing microcode ISA?
Nikhil Patil
1/17/08 RAMP Retreat 11
Test of size
Prototype Overview
Software functional model Eventually hardware functional model, but software sim exists
FPGA-based timing model written in Bluespec Complex OoO micro-architecture fits in a single FPGA
DRC or XUP
trace
FunctionalModel
Software
TimingModel
Bluespec HDL
Processor FPGADRCComputer
HT Xilinx FPGAPowerPC 405
1/17/08 RAMP Retreat 12
Test of size
Current Prototype Functional Model Derived from QEMU
Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …
Prototype QEMU currently supports x86 Added tracing, rollback (implemented with checkpoint)
Including I/O (keyboard, mouse, video) Hosts
x86 machines PowerPC inside of an FPGA
Dam Sunwoo, Jeb Keefe
1/17/08 RAMP Retreat 13
Test of size
Current Prototype Timing Model:OOO Superscalar
Fetch
BRU
R/SRename/
ROBD
ALU
LdSt Queue
iL1 dL1L2
ResultBus
TLB
MEM
BP
25 25
8
8 88
11
1 1
1 1
1 1
1
1 1
1 1
Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson
1/17/08 RAMP Retreat 14
Test of size
Current Simulator Performance on DRC
0
0.5
1
1.5
2
2.5
3
3.5
MIP
S
gshare
BP 97%
BP 100%
Includes Operating System CodeIncludes Operating System Code
1/17/08 RAMP Retreat 15
Test of size
FAST Future Work Improve performance (currently TM bottlenecked)
TM (5MIPS-20MIPS) FM (10MIPS-100MIPS)
Optimize simulator (10MIPS-20MIPS) Hardware FM (FPGA ~100MIPS)
Simple pipe running microcode ISA Multithreaded (Protoflex) to improve throughput
Real processor??? (debug for trace, transactional for rollback?) FAST-MP
Need MP host PowerPC ISA support (this month) Power estimation capabilities in FPGAs
Don’t slow down FAST performance
1/17/08 RAMP Retreat 16
Test of size
FAST and RAMP-Classic: A Comparison
FAST RAMPCores Arbitrary, complex cores from
simple core+rollback+TMFull RTL, fits in FPGA
ISAs/OSs Arbitrary ISA/OS
(x86/Windows, PowerPC, …)
Depends on core
(PowerPC, Sparc)
Accuracy RTL cycle-accurate
Host cycles >= target cycles enabling reuse of hardware resources
Only accurate target RTL available (unless timing model used)
Host cycle == target cycle
Speed Complexity/Resource tradeoff As fast as the FPGA will run
Scalability Depends on FPGA resources (TM costs)
Depends on FPGA resources
1/17/08 RAMP Retreat 17
Test of size
FAST-MP: RAMP as a functional model How to do FAST-MP? Need multicore host! RAMP will not be able to accurately model Intel Core X
micro-architecture in RAMP Unless it becomes much simpler in-order pipeline
even then, will Intel give us RTL? But, RAMP processor can execute ISA
Target ISA == Host ISA Add trace, rollback capabilities
Target ISA != Host ISA Simulate target ISA Hardware support for target ISA, trace, rollback?
RAMP becomes a FAST functional model Use timing model to predict arbitrary behavior
1/17/08 RAMP Retreat 18
Test of size
FAST & RAMP-White Shared Infrastructure FAST connectors as White connectors
Stats, timing FAST modules as White modules
Quad ported RAM CAM/Cache on top of quad-ported RAM Multi-host cycle caches Branch predictors Load/store units, bus interface units Interconnection networks
Eventually, full processors? Executing microcode, software cracking?
FAST TM as RAMP TM At that point, RAMP == FAST-MP
1/17/08 RAMP Retreat 19
Test of size
Conclusions Split functional/timing can be cycle-accurate
Roll back (FAST) or Timing-directed (timing-first?) (HASim)
We believe that roll back can be done relatively cheaply Can be done in software at about order of magnitude impact Hardware support appears to be reasonable
Log old values, playback (easy for simple core) Transactional processor?
Simple cores + roll back as host for functional model? Virtualization for more threads is orthogonal, but
multiplicative in resources
1/17/08 RAMP Retreat 20
Test of size
Another View of FAST Old processor/system modeling new processor/system
Leverage the fact that most of the time, functionality and order is identical
Differences in timing between old and new modeled in timing model
Roll back/re-execute to deal with differences Use current generation host as functional model for
next generation target? Trace implemented by debug support Roll back implemented with transactional support
1/17/08 RAMP Retreat 21
Test of size
Host == Target Modules in FAST FAST can leverage modules that are full and accurate
implementations of modules
Host TLB == target TLB
Memory requests issued when timing model
1/17/08 RAMP Retreat 22
Test of size
FAST-MP: Mixing
P P
IU IU
1/17/08 RAMP Retreat 23
Test of size
What Is A FAST Functional Model? Requirements
Fast Full System Generates instruction trace Supports rollback
Hardware functional models (very fast) Real processor doesn’t support trace/rollback FPGA implementation difficult to make complete
x86, boots Windows? Software functional models exist today
Bochs, QEMU, Simics, SimNow, SimOS, etc. Relatively fast, full system
Run on fastest hardware we know about to execute an ISA Can be modified to generate trace/support rollback
1/17/08 RAMP Retreat 24
Test of size
trace
Step 1: Improving Performance via Parallelization
Parallel slowdown due to communication? FM runs ahead, speculatively, round-trip communication infrequent Round-trip communication only when (functional path != timing path)
Microprocessors have same problem Multiple issue, deep pipelines only work if predicted path is correct
FM like perfect front end of processor, real uArch (TM) slows it down The better the target micro-architecture, the faster the simulator
FunctionalModel
(ISA + peripherals)
TimingModel
(Micro-architecture)
HostHost Host
1/17/08 RAMP Retreat 25
Test of size
Current Prototype Functional Model Derived from QEMU
Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, …
Prototype currently supports x86 Added tracing, rollback (implemented with checkpoint)
Including I/O (keyboard, mouse, video) Hosts
x86 machines PowerPC inside of an FPGA
PowerPC target by January about 1 month to port
Dam Sunwoo, Jeb Keefe
1/17/08 RAMP Retreat 26
Test of size
Current Prototype Timing Model
Fetch
BRU
R/SRename/
ROBD
ALU
LdSt Queue
iL1 dL1L2
ResultBus
TLB
MEM
BP
25 25
8
8 88
11
1 1
1 1
1 1
1
1 1
1 1
Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson
1/17/08 RAMP Retreat 27
Test of size
Microcode Compiler Intention
Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86
Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org
Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA
Generates microcode that “runs” on the timing model
Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported
Average 1.27 uOps per handled dynamic x86 instruction
Nikhil Patil
1/17/08 RAMP Retreat 28
Test of size
Current Simulator Performance on DRC
0
0.5
1
1.5
2
2.5
3
3.5
MIP
S
gshare
BP 97%
BP 100%
Includes Operating System CodeIncludes Operating System Code
1/17/08 RAMP Retreat 29
Test of size
Performance Details Timing model is current bottleneck
100MHz host cycle (not pushing timing) Currently taking ~30 host (FPGA) cycles per target cycle, max about 54
cycles (currently max latency defines target clock) BP is a simple gshare predictor
Functional model Unoptimized modified QEMU With perfect BP, immediate return from TM, 5.4MIPS
FM/TM communication 469ns blocking read from Opteron on DRC (has gotten better)
Poll every other basic block 13ns/word for burst write
1/17/08 RAMP Retreat 30
Test of size
Some Related Work (there is a lot) Software
Functional/timing partitioned Asim, current M5, Timing-First, Opal all timing model driven
Timing model tells functional model what to do and when to do it
FastSim (Schnarr, et al, ASPLOS 98) Functional/timing, rollback when functional path != timing path But, instrumented binaries, not parallelized, no hardware
Hardware HASim: Hardware ASim (Emer, et. al)
Timing-first Seven points of communication between FM & TM
Requires infinitely renamed out-of-order FM Current supports a simplified MIPS ISA
1/17/08 RAMP Retreat 31
Test of size
Conclusions/Future Work It works Current FAST simulator prototype
1.2MIPS (unoptimized), about 1000 times slower than target Timely: during architecture phase Complete: runs Windows, Linux Transparent: extensive, hardware-based stats Relatively inexpensive, easy to build and extend
(Some) future work Optimize
5MIPS soon, 10MIPS-20MIPS later (hardware FM using uCode?) More realistic timing model & calibration Tattler: automatic bottleneck detection CMP/SMP targets
1/17/08 RAMP Retreat 32
Test of size
FAST Relative Speeds (Current Prototype)
1
10
100
1000
FA
ST
Rela
tive S
peed
1/17/08 RAMP Retreat 33
Test of size
X86 Micro-Op Coverage
0
0.2
0.4
0.6
0.8
1
1.2
Dyn
am
ic In
str
ucti
on
Co
vera
ge
1/17/08 RAMP Retreat 34
Test of size
Number of uOps/x86 Instruction
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
uO
ps
/x8
6
1/17/08 RAMP Retreat 35
Test of size
Outline What is FAST (1 minute)
Targeted properties Demo (30 sec)
Runs x86, Windows at speeds fast enough to interact How?
Start with partition Proven strategy (FastSim) Rollback to handle branch mis-speculation Simplifies full-system capabilities, since the complexity of the full-
system is encapsulated in the functional model Functional model can be full-system simulator or processor
FM passes trace to TM TM is simple
can model complex structures fairly low weight Improve performance
Parallelize on FM/TM boundary Round-trip communication infrequent
Permit FM to run ahead speculatively Doesn’t mis-speculation slow things down? Learn from computer architecture (microcoded,
pipelined, OoO) Speculation permits computer system to run
faster Key is not to mis-speculate often Parallelize on functional/timing boundary Handles branch prediction, anything where
functional path is not equal to target path Rollback required from FM
Bottleneck is timing model Parallelize TM?
Difficult to do in software Practical limiations due to number of processors that
communicate quickly Hardware (FPGA) is ideal
Parallelizes nicely TM partition simplifies hardware Latency tolerant, infrequent round-trips
Prototype Overview of prototype
Software functional model running on processor host TM in Bluespec on FPGA Describe some details DRC or XUP
FM description TM description
Block diagram Microcode compiler Performance Relative performance Bottlenecks
Future work Conclusions
1/17/08 RAMP Retreat 36
Test of size
Functional Model Modifications Need to
Rollback, force branch Rollback, restore and continue
How? set_pc(inst_num, pc)
Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC
Sufficient Can be implemented with checkpoints
ISA state, memory, peripherals
BR
BR
BR
BRBR
1/17/08 RAMP Retreat 37
Test of size
1/17/08 RAMP Retreat 38
Test of size
Why
1/17/08 RAMP Retreat 39
Test of size
Why Are Simulators Slow? Complexity
40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Multiple Decode Out-of-order issue I/O Multiple cores, processors
Modeling PARALLELISM
Can we parallelize simulator?
1/17/08 RAMP Retreat 40
Test of size
Parallelizing Simulators Software
Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical
If a software simulator could be parallelized, use same techniques to make the target faster?
Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability
1/17/08 RAMP Retreat 41
Test of size
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
Modeling Complex Processors on FPGA(s) Compile RTL for FPGA
Pentium fits in large FPGA 3.1M transistors (Lu, Intel)
Issues Fit: Core 2 in a single FPGA?
Impossible: Processors grow as fast as FPGAs
Multiple FPGAs to model one processor
Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify
FPGAPentiumPentium Core 2
1/17/08 RAMP Retreat 42
Test of size
Strawman Solution Partition
Break problem down into multiple pieces Target module boundaries obeyed?
Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation
Partitioned simulator, each partition running in potentially different host technology
1/17/08 RAMP Retreat 43
Test of size
Simple Module-based Software/Hardware Partitioning
Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)
0x2
addrinst
Instruction$/Mem
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data $/Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
I1
I2
bypass
IS THERE A BETTER PARTITIONING?
1/17/08 RAMP Retreat 44
Test of size
FAST
FunctionalModel
(ISA)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
FetchDecodeRenameReservation stationsScheduling windowReorder buffer….
Inst trace
FPGA
1/17/08 RAMP Retreat 45
Test of size
More Complexity
Caches/TLBs? Keep tags, pass address (virtual and
physical if necessary) Hits, misses determined but don’t need
data Superscalar (multiple issue)?
“Fetch and issue” multiple instructions assuming they meet boundary constraints
Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions
NO DATAPATH only part of control path
1/17/08 RAMP Retreat 46
Test of size
A Timing Model
iTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
Can Easily Execute in Parallel in Hardware
1/17/08 RAMP Retreat 47
Test of size
Trace-Driven Simulation? Described a trace-driven simulator Accurate if “functional path” == “target path”
“Ideal” processor
Unfortunately, functional path is not always equal to timing path Even for simple pipelines Real processors speculatively execute, assuming their target
path is the functional path
1/17/08 RAMP Retreat 48
Test of size
Branch Prediction
iTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-
speculate Rollback, restore
Branch predictor predictor in ISA simulator?
BP only works in processor if it’s fairly accurate
FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be
handled this way
1/17/08 RAMP Retreat 49
Test of size
Speculative Simulation FAST simulators themselves are speculative
Speculate functional path == target path Timing model detects functional path != target path
Forces functional model down wrong path OR Returns functional model to functional path
Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)
Functional model reusable Timing model determines target behavior
1/17/08 RAMP Retreat 50
Test of size
FAST Tool Flow
Functional modelspecification
Timing Model(FPGA or software)
Instruction decode Functional Model
(software and/or FPGA)
Functional model
Timing modelSpecification
(modules + connector)
Timing model (RTL/C)
Timing Model Compiler(Bluespec)
Functional model compiler (gcc)
RTL-to-Timing Model
RTL
MicrocodeGenerator
Will be using Intel AWB for stats, etc.
1/17/08 RAMP Retreat 51
Test of size
Performance Details Functional model
Unoptimized modified QEMU Driving an FPGA-based timing model
Standard producer/consumer parallel communication problems 469ns blocking round-trip from Opteron to FPGA Poll every other basic block
BP is a simple gshare predictor Timing model (unoptimized)
100MHz host cycle right now (not pushing timing) Currently taking ~30 host cycles per target cycle, max about 54 cycles
(currently max latency defines target clock) Complex timing model fits into single FPGA
1/17/08 RAMP Retreat 52
Test of size
Outline Motivation
FAST: Parallelized complex core, full-system simulator
RAMP-White: Parallel hosts running parallel targets
1/17/08 RAMP Retreat 53
Test of size
RAMP-White Requirements Coherent shared memory experimental platform
Configurable coherence protocol, engine
Scalable to the same level as other RAMP machines
1K eventual target Down to 2
Full system (OS, I/O, etc.) Intentions
ISA/Architecture independent (like all RAMP efforts) Use different cores
Integrate components from other RAMP participants A test-bed for sharing IP
1/17/08 RAMP Retreat 54
Test of size
Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL
Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost to academics (www.bluespec.com)
Start with XUP board We had XUP before BEE2
Support Leon3 and PowerPC Started with embedded PowerPC but Linux, non-MP core issues Restarted with Leon3
Linux, MP-capable core Get RAMP-White infrastructure to work
Plan to port back to embedded PowerPC, easy to move to soft PowerPC
1/17/08 RAMP Retreat 55
Test of size
High-Level Architecture Philosophy Flexibility
Avoid wasted work Easy changes Module-agnostic
Processors, network, I/O, etc.
Interfaces Complete set of necessary interfaces All communication via messages
Fixed fields, but fields are configurable “shims” connect components to White infrastructure
Use existing IP
1/17/08 RAMP Retreat 56
Test of size
RAMP-White Block DiagramRAMP-White Block Diagram
Network Router
Intersection Unit (IU)
Memory Controller
(MC)
IO & Platform
Devices
Processor
Network Interface
(NIU)
Coherent $
Proc dependent
1/17/08 RAMP Retreat 57
Test of size
Three Phase Approach to Hardware H1: Incoherent shared memory
No hardware global cache, just global shared memory support Optional cache for local memory
However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor
Ring network H2: Ring-based coherence (scalable bus)
Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol
True coherence engine not required But, very restricted communication
Sufficient for testing, modeling many targets H3: General network-based coherence
Requires general coherence engine, general network
H4: Different cores
IUP $$
MC I/O
IUP $$
MC I/O
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
1/17/08 RAMP Retreat 58
Test of size
Operating System Issues with SMP OS on embedded PowerPC
Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,
bring-up) How scalable anyways? (1K processors)
Four phase approach using RAMP-White hardware O1: Separate OS per core (PowerPC) working for XUP
Region of memory is global (mmap) Locks using regular loads/stores + sequential consistency
O2: SMP OS on Leon3 Use RAMP-White scalable hardware across multiple FPGAs Snoopy cache
O3: SMP OS on Leon3 using directory cache O4: SMP OS on PowerPC port
1/17/08 RAMP Retreat 59
Test of size
Programmer View Sequential consistency
PowerPC Global addresses labeled as uncached
Ordered accesses from PowerPC 405 Coherent global cache still uncached from processor
Soft cores can be weaker
User interface Terminal per core/OS if desired Mmap to map shared memory
1/17/08 RAMP Retreat 60
Test of size
H1/O1 RAMP-White Hari Angepat did the work
Components Written in Bluespec NIU code complete and tested
2 processor ring (PowerPC) IU code complete and tested
Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface
Hardware intended to target different ISAs PLB master and slave shims written
Some preliminary OS work Multi-image mmap interface running
1/17/08 RAMP Retreat 61
Test of size
Current RAMP-White Phase 1
Intersection Unit (IU)
IO & Platform
Devices
PPC 405/Leon3
Network Interface
(NIU)
Memory Controller
(MC)
PLB shim
Intersection Unit (IU)
PPC 405/Leon3
Network Interface
(NIU)
Linux Linux
1/17/08 RAMP Retreat 62
Test of size
Our Long Term Plans H1/O1, XUP working June 2007
PowerPC With multi-OS, limited device support
H2/O2 BEE2, Leon3 Coherent cache, IU forwarding modifications SMP
H3/O3 BEE2/BEE3, Leon3 Arbitrary network, cache coherency engine
Getting network from Washington, Berkeley H4/O4 BEE2/BEE3, PowerPC
Port back to PowerPC
1/17/08 RAMP Retreat 63
Test of size
RAMP Conclusions RAMP-White architecture
Phased approach minimizes wasted work Designed to be easy to modify for your purpose
Many architectures only require modified coherence engine, maybe cache
ISA/implementation agnostic Care taken to not be specific
RAMP-White Phase 1 works Running on XUP
Phase 2 close to working Running on BEE2
1/17/08 RAMP Retreat 64
Test of size
Future Work: FAST + RAMP-White FAST is currently not scalable
Need parallel functional models and parallel timing models
Parallel functional model runs on parallel host Maintain speed
Intention is to run FAST on top of RAMP-White Modify soft core to support checkpoint, rollback in hardware
FAST provides cycle-accurate of complex cores RAMP provides high performance, scalable host FAST+RAMP provides scalable complex core simulator
1/17/08 RAMP Retreat 65
Test of size
Acknowledgements Students
Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Hari Angepat (RAMP-White) Eric Johnson (verification, Linux)
Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale
Software Bluespec
Open-source full system simulators QEMU, Bochs
66© Derek Chiou
Extra slides
1/17/08 RAMP Retreat 67
Test of size
What is in a Trace? Conceptually, everything a functional model can
produce Flattened opcode, virtual/physical address of instruction,
virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.
Can be heavily compressed Eg., simulator TLB to avoid physical address
1/17/08 RAMP Retreat 68
Test of size
Modules Modules predict what happens each cycle
Which instruction is scheduled? Modules hierarchically constructed
CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB
Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)
E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher
associativity Simplifies implementation, tradeoff time for space
Joonsoo Kim, Nikhil Patil
1/17/08 RAMP Retreat 69
Test of size
Connector Interface Commit
Indicates module is done with target cycle
Done Indicates Connector is done
with target cycle Enq, First, Deq
Just like FIFO
EnqCommit
Done
DeqCommit First Done
Bill Reinhart
1/17/08 RAMP Retreat 70
Test of size
32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA
Need more than 32b
Eventually, hope for 64b soft-core processors
For now two options: live with 4GB space Or, provide one more layer of translation
Physical address in certain region is global virtual address Translated by hardware to node + physical address
Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0
1/17/08 RAMP Retreat 71
Test of size
Node Architecture
IU
P
P $$
MC I/O
IUP $$
MC I/O
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
1/17/08 RAMP Retreat 72
Test of size
Generalized Architecture
Proc
IU NIUMC
$
Mem
OPBbridge
Intersection Unit Network Interface Unit
PLB
Proc dependent
Proc independent
1/17/08 RAMP Retreat 73
Test of size
Intersection Unit Processor interface Slave Snoop
Network interface Master (send) Slave (receive)
Memory interface Master (issue memory requests)
Hooks for coherency engine Bluespec nice to specify coherence
engine Incoherent version is a special case
Programmable memory regions Global (local and remote) Local translation
Intersection Unit (IU)
Memory Controller
(MC)
IO &
Platform Devices
Processor
Network Interface
(NIU)
Coherent $
1/17/08 RAMP Retreat 74
Test of size
Network Interface Unit Currently two virtual channels
Split into two components Msg composition/Queuing Net transmit/receive
Insert/extract for ring Intended to permit other net-
specific transmit/receive
One input/one output Creates a simple
unidirectional ring Can interface to more
advanced fabrics
Intersection Unit (IU)
Memory Controller
(MC)
IO &
Platform Devices
Processor
Network Interface
(NIU)
Coherent $
1/17/08 RAMP Retreat 75
Test of size
Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP
Used some code (PLB master) Red-BEE is not ready to distribute
Looking for switch code Berkeley’s code on CVS repository
But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now
Would like to steal software OS (kernel proxy) SMP OS port
Naming MPI reference design in BEE2 repository Is that RAMP-Blue?
A central CVS repository for RAMP code?
1/17/08 RAMP Retreat 76
Test of size
Sharing Over the Long Term Processor is shared
Leon PowerPC MicroBlaze Everything else
MC is shared Xilinx or Berkeley
Coherent cache can be shared Transactional/traditional Borrow Stanford’s?
Coherency engine can be shared CMU/Stanford
IU functionality can be shared Trying to make ours general
NIU can be shared Borrow half from Berkeley?
Network can be shared Borrow Berkeley’s?
Proc
IU NIUMC
$
Mem
Peripherals
CCE
1/17/08 RAMP Retreat 77
Test of size
IU Internal Message
Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size
Bluespec permits easy modification for your protocol
PRI CMD PERM SIZE TAG
GADDR
DATA
1/17/08 RAMP Retreat 78
Test of size
Network Message
PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data
PRI DEST SRC NETTAG CMD
MESSAGE
SIZE
1/17/08 RAMP Retreat 79
Test of size
Intersection Unit InternalsIntersection Unit Internals
Intersection Unit ControllerMemory
Controller & DRAM
Controller BRAMs
Proc IO Net
Proc IO Net
Global Address Translation
hardware
80© Derek Chiou
Fast, Full-System, Cycle-Accurate Computer Simulators via Parallelization
Derek ChiouUniversity of Texas at Austin
Electrical and Computer Engineering
1/17/08 RAMP Retreat 81
Test of size
My Ideal Computer Simulator
Fast: as fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive?
Timely: enough time to make decisions Accurate: produce cycle-accurate numbers Complete: run unmodified operating systems,
applications,… Transparent: full visibility, no performance hit Inexpensive: need thousands Flexible: quick changes, generate from RTL
1/17/08 RAMP Retreat 82
Test of size
Current Software Simulators
Performance-Detail tug-of-war Higher performance means less
simulation tasks Higher detail means more
simulation tasks
Simulator Slowdown Sim 2 min
ISA (SimNow, Simics) 1-100 10 min
Caches 10-1000 10 hrs
OOO (ASim ~1-10KIPS) 10K-1M 1 year
RTL 100M-1B 2000 years
CMP? ??? ???
1/17/08 RAMP Retreat 83
Test of size
Why So Slow? Complexity
40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Decoder I/O
Modeling PARALLELISM Parallelize simulator?
Sampling, benchmarking are orthogonal
1/17/08 RAMP Retreat 84
Test of size
Outline Motivation
How to Parallelize?
Functional Models and Timing Models
Status, Conclusions
1/17/08 RAMP Retreat 85
Test of size
Parallelizing Simulators Software
Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical
If a software simulator could be parallelized, use same techniques to make the target faster?
Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability
1/17/08 RAMP Retreat 86
Test of size
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
FPGA FPGA
Modeling Complex Processors on FPGA(s) Compile RTL for FPGA
Pentium fits in large FPGA 3.1M transistors (Lu, Intel)
Issues Fit: Core 2 in a single FPGA?
Impossible: Processors grow as fast as FPGAs
Multiple FPGAs to model one processor
Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify
FPGAPentiumPentium Core 2
1/17/08 RAMP Retreat 87
Test of size
Strawman Solution Partition
Break problem down into multiple pieces Target module boundaries obeyed?
Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation
Partitioned simulator, each partition running in potentially different host technology
1/17/08 RAMP Retreat 88
Test of size
Simple Module-based Software/Hardware Partitioning
Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006)
0x2
addrinst
Instruction$/Mem
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data $/Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
I1
I2
bypass
IS THERE A BETTER PARTITIONING?
1/17/08 RAMP Retreat 89
Test of size
Our Partitioning: Functionality/Timing Boundaries Proven Software Partitioning
Asim, FastSim, etc. Factorized, not partitioned!
Functional simulators exist Timing model becomes very simple
Promotes reuse Functional/Timing Simplifies timing model
Latency tolerant Separate FM & TM Software functional model? Hardware timing model?
Balance time/spaceFunctional
Model
(ISA + peripherals)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
CachesArbitrationPipeliningAssociativity….
Inst trace
1/17/08 RAMP Retreat 90
Test of size
Partition on ISA/Timing Proven Partitioning
Asim, Simplescalar, Timing-First, FastSim, etc.
Simplifies simulator. Promotes reuse
Same performance in software Asim at 10KHz
Most of the time spent in timing model!
Hardware???FunctionalModel
(ISA)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
FetchDecodeRenameReservation stationsScheduling windowReorder buffer….
Inst trace
1/17/08 RAMP Retreat 91
Test of size
FAST
FunctionalModel
(ISA)
TimingModel
(Micro-architecture)
InstructionsArchitectural registers
Peripheral functionality…..
FetchDecodeRenameReservation stationsScheduling windowReorder buffer….
Inst trace
FPGA
1/17/08 RAMP Retreat 92
Test of size
What is in a Trace? Conceptually, everything a functional model can
produce Flattened opcode, virtual/physical address of instruction,
virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc.
Can be heavily compressed Eg., simulator TLB to avoid physical address
1/17/08 RAMP Retreat 93
Test of size
What is a FAST Timing Model?
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
Stats gathering in hardware => no performance impact
1/17/08 RAMP Retreat 94
Test of size
More Complexity
Caches/TLBs? Keep tags, pass address (virtual and
physical if necessary) Hits, misses determined but don’t need
data Superscalar (multiple issue)?
“Fetch and issue” multiple instructions assuming they meet boundary constraints
Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions
NO DATAPATH only part of control path
1/17/08 RAMP Retreat 95
Test of size
A Timing Model
iTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
1/17/08 RAMP Retreat 96
Test of size
Driving a Timing Model
iTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
Memory &I/O timingmodels
FunctionalModel
Can Easily Execute in Parallel in Hardware
1/17/08 RAMP Retreat 97
Test of size
Trace-Driven Simulation? What we’ve described is a trace-driven simulator Accurate if “functional path” == “target path”
“Ideal” processor
Unfortunately, functional path is not always equal to timing path Real processors speculatively execute, assuming their target
path is the functional path
1/17/08 RAMP Retreat 98
Test of size
Branch Prediction
iTLB iCache
dTLB dCache
Align & Pick
Decode Decode Decode
Sched Sched Sched Sched
L2 Cache
FunctionalModel
Memory &I/O timingmodels
Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-
speculate Rollback, restore
Branch predictor predictor in ISA simulator?
BP only works in processor if it’s fairly accurate
FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be
handled this way
1/17/08 RAMP Retreat 99
Test of size
Speculative Simulation FAST simulators themselves are speculative
Speculate functional path == target path Timing model detects functional path != target path
Forces functional model down wrong path OR Returns functional model to functional path
Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization)
Functional model reusable Timing model determines target behavior
1/17/08 RAMP Retreat 100
Test of size
A Parallel Simulator Functional runs in parallel with timing Functional model can run in parallel
OoO superscalar processor?
Timing model runs in parallel in hardware Hard to not run in parallel Every register-to-register transition could potentially be
parallelized
1/17/08 RAMP Retreat 101
Test of size
Outline Motivation
How to Parallelize?
Functional Models, Timing Models and Tools
Status, Related Work, Conclusions
1/17/08 RAMP Retreat 102
Test of size
Functional Models Hardware functional models
Difficult to make complete (use Protoflex methods?) Size
Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA
Can we use them? What modifications?
1/17/08 RAMP Retreat 103
Test of size
Functional Model Modifications Need to
Rollback, force branch Rollback, restore and continue
How? set_pc(inst_num, pc)
Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC
Sufficient Can be implemented with checkpoints
ISA state, memory, peripherals
BR
BR
BR
BRBR
1/17/08 RAMP Retreat 104
Test of size
Our Functional Model Derived from QEMU
Fast Boots Linux, Windows x86, x86-64, PowerPC, Sparc, ARM, MIPS, …
We support x86 for now Added tracing, checkpoint, rollback Runs on x86 hosts as well as embedded PowerPC host inside
FPGA PowerPC in the works
Dam Sunwoo
1/17/08 RAMP Retreat 105
Test of size
Modular Timing Models Modules
Models timing functionality Rename, caches, etc.
Connectors Attach modules together Abstract timing from modules
Throughput (input, output), delay, maxTransactions
Stats and tracing
1/17/08 RAMP Retreat 106
Test of size
Modules Modules predict what happens each cycle
Which instruction is scheduled? Modules hierarchically constructed
CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB
Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies)
E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher
associativity Simplifies implementation, tradeoff time for space
Joonsoo Kim, Nikhil Patil
1/17/08 RAMP Retreat 107
Test of size
Connector Interface Commit
Indicates module is done with target cycle
Done Indicates Connector is done
with target cycle Enq, First, Deq
Just like FIFO
EnqCommit
Done
DeqCommit First Done
Bill Reinhart
1/17/08 RAMP Retreat 108
Test of size
Current Timing Model
Fetch
BRU
R/SRename/
ROBD
ALU
LdSt Queue
iL1 dL1L2
ResultBus
TLB
MEM
BP
25 25
8
8 88
11
1 1
1 1
1 1
1
1 1
1 1
1/17/08 RAMP Retreat 109
Test of size
FAST Tool Flow
Functional modelspecification
Timing Model(FPGA or software)
Instruction decode Functional Model
(software and/or FPGA)
Functional model
Timing modelSpecification
(modules + connector)
Timing model (RTL/C)
Timing Model Compiler(Bluespec)
Functional model compiler (gcc)
RTL-to-Timing Model
RTL
MicrocodeGenerator
Will be using Intel AWB for stats, etc.
1/17/08 RAMP Retreat 110
Test of size
Outline Motivation
How to Parallelize?
Functional Models and Timing Models
Status, Related Work, Conclusions
1/17/08 RAMP Retreat 111
Test of size
Execution Platforms DRC Computer
Opteron + FPGA on HT Xilinx development boards
XUP, ML-310 (embedded PowerPC)
Research Accelerator for MP (RAMP) A shared infrastructure for MP development Uses BEE2/BEE3 FPGA boards
4/5 large FPGAs and fast off-board I/O Large collaboration between Berkeley, CMU (Hoe), MIT, Stanford,
Texas, Washington and Intel.
1/17/08 RAMP Retreat 112
Test of size
Simulator Performance on DRC
0
0.5
1
1.5
2
2.5
3
3.5
MIP
S
gshare
BP 97%
BP 100%
Includes Operating System CodeIncludes Operating System Code
1/17/08 RAMP Retreat 113
Test of size
Comparison to Related Work
1
10
100
1000
Intel Asim AMD GEMS Freescale IBM Mambo PTLSim sim-outorder FAST
FA
ST
Sp
eed
up
1/17/08 RAMP Retreat 114
Test of size
Acknowledgements Students
Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Eric Johnson (verification, Linux)
Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale
Software Bluespec
Open-source full system simulators QEMU, Bochs
1/17/08 RAMP Retreat 115
Test of size
Conclusions
Fast: Expect 2MIPS-10MIPS soon
2-4 orders of magnitude slower than target Interactive
Timely: quick model building Accurate: produce cycle-accurate numbers Complete: runs Linux/Windows + apps now, compile apps on
standard machines Transparent:
Hardware stats provides full visibility, no performance hit Inexpensive:
$300/$600 XUP boards can fit many timing models $9K (academic) DRC is still cost effective per cycle
Flexible: tools
116© Derek Chiou
Backup Slides
1/17/08 RAMP Retreat 117
Test of size
Connectors Abstract timing information from modules
Cannot do so perfectly (state within modules) Characteristics
Throughput (input and output), minimum delay, maximum outstanding Entries are equal and fully shiftable
If you have distinct lanes, need separate connectors Provide stats gathering, tracing
512 entry trigger-based trace Stats funneled BACK through connector
helps with place-and-route (Nikhil Patil)
Bill Reinhart
1/17/08 RAMP Retreat 118
Test of size
Example Connector Interface Code (in ALU Module)module mkALU#(ConsumerPort#(Tuple2#(Maybe#(PReg_t), ROBTag_t)) inQ,
ProducerPort#(Tuple2#(Maybe#(PReg_t), Execute2ROB_t)) outQ)(ALU);
rule pass;inQ.deq;match { .dest, .robtag } = inQ.first;outQ.enq(tuple2(dest, Execute2ROB_t{robtag:robtag, exception: Invalid}));
endrule
rule done (inQ.done || outQ.done);inQ.commit(True);outQ.commit(True);
endrule
endmodule
ALU latency defined by Connector min-delay!
1/17/08 RAMP Retreat 119
Test of size
FAST Tool Flow
Functional modelspecification
Timing Model(FPGA or software)
Instruction decode Functional Model
(software and/or FPGA)
Functional model
Timing modelSpecification
(modules + connector)
Timing model (RTL/C)
Timing Model Compiler(Bluespec)
Functional model compiler (gcc)
RTL-to-Timing Model
RTL
MicrocodeGenerator
Will be using AWB for stats, etc.
1/17/08 RAMP Retreat 120
Test of size
Microcode Compiler Uses the LLVM Compiler Infrastructure developed at UIUC (
http://www.llvm.org) Compile the Bochs CPU (Bochs is another portable x86 full-
system emulator.) Backend retargetted to the micro-op "ISA“
Intention is to automate generation of new ISA instructions, automatically retarget new micro-architectures
Generates microcode that “runs” on the timing model
Nikhil Patil
1/17/08 RAMP Retreat 121
Test of size
Example Microcode Compile"ADD r/m32, r32" instruction. void BX_CPU_C::ADD_EdGd(bxInstruction_c *i) {
Bit32u op2_32, op1_32, sum_32; op2_32 = BX_READ_32BIT_REG(i->nnn()); if (i->modC0()) { op1_32 = BX_READ_32BIT_REG(i->rm()); sum_32 = op1_32 + op2_32; BX_WRITE_32BIT_REGZ(i->rm(), sum_32);}else { read_RMW_virtual_dword(i->seg(),
RMAddr(i), &op1_32); sum_32 = op1_32 + op2_32; write_RMW_virtual_dword(sum_32) } SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32)
}
When ModRM != 0xC0:
%u0 = ADDRGEN LOADd %v0:[%u0] -> %u1 cc,%u1 = add %u1, %nnn STOREd %v0:[%u0] <- %u1
When ModRM == 0xC0:
cc,%rm = add %rm, %nnn
1/17/08 RAMP Retreat 122
Test of size
RTL to Timing Model
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
Timing model perfectly models RTLVerification??? But, do you really want to do it this way?
1/17/08 RAMP Retreat 123
Test of size
Traces to Timing Model Interface provides ability for functional model to pass
instruction trace to timing model Storage that timing model components can read
Compression TLB to eliminate physical address Cache static information (src/dest registers, opcode, etc.)
Joonsoo Kim
1/17/08 RAMP Retreat 124
Test of size
Adding Modules Easy to add modules to improve accuracy
No functionality, only timing
E.g., DRAM model Model banks, row register, control timing, refresh Results in some requests taking longer If modeling memory controller, reorder memory operations as
well
1/17/08 RAMP Retreat 125
Test of size
Long Term Plans/Impact CMP/SMP support Early power estimation (with Freescale) Simulator available for software development/tuning before
hardware True co-development Performance/power revs of software
Design derived from simulator Write cycle-accurate simulator, automatically generate design (at
RTL + libraries level)
Change the way computer systems are designed and evaluated
1/17/08 RAMP Retreat 126
Test of size
Demo Compile code, run on
simulator
BP Perfect, gshare See difference in simulator
performance #ALUs
1, 8
Decode
Fetch
R/S
ROB
Rename
L1$
Br ALU TLB Ld/st $
L2$BP
1/17/08 RAMP Retreat 127
Test of size
Better Partitioning? Traditional partitioning on module boundaries
Timing and functionality
Is there a better way? 64b adder
cannot be implemented as a single monolithic entity But, 64 1b adders very tractable
Can compose a 64b adder from 64 x 1b adders
Is there a better partitioning for simulation?
1/17/08 RAMP Retreat 128
Test of size
Current/Future Work Finish uni-processor version
Very close, running on one platform Debug rest of pipeline
Shared/coherent bus MP version
Porting QEMU to MP host MP timing model
Hardware functional model Tool chain
1/17/08 RAMP Retreat 129
Test of size
FPGA Resources for TM OOO, branch prediction, ROB, two ALUs, 1 branch unit, 1 load/store unit (32 entry load/store
queue), iTLB, dTLB 7221 slices (52% of a 2VP30) 9 block RAMs (6% of a 2VP30) NOTE: large structures have not yet been mapped to block RAMs (trace buffer, ROB)
Configurable cache model (old Verilog version) 32KB 4-way set associative cache with 16B cache-lines
165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30)
2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)
2VP30 is an old FPGA found on a $600 list-price XUP board Current FPGAs
10 times as much logic (330K logic cells (2VP30 is around 30K)) 3 to 4 times as much block RAM
1/17/08 RAMP Retreat 130
Test of size
Current Limitations No datapath
Data speculation requires datapath
Control path/data path crossover Informing loads, Query loads
Can support But, can require additional communication Quite possible
1/17/08 RAMP Retreat 131
Test of size
FAST Tool Flow
Functional modelspecification
Timing Model(FPGA or software)
Instruction decode Functional Model
(software and/or FPGA)
Functional model
Timing modelSpecification
(modules + glue)
Timing model (RTL/C)
Timing Model Compiler(Bluespec)
Functional model compiler (gcc)
RTL-to-Timing Model
RTL
DecodeGenerator
1/17/08 RAMP Retreat 132
Test of size
RTL to Timing Model
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
Bypass/interlock I1
I2
Timing model perfectly models RTLVerification???
1/17/08 RAMP Retreat 133
Test of size
Hardware Related Work FAST is first FPGA-based functional/timing model
partitioned simulator that we know of HASim (Emer et al)
FPGA-based processor emulators Academic: RAMP, MIT UMUM, Scale, Stanford FAST, ATLAS,
CMU, … Intel (Shih-Lien Lu) !Flexible (huge effort), !Accurate (old processors, not
modeling everything)
1/17/08 RAMP Retreat 134
Test of size
Software Related Work FastSim: Schnarr, Larus (1998)
Direct execution with rollback, memoization for speed Simplescalar accuracy
Emer, et al: Asim Intel cycle-accurate, about 10KHz
PTLSim: Yourst Full system x86, about 200KHz Seems similar to Simplescalar in terms of accuracy Uses Xen to fast forward
1/17/08 RAMP Retreat 135
Test of size
Simulator Users and Tradeoffs
Architects: e.g., Matt Performance/Power/Reliability Timely, ~Accurate, Flexible !Fast, !Complete
Software: e.g., Wen-Mei Development, tuning Fast, OS !Accurate, !Flexible, !Timely
Implementation (RTL) Correctness Accurate, ~Timely !Fast, !Complete
1/17/08 RAMP Retreat 136
Test of sizeStolen from http://research.microsoft.com/si/PPT/HardwareModelingInfrastructure.pdf
Impossible????
1/17/08 RAMP Retreat 137
Test of size
Microprocessors One instruction executes to completion before next
starts Implementation is different leading to different
performance
1/17/08 RAMP Retreat 138
Test of size
Microprocessors
In-OrderFront
Out-of-OrderMiddle
In-OrderEnd
1/17/08 RAMP Retreat 139
Test of size
Current Software Simulators
Performance-Detail tug-of-war Higher performance means less
simulation tasks Higher detail means more
simulation tasks
Simulator Slowdown Sim 2 min
ISA (SimNow, Simics) 1-100 10 min
Caches 10-1000 10 hrs
OOO (ASim ~1-10KIPS) 10K-1M 1 year
RTL 100M-1B 2000 years
CMP? ??? ???
1/17/08 RAMP Retreat 140
Test of size
Why Build? (Anant) Software won’t work unless you are building hardware Motivation for software tools Large data sets Hard problems better understood and show up once
you became building Have to solve hard problems More radical the idea, more important it is to build
World only trusts end-to-end results
Cycle simulator only becomes accurate after hardware gets precisely defined
Needed for commercialization
1/17/08 RAMP Retreat 141
Test of size
Functional Models Hardware functional models
Difficult to make complete (use Protoflex methods?) Size
Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA
Can we use them? What modifications?