Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † [email protected]...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † [email protected]...
Closely-CoupledTiming-Directed Partitioning
in HAsim
Michael Pellauer†
Murali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡
†MIT CS and AI Lab
Computation Structures Group
‡Intel Corporation
VSSAD Group
To Appear In: ISPASS 2008
Motivation
We want to simulate target platforms quickly
We also want to construct simulators quickly
Partitioned simulators are a known technique from traditional performance models:
• ISA• Off-chipcommunication
• Micro-architecture• Resource contention• Dependencies
Interaction
• Simplifies timing model
• Amortize functional model design effort over many models
• Functional Partition can be extremely FPGA-optimized
TimingPartition
TimingPartition
FunctionalPartitionFunctionalPartition
Different Partitioning Schemes
As categorized by Mauer, Hill and Wood:
Source: [MAUER 2002], ACM SIGMETRICS
We believe that a timing-directed solution will ultimately lead to the best performance
Both partitions upon the FPGA
Functional Partition in Software Asim
Get Instruction (at a given Address)
Get Dependencies
Get Instruction Results
Read Memory*
Speculatively Write Memory* (locally visible)
Commit or Abort instruction
Write Memory* (globally visible)
* Optional depending on instruction type
Execution in Phases
F D X R C
F D X W C W
F D X C
The Emer Assertion:
All data dependencies can be represented via these phases
F D X R A
F D X X C W
Detailed Example: 3 Different Timing Models
Executing the same instruction sequence:
Functional Partition in Hardware?
RequirementsSupport these operations in hardware
Allow for out-of-order execution, speculation, rollback
ChallengesMinimize operation execution times
Pipeline wherever possible
Tradeoff between BRAM/multiport RAMs
Race conditions due to extreme parallelism
Functional Partition As Pipeline
Conveys concept well, but poor performance
Token Gen
Dec Exe Mem LCom GComFet
Timing Model
MemoryState
Register State
RegFile
FunctionalPartition
Implementation:Large Scoreboards in BRAM
Series of tables in BRAM
Store information about each in-flight instruction
Tables are indexed by “token”Also used by the timing partition to refer to each instruction
New operation “getToken” to allocate a space in the tables
Implementing the Operations
See paper for details (also extra slides)
Assessment:Three Timing Models
Unpipelined Target
MIPS R10K-like out-of-order superscalar
5-Stage Pipeline
Assessment:Target Performance
Targets have idealized memory hierarchy
Target Processor CPI
0
0.5
1
1.5
2
2.5
3
3.5
median multiply qsort towers vvadd average
Mo
de
l Cy
cle
s p
er
Ins
tru
cti
on
(C
PI)
Unpipelined
5-stage
Out-of-Order
Assessment:Simulator Performance
Some correspondence between target and functional partition is very helpful
Simulation Rate
0
5
10
15
20
25
30
35
40
45
median multiply qsort towers vvadd average
FP
GA
-Cy
cle
s p
er
Mo
de
l Cy
cle
(F
MR
)
Unpipelined
5-Stage
Out-of-Order
Assessment:Reuse and Physical Stats
Where is functionality implemented:
FPGA usage:
Design IMem ProgramCounter
Branch Predictor
Scoreboard/ROB
RegFile
Maptable/Freelist
ALU DMem Store Buffer
Snapshots/Rollback
Functional Partition
Unpipelined N/A N/A N/A N/A N/A
5-Stage N/A
Out-of-Order
Unpipelined 5-stage Out of Order
FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%)
Block RAMs 18 (5%) 25 (7%) 25 (7%)
Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz
Average FMR 41.1 7.49 15.6
Simulation Rate 2.4 MHz 14 MHz 6 MHz
Average Simulator IPS
2.4 MIPS 5.1 MIPS 4.7 MIPS
Virtex IIPro 70
Using ISE 8.1i
Future Work:Simulating Multicores
Scheme 1: Duplicate both partitions
Scheme 2: Cluster Timing Parititions
TimingModel
A
TimingModel
A
FuncReg +
Datapath
FuncReg +
Datapath
TimingModel
B
TimingModel
B
FuncReg +
Datapath
FuncReg +
Datapath
FuncReg +
Datapath
FuncReg +
Datapath
TimingModel
C
TimingModel
C
FuncReg +
Datapath
FuncReg +
Datapath
TimingModel
D
TimingModel
D
FunctionalMemory
State
FunctionalMemory
State
TimingModel
A
TimingModel
A
TimingModel
B
TimingModel
B
TimingModel
C
TimingModel
C
TimingModel
D
TimingModel
D
FunctionalReg State +
Datapath
FunctionalReg State +
Datapath
FunctionalMemory
State
FunctionalMemory
State
Interactionoccurshere
Interactionstill occurs
here
Use a context IDto reference all state
lookups
Future Work: Simulating Multicores
Scheme 3: Perform multiplexing of timing models themselves
Leverage HASim A-Ports in Timing Model
Out of scope of today’s talk
TimingModel
D
TimingModel
D
FunctionalReg State +
Datapath
FunctionalReg State +
Datapath
FunctionalMemory
State
FunctionalMemory
State
Interactionstill occurs
here
Use a context IDto reference all state
lookups
TimingModel
C
TimingModel
C
TimingModel
B
TimingModel
B
TimingModel
A
TimingModel
A
UT-FAST is Functional-First
This can be unified into Timing-DirectedJust do “execute-at-fetch”
Future Work:Unifying with the UT-FAST model
FuncPartition
FuncPartition
TimingPartition
TimingPartition
EmulatorEmulator
Ø
Ø
Ø
Ø
functionalemulatorrunning insoftware
FPGA
execution stream
resteer
execution stream
resteer
functionalemulatorrunning insoftware
Summary
Described a scheme for closely-coupled timing-directed partitioning
Both partitions are suitable for on-FPGA implementation
Demonstrated such a scheme’s benefits:Very Good Reuse, Very Good Area/Clock Speed
Good FPGA-to-Model Cycle Ratio:Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target)
We plan to extend this using contexts for hardware multiplexing [Chung 07]
Future: rare complex operations (such as syscalls) could be done in software using virtual channels
Questions?
Extra Slides
Functional Partition Fetch
Functional Partition Decode
Functional Partition Execute
Functional Partition Back End
Timing Model: Unpipelined
5-Stage Pipeline Timing Model
Out-Of-Order Superscalar Timing Model