Closely-Coupled Timing-Directed Partitioning in HAsim
description
Transcript of Closely-Coupled Timing-Directed Partitioning in HAsim
![Page 1: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/1.jpg)
Closely-CoupledTiming-Directed Partitioning
in HAsim
Michael Pellauer†
[email protected] Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡
†MIT CS and AI LabComputation Structures Group
‡Intel CorporationVSSAD Group
To Appear In: ISPASS 2008
![Page 2: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/2.jpg)
MotivationWe want to simulate target platforms quicklyWe also want to construct simulators quicklyPartitioned simulators are a known technique from traditional performance models:
• ISA• Off-chipcommunication
• Micro-architecture• Resource contention• Dependencies
Interaction
• Simplifies timing model• Amortize functional model design effort over many models• Functional Partition can be extremely FPGA-optimized
TimingPartition
FunctionalPartition
![Page 3: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/3.jpg)
Different Partitioning SchemesAs categorized by Mauer, Hill and Wood:
Source: [MAUER 2002], ACM SIGMETRICSWe believe that a timing-directed solution will ultimately lead to the best performance
Both partitions upon the FPGA
![Page 4: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/4.jpg)
Functional Partition in Software AsimGet Instruction (at a given Address)Get DependenciesGet Instruction ResultsRead Memory*
Speculatively Write Memory* (locally visible)Commit or Abort instructionWrite Memory* (globally visible)
* Optional depending on instruction type
![Page 5: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/5.jpg)
Execution in Phases
F D X R C
F D X W C W
F D X C
The Emer Assertion:
All data dependencies can be represented via these phases
F D X R A
F D X X C W
![Page 6: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/6.jpg)
Detailed Example: 3 Different Timing Models
Executing the same instruction sequence:
![Page 7: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/7.jpg)
Functional Partition in Hardware?Requirements
Support these operations in hardwareAllow for out-of-order execution, speculation, rollback
ChallengesMinimize operation execution timesPipeline wherever possibleTradeoff between BRAM/multiport RAMsRace conditions due to extreme parallelism
![Page 8: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/8.jpg)
Functional Partition As Pipeline
Conveys concept well, but poor performance
Token Gen Dec Exe Mem LCom GComFet
Timing Model
MemoryState
Register State
RegFile
FunctionalPartition
![Page 9: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/9.jpg)
Implementation:Large Scoreboards in BRAM
Series of tables in BRAM
Store information about each in-flight instructionTables are indexed by “token”
Also used by the timing partition to refer to each instructionNew operation “getToken” to allocate a space in the tables
![Page 10: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/10.jpg)
Implementing the Operations
See paper for details (also extra slides)
![Page 11: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/11.jpg)
Assessment:Three Timing Models
Unpipelined Target
MIPS R10K-like out-of-order superscalar
5-Stage Pipeline
![Page 12: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/12.jpg)
Assessment:Target Performance
Targets have idealized memory hierarchy
Target Processor CPI
0
0.5
1
1.5
2
2.5
3
3.5
median multiply qsort towers vvadd average
Mod
el C
ycle
s pe
r Ins
truct
ion
(CPI
)
Unpipelined5-stageOut-of-Order
![Page 13: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/13.jpg)
Assessment:Simulator Performance
Some correspondence between target and functional partition is very helpful
Simulation Rate
0
5
10
15
20
25
30
35
40
45
median multiply qsort towers vvadd average
FPG
A-C
ycle
s pe
r Mod
el C
ycle
(FM
R)
Unpipelined5-StageOut-of-Order
![Page 14: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/14.jpg)
Assessment:Reuse and Physical Stats
Where is functionality implemented:
FPGA usage:
Design IMem ProgramCounter
Branch Predictor
Scoreboard/ROB
RegFile
Maptable/Freelist
ALU DMem Store Buffer
Snapshots/Rollback
Functional Partition
Unpipelined N/A N/A N/A N/A N/A
5-Stage N/A
Out-of-Order
Unpipelined 5-stage Out of Order
FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%)
Block RAMs 18 (5%) 25 (7%) 25 (7%)
Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz
Average FMR 41.1 7.49 15.6
Simulation Rate 2.4 MHz 14 MHz 6 MHz
Average Simulator IPS
2.4 MIPS 5.1 MIPS 4.7 MIPS
Virtex IIPro 70
Using ISE 8.1i
![Page 15: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/15.jpg)
Future Work:Simulating Multicores
Scheme 1: Duplicate both partitions
Scheme 2: Cluster Timing Parititions
TimingModel
A
FuncReg +
Datapath
TimingModel
B
FuncReg +
Datapath
FuncReg +
Datapath
TimingModel
C
FuncReg +
Datapath
TimingModel
D
FunctionalMemory
State
TimingModel
A
TimingModel
B
TimingModel
C
TimingModel
D
FunctionalReg State +
Datapath
FunctionalMemory
State
Interactionoccurshere
Interactionstill occurs
here
Use a context IDto reference all state
lookups
![Page 16: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/16.jpg)
Future Work: Simulating MulticoresScheme 3: Perform multiplexing of timing models themselves
Leverage HASim A-Ports in Timing ModelOut of scope of today’s talk
TimingModel
D
FunctionalReg State +
Datapath
FunctionalMemory
StateInteractionstill occurs
here
Use a context IDto reference all state
lookups
TimingModel
C
TimingModel
B
TimingModel
A
![Page 17: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/17.jpg)
UT-FAST is Functional-First
This can be unified into Timing-DirectedJust do “execute-at-fetch”
Future Work:Unifying with the UT-FAST model
FuncPartition
TimingPartition
EmulatorØØØ
Ø
functionalemulatorrunning insoftware
FPGA
execution stream
resteer
execution stream
resteer
functionalemulatorrunning insoftware
![Page 18: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/18.jpg)
SummaryDescribed a scheme for closely-coupled timing-directed partitioning
Both partitions are suitable for on-FPGA implementation
Demonstrated such a scheme’s benefits:Very Good Reuse, Very Good Area/Clock SpeedGood FPGA-to-Model Cycle Ratio:
Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target)
We plan to extend this using contexts for hardware multiplexing [Chung 07]Future: rare complex operations (such as syscalls) could be done in software using virtual channels
![Page 21: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/21.jpg)
Functional Partition Fetch
![Page 22: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/22.jpg)
Functional Partition Decode
![Page 23: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/23.jpg)
Functional Partition Execute
![Page 24: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/24.jpg)
Functional Partition Back End
![Page 25: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/25.jpg)
Timing Model: Unpipelined
![Page 26: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/26.jpg)
5-Stage Pipeline Timing Model
![Page 27: Closely-Coupled Timing-Directed Partitioning in HAsim](https://reader036.fdocuments.us/reader036/viewer/2022070419/56815b68550346895dc95d38/html5/thumbnails/27.jpg)
Out-Of-Order Superscalar Timing Model