Superscalar Architecture Design Framework for DSP Operations
description
Transcript of Superscalar Architecture Design Framework for DSP Operations
Overview
Optimization tool. Alters superscalar architectural configuration parameters to suit a given DSP application.
It alters the architectural blocks (Number of ALU, Cache Size etc).
Motivation
Giving designers an initial idea about how their design should look like.
Particularly useful for software defined radio applications.
Optimizations can target both power consumption and speed. Target Function:
Simplescalar WATTCH
Stage 1: Search and optimization algorithm (Simulated Annealing) Stage 2: Heuristic Approach
GainAPPIscaleGainIPCGain
ease %APPI_decrGainAPPI
ase %IPC_increGainIPC
Simulated AnnealingInitial
Configuration
ConfigurationSimulated. Results
Recorded
ConfigurationChanged(Table 2)
Gain Evaluation(Equation 1)
ConfigurationChange Finalized
ConfigurationSimulated.
Results Recorded
Probability ofacceptancecalculated
(Equation 2)
Revert to previousconfiguration
PositiveGain
Negative Gain
ConfigurationNot chosen
Probabilistically
ConfigurationProbabilistically
Chosen
Number ofSimulation <250
Number ofSimulations
=250
ConfigurationFinalized
Simulated Annealing Parameter set
Sr No Parameter Configuration
1 IFQ 1, 2, 4, 16, 32
2 Branch Table 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
3 RAS 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192
4 BTB 16 4, 32 4, 64 4, 128 4, 256 4, 512 4, 1024 4, 2048 4, 4096 4, 8192 4
5 Decode Width 1, 2, 4, 16, 32
6 Issue Width 1, 2, 4, 16, 32
7 Commit Width 1, 2, 4, 16, 32
8 RUU 8, 16, 32, 64, 128, 256, 512, 1024
9 LSQ 8, 16, 32, 64, 128, 256, 512, 1024
10 I Cache 4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l, 8192:32:4:l
11 D Cache 4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l
12 Bus Width 4, 8, 16, 32, 64
13 I TIB 1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l
14 D TLB 1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l
15 I ALU 1, 2, 4, 8
16 I Mul/Div 1, 2, 4, 8
17 Memory Ports 1, 2, 4, 8
18 FP ALU 1, 2, 4, 8
19 FP Mul/Div 1, 2, 4, 8
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Simulation Index
Pro
babi
lity
of A
ccep
tanc
e
gain= - 2
gain= - 3
gain= - 5gain= - 10
Probability = 0.5 Reference Line
)10sGain_Eff1(expp
1)num_sim(logs 10
Final configuration from simulated annealing further optimized using the heuristic approach
Heuristic approach based on the operating principle of superscalar architecture.
Configuration Change Monitored Result dir =0 dir=1
1 Branch Table Branch_Misses Incr Dec
2 BTB Gain Incr Dec
3 Return Address Stack Gain Incr Dec
4 IFQ, Exec Win, I ALU IFQ_full, Eff_Gain, IPB Incr Dec
5 I ALU Gain Incr Dec
6 I Mul/Div Gain Incr Dec
7 FP ALU Gain Incr Dec
8 FP Mul/Div Gain Incr Dec
9 RUU Gain Dec Inc
10 LSQ Gain Dec Inc
11 I-Compress Gain En En
12 I-Cache Gain Dec Inc
13 D-Cache Gain Dec Inc
14 Instruction TLB Gain Dec Inc
15 Data TLB Gain Dec Inc
16 Bus Width Gain Inc Dec
17 Memory To System PortsGain
Inc Dec
18 Exit Stage Gain Nil Nil
Gain
-150
-100
-50
0
50
100
150
200
250
300
350
1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495
Gain
0
5
10
15
20
25
1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487
IPC
APPI
Results Summary
Optimized Configuration performance measures
Instructions per Cycle: 1.1934 Average Power per Instruction: 4.6744 Instructions per second (1GHz) 1.193421 G Transistor Count 10,645,929
Transistor Count for Pentium III 9,500,000
IFFT
Configuration Parameter
Instruction Fetch Queue 32
Branch Table Size 32768
Return Address Stack 16
Branch Target Buffer 1024
Instruction Decode Width 32
Instruction Issue Width 2
Instruction Commit Width 32
Register Update Unit 16
Load Store Queue 8
D Cache 2 KB
I Cache 4KB
Memory Bus Width 64 bytes
Instruction TLB 32KB
Data TLB 16 KB
Integer ALUs 4
Integer Mul/Div 1
Memory to System Ports 2
Floating Point ALU 1
Floating Point Mul/Div 4