Superscalar Architecture Design Framework for DSP Operations

Superscalar Architecture Design Framework for DSP OperationsRehan Ahmed

Overview

Optimization tool. Alters superscalar architectural configuration parameters to suit a given DSP application.

It alters the architectural blocks (Number of ALU, Cache Size etc).

Motivation

Giving designers an initial idea about how their design should look like.

Particularly useful for software defined radio applications.

Optimizations can target both power consumption and speed. Target Function:

Simplescalar WATTCH

Stage 1: Search and optimization algorithm (Simulated Annealing) Stage 2: Heuristic Approach

GainAPPIscaleGainIPCGain

ease %APPI_decrGainAPPI

ase %IPC_increGainIPC

Simulated AnnealingInitial

Configuration

ConfigurationSimulated. Results

Recorded

ConfigurationChanged(Table 2)

Gain Evaluation(Equation 1)

ConfigurationChange Finalized

ConfigurationSimulated.

Results Recorded

Probability ofacceptancecalculated

(Equation 2)

Revert to previousconfiguration

PositiveGain

Negative Gain

ConfigurationNot chosen

Probabilistically

ConfigurationProbabilistically

Chosen

Number ofSimulation <250

Number ofSimulations

=250

ConfigurationFinalized

Simulated Annealing Parameter set

Sr No Parameter Configuration

1 IFQ 1, 2, 4, 16, 32

2 Branch Table 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384

3 RAS 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192

4 BTB 16 4, 32 4, 64 4, 128 4, 256 4, 512 4, 1024 4, 2048 4, 4096 4, 8192 4

5 Decode Width 1, 2, 4, 16, 32

6 Issue Width 1, 2, 4, 16, 32

7 Commit Width 1, 2, 4, 16, 32

8 RUU 8, 16, 32, 64, 128, 256, 512, 1024

9 LSQ 8, 16, 32, 64, 128, 256, 512, 1024

10 I Cache 4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l, 8192:32:4:l

11 D Cache 4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l

12 Bus Width 4, 8, 16, 32, 64

13 I TIB 1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l

14 D TLB 1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l

15 I ALU 1, 2, 4, 8

16 I Mul/Div 1, 2, 4, 8

17 Memory Ports 1, 2, 4, 8

18 FP ALU 1, 2, 4, 8

19 FP Mul/Div 1, 2, 4, 8

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Simulation Index

Pro

babi

lity

of A

ccep

tanc

e

gain= - 2

gain= - 3

gain= - 5gain= - 10

Probability = 0.5 Reference Line

)10sGain_Eff1(expp

1)num_sim(logs 10

Final configuration from simulated annealing further optimized using the heuristic approach

Heuristic approach based on the operating principle of superscalar architecture.

Configuration Change Monitored Result dir =0 dir=1

1 Branch Table Branch_Misses Incr Dec

2 BTB Gain Incr Dec

3 Return Address Stack Gain Incr Dec

4 IFQ, Exec Win, I ALU IFQ_full, Eff_Gain, IPB Incr Dec

5 I ALU Gain Incr Dec

6 I Mul/Div Gain Incr Dec

7 FP ALU Gain Incr Dec

8 FP Mul/Div Gain Incr Dec

9 RUU Gain Dec Inc

10 LSQ Gain Dec Inc

11 I-Compress Gain En En

12 I-Cache Gain Dec Inc

13 D-Cache Gain Dec Inc

14 Instruction TLB Gain Dec Inc

15 Data TLB Gain Dec Inc

16 Bus Width Gain Inc Dec

17 Memory To System PortsGain

Inc Dec

18 Exit Stage Gain Nil Nil

Optimization Results

IFFT Operation Scale=40 (High precedence given to efficiency)

Gain

-150

-100

-50

0

50

100

150

200

250

300

350

1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495

Gain

0

5

10

15

20

25

1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487

IPC

APPI

Results Summary

Optimized Configuration performance measures

Instructions per Cycle: 1.1934 Average Power per Instruction: 4.6744 Instructions per second (1GHz) 1.193421 G Transistor Count 10,645,929

Transistor Count for Pentium III 9,500,000

IFFT

Configuration Parameter

Instruction Fetch Queue 32

Branch Table Size 32768

Return Address Stack 16

Branch Target Buffer 1024

Instruction Decode Width 32

Instruction Issue Width 2

Instruction Commit Width 32

Register Update Unit 16

Load Store Queue 8

D Cache 2 KB

I Cache 4KB

Memory Bus Width 64 bytes

Instruction TLB 32KB

Data TLB 16 KB

Integer ALUs 4

Integer Mul/Div 1

Memory to System Ports 2

Floating Point ALU 1

Floating Point Mul/Div 4

Superscalar Architecture Design Framework for DSP Operations

Documents

Transcript of Superscalar Architecture Design Framework for DSP Operations