Branch Predictor Design for AE64000

12
Branch Predictor Design for AE64000 Lynn Choi Department of Electronics and Computer Engi neering Korea University [email protected] Session: 5D Paper: 8

description

Branch Predictor Design for AE64000. Lynn Choi Department of Electronics and Computer Engineering Korea University [email protected] Session: 5D Paper: 8. Motivation. Demand for high performance embedded processors ㅡ High-end embedded applications ㅡ Many uses of embedded processors - PowerPoint PPT Presentation

Transcript of Branch Predictor Design for AE64000

Page 1: Branch Predictor Design for AE64000

Branch Predictor Design for AE64000

Lynn Choi

Department of Electronics and Computer Engineering

Korea University

[email protected]

Session: 5D Paper: 8

Page 2: Branch Predictor Design for AE64000

Motivation Demand for high performance embedded processors

ㅡ High-end embedded applications

ㅡ Many uses of embedded processors

Addition of a branch predictor

ㅡ To achieve higher performance

ㅡ The most cost-effective method

Page 3: Branch Predictor Design for AE64000

AE64000 Characteristics

IFU to minimize performance decrease caused by LERI’s Additional two pipeline stages (IFU1+IFU2) to eliminate LERI’s 3 line buffers to store 12 instructions PrePC in IFU and PC in the pipeline core

Branch misprediction penalty Branch misprediction penalty : 3 cycles

Page 4: Branch Predictor Design for AE64000

Branch Predictor Design for AE64000

Issues in branch predictor design for AE64000 AE64000 has additional two stages (IFU1-IFU2) in front of 5-stage pipeline

core. At which pipeline stage prediction should be performed?

IFU1 stage

Due to line buffers in the IFU, predicted target addresses need to be buffered as well to verify branch prediction results

need buffers for predicted branch target addresses (PTAB)

Since 4 instructions are fetched at a time, multiple branches can be fetched at a time as well.

Only the first taken branch will be predicted.

To do that, TAC has the precise target address.

Branch misprediction penalty Can be reduced from 3 to 2 cycles by updating PPC at the same cycle that

PC is updated by adding a MUX in the IFU

Page 5: Branch Predictor Design for AE64000

Branch PredictorFor AE64000

Separate BPT with TAC

PTAB to store predicted target address for instructions in the line buffer

Branch prediction verification in the ID stage

Page 6: Branch Predictor Design for AE64000

Predicted Target Address Buffer

Predicted Target Address Buffer (PTAB) For branch instructions in the line buffer

When we send a branch instruction to the pipeline core, we also send the corresponding predicted target address

Page 7: Branch Predictor Design for AE64000

Simulation Environment Developed a cycle-accurate AE64000 simulator

Simulated 1 billion instructions– 30 minutes on P4 1.6GHz with 512MB RAM

Indirect branches are not predicted in the simulation Input: AE64000 compiler binary, memory & predictor configuration

parameters Output: IPC, BPT/TAC hit ratios, etc.

Benchmark SPECint95 (compress, go) Dhrystone Whetstone

Predictors tested Last-time predictor Bimodal predictor G-share predictor

Simulator Block Diagram

Page 8: Branch Predictor Design for AE64000

Simulation Results

Without branch predictor (IPC)

Classification3-cycle misprediction

penalty2-cycle misprediction

penalty

Compress 0.6787 0.7429

Go 0.6905 0.7322

Dhrystone 0.6500 0.7200

Whetstone 0.5569 0.6521

Page 9: Branch Predictor Design for AE64000

Simulation Results

Last-time branch predictor

Page 10: Branch Predictor Design for AE64000

Simulation Results (cont’d)

Bimodal Branch Predictor

Page 11: Branch Predictor Design for AE64000

Simulation Results (cont’d)

G-share Branch Predictor

Page 12: Branch Predictor Design for AE64000

Conclusion Simulation result analysis

Consider both performance and area

The additional performance gain by g-share and bimodal predictors are negligible compared to their size and complexity.

Final design Last-time predictor with 4-way set-associative 8-en

try TAC with LRU replacement– IPC is improved 10% by reducing the branch predic

tion penalty from 3 to 2 cycles

– Additional 15% IPC improvement by branch predictor

About 11500 gate (about 2.64% area) in Verilog HDL model

– Thus, we can improve the performance of AE64000 by 25% with less than 3% cost