Branch Predictor Design for AE64000

Branch Predictor Design for AE64000

Lynn Choi

Department of Electronics and Computer Engineering

Korea University

[email protected]

Session: 5D Paper: 8

Motivation Demand for high performance embedded processors

ㅡ High-end embedded applications

ㅡ Many uses of embedded processors

Addition of a branch predictor

ㅡ To achieve higher performance

ㅡ The most cost-effective method

AE64000 Characteristics

IFU to minimize performance decrease caused by LERI’s Additional two pipeline stages (IFU1+IFU2) to eliminate LERI’s 3 line buffers to store 12 instructions PrePC in IFU and PC in the pipeline core

Branch misprediction penalty Branch misprediction penalty : 3 cycles

Branch Predictor Design for AE64000

Issues in branch predictor design for AE64000 AE64000 has additional two stages (IFU1-IFU2) in front of 5-stage pipeline

core. At which pipeline stage prediction should be performed?

IFU1 stage

Due to line buffers in the IFU, predicted target addresses need to be buffered as well to verify branch prediction results

need buffers for predicted branch target addresses (PTAB)

Since 4 instructions are fetched at a time, multiple branches can be fetched at a time as well.

Only the first taken branch will be predicted.

To do that, TAC has the precise target address.

Branch misprediction penalty Can be reduced from 3 to 2 cycles by updating PPC at the same cycle that

PC is updated by adding a MUX in the IFU

Branch PredictorFor AE64000

Separate BPT with TAC

PTAB to store predicted target address for instructions in the line buffer

Branch prediction verification in the ID stage

Predicted Target Address Buffer

Predicted Target Address Buffer (PTAB) For branch instructions in the line buffer

When we send a branch instruction to the pipeline core, we also send the corresponding predicted target address

Simulation Environment Developed a cycle-accurate AE64000 simulator

Simulated 1 billion instructions– 30 minutes on P4 1.6GHz with 512MB RAM

Indirect branches are not predicted in the simulation Input: AE64000 compiler binary, memory & predictor configuration

parameters Output: IPC, BPT/TAC hit ratios, etc.

Benchmark SPECint95 (compress, go) Dhrystone Whetstone

Predictors tested Last-time predictor Bimodal predictor G-share predictor

Simulator Block Diagram

Simulation Results

Without branch predictor (IPC)

Classification3-cycle misprediction

penalty2-cycle misprediction

penalty

Compress 0.6787 0.7429

Go 0.6905 0.7322

Dhrystone 0.6500 0.7200

Whetstone 0.5569 0.6521

Simulation Results

Last-time branch predictor

Simulation Results (cont’d)

Bimodal Branch Predictor

Simulation Results (cont’d)

G-share Branch Predictor

Conclusion Simulation result analysis

Consider both performance and area

The additional performance gain by g-share and bimodal predictors are negligible compared to their size and complexity.

Final design Last-time predictor with 4-way set-associative 8-en

try TAC with LRU replacement– IPC is improved 10% by reducing the branch predic

tion penalty from 3 to 2 cycles

– Additional 15% IPC improvement by branch predictor

About 11500 gate (about 2.64% area) in Verilog HDL model

– Thus, we can improve the performance of AE64000 by 25% with less than 3% cost

Branch Predictor Design for AE64000

Documents

Transcript of Branch Predictor Design for AE64000