Branch prediction contest_report

Branch Prediction Contest: Implementation of Piecewise Linear Prediction Algorithm

Prosunjit Biswas

Department of Computer Science. University of Texas at San Antonio.

Abstract

Branch predictor’s accuracy is very important to harness the parallelism available in ILP and thus improve performance of today’s microprocessors and specially superscalar processors. Among branch predictors, various neural branch predictors including Scaled Neural Branch Predictor (SNAP), Piecewise Linear Branch predictor outperform other state-of-the-art predictors. In this course final project for the course of Computer Architecture (CS-5513), I have studied various neural predictors and implemented the Piecewise Linear Branch Predictor as per the algorithm provided by a research paper of Dr. Daniel A. Jimenez. The hardware budget is restricted for this project and I have implemented the predictor within a predefined hardware budget of 64K of memory. I am also competing for branch prediction contest. Keywords: Piecewise Linear, Neural Network, Branch Prediction.

I. INTRODUCTION

Neural Branch predictors are the most accurate predictors in the literature but they were impractical due to the high latency associated with prediction. This latency is due to the complex computation that must be carried out to determine the excitation of an artificial neuron. [3] Piecewise Linear Branch Prediction [1] improved both accuracy and latency over previous neural predictors. This predictor works by developing a set of linear functions, one for each program path to the branch to be predicted that separate predicted taken from predicted untaken. In this paper, Piecewise Linear Branch Prediction, Daniel A. Jimenez proposed two versions of the prediction algorithm – i) The Idealized Piecewise Linear Branch Predictor and ii) A Practical Piecewise Linear Branch Predictor. In this project, I have focused on the idealized predictor.

II. RELATED WORKS

Perceptron prediction is one of the first attempts in branch prediction history that associated branch prediction through neural network. This predictor achieved a improved misprediction rate on a composite trace of SPEC2000 benchmarks by 14.7%. [2] But unfortunately, this predictor was impractical due to its high latency.

First Path-Based Neural Branch Prediction[4] is another attempt that combines path and pattern history to overcome the limitation associated with preexisting neural predictors. It improved accuracy over previous neural predictors and achieved significantly low latency. This predictor achieved IPC of an aggressively clocked microarchitecture by 16% over the former perceptron predictor. Scaled neural analog predictor, or SNAP is another recently proposed neural branch predictor which uses the concept of piecewise-linear branch prediction and relies on a mixed analog/digital implementation. This predictor decreases latency over power consumption over other available neural predictors [5]. Fig.1 (Courtesy – “An Optimized Scaled Neural Branch Predictor” by Daniel A. Jimenez) shows comparative performance of noted branch prediction approaches on a set of SPEC CPU 2000 and 2006 integer benchmarks.

III. THE ALGORITHM

The Branch predictor algorithm has two major parts namely i) Prediction algorithm ii) Train/Update algorithm. Before going to the implementation of these

Fig. 1. Performance of Branch different branch Predictors over SPEC CPU 2000 and 2006 integer benchmarks (Courtesy - “An Optimized Scaled Neural Branch Predictor” by Daniel A. Jimenez) two algorithms, we will discuss the states and variable they use. The three dimensional array W is the data structure used to store weights of the branches which is used in both prediction and update algorithm.

Fig2: The array of W with its corresponding indices

Branch address is generally taken as the last 8/10 bits of the instruction address. For each predicting branch, the algorithm keeps history of all other branches that precede this branch in the dynamic path taken by the branch. The second dimension indicated by the variable GA keeps track of these per branch dynamic path history. The third dimension, as shown as GHR[i], keeps track of the position of the address GA[i] in the global branch history register namely GHR. Some of the important variables of the algorithm is also given here for the clarity purpose. GA : An array of address. This array keeps the path history associated with each branch address. As new branch is executed, the address of the branch is shifted into the first position of the array. GHR: An array of Boolean true/false value. This array keep track of the taken / untaken status of the branches. H : Length of History Register. Output: An integer value generated by the predictor algorithm to predict current branch. Table I: The prediction algorithm. void branch_update *predict (branch_info & b) { bi = b; if (b.br_flags & BR_CONDITIONAL) { address = ( ((b.address >> 4 ) & 0x0F )<<2) | ((b.address>>2)) & 0x03; output = W[address][0][0]; for (int i=0; i<H; i++) { if ( GHR[i] == true ) output += W[address][GA[i]][i]; else if (GHR[i] == false) output -= W[address][GA[i]][i]; } u.direction_prediction(output>=0); } else { u.direction_prediction (false); } u.target_prediction (0); return &u; }

Table II: The update/train algorithm void update (branch_update *u, bool taken, unsigned int target) { if (bi.br_flags & BR_CONDITIONAL) { if ( abs(output)< theta || ( (output>=0) != taken) ){ if (taken == true ) { if (W[address][0][0] < SAT_VAL) W[address][0][0] ++; } else { if (W[address][0][0] > (-1) * SAT_VAL) W[address][0][0] --; } for(int i=0; i<H-1; i++) { if(GHR[i] == taken ) { if (W[address][GA[i]][i] < SAT_VAL) W[address][GA[i]][i] ++; } else { if (W[address][GA[i]][i] > (-1) * SAT_VAL+1 ) W[address][GA[i]][i] --; } } } shift_update_GA(address); shift_update_GHR(taken); } }

IV. TUNING PERFORMANCE

Besides the algorithm, the MPKI (Miss Per Kilo Instruction) rate of the algorithm depends on the size of various dimension of the array W. I have experienced MPKI against various dimension of W. The result of my experiment is shown below. Table 1 shows the result of the experiment. Table I : MPKI rate of the Piecewise Linear Algorithm with limited budget of 64K

W[i][GA[i]][GHR[i] MPKI

W[64][16][64] 3.982 W[128][16][32] 4.217 W[64][8][128] 4.292 W[32][16][128] 5.807 W[64][64][16] 4.826

The table shows that the predictor performs better when i, GA[i], GHR[i] has corresponding 64,16,64 entries.

V. TWEAKING INSTRUCTION ADDRESS

I have found that rather than taking the last bits from the address, discarding the 2 least significant bits of the address and then taking 3-8 bits make the predictor predicts more accurately. It decreases the aliasing and thus improves prediction rate a little bit.

Fig. 3: Tweaking Branch address for performance

speed up.

VI. RESULT

Misprediction rate of the benchmarks according to the piecewise linear algorithm is shown in fig 4. Fig.5 shows comparison of different prediction algorithms(piecewise linear, perceptron and gshare) against various given benchmarks.

0

2

4

6

8

10

12

14

164.gzip

175.vpr

176.gcc

181.m

cf

186.crafty

197.parser

201.compress

202.jess

205.raytrace

209.db

213.javac

222.m

pegaudio

227.m

trt

228.jack

252.eon

/253.perlbmk

254.gap

255.vortex

256.bzip2

300.twolf

Fig 4: Misprediction rate of different benchmarks using

piecewise linear prediction algorithm

Fig 5: Comparison of prediction algorithms against different benchmarks on given 64K budget.

VII. 64K BUDGET CALCULATION

I have limited the implementation of piecewise linear prediction algorithm within 64K + 256 byte memory. The algorithm performs better as I increase the memory limit. In table II, I have shown the calculation of 64K + 256 byte budget.

Table II: 64 K ( 65,532 Byte) memory budget limit calculation DataStructure/Array/Variable

Memory calculation

W[64][16][63] of each 1 Byte long

64,512 byte

Constants(SIZE,H,SAT_VAL,theta,N)

5*1 byte ( each value < 128)

(GA[63] * 6 bits / 8) byte 48 byte (GHR[63] * 1 bit / 8) byte 8 byte vaiables (address , output ) * 4 byte

8 byte

Total: 64,581 byte

VIII CONCLUSION

In this individual course final project, I have tried to implement the piecewise linear branch prediction algorithm. . In my implementation, I have achieved a MPKI of 3.988 at best. I think, it is also possible to enhance the performance of this algorithm with better implementation tricks. I have also compared the performance of piecewise prediction algorithm with perceptron and gshare algorithms. With the same memory limit, piecewise prediction performs significantly better than the other two.

REFERENCES

[1] Daniel A. Jimenez. Piecewise linear branch prediction. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA-32), June 2005.

[2] D. Jimenez and C. Lin. Dynamic branch prediction with per-ceptrons. In Proceedings of the Seventh International Sym-posium on High Performance Computer Architecture,Jan-uary 2001

[3] Lakshminarayanan, Arun; Shriraghavan, Sowmya,

“Neural Branch Prediction” available at http://webspace.ulbsibiu.ro/lucian.vintan/html/neu ralpredictors.pdf

[4] D.A. Jimenez, “Fast Path-Based Neural Branch Prediction,” Proc. 36th Ann. Int’l Symp. Microarchitecture, pp. 243-252, Dec. 2003.

[5] D.A. Jimenez, “An optimized scaled neural branch predictor,” Computer Design (ICCD), 2011 IEEE 29th International Conference, pp. 113 - 118, Oct. 2011.

Branch prediction contest_report

Technology

Transcript of Branch prediction contest_report