VHDL Implementation and Performance Analysis of two ...
Transcript of VHDL Implementation and Performance Analysis of two ...
VHDL Implementation and Performance Analysis of two Division Algorithms
by
Salman Khan
B.S., Sir Syed University of Engineering and Technology, 2010
A Thesis Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Applied Science
in the Department of Electrical and Computer Engineering
c© Salman Khan, 2015
University of Victoria
All rights reserved. This thesis may not be reproduced in whole or in part, by
photocopying or other means, without the permission of the author.
ii
VHDL Implementation and Performance Analysis of two Division Algorithms
by
Salman Khan
B.S., Sir Syed University of Engineering and Technology, 2010
Supervisory Committee
Dr. Fayez Gebali, Supervisor
(Department of Electrical and Computer Engineering)
Dr. Atef Ibrahim, Member
(Department of Electrical and Computer Engineering)
iii
Supervisory Committee
Dr. Fayez Gebali, Supervisor
(Department of Electrical and Computer Engineering)
Dr. Atef Ibrahim, Member
(Department of Electrical and Computer Engineering)
ABSTRACT
Division is one of the most fundamental arithmetic operations and is used exten-
sively in engineering, scientific, mathematical and cryptographic applications. The
implementation of arithmetic operation such as division, is complex and expensive in
hardware. Unlike addition and subtraction, division requires several iterative compu-
tational steps on given operands to produce the result. Division, in the past has often
been perceived as an infrequently used operation and received not as much attention
but it is one of the most difficult operations in computer arithmetic. The techniques
of implementation in hardware of such an iterative computation impacts the speed,
the area and power of the digital circuit. For this reason, we consider two division
algorithms based on their step size in shift. Algorithm 1 operates on fixed shift step
size and has a fixed number of iteration while the Algorithms 2 operates on variable
shift step size and requires considerably fewer number of iterations. In this thesis,
technique is provided to save power and speed up the overall computation. It also
looks at different design goal strategies and presents a comparative study to asses
how each of the two design perform in terms of area, delay and power consumption.
iv
Contents
Supervisory Committee ii
Abstract iii
Table of Contents iv
List of Tables vii
List of Figures viii
Acknowledgements x
Dedication xi
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation for this work . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Division Background 5
2.1 Division Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Division Algorithms Classes . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Digit Recurrence Algorithms . . . . . . . . . . . . . . . . . . . 8
2.2.2 Functional Iteration Algorithms . . . . . . . . . . . . . . . . . 8
2.2.3 Very High Radix Algorithms . . . . . . . . . . . . . . . . . . . 8
2.2.4 Look-up Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Variable Latency Algorithms . . . . . . . . . . . . . . . . . . . 9
2.3 Related work in the area . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
v
3 Considered Division Algorithms 11
3.1 Division Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Reasons For Considerations . . . . . . . . . . . . . . . . . . . 11
3.1.2 Overview of Operation . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Division Algorithm 1 : Fixed Shift Algorithm . . . . . . . . . . . . . 12
3.2.1 Mode 1 : Range reduction of Y . . . . . . . . . . . . . . . . . 13
3.2.2 Mode 2 : Post processing of Y and Z . . . . . . . . . . . . . . 14
3.3 Division Algorithm 2 : Adaptive Shift Algorithm . . . . . . . . . . . 15
3.3.1 Mode 1 : Range reduction of Y . . . . . . . . . . . . . . . . . 15
3.3.2 Mode 2 : Post processing of Y and Z . . . . . . . . . . . . . . 16
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Design and Implementation 18
4.1 Hardware entities for Algorithm 1 . . . . . . . . . . . . . . . . . . . . 18
4.1.1 X, Y and Z Registers . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Data Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Comparator for Y . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 The Look-up table . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.5 The ALU unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.6 Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.7 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.8 FSM : State transition diagram . . . . . . . . . . . . . . . . . 24
4.2 Hardware entities for Algorithm 2 . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Delta Address Generator . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 DAG Implementation . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.4 FSM : State transition diagram . . . . . . . . . . . . . . . . . 36
4.3 Circuit Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Algorithm 1 : Fixed Shift division algorithm . . . . . . . . . . 38
4.3.2 DAG overall layout . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Algorithm 2: Adaptive Shift division algorithm . . . . . . . . 40
4.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Results and Evaluation 43
5.1 Numerical Simulation using MATLAB . . . . . . . . . . . . . . . . . 43
vi
5.1.1 Numerical Simulation of Algorithm 1 . . . . . . . . . . . . . . 44
5.1.2 Numerical Simulation of Algorithm 2 . . . . . . . . . . . . . . 45
5.2 Hardware Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 VHDL Simulation of Algorithm 1 . . . . . . . . . . . . . . . . 46
5.2.2 VHDL Simulation of Algorithm 2 . . . . . . . . . . . . . . . . 48
5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 Device Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.4 Power-Delay Product . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.5 Area-Delay Product . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Comparison of Work in Related Area . . . . . . . . . . . . . . . . . . 55
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Conclusion, Contributions and Future Work 59
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography 62
7 Additional Information 64
7.1 Interpretation of signals . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 Used Terms and Acronyms 67
vii
List of Tables
Table 4.1 Truth Table when Y is positive . . . . . . . . . . . . . . . . . . 33
Table 4.2 Truth Table when Y is negative . . . . . . . . . . . . . . . . . . 33
Table 5.1 Iterations for Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 44
Table 5.2 Iterations for Algorithm 2 . . . . . . . . . . . . . . . . . . . . . 45
Table 5.3 On-chip device utilization of Algorithm 1 . . . . . . . . . . . . . 51
Table 5.4 On-chip device utilization of Algorithm 2 . . . . . . . . . . . . . 51
Table 5.5 Timing Summary of Algorithm 1 . . . . . . . . . . . . . . . . . 52
Table 5.6 Timing Summary of Algorithm 2 . . . . . . . . . . . . . . . . . 52
Table 5.7 On-chip power consumptions. . . . . . . . . . . . . . . . . . . . 54
Table 5.8 Power-delay product for Algorithm 1 and 2. . . . . . . . . . . . 55
Table 5.9 Area-delay product for Algorithm 1 and 2. . . . . . . . . . . . . 55
Table 5.10 Summary of related work in the area . . . . . . . . . . . . . . . 57
viii
List of Figures
Figure 2.1 Nonzero bits of X and Y at the start of division . . . . . . . . . 7
Figure 2.2 Nonzero bits of X and Y at the end of division . . . . . . . . . 7
Figure 4.1 Algorithm 1 system level . . . . . . . . . . . . . . . . . . . . . . 19
Figure 4.2 Registers X, Y and Z in the bank . . . . . . . . . . . . . . . . 20
Figure 4.3 Data multiplexer for register bank . . . . . . . . . . . . . . . . 21
Figure 4.4 Comparator for Y . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4.5 LUT block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.6 ALU block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.7 Logical operation of ALU during ith iteration . . . . . . . . . . 23
Figure 4.8 Counter block diagram . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 4.9 Finite State Machine block . . . . . . . . . . . . . . . . . . . . 24
Figure 4.10 State transition diagram for Algorithm 1 . . . . . . . . . . . . 25
Figure 4.11 Algorithm 2 system level . . . . . . . . . . . . . . . . . . . . . 27
Figure 4.12 Delta (δ) Address Generator . . . . . . . . . . . . . . . . . . . 28
Figure 4.13 DAG system level . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 4.14 Position finder unit block . . . . . . . . . . . . . . . . . . . . . 29
Figure 4.15 Multiplexer for flag input . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.16 Multiplexer for data input . . . . . . . . . . . . . . . . . . . . 30
Figure 4.17 The Px Register . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 4.18 The number subtractor block in DAG . . . . . . . . . . . . . . 31
Figure 4.19 2-bits scan unit 0 in level 1 . . . . . . . . . . . . . . . . . . . . 32
Figure 4.20 Hierarchical approach between level 1 and 2 . . . . . . . . . . 33
Figure 4.21 Hierarchical arrangement of position finder unit . . . . . . . . 35
Figure 4.22 Finite State Machine block . . . . . . . . . . . . . . . . . . . . 36
Figure 4.23 State transition diagram for Algorithm 2 . . . . . . . . . . . . 37
Figure 4.24 Top level block of fixed shift division algorithm . . . . . . . . . 38
Figure 4.25 Fixed shift division algorithm RTL schematic . . . . . . . . . . 39
ix
Figure 4.26 Delta address generator RTL schematic . . . . . . . . . . . . . 40
Figure 4.27 Top level block of adaptive shift division algorithm . . . . . . . 40
Figure 4.28 Adaptive shift division algorithm RTL schematic . . . . . . . . 41
Figure 5.1 All iterations for Algorithm 1 . . . . . . . . . . . . . . . . . . . 46
Figure 5.2 Iterations 0 to 2 for Algorithm 1 . . . . . . . . . . . . . . . . . 47
Figure 5.3 Iterations 3 to 7 for Algorithm 1 . . . . . . . . . . . . . . . . . 47
Figure 5.4 Iterations 8 to 11 for Algorithm 1 . . . . . . . . . . . . . . . . . 48
Figure 5.5 Iterations 12 to 14 for Algorithm 1 . . . . . . . . . . . . . . . . 48
Figure 5.6 All iterations for Algorithm 2 . . . . . . . . . . . . . . . . . . . 49
x
ACKNOWLEDGMENTS
In the name of Allah, the Most Gracious and the Most Merciful.
All praises belong to Allah the merciful for his guidance and blessings to enable me
complete this thesis. I would like to thank:
My parents, for their prayers, love, patience, emotional support, motivation and
assurance in difficult and frustrating moments and for their constant motiva-
tion. Despite the financial constraints, they were always ready to support me
financially.
My Supervisor, Dr. Fayez Gebali, for all the mentoring and support which en-
abled me to achieve my academic and research objectives, also for helping me
cope up with off-school problems and settling in as an international student.
For sharing his ideas, concepts and experiences and It would not have been
possible to complete my research without his invaluable guidance.
My Committee, Dr. Atef Ibrahim, for devoting precious time and providing valu-
able suggestions to improve the quality of the thesis.
My Manager at BC Hydro, Djordje Atanackovic, for his encouragement and
support to help me focus on my thesis completion.
UVIC ECE Dept admin and lab staff, Kevin Jones, Janice Closson, Paul Fedrigo
and Brent Sirna for assisting me during the course of my degree.
xi
DEDICATION
To my father, Muhammad Khalid Zahid and my mother, Imtiaz Khalid for
having a lifelong long dream to see me achieve my graduate qualification at a world
class foreign institution. In difficult times, it proved as key motivating factor and
enabled me to maintain focus.
To my grandmother, Rasool Fatima for her countless prayers and believing in me.
To my Supervisor, Dr. Fayez Gebali, he is one of the most knowledgeable, kindest
and helpful person I have known. I wish him the best of health.
Chapter 1
Introduction
1.1 Overview
Implementation of mathematical algorithms, as those required by a Random number
generator (RNG), require complex and expensive arithmetic operations like division
and multiplication, while also requiring iterative computations on given inputs to
obtain the required output. The techniques of implementing these operations and it-
erations in hardware significantly impacts the speed, area and power of the hardware.
The division of two integers, the divisor and the dividend, results in an integer remain-
der and an integer quotient. The integer division in the one of the most fundamental
arithmetic operation and is heavily required in engineering, scientific, mathematical
and statistical computations. Implementing and performing division operation in
hardware is complex, expensive and requires more computational power in consump-
tion when compared to the addition and subtraction operations. According to [1],
division is the most difficult operation in computer arithmetic and it is a common per-
ception of think of division as an infrequently used operation whose implementation
does not receive much attention. The division in modern micro-processors takes many
clock cycles, furthermore the number of requires clock cycles for integer division also
depends on the operand’s values [2], a larger integer operands will require more clock
cycles to perform division. The more clock cycles or numbers of iterations are needed
by the divider, the more is the power consumed and the speed of operation decreases
too. Ignoring the implementation has been proven in to result in significant system
performance degradation [3]. In applications that employ division operation, having
an efficient implementation of division hardware can significantly improve the overall
2
performance of the system, thus it is imperative to find out the best implementation
method of the division algorithm in hardware. Having a divider that has a lower heat
dissipation is also a desirable attribute in terms of performance and security.
1.2 Motivation for this work
This work is part of an on-going research on the design, development and implementa-
tion of low power Pseudo Random Number Generator (PRNG) and the work focuses
on the implementation and performance analysis of division algorithms to be incor-
porated in the PRNG. These division algorithms are implemented as co-processor
designs which will be later required by the PRNG to implement the mathematical
algorithm that generates the random numbers. Although the implementation of the
overall PRNG exceeds the scope of this work, the targeted PRNG is based on the
Park-Miller algorithm, which is a fairly popular choice of algorithm for the genera-
tion of random numbers, the algorithm requires an initial seed value, a special prime
number, a quotient and a remainder to generate a random number [4]. Two hardware
divider designs are considered and implemented to generate the quotient and remain-
der through a division algorithm for the Park-Miller Algorithm so that the random
number is generated by the PRNG.
The hardware for 32 bit integer division is based on the digit-recurrence, non
restoring division algorithm. The divider designs are analyzed later for the perfor-
mance and impact on important parameters for the choice of application. There has
been quite a bit of work in the hardware dividers with reference to the application of
algorithms particularly dealing with higher radix implementation and floating point
implementation. Most researchers compare the performance results of the overall
divider in terms of speed and area while the methodology of implementation and
how the changes in implementation can affect the performance, specially in power
consumption, in fixed point integer division has not been explained much clearly.
This motivated us to determine the best implementation in terms of performance
parameters of hardware divider and study the two dividers to see which one is best
suited to low power or high speed or low cost implementation. Another motivation
of this work was to come up with a simplified design approach that would allow the
new designers and researchers to understand and re-implement the integer division in
hardware. From the academic and learning point of view, this work enabled the un-
derstanding of iterative algorithms, their design and implementation, state machine
3
syncronization which are skills useful for any one learning practical hardware design
implementation.
1.3 Contributions
Two division algorithms which are based on digit recurrence, non-restoring division
are considered and implemented. The first algorithm is called the “fixed shift division
algorithm” while the second is the “adaptive shift algorithm”. The second algorithm is
an improvement of the first algorithm in terms of performance. Our work contributes
to the following:
1. Designed and implemented two signed integer division algorithms for performing
the division operation in hardware.
2. Verify the hardware design by developing a Matlab code to confirm the correct-
ness and accuracy of the hardware implemented in VHDL.
3. Compared performance of two division algorithms from the viewpoint of device
utilization (area), power consumption and timing analysis (delay).
4. The high-radix technique proposed in [5] for floating point arithmetic is adapted
to integer arithmetic.
Our work will help the designer in decision making towards choosing the division im-
plementation for application specific purpose. If the application demands high speed
or low power computation such as RNGs, cryptographic and encryption processors
then the adaptive shift algorithm is the preferred choice where as in applications such
as those in smart cards which have area and cost constraints, the fixed shift algorithm
is better suited.
1.4 Thesis Organization
This section outlines the organization of the thesis and is intended to present the
reader with the brief summary of main focus of each chapter.
Chapter 1 introduces the reader to the subject and the scope of the research. The
motivation for the research and the contributions of the research is discussed
which were the fundamental objectives in thesis.
4
Chapter 2 describes the background and fundamentals of division in hardware. A
brief classification of division algorithms is provide in order to aid the reader
with the understanding of the related previous work done in the area.
Chapter 3 describes our approach towards the division operation. The two con-
sidered division algorithms, known as the fixed shift division algorithm and
the adaptive shift division algorithm, are presented and their methodology ex-
plained which is used to achieve the correct result of division operation.
Chapter 4 describes the hardware design and implementation. The system hard-
ware entities are explained which are common to both the algorithms and also
the ones that are specific to each of the two algorithms. The circuit implemen-
tations of both the algorithms are presented.
Chapter 5 contains the results and evaluations of the two algorithms. The numerical
simulation results are obtained to verify that the algorithms work and then the
results of hardware simulations (in VHDL) are presented to confirm that the
implementation of the two algorithm has been done correctly. Performance
analysis of the two algorithms is also conducted in this chapter.
Chapter 6 has the concluding statements and the short description of the work and
what was achieved through this work.
5
Chapter 2
Division Background
2.1 Division Fundamentals
There are various references, such as [6][7][8], by authors who have worked on number
division. The fundamental principle of division is that the division of dividend by a
divisor can be realized in cycles of shifting and adding (in actual subtraction) with
hardware or software control of the loop which requires iteratively converging at the
correct result of the division through the hardware divider.
In this literature, we refer to Y as the dividend and X as the divisor. We wish to
divide the integer Y by a positive integer X, the result of this division operation should
be two integers: the quotient and the remainder, denoted by q and r respectively so
that the following equation is satisfied:
Y = qX + r (2.1)
q and r can be expressed as:
q =
⌊Y
X
⌋(2.2)
0 ≤ r < X (2.3)
the floor value of eqn (2.2) would give us a whole number rounded to the lower
integer and a fractional part which is the difference from the actual value to the
6
rounded value. This whole number is the quotient while the fractional part of this
floor function will give us the remainder. Using this concept we can rewrite:
r = Y − qX (2.4)
The above equation states that the remainder r, can be obtained if X is subtracted
from Y for a q number of times, untill the condition in (2.3) is statisfied and at this
point the value of Y is the desired remainder, r. Most hardware dividers operate in
the same manner, this is very similar to the the long division by hand in which the
hardware divider updates the value of Y as per the equation:
Y ← Y − δX (2.5)
The δ is the partial quotient and the updated value of Y is the partial remainder.
The hardware divider, in the similar manner as long division method by hand, keeps
track of the quotients by adding their values in a register Z, which is given by:
Z ← Z + δ (2.6)
From (2.5) and (2.6) we see that δ is subtracted in Y and added to Z.
The choice of value of δ can be arbitrary towards achieving the correct result of
division, provided that the following two conditions are met:
1. The updated value for Y in (2.5) should converge to the range 0 ≤ Y < X, so
that this will produce the desired remainder. If Y is positive, the factor δX is
subtracted, if Y is negative, the factor δX is added to Y ;
2. The updated value for Z in (2.6) should add or subtract to produce the desired
quotient.
We represent the dividend Y of n bits in 2’s complement so that the range of Y can
be given as:
−2n−1 ≤ Y < 2n−1 (2.7)
Our divisor X is assumed to require only m bits for it’s representation such that
m ≤ n. Figure 2.1 shows the nonzero bits in Y and X at the start of division
operation. Our goal is to iteratively reduce the nonzero bits of Y to m bits so that
7
Y comes in the range:
0 ≤ Y < X (2.8)
Figure (2.2) shows the nonzero bits of Y at the end of the division operation where
Y stores the value of the remainder which can fall in the range 0 ≤ r < X.
The choice in value of δ at each iteration to implement (2.5) and (2.6) will differentiate
the division algorithms that we will implement in our work, this will be demonstrated
in chapters to follow.
Figure 2.1: Nonzero bits of X and Y at the start of division
Figure 2.2: Nonzero bits of X and Y at the end of division
8
2.2 Division Algorithms Classes
Oberman and Flynn presented the taxonomy of division algorithm in [3], which clas-
sified the algorithms based on their hardware implementations and they classify the
algorithms in five classes: digit recurrence, functional iteration, very high radix, ta-
ble look-up and variable latency. Many practical division algorithms are hybrids of
several of these classes and can reach combinations of classes to the overall algorithm.
2.2.1 Digit Recurrence Algorithms
Digit recurrence is the most simplest and widest implemented of all division algo-
rithms. The methodology behind it is that it uses subtractive methods to deduce
digits of quotient on every iteration and it retires a fixed number of bits of the quo-
tient in every iteration to achieve this, meaning that the step-size of bits retired
in each iterations are the same. The implementation of digit recurrence algorithms
require less complexity and area.
2.2.2 Functional Iteration Algorithms
The functional iteration uses the multiplication operation as the basis of division
operation. Functional iteration take advantage of high speed multiplier to converge
to result quadratically, unlike the subtractive division through which the result is
converged upon linearly, this reduces the latency and length of each iteration cycles.
Therefore instead of retiring fixed bit at iterations, this class of algorithms retire
increasing bits at each iteration.
2.2.3 Very High Radix Algorithms
Digit recurrence algorithms are suited to low radix division operation and as we
increase the radix, the hardware and divisor multiple process gets complicated and
consumes more area and computation time too. A variant of this is the very high
radix algorithm which avoids the constraints posed by the higher radix, and the term
“very high radix” applies to dividers that retire more than 10 bits in each iteration.
9
2.2.4 Look-up Tables
When a low-precision quotient is required, it may be feasible to apply division using a
look-up table implementation without the use of an algorithm. This implementation
uses direct and liner approximation methods to compute quotient bits. The table can
be implemented as a ROM and the advantage of using this fast processing since no
arithmetic calculation is needed but on the down side, the size of the look-up table
grows exponentially to account for each added bit for accuracy.
2.2.5 Variable Latency Algorithms
The digit recurrence and very high radix algorithms retire fixed number of bits in
every iteration while the function iteration based algorithms retire increasing number
of bits in every iteration, but all three of these algorithms complete the operation in
fixed number of cycles. Variable latency algorithms based dividers perform division
in variable amount of time.
2.3 Related work in the area
The main algorithms for division in hardware implementation were highlighted in pre-
vious section and each methodology has it’s own application and benefits, however
the digit recurrence algorithms is the most commonly used approach for hardware
division implementation and they have procedures like restoring, non-restoring, SRT
division (Sweeney, Robertson and Tocher), approximation algorithms, CORDIC al-
gorithm, multiplicative algorithm and continued product algorithm [9]. According to
Sutter and Deschamps in [10], binary non-restoring digit recurrence algorithms are
the mostly preferred procedure for FPGA based dividers. Authors of [9] implemented
high speed non-restoring based division using the high speed adder/subtractor ap-
proach to speed up the division operation. Sutter and Deschamps implemented high
speed fixed point dividers in [10] based on utilization of FPGA characteristics such as:
adder/subtractor or conditional adders having same delay as simple adders; existence
of dedicated and fast carry generation and propagation logic; and additional mul-
tiplexers to the general purpose LUTs in a sequential, combinational and pipelined
circuits. Achieving higher speed is desirable in hardware implementation but some
applications may also require power efficiency, Nannarelli and Lang proposed low
power divider [11], which discussed power saving techniques such as : re-timing the
10
recurrence, changing redundant representations to reduce the number of flip flops,
using gates with lower drive capability, equalizing the paths of the input signals of
the blocks to reduce glitches, switching-off not active blocks.
We focused the implementation of division algorithms on non-restoring division
methodology and designed a fixed iteration division algorithm and then utilized Dr.
Gebali’s HCORDIC technique [5], an adaptive algorithm methodology, to reduce
number of iterations based on hierarchical design for the adaptive shift iteration
algorithm. Dr. Gebali implemented this technique for floating point arithmetic and
we adapted this technique to make it applicable to integer arithmetic.
2.4 Chapter Summary
This chapter highlights the basics of division in hardware which will enable the reader
to understand the algorithms we present in Chapter 3. Overview of some of the known
division algorithm classes are presented to enable the reader to understand the high
level differences between different implementations. The related work in the area of
division is also discussed to present the reader with additional information to help
better understanding of intended work.
11
Chapter 3
Considered Division Algorithms
The non-restoring division algorithm is based on retiring fixed number of quotient
bits in each iterations, the basis of our algorithms depends on the shifts or δ, which
was introduced in the previous chapter. The difference in the size of δ defines our
algorithms with the fixed δ and the adaptive δ, which we refer to as fixed shift
algorithm and the adaptive shift algorithm respectively.
3.1 Division Approach
3.1.1 Reasons For Considerations
We choose these two division algorithms because of the following reasons:
1. They are popular for implementation of division in integer arithmetic.
2. No multiplier is needed (reduced power and area).
3. No adder, no multiplier, look up table is utilized thus can be implemented in
non-Xilinx programmable logic devices, hence these algorithms are not device
specific.
4. Simplicity of the algorithms.
3.1.2 Overview of Operation
The two algorithms essentially operate in two modes:
12
1. Range reduction mode of Y - in this mode, the algorithm takes multiple steps/iterations
to reduce the dividend to converge on to the result.
2. Post processing mode of Y and Z - this is a single step to process the remainder
and quotient when the result in mode 1, does not fall in the desired range.
To begin the operation in mode 1, the sign of the current value of dividend Y is
checked, if the value is negative, the product of δ and divisor, X is added to Y and
the next value of Y is obtained. If the value of Y is positive, the product δ*X is
subtracted from the current value of Y to obtain the next value of Y, these steps
yields the value of the remainder.
The quotient is produced in simultaneous steps, the δ is added or subtracted to the
current value of quotient Z depending on the operation performed on Y since the two
will have opposite operations performed on them. At each of these steps, the range
of Y is also kept in checked; if at the end of the iteration, the value of Y is in the
desired range, that value of Y would be the remainder and the corresponding value
of Z will be the quotient.
If the value is not in the range at the end of the range reduction mode, the algorithm
will jump to mode 2, which will be a single step to adjust the range so that we have the
correct quotient and remainder at the next step. This methodology is mathematically
explained in the next section.
3.2 Division Algorithm 1 : Fixed Shift Algorithm
This algorithm performs a fixed minimal number of iterative steps to give the quotient
and the remainder when we perform the division of Y by X. In our work, Y is a 32
bit signed integer such that the value n, which is the number of bits in the dividend, is
32. The X is m bits long, which is 17 bits long since this is the minimum value needed
by Dr. Gebali for the initial quotient to implement the random number generator.
The sign of X is arbitrary, therefore assumed to be positive.
The fixed shift division algorithm has the following properties:
1. The required number of iteration is equal to n−m+ 1.
13
2. The sign of the current value of Y determines if the operation needed on the
next iteration is addition or subtraction.
3. The value of Z will converge on to the quotient with the opposite operation to
the operation of Y in property number 2.
4. The δ at every iteration is determined by the equation (3.6) below.
3.2.1 Mode 1 : Range reduction of Y
The step size of δ is given by the iteration index and not by the intermediate values
of Y. The property number 1 is applied on Y and Z per the following equations:
Y (i+1) = Y (i) − µiδiX, 0 ≤ i ≤ n−m (3.1)
Z(i+1) = Z(i) + µiδi (3.2)
where the initial value of Y and Z are:
Y (0) = Y (3.3)
Z(0) = 0 (3.4)
the µi in equation (3.1) and (3.2) denotes the addition or subtraction operation in a
given iterative index value i, the δi is the step size given by the following equations:
µi =
1 when Y (i) ≥ 0
−1 when Y (i) < 0(3.5)
δi = 2(n−m−i), 0 ≤ i ≤ n−m (3.6)
Once again, it is important to remember that the iteration step size depends on δ
and not on the intermediate data of the partial quotient and remainder, this step size
will be governed by the binary shift and will be used by the ALU of the divider to
compute the result.
14
3.2.2 Mode 2 : Post processing of Y and Z
On the completion of Mode 1, the value of remainder, Y n−m+1 needs to fall in the
range:
−2m−1 ≤ Y n−m+1 ≤ 2m−1 − 1 (3.7)
This range may not be satisfied due to the following:
1. The value of Y n−m+1 is negative.
2. The value of Y n−m+1 is positive but greater than X.
In either outcome, the post processing mode becomes applicable such that the in-
equality below is satisfied in order to achieve the correct remainder:
0 ≤ Y n−m+1 < X (3.8)
the value of quotient, Zn−m+1, also needs to be updated whenever Y is changed.
In order to bring the result Y n−m+1 in the desired range, the following process needs
to be applied:
Y (n−m+1) = Y (n−m+1) − µX (3.9)
Z(n−m+1) = Z(n−m+1) + µ (3.10)
where µ works in the same way as in range reduction mode to determine the addi-
tion and subtraction operation on equation (3.9) and (3.10) based on the following
condition:
µ =
1 when Y (n−m+1) ≥ X
−1 when Y (n−m+1) < 0(3.11)
To satisfy (3.8), this process is only needed once. The total number of iterations
needed in algorithm 1 is n−m + 1 if the result of division is achieved in mode 1. If
the result is not achieved in mode 1, a total n−m+ 2 iterations will be required.
15
3.3 Division Algorithm 2 : Adaptive Shift Algo-
rithm
This algorithm does not perform a fixed number of iterative steps to compute the
quotient and the remainder but instead it functions by determining at each iteration,
the step size δ from the magnitude of the input data. Since the step size of the shift
is not fixed, we call this as adaptive shift. This algorithm requires lesser iterations
in comparison to the fixed shift algorithm. Similar to our assumptions for fixed shift
algorithm, we consider the divisor X to have m bits and the dividend Y to have n
bits, inclusive of sign bit.
The adaptive shift division algorithm has the following properties:
1. The required number of iteration is determined by the input data.
2. The sign of the current value of Y determines if the operation needed on the
next iteration is addition or subtraction.
3. The value of Z will converge on to the quotient with the opposite operation to
the operation of Y in property number 2.
4. The location of the most significant bit value of Y and X determines the value
of δ at every iteration by the equation (3.17) below.
3.3.1 Mode 1 : Range reduction of Y
The step size of δ in the adaptive shift algorithm is obtained by the magnitude of
the input data and not by the iteration index, as it was obtained in the fixed shift
algorithm. The iterations on Y and Z occur as per the following equations:
Y (i+1) = Y (i) − µiδiX, 0 ≤ i ≤ n−m (3.12)
Z(i+1) = Z(i) + µiδi (3.13)
where the initial value of Y and Z are:
Y (0) = Y (3.14)
Z(0) = 0 (3.15)
16
the µi in equations (3.12) and (3.13) denotes the addition or subtraction operation
in a given iterative index value i, the δi is the step size given respectively by the
following equations :
µi =
1 when Y (i) ≥ 0
−1 when Y (i) < 0(3.16)
δi = 2(Py−Px), |y| ≥ x (3.17)
where Px is the position of the most significant set bit of X, since X is arbitrary and
our notation assumes it as a positive value.
while Py is defined as:
Py =
position of most significant 1 when Y > 0
0 when Y = 0
position of most significant 0 when Y < 0
(3.18)
when Py ≤ Px, the iterations for the range reduction mode are stopped.
3.3.2 Mode 2 : Post processing of Y and Z
On the completion of Mode 1, the range of Y n−m+1 needs to fall in the range:
−2m−1 ≤ Y n−m+1 ≤ 2m−1 − 1 (3.19)
Just like in fixed shift algorithm post processing; the range may not be satisfied
because either the value of Y n−m+1 is negative or positive but greater than X and
thus, this value needs to be processed so that it satisfies the range:
0 ≤ Y n−m+1 < X (3.20)
the value of quotient, Zn−m+1, also needs to be updated whenever Y is changed.
In order to bring the result Y n−m+1 in the desired range, the following process needs
17
to be applied:
Y (n−m+1) = Y (n−m+1) − µX (3.21)
Z(n−m+1) = Z(n−m+1) + µ (3.22)
where µ works in the same way as in range reduction mode to determine the addition
and subtraction operation on equations (3.12) and (3.13) based on the following
condition:
µ =
1 when Y (n−m+1) ≥ X
−1 when Y (n−m+1) < 0(3.23)
This processes is needed so that the range of equation (3.20) is satisfied. The total
number of iterations needed in algorithm 2 is n−m if the result of division is achieved
in mode 1. If the result is not achieved in mode 1, one more iteration is needed in
mode 2.
3.4 Chapter Summary
In this chapter, we considered the two division algorithms; the fixed shift algorithm
and the adaptive shift algorithm. The equations and conditions required by the
algorithms were explained and represented mathematically. The difference between
the two algorithms is primarily based on the step size δ, in the fixed shift algorithm,
the δ is determined by the iterative index while in the adaptive shift algorithm, the δ is
governed by the input data, that is difference between the position of most significant
“1” or “0” based on sign of Y and the position of most significant “1” in X, since X
is assumed to be positive. In both algorithms, the idea is to reduce Y as determined
by δ such that it is positive and lesser than X in magnitude. When Y fails to falls
in the correct range, a post processing step is required to obtain the correct values of
Y and Z.
18
Chapter 4
Design and Implementation
The hardware realization of the division algorithms requires identification and de-
sign implementation of individual system blocks and their interconnectivity divider
designs. This chapter provides sufficient design methodology.
4.1 Hardware entities for Algorithm 1
The division methodology, equations, conditions and operations explained in chapter
3, will be used to determine the hardware entities required for each of the division
algorithms. In this section we look at the hardware entities that are required for
implementation of Algorithm 1. In every iteration the hardware needs to implement:
• One shift.
• One addition and one subtraction (two operations performed by the ALU)
to implement this, Algorithm 1 needs the following entities:
• X, Y and Z registers
• Data multiplexer
• Comparator for Y
• Look-up table
• ALU
• Counter
19
• Finite state machine
The system block-level diagram of Algorithm 1 is shown in fig. 4.1.
Figure 4.1: Algorithm 1 system level
4.1.1 X, Y and Z Registers
Division requires four operands in total; the divisor, the dividend, the quotient and
the remainder but in our implementation, only three operands are needed since we
20
reduce the dividend such that it yields the quotient. Therefore we need to store only
three values in the registers; the remainder Y, the quotient Z and the divisor X. The
word width of Y is 32 bits, therefore we set the registers of X and Z to 32 bits
word width too. Having a uniform word width of the three registers will simplify the
applicability of arithmetic operations on these operands.
Moreover, the registers are required to hold values from the following:
• The initial values from the external data lines
• The intermediate values of Y and Z from the data feedback from the ALU
during each iteration.
• The final values of Y and Z once the iterations are complete and division result
is obtained.
To perform the above requirements, we need to have control signals for the register
bank to enable the read/write capability on the register contents and we also need
the ability to switch selectivity between the external data or the internal feedback
data. The block level view of our register bank is shown in fig. 4.2 below.
Figure 4.2: Registers X, Y and Z in the bank
4.1.2 Data Multiplexer
The multiplexer has the control signal input from the controller to select from the
external data line or from the feedback data lines from the ALU, the output data
lines from the multiplexer feeds the data into the registers. The block level of the
multiplexer is shown in fig. 4.3.
21
Figure 4.3: Data multiplexer for register bank
4.1.3 Comparator for Y
The comparator that scans Y is an important part of the hardware since it determines
if the addition or subtraction operation is needed on the next values of Y and Z. The
operands X and Y are fed up in to the comparator to raise the flag when the following
conditions occur:
• Raise the flag when the value of Y goes negative (f ypos = 0)
• Raise the flag when the value of Y is positive but less than X (f ygtex = 1)
The block level view of the comparator is shown in fig. 4.4 below.
Figure 4.4: Comparator for Y
4.1.4 The Look-up table
The look-up table (LUT) is implemented as a ROM in the system with contents
stored in weights of binary shifts. The value of δ calculated from in two algorithms
corresponds to the address in the LUT, which is picked up by the ALU during the
computation in the iteration. The LUT block is shown in fig. 4.5.
22
Figure 4.5: LUT block
4.1.5 The ALU unit
The ALU unit computes the equations (3.1)(3.2)(3.9)(3.10)(3.12)(3.13)(3.21) and
(3.22) and is comprised of three ALUs to perform the following:
• Perform multiplication between δi and X.
• Perform addition/subtraction (based on sign bit of current Y ) of the product
δiX from Yi to obtain Yi+1
• Perform addition/subtraction (based on the sign bit of current Y ) of δi from Zi
to obtain Zi+1
The ALU requires the control signal based on the status of comparator flags to per-
form addition or subtraction operation. The ALU block is shown in fig. 4.6 and the
logical operation during an iteration is shown in fig. 4.7.
Figure 4.6: ALU block
23
Figure 4.7: Logical operation of ALU during ith iteration
4.1.6 Counter
To perform the shift we need a counter. Recall from section 3.2.1 that the step size
of δ is given by the iteration index and not by the intermediate values of Y. The
counter is employed in algorithm 1 to produce the iterations indexes at each iteration
which pulls out the corresponding values from the LUT table for the ALU. When the
iterations are complete, a flag is raised and it’s status is provided to the controlling
unit. The counter block is shown below in fig. 4.8.
Figure 4.8: Counter block diagram
4.1.7 Finite State Machine
The finite state machine (FSM) is the controlling unit of the system, it sends and
receives the control signals to and from other hardware entities in the system. The
FSM block is shown in fig. 4.9. The FSM of the algorithm 1 is fairly simple and only
has four states: initial, iterate, adjust and final.
24
Figure 4.9: Finite State Machine block
4.1.8 FSM : State transition diagram
In the initial state the FSM is in the idle mode and scans for an external “start”
input control signal. The initial state is used as a system initialization mode which
occurs upon reset and the counter is cleared, the “sel” (select) is set to high so that
the external data inputs are selected and those values are ready to be loaded into
the registers X,Y and Z. The enable x and enable yz are set to high which enables
the writing in the registers while the “done” signal is set to “zero” and add sub y is
essentially in the don’t care state.
Once the “start” is received, the FSM goes into the iterate mode which implements
the “range reduction of Y ” mode, for this the counter is enabled and the “sel” control
is set to “0” so that the internal feedback data lines from the ALU are selected for
the next iteration. The flags f ypos = 0 and f ygtex = 1, which means that the Y is
negative or is positive but greater than equal to X respectively and the add sub y is
controlled accordingly. If the value of Y is negative the addition is performed, if it’s
positive and greater than Y, subtraction is performed. When the counter has reached
the pre-determined “counts”, the f i flag is raised to a “1”, which sends the signal to
the FSM that iterations are complete in mode 1.
The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then
the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags
have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the
correct range and the FSM goes directly into the final state. In the final state, the
write capability in registers Y and Z is disabled through enable yz and the “done”
signal is set to “1” which indicates that the division operation is complete. The state
25
transition diagram is shown in fig. 4.10.
Figure 4.10: State transition diagram for Algorithm 1
4.2 Hardware entities for Algorithm 2
We know from section 3.3.1 that the step size of δ in the adaptive shift algorithm is
obtained by the magnitude of the input data and not by the iteration index, therefore
we do not use the counter in the implementation of this algorithm. We will instead
need a special hardware unit that will check the most significant 1’s or 0’s in the
operand, if the number is positive or negative respectively. In our design, we call
this unit the delta address generator (DAG). In every iteration the hardware needs
to implement the following operations:
• Determine the location of most significant 1 or 0 for Y i.
• One shift.
26
• One addition and one subtraction (two operations performed by the ALU)
to implement this, Algorithm 2 needs the following entities:
• X, Y and Z registers
• Data multiplexer
• Comparator for Y
• Look-up table
• ALU
• Delta (δ) address generator
• Finite state machine
The system block-level diagram of Algorithm 2 is shown in fig. 4.11. We only
discuss the DAG and finite state machine for Algorithm2 because it specific to the
adaptive shift algorithm while the rest of the entities are implemented in the exact
same way as in Algorithm 1. One key difference between both designs is that the
counter used in Algorithm 1 is not used in Algorithm 2, instead, the DAG generates
the shifts in δ.
27
Figure 4.11: Algorithm 2 system level
4.2.1 Delta Address Generator
This unit will determine the location of most significant 1 or 0 by scanning the position
Py and Px and generating an address from the difference of the two position to obtain
the corresponding value of shift in δ from the LUT ROM. which will be used by the
ALU for the computation in the iteration step. The DAG block level diagram is
shown in fig. 4.12. and the overall system block-level diagram is given in fig. 4.13.
28
Figure 4.12: Delta (δ) Address Generator
Figure 4.13: DAG system level
29
The DAG is composed of several hardware entities such as :
• position finder unit.
• multiplexer for flag.
• multiplexer for data lines.
• Px register.
• number subtractor.
Position finder unit
The purpose of this unit is to find Py and Px from Y and X respectively based on the
input of flag, f ypos. If f ypos = 1, the position finder unit detects the most significant
1 bit in Y and if f ypos = 0 then the unit detects for most significant 0 bit. Since
X is assumed to be positive, the unit will always look for most significant 1’s in X.
See fig. 4.14 below, the number at the input is either X or Y depending upon the
data multiplexer input. Similarly, the f mux which is the flag forwarded by the flag
multiplexer, indicates the sign on the number operand at the input of the position
finder unit. For the case of Y, the “f mux” will have input from the f ypos of the
comparator, for the case of X, the “f mux” will send a “1” to the position finder unit,
which indicates the unit to look for most significant “1” in X. The output “position”
will have the value of Py and Px from Y and X respectively. The “flag out” signal
is the resultant of the hierarchical implementation of the position finder and is not
used in computation of Py and Px or in the division operation.
Figure 4.14: Position finder unit block
30
Multiplexer for flag
This is just a simple multiplexer that enables re-using the same position finder unit
for Px and Py. It reads the status of the flag, f ypos to decide if the unit needs to look
for 1’s or 0’s in Y. For the case of Px, we feed a “1” from the multiplexer input so
that the unit always looks for most significant “1’s” in X, since X is always positive.
Figure 4.15 below highlights this, the “sel x” input comes from the FSM and when
it’s high, the multiplexer sends “1” at the output, otherwise when its a low or a “0”,
it sends “f ypos” at the output as “f mux”.
Figure 4.15: Multiplexer for flag input
Multiplexer for data lines
This works in the exact same way as the multiplexer for flag and share the same
control input “sel x”, since we re-use the position finder unit for both Px and Py, this
multiplexer helps to control the data lines selected as input for the position finder
unit as shown in fig. 4.16 below.
Figure 4.16: Multiplexer for data input
31
Px register
To employ the re-usability of the position finder unit, we need a register that stores
subtracter. Since this register is only used for Px, it will function only when “sel x
= 1”, and therefore this is controlled by the signal “enable reg Px”. Figure 4.17
illustrates this block.
Figure 4.17: The Px Register
Number subtractor
This hardware entity essentially performs the subtraction of Py − Px that is used
as an address for LUT and this entity also raises the flag “f i” when the result of
subtraction is less than or equal to “0”, which indicates to the FSM that the “range
reduction of Y ” mode is complete. The delta address will have the value of the delta
from the result of Py − Px while the “position x and position” signals represent Px
and Py respectively. Figure 4.18 illustrates this block.
Figure 4.18: The number subtractor block in DAG
32
4.2.2 DAG Implementation
The DAG is the most important hardware unit for the Algorithm 2 since this unit
computes the adaptive shift, δ for this algorithm. Remember in algorithm 1, we
employed the counter to compute the fixed shifts which was based on the iterative
index i, but in the adaptive shift based division technique, we scan the words Y and
X for the bit position of most significant 1’s or 0’s and then use the difference between
the bit locations to obtain value of δ.
The DAG is implemented in a hierarchical arrangement of five levels which is given
by the relation since we have the 32 bit operand:
2x = 32 (4.1)
therefore, x = 5
The “level 1” is comprised of 16 2-bits scan units that each scans the two bits at a
time for the entire word width of Y starting from bit location Y0Y1 up till Y30Y31, the
unit checks the presence of 1 or 0 in the MSB, depending on the sign of Y otherwise
checks the LSB for a 1 or 0 and sends the flag and position to the next hierarchical
level. This unit also accepts a starting base value n at each block on which the value
is obtained to pass on to the next level. Figure 4.19 below shows 2 of these units that
will help illustrate the concept.
Tables 4.1 and 4.2 show how the 2-bits scan unit works in Y is positive or negative.
Figure 4.19: 2-bits scan unit 0 in level 1
The “level 2” is comprised of 8 scan block that each scans, essentially 4 bits, the
two numbers and the two flags from the 2-bits scan unit in level 1 starting from
scan block 0 up to scan block 8, if the flag “f1” of the 2-bits scan unit 1 is a “1” then
the number on the output of scan block 0 is “n1” and if the flag “f0” is a “1” and “f1”
33
Y1 Y0 n0 f00 0 0 00 1 n 11 0 n+1 11 1 n+1 1
Table 4.1: Truth Table when Y is positive
Y1 Y0 n0 f00 0 n+1 10 1 n+1 11 0 n 11 1 0 0
Table 4.2: Truth Table when Y is negative
is a “0” then the number on the output of scan block 0 is “n0”. We demonstrate this
relation between scan block 0 in level 2 and the two 2-bits scan unit 0 and scan unit 1
from level 1 in fig. 4.20.
The approach for level 2 transcends in the same manner all the way down to level
5 through level 3 and level 4. The scan block shown in fig. 4.20 is exactly the same
for the rest of level and works on the same principle by accepting two numbers and
two flags from the previous level and updating the number output depending on the
status of the flag(s). As we increase a level, the number of scan blocks needed will be
reduced by a factor of 2 and hence we only have four scan blocks in level 3 and then
two in level 4 and one in level 5.
Figure 4.20: Hierarchical approach between level 1 and 2
The hierarchical arrangement of all 5 levels is shown in fig. 4.21. The number
34
“n0 L5”obtained in the output of level 5 is the position of the most significant 1 or 0
depending on the sign of operand.
The n at the top of each “2-bits scan unit”, referred in figure as “u”, is the base
value for each unit. Notice that the whole word width of 32 bits is covered with 16 2-
bits scan unit, each scan unit forwards their respective bit position outputs (n0...n15)
and flag outputs (f0...f15) to scan blocks in level 2.
Although the methodology of operation is same for scan blocks as 2-bits scan
unit, different notation for number output and flag outputs is used to highlight the
difference. The numbers are the base value plus the most significant 1 or 0 in that
unit, then the flag determine which scan block has the most significant 1 and 0, in
other words, if the flag from a high order scan block is high, the number output of
that scan block is sent at the output.
35
Figure 4.21: Hierarchical arrangement of position finder unit
36
4.2.3 Finite State Machine
The finite state machine (FSM) is the controlling unit of the system sends and receives
the control signals to and from other hardware entities in the system. The FSM block
is shown below in the fig. 4.22. The FSM of the algorithm 2 has an additional state
than algorithm 1 and only has a total of five states: initial, load X (initialize X ),
iterate, adjust and final.
Figure 4.22: Finite State Machine block
4.2.4 FSM : State transition diagram
In the initial state the FSM is in the idle mode and scans for an external “start” input
control signal. Once the “start” is received, the FSM goes into the “load X” mode.
The load X state is an additional initialization state, along with the initial state,
that loads the value of Px into the Px register so that the iterations are synchronized
with Py when the iterate mode is reached. The initial state is used as a system
initialization mode which occurs upon reset and the “sel” (select) is set to high so
that the external data inputs are selected and those values are ready to be loaded into
the registers X,Y and Z. The enable x and enable yz are set to high which enables
the writing in the registers while the “done” signal is set to “zero” and add sub y
is essentially in the don’t care state. We have two additional control signals; the
“sel x” (select x) and “enable reg x” (enable register x) which are associated with
obtaining the value of Px, the position of “X ”. In the load X’ state, the “sel x” and
“enable reg x” are disabled so that DAG will fetch values of Y in order to obtain the
value of Py.
The value of δ can be obtained when the DAG performs the operation Py − Px,
37
on next clock cycle the FSM goes into the iterate mode which implements the “range
reduction of Y ” mode. The flags f ypos = 0 and f ygtex = 1, which means that the Y
is negative or is positive but greater than equal to X respectively and the add sub y
is controlled accordingly. If the value of Y is negative the addition is performed,
if it’s positive and greater than Y, subtraction is performed. When the result of
Py − Px ≤ 0, the the f i flag is raised to a “1”, which sends the signal to the FSM
that iterations are complete in mode 1. The state transition diagram is shown in fig.
4.23.
Figure 4.23: State transition diagram for Algorithm 2
38
The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then
the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags
have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the
correct range and the FSM goes directly into the final state. In the final state, the
write capability in registers Y and Z is disabled through enable yz and the “done”
signal is set to “1” which indicates that the division operation is complete.
4.3 Circuit Implementations
The general description of the system and it’s blocks has been covered in previous
sections, In this section we look at the Register Transfer Level (RTL) view of the
top level block and the overall RTL schematic of the division and allied hardware
implementation. The signals paths are shown in red and the data paths are shown
in black colored lines in the schematics.
4.3.1 Algorithm 1 : Fixed Shift division algorithm
The top level block and the RTL schematic for fixed shift division algorithm are shown
in the fig. 4.24 and 4.25.
Figure 4.24: Top level block of fixed shift division algorithm
39
Figure 4.25: Fixed shift division algorithm RTL schematic
40
4.3.2 DAG overall layout
The overall RTL schematic for the DAG used in algorithm 2 is shown in fig. 4.26.
Figure 4.26: Delta address generator RTL schematic
4.3.3 Algorithm 2: Adaptive Shift division algorithm
The schematics for adaptive shift division algorithm are shown in the fig. 4.27 and
4.28.
Figure 4.27: Top level block of adaptive shift division algorithm
41
Figure 4.28: Adaptive shift division algorithm RTL schematic
42
4.4 Chapter summary
In this chapter, the design overview and methodology was explained with regards to
each of the two division algorithms: the algorithm 1, fixed shift division algorithm
and the algorithm 2, the adaptive shift division algorithm. The difference in operation
and implementation between the two algorithms was explained with the reference to
step size, δ. In algorithm 1, the iterations are pre-determined and this was achieved
through a counter while in algorithm 2, the shifts in δ was achieved through a special
hardware called the DAG. The DAG, is a hierarchal implementation of scan units
and scan blocks with the purpose of calculating the difference between Py −Px. This
difference corresponds to an address in the LUT that holds the shifted binary value
of δ.
43
Chapter 5
Results and Evaluation
The aim of this chapter is to demonstrate that the two division algorithms designs in
previous chapter will work based on the algorithms discussed in chapter 3. The im-
plementation phase proved to be very challenging and required a respectable amount
of testing, debugging and design revision to ensure that proper functionality of the in-
tended hardware. This chapter documents the tests and simulation results to analyze
the functionality and the performance of the two algorithms. Initially the division
algorithm 1, based on fixed shift, was constructed to achieve a working division al-
gorithm and then algorithm 2, based on adaptive shift technique, was constructed to
produce the same division result. A comparative analysis was conducted between the
two algorithms for their power consumption, device utilization, timing analysis, area-
delay product and power-delay product based on design goals for balanced, timing
performance and power optimization. Some of the related work in the area is also
compared in this chapter.
5.1 Numerical Simulation using MATLAB
The two algorithms were first implemented in software using MATLAB in order to
verify that the division algorithms yielded correct value of quotient and remainder
when the dividend was divided by the divisor. The purpose of this numerical simula-
tion was also to have a reference benchmark of numerical values in each iteration so
that the comparison can be drawn accordingly during the hardware implementation
phase. These simulation numbers were not only important from the verification point
of view, but were also very beneficial during hardware description debugging.
44
5.1.1 Numerical Simulation of Algorithm 1
Table 5.1 shows the numerical simulation in each iteration when Y = 1,176,349 is
divided by X = 127,773.
Range Reduction Mode of Y, Algorithm 1
i Y i+1 Zi+1 δi = 2n−m−i µi × δiInitialize 1,176,349 0 - -
0 -2092256483 16384 214 16384
1 -1045540067 8192 213 -8192
2 -522181859 4096 212 -4096
3 -260502755 2048 211 -2048
4 -129663203 1024 210 -1024
5 -64243427 512 29 -512
6 -31533539 256 28 -256
7 -15178595 128 27 -128
8 -7001123 64 26 -64
9 -2912387 32 25 -32
10 -868019 16 24 -16
11 154165 8 23 -8
12 -356927 12 22 4
13 -101381 10 21 -2
14 26392 9 20 -1
Post Processing Mode of Y and Z, Algorithm 1
Not required, results are obtained
Table 5.1: Iterations for Algorithm 1
In the above table, we have all the values of Y i+1 and Zi+1 for each of the iterations
i, notice that shifts in δi are decremental or decreasing by 1 bit, for this reason we call
this algorithm as the fixed shift division algorithm. In chapter 3, we discussed that
in our work Y is 32 bits and X needs to be at least in 17 bits, denoted by n and m
respectively. The difference n−m = 15 bits, which gives us the number of iterations
required for the division operation, therefore we perform a total of 15 iterations. Since
the result of division on the chose value of operands Y and X satisfies the equation
(3.8), the “post processing mode of Y and Z is not needed” in the fixed shift division
algorithm.
45
5.1.2 Numerical Simulation of Algorithm 2
Table 5.2 shows the numerical simulation in each iteration when Y = 1,176,349 is
divided by X = 127,773.
Range Reduction Mode of Y, Algorithm 2
i Y i+1 Zi+1 δi = 2Py−Px µi × δiInitialize 1,176,349 0 - -
0 -868019 16 24 16
1 154165 8 23 -8
2 -101381 10 21 2
Post Processing Mode of Y and Z, Algorithm 2
26392 9 20 -1
Table 5.2: Iterations for Algorithm 2
The table lists out the values of Y i+1 and Zi+1 for each of the iterations i, notice
that shifts in δi are not decremental as in case of algorithm 1, for this reason we call
this algorithm as the adaptive shift division algorithm. As discussed in chapter 3,
the iterations for adaptive shift division algorithm are given by Py − Px. The most
significant 1 in Y is at 20th bit position starting from bit position number 0, the least
significant bit in Y while the most significant 1 in X is at 16th bit position starting
from bit position number 0, the least significant bit in X. The difference between the
two respective bit positions in Y and X is 20− 16 = 4, therefore the algorithm takes
a total of 4 iterations to produce the result. The iterations in the “range reduction
mode of Y ” ends when Py ≤ Px, at this point the value of Y did not satisfy the
equation (3.20) therefore the algorithm goes into the “post processing mode of Y and
Z to obtain the correct result. The Table 5.2, shows that the iterations required to
achieve the division result is much lesser than the iterations given in Table 5.1.
5.2 Hardware Simulation
The two algorithms were design, synthesized and implemented in VHDL using Xilinx
ISE Project Navigator 13.4. The implemented top level and overall RTL schematics of
both the division algorithms and allied hardware modules were presented in chapter
4. The VHDL test benches were created and simulated to verify that the hardware
performs division correctly.
46
5.2.1 VHDL Simulation of Algorithm 1
Figure 5.1 shows the screen shots of test bench output.
Figure 5.1: All iterations for Algorithm 1
47
By observing Y and Z in fig. 5.1, once the “start” becomes a “1”, the iterations
begin on every rising edge as it can be seen until the result of division quotient and
remainder is achieved. For clarity we break down the figure and examine the zoomed
view of iterations in fig. 5.2 to fig. 5.6 such that, to verify iterations data with the
numerical simulation data.
Figure 5.2: Iterations 0 to 2 for Algorithm 1
Figure 5.3: Iterations 3 to 7 for Algorithm 1
48
Figure 5.4: Iterations 8 to 11 for Algorithm 1
Figure 5.5: Iterations 12 to 14 for Algorithm 1
Our observation of the figures of the test bench screen shots above, it can be seen
that the iteration data from the VHDL simulation is consistent with the numerical
simulation data obtain in section 5.1.
5.2.2 VHDL Simulation of Algorithm 2
We now assess the functionality of our division algorithm 2, the adaptive shift division
algorithm. Just like in section 5.1, it was observed that the adaptive shift technique
reduces considerable number of iterations as compared to the fixed shift technique,
this is verified by observing Y and Z in fig. 5.7.
49
Figure 5.6: All iterations for Algorithm 2
50
5.3 Performance Evaluation
The hardware device chosen for the implementation is Xilinx Spartan-3E xc3s1200e-
4fg320. This Spartan-3E FPGA device contains 1200,000 system gates, 19,512 equiv-
alent logic cells and 8,672 total number of slices [12] out of which the available logic
utilization consists of 17344 flip flops, 17344 4 input LUTs and 250 bonded IOBs.
Apart from the usage of slices in logic, they are also used for routing signals within
the device. The test study analyzes and compares the two division algorithms for
their power consumption, device utilization and timing analysis using the Xilinx ISE
tool with respect to three design goals:
• Balanced.
• Timing performance.
• Power optimization.
These design goal profiles are pre-defined in ISE Navigator tool and can be set to a
desired goal in the synthesis properties. In the balanced profile, the optimization goal
is the “speed” and the optimization effort is “normal” while in the timing performance,
the optimization goal is the “speed” and uses a “high” optimization effort. In the
power optimization profile, the optimization goal is “area” while the optimization
effort is “high”.
5.3.1 Device Utilization
Once the two division algorithms were successfully complied, they were synthesized
to assess the device utilization and performance. The device utilization results for
fixed shift division algorithm and adaptive shift division algorithm is shown in Table
5.3 and 5.4 respectively.
The device utilization summary can be obtained through the following:
Go to ISE Navigator Design pane > select Implementation (view) > select the design
as “top module”.
In the Process pane > select synthesize - XST > View the Design summary (synthe-
sized) window .
51
Device Utilization Summary : Algorithm 1
Design goal Balanced Timing Performance Power Optimization
Number of Blocks 457 462 426
Flip Flops 70 86 70
4-Input LUTs 253 242 222
Occupied Slices 129 159 134
Table 5.3: On-chip device utilization of Algorithm 1
The on-chip logic utilization summary shows that Algorithm 1 uses a total of 457
blocks for the balanced, 462 blocks for timing performance and 426 blocks for power
optimization which also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for
each of the three profile. All three profiles have a device utilization of 1% for our
target device.
Device Utilization Summary : Algorithm 2
Design goal Balanced Timing Performance Power Optimization
Number of Blocks 607 582 585
Flip Flops 105 104 104
4-Input LUTs 368 344 347
Occupied Slices 194 182 192
Table 5.4: On-chip device utilization of Algorithm 2
Algorithm 2 uses a more area as compared to Algorithm 1, and in addition to the
number of flip flops and 4-input LUTs mentioned in Table 5.4, the total number of
blocks also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for each of the
three profile. All three profiles have a device utilization of 2% for our target device.
5.3.2 Timing Analysis
In the timing analysis we look at the two division algorithms for their clock frequency,
the critical path delay and the overall completion time required by the division op-
eration. The Table 5.8 and Table 5.9 shows the timing summary of the division
algorithm 1 and 2 respectively.
52
Timing Analysis : Algorithm 1
Design goal Balanced Timing Performance Power Optimization
Clock Frequency [MHz] 81.374 83.008 77.48
Critical Path Delay [ns] 12.289 12.047 12.907
Division operation completion time = 155 ns
Table 5.5: Timing Summary of Algorithm 1
Timing Analysis : Algorithm 2
Design goal Balanced Timing Performance Power Optimization
Clock Frequency [MHz] 39.022 41.722 37.198
Critical Path Delay [ns] 25.626 23.968 26.883
Division operation completion time = 70 ns
Table 5.6: Timing Summary of Algorithm 2
The critical path delay determines the clock frequency and can be obtained by
running the synthesize - XST option from the process panel in the ISE tool. Once
the synthesis is complete, the timing report can be viewed from right clicking the
synthesize - XST. This report also reveal the the source and destination of the critical
path in each of the profiles.
The data path that cause the critical path delay for Algorithm 1 were the following:
• Balanced.
source : counter instance/temp count 3 (FF)
destination : regbank instance/y reg alu 31 (FF).
• Timing performance.
source : regbank instance/y reg alu 1 1 (FF)
destination : regbank instance/z reg alu 31 (FF).
• Power optimization.
source : regbank instance/y reg alu 0 (FF)
destination : regbank instance/z reg alu 31 (FF).
The data path that cause the critical path delay for Algorithm 2 were the following:
53
• Balanced.
source : fsm instance/pstate internal FSM FFd3 (FF)
destination : regbank instance/y reg alu 31 (FF).
• Timing performance.
source : fsm instance/pstate internal FSM FFd2 (FF)
destination : regbank instance/y reg alu 31 (FF).
• Power optimization.
source : fsm instance/pstate internal FSM FFd2 (FF)
destination : regbank instance/y reg alu 31 (FF).
The overall division operation completion time was obtained through running the
ISIM simulation through the “simulation” view of the ISE tool and double clicking the
“simulate behavioral model”, which will show the test bench output. Using vertical
markers to calculate the time difference between vertical marker place on the rising
edge when “start” signal becomes high till the rising edge time instant when “done”
signal is set to high.
In respective to the timing analysis empirical data, it was observed that the al-
gorithm 2 had over 50% lesser clock frequency which corresponds to double the time
period or the delay. The addition of DAG hardware increased delay per clock cycle
thereby, reducing the clock frequency which results in lesser circuit power consump-
tion and increased reliability [13], this is because the dynamic power consumption
is related to clock frequency, the higher switching activity there is in the circuit, or
higher clock frequency, it results in higher dynamic power consumption. The DAG
hardware also resulted in lesser job (division operation) completion time which verifies
that the Algorithm 2 is more than 50% faster than Algorithm 1.
5.3.3 Power Consumption
The total on-chip power is given by the static power and the dynamic power. The
static power results mainly from the leakage current within the device from the tran-
sistors and exists even when the transistor is logically “OFF”. The dynamic power
depends on the switching activity defined in [14]. Based on this theory, the total
power consumption will change for the same design if different target devices are
used, therefore in our results, we refer to dynamic power in the presented data in
Table 5.7.
54
The ISE XPower tool can be used through the following:
Go to ISE Navigator Process pane > select Implement design > place & route
Analyze Power Distribution (XPower Analyzer) > Go to the XPA tool window and
set the clock frequency in tree drop down list > go to Tools in the menu bar > update
power analysis.
Design Power Consumption [mW]
Design goal Algorithm 1 Algorithm 2 % difference
Balanced 87 31 64.37 %
Timing Performance 93 38 59.14%
Power Optimization 68 31 54.41%
Table 5.7: On-chip power consumptions.
The last column in the table, % difference between the power consumption of the
two designs, is a clear indication that the adaptive shift division algorithm is power
efficient. All the three design goals for the adaptive shift division algorithm show
more than 50% of lesser power consumption than compared to the fixed shift division
algorithm. It should be worth mentioning that generally an increase in the area of
the design increases power consumption which can be said for Algorithm 2 but this
was not found to be the case for Algorithm 2. The reason for this, to the best of our
knowledge, is that the Xilinx’s implementation of the division through Algorithm 2
results in lesser number of gates switching which causes lesser switching activity and
hence lesser dynamic power consumed.
5.3.4 Power-Delay Product
The power-delay product and area-delay product are two figures of merit considered
by many designers in digit electrons and hence we include these metrics to enable
prospective designers to determine the trade off between the designs. The power-
delay product can also be referred to as the “switching energy” of the digital circuit
and is given by the power consumption over a switching event. Table 5.8 shows the
power-delay product for the two algorithms, this figure of merit is measured in joules
[J].
Since the target device we have chosen for our design is very huge as compared
55
to the overall area utilization (1% for Algorithm 1 and 2% for Algorithm 2) of the
designs, this results in an enormous leakage power therefore we chose the dynamic
power of the circuit to measure this.
Power-Delay Product [J]
Design goal Algorithm 1 Algorithm 2
Balanced 1069.143 794.406
Timing Performance 1120.371 910.784
Power Optimization 877.676 833.373
Table 5.8: Power-delay product for Algorithm 1 and 2.
5.3.5 Area-Delay Product
In simple terms, the area-delay product can be referred as the number of LUTs ×clock period of the design [15]. For our calculation we also consider the flip flops
utilized to determine this number. Table 5.9 shows the area-delay product for the
two algorithms, this figure of merit is calculated by: (Flip Flops + 4-LUTS) × clock
period of the design.
Area-Delay Product [(FF+4LUT).s]
Design goal Algorithm 1 Algorithm 2
Balanced 3969.347 12121.098
Timing Performance 3951.416 10737.664
Power Optimization 3768.844 12124.233
Table 5.9: Area-delay product for Algorithm 1 and 2.
5.4 Comparison of Work in Related Area
Firstly, as previously discussed in section 2.3 and to the best of our knowledge and
review of related work in the area, most of the work in division is implemented for
high speed dividers, particularly in floating point arithmetic using techniques such as
high speed adders/subtracters and multiplication, utilization of precomputed values
and look-up tables to speed up division. Several authors have implemented division
in higher radix or different bit sized operands while our work focuses on radix 2
implementation of 32 bit operands. Our implementation of division is carried out
56
in 32 bit signed fixed point arithmetic without utilizing any of the above mentioned
techniques to speed up the division operation, therefore finding exact comparable
work was difficult since we vary δ, the step size of bit shift, to perform division and
speed up the overall operation.
The authors of [16] stress that the work related to power consumption in dividers
is fairly limited in FPGAs, which is what we encountered as well during our explo-
ration of related comparable work in power consumption analysis of integer division
in FPGAs.
Lastly, some of the other work in division is implemented on ASIC and custom
CMOS technologies and in fair respect to difference in design implementation, area
and power consumption amongst these technologies, we do not make compare the
work with ours. Other instances were found where authors of [17] implemented a 16
bit fixed point complex divider using pipelined CORDIC to divide complex numbers
in polar coordinates and because this and similar sort of work focus on a different
coordinate system, we do not compare such work to ours. Moreover, there has not
been a lot of work on FPGA based fixed point integer division in recent years and to
the best of our knowledge, the work of authors in [9][10][18][19] is among the closet
recent work that can somewhat be comparable with our results.
Authors of [9] implemented a division using high speed adder/subtractor with
a 10 bit divisor and a 9 bit dividend that results in a 9 bit quotient and a 10 bit
remainder. Although it is not clearly mentioned but judging from the operands used
in the example, this work focuses on fixed point integer division but details about
power, delay, frequency and latency is not provided. The authors of [10] implemented
high speed fixed point divider based on implementation of an adder/subtracter or
conditional adder and ripple carry adders, the details about power is not reported.
The authors of [18] provided results for a fixed point divider in FPGA that uses
precomputed values, the input is scaled to a denominator that has a value between
0.5 and 1 and then the inverse of this denominator is multiplied to the nominator,
the details about power consumption, area utilization is not provided. The authors
of [19] present a power consumption analysis of 6 ÷ 3 bit integer division, based on
ancient Vedic mathematic technique, in FPGAs. Table 5.10 summarizes the results
from these authors along with our considered designs.
57
Fixed Point Division : Results Summary
Scheme Bits # of LUT Power Delay Frequency Latency
slices [mW] [ns] [MHz] [ns]
Algorithm 1* 32 129 253 87 12.29 81.37 155
Algorithm 1** 32 159 242 93 12.05 83.00 “
Algorithm 1*** 32 134 222 68 12.91 77.48 “
Algorithm 2* 32 194 368 31 25.63 39.02 70
Algorithm 2** 32 182 344 38 23.97 41.72 “
Algorithm 2*** 32 192 347 31 26.88 37.198 “
[9] 10×9 103 176 - - - -
[10] Combinational 32 687 1152 - 85.0 - -
[10] Pipelined 32 1273 - - 3.8 263 129.2
[10] Sequential 32 201 - - 3.8 263 125.4
[18] 32/16 I/O - 647† - - - 0.03 [µs]
[19] 6÷ 3 - - 93 41§ 250 -
* Balanced design goal
** Timing performance design goal
*** Power optimization design goal† TLE - Total logic elements§ - Propagation delay
Table 5.10: Summary of related work in the area
5.5 Chapter Summary
In this chapter, the two division algorithms were numerically simulated to verify that
they perform division and then the algorithms were designed in hardware using VHDL
which showed that the algorithms were implementable in hardware just like they were
done in numerical methodology. We then analyzed and evaluated the performance of
the digital designs of the two division algorithm.
The fixed shift division algorithm (algorithm 1) demonstrated lesser device logic
utilization as compared to the adaptive shift division algorithm (algorithm 2). In the
power consumption analysis, the adaptive shift division algorithm demonstrated sig-
nificantly efficient power consumption, compared to the fixed shift algorithm. In the
timing analysis, we studied the critical path delay and the overall division operation
58
completion time for both designs. We then provide the power-delay product and the
area-delay product of the two designs. Lastly, we compare our fixed point divider
with some of the work published in related area in the recent years. This information
and results are very useful to our study since it gives us the capability to asses, from
the design point view, which algorithm is needed in our application. In applications,
where power efficiency and higher speed is a requirement, the adaptive shift divider
algorithm is recommended, in cost effective and cheaper application, the fixed shift
divider algorithm can be implemented.
59
Chapter 6
Conclusion, Contributions and
Future Work
6.1 Conclusion
This thesis considered the implementation of division operation in hardware, which in
the past have not been as much worked upon as the multipliers and adders. Moreover,
the implementation of a power efficient division algorithm was one of the major ob-
jective of the thesis and research work. Division operation is critical in cryptographic
and security processors and often employed as a co-processor in cryptographic and
encryption processors. Power efficient divider not only saves power, it also helps re-
duce the heat signature of the digital device which is an important aspect since most
of the side-channel analysis and attacks, which monitor the heat dissipation profile.
The two division algorithms were considered based on digit recurrence non-restoration
division technique using the simple long division concept, performed by hand. The
algorithms were numerically simulated in Matlab to confirm that they work and
yield correct values of quotient and remainder for a given dividend and divisor. The
hardware implementation was verified through VHDL and yielded results which were
consistent with the numerical simulation.
Three different analysis were made on the two algorithms which were: device uti-
lization, power consumption and timing analysis. Three analysis were obtain on three
different design goals based on the optimization techniques preset in VHDL synthesis.
The fixed shift division worked on predetermined fixed shifts in δ, in this algorithm
the shift in δ is incremental and this increases with iteration corresponding to the
60
shifted value of (2i). The adaptive shift algorithm does not have fixed incremental
shifts in the value of δ, hence the name, adaptive. The adaptive shifts were generated
by a special hardware that checks the most significant “1” or “0” in the given dividend
with the most significant “1” in the divisor. The difference between the two position
of the most significant digits, computes the shift for the algorithm. The adaptive
shift algorithm performed the same division operation and generated the results in
much lesser iterative steps. In doing so, it utilized more area on the device and has
increased routing and logic delays. The adaptive shift division algorithm also demon-
strated lesser power consumption as compared to the fixed shift algorithm. The two
design were designed and implemented in a simple yet efficient methodology.
It must also be necessary to mention that this work is part of Dr. Gebali’s work
on Pseudo Random Number Generator (PRNG), which is based on Park Miller Algo-
rithm, the Park Miller Algorithm requires division operation and generates random
numbers with the utilization of quotient and remainder in the algorithm. The imple-
mentation of integer division is comparable to this thesis for any future work.
6.2 Contributions
This following contributions are made based on this thesis:
1. Designed and implemented two signed integer division algorithms for performing
the division operation in hardware.
2. Verify the hardware design by developing a Matlab code to confirm the correct-
ness and accuracy of the hardware implemented in VHDL.
3. Compared performance of two division algorithms from the viewpoint of device
utilization (area), power consumption and timing analysis (delay).
4. The high-radix technique proposed in [5] for floating point arithmetic is adapted
to integer arithmetic.
6.3 Future Work
This work provided hardware implementations for 32-bit machines. The work can be
extended in two directions:
61
1. Random number generation for 64- or even 128-bit machines.
2. Random number generation for elliptic curve cryptography where the integers
are represented by 200 to 600 bits.
The random number generation in mobile devices is typically 32 bits, which is
what we use in this work. Most of the current systems use the library generated
RAND function to generate the random numbers for usage in application. When
there is a need for higher number of bits say 64 bits or more the RAND function
could have limitations in generating high quality random numbers for the desired
application and to cater this, our work needs to be extended and scaled accordingly
so that we are able to meet the requirements.
The bits required in elliptical curve cryptography typically varies between 200 bits
up to 600 bits, for such scenarios our work will need to be varied accordingly. Two
ways to do this is by:
1. Increasing the size of the components up to the required bit size instead of 32
bits.
2. Using the 32-bit design concatenating the resulting numbers to represent an
arbitrary size integer.
To improve system speed, high-radix implementations could be considered for the
calculation of the terms δi. Pipelining the operations might not lead to improved
latency since increased clock speed is accompanied by increased number of pipeline
stages.
62
Bibliography
[1] N.M. Nayeem, M.A. Hossain, M. Haque, L. Jamal, and H. Babu. Novel reversible
division hardware. In Circuits and Systems, 2009. MWSCAS ’09. 52nd IEEE
International Midwest Symposium on, pages 1134–1138, Aug 2009.
[2] N. Takagi, S. Kadowaki, and K. Takagi. A hardware algorithm for integer di-
vision. In Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposium
on, pages 140–146, June 2005.
[3] S.F. Oberman and M. Flynn. Design issues in division and other floating-point
operations. Computers, IEEE Transactions on, 46(2):154–161, Feb 1997.
[4] S. K. Park and K. W. Miller. Random number generators: Good ones are hard
to find. Commun. ACM, 31(10):1192–1201, Oct 1988.
[5] Fayez Elguibaly and A. Rayhan. Hcordic: a high-radix adaptive cordic algorithm.
Canadian Journal of Electrical and Computer Engineering, 25(4):149, 2000.
[6] Shlomo Waser and Michael J. Flynn. Introduction to arithmetic for digital sys-
tems designers. Holt, Rinehart and Winston, New York, 1982.
[7] R.P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge
Monographs on Applied and Computational Mathematics. Cambridge University
Press, 2010.
[8] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford
series in electrical and computer engineering. Oxford University Press, 2010.
[9] Sukhmeet Kaur and Rajeev Agarwal. Vhdl implementation of non restoring
division algorithm using high speed adder/subtractor. International Journal of
Advanced Research in Electrical, Electronics and Instrumentation Engineering,
2.7, 2013.
63
[10] G. Sutter and J. Deschamps. High speed fixed point dividers for fpgas. In Field
Programmable Logic and Applications, 2009. FPL 2009. International Confer-
ence on, pages 448–452, Aug 2009.
[11] A. Nannarelli and T. Lang. Low-power divider. Computers, IEEE Transactions
on, 48(1):2–14, Jan 1999.
[12] Xilinx. Xilinx Power Tools Spartan-6 and Virtex-6 FPGAs. http://www.
xilinx.com/support/documentation/sw_manuals/xilinx11/ug733.pdf,
March 2010. Accessed : April 2015.
[13] Anju S Pillai and TB Isha. Factors causing power consumption in an embed-
ded processor-a study. International Journal of Application or Innovation in
Engineering & Management (IJAIEM), 2(7), 2013.
[14] Xilinx. Spartan-3E FPGA Family Data Sheet. http://www.xilinx.com/
support/documentation/data_sheets/ds312.pdf, July 2013. Accessed : April
2015.
[15] Xilinx User Community. Area delay product of FPGA designs.
http://forums.xilinx.com/xlnx/board/crawl_message?board.id=IMPBD&
message.id=5111, March 2012. Accessed : June 2015.
[16] Ruzica Jevtic, Bojan Jovanovic, and Carlos Carreras. Power estimation of di-
viders implemented in fpgas. In Proceedings of the 21st Edition of the Great Lakes
Symposium on Great Lakes Symposium on VLSI, GLSVLSI ’11, pages 313–318,
New York, NY, USA, 2011. ACM.
[17] Dong Wang, Pengju Ren, and Leibo Liu. A high-throughput fixed-point complex
divider for fpgas. IEICE Electronics Express, 10(4):20120879–20120879, 2013.
[18] Muhammad Firmansyah Kasim, Trio Adiono, Muhammad Fahreza, and Muham-
mad Fadhli Zakiy. Fpga implementation of fixed-point divider using pre-
computed values. Procedia Technology, 11(0):206 – 211, 2013. 4th International
Conference on Electrical Engineering and Informatics, ICEEI 2013.
[19] D. Kumar, A. Sharma, and P. saha. Integer division technique for signal pro-
cessing applications. In Proceedings of the 9th International Conference on Ubiq-
uitous Information Management and Communication, IMCOM ’15, pages 52:1–
52:4, New York, NY, USA, 2015. ACM.
64
65
Chapter 7
Additional Information
7.1 Interpretation of signals
Signal Bit(s) Description
x in 32 initial value of X, the divisor
y in 32 initial value of Y, the dividend
mux to y 32 data path from data multiplexer to register Y
mux to z 32 data path from data multiplexer to register z
enable x 1 control signal from FSM from register X
enable yz 1 control signal from FSM from register Y and Z
clk 1 system clock signal
x reg alu 32 data path register X to ALU
y reg alu 32 data path register Y to ALU
z reg alu 32 data path register Z to ALU
y alu reg 32 feedback data path from ALU containing value of Y (i+ 1)
z alu reg 32 feedback data path from ALU containing value of Z(i+ 1)
f ypos 1 flag indication for positive or negative value of Y
f ygtex 1 flag indication for Y greater than or equal to X
address 5 input value of LUT address from counter in Algorithm 1
delta alu 32 data path carrying value of δ from the LUT to the ALU
y add sub 1 FSM control signal to ALU to perform addition or signal
reset 1 reset system and initialize the divider
start 1 start division operation
counter enable 1 input control signal at counter from FSM
f i 1 flag indication for completion of iteration
66
Signal Bit(s) Description
count 5 output value of counter to LUT in Algorithm 1
count enable 1 FSM control signal output to the counter
sel 1 select external/feedback data values
done 1 indication for completion of division operation
sel x 1 select data from X in DAG in Algorithm 2
enable reg px 1 enable writing data in register Px in DAG
flag out 1 output of flag from DAG (we don’t care about this)
delta address 5 input value of LUT address from DAG in Algorithm 2
flag out 1 output of flag from DAG and position finder unit
number 32 input data value or operand in position finder unit
f mux 1 position finder input to select X or Y flag
position (of Y ) 5 input value of Py for number subtractor
position x 5 input value of Px for number subtractor
enable reg x 1 write enable in register Px inside DAG
67
Chapter 8
Used Terms and Acronyms
• RNG, Random Number Generator
• PRNG, Pseudo Random Number Generator
• VHDL, VHSIC Hardware Description Language
• ROM, Read Only Memory
• CORDIC, COordinate Rotation DIgital Computer
• FPGA, Field Programmable Gate Array
• LUT, Look Up Table
• H-CORDIC, High Performance Adaptive CORDIC
• ALU, Arithmetic and Logical Unit
• FSM, Finite State Machine
• DAG, Delta Address Generator
• MSB, Most Significant Bit
• LSB, Least Significant Bit
• RTL, Register Transfer Logic
• MATLAB, Matrix Laboratory (Mathworks, Inc. computation tool)
• IOB, Input/Output Block
68
• ISE, Integrated Synthesis Environment (Xilinx HDL tool)
• BUFGMUXs, multiplexed global clock buffer that can select between two input
clocks
• MULTI18xSIOs, dedicated multipliers in the (Xilinx) target device
• XST, Xilinx Synthesis Technology
• ISIM, ISE Simulator
• ASIC, Application-Specific Integrated Circuit
• CMOS, Complementary metaloxidesemiconductor (CMOS)