FPGA Implementation of Variable Precision Euclid’s … · FPGA Implementation of Variable...

13
Journal of Engineering Technology (ISSN 0747-9964) Volume 6, Special Issue on Technology Innovations and Applications Oct. 2017, PP. 410-422 410 FPGA Implementation of Variable Precision Euclid’s GCD Algorithm Qasem Abu Al-Haija 1 , Sharifah Mumtazah Syed Ahmad 2 , Islam Alfarran 3 1 College of Engineering, King Faisal University, Hufof 31982, P.O. 380, Saudi Arabia. 2 Faculty of Engineering, Universiti Putra Malaysia, 43400, Serdang, Selangor, Malaysia 3 College of Applied Studies and Community Service, King Faisal University, Hufof, Saudi Arabia Abstract: Introduction: Euclid's algorithm is well-known for its efficiency and simple iterative to compute the greatest common divisor (GCD) of two non-negative integers. It contributes to almost all public key cryptographic algorithms over a finite field of arithmetic. This, in turn, has led to increased research in this domain, particularly with the aim of improving the performance throughput for many GCD-based applications. Methodology: In this paper, we implement a fast GCD coprocessor based on Euclid's method with variable precisions (32-bit to 1024-bit). The proposed implementation was benchmarked using seven field programmable gate arrays (FPGA) chip families (i.e., one Altera chip and six Xilinx chips) and reported on four cost complexity factors: the maximum frequency, the total delay values, the hardware utilization and the total FPGA thermal power dissipation. Results: The results demonstrated that the XC7VH290T-2-HCG1155 and XC7K70T-2-FBG676 devices recorded the best maximum frequencies of 243.934 MHz down to 39.94 MHz for 32-bits with 1024-bit precisions, respectively. Additionally, it was found that the implementation with different precisions has utilized minimal resources of the target device, i.e., a maximum of 2% and 4% of device registers and look-up tables (LUT’s). Conclusions: These results imply that the design area is scalable and can be easily increased or embedded with many other design applications. Finally, comparisons with previous designs/implementations illustrate that the proposed coprocessor implementation is faster than many reported state-of-the-art solutions. This paper is an extended version of our conference paper [1]. Keywords: Digital arithmetic, FPGA, Integrated circuit synthesis, Euclid's algorithm, GCD. 1. Introduction Over the past decade, a rapid advancement in digital hardware design has led to a major revolution that took the place of the once dominant digital design era without using traditional methods of logic design. Millions of logic gates and tens of thousands of flip-flops can co-exist in a single design technology tool via field programmable gate arrays (FPGAs) [2]. FPGA devices contain a matrix of configurable logic blocks (CLBs) connected via a programmable network that can be utilized by writing a software program using hardware description languages (HDLs) such as VHDL programming [3] and design synthesize [4]. The efficient design-based FPGA technology has recently emerged in many fields and applications, such as high-performance computing, networking, security and cryptography, as in [5, 6]; fault tolerance applications, as in [7]; and many other regular and irregular applications.

Transcript of FPGA Implementation of Variable Precision Euclid’s … · FPGA Implementation of Variable...

Journal of Engineering Technology (ISSN 0747-9964)

Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

410

FPGA Implementation of Variable Precision Euclid’s GCD Algorithm

Qasem Abu Al-Haija1, Sharifah Mumtazah Syed Ahmad

2, Islam Alfarran

3

1 College of Engineering, King Faisal University, Hufof 31982, P.O. 380, Saudi Arabia.

2 Faculty of Engineering, Universiti Putra Malaysia, 43400, Serdang, Selangor, Malaysia

3 College of Applied Studies and Community Service, King Faisal University, Hufof, Saudi Arabia

Abstract: Introduction: Euclid's algorithm is well-known for its efficiency and simple iterative to

compute the greatest common divisor (GCD) of two non-negative integers. It contributes to almost all

public key cryptographic algorithms over a finite field of arithmetic. This, in turn, has led to increased

research in this domain, particularly with the aim of improving the performance throughput for many

GCD-based applications. Methodology: In this paper, we implement a fast GCD coprocessor based on

Euclid's method with variable precisions (32-bit to 1024-bit). The proposed implementation was

benchmarked using seven field programmable gate arrays (FPGA) chip families (i.e., one Altera chip

and six Xilinx chips) and reported on four cost complexity factors: the maximum frequency, the total

delay values, the hardware utilization and the total FPGA thermal power dissipation. Results: The

results demonstrated that the XC7VH290T-2-HCG1155 and XC7K70T-2-FBG676 devices recorded

the best maximum frequencies of 243.934 MHz down to 39.94 MHz for 32-bits with 1024-bit

precisions, respectively. Additionally, it was found that the implementation with different precisions

has utilized minimal resources of the target device, i.e., a maximum of 2% and 4% of device registers

and look-up tables (LUT’s). Conclusions: These results imply that the design area is scalable and can be

easily increased or embedded with many other design applications. Finally, comparisons with previous

designs/implementations illustrate that the proposed coprocessor implementation is faster than many

reported state-of-the-art solutions. This paper is an extended version of our conference paper [1].

Keywords: Digital arithmetic, FPGA, Integrated circuit synthesis, Euclid's algorithm, GCD.

1. Introduction

Over the past decade, a rapid advancement in digital hardware design has led to a major revolution

that took the place of the once dominant digital design era without using traditional methods of logic

design. Millions of logic gates and tens of thousands of flip-flops can co-exist in a single design

technology tool via field programmable gate arrays (FPGAs) [2]. FPGA devices contain a matrix of

configurable logic blocks (CLBs) connected via a programmable network that can be utilized by

writing a software program using hardware description languages (HDLs) such as VHDL

programming [3] and design synthesize [4]. The efficient design-based FPGA technology has recently

emerged in many fields and applications, such as high-performance computing, networking, security

and cryptography, as in [5, 6]; fault tolerance applications, as in [7]; and many other regular and

irregular applications.

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

411

Cryptoprocessors design [8] involves the use of different number theory algorithms at the upper level

of the design, which can be built based on the digital arithmetic [9] that can be easily described by

HDLs for FPGA verification purposes. Number theory [8, 10] was largely separated from other fields

of mathematics since it is topically related to elementary arithmetic. Its applications have rapidly

increased in recent years for areas such as coding theory, cryptography and statistical mechanics. The

“Euclidean algorithm” [8] and “Sieve of Eratosthenes” [10] are both recent candidates to implement.

This paper focuses on the iterative Euclid's algorithm to compute the GCD of two non-negative

integer numbers. Its efficiency and simplicity make it attractive to many applications, especially those

that are related to public key cryptography using finite field arithmetic operations. An RSA (Rivest,

Shamir, and Adleman) cryptosystem is a good example of a GCD application that uses crypto-

algorithms [5], and the trusted platform module (TPM) uses RSA as a building block [6]. Another

example of a GCD application is multiplicative inverse calculations [11]. The reasons of

implementing Euclid's GCD can be justified by referring to what reported in [12] as author discussed

four common GCD algorithms: Dijkstra’s algorithm, Euclidian algorithm, Binary GCD algorithm and

Lehmer's algorithm. They reported that was found that Euclidian algorithm can be used efficiently to

compute GCD with time complexity of ( ( )). However, this linearity can be reduced to (

( )) by using Lehmer's algorithm to reduce the large integer numbers of GCD prior to use

Euclid's algorithm. The contributions of this paper can be summarized as follows:

A state-of-the-art review of various design techniques of GCD algorithm (hardware, software or

hybrid).

Details on the hardware implementation of a variable data path GCD coprocessor using efficient

modules, including a schematic diagram, a finite state machine, and a full RTL diagram for a 32-

bit GCD (Appendix).

Comparative performance evaluation of the FPGA implementation for GCD using seven different

FPGA chip families.

Discussion of the synthesize results related to the area of the design, the total delay of the design,

minimum delay, maximum frequency and total FPGA thermal power dissipation complexities.

Comparison of the proposed GCD implementation with many state-of-the-art works.

The remainder of this paper is organized as follows. Section 2 discusses the related works on GCD

designs and implementations. Section 3 provides a brief background of GCD arithmetic along with a

detailed numerical example for illustration purposes. Section 4 discusses the complete hardware

implementation and specifications. Section 5 contains experimental results with their associated

discussions, including performance measures, hardware utilization of the proposed implementation,

FPGA total thermal power dissipation and a state-of-the-art benchmarking study. Finally, Section 6

concludes the paper.

2. Literature Review

Recently, many hardware/software solutions have attempted to address the efficient design of iterative

number theory algorithms, such as the Euclidian algorithm. The most commonly used solutions

include FPGA design and synthesis, hardwired microprogramming, and software-based simulations

via high-level programming languages.

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

412

For instance, the FPGA design with its various chip families was the dominant method of

implementing a high-speed GCD processor. Upadhyay and Patel [13] proposed a 4-bit hardware

design for the Euclidean-calculated GCD using a narrative method with modular arithmetic based on

subtraction (replacing the remainder by repeated subtraction) using basic digital components, such as

multiplexers, comparators, registers, and a full subtractor. They concluded that the proposed design

provides less complexity in terms of both hardware requirements and execution time. Additionally,

Kohale and Jasutkar [14] studied the performance (area-speed) and power dissipation for the FPGA

design of an 8-bit GCD processor using Euclid’s and Stein’s algorithms with Spartan 6 as the

hardware chip technology. Their experimental results showed that Spartan 6 improved the power

consumption by 42% and increased the performance speed over previous generation devices (i.e.,

Spartan 3). They also found that Stein’s algorithm has better results than Euclid’s algorithm with less

power consumption and better performance. In their related work [15], they targeted the Xilinx

Spartan-3 chip family via VHDL to develop an FPGA design for the GCD based on two computation

methods: Euclid’s and Stein’s algorithms. Their experimental results were generated using Xilinx ISE

9.1i and showed that Euclid's GCD algorithm recorded a better performance with fewer slice registers

and required bounded input/output blocks (IOB). It also recorded the minimal power consumption at

24 mW.

Moreover, Shah et al. [16] utilized the idea of reversible logic to break the conventional speed-power

trade-off, which they claimed had a close match to quantum computing devices. To authenticate their

research, various combinational and sequential circuits were implemented, such as an 8-bit GCD

processor the use of reversible gates. Their FPGA design of the GCD processor recorded a maximum

frequency of 456 MHz at an operand size of 8-bits using the Spartan-3 XC3S50 family. Furthermore,

Willingham and Kale [17] proposed an asynchromatic system that uses Euclid’s algorithm to calculate

the GCD of two integers that contain both repetition and decision to implement arbitrarily complex

computational systems. They showed that, under typical conditions in a 0.35-μm process, a 16-bit

implementation can perform a 24-cycle test vector in 2.067 μs with a power consumption of 3.257

nW. Boland et al. [18] applied a word-length optimization technique to implement every arithmetic

operator throughout a custom FPGA-based accelerator via the IEEE-754 standard single or double

precision arithmetic. They implemented the FPGA design of Euclid's GCD algorithm using Xilinx

Coregen and obtained a maximum frequency range from 230-180 MHz as the number of fractional

bits varies from 5 to 10.

Recently, the FDFM- (few DSP slices and few block memories)-based approach has been proposed

and efficiently utilized into different FPGA design applications, such as the FDFM-based

designs/implementations for Euclid's-computed GCD unit in [19 and 20]. Zhou et al. [19]

implemented their processor core that executes Euclid's GCD algorithm using few DSP slices and few

blocks of RAM in a single FPGA. This processor core (called the GCD processor core) has been built

using the FDFM approach embedded in the Xilinx Virtex-7 family FPGA XC7VX485T-2. Their

experimental results showed that the performance of this FPGA implementation using the 1280 GCD

processor cores is 0.0904 µs per GCD computation for two 1024-bit integers, which is 3.8 times faster

than the best GPU implementation and 316 times faster than a sequential implementation on an Intel

Xeon CPU. In a related work, the same authors in [20] used the same implementation environment

and adopted 1408 processors working in parallel and independently compute the GCD. These authors

showed that this new core runs at 0.057 µs per GCD computation of two 1024-bit RSA moduli, which

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

413

is 6.0 times faster than the best GPU implementation and 500 times faster than a sequential

implementation on an Intel Xeon CPU.

The use of BIST (built in self-test) technology as an additional microchip controller had an impact on

the very large scale integrated (VLSI) design according to Devi et al. [21], who focused on the

dramatic impact of the VLSI as it increases the complexity of the circuits. Per the Altera corporation

[22], one solution can be used to avoid the overhead of the VLSI design by adding an extra IC chip

with a self-test ability. Thus, they proposed a VHDL implementation of the GCD processor with the

BIST capability using the Xilinx Spartan-3 chip family. Then, they compared the area overhead for

both schemes (with/without BIST). The experimental results showed that the BIST implementation

for the GCD increased the area overhead but eliminated the need to acquire high-end testers. Again,

Kohale and Jasuktar [23] proposed an FPGA design with the BIST controller of the arithmetic logic

unit (ALU) to calculate the 8-bit GCD of two positive integers using Euclid’s and Stein’s algorithms.

They compared the design using various Xilinx Families (with and without the BIST technique). The

selection of the Xilinx Family depends on the lowest power consumption of the ALU. Thus, they

concluded that the Spartan 3E FPGA family was preferable for the GCD design with the BIST

feature, as it recorded the lowest power dissipation number of 34 mw. In some related works, the

authors of [24] applied BIST technology as they proposed new 4-bit and 8-bit GCD processors based

on the BIST controller using Euclid’s and Stein’s algorithms. They applied the proposed FPGA

design to three Xilinx Spartan 6 target devices, namely the XC3S50, XC4VFX12, and XC6SLX4.

Comparisons regarding the number of look-up tables (LUT’s) showed that the XC6SLX4 device was

the most efficient device, as it registered the minimum required area of the design.

The software-based solutions are valid for specific design situations. In [25], Upadhyay et al.

proposed an 8-bit hardware design GCD processor using four different algorithms, including Euclid's

method, the divisibility check method, the dynamic modulo method and the static modulo method.

They simulated their work using Logisim Simulator 2.7.1 and compared the designs in terms of both

space and time complexity. The conducted experiments of [25] showed that Euclid's method was the

best-suited method in terms of space, while the dynamic modulo method was the best method in terms

of time complexity since the number of clock pulses was considerably reduced. Additionally,

Hemmer et. al. [26] reported on several generic implementations for univariate polynomial GCD

computations over the integers, particularly over algebraic extensions. They designed a new

polynomial software package that became a part of the Cgal release 3.4. Regarding the GCD, their

FPGA implementation of the GCD using the hybrid approach computed Euclid's algorithm

(approximately) in 1 m sec at a data-path size of 1024-bits. Furthermore, Ellerve and his research

group [27] described an environment to accelerate fault simulation by hardware emulation on FPGA

digital circuits. The proposed approach allows the simulation speed to be increased by 40 to 500 times

compared to the state-of-the-art software-based fault simulation. The study included the FPGA design

of the 32-bit GCD as a benchmark, which recorded a maximum frequency of 25 MHz with and

without fault dropping and 20 MHz with three-valued logic. Based on the experiments, it can be

concluded that it is beneficial to use emulation for circuits that require large numbers of test vectors

while using simple but flexible algorithmic test vector generating circuits (BIST).

Another noticeable method that has been recorded in the literature is the use of mixed solutions such

as the conversion of the high-level to the HDL language. In [28], the authors presented an

optimization technique of flow paths (a compiler for converting high-level stack-based languages

(Java, C++, C#, and VB) to VHDL for use on an FPGA or application-specific integrated circuit

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

414

(ASIC), as new self-propagating flow paths that execute faster and are less resource-intensive. They

conducted several comparisons. They synthesized their proposed Euclid’s GCD design for a Xilinx

Spartan-6 XC6SLX75, which reported a maximum frequency of 200 MHz for the 32-bit size.

Different from these previous works, the major contribution of our work focuses on efficient FPGA

implementation and the synthesis of Euclid's GCD using different datapath sizes and FPGA device

technologies in terms of timing issues, such as the critical path delay and the maximum frequency of a

digital GCD processor. In this paper, we implemented the proposed GCD using VHDL. In addition, a

comparative synthesizing study will be presented for several implementation options using different

FPGA devices in terms of delay and maximum frequency. The comparison with other existing designs

and implementations showed that the proposed coprocessor implementation improved larger scale

performance.

3. Euclid's GCD algorithm-Revisited

The GCD of two numbers is the highest common divisor/factor to both numbers. Two numbers whose

GCD is 1 are called co-prime or relatively prime. There are many algorithms that can be used to

compute the GCD [8, 10]. Euclid's algorithm was chosen due to its proven efficiency and simplicity.

It also arrives at the solution faster within a single cycle [10]. Euclid’s algorithm computes the GCD

of two non-negative integers (at least one of which is non-zero). The well-ordering principle states

that every non-empty set of positive integers has a smallest element. Assume that a ≥ b > 0 for

integers a and b. To find the GCD of (a, b), the division algorithm [9] tells us that:

a = q1b + r1, where 0 ≤ r1 < b.

If r1 = 0, then b|a and GCD (a, b) = b. If r1 ≠ 0, divide b by r1 to produce integers q2 and r2, such that:

b = q2r1 + r2, 0 ≤ r2 < r1.

If r2 = 0, we stop the process. Otherwise, we continue to get r1 = q3r2 + r3, where 0 ≤ r3 < r2.

This process continues until we get a zero remainder rN+1. We arrive at the following system of

equations:

a = q1b + r1, 0 < r1 < b & b = q2r1 + r2, 0 < r2 < r1

r1 = q3r2 + r3, 0 < r3 < r2 ... rN−2 = qNrN−1 + rN, 0 < rN < rN−1

rN−1 = qN+1rN + 0 => rN is GCD (a, b).

Additionally, selecting an appropriate FPGA kit depends on the application itself. There is a clear

trade-off between the two dominant FPGA companies of Altera and Xilinx. The comparison of kits

for the main and common features provided in each chip family can be found in [4]. For a better

understanding and explanation of the implemented Euclid’s GCD algorithm, we give an illustration

example as follows:

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

415

4. Implementation and Environment

To benchmark the proposed implementation, we have developed our implementation using VHDL as

the hardware description language and tested the implementation code using seven different FPGA

chips such as the Altera Cyclone IV (EP4CGX-22CF19C6) and the Xilinx Virtex-7 (XC7VH290T-2-

HCG1155). Additionally, at the software phase, different programs were used to accomplish the work

of this paper, including Altera Quartus II [22] as the full platform for the Altera kit, including the

hardware synthesis; the ModelSim-Altera 10.1d simulation and verification [22]; the Xilinx ISE

Design Suite version 14.2 [4] to synthesize the implementation using six Xilinx FPGA devices, in

addition to the Altera (to benchmark and compare); and Maple Worksheets 17 for mathematical

verification purposes [29]. Moreover, a high-performance multiprocessor platform has been used in

the coding, simulation, verification, synthesis, and testing phases. The platform specifications are

shown in Table 1. Furthermore, the implementation was synthesized for variable numbers of bit sizes

(from 32 to 1024- bits).

Table 1. Simulation Platform Specifications.

ITEM DESCRIPTION

Processor / OS 4th Gen. Intel-I7 Quad-Core [3.4 GHZ, 8 MB Shared Cache] / Win 8.1 64 bit

Memory / Hard Drive 16 GB DDR3 - 1600 MHz / 2 TB 7200 RPM SATA

Graphics/ Screen 2 GB AMD Radeon R7 240 [DVI, HDMI, & DVI- VGA] / 23" LED Display

Figure 1. (a) The Euclidian GCD algorithm (b) The finite state diagram of the GCD

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

416

In this work, we have implemented the Euclidean algorithm (given in figure 1.a) to iteratively

compute the GCD by performing the repeated modular multiplication method. The two-parameter-

loop is repeatedly executed until the second input is greater than zero while exchanging the input

registers. It will put the remainder in the second input register. Finally, the output will be ready after

finishing the loop in the first input register. The finite state diagram of the proposed implementation is

depicted in figure 1.b. It shows the state values along with the transition condition.

Figure. 2. (a) Top-level block diagram of the GCD. (b) Internal architecture of the GCD coprocessor

The top view of the GCD processor is illustrated in figure 2.a. The GCD processor has two (N-1 bits)

numbers as input values, three control signals (reset, enable and a synchronized clock), N-1 bits

number as an output, and an acknowledgement signal. The internal hardware architecture is illustrated

in figure 2.b. The implementation consists of the following:

Three main registers are used, including two registers to hold inputs and one to hold the output.

The modular multiplication unit uses an interleaved algorithm for faster performance. The

modular multiplication is needed to preserve the products less than the input operands.

The subtraction unit is used for repeated reductions of the swapped operands.

One equality comparator can be built only from NOR gates. It iteratively tests the output

results.

We believe that our proposed implementation is efficient for several possible reasons. For instance,

the use of Interleaved modular multiplication as a core of computing Equid’s GCD improves

efficiency. Also, the use a maximum possible number of concurrent VHDL statements in

implementing Euclid’s GCD algorithm also improves efficiency. Furthermore, the use of new FPGA

chip technology offered better hardware utilization and enhanced performance.

5. Cost Factors Results and Discussion

The experimental data presented in this section were generated using both Altera Quartus II and

Xilinx Synthesizer ISE tools. The target chip technologies were set to the Altera Cyclone IV (ep4cgx-

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

417

22cf19c6/ep4ce115 f29c7), the Xilinx Virtex-7 (XC7VH290T-2-HCG1155), the Xilinx Virtex-5

(XC5VlX20T-2-FF323), the Xilinx Spartan-6 (XC7Z010-2-CLG400), the Xilinx Artix

(70XC7A100T-2-CSG324), the Xilinx Kintex-7 (XC7K70T-2-FBG676), and the Xilinx Zynq

(XC7Z010-2-CLG400). We applied them all to our VHDL code for the GCD-processor.

The bar chart in figure 3 compares the maximum frequency values (in MHz) of variable

implementation lengths (32-, 64-, 128-, 256-, 512-, and 1024-bits) for the seven FPGA devices. It can

be clearly seen that the lowest frequencies are recorded for the FPGA implementation using Altera

with 104.2, 77.1 and 43.4 for 32-, 64-, and 128-bits, respectively. Additionally, no numbers have been

recorded for higher bit lengths due to the capability of such an educational device. The next two

devices have higher rates. The maximum frequencies for the implementations based the Artix-7 and

the Vertix-5 are greater than Altera by almost 33% and 14%, respectively. The performances for the

implementations based on the Vertix-7 and Zynq devices were identical and have equal frequencies,

with an average enhancement of 48% in overall frequency relative to Altera. The Spartan 6 version

showed a similar tendency as the Altera version except for the 32-bit implementation length. The

highest numbers belonged to the implementation based on the Kintex-7 device, with an 11% increase

in frequency compared to the Vertix 7 for the 32-bit length.

Figure 3. Maximum frequency values (MHz)

In contrast, minimum period values (in nano-seconds) shown in figure 4 were much lower in all

device families. The highest numbers were recorded in the Artix-7 and the Vertix-5 with the 1024-bit

length (44.1 ns and 42.7 ns, respectively). The lowest numbers were recorded in the Kintex-7, the

Vertix-7, and the Zynq with 25 ns for the same bit length. Critical path delay values for the Spartan 6

device were 38.1 ns for the same length. No numbers have been listed for the delays in the Altera for

bit lengths more than 128-bit. The figures for other bit lengths were relatively uniform and range from

4.1 ns for the 32-bit Kintex-7 implementation to 18.9 ns for the 128-bit Altera implementation and up

to 28.4 ns for the 512-bit Spartan-6 implementation, which consumes about twice as much as Vertix-7

and Zynq with similar bit lengths. Thus, higher bit length implementations (256, 512, and 1024) can

be implemented with the Vertix-7, Zynq, and Kintex-7 device families since they recorded the best

figures for maximum frequency.

10

4.2

21

7.1

12

1.0

21

7.1

15

4.9

24

3.9

21

7.1

77

.1

18

7.6

10

2.8

76

.5

12

9.3

18

7.6

18

7.6

43

.4

14

7.5

83

.7 58

.5

97

.8

14

7.5

14

7.5

10

6.4

61

.0

52

.6

65

.8

10

6.4

10

6.4

68

.4 39

.7

35

.2

40

.3

68

.4

68

.4 39

.9

23

.4

26

.2

22

.7

39

.9

39

.5

ALTERA VERTIX 7 VERTIX 5 SPARTAN 6 ARTIX 7 KINTEX 7 ZYNQ

32 bits 64 bits 128 bits 256 bits 512 bits 1024 bits

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

418

Figure 4. Minimum period values (ns)

Figure 5 shows the total delay values. These values are estimated by multiplying the expected longest

source clock period (from source rise to destination rise) by the number of logic stages (levels). The

number of levels is that required by the synthesizer to travel from the input of a flip-flop or latch

through logic and routing and arrive at the output of the chip before the next clock edge. This includes

the clock-to-Q delay of the source flip-flop and the path delay from that flip-flop to the output pad

(ISE 14.1 Synthesizer, 2014). The figure clearly states that the longest delay period is related to the

Artix-7 for the 1024-bit length with 88.2 ns. The best delay time is related to the Kintex-7, Vertix-7

and Zynq, with equal delays of 50.1 ns for the same bit length.

Figure 5. Total delay values (ns)

From the obtained results, we can see that the minimum delay and maximum frequency occur when

the precision/datapath size is 32 bits with the Xilinx Kintex-7 XC7K70T-2-FBG676 applied. When

the operands precision increases, the delay linearly increases. The delay increases as the number of

bits increases. For a higher datapath size such as 1024 bits, the maximum frequency has been

recorded for the Xilinx Vertix-7 XC7VH290T-2-HCG1155 and the Xilinx Kintex-7 XC7K70T-2-

FBG676. Even though some of the previous designs and implementations might be different in the

architecture, datapath size and devices technology, the comparisons between our implementation and

others are valid, as they show that our proposed implementation is competitive with many dedicated

9.0

4.6

8.3

4.6

6.5

4.1

4.6

14

.0

5.3

9.7

13

.1 7

.7

5.3

5.3

18

.9

6.8

11

.9

17

.1 10

.2 6.8

6.8

9.4

16

.4

19

.0

15

.2 9

.4

9.4

14

.6

25

.2

28

.4

24

.8

14

.6

14

.6

25

.0

42

.7

38

.1

44

.1

25

.0

25

.3

ALTERA VERTIX 7 VERTIX 5 SPARTAN 6 ARTIX 7 KINTEX 7 ZYNQ

32 bits 64 bits 128 bits 256 bits 512 bits 1024 bits

18

.0 9

.2

16

.5 9.2

12

.9

8.2

9.2

28

.0

10

.7

19

.4

26

.2 15

.5

10

.7

10

.7

37

.8

13

.6

23

.9

34

.2 2

0.4

13

.6

13

.6

18

.8

32

.8

38

.0

30

.4 18

.8

18

.8

29

.2

50

.4

56

.8

49

.7

29

.2

29

.2

50

.1

85

.4

76

.3

88

.2

50

.1

50

.6

ALTERA VERTIX 7 VERTIX 5 SPARTAN 6 ARTIX 7 KINTEX 7 ZYNQ

32 bits 64 bits 128 bits 256 bits 512 bits 1024 bits

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

419

solutions. For instance, the FPGA implementation of the GCD in [16] recorded a maximum frequency

of 456.09 MHz at an operand size of 8-bits for the Reversible GCD control unit using the Spartan-3

XC3S50 family. Instead, our GCD processor with the same chip family but a higher version (Spartan-

6 XC7Z010-2-CLG400) computes the GCD 1.8 times faster and would be 2.03 times faster if used

with the Kintex-7 version. Furthermore, [28] the synthesized Euclid’s GCD implementation for the

Xilinx Spartan-6 XC6SLX75 and the implementation tools reported a maximum frequency of 200

MHz for the 32-bit size. For our processor, it is 217.086 MHz synthesized with the Xilinx Spartan-6

XC7Z010-2-CLG400 (the same FPGA Family) and 243.9 MHz synthesized with the Xilinx Kintex-7

XC7K70T-2-FBG676. Therefore, our processor throughputs are 1.09 and 1.93 times faster,

respectively.

Table 2. Hardware Utilization Using the XC7VH290T-2-HCG1155.

Precision 32-bit 64-bit 128-bit 256-bit 512-bit 1024-bit

Registers 376 (0%) 729 (0%) 1435 (0%) 2844 (0%) 5658 (1%) 11290 (2%)

LUTs 508 (0%) 964 (0%) 1637 (1%) 2697 (1%) 5315 (2%) 10518 (4%)

Table 2 shows the hardware utilization results for the GCD coprocessor when implemented using the

Vertix7 (device: XC7VH290T-2-HCG1155) represented by the number of utilized registers (the total

number of registers in the target device is 437600) and the number of utilized Lookup Tables - LUTs

(the total number of LUTs in the target device is 218800). It is clear that the implementation with

different precisions utilizes fewer resources of the target device. The largest implementation length

(i.e., 1024 bit) utilizes a maximum of 2% and 4% of device registers and LUTs, respectively. This

indicates that the implementation area is scalable and can be easily increased or embedded with many

other design applications.

Table 3. Total FPGA Thermal Power Dissipation (mW) using the Altera Cyclone IV E (EP4CE115

F29C7)

Precision 8-bit 16-bit 32-bit 64-bit 128-bit 164-bit 256-bit 512-bit 1024-bit

I/O Power 4.0 7.0 13.0 26.0 50.0 64.0 100.0 200.0 300.0

Static Power 135.0 135.0 135.0 135.0 135.0 135.0 135.0 135.0 135.0

Total FPGA 139.0 142.0 148.0 161.0 185.0 199.0 235.0 335.0 435.0

Table 3 shows the total FPGA thermal power dissipation (mW) values consumed from applying the

GCD algorithm with different datapath lengths (8-bit to 1-kbit)) to the Altera Cyclone IVE (ep4ce115

f29c7) FPGA kit, where . The estimated power results for

the design with different precisions from 8-bit to 164-bit were generated using the powerplay early

power estimator tool in the Quartus II CAD simulation pack. The 164-bit design was the largest

datapath that allowed for the power estimation tool due the number of I/O pins provided to by the

target FPGA kit. The 512 pins cover, two 164-bit inputs, one 164 bits output result, and other pins, are

for control signals such as clock, enable, acknowledge and reset. The power values for the larger

designs (i.e., 256, 512 and 1024 bits) can be extrapolated from the general trend for power figures.

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

420

The total FPGA design power is mostly affected by the I/O power, while the term of power is constant

(i.e., static power), as articulated in figure 6.

Figure 6. Total FPGA thermal power dissipation (mW)

To sum up, this paper generates attractive synthesized results that can be used to implement a GCD

processor using parallel arithmetic units and redundant multipliers and adders, such as those that are

commonly used in cryptographic systems over a known finite field. It was found that choosing the

best chip technology would increase the throughput of the arithmetic operations.

6. Conclusions

In this paper, we propose an efficient FPGA implementation for a GCD processor based on

Euclidian's algorithm using Altera FPGA devices that improve the computational process. The

performance of the proposed implementation is studied in terms of both critical path delay (ns) and

the maximum frequency (MHz) to compare the performance of the proposed coprocessor using

different implementations and simulations. In addition, the synthesized results of this paper targeted

seven different chip technologies (one Altera chip and six Xilinx chips) and six different datapath

sizes (32- to 1024-bits). It was found that the Xilinx Vertix-7 XC7VH290T-2-HCG1155 and the

Xilinx Kintex-7 XC7K70T-2-FBG676 could be used as the fastest FPGA chip devices. Eventually,

maximum frequencies of 243.934 MHz for 32-bit datapaths down to 39.94 MHz for 1024-bit data

paths have been achieved, which show that the proposed coprocessor implementation has a

throughput efficiency of up to two times faster than other state-of-the-art implementations.

References

[1]. Q. A. Al-Haija, M. Al-Ja'fari, M. Smadi, (2016) 'A comparative study up to 1024-bit

Euclid's GCD algorithm FPGA implementation & synthesizing', 2016 5th International

Conference on Electronic Devices, Systems and Applications (ICEDSA), Ras Al Khaimah,

United Arab Emirates. Pp. 296-300.

4.0

13

9.0

7.0

14

2.0

13

.0

14

8.0

26

.0

16

1.0

50

.0

18

5.0

64

.0

19

9.0

10

0.0

23

5.0

20

0.0

33

5.0

30

0.0

43

5.0

I / O P O W E R T O T A L F P G A

8 bit 16 bit 32 bit 64 bit 128 bit 164 bit 256 bit 512 bit 1024 bit

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

421

[2]. C. Maxfield (2004) 'The design warriors guide of FPGAs: devices, tools and flows', Mentor

Graphics Corporation and Xilinx, Inc., Elsevier.

[3]. D. L. Perry, (2002) 'VHDL: programming by Example', Fourth ed., the McGraw-Hill

Company.

[4]. ISE 14.1 Synthesizer, (2014) 'Software manual and forum', Xilinx corporation.

https://forums.xilinx.com.

[5]. Q. Abu Al-Haija, M. Smadi, M. Al-Ja'fari, A. Al-Shua'ibi (2014) 'Efficient FPGA

Implementation of RSA Coprocessor using Scalable Modules', 9th International

Conference on Future Networks & Communications (FNC), Elsevier, Canada. Procedia

Computer Science, Vol 34; Pp 647 – 654

[6]. X. Chu, D. Feng, (2015) ‘On the provable security of TPM2.0 cryptography APIs’, Int. J. of

Embedded Systems, Inderscience, Vol 7(3/4); Pp.230 - 243.S

[7]. H. Sriraman, P. Venkatasubbu, (2017) ‘On the field design bug tolerance on a multi-core

processor using FPGA, Int. J. of High Performance Computing and Networking,

Inderscience, Vol 10(1/2); Pp.34 - 43.

[8]. W. Trappe, L. C. Washington, (2002) 'Introduction to Cryptography with Coding Theory',

Prentice Hall, Vol (1); Pp. 1-176.

[9]. M. D. Ercegovac, T. Lang, (2004) 'Digital Arithmetic', Morgan Kaufmann Publishers,

Elsevier, Vol (1); Pp 51-136.

[10]. W. Stein, (2011) 'Elementary Number Theory: Primes, Congruence, and Secrets', Springer,

Vol (1).

[11]. Y. C. Mei, S. Z. M. Naziri, (2011) 'The FPGA implementation of multiplicative inverse

value of GF (28) generator using Extended Euclid Algorithm (EEA) method for Advanced

Encryption Standard (AES) algorithm', IEEE International Conference on Computer

Applications & Industrial Electronics. Pp 12-15.

[12]. I. Marouf, M. M. Asad, Q. Abu Al-Haija, (2017) 'Reviewing and Analyzing Efficient

GCD/LCM Algorithms for Cryptographic Design', International Journal of New Computer

Architectures and their Applications (IJNCAA), The Society of Digital Information and

Wireless Communications, Vol. 7(1), P.p. 1-7.

[13]. D. Upadhyay, H. Patel, (2013) 'Hardware Implementation of Greatest Common Divisor

using Subtractor in Euclid Algorithm', International Journal of Computer Applications

(0975 – 8887), Vol 65 (7); Pp 24-28.

[14]. S. D. Kohale, R. W. Jasutkar, (2014) 'Power optimization of GCD processor using low

power Spartan 6 FPGA family (an improvement over Spartan 3 FPGA)', International

Journal of Conceptions on Electronics & Communication Engineering, Vol 2(1); Pp 1–6.

[15]. S. D. Kohale, R. W. Jasutkar, (2013) 'Designing of an 8-Bit ALU for GCD Computations

using Two Approaches', 3rd International Conference on Intelligent Computational Systems

(ICICS'13). Pp 34-38.

[16]. H. Shah, A. Rao, M. Deshpande, A. Rane, S. Nagvekar, (2014) 'Implementation of High

Speed Low Power Combinational and Sequential Circuits using Reversible logic', IEEE

International Conference on Advances in Electrical Engineering (ICAEE). Pp. 1-4.

Journal of Engineering Technology Volume 6, Special Issue on Technology Innovations and Applications

Oct. 2017, PP. 410-422

422

[17]. D. J. Willingham, I. Kale, (2008) 'System for calculating the Greatest Common

Denominator implemented using Asynchrobatic Logic', IEEE 26th NORCHIP Conference.

Pp 194-197

[18]. D. Boland, G. A. Constantinides, (2014) 'Word-length Optimization Beyond Straight Line

Code', ACM/SIGDA international symposium on Field programmable gate arrays (FPGA

14), Pp. 105-114.

[19]. Z. Zhou, K. Nakano, Y. Ito, (2016) ‘Efficient Implementation of FDFM Approach for

Euclidean Algorithms on the FPGA’, International Journal of Networking and Computing,

Vol 6 (2); P.p. 420–435.

[20]. Z. Zhou, K. Nakano, Y. Ito, (2016) ‘Parallel FDFM Approach for Computing GCDs using

the FPGA’, Springer: PPAM 2015, Part I, LNCS 9573, pp. 238–247, DOI: 10.1007/978-3-

319-32149-3 23.

[21]. R. Devi, J. Singh, M. Singh, (2011) 'VHDL Implementation of GCD Processor with Built

in Self-Test Feature', International Journal of Computer Applications, Vol 25(2); Pp 1-3.

[22]. ALTERA Corporation, (2013) 'Introduction to the Quartus® II Software'.

[23]. S. D. Kohale, R. W. Jasutkar, (2013) 'Power Dissipation of ALU Implementation of GCD

Processor with and Without BIST Among Various Xilinx Families', International Journal of

Engineering Research & Technology (IJERT), Vol. 2(2); Pp 1-7.

[24]. S. D. Kohale, R. W. Jasutkar, (2013) 'FPGA Based Implementation of BIST Controller

Using Different Approaches', International Journal of Materials, Mechanics and

Manufacturing, vol. 1 (2); Pp 110-113.

[25]. D. Upadhyay, J. Kolte, K. Jalan, (2013) 'Approach to design Greatest Common Divisor

Circuits based on Methodological analysis and Valuate Most Efficient Computational

Circuit', International Journal of Electrical and Electronics Engineering Research (IJEEER),

Vol 3(4); Pp. 59-66.

[26]. M. Hemmer, D. Hulse, (2009) 'Generic implementation of a modular GCD over Algebraic

Extension Fields', 25th European Workshop on Computational Geometry.

[27]. P. Ellervee, J. Raik, K. Tammemäe, R. Ubar, (2006) 'Environment for FPGA-based fault

emulation’ Estonian Academy of Sciences, Engineering. Pp 323–33.

[28]. D. M. Hanna, B. Jones, L. Lorenz, M. Bowers, (2011) 'Generating Hardware from Java

using Self-Propagating Flowpaths', International Conference on Embedded Systems and

Applications (ICESA 11).

[29]. W. M. Incorporation, (2007) 'Maple User Manual'. a division of Waterloo Maple Inc.