Download - Comparison of the Hardware Performance of the AES ...ece.gmu.edu/crypto_resources/web_resources/theses/GMU_theses/Chodowiec/... · 4.4 Outer round pipelining ... 4.1-1 Flow diagram

Comparison of the Hardware Performance of the AES Candidates Using ReconfigurableHardware

A thesis submitted in partial fulfillment of the requirements for the degree of Master ofScience at George Mason University

By

Pawel R. ChodowiecBachelor of Science

Warsaw University of Technology, 1998

Director: Kris M. Gaj, Assistant ProfessorDepartment of Electrical and Computer Engineering

Spring Semester 2002George Mason University

Fairfax, VA

ii

Table of Contents

PageAbstract ............................................................................................................................viii1. Preface............................................................................................................................. 11.1 Data Encryption Standard ............................................................................................. 11.2 DES security ................................................................................................................. 22. Introduction..................................................................................................................... 62.1 Advanced Encryption Standard .................................................................................... 62.1.1 Requirements and evaluation criteria......................................................................... 72.1.2 Evaluation process ................................................................................................... 102.2 Need for comparison of hardware implementations................................................... 132.3 Previous work ............................................................................................................. 143. Characteristics of hardware implementations............................................................... 163.1 Hardware vs. software implementations..................................................................... 163.2 Parameters of hardware implementations................................................................... 233.2.1 Throughput............................................................................................................... 233.2.2 Latency..................................................................................................................... 243.2.3 Area.......................................................................................................................... 263.3 Design tradeoffs .......................................................................................................... 303.3.1 Increasing the throughput ........................................................................................ 313.3.2 Decreasing the area .................................................................................................. 434. Hardware architectures for symmetric-key block ciphers ............................................ 454.1 Main characteristics of block ciphers ......................................................................... 454.1.1 Structure of a symmetric-key block cipher .............................................................. 454.1.2 Key schedule............................................................................................................ 474.1.3 Modes of operation .................................................................................................. 484.2 Basic iterative architecture.......................................................................................... 514.3 Loop unrolling ............................................................................................................ 554.4 Outer round pipelining................................................................................................ 604.5 Inner round pipelining................................................................................................. 654.6 Mixed inner- and outer-round pipelining.................................................................... 705. Methodology of comparison of AES candidates .......................................................... 745.1 Limits of this research................................................................................................. 745.2 Choice of architectures ............................................................................................... 775.2.1 Comparison in feedback modes ............................................................................... 775.2.2 Comparison in non-feedback modes........................................................................ 785.3 Tools, design process and synthesis parameters ......................................................... 79

iii

6. Implementation of AES candidates .............................................................................. 826.1 MARS ......................................................................................................................... 826.1.1 Structure and components of MARS ....................................................................... 826.1.2 Implementation of multiplication modulo 232 ......................................................... 896.1.3 Results of the implementation of MARS................................................................. 946.2 RC6 ............................................................................................................................. 966.2.1 Structure and components of RC6 ........................................................................... 966.2.2 Implementation of squaring modulo 232 .................................................................. 986.2.3 Results of the implementation of RC6................................................................... 1006.3 Rijndael ..................................................................................................................... 1026.3.1 Structure and components of Rijndael ................................................................... 1026.3.2 Results of the implementation of Rijndael............................................................. 1086.4 Serpent ...................................................................................................................... 1096.4.1 Structure and components of Serpent .................................................................... 1096.4.2 Results of the implementation of Serpent.............................................................. 1126.5 Twofish ..................................................................................................................... 1136.5.1 Structure and components of Twofish ................................................................... 1136.5.2 Results of the implementation of Twofish............................................................. 1187. Analysis of the results ................................................................................................. 1197.1 Comparison of ciphers in feedback modes ............................................................... 1197.2 Comparison of ciphers in non-feedback modes........................................................ 1258. Summary..................................................................................................................... 132List of References ........................................................................................................... 138

iv

List of Tables

Table Page2.1-1 Fifteen candidate algorithms ................................................................................... 112.1-2 Security margins of final AES candidate algorithms .............................................. 123.1-I Characteristic features of implementations of cryptographic transformations inASICs, FPGAs, and software............................................................................................ 223.3-I Features of methods exploring parallel computations ............................................. 42

v

List of Figures

Figure Page3.1-1 Structure of the Virtex FPGA.................................................................................. 183.1-2 Example of an attack on the hardware implementation. ......................................... 203.2-1 System consisting of multiple modules with throughput parameters...................... 243.2-2 System consisting of multiple modules with latency parameters............................ 253.2-3 Circuit with FIFO buffers. ....................................................................................... 253.2-4 Variety of functions possible to implement using one lookup table (LUT)............ 283.2-5 Example of LUT utilization..................................................................................... 293.3-1 Parallel processing units – string of data split among units. ................................... 313.3-2 Principles of pipelined implementation................................................................... 333.3-3 Pipeline with delay of registers taken into account. ................................................ 353.3-4 Unbalanced pipeline. ............................................................................................... 363.3-5 Pipelining of circuit consisting of unequal operations. ........................................... 373.3-6 Example of unnecessarily placed register. .............................................................. 383.3-7 Throughput in the pipelined implementations......................................................... 393.3-8 Pipelining of Feistel-network cipher. ...................................................................... 413.3-9 Example of an array multiplier as a circuit requiring additional area for registerswhen pipelined. ................................................................................................................. 423.3-10 Resource sharing.................................................................................................... 444.1-1 Flow diagram of a typical symmetric-key block cipher. ......................................... 464.1-2 Example of feedback and non-feedback modes of operation.................................. 494.1-3 Counter mode. ......................................................................................................... 504.2-1 Basic iterative architecture ...................................................................................... 524.2-2 Critical path in the basic iterative architecture. ....................................................... 534.3-1 Loop unrolling. ........................................................................................................ 554.3-2 Optimization of logic across rounds. ....................................................................... 574.3-3 Simultaneous evaluation of functions in unrolled rounds. ...................................... 574.3-4 Throughput vs. area ratio for unrolled architectures. .............................................. 594.4-1 Outer round pipelining............................................................................................. 614.4-2 Optimization of logic across rounds. ....................................................................... 634.4-3 Throughput vs. area ratio in outer round pipelining. ............................................... 654.5-1 Inner round pipelining. ............................................................................................ 664.5-2 Throughput vs. area ratio for inner round pipelining. ............................................. 704.6-1 Mixed inner- and outer-round pipelining. ............................................................... 714.6-2 Throughput vs. area ratio for mixed pipelining....................................................... 734.6-3 Latency vs. area ratio for mixed pipelining............................................................. 73

vi

5.1-1 Block diagram common for all implementations. ................................................... 765.2-1 Throughput/area ratio for mixed architecture.......................................................... 795.3-1 Design flow for each implementation. .................................................................... 806.1-1 High-level structure of MARS. ............................................................................... 836.1-2 Mixing transformation............................................................................................. 846.1-3 Mixing transformation core. .................................................................................... 846.1-4 Keyed transformation. ............................................................................................. 866.1-5 Keyed transformation core. ..................................................................................... 866.1-6 E-function. Red line indicates critical path. ............................................................ 876.1-7 Variable rotation. ..................................................................................................... 886.1-8 Virtex Slice with carry logic.................................................................................... 896.1-9 Example of multiplication scheme. Two AND gates feed full adder...................... 906.1-10 Multiplication – implementation of the circuit from 6.1-9 in a Vritex Slice......... 916.1-11 Array multiplier modulo 28. .................................................................................. 926.1-12 Structure of an array multiplier with reversed order of additions. ........................ 926.1-13 Change from array to tree. ..................................................................................... 936.1-14 Final multiplication schematic............................................................................... 946.2-1 Implementation of one round of RC6...................................................................... 976.2-2 Squarer derived from array multiplier. .................................................................... 986.2-3 Squarer modulo 28. .................................................................................................. 996.2-4 Optimized squarer modulo 28. ................................................................................. 996.3-1 One round of Rijndael. .......................................................................................... 1036.3-2 Construction of a) ByteSub, b) InvByteSub transformations................................ 1036.3-3 Structure of the implementation of a single round of Rijndael. ............................ 1076.4-1 Single round of Serpent. ........................................................................................ 1096.4-2 Implementation of Serpent I8 in basic architecture............................................... 1106.4-3 Serpent I1............................................................................................................... 1116.5-1 High-level structure of the Twofish cipher............................................................ 1146.5-2 S-boxes in Twofish................................................................................................ 1156.5-3 Permutation q......................................................................................................... 1156.5-4 PHT transformation. .............................................................................................. 1176.5-5 Implementation of a single round of Twofish. ...................................................... 1177.1-1 Throughput for Virtex XCV-1000, my results. ..................................................... 1197.1-2 Area for Virtex XCV-1000, my results. ................................................................ 1207.1-3 Throughput for Virtex XCV-1000, comparison with results of other groups. ...... 1217.1-4 Area for Virtex XCV-1000, comparison with results of other groups. ................. 1227.1-5 Throughput vs. area for Virtex-1000, our results. The result for Serpent I1 based on[12].................................................................................................................................. 1237.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs, NSA result. ........... 1247.2-1 Throughput for mixed inner- and outer-round pipelining in Virtex1000, my results.......................................................................................................................................... 1267.2-2 Area for mixed inner- and outer-round pipelining on Virtex1000, my results...... 1277.2-3 Increase in the encryption/decryption latency as a result of moving from the basicarchitecture to mixed inner- and outer-round pipelining. ............................................... 127

vii

7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA results. ...................... 1288-1 Results of survey filled by participants of the AES3 conference. ............................ 135

Abstract

COMPARISON OF THE HARDWARE PERFORMANCE OF THE AES

CANDIDATES USING RECONFIGURABLE HARDWARE

Pawel Chodowiec, Computer Engineering M.S.

George Mason University, 2002

Thesis Director: Dr. Kris M. Gaj

The results of fast implementations of all five AES final candidates using Virtex Xilinx

Field Programmable Gate Arrays are presented and analyzed. Performance of several

alternative hardware architectures is discussed and compared. One architecture optimum

from the point of view of the throughput to area ratio is selected for each of the two major

types of block cipher modes. For feedback cipher modes, all AES candidates have been

implemented using the basic iterative architecture, and achieved speeds ranging from 61

Mbit/s for Mars to 431 Mbit/s for Serpent. For non-feedback cipher modes, four AES

candidates have been implemented using a high-throughput architecture with pipelining

inside and outside of cipher rounds, and achieved speeds ranging from 12.2 Gbit/s for

Rijndael to 16.8 Gbit/s for Serpent. A new methodology for a fair comparison of the

hardware performance of secret-key block ciphers has been developed and contrasted

with methodology used by the NSA team.

1

Preface

1.1 Data Encryption Standard

DES is probably one of the best-studied and most controversial ciphers. Its history

began in 1973 when National Bureau of Standards (NBS) issued a public request for

proposals for a standard symmetric key cryptographic algorithm. The request specified a

series of design criteria. Some of the most important requirements were:

� The algorithm had to provide a high level of security,

� The algorithm had to be completely specified and easy to understand,

� The security of the algorithm had to reside in the key, and cannot depend on the

secrecy of the algorithm,

� The algorithm had to be available to all users on the royalty free basis,

� The algorithm had to be adaptable for use in diverse applications,

� The algorithm had to be economically implementable in electronic devices.

In 1974 IBM submitted a promising algorithm as a response for this request. NBS

asked National Security Agency (NSA) for help in evaluating the algorithm. NSA

introduced a few changes to the algorithm. The key length has been shortened from 128

to 56 bits. The content of all S-boxes has also been changed. NSA, however, classified all

information justifying these changes. Since then, DES started to be criticized. Many

2

researchers suspected that NSA installed a trapdoor in S-boxes permitting NSA to

cryptanalyze the algorithm. Also the reduction of the key length was controversial.

Despite all the criticism, DES was adopted as a US encryption standard in 1977,

and became de facto a world standard. The algorithm is defined in the American standard

FIPS 46 "Data Encryption Standard", and is described as a 16-round Feistel-network

cipher operating on 64-bit blocks of data.

The terms of DES standard required its review every five years. In 1983 the

standard has been automatically recertified for the next five years. In 1987 NSA proposed

the Commercial COMSEC Endorsement Program, which would lead to the development

of a series of algorithms replacing DES. Those algorithms would not be made public, and

would be available only in a tamper-proof VLSI chips. The NSA's proposal was not well

received, and because of the lack of other propositions DES standard remained effective

for next five years. In 1993 a 15 years old standard remained still unbroken. Again lack

of any alternative led to its recertification for another five years. In 1997, the National

Institute of Standards and Technology (NIST; former NBS) aware of the DES weakness,

laying mainly in its short key, announced a contest for the development of Advanced

Encryption Standard, which is going to replace a 20 year old DES.

1.2 DES security

The unclear design criteria classified by NSA sparked the biggest worldwide

effort to break DES. What were the criteria for choosing S-boxes? Why does DES consist

of exactly 16 rounds? Why does the key have only 56 bits? Those and other questions

3

exposed DES for cryptanalysis like no other cipher. Despite all attempts to find any

crack, DES secrets remained uncovered for nearly 15 years. Finally in 1990 Eli Biham

and Adi Shamir discovered differential cryptanalysis, the new and powerful method of

cryptanalysis [5]. DES appeared to be surprisingly resistant to the new attack [6, 7]. The

attack requires 247 chosen plaintexts or 255 known plaintexts and the analytical

complexity of 237 operations. The enormous amount of data and time to mount the attack

makes it less efficient than the brute force search for the key. Biham and Shamir came to

interesting conclusions:

� The S-boxes happened to be optimized against differential cryptanalysis,

� Any number of rounds less than 16 makes differential cryptanalysis more efficient

than the brute force attack for a known-plaintext attack.

Why is DES so resistant to an attack discovered many years after its

development? The answer to this question is even more surprising. The designers of DES

already knew differential cryptanalysis at the design time. After consultation with NSA,

they decided that disclosure of the design considerations might reveal the differential

cryptanalysis. Although DES was already resistant to the new attack, many other ciphers,

which were already in use, appeared to be vulnerable. After publishing details of

differential cryptanalysis, IBM finally published the design criteria for the S-boxes and P-

box, showing that no trapdoor was intended to be installed. Soon researchers in the open

cryptographic community began to appreciate the design principles behind DES.

4

Is DES still secure today? No attack better than the brute force search has been

discovered, but the main accusation of a too short key remains irrefutable. Ever since

DES was first proposed in the 1970s, it has been criticized for its short key. The US

government officials claimed that governments cannot decrypt information when

protected by DES, or that it would take multimillion-dollar networks of computers and

months to decrypt one message.

In 1997 RSA Laboratories issued a series of challenges in order to demonstrate

that DES offers only marginal protection. The first DES Challenge was launched in

January 1997. The secret key was recovered in 96 days by a team led by Rocke Verser of

Loveland, Colorado. In February 1998, Distributed.Net won DES Challenge II-1 during a

41-day effort. Distributed.Net consolidated tens of thousands of computers connected

through the Internet for this task. In July, the Electronic Frontier Foundation (EFF) won

DES Challenge II-2 by recovering an encrypted message in 56 hours shattering the

previous record. The answer for the challenge was “It’s time for those 128-, 192-, and

256-bit keys”. The main significance of the new record lies in the fact, that only one

machine, specifically designed for cracking DES, achieved what US government claimed

was impossible.

The design of the EFF DES Cracker consists of an ordinary personal computer

connected to the large array of custom chips. One ASIC chip contains 24 search units,

each capable of checking 2.5 million keys per second. Over 1800 chips were used in the

design giving the search speed of 90 billion keys per second. The average time to recover

the key is only 4.5 days. It took EFF less than one year to build Deep Crack, and it cost

5

only $220,000. EFF and O’Reilly and Associates have published a book about EFF DES

Cracker [EFF+98]. The book contains the complete design details for the Deep Crack

chips, boards and software. Ipso facto, EFF proved that DES is undoubtedly insecure.

Moreover, it proves also that most of the world’s governments already built similar or

even more powerful machines.

DES Cracker "DeepCrack" custom microchip.

DES Cracker circuitboard fitted with DeepCrack chips.

The machine tests over 90billion keys per second, takingan average of less than 5 days todiscover a DES key.

The final nail was put into the Data Encryption Standard coffin on January 19,

1999. The Distributed.Net and EFF DES Cracker won DES Challenge III in 22 hours and

6

15 minutes. Over 100,000 computers connected through the Internet and EFF’s machine

were testing 245 billion keys per second when the key was found. The decrypted message

foreshadows a new standard: “See you in Rome (second AES Conference), March 22-23,

1999.”

6

Introduction

2.1 Advanced Encryption Standard

After DES was shown to be vulnerable to the brute force attack, the need for a

new standard became unquestionable. There already exists an ANSI encryption standard,

3DES [3], which offers higher security than DES [16], but it is highly inefficient,

especially in software implementations. DES was primarily designed for hardware

implementations in existing technology. Nevertheless, current demands for higher

bandwidths in both computer and telecommunication networks are becoming difficult to

satisfy by 3DES encryption devices, especially when feedback modes of operation are

being considered. It was shown in [32] that DES implemented in VirtexE-8 FPGA in

non-feedback mode can achieve a throughput of 12 Gbps. This would translate to 4 Gbps

for 3DES. Po Khuon has demonstrated a 3DES implementation in Virtex-6 FPGA

capable of handling a throughput of 59 Mbps in feedback mode [19] and 7 Gbps in non-

feedback mode for deeply pipelined design [9]. My recent research led to implementation

of 3DES in Virtex-6 FPGA that achieves a throughput of 116 Mbps in feedback mode. Of

course, ASIC devices can satisfy higher throughput demands. One of the reported

implementations of 3DES in older 0.6 micron CMOS technology is capable of encrypting

data with a throughput of at least 155 Mbps [21]. Many current computer and

telecommunication networks require higher throughputs in the range of gigabits per

7

second. I already participate in designing of hardware accelerators for encryption

algorithms used in 1 Gbps IPSec implementation [8]. However, next generation of 10

Gbps LAN networks is being developed and 10 Gbps encryption speed will soon be

required. Clearly, 3DES algorithm can be a serious bottleneck in those applications.

The National Institute of Standards and Technology (NIST) has recognized the

need for new standard and initiated the process of developing an Advanced Encryption

Standard [2]. The main NIST’s objective was to develop an algorithm, which offers

security at least equal to 3DES, and significantly more efficient in software and hardware

implementations in variety of platforms. The algorithm should be capable of protecting

sensitive government information well into the 21st century.

2.1.1 Requirements and evaluation criteria

NIST published a formal call for candidate algorithms in 1997. The minimum

acceptable capabilities were:

1. The algorithm must implement symmetric (secret) key cryptography.

2. The algorithm must be a block cipher.

3. The candidate algorithm shall be capable of supporting key-block combinations with

sizes of 128-128, 192-128, and 256-128 bits.

8

In addition to the above list all submissions should include:

� A complete written specification of the algorithm, consisting of all necessary

mathematical equations, tables, diagrams, and parameters needed to implement

the algorithm,

� A statement of the algorithm’s estimated computational efficiency in hardware

and software. Submitters were required to at least provide estimates for “NIST

AES analysis platform” and for 8-bit processors,

� A set of test vectors allowing verification of correctness of all implementations,

� A statement of the expected strength of the algorithm along with any supporting

rationale,

� Analyses of the algorithm with respect to known attacks. All known weak keys,

equivalent keys, complementation properties, restrictions on key selection, and

similar features of the algorithm should also be noted,

� Optimized and reference source codes in ANSI C and Java describing the

algorithm,

� Declarations of granting full rights to patents covering the algorithm when and if

it should be chosen as a federal standard.

It was a remarkable change in the government’s approach to the security issue.

The previous government standard, DES [16], has been developed in close cooperation

with the National Security Agency (NSA). NSA has concealed the design criteria and

justifications, which resulted in a lack of trust in the standard. This time NIST organized

9

the entire process in the form of a contest. Anybody could submit his own algorithm.

Submitters were obliged to reveal all information about the algorithms and justify all

design decisions. The entire cryptographic community evaluated all algorithms openly.

The organization of the AES selection had several important advantages in that it:

� Focused the effort of cryptographic community on one task, what was very

essential taking into account the small number of specialists in unclassified

research,

� Stimulated research on methods of constructing secure ciphers,

� Avoided backdoor theories, and,

� Speeded-up the acceptance of the standard.

All algorithms were evaluated with respect to three categories of criteria:

1. SECURITY - the most important factor in the evaluation

� Actual security offered by the algorithm,

� Extent to which the algorithm output is indistinguishable from random

permutation on the input block,

� Soundness of the mathematical basis for the algorithm’s security,

� Other security factors, for example attacks, which demonstrate that the actual

security of the algorithm is less than the strength claimed by the submitter.

2. COST

� Licensing requirements,

10

� Computational efficiency – speed of the algorithm in hardware and software,

� Memory requirements – in case of software implementations, code size and

RAM requirements are major factors. In case of hardware implementations,

gate count will be taken into account.

3. ALGORITHM AND IMPLEMENTATION CHARACTERISTICS

� Flexibility – the ability of the algorithm to be implemented on different

platforms for various applications,

� Hardware and software suitability – the algorithm should not be restricted to

hardware or software implementations only,

� Simplicity – simplicity of design and ease of implementation.

2.1.2 Evaluation process

The process of evaluating candidate algorithms has been divided into two rounds.

The first round was intended to focus on the evaluation of algorithms based on the

cryptanalysis performed by public as well as the efficiency of software implementations

on a variety of platforms. The AES contest had attracted 15 submissions of block ciphers

from four continents, and 12 countries, as shown in Table 2.1-1. Most of the algorithms

came from outside of the USA, demonstrating the large interest of the broad

cryptographic community in the development of the U.S. government encryption

standard.

11

Table 2.1-1 Fifteen candidate algorithms.

Continent Country CipherNorth America Canada CAST-256

DealUSA Mars

RC6TwofishSafer+HPC

Costa Rica FrogEurope Germany Magenta

Belgium RijndaelFrance DFCIsrael, UK,Norway

Serpent

Asia Korea CryptonJapan E2

Australia Australia LOKI97

Only five algorithms passed to the second round of the evaluation: Mars [4], RC6

[28], Rijndael [11], Serpent [1], and Twofish [30]. All of the final candidates proved to

be sufficiently secure according to the best knowledge available during their analysis. Of

course, nobody can absolutely claim invulnerability of his design to future cryptanalysis

methods. At best, only estimation based on the current state of the art in cryptanalysis can

be made. One of the ways of assessing the security of symmetric-key ciphers is based on

differential [5] and linear [23] cryptanalyses. For both methods, there can be found a

minimal number of rounds, which make the attack less practical than the brute force

search. Any number of rounds greater than the minimum is believed to create the security

12

margin, a type of assurance by designers themselves against future attacks. Table 2.1-2

summarizes the security features of five final candidates. Obviously, ciphers with greater

security margins pay a price in the speed of operation, since their numbers of rounds are

greater than necessary.

Table 2.1-2 Security margins of final AES candidate algorithms.

Algorithm Number ofrounds

Minimumnumber ofroundsbelieved tobe secure

Securitymargin

Number ofrounds ofthe bestactual orestimatedattack

Securitymargin

Mars 32 20 60% 12 166%RC6 20 21 -5% 16 25%Rijndael 10 8 25% 6 66%Serpent 32 17 88% 15 113%Twofish 16 14 14% 6 166%

The second round of evaluation focused on further cryptanalysis and hardware

implementations of each of the finalists. FPGA based implementations played a great role

in final evaluation. In this thesis I present my contribution to the selection of the new

cryptographic standard. My results have been presented on the third AES conference in

New York, April 2000 [19]. I further extended my analyses and presented them on the

FPGA’2001 conference held in Monterey, February 2001 [9], and on the RSA’2001

conference held in San Francisco, April 8-12, 2001 [35]. Those results are also included

in this thesis.

13

Finally, in October 2000, NIST announced the winner of the contest: Rijndael.

AES was finally accepted as a federal standard on November 26, 2001. [18].

2.2 Need for comparison of hardware implementations

Software implementations of cryptography are dominating today’s encryption

market. Most of the users do not require high encryption speeds for their applications.

Encrypting electronic mail or private files usually does not need to be done strictly in

real-time. Each day more users start using computer networks and want to ensure privacy

for their network transactions. Existing Local Area Networks and Metropolitan Area

Networks operate with moderate speeds of 10 and 100 Mbps. These speeds can still be

handled by personal computers. However, new technological breakthroughs in LAN and

MAN networks change the horizon significantly. Gigabit Ethernet already exists, and is

becoming a competitive solution for LANs. In response to market trends, where Gigabit

Ethernet is being deployed over tens of kilometers in private networks, the Ethernet

industry developed a way to not only increase the speed of Ethernet to 10 Gbps, but also

to extend its operating distance. Encrypting data with speeds in the range of gigabits per

second is unachievable for the current and foreseeable generations of personal computers,

and broader use of hardware accelerators becomes inevitable.

The number of cryptographic standards targeting communication networks grows

rapidly, and it seems to be natural that cryptographic services become a standard feature

of new products. Future communication devices will be equipped with cryptographic

14

modules by default. If one looks closer at those devices, most likely we will see a small

hardware chip protecting the privacy of our communication.

With respect to those trends, the comparison of hardware implementations of

candidate algorithms for AES becomes one of the most important selection criteria. My

research indicates that hardware implementations reveal large differences in performance

among candidate algorithms. Furthermore, implementations developed by other groups

confirmed my results in most conclusions. In the presence of no major breakthroughs in

cryptanalysis of the AES candidates, and relatively inconclusive results of their software

performance evaluation [24, 30], the comparison of the hardware performance of the

AES algorithms provided a major indicator for a final decision regarding the new

standard.

2.3 Previous work

All AES candidate ciphers are brand new algorithms. Their analysis period was

very short and very little could have been done to analyze their performance in dedicated

hardware. The designers of the submitted algorithms are mostly mathematicians, who

have usually limited knowledge and experience with hardware designs. Their original

documentation contains only rough estimates of the hardware performance [4, 28, 11, 1,

29]. Additionally, these estimates are very difficult to compare among each other,

because of large differences in assumptions regarding the technology, and because of

different architectural choices. By the time we started our research only two results of

15

actual implementations of individual algorithms became available [14, 26], however this

was still a fragmentary knowledge not suitable for reliable comparison.

Starting our research I already had some experience in working with

reconfigurable hardware and implementations of cryptography. I have gained this

experience during my senior design project completed at the Warsaw University of

Technology, which focused on implementing hardware encryption device for hard drives

using RC5 algorithm [10].

16

Characteristics of hardware implementations

3.1 Hardware vs. software implementations

Cryptography can be implemented in both software and hardware. Usually the

desired speed of encryption/decryption and cost of the implementation are major factors

influencing the choice of technology.

Software implementations are designed and coded in programming languages,

such as C, C++, Java, and assembler, and are developed to run on general-purpose

processors, digital signal processors, and smart cards. Usually, software implementations

are very inexpensive. In most cases, cryptographic transformations match very well

modern microprocessors’ architectures, and even inexperienced programmers may easily

come up with correct implementations.

General-purpose processors offer enough power to satisfy the needs of individual

users, therefore majority of the existing implementations of cryptography reside in

software. Hardware implementations are the only way to achieve speeds beyond the

reach of the general-purpose microprocessors.

Hardware implementations are designed and coded either in hardware description

languages, such as VHDL and Verilog HDL, or using schematic capture. There exist two

major implementation approaches for hardware designs: Application Specific Integrated

Circuits (ASIC) and Field Programmable Gate Arrays (FPGA).

17

Application Specific Integrated Circuits are designed all the way from behavioral

description to the physical layout. The design is very time consuming and requires a lot

of manpower. A final layout is sent to a very expensive fabrication. Clearly, every design

mistake may have a large impact on the length of the design cycle and its cost. The

designers needed some inexpensive of rapid prototyping. This idea found its realization

in the form of FPGA devices.

Field Programmable Gate Arrays offer very unique features. They can be

purchased off-the-shelf and reconfigured to perform different functions. Each

reconfiguration takes only a fraction of a second. An FPGA consists of thousands of

small universal building blocks, known as Configurable Logic Blocks (CLB) [34]. CLBs

are connected using programmable interconnects. Some of the FPGA families contain

dedicated memory blocks. These are called Block SelectRAMs [34]. Figure 3.1-1 shows

the architecture of the Xilinx Virtex FPGA family.

18

Block SelectRAM

Block SelectRAM

ConfigurableLogicBlock

I/OBlock

LUT carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 1

LUT carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 0

Block SelectRAM

Block SelectRAM

ConfigurableLogicBlock

I/OBlock

LUT carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 1

LUT carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 0

LUT carry &control

G4

G3

G2

G1

D QD Q

LUT carry &control

F4

F3

F2

F1

D QD Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 1

LUT carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 0

LUT carry &control

G4

G3

G2

G1

D QD Q

LUT carry &control

F4

F3

F2

F1

D QD Q

YQ

Y

X

XQ

CIN

COUT

CLKSlice 0

Figure 3.1-1 Structure of the Virtex FPGA.

Although FPGAs were originally invented primarily to support development of

custom ASICs, they found applications as target devices. Due to their ability for

reconfiguration, FPGAs present very sophisticated circuits, and their potential is rarely

19

fully exploited. Even very simple components, like individual gates, have to be

implemented using the entire CLB, leading to sub optimal utilization of available

resources. All connections between Configurable Logic Blocks are routed using

configurable switches which present additional sources of delay. The rule of thumb is that

an ASIC is ten times faster than an equivalent FPGA, provided that both are fabricated in

the same technology. However, in FPGAs, the reconfiguration capability may be

explored as an essential feature of the design. In cryptography, it is often useful to switch

among encryption algorithms. Changing the FPGA configuration can easily accomplish

this task. Also, correcting mistakes or simply upgrading existing products by adding more

functionality is as easy as upgrading software implementations.

The cost of design based on the FPGA technology is far lower than for the custom

ASIC technology. Designers themselves can reprogram FPGAs, therefore less manpower

is needed in the development process. For low volumes, the FPGA-based products are

more profitable than those based on ASICs. For all that, FPGA implementations are still

more expensive than software implementations.

A very essential parameter of every cryptographic implementation is its level of

security. No secure algorithm helps if an attack on the implementation exists. Although,

software implementations happen to be the most common, they provide the lowest level

of security. It is extremely difficult to ensure no leakage of information. For example, one

form of attack on software may use little programs, like viruses, to imperceptibly collect

information and send it to an attacker. Another approach may be based on scanning the

20

memory freed by the cryptographic program that finished execution. What if the key just

used by that program has not been wiped out of the memory?

Hardware implementations are much easier to protect. It is relatively easy to

design circuits ensuring that no attack is possible, unless there is a physical access to the

device. In some situations the existence of such access has to be taken into account. One

of the ways to attack a hardware implementation is to replace the cryptographic chip with

another one, which would perform the same operation as the original chip, but could

additionally leak important information to the attacker, see Figure 3.1-2.

C = EK(P)Plaintext Ciphertext

replaced

C = EK(P)Plaintext

Ciphertext

Leaking information(i.e. encryption key)

identicaltransformations

C = EK(P)Plaintext Ciphertext

replaced

C = EK(P)Plaintext

Ciphertext

Leaking information(i.e. encryption key)

identicaltransformations

Figure 3.1-2 Example of an attack on the hardware implementation.

The FPGA based designs do not offer much protection against physical

manipulation. The configuration bitstream can be easily read and reverse engineered to

identify all security mechanisms. Therefore, preparing and replacing the bitstream with a

21

slightly changed version of the circuit is not difficult. Moreover, FPGA devices are

equipped with a Readback function, useful for debugging, which permits reading the

contents of all registers and memories together with configuration bitstream. This way

sensitive information, i.e. encryption key, can be easily compromised.

Xilinx has addressed this problem in their latest family of Virtex-II FPGAs [25].

Virtex-II devices have on-chip decryptors that have their keys loaded during board

manufacture in a secure environment. Once the devices have been programmed with the

correct keys, they can be configured with encrypted bitstreams. Xilinx has chosen DES

and Triple DES for encrypting bitstreams. This solution is, however, unique among

FPGAs.

Protecting ASIC based designs is somewhat easier. The ASIC chip can be

designed to be tamper resistant. This means that the chip should not leak any information

under any form of stressing. Some of the possible attacks may be based on power

consumption analysis or fault introduction. Furthermore, the chip may have some strong

authentication mechanism implemented making replacement of the chip not an easy task.

22

Table 3.1-I Characteristic features of implementations of cryptographic

transformations in ASICs, FPGAs, and software

ASICs FPGAs Software(general-purposemicroprocessors)

Speed very fast fast moderately fastDevelopment process

Design Cost very expensive moderatelyexpensive

inexpensive

Design Cycle long moderately long shortDesign Tools very expensive inexpensive inexpensiveMaintenance andUpgrades

expensive inexpensive inexpensive

Cryptographic issuesTamper Resistance strong limited weakKey Protection strong limited weakAlgorithm Agility no yes yes

Every type of implementation has its advantages and disadvantages. Their basic

features are summarized in Table 3.1-I. Software implementations are an attractive

choice when speed of encryption is not a main concern and the expected level of security

does not have to be high. It means that the importance of the protected information is low

compared to the effort required for breaking security mechanisms. The more secure and

faster solutions are required, the more vital role is played by hardware implementations.

When taking into account hardware solutions, FPGAs become more and more

attractive. If we assume that an attacker has no physical access to the device, then FPGA

designs can be as secure as ASICs.

23

3.2 Parameters of hardware implementations

Every hardware circuit can be characterized by two major parameters: speed of

operation, and area. Cryptographic algorithms are intended to perform cryptographic

transformations on strings of data. Therefore the speed of cryptographic implementations

is commonly characterized by the throughput. Throughput does not always give the full

information about the speed, and is often accompanied by another parameter – latency.

3.2.1 Throughput

Throughput is defined as the number of bits processed in a unit of time after the

process has gone through any initialization, and is usually expressed in Mbps or Gbps.

Typically, the encryption and decryption throughputs are equal. All symmetric-key

algorithms perform a fixed sequence of transformations, in other words, no conditional

operations are performed. Therefore, the time of encryption of one block of data is

usually fixed and known, unless implementation uses some tricks which can vary the

time of encryption. From the point of view of cryptographers, using any techniques

yielding correlation between data and time of encryption is highly undesirable. Such a

correlation leaks information about data, and can be used to mount timing attacks on the

implementation.

Startup -unit Timebits processed ofNumber Throughput �

24

Throughput has a very important meaning when considering a bigger system

consisting of multiple modules processing data in sequence, as shown in Figure 3.2-1,

because it is limited to the maximum throughput of the slowest of the modules.

Cryptographic transformations usually require the most processing power, and present a

bottleneck in many applications.

Module 1Thpt1

Data stream Module 2Thpt2

Module nThptn

Figure 3.2-1 System consisting of multiple modules with throughput

parameters.

� � iThptThpt iini module oft throughpua is e wherinfThroughput

...1system�

�

When we talk about throughput we usually mean the maximum throughput of a

circuit, although it may process streams with smaller throughput.

3.2.2 Latency

Latency is defined as the time required to complete processing of one block of

data, and is usually expressed in number of clock cycles. This is the time between a

moment when a block of data enters the encryption unit, and a moment when it leaves it.

Throughput and latency describe different features of systems. The total latency of a

25

system is a sum of latencies of all modules processing data sequentially. Therefore, all

modules, no matter how different from each other, contribute to the overall latency.

Module 1Ltncy1

Data stream Module 2Ltncy2

Module nLtncyn

Figure 3.2-2 System consisting of multiple modules with latency

parameters.

iLtncyLtncy i

n

ii module a oflatency a is whereLatency

1system �

�

�

Latency does not always stay at fixed level even if the throughput does. In many

communication applications data come in bursts, i.e. packets, and need to be buffered for

further processing. First-In-First-Out buffers are commonly used for data buffering, as

shown in Figure 3.2-3. Data blocks arriving at a nearly full FIFO have to wait for

processing until preceding blocks are completed. Therefore, their latency is significantly

larger than the latency of blocks arriving at an empty FIFO.

FIFOData stream Processing

unit FIFOFIFOData stream Processing

unit FIFO

Figure 3.2-3 Circuit with FIFO buffers.

26

It has more sense to talk about worst-case latency or average latency under given

assumptions.

In this work I focused completely on implementing cryptographic algorithms

only. I did not make use of any FIFO buffers, and all circuits presented here have fixed

latency.

3.2.3 Area

Area describes the “size” of the circuit. There exist different ways of expressing

this size depending on technology.

In the ASIC technology, area is expressed in terms of the size of a die [�m2], or in

terms of the number of transistors or logic gates. Both measurements correspond closely

to the cost of the design. Die cost is typically proportional to the fourth or higher power

of the die area [20]:

� �4area Diedie ofCost f�

This means that doubling the size of a die would increase its cost sixteen times.

There are also other costs associated with the production of ASICs. Namely, the cost of

testing and the cost of packaging. The cost of testing is proportional to the complexity of

the circuit, and can also be a function of the circuit area.

In the FPGA technology cost analysis is easier. The size of the circuit is expressed

in terms of the number of configurable logic blocks. Therefore, having a logic block

27

count, it is easy to estimate the cost of the design by simply comparing prices of devices

into which it would fit. Very often FPGA manufacturers give number of logic gates

equivalent to the entire FPGA circuitry. For example, Virtex 1000 FPGA is equivalent to

1,124,022 logic gates [34]. It is tempting to use this number to find the size of an

equivalent ASIC circuit based on the FPGA utilization. However, practice shows that

these estimations are highly inaccurate. Some of the reasons for this inaccuracy include

the following situations:

1) One Lookup Table (LUT) can realize a wide variety of combinational functions of

different complexity. It may represent only one logic gate, or a quite complex circuit

consisting of many gates although still being counted as one LUT, as shown in Figure

3.2-4.

2) One CLB Slice consists of two LUTs. It happens quite often that only one of them is

utilized, as shown in Figure 3.2-5.

28

x1 x2 x3 x4

y

x1 x2

y

LUT

x1x2x3x4

y

a) b)

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

x1 x2 x3 x4

y

x1 x2 x3 x4

y

x1 x2

y

x1 x2

y

LUT

x1x2x3x4

yLUT

x1x2x3x4

y

a) b)

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

Figure 3.2-4 Variety of functions possible to implement using one lookup

table (LUT). a) single logic gate, b) complex combinational logic.

29

g carry &control

G4

G3

G2

G1

D Q

f carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

g carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

LUT carry &control

G4

G3

G2

G1

D Q

f carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

a)

b)

g carry &control

G4

G3

G2

G1

D Q

f carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

g carry &control

G4

G3

G2

G1

D QD Q

f carry &control

F4

F3

F2

F1

D QD Q

YQ

Y

X

XQ

CIN

COUT

CLK

g carry &control

G4

G3

G2

G1

D Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

g carry &control

G4

G3

G2

G1

D QD Q

LUT carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

LUT carry &control

G4

G3

G2

G1

D Q

f carry &control

F4

F3

F2

F1

D Q

YQ

Y

X

XQ

CIN

COUT

CLK

LUT carry &control

G4

G3

G2

G1

D Q

f carry &control

F4

F3

F2

F1

D QD Q

YQ

Y

X

XQ

CIN

COUT

CLK

a)

b)

Figure 3.2-5 Example of LUT utilization. a) two functions occupy one

CLB Slice, b) two functions occupy two CLB Slices.

The number of CLBs is usually the main and the only parameter reported by

designers. For the reasons listed above, it does not give full information about the area. It

should be accompanied with the specific number of lookup tables, flip-flops and memory

elements used in the design.

30

Some of the FPGA devices are equipped with dedicated RAM blocks. There is no

reliable way to translate them into an equivalent number of CLBs or logic gates which

complicates the comparison of implementations in different technologies even more.

3.3 Design tradeoffs

Designers of hardware implementations have much more flexibility in choosing

the way of developing their implementation than designers of software. In some

applications, the maximization of speed may be an ultimate goal. In these cases, cutting

edge technology plus sophisticated design techniques may be applied. In other

applications the speed of operation may not be very demanding, and designers may be

more concerned with the area and cost constraints. Every digital circuit can be

implemented differently keeping in mind specific requirements. In this section, I present

few basic techniques demonstrating the tradeoff between area, speed and latency.

3.3.1 Increasing the throughput

One way to increase the speed of the circuit is by increasing the speed of

particular operations. This approach can be used in any situation, and may be achieved by

using sophisticated design techniques like fast multipliers or aggressive logic

decomposition. Another way to increase speed is through exploration of parallelism

existing on different levels. Parallelism can be found in the encryption algorithm, when

certain transformations can be performed simultaneously. It can also be exploited outside

the algorithm if many independent blocks of data can be processed simultaneously. I will

31

discuss two basic techniques often used when the parallelism can be exploited outside the

encryption algorithm.

Using multiple independent processing units

The first trivial technique is to use many identical processing units working on

independent blocks of data. A single processing unit can represent a complete

encryption/decryption circuit. This unit does not have to be very fast. High speed is

obtained simply by using a large number of processing units.

PU 1

1

N+1

2N+1

3N+1

PU 2

2

N+2

2N+2

3N+2

PU N

N

2N

3N

4N

data blocks

input data string

output data string

processing unitsPU 1

1

N+1

2N+1

3N+1

PU 2

2

N+2

2N+2

3N+2

PU N

N

2N

3N

4N

data blocks

input data string

output data string

processing units

Figure 3.3-1 Parallel processing units – string of data split among units.

32

The speedup obtained using this technique is directly proportional to the number

of processing units:

unit oneunits parallel ThroughputNThroughput ��

The latency stays the same, as we do not change anything in the structure of a

single unit. There is, however, a price for this speedup: the area of the circuit grows

proportionally to the number of processing units, significantly increasing the cost of

implementation. A good example of a design consisting of many identical units is a DES

Cracker built by Electronic Frontier as described in chapter 1.

Pipelining

Pipelining is a more sophisticated method of achieving higher speeds than by

augmenting the number of processing units. Pipelining is applied inside the processing

unit by dividing it into several stages which execute sequentially. Let us consider a

combinational circuit as shown in Figure 3.3-2 a). The circuit can process only one block

of data at a time, and each processing takes time T. The following parameters

characterize our circuit:

TLatency nalcombinatio �

33

Tdata ofblock of SizeThroughput nalcombinatio �

solid combinationallogic

T

register

1 2 3 4time

T 2T 3T 4T0

1

stage

1 2 n

T/n T/n T/n

1 2time

0

stage

3 41 2 3

1 21

1 2 3 4

TTn

2Tn

3Tn

4Tn

1234

n

a) b)

solid combinationallogic

T

register

1 2 3 4time

T 2T 3T 4T0

1

stage

1 2 n

T/n T/n T/n

1 2time

0

stage

3 41 2 3

1 21

1 2 3 4

TTnTn

2Tn2Tn

3Tn3Tn

4Tn4Tn

1234

n

a) b)

Figure 3.3-2 Principles of pipelined implementation. a) original

combinational logic, b) pipelined version of the same logic.

We can divide this circuit into n stages as shown in Figure 3.3-2 b). The modified

circuit is called to be pipelined, and can process n blocks of data simultaneously, but in

different phases.

34

The pipelined circuit will have the following parameters:

TLatencypipelined �

� �1-nNnT

block of SizeNThroughputpipelined

��

��

where: N – number of blocks of data to be processed

n – number of pipeline stages

If we have a large number of blocks of data, then throughput is approximately n

times bigger than for the not pipelined circuit:

Tblock of SizenThroughput lim pipelinedN

��

��

From now on, I will assume in my analyses, that we have large number of blocks

to process: N >> n.

Unfortunately, the speedup of n times is impossible to achieve in real

implementations, because each introduced register contributes to additional delay, as it is

characterized by some propagation time �.

35

1

T

�

2 n

� �

nTn

Tn

1

T

�

2 n

� �

nTn

Tn

Figure 3.3-3 Pipeline with delay of registers taken into account.

If we take it into account, the latency and throughput will be expressed by:

�� nTLatency pipeline balanced

��

��

nTblock of SizenThroughput pipeline balanced

We can observe now, that throughput in not linearly proportional to the number of

pipeline stages. For small number of stages T >> n*�, and throughput will be nearly n

times greater. However, it will grow slower and slower as we will keep introducing more

and more stages.

There exist another issue limiting efficiency of pipeline. It is very difficult to

create a perfectly balanced pipeline. Usually stages are not equal, and the stage requiring

36

the longest time for computations will limit the maximum clock frequency of the entire

circuit.

1

T+�T

�

2 n

� �

n

1

T+�T

�

2 n

� �

n

Figure 3.3-4 Unbalanced pipeline.

We can express the information about imbalance as an additional delay

contributed to computations in the longest stage. The latency and the throughput

parameters become:

�� nTTLatency pipeline imbalanced

��

��

nTTblock of SizenThroughput pipeline imbalanced

We can list some possible sources of imbalance:

� The combinational circuit may consist of few basic operations, which execute in

different amount of time. To create well-balanced pipeline, the designer may need to

37

introduce registers somewhere inside of one of the operations. It may require

significant design effort.

op. 1

4ns

op. 2 op. 3

7ns 5ns

2ns

18ns

op. 1 op. 2 op. 3

9ns

2ns

27ns

2ns 2ns

9ns 9ns

a) b)

op. 1

4ns

op. 2 op. 3

7ns 5ns

2ns

18ns

op. 1

4ns

op. 2 op. 3

7ns 5ns

2ns

18ns

op. 1 op. 2 op. 3

9ns

2ns

27ns

2ns 2ns

9ns 9ns

a) b)

Figure 3.3-5 Pipelining of circuit consisting of unequal operations. a)

original combinational circuit, b) pipelined circuit – operation 2

determinates clock frequency.

� The synthesis tool may do better job in synthesizing the circuit than the designer can

predict, and unnecessary delay can be introduced to the circuit.

38

LUT

a

b

c

d

y0

1

a

b

c

d

y0

1

a

b

c

d

yD Q

0

1

a

b

c

d

D Q

D Q

D Q

D Qy

LUT LUT

LUT

a)

b)

LUT

a

b

c

d

y0

1

a

b

c

d

y0

1

a

b

c

d

yD QD Q

0

1

a

b

c

d

D QD Q

D QD Q

D QD Q

D QD Qy

LUT LUT

LUT

a)

b)

Figure 3.3-6 Example of unnecessarily placed register. a) properly

pipelined circuit, b) improperly pipelined circuit – requires more area and

has larger latency.

� The designer of the FPGA based design has usually very little influence on the

placing of components and routing of nets. Design tools support manual

floorplanning, but it again requires a lot of work. In case of my designs I completely

relied on the automatic placement and routing. In all my designs routing has

contributed to 60-90% of the total delay in the critical path.

In case of FPGAs, one straightforward approach to pipelining the design is to

limit the number of CLB levels per stage. This way, we can have a control over the delay

39

through the logic part, but still no control over the routing part. The deeper pipeline we

want to design, the more difficult balancing becomes. Even nets with high fanouts can

dramatically change the overall performance.

To summarize, I expect that the imbalance factor �T should be treated as a

function of a number of pipeline stages, and it increases as the number of stages

increases. I believe that the �T factor can easily exceed 2T in the very deep pipeline

designs, when the designer does not do any floorplanning, as it is in my case.

Figure 3.3-7 shows typical relationship between throughput and number of

introduced pipeline stages. Curve B shows perfectly balanced pipeline, and curve C

assumes some imbalance.

0

5

10

15

20

25

30

Number of pipeline stages

AB

C

Figure 3.3-7 Throughput in the pipelined implementations. A – ideal n

times speedup. B – ideally balanced pipeline. C – unbalanced pipeline.

40

From Figure 3.3-7 we can easily conclude that the common assumption of

throughput being proportional to the number of pipeline stages is true only for small

number of stages. In all of my fully pipelined implementations, the number of stages

ranges from 81 to 588, resulting in much smaller than n times speedup.

Another important question is how the pipelining affects requirements for area. In

the case of ASIC designs every additional flip-flop or latch introduced to the circuit

requires additional resources. However, in FPGA technology pipelining is usually much

less expensive. Every lookup table (LUT) has a D flip-flop associated with it, and it is

frequently sufficient to just use these “free” flip-flops to implement a pipeline.

Unfortunately, this is not always sufficient. The circuit structure may not be balanced,

meaning that we need to perform different computations on different pieces of data,

requiring different complexity of logic. Any Feistel–network cipher presents a classical

example of a structure demanding a lot of overhead when pipelined – see Figure 3.3-8. In

this example we can see that pipelining has to be introduced in two data paths

additionally to pipelining of F-function. Even if pipelining of the F-function does not

require any additional area, pipelining of those two data paths may significantly increase

area requirements.

41

F-functionpipelinedF-function(n-stages)

12

n

12

n

a) b)

F-functionF-functionpipelinedF-function(n-stages)

12

n

12

n

pipelinedF-function(n-stages)

12

n

12

n

12

n

12

n

a) b)

Figure 3.3-8 Pipelining of Feistel-network cipher. a) combinational

circuit, b) pipelined circuit – additional registers required in two data

paths.

Additionally, some arithmetic operations used in a circuit may require more area

when pipelined. One of the examples of such operations is an array multiplier presented

in Figure 3.3-9.

I presented two main ways of increasing throughput, one based on using multiple

processing units, and the other based on pipelining. Both methods have their advantages

and disadvantages. I summarize their basic characteristics in the Table 3.3-I. By

combining these two methods one can achieve very high throughputs.

42

a) b)a) b)

Figure 3.3-9 Example of an array multiplier as a circuit requiring

additional area for registers when pipelined. a) combinational circuit, b)

pipelined circuit (additional registers are required to pipeline arguments

input to the array, however they are not shown in the schematic).

Table 3.3-I Features of methods exploring parallel computations

Multiple processing units PipeliningComplexity ofdesign

simple may be difficult especially for deeppipelines

Speedup proportional to thenumber of processingunits

for small number of stagesproportional to the number of stages,for large number of stages the gainon speedup drops down

Area requirements proportional to thenumber of processingunits

in balanced designs may be verysmall,in unbalanced designs may causesignificant area increase

43

3.3.2 Decreasing the area

When area is our main concern, sacrificing circuit performance can always

decrease it. Of course, it may happen that the circuit is not properly minimized and its

minimization may result in decreasing the area and increasing the speed. It was shown in

[27] that commercial tools do not perform logic decomposition very well. However,

designers usually do not try any logic optimization by themselves and they entirely rely

solely on synthesis tools.

One of the ways to decrease the amount of area is to use simple units performing

specific operations. For example, using ripple-carry adders instead of carry-lookahead or

carry-skip adders. Sometimes we can replace one large and complex combinational

circuit by a small sequential unit. For example, multiplication can be performed in

multiple clock cycles using only one adder, or in a single clock cycle using irregular

combinational structure like Wallace tree.

Another way to reduce the area is to use resource sharing. It may happen that the

circuit consists of multiple units performing the same operations. Fortunately, in the case

of ciphers, very often the same types of operations are applied to different parts of the

data. These may be multiplication, rotations, or S-boxes. We need to implement only one

instance of such a unit and perform all computations in different clock cycles.

44

F

mul

tiple

xer

F

F

a) b)

F

mul

tiple

xer

F

F

F

mul

tiple

xer

F

F

a) b)

Figure 3.3-10 Resource sharing. a) two identical operations performed on

distinct data, b) one circuit can perform the same operation in greater

amount of clock cycles.

In the case of very aggressive requirements on area one can try to design smaller

number of universal units, capable of performing different types of operations: addition,

multiplication, rotation, XOR and so on. In this approach, the circuit structure may

become similar to the structure of a microprocessor.

45

Hardware architectures for symmetric-key blockciphers

4.1 Main characteristics of block ciphers

Symmetric key block ciphers present very specific class of algorithms. Most of

them have a very similar, well-studied structure and work in standardized modes of

operation. The goal of this research is to make a fair comparison of hardware

implementations of all of the AES finalists. Therefore, I decided to implement each of the

considered algorithms in the same way and exploit all their commonalties. In this chapter

I present key features common to all five AES candidates that had the largest impact on

my hardware designs decisions.

4.1.1 Structure of a symmetric-key block cipher

Most of the symmetric-key block ciphers have a similar round-oriented structure.

All five final AES candidate algorithms share this feature. Figure 4.1-1 shows the general

flow of the encryption/decryption process.

Usually all rounds within the cipher contain identical operations, and permit

iterative execution. Mars and Serpent are slightly different among the candidates. Mars

employs two entirely different types of rounds, which even have different purposes.

46

Serpent has all rounds very similar in structure yet it uses S-boxes with slightly different

contents in consecutive rounds.

cipher round

initialtransformation

finaltransformation

i := 1

i < #rounds?

i := i +1

round key[0]

roundkey[#rounds+1]

round key[i]

Figure 4.1-1 Flow diagram of a typical symmetric-key block cipher.

Despite the differences among internal rounds, it is quite common for the last or

first round to be slightly different. These differences, however, do not impose significant

constraints on the design.

All operations within a round are fixed, arithmetic or logical transformations like

substitution, addition and multiplication modulo, and operations on polynomials in Galois

fields. There are no conditional statements in the algorithm. This makes implementations

very straightforward.

47

For some of the ciphers the algorithm used for encryption is identical to that used

for decryption. In particular, this is true for Feistel-network ciphers like DES. Twofish,

RC6 and Mars have similar structure, however none of them permits performing

encryption and decryption in exactly the same way. Twofish can be expressed in a little

bit different way to explore this feature. Other ciphers, Serpent and Rijndael, employ

inverse operations for decryption which have very little in common with their encryption

counterparts.

4.1.2 Key schedule

Generally every round accepts at least one round key. Round keys are computed

by an accompanying algorithm called the key schedule. The key schedule is also an

iterative algorithm, which expands the main key into round keys. The more rounds which

are required by the algorithm, the more round keys need to be supplied from the key

schedule. Performing decryption requires applying round keys in reverse order to all

rounds. Some key schedules permit computing round keys in any order, such as in the

case of DES and Twofish. This feature creates a perfect opportunity for computing keys

“on the fly” and may therefore significantly influence design strategies for hardware

implementations. Other ciphers have key schedules computing keys only in the forward

direction. Usually, this restriction forces pre-computing all round keys and storing them

in some memory before use.

The key schedule itself represents a completely independent algorithm and can be

considered as a design independent from the encryption unit. The key schedule is also

48

constrained by slightly different criteria than the encryption circuit. If a large bulk of data

is going to be encrypted, then it is not necessary to change keys frequently, and the key

schedule unit does not have to be fast even when the encryption speed is high. On the

other hand, some applications require fast key changes. For example, encrypting ATM

cells with different keys requires changing keys every 56 bytes. For those applications the

design of key schedule may be even more challenging than the design of the encryption

core.

Key schedule units of all final AES candidates are certainly worth looking at, and

differences found among them could be very influential to the selection process.

Nevertheless, we have not conducted any research that could yield any comparison of key

schedules. In this thesis I have entirely focused on implementing encryption and

decryption circuits under different assumptions and constraints.

4.1.3 Modes of operation

Symmetric-key block ciphers are used in several operating modes. Currently five

modes have been standardized for use with DES [17]. Those four modes are: Electronic

CodeBook (ECB), Cipher Block Chaining (CBC), Cipher FeedBack (CFB), Output

FeedBack (OFB), and Counter (CTR). They can be classified into two categories:

� non-feedback modes: ECB, CTR

� feedback modes: CBC, CFB, OFB.

49

Figure 4.1-2 shows examples of non-feedback and feedback modes, where

consecutive blocks of plaintext (P) are transformed into blocks of ciphertext (C). In the

case of feedback mode the initialization vector (IV) is supplied.

E

P[0]

C[0]

E

P[1]

C[1]

E

P[n]

C[n]

D

P[0]

D

P[1]

D

P[n]

E

P[0]

C[0]

IV

E

P[1]

E

P[n]

C[1] C[n]

D D D

IV

P[0] P[1] P[n]

a) c)

b) d)

Figure 4.1-2 Example of feedback and non-feedback modes of operation.

a) ECB mode encryption, b) ECB mode decryption, c) CBC mode

encryption, d) CBC mode decryption.

ECB mode has a very nice property of treating all blocks of ciphertext

independently. It is a very valuable feature from the point of view of hardware

implementation, because all techniques exploiting parallelism, as discussed in chapter 3,

can be applied to speedup computations. Unfortunately, ECB mode is rarely used in

practice. The main reason is that it does not hide data patterns occurring in plaintext.

50

The feedback modes of operation offer better security and are used more often,

but there is a price for it. All feedback modes imply strong data dependency between

consecutive blocks of data. Computation of ciphertext block may start only when the

previous ciphertext block has been already computed like in CBC mode - Figure 4.1-2b.

This restriction has a significant impact on the hardware implementation specification

because there is no parallelism that can be exploited during computation of the same

stream of data. All techniques for the speeding-up of the hardware implementation which

use parallel processing, such as pipelining, can be employed only if several independent

streams of data are available. Although the overall throughput of the implementation can

be significantly improved, the throughput of each independent stream remains limited.

The limitations imposed by feedback modes have already been recognized, and

other non-feedback modes have been proposed. One of the well-studied modes is counter

mode – Figure 4.1-3.

E

P[0]

IV

C[0]

E

P[1]

IV+1

C[1]

E

P[n]

IV+n

C[n]

E

C[0]

IV

P[0]

E

C[1]

IV+1

P[1]

E

C[n]

IV+n

P[n]a) b)

Figure 4.1-3 Counter mode. a) encryption, b) decryption.

51

Counter mode works similarly to one-time-pad ciphers. First the pseudo-random

sequence is generated based on the key, and IV values. Next plaintext is simply XOR-ed

with this sequence. Decryption is performed by XOR-ing the ciphertext with the same

pseudo-random sequence. Therefore, counter mode requires implementing only

encryption transformation for the underlying block cipher.

Counter mode has been proven to be equally secure as feedback modes, provided

that the key and initialization vector will never be used to encrypt two different messages.

This concern was probably one of the primary reasons why it was not standardized so far.

NIST has foreseen the need for new modes of operation soon after selecting the

winner of the AES contest. It is likely that previously standardized modes together with

counter mode will be temporarily accepted for use with AES. However, NIST has already

initiated a new public effort for developing new modes of operation intended for AES.

One of the promising modes submitted to NIST is an Offset CodeBook (OCB) mode

[22]. OCB offers not only high security and parallelism, but also an authentication

service which is not present in any of the current standardized modes of operation.

4.2 Basic iterative architecture

Since most of the block ciphers have round-oriented design, it is sufficient to

implement only one round and circulate data through the same logic several times. The

most straightforward implementation of a typical symmetric-key block cipher is shown in

Figure 4.2-1. A single round is implemented as combinational logic and is supplemented

with a register and multiplexer. The register is required to hold intermediate results of

52

computations between consecutive clock cycles. It can be positioned anywhere within the

round circuit, depending on particular design constraints. The purpose of the multiplexer

is to feed data back to the circuit or to fetch a new block of data.

combinationallogic

register

input

output

multiplexer

oneround

Figure 4.2-1 Basic iterative architecture

Usually, the combinational circuit representing one cipher round is capable of

performing either encryption or decryption, since most of the modes of operation require

both transformations. Obviously, in the basic iterative architecture only one block of data

is being transformed in the circuit and only encryption or decryption operation is

activated. Sometimes it can be justified to implement encryption or decryption only. Such

implementation may require two separate chips, or one FPGA chip that can be entirely or

partially reconfigured to switch between operations. However, from the point of view of

most applications, the time of reconfiguration implies unacceptably large overhead and is

rarely justified in practice.

53

The basic iterative architecture permits encrypting only one block of data at a

time, and therefore, is suitable for any mode of operation. Since only one round is

physically implemented in the circuit, transforming one block of data takes most likely

the same number of clock cycles as the number of cipher rounds. However, some of the

block ciphers require more round keys than the number of rounds, for example additional

whitening keys. If the key schedule computes only one key per clock cycle, or all keys

are stored in one memory, then the number of clock cycles needed to encrypt one block

of data will be equal to the number of round keys. Obviously, each cipher has its own

characteristic features and the number of clock cycles required to process one block of

data may deviate slightly from the number of cipher rounds.

The critical path in the circuit determines the minimum clock period. If our goal is

to maximize the throughput, the critical path should appear in the feedback between

output and input of the intermediate register, as shown in the Figure 4.2-2. Otherwise the

performance of the circuit would be unnecessarily limited by other logic.

combinationallogic

input

output

multiplexerTmux

Tround

τcombinational

logic

input

output

multiplexerTmux

Tround

τ

Figure 4.2-2 Critical path in the basic iterative architecture.

54

The minimum clock period can be computed as the sum of following factors:

� propagation time through the multiplexer Tmux,

� propagation and setup time of the register �, and,

� propagation time through the round circuit Tround.

roundmuxbasic T T periodclock ��

Latency and throughput are expressed as follows:

basicbasicbasic cyclesclock # periodclock latency ��

basicbasicbasic cyclesclock # periodclock

sizeblock throughput�

�

Different ciphers have different number of rounds and different sizes of rounds.

For example, Serpent has many simple rounds, while Rijndael has fewer, but more

complex, rounds. Therefore, the influence of the multiplexer and register on the overall

performance of the circuit is not equal for various ciphers implemented in the basic

iterative architecture. Typically, however, times of propagation through the multiplexer

and register are much smaller than the time of propagation through the round circuit, and

can be safely neglected.

55

4.3 Loop unrolling

Loop unrolling is the simplest extension of the basic iterative architecture, as

shown in the Figure 4.3-1. The idea is to implement more than just one round as a

combinational circuit. Typically, the number of unrolled rounds is a divisor of a total

number of cipher rounds, and I assume this as true in further analyses. In extreme case all

rounds can be unrolled, as shown in the Figure 4.3-1b, this eliminates the need for

multiplexer and feedback.

round 1

input

output

multiplexer

round 2

round k

a)

round 1

input

output

round 2

round n

b)

Figure 4.3-1 Loop unrolling. a) partial unrolling, b) full unrolling.

As in the basic iterative architecture, there can be only one block of data

processed at a time and, therefore, loop unrolling is equally well suited for feedback and

non-feedback modes of operation.

56

The number of clock cycles necessary to encrypt a single block of data decreases

proportionally to the number of unrolled rounds. At the same time the minimum clock

period increases, but possibly by a factor slightly smaller than the number of unrolled

rounds. There are two major reasons for this occurrence:

� the input multiplexer and register have smaller influence on the overall circuit

performance, and,

� unrolled rounds may yield additional logic optimizations.

The minimum clock period can be expressed as follows:

rounds unrolledmuxunrolled T T periodclock ��

Let k be the number of unrolled rounds. Let’s assume that the propagation time

through unrolled logic is at most k times larger than the time for one round. However,

some optimizations may occur in the unrolled circuit that will decrease the delay. It may

happen that operations executed at the beginning and at the end of the round logic can be

implemented as one circuit with smaller delay, as shown on the Figure 4.3-2.

It may be possible to furthermore explore timing characteristics of the round

circuit. Figure 4.3-3 shows a hypothetical round structure with two operations f1 and f2.

One unrolling is sufficient to notice that functions f1 in both rounds can be evaluated

simultaneously.

57

f1

f2

f3

a)

f1

f2

f3 & f1optimized together

f2

f3

b)

Figure 4.3-2 Optimization of logic across rounds. a) single round, b) two

rounds unrolled.

f1

f2

a) b)

f1

f2

f1

f2

Figure 4.3-3 Simultaneous evaluation of functions in unrolled rounds. a)

single round, b) two rounds unrolled.

58

This potential speedup comes at a cost in circuit area. The area grows very fast

with each unrolled round, because not only round logic but also key schedule logic need

to be expanded. The more complex circuit consumes additional logic and routing

resources. In the existing scale of integration of CMOS circuits, the delays introduced by

interconnections between logic elements play a crucial role in the overall performance

characteristic. According to Xilinx, for small designs a 50/50 ratio between logic and

route delays should be anticipated. However, the bigger the chip the bigger the route

budget, 40/60 or 30/70. The increased delay through interconnects may easily cancel the

speedup gained on logic resources. For this reason it is not generally advisable to use

loop unrolling in FPGA devices, unless the designer takes great care in placement and

routing. Full loop unrolling eliminates the need for multiplexer and feedback, therefore it

presents a perfect object for placement and routing. Loop unrolling can prove to be

beneficial especially when used in ASIC devices as it was demonstrated in [31, 36].

The latency and throughput can be expressed similarly as for the basic architecture:

unrolledunrolledunrolled cyclesclock # periodclock latency ��

unrolledunrolledunrolled cyclesclock # periodclock


�

If the design is well placed and routed, the following holds:

59

basicunrolled periodclock periodclock �� k

kbasic

unrolledcyclesclock #

cyclesclock # �

basicunrolled throughputthroughput �

As a result, the latency and the throughput parameters are slightly better for the

unrolled circuit, but all that comes at a large price in the circuit size. Figure 4.3-4 shows

the area to throughput ratio of the unrolled circuit with respect to the basic iterative

architecture.

throughput

area

k=2 k=3 k=4 k=5

loop-unrolling

basic architecture

throughput

area

k=2 k=3 k=4 k=5

loop-unrolling

basic architecture

Figure 4.3-4 Throughput vs. area ratio for unrolled architectures.

60

It becomes clear that loop unrolling can be justified only when the design is not

constrained by the area requirements. In practice loop unrolling is a too expensive way of

speeding up the circuit, therefore I did not attempt any implementations of any of the

ciphers in this architecture.

4.4 Outer round pipelining

Pipelining of digital circuits is not always an easy task. Proper pipeline should

have approximately equal stages. As I have mentioned in section 4.1.1, existing

symmetric-key ciphers have round-oriented architectures. All rounds are alike and have

similar, if not identical, complexities. This feature makes them a natural choice for

pipeline stages. Similar to loop unrolling, one can implement a few rounds and introduce

pipeline registers between them. The natural choice for register placement is directly

between rounds, however they may be placed inside each round only if the placement is

consistent in every round.

The number of pipeline stages K is usually a divisor of the total number of cipher

rounds. When area constraints permit, all rounds can be implemented which eliminates

the need for feedback as in full loop unrolling – Figure 4.4-1.

61

round 1

input

output

multiplexer

round 2

round k

a)

round 1

input

output

round 2

round n

b)

round 1

input

output

multiplexer

round 2

round k

a)

round 1

input

output

round 2

round n

b)

Figure 4.4-1 Outer round pipelining. a) partial unrolling, b) full unrolling.

Outer round pipelining is easy and straightforward to apply. Since in general all

rounds are identical, it is sufficient to design only one stage and reuse it. This guarantees

a very well balanced pipeline from the point of view of logic resources. Routing

resources are more difficult to control. The designer may, however, prepare a macro with

one round fully placed and routed. This macro can be used several times in the final

circuit, and with careful placement routing delays should be very similar in all stages.

The pipelined circuit is capable of processing K blocks of data simultaneously.

All blocks in the pipeline are processed independently and no dependencies among them

are allowed. This limitation sets constraints on processing data in feedback modes. A

pipeline can be fully utilized in feedback modes only when K independent streams of data

are available. This also means that the key schedule has to supply K different keys to the

62

encryption unit. If those conditions are not met, pipelining of any kind does not give any

advantages over the basic iterative architecture for feedback modes.

The number of clock cycles necessary to process one block of data is the same as

for the basic iterative architecture, however since many blocks can be processed

simultaneously the average number of clock cycles per block is approximately K times

smaller.

The minimal clock period should, in general, remain very similar to the minimal

clock period of a basic iterative architecture:

basicroundmuxouter periodclock TTperiodclock ��

Of course, the more pipeline stages the more complex is the overall circuit and the

more complex is the signal routing. If the circuit is placed automatically it is very likely

that the value of the minimum clock period will deteriorate comparing to the basic

iterative architecture.

In the case of loop unrolling there exist the potential for optimization of logic

across rounds as shown in Figure 4.3-2. For outer round pipelining, registers are usually

placed exactly between those rounds what inhibits optimization across rounds. However,

it may be still possible to explore this feature of a cipher. The designer has to recognize

those situations himself and place pipeline registers in optimum places, not necessarily

between rounds. For the situation shown in Figure 4.3-2 it may be beneficial to place a

register between operations f1 and f2 as shown in Figure 4.4-2.

63

f1

f2

f3

a)

f1

f2


f2

f3

b)

f1

f2

f3

a)

f1

f2


f2

f3

b)

Figure 4.4-2 Optimization of logic across rounds. a) one round, b) two

rounds.

The latency of the circuit with outer round pipelining will remain approximately

the same as for the basic iterative architecture.

outerouterouter cyclesclock # periodclock latency ��

basicouter periodclock periodclock �

basicouter cyclesclock # cyclesclock # �

64

basicouter latencylatency �

Both the minimum clock period and the number of clock cycles required to

process one block of data remain unchanged. This feature permits applying outer round

pipelining without making drastic changes to other surrounding logic which could have

been designed for the basic iterative architecture first, and subsequently adapted for a

pipelined implementation.

The throughput of a pipelined circuit is approximately K times higher than for the

basic iterative architecture. This speedup comes from the fact that K blocks of data can be

processed simultaneously.

basicouterouter

outer t throughpu cyclesclock # periodclock

sizeblock throughput ��

�

�� KK

The outer round pipelining gives linear speedup proportional to the number of

implemented rounds. This linear speedup comes at the cost of a linear area increase,

because not only round logic but also key schedule logic has to be expanded like in the

case of the loop unrolling architecture. Figure 4.4-3 shows area to throughput ratio of the

pipelined circuit with respect to the basic iterative architecture.

65

throughput

area

outer-round pipeliningnon-feedback modes

basic architectureK=2

K=3

K=4

K=5

outer-round pipeliningfeedback modes

throughput

area

outer-round pipeliningnon-feedback modes

basic architectureK=2

K=3

K=4

K=5

outer-round pipeliningfeedback modes

Figure 4.4-3 Throughput vs. area ratio in outer round pipelining.

The simplicity of outer round pipelining makes it the most commonly used

pipelined architecture for secret-key ciphers documented in the literature.

4.5 Inner round pipelining

Inner round pipelining is another, more advanced, method of pipelining a block

cipher. The idea is to implement and pipeline only one round, and use this circuit

iteratively like in the case of the basic iterative architecture. Figure 4.5-1 shows inner

round pipelining architecture.

66

input

output

multiplexer

one

roun

d

input

output

multiplexer

one

roun

dFigure 4.5-1 Inner round pipelining.

The question arises as to how to select the number and placement of pipeline

stages within the cipher round. In most practical designs the ultimate goal is to meet

certain performance requirements such as throughput or latency. To meet the throughput

requirement it is sufficient to divide the round into combinational pieces where the

longest one does not exceed the requirement for the minimal clock period. The way to

minimize the latency is to apply as few pipeline stages as possible.

When both throughput and latency have to be optimized the best approach is to

divide the round into equal stages. However, this may not be easy to realize since rounds

usually consist of different operations. It is tempting to look at each operation from the

point of view of its logical structure and divide the entire round into stages consisting of

equal number of logic levels (CLB levels). This approach is easy to apply since

experienced designer can anticipate ahead of time how each of the operations will fit into

an FPGA. Usually it is enough to look at each of the operations separately, but some

optimizations may be possible across operations resulting in a smaller total number of

67

logic levels. It is, therefore, advisable to look at the entire round. Unfortunately, this

method has a serious flaw. It does not take into account delays through routing resources.

Long nets and high fanout may contribute to significant delays resulting in unexpected

loss of performance. These kinds of problems are very difficult to recognize at the initial

stage of design.

A more accurate design method would require analyzing actual delays in an

existing combinational circuit before applying pipelining. This means that the basic

iterative architecture should be implemented and analyzed first. Analyzing the timing

parameters of the implemented circuit is not an easy task in itself, as it requires good

knowledge of implementation tools and consumes a lot of time. Finally these analyses are

only approximate, because to some extent they are specific for a particular placement and

routing realization. Automatic tools may solve placement and routing problems

differently for pipelined circuits than for non-pipelined ones, and routing delays may be

significantly different. The most straightforward way to prevent this possibility is to force

some placement pattern. This may be done by constraining parts of the circuit to specific

areas in the FPGA. Of course this adds to the overall complexity of the design process,

and usually requires a lot of experience.

Let us assume that the number of pipeline stages is k. This number can be

arbitrary, as it is not correlated with the number of cipher rounds in any way. The

minimum clock period will be smaller than for the basic iterative architecture, but even

for a perfectly balanced architecture it will not be k times smaller – see chapter 3.

68

kkbasicroundmux

innerperiodclock TTperiodclock �

��

A pipelined circuit requires higher clock frequency than the basic iterative

architecture, and this means that surrounding logic has to keep up with this requirement.

Processing one block of data takes k times more clock cycles than for the basic iterative

architecture.

basicinner cyclesclock # cyclesclock # �� k

Taking these facts into account it is clear that the latency of the pipelined circuit

will be worse than for the basic iterative architecture.

innerinnerinner periodclock cyclesclock # latency ��

basicinner latencylatency �

Since k blocks of data can be processed simultaneously, the average number of

clock cycles per block of data is the same as for the basic iterative architecture.

basicinner cyclesclock # blockper cyclesclock # average �

69

From this observation we can conclude that for inner round pipelining, speedup

comes from processing data at the increased clock frequency. Therefore, throughput

depends only on the minimal clock period and not on the number of pipeline stages.

innerinner periodclock rounds#


�

basicinner t throughpu throughput �� k

Although inner round pipelining is quite difficult to apply and affects latency, it

gives significant area benefits in FPGA designs. Introducing pipeline registers into the

existing combinational logic, in this case cipher round, does not usually cost much area as

there exist “free” flip-flops in every CLB Slice – see chapter 3. Figure 4.5-2 shows

throughput versus area ratio for inner round pipelining.

The benefits of inner round pipelining make this technique very attractive for

high-speed implementations. Unfortunately, the difficulties in designing good pipelines

discourage researchers from using them in their implementations.

70

throughput

area

inner-round pipeliningnon-feedback modes

basic architecturek=2

k=3

k=4

k=5

inner-round pipeliningfeedback modes

throughput

area

inner-round pipeliningnon-feedback modes

basic architecturek=2

k=3

k=4

k=5

inner-round pipeliningfeedback modes

Figure 4.5-2 Throughput vs. area ratio for inner round pipelining.

4.6 Mixed inner- and outer-round pipelining

Inner and outer round pipelining have their limits for maximum throughput. In

some applications throughput rates in the range of gigabits per second are required. For

those applications inner and outer round pipelining techniques may be mixed together.

The basic idea is to implement inner round pipelining and then unroll such round as many

times as necessary, as shown in Figure 4.6-1.

In extreme cases all rounds can be unrolled giving maximum possible throughput.

From my experience the maximum throughput rate may range beyond 10 Gbps for a fully

unrolled circuit.

71

outputa) b)

input

multiplexer

roun

d 1

roun

d 2

roun

d K

outputro

und

1ro

und

2ro

und

n

input

outputa) b)

input

multiplexer

roun

d 1

roun

d 2

roun

d K

outputro

und

1ro

und

2ro

und

n

input

Figure 4.6-1 Mixed inner- and outer-round pipelining. a) partial unrolling,

b) full unrolling.

The mixed inner- and outer-round pipelining inherits all of the implementation

difficulties of both inner-round and outer-round pipelining. The main difficulty in

applying mixed inner- and outer-round pipelining lies in implementing the inner-round

pipeline and then efficiently placing all its instances such that the routing delays remain

the same in the entire circuit. This procedure usually requires careful manual placement.

We can expect that the minimum clock period will be approximately the same as

for inner-round pipelining. However, it may deteriorate since rounds are unrolled and

routing constraints are harder to meet.

72

kbasic

innermixedperiodclock periodclock periodclock ��

The latency may also be expected to be at the level of the inner-round pipelined circuit.

innermixed latencylatency �

The maximum throughput becomes the highest among all architectures since it is

K times higher than for inner-round pipelined circuit. We have achieved throughputs in

the range of 10 Gbps for mixed inner- and outer-round pipelined architecture. However,

this high throughput is achievable only when K�k independent data streams are available.

rounds# periodclock sizeblock

rounds# periodclock sizeblock throughput

basicmixedmixed

�

��

�

��

k

KK

The mixed inner- and outer-round pipelining gives significant gain in throughput,

but it also inherits all the drawbacks of inner-round pipelining and outer-round-

pipelining. It affects latency in the same way as inner-round pipelining does, and is

associated with a high cost in area just as in the case of the outer-round pipelining. The

throughput versus area, and latency versus area are shown in Figure 4.6-2 and Figure

4.6-3 respectively. On top of that, proper design of a mixed architecture is the most

challenging task.

73

area

throughput- basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and outer-round pipelining

K=2

K=3

K=4K=5

K=2

K=3

k=2

kopt

area


K=2

K=3

K=4K=5

K=2

K=3

k=2

kopt

Figure 4.6-2 Throughput vs. area ratio for mixed pipelining.

area

latency - basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and

outer-round pipelining

K=2 K=4K=3 K=5

K=2 K=3

k=2

kopt

area

latency - basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and

outer-round pipelining

K=2 K=4K=3 K=5

K=2 K=3

k=2

kopt

Figure 4.6-3 Latency vs. area ratio for mixed pipelining.

74

Methodology of comparison of AES candidates

5.1 Limits of this research

The scope of possible applications for AES is large. The different modes of

operation that can be used with the underlying cipher, different block and key sizes, as

well as different application constraints make it impossible to perform an exhaustive

comparison within a limited time frame. Therefore, I restricted my research to the 5 most

representative uses:

1. Only 128-bit keys have been considered. Performance for other key sizes can be

easily derived.

Each AES candidate was required to operate with three different key sizes:

128-, 192- and 256-bit. For most of the ciphers the key size influences the key

schedule algorithm only, and does not make any difference in

encryption/decryption transformation. However, in the of Rijndael the number of

cipher rounds depends on key size and block size. This dependence is very simple

and the performance of Rijndael can be easily derived for other key sizes.

2. No comparison of key schedules has been made.

Comparing key schedules can be a more challenging task than comparing

encryption algorithms, because of their strong dependence on key sizes. In many

applications all three key sizes are required, and the key schedule unit would have

75

to support all of them. I have chosen not to implement any of the key schedules

for the comparison purposes, as it would require significant effort and more time.

Instead, my implementations include a memory of internal keys loaded with the

keys generated externally, and the circuitry necessary to distribute these keys

from the memory to the encryption/decryption unit.

3. Only 128-bit blocks have been supported.

AES requirements are limited to 128-bit blocks only. Therefore, I have

considered only this block size, even if the given AES candidate supports other

block sizes.

4. Encryption and decryption implemented in one circuit if possible.

Most of the secret-key ciphers applications require encryption and decryption

services. Therefore, I think that proper comparison of ciphers should reflect those

needs and I have implemented both transformations together. MARS, RC6, and

Twofish algorithms perform encryption and decryption in a very similar way, and

permit resource sharing between encryption and decryption. I have chosen to

explore this feature whenever it was possible. Resource sharing allows making the

overall circuit smaller, but impairs throughput, because additional switching

circuitry is required. Other ciphers, Serpent and Rijndael, do not share many

similarities between encryption and decryption transformations, and required

implementation of two separate units. Moreover, Rijndael’s circuits designated

for encryption and decryption have different complexities. The decryption unit

has longer critical path and slows down the entire implementation. With respect to

76

this fact, I feel that it is very unfair to only analyze encryption unit as many other

research groups did.

encryprion/decryption

inputinterface

outputinterface

memory forsubkeys

controlunit

datainput

keyinputcontrol

dataoutput

encryprion/decryption

inputinterface

outputinterface

memory forsubkeys

controlunit

datainput

keyinputcontrol

dataoutput

Figure 5.1-1 Block diagram common for all implementations.

5. Throughput/area ratio was the main optimization criteria.

Throughput is probably the most popular parameter of hardware

implementation and, and even if it does not carry full information about the

circuit, it is often associated with its “power.” However, I did not try to achieve

the highest possible throughput at all cost. I tried to trade speed and area in an

intelligent way, so that my implementations could reflect costs associated with

77

circuit sizes too. Therefore, I tried to maximize throughput together with

throughput/area ratio.

The general block diagram for all our architectures is shown in Figure 5.1-1.

5.2 Choice of architectures

In my opinion, a fair methodology for comparing hardware performance of the

AES candidates should not favor any group of ciphers or a specific internal structure of a

cipher. Different ciphers have different architectures. In particular, some ciphers employ

a large number of small rounds, while the other a small number of bigger and more

sophisticated rounds. However, both may achieve similar throughputs and have similar

throughput/area ratios. My main goal was to compare ciphers in both feedback and non-

feedback modes when only one stream of data is available.

5.2.1 Comparison in feedback modes

Pipelined architectures are highly underutilized in feedback modes when only one

stream of data is available. Only basic iterative and unrolled architectures come into play.

The unrolled architecture offers slightly higher throughput than basic iterative

architecture, but is not easy to properly design, and has an unattractive throughput/area

ratio. The basic architecture is much more practical from the point of view of real

implementations, and has several important features:

78

� Relatively easy to implement in a similar way for all AES candidates, which

supports fair comparison,

� Presents a good starting point for all other architectures. Many parameters of

other architectures can be estimated from basic architecture, and,

� Assures the maximum throughput/area ratio for feedback operating modes

(CBC, CFB), now commonly used for bulk data encryption.

5.2.2 Comparison in non-feedback modes

Non-feedback modes permit encrypting more than one block of data belonging to

the same stream simultaneously. Only pipelined architectures can take full advantage of

this feature. Outer-round pipelining is the easiest to apply, but is not well suited for

general comparison because it enforces one pipeline stage per one round. This way

ciphers with smaller rounds automatically achieve better performance. A fair comparison

should exploit all potentials of a cipher. I think that the best approach is to show

performances and sizes for ciphers implemented in mixed architectures. The ideal

situation would be to implement inner-round pipeline with respect to some optimization

criteria first, and next completely unroll the cipher. Inserting too many pipeline stages

into the cipher round does not make sense, as it gives small gain in throughput if at all,

and increases area and latency. I believe that the optimum number of in-round pipeline

stages kopt should be found for each cipher. This optimum should give the best

throughput/area ratio, as shown in Figure 5.2-1. Unfortunately this method has a very

serious drawback. Finding the optimum number of pipeline stages is not easy, and in

79

practice we can only estimate what this number could be and where all the registers could

be placed. Due to these constraints, this task became unfeasible, and I had to choose

another, sub-optimal strategy. The simplest idea is to introduce pipeline registers to run

the circuit with as high clock frequency as possible. This essentially tells what

throughput/area ratio one can get for similar clock frequency for each cipher. It turned

out to give me quite impressive results in terms of throughput.

area


K=2

K=3

K=4K=5

K=2

K=3

k=2

kopt

area


K=2

K=3

K=4K=5

K=2

K=3

k=2

kopt

Figure 5.2-1 Throughput/area ratio for mixed architecture.

5.3 Tools, design process and synthesis parameters

All my hardware designs have been encoded in VHDL’87. I made a significant

effort to describe all components structurally so that I could indirectly guide the synthesis

tools as to how each component should be synthesized. I could have used specific

80

primitives from the Xilinx library to enforce certain implementation strategies, however,

it would have made my code device specific. I have chosen to use those libraries only

when it was necessary, for example to enforce the use of lookup tables (LUT) in RAM

mode. Other than that I completely relied on synthesis tools.

I used Active-HDL 3.6 as a design entry tool. This tool greatly supported the

development of my codes permitting very accurate behavioral, post-synthesis and timing

simulations under the control of test benches. Once the entire code was verified I used

Xilinx Foundation Series 2.1i for design synthesis and implementation. The design flow

is shown in Figure 5.3-1.

Netlist with timing

Code in VHDL

3. Timing simulation

1. Functional simulation2. Synthesis

andImplementation

Aldec, Active-HDL

Xilinx, Foundation Series v. 2.1

Aldec, Active-HDL

Implementation Verification

Netlist with timing

Code in VHDL

3. Timing simulation

1. Functional simulation2. Synthesis

andImplementation

Aldec, Active-HDL

Xilinx, Foundation Series v. 2.1

Aldec, Active-HDL

Implementation Verification

Figure 5.3-1 Design flow for each implementation.

As a target device I have chosen Xilinx Virtex XCV1000BG560-6 FPGA. This

device is fabricated in 0.22�m CMOS process, and contains around one million

equivalent logic gates.

81

I have not set any constraints other than the target clock frequency. In the case of

basic architectures the target clock frequency was 50 MHz, and in the case of pipelined

architectures 150 MHz. Foundation Series returns a detailed layout of a circuit,

information about resources utilized, and a netlist with all timing information which can

be further simulated back in Active-HDL. This final simulation, as well as the output

from a static timing analyzer gave me the final performance of each of the circuits.

82

Implementation of AES candidates

6.1 MARS

MARS was submitted to the AES contest by a large team from IBM [4]. Some of

the members of this team participated in the design of DES over twenty years ago. The

designers have put a lot of effort to make MARS as secure as possible. As they claim,

they have added many stop-fault mechanisms which makes MARS resistant to known

and anticipated attacks. Indeed, MARS has one of the largest security margins among all

candidates – see Table 2.1-2. All this security comes however with a high price in

performance both in software and hardware.

6.1.1 Structure and components of MARS

MARS consists of 32 rounds divided into four major groups:

� forward mixing,

� keyed forward transformation,

� keyed backwards transformation, and

� backwards mixing.

Figure 6.1-1 shows general structure of MARS.

83

forward mixing

keyed forwardtransformation

keyed backwardstransformation

backwards mixing

plaintext

ciphertext

+

-

subkey

subkey

subkeys

subkeys

Figure 6.1-1 High-level structure of MARS.

Only keyed transformations make use of keys, and all together are called

“cryptographic core”. Mixing transformations have more auxiliary purpose.

Forward and backwards mixing transformations for encryption and decryption are

alike, but not the same, and with some effort they can be implemented in one circuit. I

have chosen to implement them all in one unit. This appeared to not be an easy task. The

reader may refer to original MARS documentation [4] for description of mixing

transformations. The structure I came up with is shown in Figure 6.1-2 and Figure 6.1-3.

84

D3 frominput reg.

D3 fromkeyed transf.

D2 frominput reg.


D1 frominput reg.


D0 frominput reg.


D3 loop D2 loop D1 loop D0 loop

128-bit register

Mixing transformation core


optionalswap

optionalrotationto theright

optionalswap

D3 frominput reg.


D2 frominput reg.


D1 frominput reg.


D0 frominput reg.



128-bit register

Mixing transformation core


optionalswap


optionalswap

Figure 6.1-2 Mixing transformation.

D1

+/–

S0

S1

D0

–

S0

S1

D2

+/–

D3

+

<<<8 >>>8

fromE-function

toE-function

0

0

32 32 32 32

8

8

88

8

32

32

32

32

32

D1

+/–

S0

S1

D0

–

S0

S1

D2

+/–

D3

+

<<<8 >>>8

fromE-function

toE-function

0

0

32 32 32 32

8

8

88

8

32

32

32

32

32

Figure 6.1-3 Mixing transformation core.

85

Even brief inspection of Figure 6.1-2 and Figure 6.1-3 reveals a large number of

multiplexers which are used to route data in different directions depending on which

mixing transformation is going to be performed. Obviously, these multiplexers have a

negative influence on circuit performance.

Mixing transformations employ four 8x32 S-boxes of two kinds: S0 and S1,

simple additions and subtractions modulo 232, XORs, and fixed rotations. The most

interesting from my point of view are S-boxes. They accept 8-bit inputs, therefore are too

big for 4-bit LUTs present in the Xilinx FPGA. All four S-boxes can be implemented on

three Block SelectRAMs. However, those types of memories are unavailable in less

expensive families of FPGAs, therefore we have decided to describe them as big lookup

tables and leave their decomposition to synthesizer. This resulted in large circuit size and

long propagation delays. Perhaps better results could be obtained if we would try to

decompose S-boxes using advanced tools designed for this task.

The “cryptographic core” consists of two types of rounds: forward and backwards

keyed transformations, which, similar to mixing transformations, are not identical for

encryption and decryption. The circuit that realizes all “cryptographic core”

transformations is shown in Figure 6.1-4 and Figure 6.1-5.

86

D3 frommix transf.

D3 loop

128-bit register

Keyed transformation core

optionalswap


optionalswap

D2 frommix transf.

D2 loop

D0 frommix transf.

D0 loop

D1 frommix transf.

D1 loop

D3 loop D2 loop D0 loopD1 loop

D3 frommix transf.

D3 loop

128-bit register

Keyed transformation core

optionalswap


optionalswap

D2 frommix transf.

D2 loop

D0 frommix transf.

D0 loop

D1 frommix transf.

D1 loop

D3 loop D2 loop D0 loopD1 loop

Figure 6.1-4 Keyed transformation.

E+/–

+/–

>>>13

<<<13 >>>13

1

2

3

to S-box

fromS-box

D3 D2 D1 D032 32 32 32

32

328

E+/–

+/–

>>>13

<<<13 >>>13

1

2

3

to S-box

fromS-box

D3 D2 D1 D032 32 32 32

32

328

Figure 6.1-5 Keyed transformation core.

87

One can again notice a large number of multiplexers in Figure 6.1-4, which

contribute to additional propagation delays. The keyed transformation consists of E-

function, a couple of adders, subtractors, XORs, and fixed rotations. The details of E-

function are presented in Figure 6.1-6.

<<<

<<< 5

<<< 13

*

S

<<< 5

<<<

K[4+2i]

K[5+2i]

1

2

3

<<<

<<< 5

<<< 13

*

S

<<< 5

<<<

K[4+2i]

K[5+2i]

1

2

3

Figure 6.1-6 E-function. Red line indicates critical path.

The E-function is the largest transformation in the entire cipher. Among other

operations, it employs 9x32 S-box, variable rotations, and multiplication modulo 232.

Multiplication and variable rotation appear in the critical path, and I paid more attention

to their implementation.

The 9x32 S-box is simply a concatenation of S-boxes S0 and S1, where the most

significant bit selects S-box S0 or S1 for transforming eight least significant bits. Since

the mixing transformation and the keyed transformation are never executed

88

simultaneously in the basic architecture, I decided to share two S-boxes between both

transformations, as shown in Figure 6.1-3.

The variable rotation has been implemented on CLBs in a way presented in

Figure 6.1-7.

<<<16 >>>16

<<<8 >>>8

<<<4 >>>4

<<<2 >>>2

<<<1 >>>1

rot[4]

rot[3]

rot[2]

rot[1]

rot[0]

A

B

<<<

A32

rot

B

32

5

<<<16 >>>16

<<<8 >>>8

<<<4 >>>4

<<<2 >>>2

<<<1 >>>1

rot[4]

rot[3]

rot[2]

rot[1]

rot[0]

A

B

<<<

A32

rot

B

32

5

Figure 6.1-7 Variable rotation.

Implementing multiplication appeared to be a more challenging task, and I have

taken advantage of the structure of CLBs for optimizations.

89

6.1.2 Implementation of multiplication modulo 232

The multiplication modulo 232 can be implemented efficiently on Xilinx Virtex

devices since the CLB Slices contain special logic supporting arithmetic operations. A

simplified structure of one Slice is shown in the Figure 6.1-8. It consists of two LUTs,

associated control and carry logic, and two D flip-flops. Carry logic plays important role

in implementation of multiplication. This dedicated logic is designed to speed up

arithmetic operations using ripple adders. According to the Virtex documentation, the

maximum propagation time from CIN to COUT is only 0.1 ns, while the propagation time

from inputs to LUT to output takes 0.6 ns. Clearly, the use of carry logic in the design of

the multiplier is a reasonable choice.

LUT

G4G3G2G1

0

1

0 1

D Q

Y

YQ

LUT

F4F3F2F1

0

1

0 1

D Q

X

XQ

CIN

COUT

LUT

G4G3G2G1

0

1

0 1

D Q

Y

YQLUT

G4G3G2G1

0

1

0 1

D QD Q

Y

YQ

LUT

F4F3F2F1

0

1

0 1

D QD Q

X

XQ

CIN

COUT

Figure 6.1-8 Virtex Slice with carry logic.

90

Performing multiplication I want to sum components which are pre-computed in

the AND matrix. Figure 6.1-9 shows the typical situation where a full adder follows two

AND gates.

cout

FA

x0a31

x1a30

cin

s

cout

FA

x0a31

x1a30

cin

s

Figure 6.1-9 Example of multiplication scheme. Two AND gates feed full adder.

Fortunately, the Virtex Slice supports computing ANDs with FA in one LUT. This

is a unique feature among currently available FPGAs. Figure 6.1-10 shows details of the

implementation. The LUT computes the propagate function:

� � � �301310 and xor and axaxp �

The additional AND gate computes the generate signal from x1 and a30:

301 and axg �

91

LUT

0 1

s

cin

cout

x0a31x1

a30

p

g

LUT

0 1

s

cin

cout

x0a31x1

a30

p

g

Figure 6.1-10 Multiplication – implementation of the circuit from Figure

6.1-9 in a Vritex Slice.

Let us focus now on the full multiplier, which is created using principles of array

multipliers. I will use an 8x8-bit example, but all conclusions can be easily extended to

the 32x32-bit version. The logic for a multiplier modulo 28 is sketched in Figure 6.1-11.

There exist many paths, which are equally critical. I have highlighted only one of them in

red. The same result can be obtained by reordering summed terms:

a2 x ... a2 x a xof instead a x ... a2 x a2x 77

1100

66

77 ��

The resulting structure has much shorter, and only one critical path, as shown in

Figure 6.1-12. I do not consider horizontal nets as critical because they are implemented

in fast carry logic.

92

Figure 6.1-11 Array multiplier modulo 28.

Figure 6.1-12 Structure of an array multiplier with reversed order of

additions.

93

The main concern is then to minimize the vertical path. The multiplication from

Figure 6.1-12 can be symbolically represented as consecutive additions performed one at

a time as shown in the Figure 6.1-13a. These additions can be organized into a tree

(Figure 6.1-13b). This trick significantly shortens the number of logic levels from 7 to 3

(31 to 5 in the case of the 32-bit multiplier). In general, the use of the tree allows

realizing the multiplier on log 2 (n) logic levels, where n is a number of multiplied bits.

Therefore a 64-bit multiplier should contain 6 levels of logic and so on. The final

multiplier architecture resulting from applying tree structure is shown in Figure 6.1-14.

This structure was used in the basic iterative architecture.

a) array additions b) tree additionsa) array additions b) tree additions

Figure 6.1-13 Change from array to tree.

94

FA FA FA FA

x2y3 x2y2

FA FA HA

x0y0HA

x2y0x2y1

HA

x5y0

x1y4

x0y4x0y5x0y6x0y7

HA

x3y0

FA

x3y1

x2y4x2y5

FAFAFA

FA

x1y5x1y6

HA

x3y1

x5y1

x7y0

x5y2

x1y0

x3y3 x3y2

x1y3 x1y2

x6y1 x6y0

x4y3 x4y2 x4y1 x4y0

x0y3 x0y2

x1y1

x0y1

FAFA

FA FA FA FA

HAFA

HAFA

FA

FA

FA FA FA FA

x2y3 x2y2

FA FA HA

x0y0HA

x2y0x2y1

HA

x5y0

x1y4

x0y4x0y5x0y6x0y7

HA

x3y0

FA

x3y1

x2y4x2y5

FAFAFA

FA

x1y5x1y6

HA

x3y1

x5y1

x7y0

x5y2

x1y0

x3y3 x3y2

x1y3 x1y2

x6y1 x6y0

x4y3 x4y2 x4y1 x4y0

x0y3 x0y2

x1y1

x0y1

FAFA

FA FA FA FA

HAFA

HAFA

FA

FA

Figure 6.1-14 Final multiplication schematic.

6.1.3 Results of the implementation of MARS

Throughput and area in the basic iterative architecture

The implementation of MARS in the basic iterative architecture has taken 2,744

CLB Slices. The static timing analyzer indicated the maximum clock frequency at the

level of 15.3 MHz, which gives the throughput of 61 Mbps. It is not much since designers

have reported a throughput of 85 Mbps on 200 MHz PowerPC. A better FPGA

implementation of MARS was presented in [12], where designers made an extensive use

of optimized libraries and achieved a throughput at a level of 102 Mbps.

95

I believe that the better performance could be obtained if S-boxes were

implemented on Block SelectRAMs. Additionally, I have noticed that keyed

transformations take approximately twice as much time as mixing transformations.

Therefore, it may be a better solution to perform only a half of the keyed transformation

within one clock cycle and increase the clock frequency.

I have not attempted any implementation of MARS in any pipelined architecture.

I considered it as a very time consuming task. Moreover, MARS was already criticized

for its complexity and it was unlikely that will be selected for AES.

96

6.2 RC6

6.2.1 Structure and components of RC6

The RC6 cipher was submitted to the AES contest by Ronald Rivest from MIT

together with his partners from RSA Labs [28]. RC6 is an extension of RC5, an older

cipher designed by Rivest, and also belongs to the class of Feistel-network ciphers. This

feature permits implementing encryption and decryption within the same circuit.

The algorithm consists of 20 identical and simple rounds. Figure 6.2-1 shows the

structure of a circuit implementing one round.

The main operations employed in the algorithm are variable rotations and

multiplications modulo 232. The rotations are implemented in the same way as in the case

of MARS – see Figure 6.1-7. One could also use the same multiplier, but fortunately RC6

can be tweaked a little bit, and the multiplier can be significantly reduced.

The F-function present in Figure 6.2-1 performs the following operation:

output <= (input * ( 2 * input + 1 )) <<< 5

<<< denotes rotation to the left by 5 bit positions.

97

d e

de

F

A

B+

R0

–

R1

d e

<<<

R0

d e

R1

S[2i+1]

S[2i+1]

d e

de

F

+

–

d e

<<<

R2

d e

R3

S[2i]

S[2i]

R3R2

A

B

feedbackto R0

feedbackto R1

feedbackto R2

feedbackto R3

32 32 32 32

5 5

d ed e

de de

F

A

B+

R0

–

R1

d ed e

<<<

R0

d ed e

R1

S[2i+1]

S[2i+1]

d ed e

de de

F

+

–

d ed e

<<<

R2

d ed e

R3

S[2i]

S[2i]

R3R2

A

B

feedbackto R0

feedbackto R1

feedbackto R2

feedbackto R3

32 32 32 32

5 5

Figure 6.2-1 Implementation of one round of RC6.

It turns out that it can be easily replaced by the following equation:

output <= ( 2 * input2 + input ) <<< 5

The trick is to get rid of multiplication by changing it into squaring.

98

6.2.2 Implementation of squaring modulo 232

Any multiplier can perform squaring if both its inputs are connected together.

However, a special-purpose squarer is much smaller and faster. An array multiplier can

be reduced to a squarer in a way shown in Figure 6.2-2.

x0x0x1x0x2x0x3x0x4x0x5x0x6x0x7x0x0x1x1x1x2x1x3x1x4x1x5x1x6x1

x0x2x1x2x2x2x3x2x4x2x5x2x0x3x1x3x2x3x3x3x4x3

x0x4x1x4x2x4x3x4x0x5x1x5x2x5

x0x6x1x6x0x7

move to thenext column

reduceto x0


x2x3x2x4x2x3

a) multiplication of two the same operands

b) squaring algorithm derived from multiplication

x0x0x1x0x2x0x3x0x4x0x5x0x6x0x7x0x0x1x1x1x2x1x3x1x4x1x5x1x6x1


x0x4x1x4x2x4x3x4x0x5x1x5x2x5

x0x6x1x6x0x7

move to thenext column

reduceto x0


x2x3x2x4x2x3

a) multiplication of two the same operands

b) squaring algorithm derived from multiplication

Figure 6.2-2 Squarer derived from array multiplier.

It is easy to notice that same products exist in same columns, and many of them

can be reduced to the structure given in the Figure 6.2-2b. The resulting squarer, shown

in Figure 6.2-3, occupies around 50% of the area of the corresponding multiplier, with a

height that is twice as small.

99

FA

x0x6x1x5

HA

x1x4x0x5

FA

x2x4

FA

x2x3

FA

0

FA

FA HA

FA FA

0

x1x3x0x4

x1x2x0x3

FA HA

0x0x2 x0x1

x1x2x3

0

FA

x0x6x1x5

HA

x1x4x0x5

FA

x2x4

FA

x2x3

FA

0

FA

FA HA

FA FA

0

x1x3x0x4

x1x2x0x3

FA HA

0x0x2 x0x1

x1x2x3

0

Figure 6.2-3 Squarer modulo 28.

I have further reduced the height of the squarer by ordering adders in a tree,

similarly to the multiplication in MARS. The resulting final circuit is presented in Figure

6.2-4. Although the area of the squarer is much smaller than for the multiplier, the

number of logic levels involved in a 32-bit squaring is four, which is only slightly less

than for the multiplication.

FA FA FA HA

0 0 0 x1x2x3

x5x1 x4x1 x3x1 x2x1

FA HA

x4x2 x3x2

x6x0 x5x0

FA FA FA FA FA HA

x4x0 x3x0 x2x0 x1x0

x00

FA FA FA HA

0 0 0 x1x2x3

x5x1 x4x1 x3x1 x2x1

FA HA

x4x2 x3x2

x6x0 x5x0

FA FA FA FA FA HA

x4x0 x3x0 x2x0 x1x0

x00

Figure 6.2-4 Optimized squarer modulo 28.

100

6.2.3 Results of the implementation of RC6

Throughput and area in basic architecture

The implementation of RC6 in the basic iterative architecture has taken 1,137

CLB Slices, which is less than half of the size of MARS. The maximum clock frequency

indicated by the static analyzer was 22.3 MHz, what translates to the throughput of 142.7

Mbps. This result is far better than for MARS, but does not satisfy the most demanding

needs for fast encryption. This poor performance comes from the fact that RC6 has a long

critical path, which goes through a squarer and a variable rotator as indicated in Figure

6.2-1.

Throughput and area in mixed architecture

Considering the implementation of RC6 in mixed inner- and outer-round

pipelining I have decided not to use a tree structure for squarer. A pure array squarer has

a very regular layout, and this makes it easier for the placing tool to minimize the delays

of interconnections. In the case of deeply pipelined designs, interconnections may

contribute to similar or even larger delays than logic. I have pipelined RC6 very deeply,

as I introduced 28 pipeline stages within one round. This gives a total of 560 stages in the

mixed architecture, but allows inputting data blocks every clock cycle. The resultant

circuit takes approximately 47,000 CLB Slices, and requires 4 Virtex1000 devices.

According to the static analyzer it can be run with 108.1 MHz clock. This gives a

throughput of 13.1 Gbps.

101

Introducing so many pipeline stages was completely unnecessary, since the gain

in frequency was only a factor of five. Most likely I could have achieved a similar

throughput for less than ten pipeline stages. Additionally I could reduce the circuit size

by around 25%.

102

6.3 Rijndael

6.3.1 Structure and components of Rijndael

Rijndael is an SP-network cipher. This means that the main transformations

employed in this cipher are substitutions and permutations applied to all bits of data block

in every round. Rijndael was submitted to the AES contest by V. Rijmen and J. Daemen

from Belgium [11]. Rijndael is unique in many ways. The number of cipher rounds

depends on the size of the key and the size of the data block and is equal to 10 for 128-bit

key and 128-bit block. Despite such a small number of rounds, Rijndael offers a quite

good security margin, although some researchers express their worries about this small

number of rounds. All main operations employed in Rijndael are based on arithmetic in

Galois fields which makes it highly efficient in both software and hardware

implementations. Since Rijndael is an SP-network cipher, it requires inverse

transformations for encryption and decryption, which in general are not easy to

implement in one circuit. Figure 6.3-1 shows one round of encryption and decryption.

There are only three main operations: MixColumn, ShiftRow, and ByteSub, plus

their inversed versions in case of decryption. These operations have very nice properties

which permit their reordering giving the implementer many degrees of freedom.

ShiftRow and InvShiftRow change only the order of bytes within a 128-bit block,

and do not require any logic resources.

103

ByteSub

plaintext

ciphertext

subkey

ShiftRow

MixColumn

InvByteSub

plaintext

ciphertext

subkey

InvShiftRow

InvMixColumn

a) b)

ByteSub

plaintext

ciphertext

subkey

ShiftRow

MixColumn

InvByteSub

plaintext

ciphertext

subkey

InvShiftRow

InvMixColumn

a) b)

Figure 6.3-1 One round of Rijndael. a) encryption, b) decryption.

ByteSub and InvByteSub can be viewed as ordinary 8x8 S-boxes. Rijndael uses 16

S-boxes of each kind in a single round. The best implementation approach would be to

implement those S-boxes on Block SelectRAMs, but I decided not to use these

components in any of the ciphers. Implementation of 32 8x8 S-boxes on LUTs takes large

amount of area, however the construction of those S-boxes gives an opportunity to save

some space through resource sharing. ByteSub and InvByteSub can be decomposed into

two operations: a simple affine transformation and inversion in Galois field as shown in

Figure 6.3-2.

inversed elementIn Galois field

affinetransformation

a) b)

inversed affinetransformation




a) b)



Figure 6.3-2 Construction of a) ByteSub, b) InvByteSub transformations.

104

The affine transformation and its inverse are simple to implement on LUTs. The

inversion in Galois field is symmetric, and stays the same in both encryption and

decryption. This feature permits sharing the inversion between both transformations.

MixColum and InvMixColumn do not share the same operations. The MixColumn

transformation can be expressed as a matrix multiplication in the Galois Field GF(28).

��

�

�

��

�

�

��

�

�

��

�

�

�

��

�

�

��

�

�

AAAA

02010103030201010103020101010302

BBBB

3

2

1

0

3

2

1

0

Each symbol in this equation (such as Ai, Bi, '03') represents an 8-bit element of

the Galois Field. Each of these elements can be treated as a polynomial of degree less

than 8, with coefficients in {0,1} determined by the respective bits of the GF(28) element.

For example, '03' is equivalent to '0000 0011' in binary, and to

c(x) = 0�x7 + 0�x6 + 0�x5 + 0�x4 + 0�x3 + 0�x2 + 1�x + 1�1 = x +1

in the polynomial radix representation.

The multiplication of elements of GF(28) is accomplished by multiplying the

corresponding polynomials modulo a fixed irreducible polynomial

105

m(x) = x8 + x4 + x3 + x + 1

For example, multiplying a variable element A=a7 a6 a5 a4 a3 a2 a1 a0 by a constant element

'03' is equivalent to computing

B(x) = b7 x7 + b6 x6 + b5 x5 + b4 x4 + b3 x3 + b2 x2 + b1 x + b0 =

=(a7 x7 + a6 x6 + a5 x5 + a4 x4 + a3 x3 + a2 x2 + a1 x + a0) � (x+1) mod (x8 + x4 + x3 + x + 1)

After several simple transformations

B(x) = (a7 + a6) x7 + (a6 + a5) x6 + (a5 + a4) x5 + (a4 + a3+ a7) x4 + (a3 + a2+ a7) x3 +

+ (a2 + a1) x2 + (a1 + a0+ a7) x + (a0 + a7)

where '+' represents an addition modulo 2, i.e. an XOR operation.

As a result each bit of a product B, can be represented as an XOR function of at most

three variable input bits, e.g., b7 = (a7 + a6) , b4 = (a4 + a3+ a7), etc.

Each byte of the result of a matrix multiplication is an XOR of four bytes

representing the Galois Field product of a byte A0, A1, A2, or A3 by a respective constant.

The entire MixColumn transformation can be performed using two layers of XOR gates,

with up to 3-input gates in the first layer, and 4-input gates in the second layer. In Virtex

FPGAs, each of these XOR operations requires only one lookup table (i.e., a half of a

CLB Slice).

106

The InvMixColumn transformation can be expressed as a following matrix

multiplication in GF(28).

��

�

�

��

�

�

��

�

�

��

�

�

�

��

�

�

��

�

�

BBBB

0E090D0B0B0E090D0D0B0E09090D0B0E

AAAA

3

2

1

0

3

2

1

0

The primary differences compared to MixColumn are the larger hexadecimal

values of the matrix coefficients. Multiplication by these constant elements of the Galois

Field leads to the more complex dependence between the bits of a variable input and the

bits of a respective product. For example, the multiplication A='0E' � B, leads to the

following dependence between the bits of A and B:

a7 = b6 + b5 + b4

a6 = b5 + b4 + b3 + b7

a5 = b4 + b3 + b2 + b6

a4 = b3 + b2 + b1 + b5

a3 = b2 + b1 + b0 + b6 + b5

a2 = b1 + b0 + b6

a1 = b0 + b5

a0 = b7 + b6 + b5

107

The entire InvMixColumn transformation can be performed using two layers of

XOR gates, with up to 6-input gates in the first layer, and 4-input gates in the second

layer. In Virtex FPGAs, an implementation of a 6-input XOR operation requires two

layers of CLB Slices. As a result, the InvMixColumn transformation has a significantly

longer critical path compared to the MixColumn transformation, and the entire decryption

is more time consuming than encryption.

Taking into account all properties of the component operations, I have

implemented Rijndael in the structure shown in Figure 6.3-3.

inversed elementin Galois field



ShiftRow

MixColumn

subkey

InvShiftRow

subkey

InvMixColumn

encryption decryption

inversed elementin Galois field



ShiftRow

MixColumn

subkey

InvShiftRow

subkey

InvMixColumn

encryption decryption

Figure 6.3-3 Structure of the implementation of a single round of Rijndael.

108

6.3.2 Results of the implementation of Rijndael


The implementation of Rijndael in a basic iterative architecture has taken 2,507

CLB Slices, very close to the size of MARS. However, the maximum clock frequency

indicated by the static analyzer was 32.3 MHz. Together with the small number of rounds

it puts Rijndael in a high position in the AES ranking with the throughput of 413.4 Mbps.

This result is much better than for MARS and RC6.


Implementing Rijndael in pipelined architecture was more challenging than it may

at first seem. The main difficulty lies in pipelining the S-boxes which one would normally

leave to the synthesis tool for decomposition. Unfortunately our synthesizer does not

insert pipeline stages automatically. I could buy a special core for distributed memory,

which allows such optimization, or do the decomposition manually, but both solutions

did not seem to guarantee a good performance. For this reason I have decided to use

Block SelectRAMs to implement S-boxes.

I have introduced 7 pipeline stages into a single round, what gives a total of 70

stages for the full cipher. The amount of area required by the implementation was in the

range of 12,600 CLB Slices + 80 Block SelectRAMs. I could run the circuit with a 95

MHz clock, and this result gives a throughput of 12.1 Gbps.

109

6.4 Serpent

6.4.1 Structure and components of Serpent

Serpent is a block cipher developed in international cooperation between R.

Anderson, E. Biham and L. Knudsen [1]. All the submitters are very well known

cryptanalysts. Authors emphasize that their design philosophy was highly conservative,

therefore only operations well studied and understood are used. Taking into account the

reputation of the submitters, it is not surprising that Serpent has the largest security

margin among all candidates. Serpent belongs to a class of SP-network ciphers. It

consists of 32 small and simple rounds. Figure 6.4-1 shows one round of Serpent. Last

round is slightly different, but does not impose any significant constraints at the design.

S-boxes

Lineartransformation

input

output

subkey

Figure 6.4-1 Single round of Serpent.

110

Unfortunately, not all rounds are identical. The cipher employs eight different sets

of 4x4 S-boxes that repeat every eight rounds. Additionally, encryption and decryption

consist of different operations and we cannot take any advantage of resource sharing

between encryption and decryption.

Serpent can still be implemented in a basic architecture evaluating one round per

clock cycle, but it requires the implementation of some switching circuit selecting S-

boxes, as shown in Figure 6.4-3, and turns out to be very inefficient. I have made an

exception for Serpent and have unrolled eight rounds treating this configuration as a basic

architecture, as shown in Figure 6.4-2. I call this architecture Serpent I8.

input

128-bit register

32 copies of S-box 0

linear transformationwith included subkey

K0

linear transformation

K7

K32

output

128

128

128

K1



input

128-bit register

Inversed lineartransformation

32 copies of inversedS-box 7 with subkey

K32

output

128

128

K7



K1


K0


b)a)

input

128-bit register



K0


K7

K32

output

128

128

128

K1



input

128-bit register



K0


K7

K32

output

128

128

128

K1



input

128-bit register



K32

output

128

128

K7



K1


K0


input

128-bit register



K32

output

128

128

K7



K1


K0


b)a)

Figure 6.4-2 Implementation of Serpent I8 in basic architecture. a)

encryption, b) decryption.

111

As I have mentioned, the S-boxes accept only 4 inputs, and therefore match

exceptionally well the structure of FPGA. Moreover, the linear transformation consists

only of two levels of XORs, which can be implemented very efficiently on LUTs. The

same observations apply to decryption circuit. Serpent matches the internal architecture

of an FPGA so well that it is hard to believe that its designers are mathematicians with no

hardware design experience. The implementation of Serpent took us the least amount of

time.

Some of the research groups have implemented Serpent based on only one round

and switching S-boxes as shown in Figure 6.4-3. We refer to this architecture as Serpent

I1.

128-bit register

32 x S-box 0

Ki regular Serpent round

32 x S-box 7

linear transformationK32

output

128

128

128

32 x S-box 1

8-to-1 128-bit multiplexer128 128 128

128 128 128

128-bit register

32 x S-box 0

Ki regular Serpent round

32 x S-box 7

linear transformationK32

output

128

128

128

32 x S-box 1

8-to-1 128-bit multiplexer128 128 128

128 128 128

Figure 6.4-3 Serpent I1.

112

6.4.2 Results of the implementation of Serpent


The implementation of Serpent in the basic iterative architecture, as shown in

Figure 6.4-2, has taken 4,507 CLB Slices, and presents the largest circuit, but we have to

keep in mind that eight rounds have been unrolled. The maximum clock frequency

indicated by the static analyzer was 13.5 MHz, which with combination of only four

clock cycles per block gives the throughput of 431 Mbps. This result outperforms all

other ciphers.


Applying pipelining to Serpent was a very easy task because this cipher has a very

FPGA-friendly structure. We have introduced only three pipeline stages per one round,

what gives a total of 96 pipeline stages for the entire implementation. The circuit takes

approximately 19,700 CLB Slices, which indicates a very small area increase associated

with introducing registers. We could run this circuit at a clock frequency of 130.9 MHz,

which gives a high throughput of 16.7 Gbps. This is the best result achieved by any

cipher reported in the literature.

113

6.5 Twofish

6.5.1 Structure and components of Twofish

Twofish was submitted to the AES contest by a team from Counterpane Systems

[29] led by B. Schneier, a well known cryptanalyst. It almost perfectly follows the

classical Feistel-network structure, and performing encryption and decryption in the same

circuit requires introducing only a very small amount of switching logic. The entire

structure of the cipher is shown in Figure 6.5-1.

The designers of Twofish have introduced a new idea in cipher design; the use of

key dependent S-boxes. Unlike in other ciphers using fixed S-boxes, the contents of key

dependent S-boxes changes for every key, making cryptanalysis certainly much harder.

The perfect way of implementing those 8x8 S-boxes would be by expressing them as

memories, which could be filled with new contents every time the keys are changed. It

could be done using Block SelectRAMs in Virtex FPGA. Four such RAM blocks would

be sufficient to implement all eight S-boxes. This solution could be accepted only in case

of the basic iterative architecture when we do not need to change keys on the fly. In the

case of the pipelined architectures, changing the contents of memory in one clock cycle is

undoable unless we could make use of several memory modules and switch among them.

I have chosen not to use this technique, and I have implemented the algorithm, which

computes contents of S-boxes inside the cipher round.

114

P (128 bits)

+

+

+

+<<< 8

<<< 1

>>> 1

S-box 0

S-box 1

S-box 2

S-box 3

MDS

g

S-box 0

S-box 1

S-box 2

S-box 3

MDS

g

PHT

F

K2 K3K1K0

K2r+8

K2r+9

K6 K7K5K4

C (128 bits)

Rep

eat 1

6 tim

es

P (128 bits)

+

+

+

+<<< 8

<<< 1

>>> 1

S-box 0

S-box 1

S-box 2

S-box 3

MDS

g

S-box 0

S-box 1

S-box 2

S-box 3

MDS

g

S-box 0

S-box 1

S-box 2

S-box 3

MDS

g

PHT

F

K2 K3K1K0

K2r+8

K2r+9

K6 K7K5K4

C (128 bits)

K6 K7K5K4

C (128 bits)

Rep

eat 1

6 tim

es

Figure 6.5-1 High-level structure of the Twofish cipher.

Each S-box consists of three permutations interleaved with keys S0 and S1, as

shown in Figure 6.5-2. Each q-permutation can be efficiently implemented on LUTs, as it

consists of small 4x4 t-boxes shown in Figure 6.5-3, which match very well the internal

architecture of FPGA.

115

q0

q1

q0

q1

q0

q0

q1

q1

q1

q0

q1

q0

S0 S1

S-box 0

S-box 1

S-box 2

S-box 3

q0

q1

q0

q1

q0

q0

q1

q1

q1

q0

q1

q0

S0 S1

S-box 0

S-box 1

S-box 2

S-box 3

Figure 6.5-2 S-boxes in Twofish.

t0 t1

>>>1 a(0), 0, 0, 0

a b

t2 t3

>>>1 a'(0), 0, 0, 0

a' b'

84 4

44

8

t0 t1

>>>1 a(0), 0, 0, 0

a b

t2 t3

>>>1 a'(0), 0, 0, 0

a' b'

84 4

44

8

Figure 6.5-3 Permutation q.

116

Another function used in Twofish is a 4-by-4-byte MDS matrix. The

transformation performed by this matrix is described by the formula:

��

�

�

��

�

�

�

��

�

�

��

�

�

�

��

�

�

��

�

�

3y2y1y0y

5BEF01EFEF015BEF01EFEF5B5B5BEF01

3z2z1z0z

where: y3...y0 are consecutive bytes of the input 32-bit word (y3 is the most significant

byte), and z3...z0 form the output word. This matrix multiplies a 32-bit input value by 8-

bit constants, with all multiplications performed (byte by byte) in the Galois field GF

(28). The primitive polynomial is x8 + x6 + x5 + x3 + 1. Only three different

multiplications are used effectively in the MDS matrix, namely multiplication

� by 5B16 = 0101 10112 (represented in GF(28) by a polynomial x6 + x4 + x3 + x +

1),

� by EF16 = 1110 11112 (x7 + x6 + x5 + x3 + x2 + x + 1), and

� by 0116 = 0000 00012 (equivalent element in GF(28) is just 1) - obviously the

result is equal to the input value.

Finally, the PHT transform is a simple function that consists of two additions modulo 232,

as shown in Figure 6.5-4. Both additions are de facto independent and can be performed

simultaneously.

117

+ +

a b

a' b'

<<1

Figure 6.5-4 PHT transformation.

As I have mentioned at the beginning of this section, both encryption and

decryption transformations can be implemented within the same circuit with a small

amount of additional logic. Figure 6.5-5 shows the structure of an implementation of a

single round used in my design.

128-bit register

F - function

<<<1

>>>1

<<<1

>>>1

128-bit register

F - function

<<<1

>>>1

<<<1

>>>1

Figure 6.5-5 Implementation of a single round of Twofish.

118

6.5.2 Results of the implementation of Twofish


Twofish matches very well the structure of an FPGA which results in a compact

design. Its implementation has taken 1,076 CLB Slices. The maximum clock frequency

indicated by the static analyzer was 22.1 MHz and this translates to a throughput of 177

Mbps.


Twofish has a quite long critical path through its round and there exist a lot of

room for pipeline stages. I have introduced as many registers as I could and this resulted

in a very deep pipeline with 24 stages per round. Hence, the total amount of stages for the

full cipher is 384. The area of the circuit was in the range of 21,000 CLB Slices, and it

could be run with the clock frequency of 119 MHz. This gives a high throughput of 15.2

Gbps. As we can see, the number of introduced pipeline stages proved to be too big as the

gain in clock frequency was only by a factor of five. Similarly to RC6, we could most

likely obtain a similar performance for less than ten pipeline stages.

119

Analysis of the results

7.1 Comparison of ciphers in feedback modes

The results of implementing AES candidates, according to the assumptions and

design procedure summarized in chapter 5, are shown in Figure 7.1-1 and Figure 7.1-2.

0

50

100

150

200

250

300

350

400

450

500

Serpent Rijndael Twofish RC6 Mars 3DES

431414

177142

61 59

Throughput [Mbps]

0

50

100

150

200

250

300

350

400

450

500

Serpent Rijndael Twofish RC6 Mars 3DES

431414

177142

61 59

Throughput [Mbps]

Figure 7.1-1 Throughput for Virtex XCV-1000, my results.

All implementations were based on Virtex XCV-1000BG560-6, one of the largest

currently available Xilinx Virtex devices. Additionally I have implemented a current

ANSI standard [3], Triple DES, which I used as a reference for comparison.

120

Implementations of all ciphers took from 9% (for Twofish) to 37% (for Serpent

I8) of the total number of 12,288 CLB slices available in the Virtex device used in my

designs. It means that less expensive Virtex devices could be used for all

implementations. Additionally, the key scheduling unit could be easily implemented

within the same device as the encryption/decryption unit.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

SerpentRijndaelTwofish RC6 Mars 3DES

1076 1137

27442507

4507

356

Area [CLB Slices]

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

SerpentRijndaelTwofish RC6 Mars 3DES

1076 1137

27442507

4507

356

Area [CLB Slices]

Figure 7.1-2 Area for Virtex XCV-1000, my results.

In Figure 7.1-3 and Figure 7.1-4, I compare my results with the results of research

groups from Worcester Polytechnic Institute [15] and University of Southern California

[12]. Both groups used identical FPGA devices, the same design tools and similar design

procedure. The order of the AES algorithms in terms of the encryption and decryption

throughput is identical in reports of all research groups. Serpent in architecture I8 (see

Figure 6.4-2) and Rijndael are over twice as fast as remaining candidates. Twofish and

121

RC6 offer medium throughput. Mars is consistently the slowest of all candidates.

Interestingly, all candidates, including Mars are faster than Triple DES. Serpent I8 (see

Figure 6.4-2) is significantly faster than Serpent I1 (Figure 6.4-3), and this architecture

should clearly be used in cipher feedback modes whenever the speed is a primary

concern, and the area limit is not exceeded.

050100150200250300350400450500Throughput [Mbps]

Serpent I8

Rijndael Twofish RC6 MarsSerpent I1

431 444414

353

294

177 173

104

149

62

143112

88102

61

Worcester Polytechnic Institute

University of Southern California

Our Results

050100150200250300350400450500Throughput [Mbps]

Serpent I8

Rijndael Twofish RC6 MarsSerpent I1

431 444414

353

294

177 173

104

149

62

143112

88102

61



Our Results

Figure 7.1-3 Throughput for Virtex XCV-1000, comparison with results

of other groups.

The agreement among circuit areas obtained by different research groups is not as

good as for the circuit throughputs, as shown in Figure 7.1-4. These differences can be

explained based on the fact that the speed was a primary optimization criterion for all

involved groups, and the area was treated only as a secondary parameter. Additional

122

differences resulted from different assumptions regarding sharing resources between

encryption and decryption, key storage, and using dedicated memory blocks.

010002000

30004000

50006000700080009000

Serpent I8RijndaelTwofish RC6 MarsSerpent

I1

Area [CLB slices]



Our Results

1250

5511

1076

28092666

11371749

26382507

4312

35282744

4621 4507

7964

010002000

30004000

50006000700080009000

Serpent I8RijndaelTwofish RC6 MarsSerpent

I1

Area [CLB slices]



Our Results

1250

5511

1076

28092666

11371749

26382507

4312

35282744

4621 4507

7964

Figure 7.1-4 Area for Virtex XCV-1000, comparison with results of other

groups.

Despite these different assumptions, the analysis of results presented in Figure

7.1-4 leads to relatively consistent conclusions. All ciphers can be divided into three

major groups:

1. Twofish and RC6 require the smallest amount of area,

2. Rijndael and Mars require medium amount of area (at least 50% more than

Twofish and RC6),

123

3. Serpent I8 requires the largest amount of area (at least 60% more than

Rijndael and Mars). Serpent I1 belongs to the first group according to [12],

and to the second group according to [15].

The overall features of all AES candidates can be best presented using a two-

dimensional diagram showing the relationship between the encryption/decryption

throughput and the circuit area. I collected my results for the Xilinx Virtex FPGA

implementations in Figure 7.1-5. For comparison I show the results obtained by the NSA

group for ASIC implementations [33] in Figure 7.1-6.

Throughput [Mbps]

Area [CLB slices]

0

100

200

300

400

500

0 1000 2000 3000 4000 5000

Rijndael Serpent I8

Mars

RC6

TwofishSerpent I1

Throughput [Mbps]

Area [CLB slices]

0

100

200

300

400

500

0 1000 2000 3000 4000 5000

Rijndael Serpent I8

Mars

RC6

TwofishSerpent I1

Figure 7.1-5 Throughput vs. area for Virtex-1000, our results. The result

for Serpent I1 based on [12].

124

Comparing diagrams shown in Figure 7.1-5 and Figure 7.1-6 reveals that the

throughput/area characteristic of the AES candidates is almost identical for the FPGA and

ASIC implementations. The primary difference between the two diagrams comes from

the absence of the ASIC implementation of Serpent I8 in the NSA report [33].

Throughput [Mbps]

Area [mm2]

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35 40

Serpent I1

RC6 Twofish Mars

Rijndael

Throughput [Mbps]

Area [mm2]

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35 40

Serpent I1

RC6 Twofish Mars

Rijndael

Figure 7.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs,

NSA result.

All ciphers can be divided into three distinct groups:

� Rijndael and Serpent I8 offer the highest speed at the expense of the

relatively large area;

� Twofish, RC6, and Serpent I1 offer medium speed combined with a very

small area;

125

� Mars is the slowest of all AES candidates and second to last in terms of the

circuit area.

Looking at this diagram, one may ask which of the two parameters: speed or area

should be weighted more during the comparison? The definitive answer is speed. The

primary reason for this choice is that in feedback cipher modes it is not possible to

substantially increase encryption throughput even at the cost of a very substantial

increase in the circuit area. On the other hand, by using resource sharing described in

section 3.3.2, the designer can substantially decrease circuit area at the cost of a

proportional (or higher) decrease in the encryption throughput. Therefore, Rijndael and

Serpent can be implemented using almost the same amount of area as Twofish and RC6;

but Twofish and RC6 can never reach the speeds of the fastest implementations of

Rijndael and Serpent I8.

7.2 Comparison of ciphers in non-feedback modes

The results of my implementations of four AES candidates using full mixed inner-

and outer-round pipelining and Virtex XCV-1000BG560-6 FPGA devices are

summarized in Figure 7.2-1, Figure 7.2-2 and Figure 7.2-3. Because of the lack of time I

did not attempt to implement Mars in this architecture. In Figure 7.2-4, I provide the

results of implementation of all five AES finalists by the NSA group using full outer-

round pipelining and semi-custom ASICs in 0.5 �m CMOS MOSIS library [33].

126

0

2

4

6

8

10

12

14

16

18Throughput [Gbps]

Serpent RijndaelTwofish RC6

16.815.2

13.1 12.2

0

2

4

6

8

10

12

14

16

18Throughput [Gbps]


16.815.2

13.1 12.2

Figure 7.2-1 Throughput for mixed inner- and outer-round pipelining in

Virtex1000, my results.

To my best knowledge, the throughputs of the AES candidates obtained as a result

of my design effort, and shown in Figure 7.2-1, are the best ever reported, including both

FPGA and ASIC technologies.

127

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000


Area [CLB slices]

19,700 21,000

46,900

12,60080 RAMs

dedicated memory blocks, RAMs

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000


Area [CLB slices]

19,700 21,000

46,900

12,60080 RAMs

dedicated memory blocks, RAMs

Figure 7.2-2 Area for mixed inner- and outer-round pipelining on

Virtex1000, my results.

Latency without and with pipelining [�s]

Serpent I8 RijndaelTwofish RC6

297733 722

3092

897

5490

309737

x 2.5

x 4.3

x 6.1

x 2.4

6

5

4

3

2

1

0

Latency without and with pipelining [�s]

Serpent I8 RijndaelTwofish RC6

297733 722

3092

897

5490

309737

x 2.5

x 4.3

x 6.1

x 2.4

6

5

4

3

2

1

0

Figure 7.2-3 Increase in the encryption/decryption latency as a result of

moving from the basic architecture to mixed inner- and outer-round

pipelining.

128

My designs outperform similar pipelined designs based on the use of identical

FPGA devices, reported in [15], by a factor ranging from 3.5 for Serpent to 9.6 for

Twofish. These differences may be attributed to using a sub-optimum number of inner-

round pipeline stages and to limiting designs to single-chip modules in [15]. My designs

outperform NSA ASIC designs in terms of the encryption/decryption throughput by a

factor ranging from 2.1 for Serpent to 6.6 for Twofish (see Figure 7.2-1 and Figure

7.2-4). Since both of the groups obtained very similar values for throughputs in the basic

iterative architecture (see Figure 7.1-5 and Figure 7.1-6), these large differences should

be attributed primarily to the differences between the full mixed inner- and outer-round

round architecture employed by me and the full outer-round architecture used by the

NSA team.

Serpent Rijndael Twofish RC6 Mars

Throughput [Gbps]

0

1

2

3

4

5

6

7

8

9

2.2

5.7

2.3 2.2

8.0

Serpent Rijndael Twofish RC6 Mars

Throughput [Gbps]

0

1

2

3

4

5

6

7

8

9

2.2

5.7

2.3 2.2

8.0

Figure 7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA

results.

129

By comparing Figure 7.2-1 and Figure 7.2-4, it can be clearly seen that using full

outer-round pipelining for comparison of the AES candidates favors ciphers with less

complex cipher rounds. Twofish and RC6 are over two times slower than Rijndael and

Serpent I1, when full outer-round pipelining is used (Figure 7.2-4); and have the

throughput greater than Rijndael, and comparable to Serpent I1, when full mixed inner-

and outer-round pipelining is applied (Figure 7.2-1). Based on my basic iterative

architecture implementation of Mars, I predict that the choice of the pipelined

architecture would have the similar effect on Mars.

The deviations in the values of the AES candidate throughputs in full mixed

inner- and outer-round pipelining do not exceed 20% of their mean value. The analysis of

critical paths in my implementations has demonstrated that all critical paths contain only

a single level of CLBs and differ only in delays of programmable interconnects. Taking

into account already small spread of the AES candidate throughputs and potential for

further optimizations, I conclude that the demonstrated differences in throughput are not

sufficient to favor any of the AES algorithms over the other. As a result, circuit area

should be the primary criterion of comparison for our architecture and non-feedback

cipher modes.

As shown in Figure 7.2-2, Serpent and Twofish require almost identical area for

their implementations based on full mixed inner- and outer-round pipelining. RC6

imposes over twice as large area requirements. Comparison of the area of Rijndael and

other ciphers is made difficult by the use of dedicated memory blocks, Block

130

SelectRAMs, to implement S-boxes. Block SelectRAMs are not used in implementations

of any of the remaining AES candidates, and I am not aware of any formula for

expressing the area of Block SelectRAMs in terms of the area used by CLB Slices.

Nevertheless, I have estimated that an equivalent implementation of Rijndael, composed

of CLBs only, would take approximately 24,600 CLBs, which is only 17 and 25 percent

more than implementations of Twofish and Serpent respectively.

Additionally, Serpent, Twofish, and Rijndael all can be implemented using two

FPGA devices XCV-1000, while RC6 requires four such devices. It should be noted that

in my designs, all implemented circuits perform both encryption and decryption. This is

in contrast with the designs reported in [15], where only encryption logic is implemented,

and therefore a fully pipelined implementation of Serpent can be included in one FPGA

device.

Connecting two or more Virtex FPGA devices into a multi-chip module working

with the same clock frequency is possible because the FPGA system level clock can

achieve rates up to 200 MHz [34], and the highest internal clock frequency required by

the AES candidate implementation is 131 MHz for Serpent. New devices of the Virtex

family released in 2001 are capable of including full implementations of Serpent,

Twofish, and Rijndael on a single integrated circuit.

In Figure 7.2-3, I report the increase in the encryption/decryption latency resulting

from using the inner-round pipelining with the number of stages optimum from the point

of view of the throughput/area ratio. In majority of applications that require hardware-

based high-speed encryption, the encryption/decryption throughput is a primary

131

performance measure, and the latencies shown in Figure 7.2-3 are fully acceptable.

Therefore, in this type of applications, the only parameter that truly differentiates AES

candidates, working in non-feedback cipher modes, is the area, and thus the cost, of

implementations. As a result, in non-feedback cipher modes, Serpent, Twofish, and

Rijndael offer very similar performance characteristics, while RC6 requires over twice as

much area and twice as many Virtex XCV-1000 FPGA devices.

132

Summary

I have implemented all five final AES candidates in the basic iterative

architecture, suitable for feedback cipher modes, using Xilinx Virtex XCV-1000 FPGA

devices. For all five ciphers, I have obtained the best throughput/area ratio, compared to

the results of other groups reported for FPGA devices. Additionally, I have implemented

four AES algorithms using full mixed inner- and outer-round pipelining suitable for

operation in non-feedback cipher modes. For all four ciphers, I have obtained throughputs

in excess of 12 Gbps, the highest throughputs ever reported in the literature for hardware

implementations of the AES candidates, taking into account both FPGA and ASIC

implementations.

I have developed the consistent methodology for the fast implementation and fair

comparison of the AES candidates in hardware. I have found out that the choice of an

optimum architecture and a fair performance measure is different for feedback and non-

feedback cipher modes.

For feedback cipher modes (CBC, CFB, OFB), the basic iterative architecture is

the most appropriate for comparison and future implementations. The

encryption/decryption throughput should be the primary criterion of comparison, because

using a different architecture, even at the cost of a substantial increase in the circuit area,

cannot easily increase it. Serpent and Rijndael outperform three remaining AES

133

candidates by at least a factor of two in both throughput and latency. Two independent

research groups have confirmed my results for feedback modes.

For non-feedback cipher modes (ECB, counter mode), architecture with full

mixed inner- and outer-round pipelining is the most appropriate for comparison and

future implementations. In this architecture, all AES candidates achieve high, and

approximately the same throughput. As a result, the implementation area should be the

primary criteria of comparison. Implementations of Serpent, Twofish, and Rijndael

consume approximately the same amount of FPGA resources. RC6 requires over twice as

large area. My approach to comparison of the AES candidates in non-feedback cipher

modes is new and unique, and has yet to be followed, verified, and confirmed by other

research groups.

My analysis leads to the following ranking of the AES candidates in terms of the

hardware efficiency: Rijndael and Serpent close first, followed in order by Twofish, RC6,

and Mars. Figure 7.1-5 clearly indicates that Rijndael offers high throughput and best

throughput/area ratio in basic iterative architecture. Figure 7.2-2 shows area requirements

for all ciphers implemented in fully pipelined architecture. None of those ciphers could fit

within the device I used for comparison, however Xilinx Inc. has already introduced and

extended family of FPGA devices: Virtex-E and Virtex II. FPGAs in Virtex-E family

have large amount of Block SelectRAMs and similar amount of CLB Slices as in Virtex

family. Considering the use of Virtex-E devices I could certainly implement most of the

candidates within one chip. Only RC6 is too big for the largest of the devices. Serpent

134

and Twofish could be implemented in one of the largest chips Virtex-E XCV2600E.

However, Rijndael appears to have smallest requirements as it can be implemented

entirely within one Virtex-E XCV1600E. Again, Rijndael takes lead in ciphers

comparison.

When I came to the AES3 conference in New York with my advisor, we attended

the reception before the conference sessions. Talking with other participants of the

conference we had a feeling that everyone would most likely see an American candidate

cipher as a winner of the contest, because the winner was going to become an American

government standard. From this point of view, MARS, RC6 and Twofish had an

advantage over remaining candidates. Rijndael was proposed by two, unknown

researchers from Europe, what was not a good omen for its acceptance. Serpent already

had a bad reputation of being very slow in software. At the AES3 conference we have

presented a paper [19], which focused on implementations of AES candidates in basic

iterative architecture. We have shown my results, which are summarized in Figure 7.1-1

and Figure 7.1-4. Other research groups have presented similar results as shown in Figure

7.1-3. At the end of the AES3 conference all participants were asked to fill a survey,

where everyone could highlight his/her choice for AES standard. The results of the

survey are presented in Figure 8-1.

135

0102030405060708090100

SerpentRijndael Twofish RC6 Mars

# votes

0102030405060708090100

SerpentRijndael Twofish RC6 Mars

# votes

Figure 8-1 Results of survey filled by participants of the AES3

conference.

The opinion voiced by AES3 participants is surprisingly well correlated with the

results of our research.

The winner of the contest has been finally announced in August 2000. Rijndael

has become the AES, and will protect US government data well into 21st century.

The AES contest is over, but the results of my research are of interest. All the

finalists proved to be equally secure, and may find uses in real applications. I have

already encountered requests for including all remaining candidate algorithms in secure

communication standards as optional algorithms. My research results may guide

hardware implementers of those algorithms.

136

Rijndael has been officially approved for the AES in November 2001. It will

become a required algorithm for all most important secure communication protocols like

IPSec. I have started my research focusing on implementing AES for gigabit IPSec, and

have already presented an implementation of Rijndael in basic iterative architecture,

which achieves a throughput of 577 Mbps [8]. I am currently working on matching a 1

Gbps requirement for gigabit IPSec.

137

List of References

138

List of References

[1] R. Anderson, E. Biham, L. Knudsen, Serpent: A Proposal for the AdvancedEncryption Standard, NIST AES Proposal, June 1998.

[2] Advanced Encryption Standard Development Effort, http://www.nist.gov/aes.

[3] ANSI X9.52, Triple Data Encryption Algorithm Modes of Operation, 1998.

[4] C. Burwick, D. Coppersmith, E. D’Avignon, R. Gennaro, S. Halevi, C. Jutla,S. Matyas, L. O’Connor, M. Peyravian, D. Safford, N. Zunic, MARS – acandidate cipher for AES, NIST AES Proposal, June 1998.

[5] E. Biham, A. Shamir, Differential cryptanalysis of DES-like cryptosystems,Technical report CS90-16, Weizmann Institute of Science, CRYPTO'90 &Journal of Cryptology, Vol. 4, No. 1, pp. 3-72, 1991.

[6] E. Biham, A. Shamir, Differential Cryptanalysis of the full 16-round DES,Advances in Cryptology, CRYPTO’92, 1992.

[7] E. Biham, A. Shamir, Differential Cryptanalysis of the Data EncryptionStandard, Springer Verlag, 1993. ISBN: 0-387-97930-1, 3-540-97930-1.

[8] P. Chodowiec, K. Gaj, P. Bellows, B. Schott, Experimental Testing of theGigabit IPSec-Compliant Implementations of Rijndael and Triple DES UsingSLAAC-1V FPGA Accelerator Board, Proc. Information Security Conference,Malaga, Spain, October 1-3, 2001.

[9] P. Chodowiec, P. Khuon, K. Gaj, Fast Implementations of Secret-Key BlockCiphers Using Mixed Inner- and Outer-Round Pipelining, Proc. ACM/SIGDANinth International Symposium on Field Programmable Gate Arrays,FPGA’01, Monterey, February 2001, pp. 94-102.

[10] P. Chodowiec, W. Todryk, Hardware Encryptor for Hard Drives, WarsawUniversity of Technology, Faculty of Electronics and InformationTechnology, Senior Design Project, Warsaw, 1998.

http://www.nist.gov/aes

139

[11] J. Daemen, V. Rijmen, AES Proposal: Rijndael, NIST AES Proposal, June1998.

[12] A. Dandalis, V. Prasanna, J. Rolim, A Comparative Study of Performance ofAES Final Candidates Using FPGAs, Proc. Cryptographic Hardware andEmbedded Systems Workshop, CHES 2000, Worcester, MA, Aug. 17-18,2000.

[13] Electronic Frontier Foundation and O’Reilly and Associates, Cracking DES:Secrets of Encryption Research, Wiretap Politics & Chip Design, July 1998.

[14] A. Elbirt, C. Paar, An FPGA Implementation and Performance Evaluation ofthe Serpent Block Cipher, Eighth ACM International Symposium on Field-Programmable Gate Arrays, Monterey, California, February 10-11, 2000.

[15] A. Elbirt, W. Yip, B. Chetwynd, C.Paar, An FPGA Implementation andPerformance Evaluation of the AES Block Cipher Candidate AlgorithmFinalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.

[16] Federal Information Processing Standards Publication 46-3, Data EncryptionStandard, National Institute of Standards and Technology, 1999.

[17] Federal Information Processing Standards Publication 81, DES modes ofoperation, National Institute of Standards and Technology, 1980.

[18] Federal Information Processing Standards Publication 197, AdvancedEncryption Standard (AES), National Institute of Standards and Technology,2001.

[19] K. Gaj, P. Chodowiec, Comparison of the Hardware Performance of the AESCandidates Using Reconfigurable Hardware, Proc. 3rd Advanced EncryptionStandard (AES) Candidate Conference, New York, April 13-14, 2000.

[20] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach,Second Edition, 1995. ISBN: 1-55960-329-8.

[21] H. Leitold, W. Mayerwieser, U. Payer, K. Posch, R. Posch, J. Wolkerstorfer,A 155 Mbps Triple-DES Network Encryptor, Proc. Cryptographic Hardwareand Embedded Systems Workshop, CHES 2000.

[22] H. Lipmaa, P. Rogoway, D. Wagner, CTR-Mode Encryption, Comments toNIST concerning AES Modes of Operations, 2000.

140

[23] M. Matsui, Linear cryptanalysis method for DES cipher, Advances inCryptology, EUROCRYPT’93, 1993.

[24] J. Nechvatal, E. Barker, D. Dodson, M. Dworkin, J. Foti, E. Roback, StatusReport on the First Round of the Development of the Advanced EncryptionStandard, NIST report, August 1999.

[25] M. Peattie, Use Triple DES for Ultimate Virtex-II Design Protection, Xcelljournal, Issue 40, Summer 2001.

[26] M. Riaz, H. Heys, The FPGA Implementation of RC6 and CAST-256Encryption Algorithms, CCECE’99, Edmonton, Alberta, Canada, 1999.

[27] M. Rawski, L. Jozwiak, M. Nowicka, T. Luba, Non-Disjoint Decompositionof Boolean Functions and Its Application in FPGA-oriented TechnologyMapping, Proc. EUROMICRO’97, Budapest, Hungary, September 1-4, 1997.

[28] R. Rivest, M. Robshaw, R.Sidney, The RC6 Block Cipher, NIST AESProposal, June 1998.

[29] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Twofish: A 128-bit Block Cipher, NIST AES Proposal, June 1998.

[30] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Performance Comparison of the AES Submissions, Second AES CandidateConference, Rome, April 1999.

[31] A. Satoh, N. Ooba, K. Takano, E. D’Avignon, High-Speed MARS Hardware,Proc. 3rd Advanced Encryption Standard (AES) Candidate Conference, NewYork, April 13-14, 2000.

[32] S. Trimberger, R.Pang, A. Singh, A 12 Gbps DES Encryptor/Decryptor corein an FPGA, Proc. Cryptographic Hardware and Embedded SystemsWorkshop, CHES 2000.

[33] B. Weeks, M. Bean, T. Rozylowicz, C. Ficke, Hardware PerformanceSimulations of Round 2 Advanced Encryption Standard Algorithms, Proc. 3rd

Advanced Encryption Standard (AES) Candidate Conference, New York,April 13-14, 2000.

[34] Xilinx, Inc., Virtex 2.5V Field Programmable Gate Arrays, TheProgrammable Logic, Data Book, 2000.

141

[35] K. Gaj and P. Chodowiec, Fast implementation and fair comparison of thefinal candidates for Advanced Encryption Standard using FieldProgrammable Gate Arrays, Proc. RSA Security Conference -Cryptographer's Track, San Francisco, CA, April 8-12, 2001.

[36] Tetsuya Ichikawa, Tomomi Kasuya, Mitsuru Matsui, Hardware Evaluation ofthe AES Finalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.