Comparison of the Hardware Performance of the AES Candidates Using ReconfigurableHardware
A thesis submitted in partial fulfillment of the requirements for the degree of Master ofScience at George Mason University
By
Pawel R. ChodowiecBachelor of Science
Warsaw University of Technology, 1998
Director: Kris M. Gaj, Assistant ProfessorDepartment of Electrical and Computer Engineering
Spring Semester 2002George Mason University
Fairfax, VA
ii
Table of Contents
PageAbstract ............................................................................................................................viii1. Preface............................................................................................................................. 11.1 Data Encryption Standard ............................................................................................. 11.2 DES security ................................................................................................................. 22. Introduction..................................................................................................................... 62.1 Advanced Encryption Standard .................................................................................... 62.1.1 Requirements and evaluation criteria......................................................................... 72.1.2 Evaluation process ................................................................................................... 102.2 Need for comparison of hardware implementations................................................... 132.3 Previous work ............................................................................................................. 143. Characteristics of hardware implementations............................................................... 163.1 Hardware vs. software implementations..................................................................... 163.2 Parameters of hardware implementations................................................................... 233.2.1 Throughput............................................................................................................... 233.2.2 Latency..................................................................................................................... 243.2.3 Area.......................................................................................................................... 263.3 Design tradeoffs .......................................................................................................... 303.3.1 Increasing the throughput ........................................................................................ 313.3.2 Decreasing the area .................................................................................................. 434. Hardware architectures for symmetric-key block ciphers ............................................ 454.1 Main characteristics of block ciphers ......................................................................... 454.1.1 Structure of a symmetric-key block cipher .............................................................. 454.1.2 Key schedule............................................................................................................ 474.1.3 Modes of operation .................................................................................................. 484.2 Basic iterative architecture.......................................................................................... 514.3 Loop unrolling ............................................................................................................ 554.4 Outer round pipelining................................................................................................ 604.5 Inner round pipelining................................................................................................. 654.6 Mixed inner- and outer-round pipelining.................................................................... 705. Methodology of comparison of AES candidates .......................................................... 745.1 Limits of this research................................................................................................. 745.2 Choice of architectures ............................................................................................... 775.2.1 Comparison in feedback modes ............................................................................... 775.2.2 Comparison in non-feedback modes........................................................................ 785.3 Tools, design process and synthesis parameters ......................................................... 79
iii
6. Implementation of AES candidates .............................................................................. 826.1 MARS ......................................................................................................................... 826.1.1 Structure and components of MARS ....................................................................... 826.1.2 Implementation of multiplication modulo 232 ......................................................... 896.1.3 Results of the implementation of MARS................................................................. 946.2 RC6 ............................................................................................................................. 966.2.1 Structure and components of RC6 ........................................................................... 966.2.2 Implementation of squaring modulo 232 .................................................................. 986.2.3 Results of the implementation of RC6................................................................... 1006.3 Rijndael ..................................................................................................................... 1026.3.1 Structure and components of Rijndael ................................................................... 1026.3.2 Results of the implementation of Rijndael............................................................. 1086.4 Serpent ...................................................................................................................... 1096.4.1 Structure and components of Serpent .................................................................... 1096.4.2 Results of the implementation of Serpent.............................................................. 1126.5 Twofish ..................................................................................................................... 1136.5.1 Structure and components of Twofish ................................................................... 1136.5.2 Results of the implementation of Twofish............................................................. 1187. Analysis of the results ................................................................................................. 1197.1 Comparison of ciphers in feedback modes ............................................................... 1197.2 Comparison of ciphers in non-feedback modes........................................................ 1258. Summary..................................................................................................................... 132List of References ........................................................................................................... 138
iv
List of Tables
Table Page2.1-1 Fifteen candidate algorithms ................................................................................... 112.1-2 Security margins of final AES candidate algorithms .............................................. 123.1-I Characteristic features of implementations of cryptographic transformations inASICs, FPGAs, and software............................................................................................ 223.3-I Features of methods exploring parallel computations ............................................. 42
v
List of Figures
Figure Page3.1-1 Structure of the Virtex FPGA.................................................................................. 183.1-2 Example of an attack on the hardware implementation. ......................................... 203.2-1 System consisting of multiple modules with throughput parameters...................... 243.2-2 System consisting of multiple modules with latency parameters............................ 253.2-3 Circuit with FIFO buffers. ....................................................................................... 253.2-4 Variety of functions possible to implement using one lookup table (LUT)............ 283.2-5 Example of LUT utilization..................................................................................... 293.3-1 Parallel processing units – string of data split among units. ................................... 313.3-2 Principles of pipelined implementation................................................................... 333.3-3 Pipeline with delay of registers taken into account. ................................................ 353.3-4 Unbalanced pipeline. ............................................................................................... 363.3-5 Pipelining of circuit consisting of unequal operations. ........................................... 373.3-6 Example of unnecessarily placed register. .............................................................. 383.3-7 Throughput in the pipelined implementations......................................................... 393.3-8 Pipelining of Feistel-network cipher. ...................................................................... 413.3-9 Example of an array multiplier as a circuit requiring additional area for registerswhen pipelined. ................................................................................................................. 423.3-10 Resource sharing.................................................................................................... 444.1-1 Flow diagram of a typical symmetric-key block cipher. ......................................... 464.1-2 Example of feedback and non-feedback modes of operation.................................. 494.1-3 Counter mode. ......................................................................................................... 504.2-1 Basic iterative architecture ...................................................................................... 524.2-2 Critical path in the basic iterative architecture. ....................................................... 534.3-1 Loop unrolling. ........................................................................................................ 554.3-2 Optimization of logic across rounds. ....................................................................... 574.3-3 Simultaneous evaluation of functions in unrolled rounds. ...................................... 574.3-4 Throughput vs. area ratio for unrolled architectures. .............................................. 594.4-1 Outer round pipelining............................................................................................. 614.4-2 Optimization of logic across rounds. ....................................................................... 634.4-3 Throughput vs. area ratio in outer round pipelining. ............................................... 654.5-1 Inner round pipelining. ............................................................................................ 664.5-2 Throughput vs. area ratio for inner round pipelining. ............................................. 704.6-1 Mixed inner- and outer-round pipelining. ............................................................... 714.6-2 Throughput vs. area ratio for mixed pipelining....................................................... 734.6-3 Latency vs. area ratio for mixed pipelining............................................................. 73
vi
5.1-1 Block diagram common for all implementations. ................................................... 765.2-1 Throughput/area ratio for mixed architecture.......................................................... 795.3-1 Design flow for each implementation. .................................................................... 806.1-1 High-level structure of MARS. ............................................................................... 836.1-2 Mixing transformation............................................................................................. 846.1-3 Mixing transformation core. .................................................................................... 846.1-4 Keyed transformation. ............................................................................................. 866.1-5 Keyed transformation core. ..................................................................................... 866.1-6 E-function. Red line indicates critical path. ............................................................ 876.1-7 Variable rotation. ..................................................................................................... 886.1-8 Virtex Slice with carry logic.................................................................................... 896.1-9 Example of multiplication scheme. Two AND gates feed full adder...................... 906.1-10 Multiplication – implementation of the circuit from 6.1-9 in a Vritex Slice......... 916.1-11 Array multiplier modulo 28. .................................................................................. 926.1-12 Structure of an array multiplier with reversed order of additions. ........................ 926.1-13 Change from array to tree. ..................................................................................... 936.1-14 Final multiplication schematic............................................................................... 946.2-1 Implementation of one round of RC6...................................................................... 976.2-2 Squarer derived from array multiplier. .................................................................... 986.2-3 Squarer modulo 28. .................................................................................................. 996.2-4 Optimized squarer modulo 28. ................................................................................. 996.3-1 One round of Rijndael. .......................................................................................... 1036.3-2 Construction of a) ByteSub, b) InvByteSub transformations................................ 1036.3-3 Structure of the implementation of a single round of Rijndael. ............................ 1076.4-1 Single round of Serpent. ........................................................................................ 1096.4-2 Implementation of Serpent I8 in basic architecture............................................... 1106.4-3 Serpent I1............................................................................................................... 1116.5-1 High-level structure of the Twofish cipher............................................................ 1146.5-2 S-boxes in Twofish................................................................................................ 1156.5-3 Permutation q......................................................................................................... 1156.5-4 PHT transformation. .............................................................................................. 1176.5-5 Implementation of a single round of Twofish. ...................................................... 1177.1-1 Throughput for Virtex XCV-1000, my results. ..................................................... 1197.1-2 Area for Virtex XCV-1000, my results. ................................................................ 1207.1-3 Throughput for Virtex XCV-1000, comparison with results of other groups. ...... 1217.1-4 Area for Virtex XCV-1000, comparison with results of other groups. ................. 1227.1-5 Throughput vs. area for Virtex-1000, our results. The result for Serpent I1 based on[12].................................................................................................................................. 1237.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs, NSA result. ........... 1247.2-1 Throughput for mixed inner- and outer-round pipelining in Virtex1000, my results.......................................................................................................................................... 1267.2-2 Area for mixed inner- and outer-round pipelining on Virtex1000, my results...... 1277.2-3 Increase in the encryption/decryption latency as a result of moving from the basicarchitecture to mixed inner- and outer-round pipelining. ............................................... 127
vii
7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA results. ...................... 1288-1 Results of survey filled by participants of the AES3 conference. ............................ 135
Abstract
COMPARISON OF THE HARDWARE PERFORMANCE OF THE AES
CANDIDATES USING RECONFIGURABLE HARDWARE
Pawel Chodowiec, Computer Engineering M.S.
George Mason University, 2002
Thesis Director: Dr. Kris M. Gaj
The results of fast implementations of all five AES final candidates using Virtex Xilinx
Field Programmable Gate Arrays are presented and analyzed. Performance of several
alternative hardware architectures is discussed and compared. One architecture optimum
from the point of view of the throughput to area ratio is selected for each of the two major
types of block cipher modes. For feedback cipher modes, all AES candidates have been
implemented using the basic iterative architecture, and achieved speeds ranging from 61
Mbit/s for Mars to 431 Mbit/s for Serpent. For non-feedback cipher modes, four AES
candidates have been implemented using a high-throughput architecture with pipelining
inside and outside of cipher rounds, and achieved speeds ranging from 12.2 Gbit/s for
Rijndael to 16.8 Gbit/s for Serpent. A new methodology for a fair comparison of the
hardware performance of secret-key block ciphers has been developed and contrasted
with methodology used by the NSA team.
1
Preface
1.1 Data Encryption Standard
DES is probably one of the best-studied and most controversial ciphers. Its history
began in 1973 when National Bureau of Standards (NBS) issued a public request for
proposals for a standard symmetric key cryptographic algorithm. The request specified a
series of design criteria. Some of the most important requirements were:
� The algorithm had to provide a high level of security,
� The algorithm had to be completely specified and easy to understand,
� The security of the algorithm had to reside in the key, and cannot depend on the
secrecy of the algorithm,
� The algorithm had to be available to all users on the royalty free basis,
� The algorithm had to be adaptable for use in diverse applications,
� The algorithm had to be economically implementable in electronic devices.
In 1974 IBM submitted a promising algorithm as a response for this request. NBS
asked National Security Agency (NSA) for help in evaluating the algorithm. NSA
introduced a few changes to the algorithm. The key length has been shortened from 128
to 56 bits. The content of all S-boxes has also been changed. NSA, however, classified all
information justifying these changes. Since then, DES started to be criticized. Many
2
researchers suspected that NSA installed a trapdoor in S-boxes permitting NSA to
cryptanalyze the algorithm. Also the reduction of the key length was controversial.
Despite all the criticism, DES was adopted as a US encryption standard in 1977,
and became de facto a world standard. The algorithm is defined in the American standard
FIPS 46 "Data Encryption Standard", and is described as a 16-round Feistel-network
cipher operating on 64-bit blocks of data.
The terms of DES standard required its review every five years. In 1983 the
standard has been automatically recertified for the next five years. In 1987 NSA proposed
the Commercial COMSEC Endorsement Program, which would lead to the development
of a series of algorithms replacing DES. Those algorithms would not be made public, and
would be available only in a tamper-proof VLSI chips. The NSA's proposal was not well
received, and because of the lack of other propositions DES standard remained effective
for next five years. In 1993 a 15 years old standard remained still unbroken. Again lack
of any alternative led to its recertification for another five years. In 1997, the National
Institute of Standards and Technology (NIST; former NBS) aware of the DES weakness,
laying mainly in its short key, announced a contest for the development of Advanced
Encryption Standard, which is going to replace a 20 year old DES.
1.2 DES security
The unclear design criteria classified by NSA sparked the biggest worldwide
effort to break DES. What were the criteria for choosing S-boxes? Why does DES consist
of exactly 16 rounds? Why does the key have only 56 bits? Those and other questions
3
exposed DES for cryptanalysis like no other cipher. Despite all attempts to find any
crack, DES secrets remained uncovered for nearly 15 years. Finally in 1990 Eli Biham
and Adi Shamir discovered differential cryptanalysis, the new and powerful method of
cryptanalysis [5]. DES appeared to be surprisingly resistant to the new attack [6, 7]. The
attack requires 247 chosen plaintexts or 255 known plaintexts and the analytical
complexity of 237 operations. The enormous amount of data and time to mount the attack
makes it less efficient than the brute force search for the key. Biham and Shamir came to
interesting conclusions:
� The S-boxes happened to be optimized against differential cryptanalysis,
� Any number of rounds less than 16 makes differential cryptanalysis more efficient
than the brute force attack for a known-plaintext attack.
Why is DES so resistant to an attack discovered many years after its
development? The answer to this question is even more surprising. The designers of DES
already knew differential cryptanalysis at the design time. After consultation with NSA,
they decided that disclosure of the design considerations might reveal the differential
cryptanalysis. Although DES was already resistant to the new attack, many other ciphers,
which were already in use, appeared to be vulnerable. After publishing details of
differential cryptanalysis, IBM finally published the design criteria for the S-boxes and P-
box, showing that no trapdoor was intended to be installed. Soon researchers in the open
cryptographic community began to appreciate the design principles behind DES.
4
Is DES still secure today? No attack better than the brute force search has been
discovered, but the main accusation of a too short key remains irrefutable. Ever since
DES was first proposed in the 1970s, it has been criticized for its short key. The US
government officials claimed that governments cannot decrypt information when
protected by DES, or that it would take multimillion-dollar networks of computers and
months to decrypt one message.
In 1997 RSA Laboratories issued a series of challenges in order to demonstrate
that DES offers only marginal protection. The first DES Challenge was launched in
January 1997. The secret key was recovered in 96 days by a team led by Rocke Verser of
Loveland, Colorado. In February 1998, Distributed.Net won DES Challenge II-1 during a
41-day effort. Distributed.Net consolidated tens of thousands of computers connected
through the Internet for this task. In July, the Electronic Frontier Foundation (EFF) won
DES Challenge II-2 by recovering an encrypted message in 56 hours shattering the
previous record. The answer for the challenge was “It’s time for those 128-, 192-, and
256-bit keys”. The main significance of the new record lies in the fact, that only one
machine, specifically designed for cracking DES, achieved what US government claimed
was impossible.
The design of the EFF DES Cracker consists of an ordinary personal computer
connected to the large array of custom chips. One ASIC chip contains 24 search units,
each capable of checking 2.5 million keys per second. Over 1800 chips were used in the
design giving the search speed of 90 billion keys per second. The average time to recover
the key is only 4.5 days. It took EFF less than one year to build Deep Crack, and it cost
5
only $220,000. EFF and O’Reilly and Associates have published a book about EFF DES
Cracker [EFF+98]. The book contains the complete design details for the Deep Crack
chips, boards and software. Ipso facto, EFF proved that DES is undoubtedly insecure.
Moreover, it proves also that most of the world’s governments already built similar or
even more powerful machines.
DES Cracker "DeepCrack" custom microchip.
DES Cracker circuitboard fitted with DeepCrack chips.
The machine tests over 90billion keys per second, takingan average of less than 5 days todiscover a DES key.
The final nail was put into the Data Encryption Standard coffin on January 19,
1999. The Distributed.Net and EFF DES Cracker won DES Challenge III in 22 hours and
6
15 minutes. Over 100,000 computers connected through the Internet and EFF’s machine
were testing 245 billion keys per second when the key was found. The decrypted message
foreshadows a new standard: “See you in Rome (second AES Conference), March 22-23,
1999.”
6
Introduction
2.1 Advanced Encryption Standard
After DES was shown to be vulnerable to the brute force attack, the need for a
new standard became unquestionable. There already exists an ANSI encryption standard,
3DES [3], which offers higher security than DES [16], but it is highly inefficient,
especially in software implementations. DES was primarily designed for hardware
implementations in existing technology. Nevertheless, current demands for higher
bandwidths in both computer and telecommunication networks are becoming difficult to
satisfy by 3DES encryption devices, especially when feedback modes of operation are
being considered. It was shown in [32] that DES implemented in VirtexE-8 FPGA in
non-feedback mode can achieve a throughput of 12 Gbps. This would translate to 4 Gbps
for 3DES. Po Khuon has demonstrated a 3DES implementation in Virtex-6 FPGA
capable of handling a throughput of 59 Mbps in feedback mode [19] and 7 Gbps in non-
feedback mode for deeply pipelined design [9]. My recent research led to implementation
of 3DES in Virtex-6 FPGA that achieves a throughput of 116 Mbps in feedback mode. Of
course, ASIC devices can satisfy higher throughput demands. One of the reported
implementations of 3DES in older 0.6 micron CMOS technology is capable of encrypting
data with a throughput of at least 155 Mbps [21]. Many current computer and
telecommunication networks require higher throughputs in the range of gigabits per
7
second. I already participate in designing of hardware accelerators for encryption
algorithms used in 1 Gbps IPSec implementation [8]. However, next generation of 10
Gbps LAN networks is being developed and 10 Gbps encryption speed will soon be
required. Clearly, 3DES algorithm can be a serious bottleneck in those applications.
The National Institute of Standards and Technology (NIST) has recognized the
need for new standard and initiated the process of developing an Advanced Encryption
Standard [2]. The main NIST’s objective was to develop an algorithm, which offers
security at least equal to 3DES, and significantly more efficient in software and hardware
implementations in variety of platforms. The algorithm should be capable of protecting
sensitive government information well into the 21st century.
2.1.1 Requirements and evaluation criteria
NIST published a formal call for candidate algorithms in 1997. The minimum
acceptable capabilities were:
1. The algorithm must implement symmetric (secret) key cryptography.
2. The algorithm must be a block cipher.
3. The candidate algorithm shall be capable of supporting key-block combinations with
sizes of 128-128, 192-128, and 256-128 bits.
8
In addition to the above list all submissions should include:
� A complete written specification of the algorithm, consisting of all necessary
mathematical equations, tables, diagrams, and parameters needed to implement
the algorithm,
� A statement of the algorithm’s estimated computational efficiency in hardware
and software. Submitters were required to at least provide estimates for “NIST
AES analysis platform” and for 8-bit processors,
� A set of test vectors allowing verification of correctness of all implementations,
� A statement of the expected strength of the algorithm along with any supporting
rationale,
� Analyses of the algorithm with respect to known attacks. All known weak keys,
equivalent keys, complementation properties, restrictions on key selection, and
similar features of the algorithm should also be noted,
� Optimized and reference source codes in ANSI C and Java describing the
algorithm,
� Declarations of granting full rights to patents covering the algorithm when and if
it should be chosen as a federal standard.
It was a remarkable change in the government’s approach to the security issue.
The previous government standard, DES [16], has been developed in close cooperation
with the National Security Agency (NSA). NSA has concealed the design criteria and
justifications, which resulted in a lack of trust in the standard. This time NIST organized
9
the entire process in the form of a contest. Anybody could submit his own algorithm.
Submitters were obliged to reveal all information about the algorithms and justify all
design decisions. The entire cryptographic community evaluated all algorithms openly.
The organization of the AES selection had several important advantages in that it:
� Focused the effort of cryptographic community on one task, what was very
essential taking into account the small number of specialists in unclassified
research,
� Stimulated research on methods of constructing secure ciphers,
� Avoided backdoor theories, and,
� Speeded-up the acceptance of the standard.
All algorithms were evaluated with respect to three categories of criteria:
1. SECURITY - the most important factor in the evaluation
� Actual security offered by the algorithm,
� Extent to which the algorithm output is indistinguishable from random
permutation on the input block,
� Soundness of the mathematical basis for the algorithm’s security,
� Other security factors, for example attacks, which demonstrate that the actual
security of the algorithm is less than the strength claimed by the submitter.
2. COST
� Licensing requirements,
10
� Computational efficiency – speed of the algorithm in hardware and software,
� Memory requirements – in case of software implementations, code size and
RAM requirements are major factors. In case of hardware implementations,
gate count will be taken into account.
3. ALGORITHM AND IMPLEMENTATION CHARACTERISTICS
� Flexibility – the ability of the algorithm to be implemented on different
platforms for various applications,
� Hardware and software suitability – the algorithm should not be restricted to
hardware or software implementations only,
� Simplicity – simplicity of design and ease of implementation.
2.1.2 Evaluation process
The process of evaluating candidate algorithms has been divided into two rounds.
The first round was intended to focus on the evaluation of algorithms based on the
cryptanalysis performed by public as well as the efficiency of software implementations
on a variety of platforms. The AES contest had attracted 15 submissions of block ciphers
from four continents, and 12 countries, as shown in Table 2.1-1. Most of the algorithms
came from outside of the USA, demonstrating the large interest of the broad
cryptographic community in the development of the U.S. government encryption
standard.
11
Table 2.1-1 Fifteen candidate algorithms.
Continent Country CipherNorth America Canada CAST-256
DealUSA Mars
RC6TwofishSafer+HPC
Costa Rica FrogEurope Germany Magenta
Belgium RijndaelFrance DFCIsrael, UK,Norway
Serpent
Asia Korea CryptonJapan E2
Australia Australia LOKI97
Only five algorithms passed to the second round of the evaluation: Mars [4], RC6
[28], Rijndael [11], Serpent [1], and Twofish [30]. All of the final candidates proved to
be sufficiently secure according to the best knowledge available during their analysis. Of
course, nobody can absolutely claim invulnerability of his design to future cryptanalysis
methods. At best, only estimation based on the current state of the art in cryptanalysis can
be made. One of the ways of assessing the security of symmetric-key ciphers is based on
differential [5] and linear [23] cryptanalyses. For both methods, there can be found a
minimal number of rounds, which make the attack less practical than the brute force
search. Any number of rounds greater than the minimum is believed to create the security
12
margin, a type of assurance by designers themselves against future attacks. Table 2.1-2
summarizes the security features of five final candidates. Obviously, ciphers with greater
security margins pay a price in the speed of operation, since their numbers of rounds are
greater than necessary.
Table 2.1-2 Security margins of final AES candidate algorithms.
Algorithm Number ofrounds
Minimumnumber ofroundsbelieved tobe secure
Securitymargin
Number ofrounds ofthe bestactual orestimatedattack
Securitymargin
Mars 32 20 60% 12 166%RC6 20 21 -5% 16 25%Rijndael 10 8 25% 6 66%Serpent 32 17 88% 15 113%Twofish 16 14 14% 6 166%
The second round of evaluation focused on further cryptanalysis and hardware
implementations of each of the finalists. FPGA based implementations played a great role
in final evaluation. In this thesis I present my contribution to the selection of the new
cryptographic standard. My results have been presented on the third AES conference in
New York, April 2000 [19]. I further extended my analyses and presented them on the
FPGA’2001 conference held in Monterey, February 2001 [9], and on the RSA’2001
conference held in San Francisco, April 8-12, 2001 [35]. Those results are also included
in this thesis.
13
Finally, in October 2000, NIST announced the winner of the contest: Rijndael.
AES was finally accepted as a federal standard on November 26, 2001. [18].
2.2 Need for comparison of hardware implementations
Software implementations of cryptography are dominating today’s encryption
market. Most of the users do not require high encryption speeds for their applications.
Encrypting electronic mail or private files usually does not need to be done strictly in
real-time. Each day more users start using computer networks and want to ensure privacy
for their network transactions. Existing Local Area Networks and Metropolitan Area
Networks operate with moderate speeds of 10 and 100 Mbps. These speeds can still be
handled by personal computers. However, new technological breakthroughs in LAN and
MAN networks change the horizon significantly. Gigabit Ethernet already exists, and is
becoming a competitive solution for LANs. In response to market trends, where Gigabit
Ethernet is being deployed over tens of kilometers in private networks, the Ethernet
industry developed a way to not only increase the speed of Ethernet to 10 Gbps, but also
to extend its operating distance. Encrypting data with speeds in the range of gigabits per
second is unachievable for the current and foreseeable generations of personal computers,
and broader use of hardware accelerators becomes inevitable.
The number of cryptographic standards targeting communication networks grows
rapidly, and it seems to be natural that cryptographic services become a standard feature
of new products. Future communication devices will be equipped with cryptographic
14
modules by default. If one looks closer at those devices, most likely we will see a small
hardware chip protecting the privacy of our communication.
With respect to those trends, the comparison of hardware implementations of
candidate algorithms for AES becomes one of the most important selection criteria. My
research indicates that hardware implementations reveal large differences in performance
among candidate algorithms. Furthermore, implementations developed by other groups
confirmed my results in most conclusions. In the presence of no major breakthroughs in
cryptanalysis of the AES candidates, and relatively inconclusive results of their software
performance evaluation [24, 30], the comparison of the hardware performance of the
AES algorithms provided a major indicator for a final decision regarding the new
standard.
2.3 Previous work
All AES candidate ciphers are brand new algorithms. Their analysis period was
very short and very little could have been done to analyze their performance in dedicated
hardware. The designers of the submitted algorithms are mostly mathematicians, who
have usually limited knowledge and experience with hardware designs. Their original
documentation contains only rough estimates of the hardware performance [4, 28, 11, 1,
29]. Additionally, these estimates are very difficult to compare among each other,
because of large differences in assumptions regarding the technology, and because of
different architectural choices. By the time we started our research only two results of
15
actual implementations of individual algorithms became available [14, 26], however this
was still a fragmentary knowledge not suitable for reliable comparison.
Starting our research I already had some experience in working with
reconfigurable hardware and implementations of cryptography. I have gained this
experience during my senior design project completed at the Warsaw University of
Technology, which focused on implementing hardware encryption device for hard drives
using RC5 algorithm [10].
16
Characteristics of hardware implementations
3.1 Hardware vs. software implementations
Cryptography can be implemented in both software and hardware. Usually the
desired speed of encryption/decryption and cost of the implementation are major factors
influencing the choice of technology.
Software implementations are designed and coded in programming languages,
such as C, C++, Java, and assembler, and are developed to run on general-purpose
processors, digital signal processors, and smart cards. Usually, software implementations
are very inexpensive. In most cases, cryptographic transformations match very well
modern microprocessors’ architectures, and even inexperienced programmers may easily
come up with correct implementations.
General-purpose processors offer enough power to satisfy the needs of individual
users, therefore majority of the existing implementations of cryptography reside in
software. Hardware implementations are the only way to achieve speeds beyond the
reach of the general-purpose microprocessors.
Hardware implementations are designed and coded either in hardware description
languages, such as VHDL and Verilog HDL, or using schematic capture. There exist two
major implementation approaches for hardware designs: Application Specific Integrated
Circuits (ASIC) and Field Programmable Gate Arrays (FPGA).
17
Application Specific Integrated Circuits are designed all the way from behavioral
description to the physical layout. The design is very time consuming and requires a lot
of manpower. A final layout is sent to a very expensive fabrication. Clearly, every design
mistake may have a large impact on the length of the design cycle and its cost. The
designers needed some inexpensive of rapid prototyping. This idea found its realization
in the form of FPGA devices.
Field Programmable Gate Arrays offer very unique features. They can be
purchased off-the-shelf and reconfigured to perform different functions. Each
reconfiguration takes only a fraction of a second. An FPGA consists of thousands of
small universal building blocks, known as Configurable Logic Blocks (CLB) [34]. CLBs
are connected using programmable interconnects. Some of the FPGA families contain
dedicated memory blocks. These are called Block SelectRAMs [34]. Figure 3.1-1 shows
the architecture of the Xilinx Virtex FPGA family.
18
Block SelectRAM
Block SelectRAM
ConfigurableLogicBlock
I/OBlock
LUT carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 1
LUT carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 0
Block SelectRAM
Block SelectRAM
ConfigurableLogicBlock
I/OBlock
LUT carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 1
LUT carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 0
LUT carry &control
G4
G3
G2
G1
D QD Q
LUT carry &control
F4
F3
F2
F1
D QD Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 1
LUT carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 0
LUT carry &control
G4
G3
G2
G1
D QD Q
LUT carry &control
F4
F3
F2
F1
D QD Q
YQ
Y
X
XQ
CIN
COUT
CLKSlice 0
Figure 3.1-1 Structure of the Virtex FPGA.
Although FPGAs were originally invented primarily to support development of
custom ASICs, they found applications as target devices. Due to their ability for
reconfiguration, FPGAs present very sophisticated circuits, and their potential is rarely
19
fully exploited. Even very simple components, like individual gates, have to be
implemented using the entire CLB, leading to sub optimal utilization of available
resources. All connections between Configurable Logic Blocks are routed using
configurable switches which present additional sources of delay. The rule of thumb is that
an ASIC is ten times faster than an equivalent FPGA, provided that both are fabricated in
the same technology. However, in FPGAs, the reconfiguration capability may be
explored as an essential feature of the design. In cryptography, it is often useful to switch
among encryption algorithms. Changing the FPGA configuration can easily accomplish
this task. Also, correcting mistakes or simply upgrading existing products by adding more
functionality is as easy as upgrading software implementations.
The cost of design based on the FPGA technology is far lower than for the custom
ASIC technology. Designers themselves can reprogram FPGAs, therefore less manpower
is needed in the development process. For low volumes, the FPGA-based products are
more profitable than those based on ASICs. For all that, FPGA implementations are still
more expensive than software implementations.
A very essential parameter of every cryptographic implementation is its level of
security. No secure algorithm helps if an attack on the implementation exists. Although,
software implementations happen to be the most common, they provide the lowest level
of security. It is extremely difficult to ensure no leakage of information. For example, one
form of attack on software may use little programs, like viruses, to imperceptibly collect
information and send it to an attacker. Another approach may be based on scanning the
20
memory freed by the cryptographic program that finished execution. What if the key just
used by that program has not been wiped out of the memory?
Hardware implementations are much easier to protect. It is relatively easy to
design circuits ensuring that no attack is possible, unless there is a physical access to the
device. In some situations the existence of such access has to be taken into account. One
of the ways to attack a hardware implementation is to replace the cryptographic chip with
another one, which would perform the same operation as the original chip, but could
additionally leak important information to the attacker, see Figure 3.1-2.
C = EK(P)Plaintext Ciphertext
replaced
C = EK(P)Plaintext
Ciphertext
Leaking information(i.e. encryption key)
identicaltransformations
C = EK(P)Plaintext Ciphertext
replaced
C = EK(P)Plaintext
Ciphertext
Leaking information(i.e. encryption key)
identicaltransformations
Figure 3.1-2 Example of an attack on the hardware implementation.
The FPGA based designs do not offer much protection against physical
manipulation. The configuration bitstream can be easily read and reverse engineered to
identify all security mechanisms. Therefore, preparing and replacing the bitstream with a
21
slightly changed version of the circuit is not difficult. Moreover, FPGA devices are
equipped with a Readback function, useful for debugging, which permits reading the
contents of all registers and memories together with configuration bitstream. This way
sensitive information, i.e. encryption key, can be easily compromised.
Xilinx has addressed this problem in their latest family of Virtex-II FPGAs [25].
Virtex-II devices have on-chip decryptors that have their keys loaded during board
manufacture in a secure environment. Once the devices have been programmed with the
correct keys, they can be configured with encrypted bitstreams. Xilinx has chosen DES
and Triple DES for encrypting bitstreams. This solution is, however, unique among
FPGAs.
Protecting ASIC based designs is somewhat easier. The ASIC chip can be
designed to be tamper resistant. This means that the chip should not leak any information
under any form of stressing. Some of the possible attacks may be based on power
consumption analysis or fault introduction. Furthermore, the chip may have some strong
authentication mechanism implemented making replacement of the chip not an easy task.
22
Table 3.1-I Characteristic features of implementations of cryptographic
transformations in ASICs, FPGAs, and software
ASICs FPGAs Software(general-purposemicroprocessors)
Speed very fast fast moderately fastDevelopment process
Design Cost very expensive moderatelyexpensive
inexpensive
Design Cycle long moderately long shortDesign Tools very expensive inexpensive inexpensiveMaintenance andUpgrades
expensive inexpensive inexpensive
Cryptographic issuesTamper Resistance strong limited weakKey Protection strong limited weakAlgorithm Agility no yes yes
Every type of implementation has its advantages and disadvantages. Their basic
features are summarized in Table 3.1-I. Software implementations are an attractive
choice when speed of encryption is not a main concern and the expected level of security
does not have to be high. It means that the importance of the protected information is low
compared to the effort required for breaking security mechanisms. The more secure and
faster solutions are required, the more vital role is played by hardware implementations.
When taking into account hardware solutions, FPGAs become more and more
attractive. If we assume that an attacker has no physical access to the device, then FPGA
designs can be as secure as ASICs.
23
3.2 Parameters of hardware implementations
Every hardware circuit can be characterized by two major parameters: speed of
operation, and area. Cryptographic algorithms are intended to perform cryptographic
transformations on strings of data. Therefore the speed of cryptographic implementations
is commonly characterized by the throughput. Throughput does not always give the full
information about the speed, and is often accompanied by another parameter – latency.
3.2.1 Throughput
Throughput is defined as the number of bits processed in a unit of time after the
process has gone through any initialization, and is usually expressed in Mbps or Gbps.
Typically, the encryption and decryption throughputs are equal. All symmetric-key
algorithms perform a fixed sequence of transformations, in other words, no conditional
operations are performed. Therefore, the time of encryption of one block of data is
usually fixed and known, unless implementation uses some tricks which can vary the
time of encryption. From the point of view of cryptographers, using any techniques
yielding correlation between data and time of encryption is highly undesirable. Such a
correlation leaks information about data, and can be used to mount timing attacks on the
implementation.
Startup -unit Timebits processed ofNumber Throughput �
24
Throughput has a very important meaning when considering a bigger system
consisting of multiple modules processing data in sequence, as shown in Figure 3.2-1,
because it is limited to the maximum throughput of the slowest of the modules.
Cryptographic transformations usually require the most processing power, and present a
bottleneck in many applications.
Module 1Thpt1
Data stream Module 2Thpt2
Module nThptn
Figure 3.2-1 System consisting of multiple modules with throughput
parameters.
� � iThptThpt iini module oft throughpua is e wherinfThroughput
...1system�
�
When we talk about throughput we usually mean the maximum throughput of a
circuit, although it may process streams with smaller throughput.
3.2.2 Latency
Latency is defined as the time required to complete processing of one block of
data, and is usually expressed in number of clock cycles. This is the time between a
moment when a block of data enters the encryption unit, and a moment when it leaves it.
Throughput and latency describe different features of systems. The total latency of a
25
system is a sum of latencies of all modules processing data sequentially. Therefore, all
modules, no matter how different from each other, contribute to the overall latency.
Module 1Ltncy1
Data stream Module 2Ltncy2
Module nLtncyn
Figure 3.2-2 System consisting of multiple modules with latency
parameters.
iLtncyLtncy i
n
ii module a oflatency a is whereLatency
1system �
�
�
Latency does not always stay at fixed level even if the throughput does. In many
communication applications data come in bursts, i.e. packets, and need to be buffered for
further processing. First-In-First-Out buffers are commonly used for data buffering, as
shown in Figure 3.2-3. Data blocks arriving at a nearly full FIFO have to wait for
processing until preceding blocks are completed. Therefore, their latency is significantly
larger than the latency of blocks arriving at an empty FIFO.
FIFOData stream Processing
unit FIFOFIFOData stream Processing
unit FIFO
Figure 3.2-3 Circuit with FIFO buffers.
26
It has more sense to talk about worst-case latency or average latency under given
assumptions.
In this work I focused completely on implementing cryptographic algorithms
only. I did not make use of any FIFO buffers, and all circuits presented here have fixed
latency.
3.2.3 Area
Area describes the “size” of the circuit. There exist different ways of expressing
this size depending on technology.
In the ASIC technology, area is expressed in terms of the size of a die [�m2], or in
terms of the number of transistors or logic gates. Both measurements correspond closely
to the cost of the design. Die cost is typically proportional to the fourth or higher power
of the die area [20]:
� �4area Diedie ofCost f�
This means that doubling the size of a die would increase its cost sixteen times.
There are also other costs associated with the production of ASICs. Namely, the cost of
testing and the cost of packaging. The cost of testing is proportional to the complexity of
the circuit, and can also be a function of the circuit area.
In the FPGA technology cost analysis is easier. The size of the circuit is expressed
in terms of the number of configurable logic blocks. Therefore, having a logic block
27
count, it is easy to estimate the cost of the design by simply comparing prices of devices
into which it would fit. Very often FPGA manufacturers give number of logic gates
equivalent to the entire FPGA circuitry. For example, Virtex 1000 FPGA is equivalent to
1,124,022 logic gates [34]. It is tempting to use this number to find the size of an
equivalent ASIC circuit based on the FPGA utilization. However, practice shows that
these estimations are highly inaccurate. Some of the reasons for this inaccuracy include
the following situations:
1) One Lookup Table (LUT) can realize a wide variety of combinational functions of
different complexity. It may represent only one logic gate, or a quite complex circuit
consisting of many gates although still being counted as one LUT, as shown in Figure
3.2-4.
2) One CLB Slice consists of two LUTs. It happens quite often that only one of them is
utilized, as shown in Figure 3.2-5.
28
x1 x2 x3 x4
y
x1 x2
y
LUT
x1x2x3x4
y
a) b)
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y0100010101001100
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y1111111111110000
x1 x2 x3 x4
y
x1 x2 x3 x4
y
x1 x2
y
x1 x2
y
LUT
x1x2x3x4
yLUT
x1x2x3x4
y
a) b)
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y0100010101001100
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y0100010101001100
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y1111111111110000
0x1
0x2 x3 x4
0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1
y1111111111110000
Figure 3.2-4 Variety of functions possible to implement using one lookup
table (LUT). a) single logic gate, b) complex combinational logic.
29
g carry &control
G4
G3
G2
G1
D Q
f carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
g carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
LUT carry &control
G4
G3
G2
G1
D Q
f carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
a)
b)
g carry &control
G4
G3
G2
G1
D Q
f carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
g carry &control
G4
G3
G2
G1
D QD Q
f carry &control
F4
F3
F2
F1
D QD Q
YQ
Y
X
XQ
CIN
COUT
CLK
g carry &control
G4
G3
G2
G1
D Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
g carry &control
G4
G3
G2
G1
D QD Q
LUT carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
LUT carry &control
G4
G3
G2
G1
D Q
f carry &control
F4
F3
F2
F1
D Q
YQ
Y
X
XQ
CIN
COUT
CLK
LUT carry &control
G4
G3
G2
G1
D Q
f carry &control
F4
F3
F2
F1
D QD Q
YQ
Y
X
XQ
CIN
COUT
CLK
a)
b)
Figure 3.2-5 Example of LUT utilization. a) two functions occupy one
CLB Slice, b) two functions occupy two CLB Slices.
The number of CLBs is usually the main and the only parameter reported by
designers. For the reasons listed above, it does not give full information about the area. It
should be accompanied with the specific number of lookup tables, flip-flops and memory
elements used in the design.
30
Some of the FPGA devices are equipped with dedicated RAM blocks. There is no
reliable way to translate them into an equivalent number of CLBs or logic gates which
complicates the comparison of implementations in different technologies even more.
3.3 Design tradeoffs
Designers of hardware implementations have much more flexibility in choosing
the way of developing their implementation than designers of software. In some
applications, the maximization of speed may be an ultimate goal. In these cases, cutting
edge technology plus sophisticated design techniques may be applied. In other
applications the speed of operation may not be very demanding, and designers may be
more concerned with the area and cost constraints. Every digital circuit can be
implemented differently keeping in mind specific requirements. In this section, I present
few basic techniques demonstrating the tradeoff between area, speed and latency.
3.3.1 Increasing the throughput
One way to increase the speed of the circuit is by increasing the speed of
particular operations. This approach can be used in any situation, and may be achieved by
using sophisticated design techniques like fast multipliers or aggressive logic
decomposition. Another way to increase speed is through exploration of parallelism
existing on different levels. Parallelism can be found in the encryption algorithm, when
certain transformations can be performed simultaneously. It can also be exploited outside
the algorithm if many independent blocks of data can be processed simultaneously. I will
31
discuss two basic techniques often used when the parallelism can be exploited outside the
encryption algorithm.
Using multiple independent processing units
The first trivial technique is to use many identical processing units working on
independent blocks of data. A single processing unit can represent a complete
encryption/decryption circuit. This unit does not have to be very fast. High speed is
obtained simply by using a large number of processing units.
PU 1
1
N+1
2N+1
3N+1
PU 2
2
N+2
2N+2
3N+2
PU N
N
2N
3N
4N
data blocks
input data string
output data string
processing unitsPU 1
1
N+1
2N+1
3N+1
PU 2
2
N+2
2N+2
3N+2
PU N
N
2N
3N
4N
data blocks
input data string
output data string
processing units
Figure 3.3-1 Parallel processing units – string of data split among units.
32
The speedup obtained using this technique is directly proportional to the number
of processing units:
unit oneunits parallel ThroughputNThroughput ��
The latency stays the same, as we do not change anything in the structure of a
single unit. There is, however, a price for this speedup: the area of the circuit grows
proportionally to the number of processing units, significantly increasing the cost of
implementation. A good example of a design consisting of many identical units is a DES
Cracker built by Electronic Frontier as described in chapter 1.
Pipelining
Pipelining is a more sophisticated method of achieving higher speeds than by
augmenting the number of processing units. Pipelining is applied inside the processing
unit by dividing it into several stages which execute sequentially. Let us consider a
combinational circuit as shown in Figure 3.3-2 a). The circuit can process only one block
of data at a time, and each processing takes time T. The following parameters
characterize our circuit:
TLatency nalcombinatio �
33
Tdata ofblock of SizeThroughput nalcombinatio �
solid combinationallogic
T
register
1 2 3 4time
T 2T 3T 4T0
1
stage
1 2 n
T/n T/n T/n
1 2time
0
stage
3 41 2 3
1 21
1 2 3 4
TTn
2Tn
3Tn
4Tn
1234
n
a) b)
solid combinationallogic
T
register
1 2 3 4time
T 2T 3T 4T0
1
stage
1 2 n
T/n T/n T/n
1 2time
0
stage
3 41 2 3
1 21
1 2 3 4
TTnTn
2Tn2Tn
3Tn3Tn
4Tn4Tn
1234
n
a) b)
Figure 3.3-2 Principles of pipelined implementation. a) original
combinational logic, b) pipelined version of the same logic.
We can divide this circuit into n stages as shown in Figure 3.3-2 b). The modified
circuit is called to be pipelined, and can process n blocks of data simultaneously, but in
different phases.
34
The pipelined circuit will have the following parameters:
TLatencypipelined �
� �1-nNnT
block of SizeNThroughputpipelined
��
��
where: N – number of blocks of data to be processed
n – number of pipeline stages
If we have a large number of blocks of data, then throughput is approximately n
times bigger than for the not pipelined circuit:
Tblock of SizenThroughput lim pipelinedN
��
��
From now on, I will assume in my analyses, that we have large number of blocks
to process: N >> n.
Unfortunately, the speedup of n times is impossible to achieve in real
implementations, because each introduced register contributes to additional delay, as it is
characterized by some propagation time �.
35
1
T
�
2 n
� �
nTn
Tn
1
T
�
2 n
� �
nTn
Tn
Figure 3.3-3 Pipeline with delay of registers taken into account.
If we take it into account, the latency and throughput will be expressed by:
���� nTLatency pipeline balanced
���
��
nTblock of SizenThroughput pipeline balanced
We can observe now, that throughput in not linearly proportional to the number of
pipeline stages. For small number of stages T >> n*�, and throughput will be nearly n
times greater. However, it will grow slower and slower as we will keep introducing more
and more stages.
There exist another issue limiting efficiency of pipeline. It is very difficult to
create a perfectly balanced pipeline. Usually stages are not equal, and the stage requiring
36
the longest time for computations will limit the maximum clock frequency of the entire
circuit.
1
T+�T
�
2 n
� �
n
1
T+�T
�
2 n
� �
n
Figure 3.3-4 Unbalanced pipeline.
We can express the information about imbalance as an additional delay
contributed to computations in the longest stage. The latency and the throughput
parameters become:
������ nTTLatency pipeline imbalanced
�����
��
nTTblock of SizenThroughput pipeline imbalanced
We can list some possible sources of imbalance:
� The combinational circuit may consist of few basic operations, which execute in
different amount of time. To create well-balanced pipeline, the designer may need to
37
introduce registers somewhere inside of one of the operations. It may require
significant design effort.
op. 1
4ns
op. 2 op. 3
7ns 5ns
2ns
18ns
op. 1 op. 2 op. 3
9ns
2ns
27ns
2ns 2ns
9ns 9ns
a) b)
op. 1
4ns
op. 2 op. 3
7ns 5ns
2ns
18ns
op. 1
4ns
op. 2 op. 3
7ns 5ns
2ns
18ns
op. 1 op. 2 op. 3
9ns
2ns
27ns
2ns 2ns
9ns 9ns
a) b)
Figure 3.3-5 Pipelining of circuit consisting of unequal operations. a)
original combinational circuit, b) pipelined circuit – operation 2
determinates clock frequency.
� The synthesis tool may do better job in synthesizing the circuit than the designer can
predict, and unnecessary delay can be introduced to the circuit.
38
LUT
a
b
c
d
y0
1
a
b
c
d
y0
1
a
b
c
d
yD Q
0
1
a
b
c
d
D Q
D Q
D Q
D Qy
LUT LUT
LUT
a)
b)
LUT
a
b
c
d
y0
1
a
b
c
d
y0
1
a
b
c
d
yD QD Q
0
1
a
b
c
d
D QD Q
D QD Q
D QD Q
D QD Qy
LUT LUT
LUT
a)
b)
Figure 3.3-6 Example of unnecessarily placed register. a) properly
pipelined circuit, b) improperly pipelined circuit – requires more area and
has larger latency.
� The designer of the FPGA based design has usually very little influence on the
placing of components and routing of nets. Design tools support manual
floorplanning, but it again requires a lot of work. In case of my designs I completely
relied on the automatic placement and routing. In all my designs routing has
contributed to 60-90% of the total delay in the critical path.
In case of FPGAs, one straightforward approach to pipelining the design is to
limit the number of CLB levels per stage. This way, we can have a control over the delay
39
through the logic part, but still no control over the routing part. The deeper pipeline we
want to design, the more difficult balancing becomes. Even nets with high fanouts can
dramatically change the overall performance.
To summarize, I expect that the imbalance factor �T should be treated as a
function of a number of pipeline stages, and it increases as the number of stages
increases. I believe that the �T factor can easily exceed 2T in the very deep pipeline
designs, when the designer does not do any floorplanning, as it is in my case.
Figure 3.3-7 shows typical relationship between throughput and number of
introduced pipeline stages. Curve B shows perfectly balanced pipeline, and curve C
assumes some imbalance.
0
5
10
15
20
25
30
Number of pipeline stages
AB
C
Figure 3.3-7 Throughput in the pipelined implementations. A – ideal n
times speedup. B – ideally balanced pipeline. C – unbalanced pipeline.
40
From Figure 3.3-7 we can easily conclude that the common assumption of
throughput being proportional to the number of pipeline stages is true only for small
number of stages. In all of my fully pipelined implementations, the number of stages
ranges from 81 to 588, resulting in much smaller than n times speedup.
Another important question is how the pipelining affects requirements for area. In
the case of ASIC designs every additional flip-flop or latch introduced to the circuit
requires additional resources. However, in FPGA technology pipelining is usually much
less expensive. Every lookup table (LUT) has a D flip-flop associated with it, and it is
frequently sufficient to just use these “free” flip-flops to implement a pipeline.
Unfortunately, this is not always sufficient. The circuit structure may not be balanced,
meaning that we need to perform different computations on different pieces of data,
requiring different complexity of logic. Any Feistel–network cipher presents a classical
example of a structure demanding a lot of overhead when pipelined – see Figure 3.3-8. In
this example we can see that pipelining has to be introduced in two data paths
additionally to pipelining of F-function. Even if pipelining of the F-function does not
require any additional area, pipelining of those two data paths may significantly increase
area requirements.
41
F-functionpipelinedF-function(n-stages)
12
n
12
n
a) b)
F-functionF-functionpipelinedF-function(n-stages)
12
n
12
n
pipelinedF-function(n-stages)
12
n
12
n
12
n
12
n
a) b)
Figure 3.3-8 Pipelining of Feistel-network cipher. a) combinational
circuit, b) pipelined circuit – additional registers required in two data
paths.
Additionally, some arithmetic operations used in a circuit may require more area
when pipelined. One of the examples of such operations is an array multiplier presented
in Figure 3.3-9.
I presented two main ways of increasing throughput, one based on using multiple
processing units, and the other based on pipelining. Both methods have their advantages
and disadvantages. I summarize their basic characteristics in the Table 3.3-I. By
combining these two methods one can achieve very high throughputs.
42
a) b)a) b)
Figure 3.3-9 Example of an array multiplier as a circuit requiring
additional area for registers when pipelined. a) combinational circuit, b)
pipelined circuit (additional registers are required to pipeline arguments
input to the array, however they are not shown in the schematic).
Table 3.3-I Features of methods exploring parallel computations
Multiple processing units PipeliningComplexity ofdesign
simple may be difficult especially for deeppipelines
Speedup proportional to thenumber of processingunits
for small number of stagesproportional to the number of stages,for large number of stages the gainon speedup drops down
Area requirements proportional to thenumber of processingunits
in balanced designs may be verysmall,in unbalanced designs may causesignificant area increase
43
3.3.2 Decreasing the area
When area is our main concern, sacrificing circuit performance can always
decrease it. Of course, it may happen that the circuit is not properly minimized and its
minimization may result in decreasing the area and increasing the speed. It was shown in
[27] that commercial tools do not perform logic decomposition very well. However,
designers usually do not try any logic optimization by themselves and they entirely rely
solely on synthesis tools.
One of the ways to decrease the amount of area is to use simple units performing
specific operations. For example, using ripple-carry adders instead of carry-lookahead or
carry-skip adders. Sometimes we can replace one large and complex combinational
circuit by a small sequential unit. For example, multiplication can be performed in
multiple clock cycles using only one adder, or in a single clock cycle using irregular
combinational structure like Wallace tree.
Another way to reduce the area is to use resource sharing. It may happen that the
circuit consists of multiple units performing the same operations. Fortunately, in the case
of ciphers, very often the same types of operations are applied to different parts of the
data. These may be multiplication, rotations, or S-boxes. We need to implement only one
instance of such a unit and perform all computations in different clock cycles.
44
F
mul
tiple
xer
F
F
a) b)
F
mul
tiple
xer
F
F
F
mul
tiple
xer
F
F
a) b)
Figure 3.3-10 Resource sharing. a) two identical operations performed on
distinct data, b) one circuit can perform the same operation in greater
amount of clock cycles.
In the case of very aggressive requirements on area one can try to design smaller
number of universal units, capable of performing different types of operations: addition,
multiplication, rotation, XOR and so on. In this approach, the circuit structure may
become similar to the structure of a microprocessor.
45
Hardware architectures for symmetric-key blockciphers
4.1 Main characteristics of block ciphers
Symmetric key block ciphers present very specific class of algorithms. Most of
them have a very similar, well-studied structure and work in standardized modes of
operation. The goal of this research is to make a fair comparison of hardware
implementations of all of the AES finalists. Therefore, I decided to implement each of the
considered algorithms in the same way and exploit all their commonalties. In this chapter
I present key features common to all five AES candidates that had the largest impact on
my hardware designs decisions.
4.1.1 Structure of a symmetric-key block cipher
Most of the symmetric-key block ciphers have a similar round-oriented structure.
All five final AES candidate algorithms share this feature. Figure 4.1-1 shows the general
flow of the encryption/decryption process.
Usually all rounds within the cipher contain identical operations, and permit
iterative execution. Mars and Serpent are slightly different among the candidates. Mars
employs two entirely different types of rounds, which even have different purposes.
46
Serpent has all rounds very similar in structure yet it uses S-boxes with slightly different
contents in consecutive rounds.
cipher round
initialtransformation
finaltransformation
i := 1
i < #rounds?
i := i +1
round key[0]
roundkey[#rounds+1]
round key[i]
Figure 4.1-1 Flow diagram of a typical symmetric-key block cipher.
Despite the differences among internal rounds, it is quite common for the last or
first round to be slightly different. These differences, however, do not impose significant
constraints on the design.
All operations within a round are fixed, arithmetic or logical transformations like
substitution, addition and multiplication modulo, and operations on polynomials in Galois
fields. There are no conditional statements in the algorithm. This makes implementations
very straightforward.
47
For some of the ciphers the algorithm used for encryption is identical to that used
for decryption. In particular, this is true for Feistel-network ciphers like DES. Twofish,
RC6 and Mars have similar structure, however none of them permits performing
encryption and decryption in exactly the same way. Twofish can be expressed in a little
bit different way to explore this feature. Other ciphers, Serpent and Rijndael, employ
inverse operations for decryption which have very little in common with their encryption
counterparts.
4.1.2 Key schedule
Generally every round accepts at least one round key. Round keys are computed
by an accompanying algorithm called the key schedule. The key schedule is also an
iterative algorithm, which expands the main key into round keys. The more rounds which
are required by the algorithm, the more round keys need to be supplied from the key
schedule. Performing decryption requires applying round keys in reverse order to all
rounds. Some key schedules permit computing round keys in any order, such as in the
case of DES and Twofish. This feature creates a perfect opportunity for computing keys
“on the fly” and may therefore significantly influence design strategies for hardware
implementations. Other ciphers have key schedules computing keys only in the forward
direction. Usually, this restriction forces pre-computing all round keys and storing them
in some memory before use.
The key schedule itself represents a completely independent algorithm and can be
considered as a design independent from the encryption unit. The key schedule is also
48
constrained by slightly different criteria than the encryption circuit. If a large bulk of data
is going to be encrypted, then it is not necessary to change keys frequently, and the key
schedule unit does not have to be fast even when the encryption speed is high. On the
other hand, some applications require fast key changes. For example, encrypting ATM
cells with different keys requires changing keys every 56 bytes. For those applications the
design of key schedule may be even more challenging than the design of the encryption
core.
Key schedule units of all final AES candidates are certainly worth looking at, and
differences found among them could be very influential to the selection process.
Nevertheless, we have not conducted any research that could yield any comparison of key
schedules. In this thesis I have entirely focused on implementing encryption and
decryption circuits under different assumptions and constraints.
4.1.3 Modes of operation
Symmetric-key block ciphers are used in several operating modes. Currently five
modes have been standardized for use with DES [17]. Those four modes are: Electronic
CodeBook (ECB), Cipher Block Chaining (CBC), Cipher FeedBack (CFB), Output
FeedBack (OFB), and Counter (CTR). They can be classified into two categories:
� non-feedback modes: ECB, CTR
� feedback modes: CBC, CFB, OFB.
49
Figure 4.1-2 shows examples of non-feedback and feedback modes, where
consecutive blocks of plaintext (P) are transformed into blocks of ciphertext (C). In the
case of feedback mode the initialization vector (IV) is supplied.
E
P[0]
C[0]
E
P[1]
C[1]
E
P[n]
C[n]
D
P[0]
D
P[1]
D
P[n]
E
P[0]
C[0]
IV
E
P[1]
E
P[n]
C[1] C[n]
D D D
IV
P[0] P[1] P[n]
a) c)
b) d)
Figure 4.1-2 Example of feedback and non-feedback modes of operation.
a) ECB mode encryption, b) ECB mode decryption, c) CBC mode
encryption, d) CBC mode decryption.
ECB mode has a very nice property of treating all blocks of ciphertext
independently. It is a very valuable feature from the point of view of hardware
implementation, because all techniques exploiting parallelism, as discussed in chapter 3,
can be applied to speedup computations. Unfortunately, ECB mode is rarely used in
practice. The main reason is that it does not hide data patterns occurring in plaintext.
50
The feedback modes of operation offer better security and are used more often,
but there is a price for it. All feedback modes imply strong data dependency between
consecutive blocks of data. Computation of ciphertext block may start only when the
previous ciphertext block has been already computed like in CBC mode - Figure 4.1-2b.
This restriction has a significant impact on the hardware implementation specification
because there is no parallelism that can be exploited during computation of the same
stream of data. All techniques for the speeding-up of the hardware implementation which
use parallel processing, such as pipelining, can be employed only if several independent
streams of data are available. Although the overall throughput of the implementation can
be significantly improved, the throughput of each independent stream remains limited.
The limitations imposed by feedback modes have already been recognized, and
other non-feedback modes have been proposed. One of the well-studied modes is counter
mode – Figure 4.1-3.
E
P[0]
IV
C[0]
E
P[1]
IV+1
C[1]
E
P[n]
IV+n
C[n]
E
C[0]
IV
P[0]
E
C[1]
IV+1
P[1]
E
C[n]
IV+n
P[n]a) b)
Figure 4.1-3 Counter mode. a) encryption, b) decryption.
51
Counter mode works similarly to one-time-pad ciphers. First the pseudo-random
sequence is generated based on the key, and IV values. Next plaintext is simply XOR-ed
with this sequence. Decryption is performed by XOR-ing the ciphertext with the same
pseudo-random sequence. Therefore, counter mode requires implementing only
encryption transformation for the underlying block cipher.
Counter mode has been proven to be equally secure as feedback modes, provided
that the key and initialization vector will never be used to encrypt two different messages.
This concern was probably one of the primary reasons why it was not standardized so far.
NIST has foreseen the need for new modes of operation soon after selecting the
winner of the AES contest. It is likely that previously standardized modes together with
counter mode will be temporarily accepted for use with AES. However, NIST has already
initiated a new public effort for developing new modes of operation intended for AES.
One of the promising modes submitted to NIST is an Offset CodeBook (OCB) mode
[22]. OCB offers not only high security and parallelism, but also an authentication
service which is not present in any of the current standardized modes of operation.
4.2 Basic iterative architecture
Since most of the block ciphers have round-oriented design, it is sufficient to
implement only one round and circulate data through the same logic several times. The
most straightforward implementation of a typical symmetric-key block cipher is shown in
Figure 4.2-1. A single round is implemented as combinational logic and is supplemented
with a register and multiplexer. The register is required to hold intermediate results of
52
computations between consecutive clock cycles. It can be positioned anywhere within the
round circuit, depending on particular design constraints. The purpose of the multiplexer
is to feed data back to the circuit or to fetch a new block of data.
combinationallogic
register
input
output
multiplexer
oneround
Figure 4.2-1 Basic iterative architecture
Usually, the combinational circuit representing one cipher round is capable of
performing either encryption or decryption, since most of the modes of operation require
both transformations. Obviously, in the basic iterative architecture only one block of data
is being transformed in the circuit and only encryption or decryption operation is
activated. Sometimes it can be justified to implement encryption or decryption only. Such
implementation may require two separate chips, or one FPGA chip that can be entirely or
partially reconfigured to switch between operations. However, from the point of view of
most applications, the time of reconfiguration implies unacceptably large overhead and is
rarely justified in practice.
53
The basic iterative architecture permits encrypting only one block of data at a
time, and therefore, is suitable for any mode of operation. Since only one round is
physically implemented in the circuit, transforming one block of data takes most likely
the same number of clock cycles as the number of cipher rounds. However, some of the
block ciphers require more round keys than the number of rounds, for example additional
whitening keys. If the key schedule computes only one key per clock cycle, or all keys
are stored in one memory, then the number of clock cycles needed to encrypt one block
of data will be equal to the number of round keys. Obviously, each cipher has its own
characteristic features and the number of clock cycles required to process one block of
data may deviate slightly from the number of cipher rounds.
The critical path in the circuit determines the minimum clock period. If our goal is
to maximize the throughput, the critical path should appear in the feedback between
output and input of the intermediate register, as shown in the Figure 4.2-2. Otherwise the
performance of the circuit would be unnecessarily limited by other logic.
combinationallogic
input
output
multiplexerTmux
Tround
τcombinational
logic
input
output
multiplexerTmux
Tround
τ
Figure 4.2-2 Critical path in the basic iterative architecture.
54
The minimum clock period can be computed as the sum of following factors:
� propagation time through the multiplexer Tmux,
� propagation and setup time of the register �, and,
� propagation time through the round circuit Tround.
roundmuxbasic T T periodclock ��� �
Latency and throughput are expressed as follows:
basicbasicbasic cyclesclock # periodclock latency ��
basicbasicbasic cyclesclock # periodclock
sizeblock throughput�
�
Different ciphers have different number of rounds and different sizes of rounds.
For example, Serpent has many simple rounds, while Rijndael has fewer, but more
complex, rounds. Therefore, the influence of the multiplexer and register on the overall
performance of the circuit is not equal for various ciphers implemented in the basic
iterative architecture. Typically, however, times of propagation through the multiplexer
and register are much smaller than the time of propagation through the round circuit, and
can be safely neglected.
55
4.3 Loop unrolling
Loop unrolling is the simplest extension of the basic iterative architecture, as
shown in the Figure 4.3-1. The idea is to implement more than just one round as a
combinational circuit. Typically, the number of unrolled rounds is a divisor of a total
number of cipher rounds, and I assume this as true in further analyses. In extreme case all
rounds can be unrolled, as shown in the Figure 4.3-1b, this eliminates the need for
multiplexer and feedback.
round 1
input
output
multiplexer
round 2
round k
a)
round 1
input
output
round 2
round n
b)
Figure 4.3-1 Loop unrolling. a) partial unrolling, b) full unrolling.
As in the basic iterative architecture, there can be only one block of data
processed at a time and, therefore, loop unrolling is equally well suited for feedback and
non-feedback modes of operation.
56
The number of clock cycles necessary to encrypt a single block of data decreases
proportionally to the number of unrolled rounds. At the same time the minimum clock
period increases, but possibly by a factor slightly smaller than the number of unrolled
rounds. There are two major reasons for this occurrence:
� the input multiplexer and register have smaller influence on the overall circuit
performance, and,
� unrolled rounds may yield additional logic optimizations.
The minimum clock period can be expressed as follows:
rounds unrolledmuxunrolled T T periodclock ��� �
Let k be the number of unrolled rounds. Let’s assume that the propagation time
through unrolled logic is at most k times larger than the time for one round. However,
some optimizations may occur in the unrolled circuit that will decrease the delay. It may
happen that operations executed at the beginning and at the end of the round logic can be
implemented as one circuit with smaller delay, as shown on the Figure 4.3-2.
It may be possible to furthermore explore timing characteristics of the round
circuit. Figure 4.3-3 shows a hypothetical round structure with two operations f1 and f2.
One unrolling is sufficient to notice that functions f1 in both rounds can be evaluated
simultaneously.
57
f1
f2
f3
a)
f1
f2
f3 & f1optimized together
f2
f3
b)
Figure 4.3-2 Optimization of logic across rounds. a) single round, b) two
rounds unrolled.
f1
f2
a) b)
f1
f2
f1
f2
Figure 4.3-3 Simultaneous evaluation of functions in unrolled rounds. a)
single round, b) two rounds unrolled.
58
This potential speedup comes at a cost in circuit area. The area grows very fast
with each unrolled round, because not only round logic but also key schedule logic need
to be expanded. The more complex circuit consumes additional logic and routing
resources. In the existing scale of integration of CMOS circuits, the delays introduced by
interconnections between logic elements play a crucial role in the overall performance
characteristic. According to Xilinx, for small designs a 50/50 ratio between logic and
route delays should be anticipated. However, the bigger the chip the bigger the route
budget, 40/60 or 30/70. The increased delay through interconnects may easily cancel the
speedup gained on logic resources. For this reason it is not generally advisable to use
loop unrolling in FPGA devices, unless the designer takes great care in placement and
routing. Full loop unrolling eliminates the need for multiplexer and feedback, therefore it
presents a perfect object for placement and routing. Loop unrolling can prove to be
beneficial especially when used in ASIC devices as it was demonstrated in [31, 36].
The latency and throughput can be expressed similarly as for the basic architecture:
unrolledunrolledunrolled cyclesclock # periodclock latency ��
unrolledunrolledunrolled cyclesclock # periodclock
sizeblock throughput�
�
If the design is well placed and routed, the following holds:
59
basicunrolled periodclock periodclock �� k
kbasic
unrolledcyclesclock #
cyclesclock # �
basicunrolled throughputthroughput �
As a result, the latency and the throughput parameters are slightly better for the
unrolled circuit, but all that comes at a large price in the circuit size. Figure 4.3-4 shows
the area to throughput ratio of the unrolled circuit with respect to the basic iterative
architecture.
throughput
area
k=2 k=3 k=4 k=5
loop-unrolling
basic architecture
throughput
area
k=2 k=3 k=4 k=5
loop-unrolling
basic architecture
Figure 4.3-4 Throughput vs. area ratio for unrolled architectures.
60
It becomes clear that loop unrolling can be justified only when the design is not
constrained by the area requirements. In practice loop unrolling is a too expensive way of
speeding up the circuit, therefore I did not attempt any implementations of any of the
ciphers in this architecture.
4.4 Outer round pipelining
Pipelining of digital circuits is not always an easy task. Proper pipeline should
have approximately equal stages. As I have mentioned in section 4.1.1, existing
symmetric-key ciphers have round-oriented architectures. All rounds are alike and have
similar, if not identical, complexities. This feature makes them a natural choice for
pipeline stages. Similar to loop unrolling, one can implement a few rounds and introduce
pipeline registers between them. The natural choice for register placement is directly
between rounds, however they may be placed inside each round only if the placement is
consistent in every round.
The number of pipeline stages K is usually a divisor of the total number of cipher
rounds. When area constraints permit, all rounds can be implemented which eliminates
the need for feedback as in full loop unrolling – Figure 4.4-1.
61
round 1
input
output
multiplexer
round 2
round k
a)
round 1
input
output
round 2
round n
b)
round 1
input
output
multiplexer
round 2
round k
a)
round 1
input
output
round 2
round n
b)
Figure 4.4-1 Outer round pipelining. a) partial unrolling, b) full unrolling.
Outer round pipelining is easy and straightforward to apply. Since in general all
rounds are identical, it is sufficient to design only one stage and reuse it. This guarantees
a very well balanced pipeline from the point of view of logic resources. Routing
resources are more difficult to control. The designer may, however, prepare a macro with
one round fully placed and routed. This macro can be used several times in the final
circuit, and with careful placement routing delays should be very similar in all stages.
The pipelined circuit is capable of processing K blocks of data simultaneously.
All blocks in the pipeline are processed independently and no dependencies among them
are allowed. This limitation sets constraints on processing data in feedback modes. A
pipeline can be fully utilized in feedback modes only when K independent streams of data
are available. This also means that the key schedule has to supply K different keys to the
62
encryption unit. If those conditions are not met, pipelining of any kind does not give any
advantages over the basic iterative architecture for feedback modes.
The number of clock cycles necessary to process one block of data is the same as
for the basic iterative architecture, however since many blocks can be processed
simultaneously the average number of clock cycles per block is approximately K times
smaller.
The minimal clock period should, in general, remain very similar to the minimal
clock period of a basic iterative architecture:
basicroundmuxouter periodclock TTperiodclock ���� �
Of course, the more pipeline stages the more complex is the overall circuit and the
more complex is the signal routing. If the circuit is placed automatically it is very likely
that the value of the minimum clock period will deteriorate comparing to the basic
iterative architecture.
In the case of loop unrolling there exist the potential for optimization of logic
across rounds as shown in Figure 4.3-2. For outer round pipelining, registers are usually
placed exactly between those rounds what inhibits optimization across rounds. However,
it may be still possible to explore this feature of a cipher. The designer has to recognize
those situations himself and place pipeline registers in optimum places, not necessarily
between rounds. For the situation shown in Figure 4.3-2 it may be beneficial to place a
register between operations f1 and f2 as shown in Figure 4.4-2.
63
f1
f2
f3
a)
f1
f2
f3 & f1optimized together
f2
f3
b)
f1
f2
f3
a)
f1
f2
f3 & f1optimized together
f2
f3
b)
Figure 4.4-2 Optimization of logic across rounds. a) one round, b) two
rounds.
The latency of the circuit with outer round pipelining will remain approximately
the same as for the basic iterative architecture.
outerouterouter cyclesclock # periodclock latency ��
basicouter periodclock periodclock �
basicouter cyclesclock # cyclesclock # �
64
basicouter latencylatency �
Both the minimum clock period and the number of clock cycles required to
process one block of data remain unchanged. This feature permits applying outer round
pipelining without making drastic changes to other surrounding logic which could have
been designed for the basic iterative architecture first, and subsequently adapted for a
pipelined implementation.
The throughput of a pipelined circuit is approximately K times higher than for the
basic iterative architecture. This speedup comes from the fact that K blocks of data can be
processed simultaneously.
basicouterouter
outer t throughpu cyclesclock # periodclock
sizeblock throughput ��
�
�� KK
The outer round pipelining gives linear speedup proportional to the number of
implemented rounds. This linear speedup comes at the cost of a linear area increase,
because not only round logic but also key schedule logic has to be expanded like in the
case of the loop unrolling architecture. Figure 4.4-3 shows area to throughput ratio of the
pipelined circuit with respect to the basic iterative architecture.
65
throughput
area
outer-round pipeliningnon-feedback modes
basic architectureK=2
K=3
K=4
K=5
outer-round pipeliningfeedback modes
throughput
area
outer-round pipeliningnon-feedback modes
basic architectureK=2
K=3
K=4
K=5
outer-round pipeliningfeedback modes
Figure 4.4-3 Throughput vs. area ratio in outer round pipelining.
The simplicity of outer round pipelining makes it the most commonly used
pipelined architecture for secret-key ciphers documented in the literature.
4.5 Inner round pipelining
Inner round pipelining is another, more advanced, method of pipelining a block
cipher. The idea is to implement and pipeline only one round, and use this circuit
iteratively like in the case of the basic iterative architecture. Figure 4.5-1 shows inner
round pipelining architecture.
66
input
output
multiplexer
one
roun
d
input
output
multiplexer
one
roun
dFigure 4.5-1 Inner round pipelining.
The question arises as to how to select the number and placement of pipeline
stages within the cipher round. In most practical designs the ultimate goal is to meet
certain performance requirements such as throughput or latency. To meet the throughput
requirement it is sufficient to divide the round into combinational pieces where the
longest one does not exceed the requirement for the minimal clock period. The way to
minimize the latency is to apply as few pipeline stages as possible.
When both throughput and latency have to be optimized the best approach is to
divide the round into equal stages. However, this may not be easy to realize since rounds
usually consist of different operations. It is tempting to look at each operation from the
point of view of its logical structure and divide the entire round into stages consisting of
equal number of logic levels (CLB levels). This approach is easy to apply since
experienced designer can anticipate ahead of time how each of the operations will fit into
an FPGA. Usually it is enough to look at each of the operations separately, but some
optimizations may be possible across operations resulting in a smaller total number of
67
logic levels. It is, therefore, advisable to look at the entire round. Unfortunately, this
method has a serious flaw. It does not take into account delays through routing resources.
Long nets and high fanout may contribute to significant delays resulting in unexpected
loss of performance. These kinds of problems are very difficult to recognize at the initial
stage of design.
A more accurate design method would require analyzing actual delays in an
existing combinational circuit before applying pipelining. This means that the basic
iterative architecture should be implemented and analyzed first. Analyzing the timing
parameters of the implemented circuit is not an easy task in itself, as it requires good
knowledge of implementation tools and consumes a lot of time. Finally these analyses are
only approximate, because to some extent they are specific for a particular placement and
routing realization. Automatic tools may solve placement and routing problems
differently for pipelined circuits than for non-pipelined ones, and routing delays may be
significantly different. The most straightforward way to prevent this possibility is to force
some placement pattern. This may be done by constraining parts of the circuit to specific
areas in the FPGA. Of course this adds to the overall complexity of the design process,
and usually requires a lot of experience.
Let us assume that the number of pipeline stages is k. This number can be
arbitrary, as it is not correlated with the number of cipher rounds in any way. The
minimum clock period will be smaller than for the basic iterative architecture, but even
for a perfectly balanced architecture it will not be k times smaller – see chapter 3.
68
kkbasicroundmux
innerperiodclock TTperiodclock �
����
A pipelined circuit requires higher clock frequency than the basic iterative
architecture, and this means that surrounding logic has to keep up with this requirement.
Processing one block of data takes k times more clock cycles than for the basic iterative
architecture.
basicinner cyclesclock # cyclesclock # �� k
Taking these facts into account it is clear that the latency of the pipelined circuit
will be worse than for the basic iterative architecture.
innerinnerinner periodclock cyclesclock # latency ��
basicinner latencylatency �
Since k blocks of data can be processed simultaneously, the average number of
clock cycles per block of data is the same as for the basic iterative architecture.
basicinner cyclesclock # blockper cyclesclock # average �
69
From this observation we can conclude that for inner round pipelining, speedup
comes from processing data at the increased clock frequency. Therefore, throughput
depends only on the minimal clock period and not on the number of pipeline stages.
innerinner periodclock rounds#
sizeblock throughput�
�
basicinner t throughpu throughput �� k
Although inner round pipelining is quite difficult to apply and affects latency, it
gives significant area benefits in FPGA designs. Introducing pipeline registers into the
existing combinational logic, in this case cipher round, does not usually cost much area as
there exist “free” flip-flops in every CLB Slice – see chapter 3. Figure 4.5-2 shows
throughput versus area ratio for inner round pipelining.
The benefits of inner round pipelining make this technique very attractive for
high-speed implementations. Unfortunately, the difficulties in designing good pipelines
discourage researchers from using them in their implementations.
70
throughput
area
inner-round pipeliningnon-feedback modes
basic architecturek=2
k=3
k=4
k=5
inner-round pipeliningfeedback modes
throughput
area
inner-round pipeliningnon-feedback modes
basic architecturek=2
k=3
k=4
k=5
inner-round pipeliningfeedback modes
Figure 4.5-2 Throughput vs. area ratio for inner round pipelining.
4.6 Mixed inner- and outer-round pipelining
Inner and outer round pipelining have their limits for maximum throughput. In
some applications throughput rates in the range of gigabits per second are required. For
those applications inner and outer round pipelining techniques may be mixed together.
The basic idea is to implement inner round pipelining and then unroll such round as many
times as necessary, as shown in Figure 4.6-1.
In extreme cases all rounds can be unrolled giving maximum possible throughput.
From my experience the maximum throughput rate may range beyond 10 Gbps for a fully
unrolled circuit.
71
outputa) b)
input
multiplexer
roun
d 1
roun
d 2
roun
d K
outputro
und
1ro
und
2ro
und
n
input
outputa) b)
input
multiplexer
roun
d 1
roun
d 2
roun
d K
outputro
und
1ro
und
2ro
und
n
input
Figure 4.6-1 Mixed inner- and outer-round pipelining. a) partial unrolling,
b) full unrolling.
The mixed inner- and outer-round pipelining inherits all of the implementation
difficulties of both inner-round and outer-round pipelining. The main difficulty in
applying mixed inner- and outer-round pipelining lies in implementing the inner-round
pipeline and then efficiently placing all its instances such that the routing delays remain
the same in the entire circuit. This procedure usually requires careful manual placement.
We can expect that the minimum clock period will be approximately the same as
for inner-round pipelining. However, it may deteriorate since rounds are unrolled and
routing constraints are harder to meet.
72
kbasic
innermixedperiodclock periodclock periodclock ��
The latency may also be expected to be at the level of the inner-round pipelined circuit.
innermixed latencylatency �
The maximum throughput becomes the highest among all architectures since it is
K times higher than for inner-round pipelined circuit. We have achieved throughputs in
the range of 10 Gbps for mixed inner- and outer-round pipelined architecture. However,
this high throughput is achievable only when K�k independent data streams are available.
rounds# periodclock sizeblock
rounds# periodclock sizeblock throughput
basicmixedmixed
�
��
�
��
k
KK
The mixed inner- and outer-round pipelining gives significant gain in throughput,
but it also inherits all the drawbacks of inner-round pipelining and outer-round-
pipelining. It affects latency in the same way as inner-round pipelining does, and is
associated with a high cost in area just as in the case of the outer-round pipelining. The
throughput versus area, and latency versus area are shown in Figure 4.6-2 and Figure
4.6-3 respectively. On top of that, proper design of a mixed architecture is the most
challenging task.
73
area
throughput- basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and outer-round pipelining
K=2
K=3
K=4K=5
K=2
K=3
k=2
kopt
area
throughput- basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and outer-round pipelining
K=2
K=3
K=4K=5
K=2
K=3
k=2
kopt
Figure 4.6-2 Throughput vs. area ratio for mixed pipelining.
area
latency - basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and
outer-round pipelining
K=2 K=4K=3 K=5
K=2 K=3
k=2
kopt
area
latency - basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and
outer-round pipelining
K=2 K=4K=3 K=5
K=2 K=3
k=2
kopt
Figure 4.6-3 Latency vs. area ratio for mixed pipelining.
74
Methodology of comparison of AES candidates
5.1 Limits of this research
The scope of possible applications for AES is large. The different modes of
operation that can be used with the underlying cipher, different block and key sizes, as
well as different application constraints make it impossible to perform an exhaustive
comparison within a limited time frame. Therefore, I restricted my research to the 5 most
representative uses:
1. Only 128-bit keys have been considered. Performance for other key sizes can be
easily derived.
Each AES candidate was required to operate with three different key sizes:
128-, 192- and 256-bit. For most of the ciphers the key size influences the key
schedule algorithm only, and does not make any difference in
encryption/decryption transformation. However, in the of Rijndael the number of
cipher rounds depends on key size and block size. This dependence is very simple
and the performance of Rijndael can be easily derived for other key sizes.
2. No comparison of key schedules has been made.
Comparing key schedules can be a more challenging task than comparing
encryption algorithms, because of their strong dependence on key sizes. In many
applications all three key sizes are required, and the key schedule unit would have
75
to support all of them. I have chosen not to implement any of the key schedules
for the comparison purposes, as it would require significant effort and more time.
Instead, my implementations include a memory of internal keys loaded with the
keys generated externally, and the circuitry necessary to distribute these keys
from the memory to the encryption/decryption unit.
3. Only 128-bit blocks have been supported.
AES requirements are limited to 128-bit blocks only. Therefore, I have
considered only this block size, even if the given AES candidate supports other
block sizes.
4. Encryption and decryption implemented in one circuit if possible.
Most of the secret-key ciphers applications require encryption and decryption
services. Therefore, I think that proper comparison of ciphers should reflect those
needs and I have implemented both transformations together. MARS, RC6, and
Twofish algorithms perform encryption and decryption in a very similar way, and
permit resource sharing between encryption and decryption. I have chosen to
explore this feature whenever it was possible. Resource sharing allows making the
overall circuit smaller, but impairs throughput, because additional switching
circuitry is required. Other ciphers, Serpent and Rijndael, do not share many
similarities between encryption and decryption transformations, and required
implementation of two separate units. Moreover, Rijndael’s circuits designated
for encryption and decryption have different complexities. The decryption unit
has longer critical path and slows down the entire implementation. With respect to
76
this fact, I feel that it is very unfair to only analyze encryption unit as many other
research groups did.
encryprion/decryption
inputinterface
outputinterface
memory forsubkeys
controlunit
datainput
keyinputcontrol
dataoutput
encryprion/decryption
inputinterface
outputinterface
memory forsubkeys
controlunit
datainput
keyinputcontrol
dataoutput
Figure 5.1-1 Block diagram common for all implementations.
5. Throughput/area ratio was the main optimization criteria.
Throughput is probably the most popular parameter of hardware
implementation and, and even if it does not carry full information about the
circuit, it is often associated with its “power.” However, I did not try to achieve
the highest possible throughput at all cost. I tried to trade speed and area in an
intelligent way, so that my implementations could reflect costs associated with
77
circuit sizes too. Therefore, I tried to maximize throughput together with
throughput/area ratio.
The general block diagram for all our architectures is shown in Figure 5.1-1.
5.2 Choice of architectures
In my opinion, a fair methodology for comparing hardware performance of the
AES candidates should not favor any group of ciphers or a specific internal structure of a
cipher. Different ciphers have different architectures. In particular, some ciphers employ
a large number of small rounds, while the other a small number of bigger and more
sophisticated rounds. However, both may achieve similar throughputs and have similar
throughput/area ratios. My main goal was to compare ciphers in both feedback and non-
feedback modes when only one stream of data is available.
5.2.1 Comparison in feedback modes
Pipelined architectures are highly underutilized in feedback modes when only one
stream of data is available. Only basic iterative and unrolled architectures come into play.
The unrolled architecture offers slightly higher throughput than basic iterative
architecture, but is not easy to properly design, and has an unattractive throughput/area
ratio. The basic architecture is much more practical from the point of view of real
implementations, and has several important features:
78
� Relatively easy to implement in a similar way for all AES candidates, which
supports fair comparison,
� Presents a good starting point for all other architectures. Many parameters of
other architectures can be estimated from basic architecture, and,
� Assures the maximum throughput/area ratio for feedback operating modes
(CBC, CFB), now commonly used for bulk data encryption.
5.2.2 Comparison in non-feedback modes
Non-feedback modes permit encrypting more than one block of data belonging to
the same stream simultaneously. Only pipelined architectures can take full advantage of
this feature. Outer-round pipelining is the easiest to apply, but is not well suited for
general comparison because it enforces one pipeline stage per one round. This way
ciphers with smaller rounds automatically achieve better performance. A fair comparison
should exploit all potentials of a cipher. I think that the best approach is to show
performances and sizes for ciphers implemented in mixed architectures. The ideal
situation would be to implement inner-round pipeline with respect to some optimization
criteria first, and next completely unroll the cipher. Inserting too many pipeline stages
into the cipher round does not make sense, as it gives small gain in throughput if at all,
and increases area and latency. I believe that the optimum number of in-round pipeline
stages kopt should be found for each cipher. This optimum should give the best
throughput/area ratio, as shown in Figure 5.2-1. Unfortunately this method has a very
serious drawback. Finding the optimum number of pipeline stages is not easy, and in
79
practice we can only estimate what this number could be and where all the registers could
be placed. Due to these constraints, this task became unfeasible, and I had to choose
another, sub-optimal strategy. The simplest idea is to introduce pipeline registers to run
the circuit with as high clock frequency as possible. This essentially tells what
throughput/area ratio one can get for similar clock frequency for each cipher. It turned
out to give me quite impressive results in terms of throughput.
area
throughput- basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and outer-round pipelining
K=2
K=3
K=4K=5
K=2
K=3
k=2
kopt
area
throughput- basic architecture- outer-round pipelining- inner-round pipelining- mixed inner and outer-round pipelining
K=2
K=3
K=4K=5
K=2
K=3
k=2
kopt
Figure 5.2-1 Throughput/area ratio for mixed architecture.
5.3 Tools, design process and synthesis parameters
All my hardware designs have been encoded in VHDL’87. I made a significant
effort to describe all components structurally so that I could indirectly guide the synthesis
tools as to how each component should be synthesized. I could have used specific
80
primitives from the Xilinx library to enforce certain implementation strategies, however,
it would have made my code device specific. I have chosen to use those libraries only
when it was necessary, for example to enforce the use of lookup tables (LUT) in RAM
mode. Other than that I completely relied on synthesis tools.
I used Active-HDL 3.6 as a design entry tool. This tool greatly supported the
development of my codes permitting very accurate behavioral, post-synthesis and timing
simulations under the control of test benches. Once the entire code was verified I used
Xilinx Foundation Series 2.1i for design synthesis and implementation. The design flow
is shown in Figure 5.3-1.
Netlist with timing
Code in VHDL
3. Timing simulation
1. Functional simulation2. Synthesis
andImplementation
Aldec, Active-HDL
Xilinx, Foundation Series v. 2.1
Aldec, Active-HDL
Implementation Verification
Netlist with timing
Code in VHDL
3. Timing simulation
1. Functional simulation2. Synthesis
andImplementation
Aldec, Active-HDL
Xilinx, Foundation Series v. 2.1
Aldec, Active-HDL
Implementation Verification
Figure 5.3-1 Design flow for each implementation.
As a target device I have chosen Xilinx Virtex XCV1000BG560-6 FPGA. This
device is fabricated in 0.22�m CMOS process, and contains around one million
equivalent logic gates.
81
I have not set any constraints other than the target clock frequency. In the case of
basic architectures the target clock frequency was 50 MHz, and in the case of pipelined
architectures 150 MHz. Foundation Series returns a detailed layout of a circuit,
information about resources utilized, and a netlist with all timing information which can
be further simulated back in Active-HDL. This final simulation, as well as the output
from a static timing analyzer gave me the final performance of each of the circuits.
82
Implementation of AES candidates
6.1 MARS
MARS was submitted to the AES contest by a large team from IBM [4]. Some of
the members of this team participated in the design of DES over twenty years ago. The
designers have put a lot of effort to make MARS as secure as possible. As they claim,
they have added many stop-fault mechanisms which makes MARS resistant to known
and anticipated attacks. Indeed, MARS has one of the largest security margins among all
candidates – see Table 2.1-2. All this security comes however with a high price in
performance both in software and hardware.
6.1.1 Structure and components of MARS
MARS consists of 32 rounds divided into four major groups:
� forward mixing,
� keyed forward transformation,
� keyed backwards transformation, and
� backwards mixing.
Figure 6.1-1 shows general structure of MARS.
83
forward mixing
keyed forwardtransformation
keyed backwardstransformation
backwards mixing
plaintext
ciphertext
+
-
subkey
subkey
subkeys
subkeys
Figure 6.1-1 High-level structure of MARS.
Only keyed transformations make use of keys, and all together are called
“cryptographic core”. Mixing transformations have more auxiliary purpose.
Forward and backwards mixing transformations for encryption and decryption are
alike, but not the same, and with some effort they can be implemented in one circuit. I
have chosen to implement them all in one unit. This appeared to not be an easy task. The
reader may refer to original MARS documentation [4] for description of mixing
transformations. The structure I came up with is shown in Figure 6.1-2 and Figure 6.1-3.
84
D3 frominput reg.
D3 fromkeyed transf.
D2 frominput reg.
D2 fromkeyed transf.
D1 frominput reg.
D1 fromkeyed transf.
D0 frominput reg.
D0 fromkeyed transf.
D3 loop D2 loop D1 loop D0 loop
128-bit register
Mixing transformation core
D3 loop D2 loop D1 loop D0 loop
optionalswap
optionalrotationto theright
optionalswap
D3 frominput reg.
D3 fromkeyed transf.
D2 frominput reg.
D2 fromkeyed transf.
D1 frominput reg.
D1 fromkeyed transf.
D0 frominput reg.
D0 fromkeyed transf.
D3 loop D2 loop D1 loop D0 loop
128-bit register
Mixing transformation core
D3 loop D2 loop D1 loop D0 loop
optionalswap
optionalrotationto theright
optionalswap
Figure 6.1-2 Mixing transformation.
D1
+/–
S0
S1
D0
–
S0
S1
D2
+/–
D3
+
<<<8 >>>8
fromE-function
toE-function
0
0
32 32 32 32
8
8
88
8
32
32
32
32
32
D1
+/–
S0
S1
D0
–
S0
S1
D2
+/–
D3
+
<<<8 >>>8
fromE-function
toE-function
0
0
32 32 32 32
8
8
88
8
32
32
32
32
32
Figure 6.1-3 Mixing transformation core.
85
Even brief inspection of Figure 6.1-2 and Figure 6.1-3 reveals a large number of
multiplexers which are used to route data in different directions depending on which
mixing transformation is going to be performed. Obviously, these multiplexers have a
negative influence on circuit performance.
Mixing transformations employ four 8x32 S-boxes of two kinds: S0 and S1,
simple additions and subtractions modulo 232, XORs, and fixed rotations. The most
interesting from my point of view are S-boxes. They accept 8-bit inputs, therefore are too
big for 4-bit LUTs present in the Xilinx FPGA. All four S-boxes can be implemented on
three Block SelectRAMs. However, those types of memories are unavailable in less
expensive families of FPGAs, therefore we have decided to describe them as big lookup
tables and leave their decomposition to synthesizer. This resulted in large circuit size and
long propagation delays. Perhaps better results could be obtained if we would try to
decompose S-boxes using advanced tools designed for this task.
The “cryptographic core” consists of two types of rounds: forward and backwards
keyed transformations, which, similar to mixing transformations, are not identical for
encryption and decryption. The circuit that realizes all “cryptographic core”
transformations is shown in Figure 6.1-4 and Figure 6.1-5.
86
D3 frommix transf.
D3 loop
128-bit register
Keyed transformation core
optionalswap
optionalrotationto theright
optionalswap
D2 frommix transf.
D2 loop
D0 frommix transf.
D0 loop
D1 frommix transf.
D1 loop
D3 loop D2 loop D0 loopD1 loop
D3 frommix transf.
D3 loop
128-bit register
Keyed transformation core
optionalswap
optionalrotationto theright
optionalswap
D2 frommix transf.
D2 loop
D0 frommix transf.
D0 loop
D1 frommix transf.
D1 loop
D3 loop D2 loop D0 loopD1 loop
Figure 6.1-4 Keyed transformation.
E+/–
+/–
>>>13
<<<13 >>>13
1
2
3
to S-box
fromS-box
D3 D2 D1 D032 32 32 32
32
328
E+/–
+/–
>>>13
<<<13 >>>13
1
2
3
to S-box
fromS-box
D3 D2 D1 D032 32 32 32
32
328
Figure 6.1-5 Keyed transformation core.
87
One can again notice a large number of multiplexers in Figure 6.1-4, which
contribute to additional propagation delays. The keyed transformation consists of E-
function, a couple of adders, subtractors, XORs, and fixed rotations. The details of E-
function are presented in Figure 6.1-6.
<<<
<<< 5
<<< 13
*
S
<<< 5
<<<
K[4+2i]
K[5+2i]
1
2
3
<<<
<<< 5
<<< 13
*
S
<<< 5
<<<
K[4+2i]
K[5+2i]
1
2
3
Figure 6.1-6 E-function. Red line indicates critical path.
The E-function is the largest transformation in the entire cipher. Among other
operations, it employs 9x32 S-box, variable rotations, and multiplication modulo 232.
Multiplication and variable rotation appear in the critical path, and I paid more attention
to their implementation.
The 9x32 S-box is simply a concatenation of S-boxes S0 and S1, where the most
significant bit selects S-box S0 or S1 for transforming eight least significant bits. Since
the mixing transformation and the keyed transformation are never executed
88
simultaneously in the basic architecture, I decided to share two S-boxes between both
transformations, as shown in Figure 6.1-3.
The variable rotation has been implemented on CLBs in a way presented in
Figure 6.1-7.
<<<16 >>>16
<<<8 >>>8
<<<4 >>>4
<<<2 >>>2
<<<1 >>>1
rot[4]
rot[3]
rot[2]
rot[1]
rot[0]
A
B
<<<
A32
rot
B
32
5
<<<16 >>>16
<<<8 >>>8
<<<4 >>>4
<<<2 >>>2
<<<1 >>>1
rot[4]
rot[3]
rot[2]
rot[1]
rot[0]
A
B
<<<
A32
rot
B
32
5
Figure 6.1-7 Variable rotation.
Implementing multiplication appeared to be a more challenging task, and I have
taken advantage of the structure of CLBs for optimizations.
89
6.1.2 Implementation of multiplication modulo 232
The multiplication modulo 232 can be implemented efficiently on Xilinx Virtex
devices since the CLB Slices contain special logic supporting arithmetic operations. A
simplified structure of one Slice is shown in the Figure 6.1-8. It consists of two LUTs,
associated control and carry logic, and two D flip-flops. Carry logic plays important role
in implementation of multiplication. This dedicated logic is designed to speed up
arithmetic operations using ripple adders. According to the Virtex documentation, the
maximum propagation time from CIN to COUT is only 0.1 ns, while the propagation time
from inputs to LUT to output takes 0.6 ns. Clearly, the use of carry logic in the design of
the multiplier is a reasonable choice.
LUT
G4G3G2G1
0
1
0 1
D Q
Y
YQ
LUT
F4F3F2F1
0
1
0 1
D Q
X
XQ
CIN
COUT
LUT
G4G3G2G1
0
1
0 1
D Q
Y
YQLUT
G4G3G2G1
0
1
0 1
D QD Q
Y
YQ
LUT
F4F3F2F1
0
1
0 1
D QD Q
X
XQ
CIN
COUT
Figure 6.1-8 Virtex Slice with carry logic.
90
Performing multiplication I want to sum components which are pre-computed in
the AND matrix. Figure 6.1-9 shows the typical situation where a full adder follows two
AND gates.
cout
FA
x0a31
x1a30
cin
s
cout
FA
x0a31
x1a30
cin
s
Figure 6.1-9 Example of multiplication scheme. Two AND gates feed full adder.
Fortunately, the Virtex Slice supports computing ANDs with FA in one LUT. This
is a unique feature among currently available FPGAs. Figure 6.1-10 shows details of the
implementation. The LUT computes the propagate function:
� � � �301310 and xor and axaxp �
The additional AND gate computes the generate signal from x1 and a30:
301 and axg �
91
LUT
0 1
s
cin
cout
x0a31x1
a30
p
g
LUT
0 1
s
cin
cout
x0a31x1
a30
p
g
Figure 6.1-10 Multiplication – implementation of the circuit from Figure
6.1-9 in a Vritex Slice.
Let us focus now on the full multiplier, which is created using principles of array
multipliers. I will use an 8x8-bit example, but all conclusions can be easily extended to
the 32x32-bit version. The logic for a multiplier modulo 28 is sketched in Figure 6.1-11.
There exist many paths, which are equally critical. I have highlighted only one of them in
red. The same result can be obtained by reordering summed terms:
a2 x ... a2 x a xof instead a x ... a2 x a2x 77
1100
66
77 ����������������
The resulting structure has much shorter, and only one critical path, as shown in
Figure 6.1-12. I do not consider horizontal nets as critical because they are implemented
in fast carry logic.
92
Figure 6.1-11 Array multiplier modulo 28.
Figure 6.1-12 Structure of an array multiplier with reversed order of
additions.
93
The main concern is then to minimize the vertical path. The multiplication from
Figure 6.1-12 can be symbolically represented as consecutive additions performed one at
a time as shown in the Figure 6.1-13a. These additions can be organized into a tree
(Figure 6.1-13b). This trick significantly shortens the number of logic levels from 7 to 3
(31 to 5 in the case of the 32-bit multiplier). In general, the use of the tree allows
realizing the multiplier on log 2 (n) logic levels, where n is a number of multiplied bits.
Therefore a 64-bit multiplier should contain 6 levels of logic and so on. The final
multiplier architecture resulting from applying tree structure is shown in Figure 6.1-14.
This structure was used in the basic iterative architecture.
a) array additions b) tree additionsa) array additions b) tree additions
Figure 6.1-13 Change from array to tree.
94
FA FA FA FA
x2y3 x2y2
FA FA HA
x0y0HA
x2y0x2y1
HA
x5y0
x1y4
x0y4x0y5x0y6x0y7
HA
x3y0
FA
x3y1
x2y4x2y5
FAFAFA
FA
x1y5x1y6
HA
x3y1
x5y1
x7y0
x5y2
x1y0
x3y3 x3y2
x1y3 x1y2
x6y1 x6y0
x4y3 x4y2 x4y1 x4y0
x0y3 x0y2
x1y1
x0y1
FAFA
FA FA FA FA
HAFA
HAFA
FA
FA
FA FA FA FA
x2y3 x2y2
FA FA HA
x0y0HA
x2y0x2y1
HA
x5y0
x1y4
x0y4x0y5x0y6x0y7
HA
x3y0
FA
x3y1
x2y4x2y5
FAFAFA
FA
x1y5x1y6
HA
x3y1
x5y1
x7y0
x5y2
x1y0
x3y3 x3y2
x1y3 x1y2
x6y1 x6y0
x4y3 x4y2 x4y1 x4y0
x0y3 x0y2
x1y1
x0y1
FAFA
FA FA FA FA
HAFA
HAFA
FA
FA
Figure 6.1-14 Final multiplication schematic.
6.1.3 Results of the implementation of MARS
Throughput and area in the basic iterative architecture
The implementation of MARS in the basic iterative architecture has taken 2,744
CLB Slices. The static timing analyzer indicated the maximum clock frequency at the
level of 15.3 MHz, which gives the throughput of 61 Mbps. It is not much since designers
have reported a throughput of 85 Mbps on 200 MHz PowerPC. A better FPGA
implementation of MARS was presented in [12], where designers made an extensive use
of optimized libraries and achieved a throughput at a level of 102 Mbps.
95
I believe that the better performance could be obtained if S-boxes were
implemented on Block SelectRAMs. Additionally, I have noticed that keyed
transformations take approximately twice as much time as mixing transformations.
Therefore, it may be a better solution to perform only a half of the keyed transformation
within one clock cycle and increase the clock frequency.
I have not attempted any implementation of MARS in any pipelined architecture.
I considered it as a very time consuming task. Moreover, MARS was already criticized
for its complexity and it was unlikely that will be selected for AES.
96
6.2 RC6
6.2.1 Structure and components of RC6
The RC6 cipher was submitted to the AES contest by Ronald Rivest from MIT
together with his partners from RSA Labs [28]. RC6 is an extension of RC5, an older
cipher designed by Rivest, and also belongs to the class of Feistel-network ciphers. This
feature permits implementing encryption and decryption within the same circuit.
The algorithm consists of 20 identical and simple rounds. Figure 6.2-1 shows the
structure of a circuit implementing one round.
The main operations employed in the algorithm are variable rotations and
multiplications modulo 232. The rotations are implemented in the same way as in the case
of MARS – see Figure 6.1-7. One could also use the same multiplier, but fortunately RC6
can be tweaked a little bit, and the multiplier can be significantly reduced.
The F-function present in Figure 6.2-1 performs the following operation:
output <= (input * ( 2 * input + 1 )) <<< 5
<<< denotes rotation to the left by 5 bit positions.
97
d e
de
F
A
B+
R0
–
R1
d e
<<<
R0
d e
R1
S[2i+1]
S[2i+1]
d e
de
F
+
–
d e
<<<
R2
d e
R3
S[2i]
S[2i]
R3R2
A
B
feedbackto R0
feedbackto R1
feedbackto R2
feedbackto R3
32 32 32 32
5 5
d ed e
de de
F
A
B+
R0
–
R1
d ed e
<<<
R0
d ed e
R1
S[2i+1]
S[2i+1]
d ed e
de de
F
+
–
d ed e
<<<
R2
d ed e
R3
S[2i]
S[2i]
R3R2
A
B
feedbackto R0
feedbackto R1
feedbackto R2
feedbackto R3
32 32 32 32
5 5
Figure 6.2-1 Implementation of one round of RC6.
It turns out that it can be easily replaced by the following equation:
output <= ( 2 * input2 + input ) <<< 5
The trick is to get rid of multiplication by changing it into squaring.
98
6.2.2 Implementation of squaring modulo 232
Any multiplier can perform squaring if both its inputs are connected together.
However, a special-purpose squarer is much smaller and faster. An array multiplier can
be reduced to a squarer in a way shown in Figure 6.2-2.
x0x0x1x0x2x0x3x0x4x0x5x0x6x0x7x0x0x1x1x1x2x1x3x1x4x1x5x1x6x1
x0x2x1x2x2x2x3x2x4x2x5x2x0x3x1x3x2x3x3x3x4x3
x0x4x1x4x2x4x3x4x0x5x1x5x2x5
x0x6x1x6x0x7
move to thenext column
reduceto x0
x00x1x0x2x0x3x0x4x0x5x0x6x0x1x2x1x3x1x4x1x5x1
x2x3x2x4x2x3
a) multiplication of two the same operands
b) squaring algorithm derived from multiplication
x0x0x1x0x2x0x3x0x4x0x5x0x6x0x7x0x0x1x1x1x2x1x3x1x4x1x5x1x6x1
x0x2x1x2x2x2x3x2x4x2x5x2x0x3x1x3x2x3x3x3x4x3
x0x4x1x4x2x4x3x4x0x5x1x5x2x5
x0x6x1x6x0x7
move to thenext column
reduceto x0
x00x1x0x2x0x3x0x4x0x5x0x6x0x1x2x1x3x1x4x1x5x1
x2x3x2x4x2x3
a) multiplication of two the same operands
b) squaring algorithm derived from multiplication
Figure 6.2-2 Squarer derived from array multiplier.
It is easy to notice that same products exist in same columns, and many of them
can be reduced to the structure given in the Figure 6.2-2b. The resulting squarer, shown
in Figure 6.2-3, occupies around 50% of the area of the corresponding multiplier, with a
height that is twice as small.
99
FA
x0x6x1x5
HA
x1x4x0x5
FA
x2x4
FA
x2x3
FA
0
FA
FA HA
FA FA
0
x1x3x0x4
x1x2x0x3
FA HA
0x0x2 x0x1
x1x2x3
0
FA
x0x6x1x5
HA
x1x4x0x5
FA
x2x4
FA
x2x3
FA
0
FA
FA HA
FA FA
0
x1x3x0x4
x1x2x0x3
FA HA
0x0x2 x0x1
x1x2x3
0
Figure 6.2-3 Squarer modulo 28.
I have further reduced the height of the squarer by ordering adders in a tree,
similarly to the multiplication in MARS. The resulting final circuit is presented in Figure
6.2-4. Although the area of the squarer is much smaller than for the multiplier, the
number of logic levels involved in a 32-bit squaring is four, which is only slightly less
than for the multiplication.
FA FA FA HA
0 0 0 x1x2x3
x5x1 x4x1 x3x1 x2x1
FA HA
x4x2 x3x2
x6x0 x5x0
FA FA FA FA FA HA
x4x0 x3x0 x2x0 x1x0
x00
FA FA FA HA
0 0 0 x1x2x3
x5x1 x4x1 x3x1 x2x1
FA HA
x4x2 x3x2
x6x0 x5x0
FA FA FA FA FA HA
x4x0 x3x0 x2x0 x1x0
x00
Figure 6.2-4 Optimized squarer modulo 28.
100
6.2.3 Results of the implementation of RC6
Throughput and area in basic architecture
The implementation of RC6 in the basic iterative architecture has taken 1,137
CLB Slices, which is less than half of the size of MARS. The maximum clock frequency
indicated by the static analyzer was 22.3 MHz, what translates to the throughput of 142.7
Mbps. This result is far better than for MARS, but does not satisfy the most demanding
needs for fast encryption. This poor performance comes from the fact that RC6 has a long
critical path, which goes through a squarer and a variable rotator as indicated in Figure
6.2-1.
Throughput and area in mixed architecture
Considering the implementation of RC6 in mixed inner- and outer-round
pipelining I have decided not to use a tree structure for squarer. A pure array squarer has
a very regular layout, and this makes it easier for the placing tool to minimize the delays
of interconnections. In the case of deeply pipelined designs, interconnections may
contribute to similar or even larger delays than logic. I have pipelined RC6 very deeply,
as I introduced 28 pipeline stages within one round. This gives a total of 560 stages in the
mixed architecture, but allows inputting data blocks every clock cycle. The resultant
circuit takes approximately 47,000 CLB Slices, and requires 4 Virtex1000 devices.
According to the static analyzer it can be run with 108.1 MHz clock. This gives a
throughput of 13.1 Gbps.
101
Introducing so many pipeline stages was completely unnecessary, since the gain
in frequency was only a factor of five. Most likely I could have achieved a similar
throughput for less than ten pipeline stages. Additionally I could reduce the circuit size
by around 25%.
102
6.3 Rijndael
6.3.1 Structure and components of Rijndael
Rijndael is an SP-network cipher. This means that the main transformations
employed in this cipher are substitutions and permutations applied to all bits of data block
in every round. Rijndael was submitted to the AES contest by V. Rijmen and J. Daemen
from Belgium [11]. Rijndael is unique in many ways. The number of cipher rounds
depends on the size of the key and the size of the data block and is equal to 10 for 128-bit
key and 128-bit block. Despite such a small number of rounds, Rijndael offers a quite
good security margin, although some researchers express their worries about this small
number of rounds. All main operations employed in Rijndael are based on arithmetic in
Galois fields which makes it highly efficient in both software and hardware
implementations. Since Rijndael is an SP-network cipher, it requires inverse
transformations for encryption and decryption, which in general are not easy to
implement in one circuit. Figure 6.3-1 shows one round of encryption and decryption.
There are only three main operations: MixColumn, ShiftRow, and ByteSub, plus
their inversed versions in case of decryption. These operations have very nice properties
which permit their reordering giving the implementer many degrees of freedom.
ShiftRow and InvShiftRow change only the order of bytes within a 128-bit block,
and do not require any logic resources.
103
ByteSub
plaintext
ciphertext
subkey
ShiftRow
MixColumn
InvByteSub
plaintext
ciphertext
subkey
InvShiftRow
InvMixColumn
a) b)
ByteSub
plaintext
ciphertext
subkey
ShiftRow
MixColumn
InvByteSub
plaintext
ciphertext
subkey
InvShiftRow
InvMixColumn
a) b)
Figure 6.3-1 One round of Rijndael. a) encryption, b) decryption.
ByteSub and InvByteSub can be viewed as ordinary 8x8 S-boxes. Rijndael uses 16
S-boxes of each kind in a single round. The best implementation approach would be to
implement those S-boxes on Block SelectRAMs, but I decided not to use these
components in any of the ciphers. Implementation of 32 8x8 S-boxes on LUTs takes large
amount of area, however the construction of those S-boxes gives an opportunity to save
some space through resource sharing. ByteSub and InvByteSub can be decomposed into
two operations: a simple affine transformation and inversion in Galois field as shown in
Figure 6.3-2.
inversed elementIn Galois field
affinetransformation
a) b)
inversed affinetransformation
inversed elementIn Galois field
inversed elementIn Galois field
affinetransformation
a) b)
inversed affinetransformation
inversed elementIn Galois field
Figure 6.3-2 Construction of a) ByteSub, b) InvByteSub transformations.
104
The affine transformation and its inverse are simple to implement on LUTs. The
inversion in Galois field is symmetric, and stays the same in both encryption and
decryption. This feature permits sharing the inversion between both transformations.
MixColum and InvMixColumn do not share the same operations. The MixColumn
transformation can be expressed as a matrix multiplication in the Galois Field GF(28).
����
�
�
����
�
�
����
�
�
����
�
�
�
����
�
�
����
�
�
AAAA
02010103030201010103020101010302
BBBB
3
2
1
0
3
2
1
0
Each symbol in this equation (such as Ai, Bi, '03') represents an 8-bit element of
the Galois Field. Each of these elements can be treated as a polynomial of degree less
than 8, with coefficients in {0,1} determined by the respective bits of the GF(28) element.
For example, '03' is equivalent to '0000 0011' in binary, and to
c(x) = 0�x7 + 0�x6 + 0�x5 + 0�x4 + 0�x3 + 0�x2 + 1�x + 1�1 = x +1
in the polynomial radix representation.
The multiplication of elements of GF(28) is accomplished by multiplying the
corresponding polynomials modulo a fixed irreducible polynomial
105
m(x) = x8 + x4 + x3 + x + 1
For example, multiplying a variable element A=a7 a6 a5 a4 a3 a2 a1 a0 by a constant element
'03' is equivalent to computing
B(x) = b7 x7 + b6 x6 + b5 x5 + b4 x4 + b3 x3 + b2 x2 + b1 x + b0 =
=(a7 x7 + a6 x6 + a5 x5 + a4 x4 + a3 x3 + a2 x2 + a1 x + a0) � (x+1) mod (x8 + x4 + x3 + x + 1)
After several simple transformations
B(x) = (a7 + a6) x7 + (a6 + a5) x6 + (a5 + a4) x5 + (a4 + a3+ a7) x4 + (a3 + a2+ a7) x3 +
+ (a2 + a1) x2 + (a1 + a0+ a7) x + (a0 + a7)
where '+' represents an addition modulo 2, i.e. an XOR operation.
As a result each bit of a product B, can be represented as an XOR function of at most
three variable input bits, e.g., b7 = (a7 + a6) , b4 = (a4 + a3+ a7), etc.
Each byte of the result of a matrix multiplication is an XOR of four bytes
representing the Galois Field product of a byte A0, A1, A2, or A3 by a respective constant.
The entire MixColumn transformation can be performed using two layers of XOR gates,
with up to 3-input gates in the first layer, and 4-input gates in the second layer. In Virtex
FPGAs, each of these XOR operations requires only one lookup table (i.e., a half of a
CLB Slice).
106
The InvMixColumn transformation can be expressed as a following matrix
multiplication in GF(28).
����
�
�
����
�
�
����
�
�
����
�
�
�
����
�
�
����
�
�
BBBB
0E090D0B0B0E090D0D0B0E09090D0B0E
AAAA
3
2
1
0
3
2
1
0
The primary differences compared to MixColumn are the larger hexadecimal
values of the matrix coefficients. Multiplication by these constant elements of the Galois
Field leads to the more complex dependence between the bits of a variable input and the
bits of a respective product. For example, the multiplication A='0E' � B, leads to the
following dependence between the bits of A and B:
a7 = b6 + b5 + b4
a6 = b5 + b4 + b3 + b7
a5 = b4 + b3 + b2 + b6
a4 = b3 + b2 + b1 + b5
a3 = b2 + b1 + b0 + b6 + b5
a2 = b1 + b0 + b6
a1 = b0 + b5
a0 = b7 + b6 + b5
107
The entire InvMixColumn transformation can be performed using two layers of
XOR gates, with up to 6-input gates in the first layer, and 4-input gates in the second
layer. In Virtex FPGAs, an implementation of a 6-input XOR operation requires two
layers of CLB Slices. As a result, the InvMixColumn transformation has a significantly
longer critical path compared to the MixColumn transformation, and the entire decryption
is more time consuming than encryption.
Taking into account all properties of the component operations, I have
implemented Rijndael in the structure shown in Figure 6.3-3.
inversed elementin Galois field
affinetransformation
inversed affinetransformation
ShiftRow
MixColumn
subkey
InvShiftRow
subkey
InvMixColumn
encryption decryption
inversed elementin Galois field
affinetransformation
inversed affinetransformation
ShiftRow
MixColumn
subkey
InvShiftRow
subkey
InvMixColumn
encryption decryption
Figure 6.3-3 Structure of the implementation of a single round of Rijndael.
108
6.3.2 Results of the implementation of Rijndael
Throughput and area in basic architecture
The implementation of Rijndael in a basic iterative architecture has taken 2,507
CLB Slices, very close to the size of MARS. However, the maximum clock frequency
indicated by the static analyzer was 32.3 MHz. Together with the small number of rounds
it puts Rijndael in a high position in the AES ranking with the throughput of 413.4 Mbps.
This result is much better than for MARS and RC6.
Throughput and area in mixed architecture
Implementing Rijndael in pipelined architecture was more challenging than it may
at first seem. The main difficulty lies in pipelining the S-boxes which one would normally
leave to the synthesis tool for decomposition. Unfortunately our synthesizer does not
insert pipeline stages automatically. I could buy a special core for distributed memory,
which allows such optimization, or do the decomposition manually, but both solutions
did not seem to guarantee a good performance. For this reason I have decided to use
Block SelectRAMs to implement S-boxes.
I have introduced 7 pipeline stages into a single round, what gives a total of 70
stages for the full cipher. The amount of area required by the implementation was in the
range of 12,600 CLB Slices + 80 Block SelectRAMs. I could run the circuit with a 95
MHz clock, and this result gives a throughput of 12.1 Gbps.
109
6.4 Serpent
6.4.1 Structure and components of Serpent
Serpent is a block cipher developed in international cooperation between R.
Anderson, E. Biham and L. Knudsen [1]. All the submitters are very well known
cryptanalysts. Authors emphasize that their design philosophy was highly conservative,
therefore only operations well studied and understood are used. Taking into account the
reputation of the submitters, it is not surprising that Serpent has the largest security
margin among all candidates. Serpent belongs to a class of SP-network ciphers. It
consists of 32 small and simple rounds. Figure 6.4-1 shows one round of Serpent. Last
round is slightly different, but does not impose any significant constraints at the design.
S-boxes
Lineartransformation
input
output
subkey
Figure 6.4-1 Single round of Serpent.
110
Unfortunately, not all rounds are identical. The cipher employs eight different sets
of 4x4 S-boxes that repeat every eight rounds. Additionally, encryption and decryption
consist of different operations and we cannot take any advantage of resource sharing
between encryption and decryption.
Serpent can still be implemented in a basic architecture evaluating one round per
clock cycle, but it requires the implementation of some switching circuit selecting S-
boxes, as shown in Figure 6.4-3, and turns out to be very inefficient. I have made an
exception for Serpent and have unrolled eight rounds treating this configuration as a basic
architecture, as shown in Figure 6.4-2. I call this architecture Serpent I8.
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
b)a)
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
b)a)
Figure 6.4-2 Implementation of Serpent I8 in basic architecture. a)
encryption, b) decryption.
111
As I have mentioned, the S-boxes accept only 4 inputs, and therefore match
exceptionally well the structure of FPGA. Moreover, the linear transformation consists
only of two levels of XORs, which can be implemented very efficiently on LUTs. The
same observations apply to decryption circuit. Serpent matches the internal architecture
of an FPGA so well that it is hard to believe that its designers are mathematicians with no
hardware design experience. The implementation of Serpent took us the least amount of
time.
Some of the research groups have implemented Serpent based on only one round
and switching S-boxes as shown in Figure 6.4-3. We refer to this architecture as Serpent
I1.
128-bit register
32 x S-box 0
Ki regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
128
32 x S-box 1
8-to-1 128-bit multiplexer128 128 128
128 128 128
128-bit register
32 x S-box 0
Ki regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
128
32 x S-box 1
8-to-1 128-bit multiplexer128 128 128
128 128 128
Figure 6.4-3 Serpent I1.
112
6.4.2 Results of the implementation of Serpent
Throughput and area in basic architecture
The implementation of Serpent in the basic iterative architecture, as shown in
Figure 6.4-2, has taken 4,507 CLB Slices, and presents the largest circuit, but we have to
keep in mind that eight rounds have been unrolled. The maximum clock frequency
indicated by the static analyzer was 13.5 MHz, which with combination of only four
clock cycles per block gives the throughput of 431 Mbps. This result outperforms all
other ciphers.
Throughput and area in mixed architecture
Applying pipelining to Serpent was a very easy task because this cipher has a very
FPGA-friendly structure. We have introduced only three pipeline stages per one round,
what gives a total of 96 pipeline stages for the entire implementation. The circuit takes
approximately 19,700 CLB Slices, which indicates a very small area increase associated
with introducing registers. We could run this circuit at a clock frequency of 130.9 MHz,
which gives a high throughput of 16.7 Gbps. This is the best result achieved by any
cipher reported in the literature.
113
6.5 Twofish
6.5.1 Structure and components of Twofish
Twofish was submitted to the AES contest by a team from Counterpane Systems
[29] led by B. Schneier, a well known cryptanalyst. It almost perfectly follows the
classical Feistel-network structure, and performing encryption and decryption in the same
circuit requires introducing only a very small amount of switching logic. The entire
structure of the cipher is shown in Figure 6.5-1.
The designers of Twofish have introduced a new idea in cipher design; the use of
key dependent S-boxes. Unlike in other ciphers using fixed S-boxes, the contents of key
dependent S-boxes changes for every key, making cryptanalysis certainly much harder.
The perfect way of implementing those 8x8 S-boxes would be by expressing them as
memories, which could be filled with new contents every time the keys are changed. It
could be done using Block SelectRAMs in Virtex FPGA. Four such RAM blocks would
be sufficient to implement all eight S-boxes. This solution could be accepted only in case
of the basic iterative architecture when we do not need to change keys on the fly. In the
case of the pipelined architectures, changing the contents of memory in one clock cycle is
undoable unless we could make use of several memory modules and switch among them.
I have chosen not to use this technique, and I have implemented the algorithm, which
computes contents of S-boxes inside the cipher round.
114
P (128 bits)
+
+
+
+<<< 8
<<< 1
>>> 1
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
PHT
F
K2 K3K1K0
K2r+8
K2r+9
K6 K7K5K4
C (128 bits)
Rep
eat 1
6 tim
es
P (128 bits)
+
+
+
+<<< 8
<<< 1
>>> 1
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
PHT
F
K2 K3K1K0
K2r+8
K2r+9
K6 K7K5K4
C (128 bits)
K6 K7K5K4
C (128 bits)
Rep
eat 1
6 tim
es
Figure 6.5-1 High-level structure of the Twofish cipher.
Each S-box consists of three permutations interleaved with keys S0 and S1, as
shown in Figure 6.5-2. Each q-permutation can be efficiently implemented on LUTs, as it
consists of small 4x4 t-boxes shown in Figure 6.5-3, which match very well the internal
architecture of FPGA.
115
q0
q1
q0
q1
q0
q0
q1
q1
q1
q0
q1
q0
S0 S1
S-box 0
S-box 1
S-box 2
S-box 3
q0
q1
q0
q1
q0
q0
q1
q1
q1
q0
q1
q0
S0 S1
S-box 0
S-box 1
S-box 2
S-box 3
Figure 6.5-2 S-boxes in Twofish.
t0 t1
>>>1 a(0), 0, 0, 0
a b
t2 t3
>>>1 a'(0), 0, 0, 0
a' b'
84 4
44
8
t0 t1
>>>1 a(0), 0, 0, 0
a b
t2 t3
>>>1 a'(0), 0, 0, 0
a' b'
84 4
44
8
Figure 6.5-3 Permutation q.
116
Another function used in Twofish is a 4-by-4-byte MDS matrix. The
transformation performed by this matrix is described by the formula:
�����
�
�
�����
�
�
�
�����
�
�
�����
�
�
�
�����
�
�
�����
�
�
3y2y1y0y
5BEF01EFEF015BEF01EFEF5B5B5BEF01
3z2z1z0z
where: y3...y0 are consecutive bytes of the input 32-bit word (y3 is the most significant
byte), and z3...z0 form the output word. This matrix multiplies a 32-bit input value by 8-
bit constants, with all multiplications performed (byte by byte) in the Galois field GF
(28). The primitive polynomial is x8 + x6 + x5 + x3 + 1. Only three different
multiplications are used effectively in the MDS matrix, namely multiplication
� by 5B16 = 0101 10112 (represented in GF(28) by a polynomial x6 + x4 + x3 + x +
1),
� by EF16 = 1110 11112 (x7 + x6 + x5 + x3 + x2 + x + 1), and
� by 0116 = 0000 00012 (equivalent element in GF(28) is just 1) - obviously the
result is equal to the input value.
Finally, the PHT transform is a simple function that consists of two additions modulo 232,
as shown in Figure 6.5-4. Both additions are de facto independent and can be performed
simultaneously.
117
+ +
a b
a' b'
<<1
Figure 6.5-4 PHT transformation.
As I have mentioned at the beginning of this section, both encryption and
decryption transformations can be implemented within the same circuit with a small
amount of additional logic. Figure 6.5-5 shows the structure of an implementation of a
single round used in my design.
128-bit register
F - function
<<<1
>>>1
<<<1
>>>1
128-bit register
F - function
<<<1
>>>1
<<<1
>>>1
Figure 6.5-5 Implementation of a single round of Twofish.
118
6.5.2 Results of the implementation of Twofish
Throughput and area in basic architecture
Twofish matches very well the structure of an FPGA which results in a compact
design. Its implementation has taken 1,076 CLB Slices. The maximum clock frequency
indicated by the static analyzer was 22.1 MHz and this translates to a throughput of 177
Mbps.
Throughput and area in mixed architecture
Twofish has a quite long critical path through its round and there exist a lot of
room for pipeline stages. I have introduced as many registers as I could and this resulted
in a very deep pipeline with 24 stages per round. Hence, the total amount of stages for the
full cipher is 384. The area of the circuit was in the range of 21,000 CLB Slices, and it
could be run with the clock frequency of 119 MHz. This gives a high throughput of 15.2
Gbps. As we can see, the number of introduced pipeline stages proved to be too big as the
gain in clock frequency was only by a factor of five. Similarly to RC6, we could most
likely obtain a similar performance for less than ten pipeline stages.
119
Analysis of the results
7.1 Comparison of ciphers in feedback modes
The results of implementing AES candidates, according to the assumptions and
design procedure summarized in chapter 5, are shown in Figure 7.1-1 and Figure 7.1-2.
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars 3DES
431414
177142
61 59
Throughput [Mbps]
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars 3DES
431414
177142
61 59
Throughput [Mbps]
Figure 7.1-1 Throughput for Virtex XCV-1000, my results.
All implementations were based on Virtex XCV-1000BG560-6, one of the largest
currently available Xilinx Virtex devices. Additionally I have implemented a current
ANSI standard [3], Triple DES, which I used as a reference for comparison.
120
Implementations of all ciphers took from 9% (for Twofish) to 37% (for Serpent
I8) of the total number of 12,288 CLB slices available in the Virtex device used in my
designs. It means that less expensive Virtex devices could be used for all
implementations. Additionally, the key scheduling unit could be easily implemented
within the same device as the encryption/decryption unit.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SerpentRijndaelTwofish RC6 Mars 3DES
1076 1137
27442507
4507
356
Area [CLB Slices]
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SerpentRijndaelTwofish RC6 Mars 3DES
1076 1137
27442507
4507
356
Area [CLB Slices]
Figure 7.1-2 Area for Virtex XCV-1000, my results.
In Figure 7.1-3 and Figure 7.1-4, I compare my results with the results of research
groups from Worcester Polytechnic Institute [15] and University of Southern California
[12]. Both groups used identical FPGA devices, the same design tools and similar design
procedure. The order of the AES algorithms in terms of the encryption and decryption
throughput is identical in reports of all research groups. Serpent in architecture I8 (see
Figure 6.4-2) and Rijndael are over twice as fast as remaining candidates. Twofish and
121
RC6 offer medium throughput. Mars is consistently the slowest of all candidates.
Interestingly, all candidates, including Mars are faster than Triple DES. Serpent I8 (see
Figure 6.4-2) is significantly faster than Serpent I1 (Figure 6.4-3), and this architecture
should clearly be used in cipher feedback modes whenever the speed is a primary
concern, and the area limit is not exceeded.
050100150200250300350400450500Throughput [Mbps]
Serpent I8
Rijndael Twofish RC6 MarsSerpent I1
431 444414
353
294
177 173
104
149
62
143112
88102
61
Worcester Polytechnic Institute
University of Southern California
Our Results
050100150200250300350400450500Throughput [Mbps]
Serpent I8
Rijndael Twofish RC6 MarsSerpent I1
431 444414
353
294
177 173
104
149
62
143112
88102
61
Worcester Polytechnic Institute
University of Southern California
Our Results
Figure 7.1-3 Throughput for Virtex XCV-1000, comparison with results
of other groups.
The agreement among circuit areas obtained by different research groups is not as
good as for the circuit throughputs, as shown in Figure 7.1-4. These differences can be
explained based on the fact that the speed was a primary optimization criterion for all
involved groups, and the area was treated only as a secondary parameter. Additional
122
differences resulted from different assumptions regarding sharing resources between
encryption and decryption, key storage, and using dedicated memory blocks.
010002000
30004000
50006000700080009000
Serpent I8RijndaelTwofish RC6 MarsSerpent
I1
Area [CLB slices]
Worcester Polytechnic Institute
University of Southern California
Our Results
1250
5511
1076
28092666
11371749
26382507
4312
35282744
4621 4507
7964
010002000
30004000
50006000700080009000
Serpent I8RijndaelTwofish RC6 MarsSerpent
I1
Area [CLB slices]
Worcester Polytechnic Institute
University of Southern California
Our Results
1250
5511
1076
28092666
11371749
26382507
4312
35282744
4621 4507
7964
Figure 7.1-4 Area for Virtex XCV-1000, comparison with results of other
groups.
Despite these different assumptions, the analysis of results presented in Figure
7.1-4 leads to relatively consistent conclusions. All ciphers can be divided into three
major groups:
1. Twofish and RC6 require the smallest amount of area,
2. Rijndael and Mars require medium amount of area (at least 50% more than
Twofish and RC6),
123
3. Serpent I8 requires the largest amount of area (at least 60% more than
Rijndael and Mars). Serpent I1 belongs to the first group according to [12],
and to the second group according to [15].
The overall features of all AES candidates can be best presented using a two-
dimensional diagram showing the relationship between the encryption/decryption
throughput and the circuit area. I collected my results for the Xilinx Virtex FPGA
implementations in Figure 7.1-5. For comparison I show the results obtained by the NSA
group for ASIC implementations [33] in Figure 7.1-6.
Throughput [Mbps]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
Throughput [Mbps]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
Figure 7.1-5 Throughput vs. area for Virtex-1000, our results. The result
for Serpent I1 based on [12].
124
Comparing diagrams shown in Figure 7.1-5 and Figure 7.1-6 reveals that the
throughput/area characteristic of the AES candidates is almost identical for the FPGA and
ASIC implementations. The primary difference between the two diagrams comes from
the absence of the ASIC implementation of Serpent I8 in the NSA report [33].
Throughput [Mbps]
Area [mm2]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 Twofish Mars
Rijndael
Throughput [Mbps]
Area [mm2]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 Twofish Mars
Rijndael
Figure 7.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs,
NSA result.
All ciphers can be divided into three distinct groups:
� Rijndael and Serpent I8 offer the highest speed at the expense of the
relatively large area;
� Twofish, RC6, and Serpent I1 offer medium speed combined with a very
small area;
125
� Mars is the slowest of all AES candidates and second to last in terms of the
circuit area.
Looking at this diagram, one may ask which of the two parameters: speed or area
should be weighted more during the comparison? The definitive answer is speed. The
primary reason for this choice is that in feedback cipher modes it is not possible to
substantially increase encryption throughput even at the cost of a very substantial
increase in the circuit area. On the other hand, by using resource sharing described in
section 3.3.2, the designer can substantially decrease circuit area at the cost of a
proportional (or higher) decrease in the encryption throughput. Therefore, Rijndael and
Serpent can be implemented using almost the same amount of area as Twofish and RC6;
but Twofish and RC6 can never reach the speeds of the fastest implementations of
Rijndael and Serpent I8.
7.2 Comparison of ciphers in non-feedback modes
The results of my implementations of four AES candidates using full mixed inner-
and outer-round pipelining and Virtex XCV-1000BG560-6 FPGA devices are
summarized in Figure 7.2-1, Figure 7.2-2 and Figure 7.2-3. Because of the lack of time I
did not attempt to implement Mars in this architecture. In Figure 7.2-4, I provide the
results of implementation of all five AES finalists by the NSA group using full outer-
round pipelining and semi-custom ASICs in 0.5 �m CMOS MOSIS library [33].
126
0
2
4
6
8
10
12
14
16
18Throughput [Gbps]
Serpent RijndaelTwofish RC6
16.815.2
13.1 12.2
0
2
4
6
8
10
12
14
16
18Throughput [Gbps]
Serpent RijndaelTwofish RC6
16.815.2
13.1 12.2
Figure 7.2-1 Throughput for mixed inner- and outer-round pipelining in
Virtex1000, my results.
To my best knowledge, the throughputs of the AES candidates obtained as a result
of my design effort, and shown in Figure 7.2-1, are the best ever reported, including both
FPGA and ASIC technologies.
127
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Serpent RijndaelTwofish RC6
Area [CLB slices]
19,700 21,000
46,900
12,60080 RAMs
dedicated memory blocks, RAMs
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Serpent RijndaelTwofish RC6
Area [CLB slices]
19,700 21,000
46,900
12,60080 RAMs
dedicated memory blocks, RAMs
Figure 7.2-2 Area for mixed inner- and outer-round pipelining on
Virtex1000, my results.
Latency without and with pipelining [�s]
Serpent I8 RijndaelTwofish RC6
297733 722
3092
897
5490
309737
x 2.5
x 4.3
x 6.1
x 2.4
6
5
4
3
2
1
0
Latency without and with pipelining [�s]
Serpent I8 RijndaelTwofish RC6
297733 722
3092
897
5490
309737
x 2.5
x 4.3
x 6.1
x 2.4
6
5
4
3
2
1
0
Figure 7.2-3 Increase in the encryption/decryption latency as a result of
moving from the basic architecture to mixed inner- and outer-round
pipelining.
128
My designs outperform similar pipelined designs based on the use of identical
FPGA devices, reported in [15], by a factor ranging from 3.5 for Serpent to 9.6 for
Twofish. These differences may be attributed to using a sub-optimum number of inner-
round pipeline stages and to limiting designs to single-chip modules in [15]. My designs
outperform NSA ASIC designs in terms of the encryption/decryption throughput by a
factor ranging from 2.1 for Serpent to 6.6 for Twofish (see Figure 7.2-1 and Figure
7.2-4). Since both of the groups obtained very similar values for throughputs in the basic
iterative architecture (see Figure 7.1-5 and Figure 7.1-6), these large differences should
be attributed primarily to the differences between the full mixed inner- and outer-round
round architecture employed by me and the full outer-round architecture used by the
NSA team.
Serpent Rijndael Twofish RC6 Mars
Throughput [Gbps]
0
1
2
3
4
5
6
7
8
9
2.2
5.7
2.3 2.2
8.0
Serpent Rijndael Twofish RC6 Mars
Throughput [Gbps]
0
1
2
3
4
5
6
7
8
9
2.2
5.7
2.3 2.2
8.0
Figure 7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA
results.
129
By comparing Figure 7.2-1 and Figure 7.2-4, it can be clearly seen that using full
outer-round pipelining for comparison of the AES candidates favors ciphers with less
complex cipher rounds. Twofish and RC6 are over two times slower than Rijndael and
Serpent I1, when full outer-round pipelining is used (Figure 7.2-4); and have the
throughput greater than Rijndael, and comparable to Serpent I1, when full mixed inner-
and outer-round pipelining is applied (Figure 7.2-1). Based on my basic iterative
architecture implementation of Mars, I predict that the choice of the pipelined
architecture would have the similar effect on Mars.
The deviations in the values of the AES candidate throughputs in full mixed
inner- and outer-round pipelining do not exceed 20% of their mean value. The analysis of
critical paths in my implementations has demonstrated that all critical paths contain only
a single level of CLBs and differ only in delays of programmable interconnects. Taking
into account already small spread of the AES candidate throughputs and potential for
further optimizations, I conclude that the demonstrated differences in throughput are not
sufficient to favor any of the AES algorithms over the other. As a result, circuit area
should be the primary criterion of comparison for our architecture and non-feedback
cipher modes.
As shown in Figure 7.2-2, Serpent and Twofish require almost identical area for
their implementations based on full mixed inner- and outer-round pipelining. RC6
imposes over twice as large area requirements. Comparison of the area of Rijndael and
other ciphers is made difficult by the use of dedicated memory blocks, Block
130
SelectRAMs, to implement S-boxes. Block SelectRAMs are not used in implementations
of any of the remaining AES candidates, and I am not aware of any formula for
expressing the area of Block SelectRAMs in terms of the area used by CLB Slices.
Nevertheless, I have estimated that an equivalent implementation of Rijndael, composed
of CLBs only, would take approximately 24,600 CLBs, which is only 17 and 25 percent
more than implementations of Twofish and Serpent respectively.
Additionally, Serpent, Twofish, and Rijndael all can be implemented using two
FPGA devices XCV-1000, while RC6 requires four such devices. It should be noted that
in my designs, all implemented circuits perform both encryption and decryption. This is
in contrast with the designs reported in [15], where only encryption logic is implemented,
and therefore a fully pipelined implementation of Serpent can be included in one FPGA
device.
Connecting two or more Virtex FPGA devices into a multi-chip module working
with the same clock frequency is possible because the FPGA system level clock can
achieve rates up to 200 MHz [34], and the highest internal clock frequency required by
the AES candidate implementation is 131 MHz for Serpent. New devices of the Virtex
family released in 2001 are capable of including full implementations of Serpent,
Twofish, and Rijndael on a single integrated circuit.
In Figure 7.2-3, I report the increase in the encryption/decryption latency resulting
from using the inner-round pipelining with the number of stages optimum from the point
of view of the throughput/area ratio. In majority of applications that require hardware-
based high-speed encryption, the encryption/decryption throughput is a primary
131
performance measure, and the latencies shown in Figure 7.2-3 are fully acceptable.
Therefore, in this type of applications, the only parameter that truly differentiates AES
candidates, working in non-feedback cipher modes, is the area, and thus the cost, of
implementations. As a result, in non-feedback cipher modes, Serpent, Twofish, and
Rijndael offer very similar performance characteristics, while RC6 requires over twice as
much area and twice as many Virtex XCV-1000 FPGA devices.
132
Summary
I have implemented all five final AES candidates in the basic iterative
architecture, suitable for feedback cipher modes, using Xilinx Virtex XCV-1000 FPGA
devices. For all five ciphers, I have obtained the best throughput/area ratio, compared to
the results of other groups reported for FPGA devices. Additionally, I have implemented
four AES algorithms using full mixed inner- and outer-round pipelining suitable for
operation in non-feedback cipher modes. For all four ciphers, I have obtained throughputs
in excess of 12 Gbps, the highest throughputs ever reported in the literature for hardware
implementations of the AES candidates, taking into account both FPGA and ASIC
implementations.
I have developed the consistent methodology for the fast implementation and fair
comparison of the AES candidates in hardware. I have found out that the choice of an
optimum architecture and a fair performance measure is different for feedback and non-
feedback cipher modes.
For feedback cipher modes (CBC, CFB, OFB), the basic iterative architecture is
the most appropriate for comparison and future implementations. The
encryption/decryption throughput should be the primary criterion of comparison, because
using a different architecture, even at the cost of a substantial increase in the circuit area,
cannot easily increase it. Serpent and Rijndael outperform three remaining AES
133
candidates by at least a factor of two in both throughput and latency. Two independent
research groups have confirmed my results for feedback modes.
For non-feedback cipher modes (ECB, counter mode), architecture with full
mixed inner- and outer-round pipelining is the most appropriate for comparison and
future implementations. In this architecture, all AES candidates achieve high, and
approximately the same throughput. As a result, the implementation area should be the
primary criteria of comparison. Implementations of Serpent, Twofish, and Rijndael
consume approximately the same amount of FPGA resources. RC6 requires over twice as
large area. My approach to comparison of the AES candidates in non-feedback cipher
modes is new and unique, and has yet to be followed, verified, and confirmed by other
research groups.
My analysis leads to the following ranking of the AES candidates in terms of the
hardware efficiency: Rijndael and Serpent close first, followed in order by Twofish, RC6,
and Mars. Figure 7.1-5 clearly indicates that Rijndael offers high throughput and best
throughput/area ratio in basic iterative architecture. Figure 7.2-2 shows area requirements
for all ciphers implemented in fully pipelined architecture. None of those ciphers could fit
within the device I used for comparison, however Xilinx Inc. has already introduced and
extended family of FPGA devices: Virtex-E and Virtex II. FPGAs in Virtex-E family
have large amount of Block SelectRAMs and similar amount of CLB Slices as in Virtex
family. Considering the use of Virtex-E devices I could certainly implement most of the
candidates within one chip. Only RC6 is too big for the largest of the devices. Serpent
134
and Twofish could be implemented in one of the largest chips Virtex-E XCV2600E.
However, Rijndael appears to have smallest requirements as it can be implemented
entirely within one Virtex-E XCV1600E. Again, Rijndael takes lead in ciphers
comparison.
When I came to the AES3 conference in New York with my advisor, we attended
the reception before the conference sessions. Talking with other participants of the
conference we had a feeling that everyone would most likely see an American candidate
cipher as a winner of the contest, because the winner was going to become an American
government standard. From this point of view, MARS, RC6 and Twofish had an
advantage over remaining candidates. Rijndael was proposed by two, unknown
researchers from Europe, what was not a good omen for its acceptance. Serpent already
had a bad reputation of being very slow in software. At the AES3 conference we have
presented a paper [19], which focused on implementations of AES candidates in basic
iterative architecture. We have shown my results, which are summarized in Figure 7.1-1
and Figure 7.1-4. Other research groups have presented similar results as shown in Figure
7.1-3. At the end of the AES3 conference all participants were asked to fill a survey,
where everyone could highlight his/her choice for AES standard. The results of the
survey are presented in Figure 8-1.
135
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
# votes
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
# votes
Figure 8-1 Results of survey filled by participants of the AES3
conference.
The opinion voiced by AES3 participants is surprisingly well correlated with the
results of our research.
The winner of the contest has been finally announced in August 2000. Rijndael
has become the AES, and will protect US government data well into 21st century.
The AES contest is over, but the results of my research are of interest. All the
finalists proved to be equally secure, and may find uses in real applications. I have
already encountered requests for including all remaining candidate algorithms in secure
communication standards as optional algorithms. My research results may guide
hardware implementers of those algorithms.
136
Rijndael has been officially approved for the AES in November 2001. It will
become a required algorithm for all most important secure communication protocols like
IPSec. I have started my research focusing on implementing AES for gigabit IPSec, and
have already presented an implementation of Rijndael in basic iterative architecture,
which achieves a throughput of 577 Mbps [8]. I am currently working on matching a 1
Gbps requirement for gigabit IPSec.
137
List of References
138
List of References
[1] R. Anderson, E. Biham, L. Knudsen, Serpent: A Proposal for the AdvancedEncryption Standard, NIST AES Proposal, June 1998.
[2] Advanced Encryption Standard Development Effort, http://www.nist.gov/aes.
[3] ANSI X9.52, Triple Data Encryption Algorithm Modes of Operation, 1998.
[4] C. Burwick, D. Coppersmith, E. D’Avignon, R. Gennaro, S. Halevi, C. Jutla,S. Matyas, L. O’Connor, M. Peyravian, D. Safford, N. Zunic, MARS – acandidate cipher for AES, NIST AES Proposal, June 1998.
[5] E. Biham, A. Shamir, Differential cryptanalysis of DES-like cryptosystems,Technical report CS90-16, Weizmann Institute of Science, CRYPTO'90 &Journal of Cryptology, Vol. 4, No. 1, pp. 3-72, 1991.
[6] E. Biham, A. Shamir, Differential Cryptanalysis of the full 16-round DES,Advances in Cryptology, CRYPTO’92, 1992.
[7] E. Biham, A. Shamir, Differential Cryptanalysis of the Data EncryptionStandard, Springer Verlag, 1993. ISBN: 0-387-97930-1, 3-540-97930-1.
[8] P. Chodowiec, K. Gaj, P. Bellows, B. Schott, Experimental Testing of theGigabit IPSec-Compliant Implementations of Rijndael and Triple DES UsingSLAAC-1V FPGA Accelerator Board, Proc. Information Security Conference,Malaga, Spain, October 1-3, 2001.
[9] P. Chodowiec, P. Khuon, K. Gaj, Fast Implementations of Secret-Key BlockCiphers Using Mixed Inner- and Outer-Round Pipelining, Proc. ACM/SIGDANinth International Symposium on Field Programmable Gate Arrays,FPGA’01, Monterey, February 2001, pp. 94-102.
[10] P. Chodowiec, W. Todryk, Hardware Encryptor for Hard Drives, WarsawUniversity of Technology, Faculty of Electronics and InformationTechnology, Senior Design Project, Warsaw, 1998.
139
[11] J. Daemen, V. Rijmen, AES Proposal: Rijndael, NIST AES Proposal, June1998.
[12] A. Dandalis, V. Prasanna, J. Rolim, A Comparative Study of Performance ofAES Final Candidates Using FPGAs, Proc. Cryptographic Hardware andEmbedded Systems Workshop, CHES 2000, Worcester, MA, Aug. 17-18,2000.
[13] Electronic Frontier Foundation and O’Reilly and Associates, Cracking DES:Secrets of Encryption Research, Wiretap Politics & Chip Design, July 1998.
[14] A. Elbirt, C. Paar, An FPGA Implementation and Performance Evaluation ofthe Serpent Block Cipher, Eighth ACM International Symposium on Field-Programmable Gate Arrays, Monterey, California, February 10-11, 2000.
[15] A. Elbirt, W. Yip, B. Chetwynd, C.Paar, An FPGA Implementation andPerformance Evaluation of the AES Block Cipher Candidate AlgorithmFinalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.
[16] Federal Information Processing Standards Publication 46-3, Data EncryptionStandard, National Institute of Standards and Technology, 1999.
[17] Federal Information Processing Standards Publication 81, DES modes ofoperation, National Institute of Standards and Technology, 1980.
[18] Federal Information Processing Standards Publication 197, AdvancedEncryption Standard (AES), National Institute of Standards and Technology,2001.
[19] K. Gaj, P. Chodowiec, Comparison of the Hardware Performance of the AESCandidates Using Reconfigurable Hardware, Proc. 3rd Advanced EncryptionStandard (AES) Candidate Conference, New York, April 13-14, 2000.
[20] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach,Second Edition, 1995. ISBN: 1-55960-329-8.
[21] H. Leitold, W. Mayerwieser, U. Payer, K. Posch, R. Posch, J. Wolkerstorfer,A 155 Mbps Triple-DES Network Encryptor, Proc. Cryptographic Hardwareand Embedded Systems Workshop, CHES 2000.
[22] H. Lipmaa, P. Rogoway, D. Wagner, CTR-Mode Encryption, Comments toNIST concerning AES Modes of Operations, 2000.
140
[23] M. Matsui, Linear cryptanalysis method for DES cipher, Advances inCryptology, EUROCRYPT’93, 1993.
[24] J. Nechvatal, E. Barker, D. Dodson, M. Dworkin, J. Foti, E. Roback, StatusReport on the First Round of the Development of the Advanced EncryptionStandard, NIST report, August 1999.
[25] M. Peattie, Use Triple DES for Ultimate Virtex-II Design Protection, Xcelljournal, Issue 40, Summer 2001.
[26] M. Riaz, H. Heys, The FPGA Implementation of RC6 and CAST-256Encryption Algorithms, CCECE’99, Edmonton, Alberta, Canada, 1999.
[27] M. Rawski, L. Jozwiak, M. Nowicka, T. Luba, Non-Disjoint Decompositionof Boolean Functions and Its Application in FPGA-oriented TechnologyMapping, Proc. EUROMICRO’97, Budapest, Hungary, September 1-4, 1997.
[28] R. Rivest, M. Robshaw, R.Sidney, The RC6 Block Cipher, NIST AESProposal, June 1998.
[29] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Twofish: A 128-bit Block Cipher, NIST AES Proposal, June 1998.
[30] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Performance Comparison of the AES Submissions, Second AES CandidateConference, Rome, April 1999.
[31] A. Satoh, N. Ooba, K. Takano, E. D’Avignon, High-Speed MARS Hardware,Proc. 3rd Advanced Encryption Standard (AES) Candidate Conference, NewYork, April 13-14, 2000.
[32] S. Trimberger, R.Pang, A. Singh, A 12 Gbps DES Encryptor/Decryptor corein an FPGA, Proc. Cryptographic Hardware and Embedded SystemsWorkshop, CHES 2000.
[33] B. Weeks, M. Bean, T. Rozylowicz, C. Ficke, Hardware PerformanceSimulations of Round 2 Advanced Encryption Standard Algorithms, Proc. 3rd
Advanced Encryption Standard (AES) Candidate Conference, New York,April 13-14, 2000.
[34] Xilinx, Inc., Virtex 2.5V Field Programmable Gate Arrays, TheProgrammable Logic, Data Book, 2000.
141
[35] K. Gaj and P. Chodowiec, Fast implementation and fair comparison of thefinal candidates for Advanced Encryption Standard using FieldProgrammable Gate Arrays, Proc. RSA Security Conference -Cryptographer's Track, San Francisco, CA, April 8-12, 2001.
[36] Tetsuya Ichikawa, Tomomi Kasuya, Mitsuru Matsui, Hardware Evaluation ofthe AES Finalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.
Top Related