Telecommunications and Informatics Engineering...The FireWorks processor of this framework will be...

Cryptographic Functions on the SideWorks architecture

Nuno Andre Mendes Pissarra

Thesis to obtain the Master of Science Degree in

Telecommunications and Informatics Engineering

Supervisor: Prof. Ricardo Jorge Fernandes Chaves

Examination Committee

Chairperson: Prof. Paulo Jorge Pires FerreiraSupervisor: Prof. Ricardo Jorge Fernandes Chaves

Member of the Committee: Prof. Fernando Duarte Goncalves

October 2015

Cryptography is about communication in the presence of adversaries.Ronald Linn Rivest

Acknowledgments

First of all, I would like to thank my supervisor, Professor Ricardo Chaves for the support, guidance and

helpful suggestions through the elaboration of this dissertation.

I sincerely want to thank Alexandre Santos and Ruben Lino for putting up with too many questions and

doubts regarding the Coreworks framework and debug scenarios. Working with them has been an extremely

valuable experience.

A special thanks to the Nuno Neves for all the support regarding MB-LITE processor. Also, I would like

to express my gratitude for all my colleagues and friends at Instituto Superior Tecnico. To them, I thank the

friendship and company during the lazy days and long nights where this work has been developed.

Furthermore, I am grateful to my sister Ana Pissarra for the motivation, for reviewing some parts of this

work, providing me with feedback and patience when it was most required.

Finally, I dedicate this work to my mother Fernanda Mendes and father Mario Pissarra, for all the love

and comprehension and for always believing in me, opening all the doors in life for me. I thank them for the

unconditional support and encouragement.

iii

Abstract

The proliferation of smartphones, tablets and smaller embedded systems, combined with the increase in the

amount of data stored and transmitted by this kind of devices, has change the way we view the necessity to

protect valuable information, now seen less as a feature and more as a requirement.

Coreworks developed a computing technology which speeds up the development of high-performance, small

area and low power reconfigurable processors. The Coreworks technology named SideWorks highlights the fact

that reconfigurable processors built with this technology are primarily targeted to work as a dedicated high-

performance offload engines.

This work presents the implementation of multiple cryptographic algorithms, the SHA family, AES and

CLEFIA using the Coreworks processing framework. The FireWorks processor of this framework will be used

to control the hardware accelerator designated as SideWorks.

The presented work explores the use of the SideWorks platform to adequately schedule and accelerate the

computation of the most common cryptography algorithms (excluding asymmetrical algorithms). The proposed

approach considers merging the needed processing structures, towards a more compact and efficient cryptog-

raphy implementations. Taking full advantage of the features provided by the SideWorks technology and the

building blocks proposed, demonstrating a novel approach to the implementation of known cryptographic algo-

rithms.

Keywords

SHA, AES, CLEFIA, FPGA, Coreworks, SideWorks.

v

Resumo

A proliferacao de smartphones, tablets e sistemas embebidos cada vez mais pequenos, aliada ao aumento

da quantidade de dados armazenados e transmitidos por este tipo de dispositivos, tem alterado a forma como

vemos a necessidade de proteger informacao. A necessidade de proteger informacao deixou de ser uma opcao /

carateristica para se tornar indespensavel.

A Coreworks desenvolveu uma tecnologia que permite acelerar o desenvolvimento e a performance de sis-

temas, com uma pequena area de ocupacao e baixo consumo energetico, para processadores reconfiguraveis.

Destaca-se, na tecnologia da Coreworks, denominada SideWorks, o facto de os processadores reconfiguraveis

desenvolvidos com esta tecnologia serem principalmente direcionados para motores dedicados de alto desem-

penho.

Este trabalho apresenta a implementacao de multiplos algoritmos criptograficos, entre eles os elementos da

famılia SHA, AES e CLEFIA usando a framework da Coreworks. O trabalho desenvolvido explora a utilizacao

da plataforma SideWorks de forma a acelerar e agendar adequadamente a computacao de alguns dos algoritmos

mais comuns em criptografia (excluindo algoritmos assimetricos). A abordagem proposta considera a fusao

das estruturas de processamento necessarias, de forma a criar implementacoes mais compactas e eficientes.

Aproveitando ao maximo os recursos oferecidos pela tecnologia SideWorks e os blocos de construcao propostos,

este trabalho expoe uma nova abordagem para a implementacao destes algoritmos criptograficos.

Palavras Chave

SHA, AES, CLEFIA, FPGA, Coreworks, SideWorks.

vii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related State of the art 5

2.1 Hash functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 SHA-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 SHA-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Symmetric Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1.2 SubBytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1.3 ShiftRows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1.4 MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1.5 AddRoundKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1.6 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 CLEFIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2.2 F-Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3.1 Electronic Code Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.3.2 Cipher Block Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Related Technology 25

3.1 Hardware-Software CoDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 SideWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix

3.3 MB-LITE processor core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Development Environment 31

4.1 Coreworks Processing Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 FireWorks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 SideWorks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 SideWorks Design Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.6 SideWorks Functional Unit Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Proposed Structures for Hash Functions 41

5.1 Secure Hash Algorithm (SHA)-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 SHA-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47



5.2.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49



5.3.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Proposed Structures for Symmetric Key Encryption Algorithms 53

6.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54



6.1.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 CLEFIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59



6.2.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Evaluation 63

7.1 Hardware Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.2 Functional Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

x

7.3 Proposed Structures performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Conclusions 69

Appendix A SideWorks Examples A-1

A.1 SideWorks code example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2

A.2 SideWorks code example parallel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.3 SideWorks FU ROL32.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5

A.4 SideWorks SHA-1 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6

Appendix B SideWorks Functional Units B-1

B.1 ADDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 ADDX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3

B.3 ANDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4

B.4 CHI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

B.5 CXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.6 FXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7

B.7 FXOR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8

B.8 ROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9

B.9 SBHIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10

B.10 SXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11

B.11 SXOR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12

B.12 XORCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14

B.13 XORR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15

B.14 XORX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16

B.15 XORX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17

xi

List of Figures

2.1 Schematic of the SHA-1 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Schematic of the SHA-2 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Sponge construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Terminology used in Keccak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 The step mappings of Keccak-f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Basic structure of the AES algorithm: encryption (left), decryption (right). . . . . . . . . . . . . . . . . 13

2.7 Substitute byte transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 ShiftRows transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 MixColumns transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.10 Add round key transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.11 CLEFIA datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.12 Schematic of the CLEFIA F functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.13 Encryption and decryption with the Electronic Code Book mode. . . . . . . . . . . . . . . . . . . . . 22

2.14 Encryption and decryption with the Cipher Block Chaining mode. . . . . . . . . . . . . . . . . . . . . 23

3.1 Top level view of the SideWorks architecture template[1]. . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 MB-LITE configuration example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Coreworks Processing Engine interfacing with user cores and memory[1]. . . . . . . . . . . . . . . . . 32

4.2 FireWorks Architecture[1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Hardware/Software Co-design flow using SideWorks[1]. . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Sequential vector add example diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Example diagram parallel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 SHA-1 Extend the sixteen 32-bit words into eighty 32-bit words. . . . . . . . . . . . . . . . . . . . . 43

5.2 SHA-1 F1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 SHA-1 F2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 SHA-1 F3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 ADDX FU description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6 SHA-1 Final critical datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.7 XORX FU description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.8 SHA256 Final schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiii

5.9 CHI FU description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.10 XORR FU description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1 XOR Select description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 XOR final description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.3 Advanced Encryption Standard (AES) Output FeedBack (ECB) Core Partial Round Example . . . . . . . 57

6.4 AES Cipher Block Chaining (CBC) Core Partial Round Example . . . . . . . . . . . . . . . . . . . . 58

6.5 CLEFIA Round Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

B.1 ADDX FU Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 ADDX2 FU Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3

B.3 ANDx Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4

B.4 CHI Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

B.5 CXOR Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.6 FXOR Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7

B.7 FXOR2 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8

B.8 ROL Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9

B.9 SBSHIFT Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10

B.10 SBSHIFT Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10

B.11 SXOR FU Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11

B.12 SXOR2 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12

B.13 XORCL Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14

B.14 XORR Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15

B.15 XORx Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16

B.16 XORX2 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17

xiv

List of Tables

2.1 Block Cipher Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 SHA-1 and SHA-2 Functional Units requirements and technical features. . . . . . . . . . . . . 49

5.2 SHA-3 Functional Units requirements and technical features. . . . . . . . . . . . . . . . . . . . 52

6.1 AES and CLEFIA Functional Units requirements and technical features. . . . . . . . . . . . . . 61

7.1 CXOR alternative implementations features . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2 ADDX alternative implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3 XORX alternative implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 Proposed structures performance summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.1 ADDX Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 ADDX2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3

B.3 ANDX Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4

B.4 CHI Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

B.5 CXOR Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.6 FXOR Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7

B.7 FXOR2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8

B.8 ROL Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9

B.9 SXOR Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11

B.10 SXOR2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13

B.11 XORCL Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14

B.12 XORR Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15

B.13 XORX Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16

B.14 XORX2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17

xv

Abbreviations

AES Advanced Encryption Standard

AGU Address Generation Unit

ALU Arithmetic Logic Unit

ASIC Application-Specific Integrated Circuits

BRAM Block Memory

CAD Computer-Aided Design

CBC Cipher Block Chaining

CWPE Coreworks Processing Engine

DES Data Encryption Standard

DMA Direct Memory Access

DPA Differential Power Analysis

DRM Digital Rights Management

DSP Digital Signal Processor

DSS Digital Signature Standard

ECB Electronic CodeBook

ECB Output FeedBack

EDK Embedded Development Kit

ELF Executable and Linkable Format

FIFO First In First Out

FIPS Federal Information Processing Standards

FPGA Field Programmable Gate Array

FU functional unit

xvii

GFN Generalized Feistel Network

GF Galois field

HMAC Hash-based Message Authentication Code

IETF Internet Engineering Task Force

IoT Internet of Things

IPsec Internet Protocol Security

IP Intellectual Property

IV Initialization Vector

LUT LookUp Table

MD5 Message-Digest algorithm 5

MU memory unit

NIST National Institute of Standards and Technology

RFC Request for Comments

RISC Reduced Instruction Set Computing

RTL Register Transfer Level

SHA Secure Hash Algorithm

SoC System on Chip

SSH Secure Shell

TLS Transport Layer Security

VHDL VHSIC Hardware Description Language

VLSI Very Large Scale Integration

xviii

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

1.1 Motivation Introduction

In the latest years, with the evolution of the communications the amount of the transmitted data has increased

dramatically. Today modern world of e-mail, internet banking, on-line shopping and other sensitive digital com-

munications, cryptography has become a vital tool to ensure the connection and interaction in society. A clear

example are hash functions that operate at the root of many popular cryptographic methods currently in use, such

as the Digital Signature Standard (DSS), Transport Layer Security (TLS) and Internet Protocol Security (IPsec)

protocols, numerous random number generation algorithms, encryption algorithms, all-or-nothing transforms

and password storage mechanisms.

In essence, the information security deals with the safe and accurate transfer of information, considering

the necessity to protect valuable information is as ancient as mankind. Whether that information is a classified

document, the technology to produce weapons or a building blueprint, all of them have some value and therefore,

needed to be protected[2].

1.1 Motivation

Today, the most complicated cryptographic systems have been implemented in software rather than in hard-

ware. One major reason is the implementers increased knowledge in software programming, rather than in

hardware design. Software tools are widely spread with low prices while Very Large Scale Integration (VLSI)

Computer-Aided Design (CAD) commercial tools are only on the interests of large companies and specified

research groups. Individual users and class projects are restricted to software possibilities. The applications in-

creasing demand for computation power and the energy restrictions on portable devices, force us to consider that

general-purpose processors are no longer an efficient solution for mobile systems. So, new hardware approaches

are needed in order to implement computational heavy functions with low power demands in order to meet the

current network speed requirements and the amount of data that was to be transmitted. Such approaches are

Application-Specific Integrated Circuits (ASIC) technology and Field Programmable Gate Array (FPGA).

ASIC devices are the best solution when dealing with real-time and more demanding systems. ASICs de-

vices guarantee better performance, with low power consumptions however ASICs lack adaptability. Between

software implementations and the ASICs devices, there is a middle ground. This area is covered by the FPGAs.

These components provide reconfigurable logic and they are commercially available at low prices. These de-

vices vary in capacity and performance. The main disadvantage of them is that they are not suitable for the

implementation of large functions. Programmable logic has several advantages over custom hardware, such as

lower time to market and higher adaptability.

1.2 Relevance

The increase in knowledgeable users in conjunction with the unprecedented low cost of electronic devices

has brought to the users an unparalleled level of convenience and flexibility.

Tablets and laptops let users work anywhere, anytime. Unfortunately, physical security is a major problem

for these devices. To be portable, they must be lightweight and small-sized. Since they are designed for mobile

use, they are often exposed in public places such as airports, coffee houses or taxis, where they are vulnerable

to theft or loss. Along with the value of lost hardware, users must worry about the exposure of sensitive

2

Introduction 1.3 Development environment

information. People store vast amounts of personal data on their mobile devices and the loss of a device may

lead to the exposure of credit card numbers, passwords, client data or even military secrets [3][4].

Even in smaller embedded systems many of these problems are still relevant, as security assurances is a

particularly challenging endeavor. Resource and performance constraints imposed upon embedded systems

preclude the application of security assurance techniques that have been developed for desktops and enterprise

computation platforms.

Identical concerns can be found in non critical systems as well. With the increased adoption of the concept

of the Internet of Things (IoT), which consist of billions of digital devices, people, services, and other physical

objects having the potential to seamlessly connect, interact and exchange information about themselves through

a digital environment. Using the combination of network connectivity with embedded systems, sensors, and

actuators in the physical world. This new concept involves objects of our daily life, like clothes, cars, smart

cards, which will be able to reveal information about themselves, interacting with each other and with the

environment. IoT will, therefore, add an enormous range of new industrial opportunities to the software and

hardware markets. Due to the multiple aspects that involve, security for IoT will be a critical concern that must

be addressed in order to enable several current and future applications[5].

1.3 Development environment

Lately, the interest in parallel architectures has increased to address the demanding computational needs of

multimedia and communication algorithms. These architectures are composed by application-specific hardware

to accelerate the critical parts of the algorithms, maintaining the energy consumption within adequate limits.

However, specific hardware has a long development time and cannot be altered once manufactured. This limi-

tation is incompatible with the fast and dynamic market changes and short life expectancy of modern electronic

products.

Coreworks developed a computing technology which minimizes the problems outlined above as it speeds

up the development of high-performance, small area and low power reconfigurable processors. The Coreworks

technology named SideWorks highlights the fact that reconfigurable processors built with this technology are

primarily targeted to work as a dedicated high-performance offload engines.

SideWorks is a general architecture template for runtime reconfigurable processors, using pre-designed and

pre-verified programmable functional units (FUs) and embedded memories interconnected by programmable

partial crossbars. The FU library includes general purpose FUs such as Arithmetic Logic Units (ALUs), multi-

pliers and shifters, as well as more application-specific units such as bit packing/unpacking functions, data type

converters, etc. With this technology, application-specific architectures can be automatically generated, pro-

grammed and simulated by proprietary tools. The Coreworks technology combines a proprietary conventional

processor (FireWorks) with a programmable hardware block (SideWorks), to form a totally programmable so-

lution which achieves the performance of custom hardware and the flexibility of software. The hardware and

software provided by the Coreworks technology addresses fast evolving multimedia and communication stan-

dards and allows the frequent need for updates and bug fixes. Those characteristics contribute to the overall risk

reduction compared with the development of a multi-million dollar chip design [1].

3

1.4 Objectives Introduction

1.4 Objectives

When facing the challenge to implement cryptographic algorithms in a platform such as SideWorks, the

developer must identify the necessary building blocks to achieve the most efficient and compact implementation.

To succeed in this task, the study and comprehension of the desired algorithms is a needed and a crucial step

to create and optimize the required blocks. As such, the main goal of this work is not only to implement

cryptographic algorithms in SideWorks but to provide SideWorks with essential building blocks to do so in the

most effective way. The fact that there are no current implement patrons of this kind is the motivation behind

this project and could represent a new field in which SideWorks can thrive.

1.5 Document Structure

The document is organized as follows: Chapter 2 presents the general description of the implemented target

algorithms, using reference implementations provided by the authors; Chapter 3 briefly introduces the design

challenges and considerations faced when using Hardware Software CoDesign and some references on how

Coreworks tackles the problem, also presenting the MB-Lite processor as direct alternative. Chapter 4 intro-

duces the Coreworks framework, by describing the SideWorks architecture and design flow, presenting a design

example to illustrate the basic procedure when developing for this architecture. Chapter 5 focus on the inherent

challenges and the implementation of the SHA algorithms, presenting the support procedures and mechanisms

devised during the course of the work, to be used within the proposed structures.

In Chapter 6, structures for two symmetrical encryption algorithms are proposed. The first Section proposes

a compact Advanced Encryption Standard (AES) structure capable of achieving high throughputs with a small

device occupation. The second Section proposes a structure for CLEFIA, notable for its lightweight efficiency.

Chapter 7 presents the hardware utilization and performance evaluation for both the FUs and the proposed

structures. Finally, Chapter 8 concludes this work, summarizing the main results of the work.

4

2Related State of the art

Contents2.1 Hash functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 SHA-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 SHA-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Symmetric Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1.2 SubBytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1.3 ShiftRows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1.4 MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1.5 AddRoundKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.1.6 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 CLEFIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2.2 F-Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.3.1 Electronic Code Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.3.2 Cipher Block Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5

2.1 Hash functions Related State of the art

This Chapter is divided into two major Sections, one for Hash functions and other for Symmetric Encryption

algorithms. Starting by describing the principles of the algorithms, focusing on the reference implementation

and how they work. Designed to give a firmer understanding of hash functions and block ciphers, what they are,

how they work in particular Secure Hash Algorithm (SHA) family, AES, and CLEFIA, also providing insight

on some of the existing modes of operation.

2.1 Hash functions

Hash functions are a key primitive in modern cryptography, often informally called one-way functions. A

hash function is a computationally efficient function, which maps binary strings of arbitrary length to binary

strings of some fixed length, called hash values or digest value.

Used in many cryptographic algorithms and protocols. In the case of cryptographic algorithms, hash func-

tions are used in the digital signature scheme, message padding of public key encryption scheme and message

authentication codes.

In 1993, National Institute of Standards and Technology (NIST) published the first SHA-0 Standard, which

two years later was superseded by SHA-1 to improve the original design. The SHA-1 was still deemed secure

by the end of the millennium, but shortly after, an avalanche of results on hash functions culminated with

collision attacks for Message-Digest algorithm 5 (MD5)[6] and SHA-1[7]. Meanwhile, NIST had introduced

the SHA-2 family, although SHA-2 bears some similarities to the SHA-1 algorithm, these attacks have not been

successfully extended to SHA-2 which remains unbroken until today.

Recently, NIST announced the SHA-3 competition, calling for proposals for a hash function that will replace

the SHA-2 standard. NIST selected 51 entries for the Round 1 [8], 14 of them advanced to Round 2[9], from

which 5 finalists were selected [10].

The selection of winning candidates was driven by considering security as well as implementation efficiency

of the proposed hash algorithms in hardware and software. However, systematic cryptanalysis of hash functions

is not well established, and it is hard to measure the cryptographic strength of a hash function beyond obvious

metrics such as digest length. On October 2, 2012, Keccak was selected as the winner of the SHA-3 competition.

The following subsections describe the SHA-1, SHA-2, and the novel SHA-3 algorithms.

2.1.1 SHA-1

SHA-1 is the most popular hash function in use. The algorithm takes as input a message with a maximum

length of less than 264 bits and produces as output a 160-bits message digest. The input is processed in 512-bits

blocks. The algorithm processing includes the following steps:

Padding: The purpose of message padding is to make the total length of a padded message congruent to

448 modulo 512 (length = 448 mod 512). The number of padding bits should be between 1 and 512. Padding

consists of a single 1-bit followed by the necessary number of 0-bits.

Appending Length: A 64-bits binary representation of the original length of the message is appended to

the end of the message.

6

Related State of the art 2.1.2 SHA-2

A B C D E

+

+

+

+

Wt

Kt

EDCBA

F

<<5

<<30

Figure 2.1: Schematic of the SHA-1 algorithm.

Initialize the SHA-1 buffer: The 160-bits buffer is represented by five four-word buffers (A, B, C, D and

E) used to store the middle or final results of the message digests for SHA-1 function.

Process the message in 16-word blocks: The heart of the algorithm is a module that consists of four rounds

of processing 20 steps each. The four rounds have a similar structure, but each uses a different primitive logical

function. These logical functions are defined as follows:

f(B,C,D)t =

(B ∧ C) ∨ (B ∧D) 0 ≤ t ≤ 19B ⊕ C ⊕D 20 ≤ t ≤ 39(B ∧ C) ∨ (B ∧D) ∨ (C ∧D) 40 ≤ t ≤ 59B ⊕ C ⊕D 60 ≤ t ≤ 79

(2.1)

These rounds take as input the current 512-bits block and the 160-bits buffer value (A, B, C, D and E), and then

update these buffers. Each round also makes use of an additive constant Kt. In hexadecimal format these are

given by:

Kt =

5A827999 0 ≤ t ≤ 196ED9EBA1 20 ≤ t ≤ 398F1BBCDC 40 ≤ t ≤ 59CA62C1D6 60 ≤ t ≤ 79

(2.2)

The output of the fourth round is added to the input of the first round, and then the addition is modulo 232 to

produce the ABCDE values that are used to calculate next 512-bits block.

Output: After all 512-bits blocks have been processed, the output of the last block is the 160-bits message

digest.

2.1.2 SHA-2

The hash functions of the SHA-2 family differ most significantly in the number of security bits that are

provided for the hashed input message. Security is directly related to the message digest length. In most of

the cases, when a hash function is used in conjunction with another encryption algorithm, there are special

demands, which require the use of a hash function with a certain number of security bits. For example, if a

message is being signed with a digital signature algorithm that provides 192-bit security, then that signature

algorithm requires the use of a secure hash algorithm that provides at least 192-bit security, meaning on this

case it is needed an SHA-2(384).

7

2.1.2 SHA-2 Related State of the art

A B C D E F G H

Ch + +

Wt

Kt

∑1 +

Ma +

∑0 +

HGFEDCBA

+

Figure 2.2: Schematic of the SHA-2 algorithm.

The SHA-2 standard supersedes the existing SHA-1, adding four new hash functions, SHA-2(224), SHA-

2(256), SHA-2(384) and SHA-2(512), for computing a condensed representation message digest. The produced

message digest ranges in length from 224 to 512-bits, depending on the selected hash function. These hash

functions enable the determination of a message integrity, any change to the message will, with a very high

probability, results in a different produced message digest. The hash functions, specified in this standard are

considered to be computationally unfeasible to (1) to find a message that corresponds to a given message digest,

or (2) to find two different messages that produce the same message digest.

Each hash function operation can be divided into two stages: pre-processing and hash computation. Pre-

processing involves padding the input message, parsing the padded data n to a number of m-bit blocks and

setting the appropriate initial values, which are used in the hash computation. The hash computation uses the

padded data along with functions, constants and word logical and algebraic operations to iteratively generate

a series of hash values. The produced hash value after a specified number of transformation rounds is equal

to the message digest. The SHA-2 algorithm is very similar in structure to SHA-1 nevertheless it uses eight,

rather than five, 32-bit subblocks, also each block, considered as 16 32-bit words, 64 (or 80) 32-bit words are

produced. SHA-2 operates in the same manner of SHA-1: The message to be hashed is first: (1) padded with

its length in such a way that the result is a multiple of 512 bits long, and then; (2) parsed into 512-bit message

blocks M(1), M(2),...,M(N).

The message blocks are processed one at a time, beginning with a fixed initial hash value H(0), sequentially

computing H(i)= Hi−1 + CMi(Hi−1) where C is the SHA-2 compression function and + means word-wise

mod 232 addition. H(N)is the hash of M. The compression function is depicted in Figure 2.2 and the respective

pseudocode in Listing 5.3.

8


2.1.3 SHA-3

Keccak[11] is a family of hash functions that has been submitted as a candidate and won the NIST hash

algorithm competition for SHA-3, based on the sponge construction and uses as a building block an iterated

permutation.

The sponge construction [12] is a mode of operation, based on a simple iterated construction for building

a function f , with variable-length input and arbitrary-length output based on a fixed-length permutation or

transformation f , operating on a fixed number, b of bits. Here b is called the width. The sponge construction

Figure 2.3: Sponge construction

operates on a state of b = (r + c) bits. The value r is called bit rate and c is called capacity. First, all the bits of

the state are initialized to zero. The input message is divided into pieces of length r bits each. Then it proceeds

in two phases: the absorbing phase followed by the squeezing phase. Absorbing phase the r bit input message

blocks are XORed into the first r bits of the state, interleaved with applications of the function f . When all

message blocks are processed, the sponge construction switches to the squeezing phase. In the squeezing phase,

the first r bits of the state are returned as output blocks, interleaved with applications of the function f . The

number of iterations is determined by the requested number of bits.

Finally, the output is truncated to the requested length. The sponge construction is illustrated in Figure 2.3.

The value c = b− r is called the capacity, c actually determines the attainable security level of the construction

[13, 14].

Figure 2.4: Terminology used in Keccak

In order to facilitate the description of the individual mappings, the following naming conventions are used

by the authors for parts of the Keccak-f state. Considering w = 2l where l ranges from 0 to 6, hence, there are

seven such permutations, where w represents the length and b the width of each of the permutations (b = 25w).

9

2.1.3 SHA-3 Related State of the art

Some of the notions present on Figure 2.4 are:

• A row is a set of 5 bits with constant y and z coordinates.

• A column is a set of 5 bits with constant x and z coordinates.

• A lane is a set of w bits with constant x and y coordinates

• A sheet is a set of 5w bits with constant x coordinate.

• A plane is a set of 5w bits with constant y coordinate.

• A slice is a set of 25 bits with constant z coordinate.

Figure 2.5: The step mappings of Keccak-f

In Keccak, the underlying function is a permutation chosen in a set of seven Keccak-f permutations, denoted

Keccak-f[b], where b ∈ {25, 50, 100, 200, 400, 800, 1600} is the width of the permutation. It consists of the

iteration of a simple round function, similar to a block cipher without a key schedule. The nominal version of

Keccak-f operates on a 1600-bit state. The choice of operations is limited to bitwise XOR, AND, NOT and

rotations. There is no need for table-lookups, arithmetic operations or data-dependent rotations. The state is

organized as an array of 5 x 5 lanes, each of length w ∈ {1, 2, 4, 8, 16, 32, 64} (b = 25w).

The description of the Keccaks Round is illustrated in Listing 2.1 and detailed in Figure 2.5. The number of

rounds nr depends on the permutation width and is given by nr = 12 + 2l, where 2l = w. This gives 24 rounds

for Keccak-1600, 12 + 2 ∗ 6 = 24.

In Listing 2.1 A denotes the complete permutation state array, and A[x, y] denotes a particular lane in that

state. B[x, y], C[x], D[x] are intermediate variables. The constants r[x, y] are the rotation offsets, while RC[i]

are the round constants. The rot(W, r) is a bitwise cyclic shift operation, moving the bit at position i into position

i+ r.

Listing 2.1: Keccak Main loop, Pseudocode

Keccak-f[b](A) {for all i in 0...n_r -1

10


A = Round[b](A, RC[i])return A

}Round[b](A,RC) {//θ stepfor all x in 0...4C[x] = A[x,0] xor A[x,1] xor A[x,2] xor A[x,3] xor A[x,4]D[x] = C[x-1] xor rot(C[x+1],1)

for all (x,y) in (0...4,0...4)A[x,y] = A[x,y] xor D[x]

//ρ and π stepsfor all (x,y) in (0...4, 0...4)B[y,2*x+3*y] = rot(A[x,y], r[x,y])

//χ stepfor all (x,y) in (0...4, 0...4)A[x,y] = B[x,y] xor ((not B[x+1,y]) and B[x+2,y])

//ι stepA[0,0] = A[0,0] xor RC

return A}

11

2.2 Symmetric Encryption Related State of the art

2.2 Symmetric Encryption

Symmetric encryption also referred to as symmetric-key, secret-key, and single-key schemes or algorithms

which are one of the oldest and most well understood cryptography primitive, representing the beginnings of

this field of knowledge. Caesar and his cipher, the Germans and Enigma, and the Japanese and Purple are all

examples of symmetric encryption. An encryption system in which the sender and receiver of a message share

a single, common key that is used to cipher and decipher the message. A secret key, which can be a number, a

word or just a string of random letters, is applied to the text of a message to change the content in a particular

way. This might be as simple as shifting each letter by a number of places in the alphabet. As long as both

sender and recipient know the secret key, they can cipher and decipher all messages that use the same key.

The algorithms used for encryption with this type of scheme can be either block cipher algorithms or stream

cipher algorithms. The former encrypts a block of data at a time while the latter cipher data on a character

by character basis. A block cipher takes a fixed length block of text of length b bits and a key as input and

produces a b bit block of ciphertext. Using the basic block cipher with the same key, two instances of the same

input block will give the same output blocks. The block cipher encrypts a block of the message or plaintext

p into a block of ciphertext c under the action of a secret key k. This is usually denoted as c = ENC(k, p).

The exact form of the encryption transformation will be determined by the selection of the block cipher and the

value of the key k. The process of encryption is reversed by decryption, which will use the same user supplied

key. This is mentioned as p = DEC(k, c). Also when multiple blocks of plaintext are encrypted using the

same key, a number of security issues arise. To apply a block cipher in a variety of applications, five modes

of operation have been defined by NIST in SP 800-38A[15], and discuss in Section 2.2.3. Although specific

modes of operation of a block cipher can effectively turn the block cipher into a stream cipher, the use of this

type of cipher is out of the scope of this work.

2.2.1 AES

The Advanced Encryption Standard (AES) is the most widely used symmetric-key cipher today, although

the term Standard in its name only refers to US government applications. Being the first publicly accessible

and open cipher approved by the NSA, for top secret information, the AES block cipher is also mandatory in

several industry standards and is used in many commercial systems. Among the commercial standards, that

include AES, are the Internet security standard IPsec, TLS, the Wi-Fi encryption standard IEEE 802.11i the

secure shell network protocol Secure Shell (SSH) and numerous security products around the world.

It is worth mentioning the process by which the AES algorithm was selected. In 1997, NIST called for

proposals for the new standard, designers had a clean slate in terms of putting together a cipher matching the

NIST specified requirements, and these requirements were few but clear:

• The cipher should be a single block cipher with 128 bit block size

• The cipher should be available royalty-free worldwide

• The cipher should support three key lengths 128, 192 and 256 bit

• The cipher should offer the security of two-key triple-Data Encryption Standard (DES) as a minimum

12

Related State of the art 2.2.1 AES

There were 15 different algorithms that were submitted from all over the world. Following the first round of

submission, the different algorithms were analyzed for their security and performance. Two AES conferences

were held, one in 1998 and one in 1999, results and conclusions papers were published regarding the different

security and other properties of the submitted schemes. Following the second AES conference, the pool was

downsized to five finalists and the second round begun. A third AES conference was then held, inviting addi-

tional scrutiny on the five finalists. Finally, in October 2000, NIST announced that the winning algorithm is

Rijndael[16], the standard was finally published, in November 2001, as FIPS PUB 197[17].

This process was ingenious because any group who submitted an algorithm and was, therefore, interested

in having their algorithm adopted, had a strong motivation to attack all the other submissions. Note that the

motivation was not financial because the winning submission could not be patented.

The AES cipher is a subset of the Rijndael block cipher. The Rijndael block and key size vary between 128,

192 and 256 bits. However, the AES standard only calls for a block size of 128 bits. Hence, only Rijndael with

a block length of 128 bits is known as the AES algorithm.

2.2.1.1 Structure

Today, AES-128 is predominant and supported by most hardware implementations. The central design

principle of the AES algorithm is simplicity. The simplicity is realized by two means: the adoption of symmetry

at different levels and the choice of basic operations. The first level of symmetry lies in the fact that the AES

algorithm encrypts 128-bit blocks of plaintext by repeatedly applying the same round transformation, outlined

in Figure 2.6. AES-128 applies the round transformation 10 times; AES-192 uses 12, and AES-256 uses 14

iterations.

Encrypt

AddRoundKey

Enc

rypt

ion

Rou

nd

SubBytest

ShiftRows

MixColumns

AddRoundKey

PLAINTEXT

Nr−1

Las

tRou

nd

SubBytes

ShiftRows

AddRoundKey

CIPHERTEXT Decry

pt AddRoundKey

Dec

rypt

ion

Rou

nd

InvShiftRows

InvSubBytes

InvMixColumns

AddRoundKey

Las

tRou

nd

InvShiftRows

InvSubBytes

AddRoundKey

CIPHERTEXT

PLAINTEXTN

r−1

Figure 2.6: Basic structure of the AES algorithm: encryption (left), decryption (right).

13

2.2.1 AES Related State of the art

The basic operations used in the AES algorithm can all be described very easily in terms of operations over

the finite field Galois field (GF)(28). The round transformation modifies the 128-bit state. The initial state is the

input plaintext and the final state after the round transformation is the output ciphertext. The state is organized

as a 4 x 4 square matrix of bytes, this is the best way to visualize internal state and the operations performed

and it is actually the name of the cipher in which Rijndael is based, SQUARE[18].

The round transformation scrambles the bytes of the state either individually, row-wise, or column-wise

by applying the functions SubBytes, ShiftRows, MixColumns, and AddRoundKey sequentially. An initial Ad-

dRoundKey operation precedes the first round. The last round differs slightly from the others, the MixColumns

operation is omitted. The functions of the round transformation are linear and non-linear operations that are

reversible to allow decryption using their inverses. Every function affects all bytes of the state. The function

SubBytes is the only non-linear function in AES. It substitutes all bytes of the state using table lookup. The

content of the table can be computed by a finite field inversion followed by an affine transformation in the binary

extension field GF(28). The resulting lookup table is often called an S-Box. The same S-Box is used for all 16

bytes of the state.

The ShiftRows function is a simple operation. It rotates the rows of the state by an offset. The offset equals

the row index, the first row is not shifted at all, and the last row is shifted 3 bytes to the left. The MixColumns

function accesses the state column-wise, working on each column in the same way. It interprets a column as a

polynomial over GF(28), with degree < 4. The state bytes are the coefficients of the polynomial. The output

column corresponds to the polynomial obtained from the multiplication by a constant polynomial and reducing

the result modulo x4 + 1.

The AddRoundKey function is a 128-bit XOR operation that adds a round key to the state. A new round key

is derived for every iteration from the previous round key. The initial round key is equal to the original secret

key. The computation of the round keys is based on the SubBytes function and uses additionally some simple

byte level operations like XOR.

Decryption computes the original plaintext of an encrypted ciphertext. During the decryption, the AES

algorithm reverses encryption by executing inverse round transformations in reverse order. The round trans-

formation of decryption uses the functions AddRoundKey, InvMixColumns, InvShiftRows, and InvSubBytes

in this order. AddRoundKey is its own inverse function because the XOR function is its own inverse. The

round keys have to be computed in reverse order. InvMixColumns needs a different constant polynomial than

MixColumns does. InvShiftRows rotates to the right instead of to the left. InvSubBytes reverses the S-Box

lookup table by an inverse affine transformation followed by the same inversion over GF(28) which was used

for encryption.

The following Sections describe in more detail each of the previously mentioned components of AES.

2.2.1.2 SubBytes

As shown in Figure 2.6, the first layer in each round is the Byte Substitution layer. The Byte Substitution

layer can be viewed as a row of 16 parallel S-Boxes, each with 8 bits input and output 8 bits. The 16 input bytes

are substituted by looking up a fixed table. AES defines a 16 x 16 matrix of byte values, called an S-box that

contains a permutation of all possible 256 of 8 bit values. Each individual byte of the state is mapped into a new

14


byte in the following way. The leftmost 4 bits of the byte are used as a row value and the rightmost 4 bits are

used as a column value. These row and column values serve as indexes into the S-box to select a unique 8 bit

output value. For example, the hexadecimal value 78 references row 7, column 8 of the S-box, which contains

the value BC. The 16 new bytes that result are arranged in a square consisting of four rows and four columns,

illustrated in Figure 2.7. Note that all 16 S-Boxes are identical, each state byte ai is replaced, i.e., substituted,

by another byte bi, S(ai) = bi.

a0,0 a0,1 a0,2 a0,3

a1,0 a1,1 a1,2 a1,3

a2,0 a2,1 a2,2

a3,0 a3,1 a3,2 a3,3

a2,3

b0,0 b0,1 b0,2 b0,3

b1,0 b1,1 b1,2 b1,3

b2,0 b2,1 b2,2

b3,0 b3,1 b3,2 b3,3

b2,3

S-Box

S-Box

Figure 2.7: Substitute byte transformation

The S-Box is the only nonlinear element of AES, i.e., it holds that S(a) + S(b) 6= S(A + B) for two states a

and b. The S-Box substitution is a bijective mapping, i.e., each of the 256 possible input elements is one-to-one

mapped to one output element. Even though the S-Box is bijective, it does not have any fixed points, i.e., there

are not any input values ai such that S(ai) = ai. Even the zero-input is not a fixed point S(00) = 63.

The inverse substitute byte transformation, called InvSubBytes, makes use of the inverse S-box, for example,

that the input BC produces the output 78, and the input 78 to the S-box produces BC. Of course, the S-box

must be invertible, that is, IS[S(a)] = a. However, the S-box does not self-inverse in the sense that it is not

true that S(a) = IS(a). For example, S(78) = BC, but IS(78) = C1.

2.2.1.3 ShiftRows

In this step each of the four rows of the square resulting from the byte substitution process is cyclically

shifted to the left. More precisely, the first row is not altered, the second row is shifted one-byte position to the

left, the third row is shifted two positions to the left and the fourth row is shifted three positions to the left. As

depicted in Figure 2.8.

a0,0 a0,1 a0,2 a0,3

a1,0 a1,1 a1,2 a1,3

a2,0 a2,1 a2,2 a2,3

a3,0 a3,1 a3,2 a3,3

b0,0 b0,1 b0,2 b0,3

b1,0 b1,1 b1,2 b1,3

b2,0 b2,1 b2,2 b2,3

b3,0 b3,1 b3,2 b3,3

ShiftRows

No change

Shift 1

Shift 2

Shift 3

Figure 2.8: ShiftRows transformation

15

2.2.1 AES Related State of the art

The result is a new square consisting of the same 16 bytes, with the property that all the entries that used to

be in one column have been moved so that they now lie in different columns.

The inverse shift row transformation, called InvShiftRows, performs the circular shifts in the opposite di-

rection for each of the last three rows, with a one-byte circular right shift for the second row, and so on.

2.2.1.4 MixColumns

The MixColumn step is a linear transformation which mixes each column of the state matrix. Operating on

each column individually, each byte of a column is mapped into a new value that is a function of all four bytes

in that column. To do this, we view the column as a 4 x 1 column vector of entries in GF (28) and multiply this

column vector by the 4 x 4 GF (28) matrix M where:

M =

02 03 01 0101 02 03 0101 01 02 0303 01 01 02

(2.3)

The output will be a 4 x 1 column vector of entries in GF(28) which replaces the column being processed, view

Figure 2.9.

a0,0 a0,1 a0,2 a0,3

a1,0 a1,1 a1,2 a1,3

a2,0 a2,1 a2,2

a3,0 a3,1 a3,2 a3,3

a2,3

b0,0 b0,1 b0,2 b0,3

b1,0 b1,1 b1,2 b1,3

b2,0 b2,1 b2,2

b3,0 b3,1 b3,2 b3,3

b2,3

M • ax,0 = bx,0

Figure 2.9: MixColumns transformation

The inverse mix column transformation, called InvMixColumns, is defined by the following matrix multi-

plication:

M ′ =

0E 0B 0D 0909 0E 0B 0D0D 09 0E 0B0B 0D 09 0E

(2.4)

The additions in the vector matrix multiplication are GF (28) additions, which are simple bitwise XORs

of the respective bytes. For the multiplication of the constants, we have to realize multiplications with the

constants 01, 02 and 03. These are quite efficient, and in fact, the three constants were chosen such that

software implementation is easy. Multiplication by 01 is multiplication by the identity and does not involve any

explicit operation. Multiplication by 02 and 03 can be done through table lookup in two 256 by 8 tables. As

an alternative, multiplication by 02 can also be implemented as a multiplication by x, which is a left shift by

one-bit, and a modular reduction with P (x) = x8 + x4 + x3 + x + 1. Similarly, multiplication by 03, which

represents the polynomial (x+ 1), can be implemented by a left shift by one-bit and the addition of the original

value followed by a modular reduction with P (x).

16


2.2.1.5 AddRoundKey

At each round, the state array is combined with a same sized array of subkey material, which can be con-

sidered the round key. This is derived from the user-supplied key using a key schedule that will be described in

Section 2.2.1.6. The 128 bits of the state are bitwise XORed with the 128 bits of the round key. Illustrated in

Figure 2.10 for the element (0, 3) representing the state and round key as a 4 x 4 matrix. Denote by ai,j the byte

appearing in the i row and j column of the state array, and likewise by ki,j that analogous byte in the key array.

a0,0 a0,1 a0,2 a0,3

a1,0 a1,1 a1,2 a1,3

a2,0 a2,1 a2,2 a2,3

a3,0 a3,1 a3,2 a3,3b0,0 b0,1 b0,2 b0,3

b1,0 b1,1 b1,2 b1,3

b2,0 b2,1 b2,2 b2,3

b3,0 b3,1 b3,2 b3,3

+

k0,0 k0,1 k0,2 k0,3

k1,0 k1,1 k1,2 k1,3

k2,0 k2,1 k2,2 k2,3

k3,0 k3,1 k3,2 k3,3

Figure 2.10: Add round key transformation

Then the AddRoundKey step consists of computing bi,j = ai,j⊕ki,j for every element in the matrix. The in-

verse add round key transformation is identical to the add round key transformation, because the XOR operation

is its own inverse.

2.2.1.6 Key Schedule

The key schedule takes the original input key of length 128, 192 or 256 bits and derives the subkeys used

in AES. Note that an XOR addition of a subkey is used both at the input and output of AES. This process is

sometimes referred to as key whitening. The number of subkeys is equal to the number of rounds plus one, due

to the key needed for key whitening in the first key addition layer, Figure 2.6. Thus, for the key length of 128

bits, the number of rounds is Nr = 10, and there are 11 subkeys, each of 128 bits. The AES with a 192-bit

key requires 13 subkeys of length 128 bits, and AES with a 256-bit key has 15 subkeys. The AES subkeys are

computed recursively, i.e., in order to derive subkey ki, subkey ki−1 must be known, etc. The AES key schedule

is word oriented, where 1 word = 32 bits. Subkeys are stored in a key expansion array w that consists of words.

The expansion of the input key into the key schedule proceeds according to the pseudocode in Listing 2.2.

Listing 2.2: AES Key Expansion Algorithm, Pseudocode

KeyExpansion(byte key[4*Nk], word w[Nb*(Nr+1)], Nk){word tempi = 0while (i < Nk){w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])i = i+1

}

17

2.2.2 CLEFIA Related State of the art

i = Nkwhile (i < Nb * (Nr+1){temp = w[i-1]if (i mod Nk = 0){

temp = SubWord(RotWord(temp)) xor Rcon[i/Nk]}else if (Nk > 6 and i mod Nk = 4){

temp = SubWord(temp)}w[i] = w[i-Nk] xor tempi = i + 1

}}

RotWord performs a one-byte circular left shift on a word and SubWord performs a byte substitution on each

byte of its input word, using the S-box. The Key Expansion generates a total of Nb ∗ (Nr + 1) words, the

algorithm requires an initial set of Nb words, and each of the Nr rounds requires Nb words of key data. From

Listing 2.2, it can be seen that the first Nk words of the expanded key are filled with the Cipher Key. Every

following word, wi, is equal to the XOR of the previous word, wi−1, and the word Nk positions earlier, wi−Nk.

For words in positions that are a multiple of Nk, a transformation is applied to wi−1 prior to the XOR, followed

by an XOR with a round constant, Rconi.

It is important to note that the Key Expansion routine for 256-bit Cipher Keys Nk = 8 is slightly different

than for 128- and 192-bit Cipher Keys. If Nk = 8 and i− 4 is a multiple of Nk, then SubWord is applied to wi−1

prior to the XOR.

2.2.2 CLEFIA

Developed by SONY Corporation and presented for the first time in the Fast Software Encryption 2007

conference [19]. It is advertised as a fast encryption algorithm in both software and hardware and it is claimed

to be highly secure. Originally designed for Digital Rights Management (DRM) purposes, this algorithm has

become notable for its lightweight efficiency. These and several other attractive features of CLEFIA have been

widely recognized, being submitted for standardization and already standardized by several bodies. CLEFIA

was submitted to Internet Engineering Task Force (IETF) [20] and is on the International Standard in ISO/IEC

standard [21] since 2012, making it one of the only two lightweight block ciphers recommended by the ISO/IEC

standards. Also candidate for CRYPTREC ciphers list, the Japanese government standardization body, in the

final stage of evaluation before becoming the CRYPTREC standard.

2.2.2.1 Structure

The CLEFIA algorithm is a 128 bit block symmetrical encryption algorithm with a key size varying from

128, 192, to 256 bits. As in AES and most of the current block ciphers, it is divided into two parts, a Key

Scheduling Part and a Data Path computed in multiple rounds, employing a relatively homogeneous algorithm.

This regularity facilitates the development of compact structures, allowing it to be easily deployed in platforms

with limited resources.

In CLEFIA, interesting design techniques can be found which are considered as the State of the Art in

current cryptographic algorithms, namely: Whitening Keys a technique used to improve security of iterated

block ciphers, consisting in steps to combine data with portions of the key, before the first round and after the

18

Related State of the art 2.2.2 CLEFIA

F0 +

+RK0 WK0

P1

32

F1 +

+RK1 WK1

P3

32

F0 +

RK2

F1 +

RK3

P0

32

P2

32

F0 +

RK2r−4

F1 +

RK2r−3

F0 +

RK2r−2

F1 +

RK2r−1

C0

32

+ WK2

C1

32

+ WK3

C3

32

C2

32

F0 +

+RK2r−2 WK2

C1

32

F1 +

+RK2r−1 WK1

C3

32

F0 +

RK2r−4

F1 +

RK2r−3

C0

32

C2

32

F0 +

RK2

F1 +

RK3

F0 +

RK0

F1 +

RK1

+ WK0

P1

32

+ WK1

P3

32

P2

32

P0

32

Figure 2.11: CLEFIA datapath

last round; Feistel Structures that are the most widely used and best studied structures for the design of block

ciphers, initially proposed by Horst Feistel in the early 1970s and adopted by the well-known block cipher

such as DES; and a Diffusion Switch Mechanisms consisting in the use of multiple diffusion matrices in a

predetermined order, to ensure immunity against differential and linear attacks [22–24].

The different Key sizes that can be used in CLEFIA (128, 192, or 256 bits) directly influence the number of

computed rounds, 18, 22, or 26, respectively. CLEFIA employs a generalized Feistel structure which contains

four branches Feistel data lines computed for several rounds, defined as Generalized Feistel Network (GFN)4,n,

where n represents the number of rounds. This structure is an extended version of the traditional two branch

Feistel structure. As in most ciphering algorithms, operations on data consist of byte swapping, byte sub-

stitution, and arithmetic operations. The following describes the main operations performed in the CLEFIA

algorithm.

The encryption process in CLEFIA mostly consists of the GFN4,n, in each round the data is mixed and

added with the round keys using bitwise XOR additions, bit permutations, and two F-functions in a round.

These F-functions are located in parallel in a round, their input and output length is 32 bits long. Due to this

four branch structure, the size of each F-function is smaller than the two branch Feistel structure which requires

double-sized input. This characteristic of the four branch structure permits a more efficient implementation of

19

2.2.2 CLEFIA Related State of the art

these F-functions both in hardware and software. The input P and the output C are 128 bit values while RKs

are the 2n × 32 bit round keys. A full 128 bit block CLEFIA encryption also requires the addition of four 32

bit whitening keys WKi. Two 32 bit whitening keys are added before the GFN4,n computation and two more

are added after all the GFN4,n rounds are computed, as depicted in left-hand side of Figure 2.11. Denote by

P = P0|P1|P2|P3 a 128 bit plaintext, where each Pi, i = 0, 1, 2, 3, is a 32 bit vector, and C represents the

ciphertext.

Given the Feistel network based structure of this algorithm the decryption process is identical to the encryp-

tion one, using the same computational units, only differing in the order that the operations are performed, as

depicted in the right-hand side of Figure 2.11. This inverse computation is achieved by feeding the round and

whitening keys in the inverse order, allowing for the same computational structure to be used.

2.2.2.2 F-Functions

The two F-functions F0 and F1 consist of Round Key (RK) addition, four nonlinear 8 bit S-boxes, and a

diffusion matrix. The construction of F0 and F1 is shown in Figure 2.12.

F0

+

+

+

+

x08

x18

x28

x38

k08

k18

k28

k38

S0

S1

S0

S1

M0

y0

y1

y2

y3

F1

+

+

+

+

x08

x18

x28

x38

k08

k18

k28

k38

S1

S0

S1

S0

M1

y0

y1

y2

y3

Figure 2.12: Schematic of the CLEFIA F functions.

Two kinds of S-boxes S0 and S1 are employed, and the order of these S-boxes are different in F0 and F1. One

is an 8 bit S-box which is based on 4 bit S-boxes, and the other is an 8 bit S-box which is based on the inverse

function over GF(28). For the 8 bit input of the S-box, the upper 4 bits indicate a row and the lower 4 bits

indicate a column. By combining these two S-boxes with different algebraic characteristics, immunity against

byte oriented algebraic attacks is expected to be strong.

The diffusion matrices of F0 and F1, are also different as shown bellow.

M0 =

01 02 04 0602 01 06 0404 06 01 0206 04 02 01

M1 =

01 08 02 0A08 01 0A 0202 0A 01 080A 02 08 01

(2.5)

The matricesM0 for F0 andM1 for F1 are defined as the multiplications between these matrices and vectors

are performed in GF(28) defined by a primitive polynomial z8 + z4 + z3 + z2 + 1.

Each one of the four 8 bit input lines is multiplied by the values in each line of the matrix and additions

are performed at the end to finish the operations on these matrices. The constant values used on these matrices

suggest some simplifications for the operations needed in these diffusion matrices, as proposed in [25].

20

Related State of the art 2.2.3 Modes of Operation

2.2.2.3 Key Schedule

The key schedule of CLEFIA is complex compared to that of AES. As stated in Section 2.2.2.1, the CLEFIA

algorithm supports inputs keys of 128, 192, and 256. However, the ciphering process itself requires several 32

bit round keys and whitening keys. This means that the input key needs to be expanded into the 36, 44, or 52

round keys, respectively, and four whitening keys. This expansion is carried out by the key scheduling part of

the CLEFIA algorithm. The calculation of the round keys is performed by feeding the initial key value through

a processing network (GFN) similar to the one depicted in Figure 2.11, used to cipher the data. An intermediate

key is produced from the master key. Then the round keys are obtained from the intermediate key, with XOR

and bit permutations. The resulting values are the needed round keys, used in the ciphering data path.

2.2.3 Modes of Operation

A block cipher on its own is rather limited. It takes a b bit string and outputs a b bit string under the action

of a secret key. When a block cipher is standardized for widespread use, the need to know how it should be used

arises. In essence, a mode of operation is a technique for enhancing the effect of a cryptographic algorithm or

adapting the algorithm for an application, such as applying a block cipher to a sequence of data blocks or a data

stream.

Published shortly after the publication of DES, four modes of operation for the DES were specified in FIPS

81 [26]. The modes described in that document are often considered as the standard modes of block cipher

operation. Similarly, after establishing the AES in 2001 [17], NIST published guidance on how the cipher

should be used in special publication SP800-38A [15], which updated versions of the ones specified previously

and added a fifth mode, Counter Mode.

Intended to cover a wide variety of applications of encryption for which a block cipher could be used. These

modes are intended to be used with any symmetric block cipher. The modes are summarized in Table 2.1.

Mode Description Typical ApplicationElectronic Code Book

(ECB)Each block of plaintext bits is encoded inde-pendently using the same key.

Secure transmission of sin-gle values (e.g., an encryp-tion key).

Cipher Block Chaining(CBC)

The input to the encryption algorithm is theXOR of the next block of plaintext and the pre-ceding block of ciphertext.

General-purpose block-oriented transmission.Authentication.

Cipher Feed Back(CFB)

Input is processed s bits at a time. Preced-ing ciphertext is used as input to the encryp-tion algorithm to produce pseudorandom out-put, which is XORed with the plaintext to pro-duce next unit of ciphertext.

General purpose stream-oriented transmission.Authentication.

Output Feed Back(OFB)

Similar to CFB, except that the input to the en-cryption algorithm is the preceding encryptionoutput, and full blocks are used.

Stream-oriented transmis-sion over a noisy channel(e.g., satellite communica-tion).

Counter(CTR)

Each block of plaintext is XORed with an en-crypted counter. The counter is incrementedfor each subsequent block.

General purpose block-oriented transmission.Useful for high-speedrequirements.

Table 2.1: Block Cipher Modes of Operation

21

2.2.3 Modes of Operation Related State of the art

Note that arbitrary length messages can be unambiguously padded to a total length that is a multiple of any

desired block size by appending a 1 followed by the sufficient number of 0s and adding a block in case the

length of the message is already a multiple of the block size. For all of the considered constructions the length

assumed is the same as the plaintext message which is exactly a multiple of the block size.

2.2.3.1 Electronic Code Book

The simplest mode is the Electronic Code Book mode, in which plaintext is handled one block at a time and

each block of plaintext is encrypted using the same key (Figure 2.13). The term code book is used because, for

a given key, there is a unique ciphertext for every b bit block of plaintext. Therefore, this can be viewed as a

gigantic code book in which there is an entry for every possible b bit plaintext pattern showing its corresponding

ciphertext.

ENC

P0

k

C0

ENC

P1

k

C1

ENC

P2

k

C2

DEC

C0

k

P0

DEC

C1

k

P1

DEC

C2

k

P2

Figure 2.13: Encryption and decryption with the Electronic Code Book mode.

The main problem of the ECB mode is that it encrypts highly deterministically. This means that identical

plaintext blocks result in identical ciphertext blocks, as long as the key does not change. Thus, it is easy to

distinguish an encryption of a plaintext that consists of two identical blocks from an encryption of a plaintext

that consists of two different blocks.

The ECB mode also has advantages. Block synchronization between the encryption and decryption parties

is not necessary, i.e., if the receiver does not receive all encrypted blocks due to transmission problems, it is

still possible to decrypt the received blocks. Also, block ciphers operating in ECB mode can be parallelized,

e.g., one encryption unit encrypts or decrypts block1, the next one block2, and so on. This is an advantage for

high-speed implementations.

2.2.3.2 Cipher Block Chaining

There are two main ideas behind the Cipher Block Chaining (CBC) mode. First, the encryption of all blocks

is chained together such that ciphertext Ci depends not only on block Pi but on all previous plaintext blocks as

well. Second, the encryption is randomized by using an Initialization Vector (IV).

The ciphertext Ci , which is the result of the encryption of plaintext block Pi, is fed back to the cipher

input and XORed with the succeeding plaintext block Pi+1 . This XOR sum is then encrypted, yielding the

next ciphertext Ci+1 , which can then be used for encrypting Pi+2, and so on. This process is shown on the

left hand side of Figure 2.14. For the first plaintext block P1 there is no previous ciphertext. For this an IV is

added to the first plaintext, which also allows us to make each CBC encryption non deterministic. Note that the

first ciphertext C1 depends on plaintext P1 and the IV. The second ciphertext depends on the IV, P1 and P2.

The third ciphertext C3 depends on the IV and P1, P2 , P3, and so on. The last ciphertext is a function of all

plaintext blocks and the IV.

22

Related State of the art 2.3 Summary

ENC

P0

k

C0

+IV

P0

ENC

P1

k

C1

+

P1

ENCk

C2

+

P2

DEC

C0

k

+

P0

IV

DEC

C1

k

+

P1

DEC

C2

k

+

P2

Figure 2.14: Encryption and decryption with the Cipher Block Chaining mode.

When decrypting a ciphertext block Ci in CBC mode, it is necessary to reverse the two operations done on

the encryption side. First, to reverse the block cipher encryption by applying the decryption function. After

this undo the XOR operation by again XORing the correct ciphertext block, this process is depicted on the

right-hand side of Figure 2.14.

If a new IV is chosen every time, the CBC mode becomes a probabilistic encryption scheme. Which means

encrypting a string of blocks P1 , . . . , Pt once with a first IV and a second time with a different IV, the two

resulting ciphertext sequences look completely unrelated to each other.

The IV must be known to both the sender and receiver but be unpredictable by a third party. In particular, for

any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of

the generation of the IV. In most cases, the IV should be a nonce, i.e., a number used only once. For maximum

security, the IV should be protected against unauthorized changes. This could be done by sending the IV using

ECB encryption.

The main drawback of this mode is that encryption must be carried out sequentially because the ciphertext

block Ci is needed in order to encrypt the plaintext block Pi+1, unlike decryption which may be executed in

parallel. Thus, if parallel processing is available, CBC mode encryption may not be the best choice.

2.3 Summary

In this Chapter, the first Section presents a brief introduction to hash functions, explaining what a hash

function is where, when and why it is used, and some references of how the hash functions evolved, highlighting

the most important and thus chosen for implementation. Providing an overview of the structure and inner

functions used by each algorithm.

The second Section, begins with a short definition and examples of known symmetric key encryption al-

gorithm, and goes on to explain the origins and the inner works of AES, one of the most used symmetric key

algorithms used and its counterpart CLEFIA, developed SONY. Both AES and CLEFIA are symmetrical block

ciphers sharing some ciphering techniques such as diffusion matrices, permutations, and byte substitution. Fur-

thermore, this chapter provides a description and representation of the algorithms, the respective elements and

the considered modes of operation.

23

3Related Technology

Contents3.1 Hardware-Software CoDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 SideWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 MB-LITE processor core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

25

3.1 Hardware-Software CoDesign Related Technology

In this chapter, a brief discussing of the design challenges and considerations faced when using Hardware

Software CoDesign and some references on how Coreworks tackles the problem. Also an overview of the

MB-Lite a lightweight implementation of the Microblaze Instruction Set Architecture, used for a comparison

between the developed solution and the references implementations, running on this processor, see Chapter 7.

3.1 Hardware-Software CoDesign

Currently, hardware and software components composing a computer-based system are designed synchronously

as a whole, considering each others features. The design method considering the quality and performance trade-

off between hardware and software components is called hardware/software codesign [27], [28]. Hardware/-

software codesign is not a new concept, but has received much attention after the increase of computer-based

systems in size and complexity. Both hardware and software systems have to be designed to maximize their

mutual performances. The concept of codesign also becomes important in system quality and performance

measurement and assessment.

Full custom, an application-specific design of on-chip hardware accelerators can provide orders of magni-

tude improvements in efficiencies for a wide variety of application domains [29]. However, a full custom design

is expensive in many aspects. Hence, the question is whether such techniques can be applied to a broader class

of more general applications to amortize the cost of custom design by providing multiple functionalities.

As a compromise, a hardware-software codesign approach seems an attractive solution [30], [31].

Reconfigurable hardware has received increasing attention in the past decade due to its adaptable capa-

bility and short design time. Instead of using FPGAs simply as ASIC replacements, designers can combine

reconfigurable hardware with conventional instruction processors in a codesign system, providing a flexible and

powerful means of implementing computationally demanding Digital Signal Processor (DSP) applications.

Most traditional codesign implementations are application specific and do not have a standard method for

implementing tasks. A hardware model is usually very different from those used in software. These distinctive

views of hardware and software tasks can cause problems in the codesign process. For example, swapping tasks

between hardware and software can result in a totally new structure for the control circuit.

The following Section describes how Coreworks and their technology tackle these problems.

3.2 SideWorks

SideWorks is a high performance digital signal processing engine. The SideWorks core has a fully cus-

tomizable architecture targeted for reconfigurable digital signal processing applications. The SideWorks design

is based on a coarse-grain reconfigurable array and is optimized for the efficient execution of algorithms com-

prised of nested loops containing complex logic and arithmetic expressions. These computationally expensive

algorithms are common in a vast number of application domains. Additionally, the post-silicon reconfigurable

feature allows the reduction of silicon area by combining multiple fixed-function hardwired blocks in fewer

SideWorks instances.

SideWorks key features include pre-silicon configurability of nested loop depth, the number of input and

output ports, type and number of arithmetic FU, datapath width, datapath routing and address generation. In

26

Related Technology 3.2 SideWorks

addition, the core supports runtime partial reconfiguration using a memory mapped configuration register file.

This DSP technology targets cost and power sensitive applications, such as multimedia and communications.

SideWorks enables the creation of DSP cores that are both configurable before fabrication and reconfigurable.

The data transfers and some aspects of the execution unit function are programmable at runtime. SideWorks

does not run as a stand-alone processor instead it is combined with a general-purpose host processor that man-

ages program flow and data I/O. Therefore, Coreworks also supplies FireWorks, a compact, 32-bit Reduced

Instruction Set Computing (RISC) processor core.

Figure 3.1: Top level view of the SideWorks architecture template[1].

Designed to run computation intensive applications in a multi-processor environment the Coreworks Pro-

cessing Engine (CWPE) is an Intellectual Property (IP) core system developed by Coreworks. Figure 4.1 illus-

trates the use of the CWPE in a multi-core System on Chip (SoC).

CWPE computational power comes essentially from the use of Coreworks reconfigurable accelerator tech-

nology – SideWorks blocks.

A high-level view of the SideWorks architecture is depicted in Figure 3.1. The architecture consists of arrays

of FUs, memory units (MUs) and Address Generation Units (AGUs), which can be flexible and dynamically

interconnected according to the information in a configuration register file (CONFIG). A second register file

(CONFIG NXT) holding the next SideWorks configuration is included for fast reconfiguration of the engine.

27

3.3 MB-LITE processor core Related Technology

The data defining the configuration of the programmable FUs, crossbars, and address generators can be

found in the configuration register file. It also stores some constants used in the computations. The SideWorks

configuration register file can be accessed by the Direct Memory Access (DMA) while the engine is running

also hiding or mitigating the reconfiguration time. SideWorks tasks are modeled as sequences of nested loop

groups in C, using a proprietary function library called SideC. The values in the configuration registers create a

temporary hardware datapath for executing a nested loop group[1].

3.3 MB-LITE processor core

There are a several processor cores that are commonly used in SoC applications, both commercial cores that

require the acquisition of a license, like ARM [32], PowerPC [33], NIOS3 [34], MicroBlaze [35], and many

others; and free or open cores that may be used without the need to acquire a license, like LEON3 [36], OpenRisc

[37] and MB-LITE [38]. The main advantage of commercial processors is that they are usually well tested and

optimized for a specific target hardware and provide a complete set of CAD tools to make the SoC design an

easier process. For example, MicroBlaze from Xilinx is well integrated with the development platform from

the same foundry, which leads to highly optimized designs at the cost of being bound to a particular technology

Xilinx Spartan and Virtex FPGA families and a concrete set of tools Xilinx ISE and Embedded Development

Kit (EDK).

The MB-LITE processor implements a reduced MicroBlaze instruction set architecture (ISA) in a 32-bit

Harvard RISC architecture, based on the MIPS five-stage pipeline structure. The stages usually feature in this

type of architectures are Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory (MEM) and

Write-Back (WB). Accordingly, all instructions have a single cycle latency, except the branches whose latency

is two or three clock cycles (with or without delay slots, respectively). The adopted architecture also includes

an address decoder to allow communication with the different peripherals. It also provides a character device,

so that STDOUT can be easily used in the software development phase.

The design of this processor was made completely modular, which provides an easy way to add, remove

and connect other modules and peripherals. The core, itself, features the five components that implement the

pipeline, the instruction, and data memories, as well as the address decoder and the wishbone[39] adapter, as

completely independent modules, allowing the microprocessor to be highly configurable. There are also other

parameters that can be configured depending on the targeted application, such as the enabling/disabling of the

multiplier, barrel shifter and interrupts, the size of the memory buses, the byte order policy or the usage of

forwarding in the different pipeline stages.

Figure 3.2: MB-LITE configuration example

28

Related Technology 3.4 Summary

In Figure 3.2 it is illustrated a possible configuration, featuring the connection of I/O mapped devices

through the address decoder. The configuration includes a local memory and a character device directly con-

nected to the core, as well as a global memory access through a Wishbone bus[39].

The MB-LITE soft-core was selected to be used in this project, not only due to its simple and portable pro-

cessing structure but also due to its compliant implementation with the well-known MicroBlaze ISA. Therefore,

it also has the advantage of having an available and highly matured compiler, i.e., GCC, which is essential, the

use of MB-LITE further discussed in Chapter 7.

3.4 Summary

The beginning of the chapter presents a brief overview of the current state regarding Hardware Software

Codesign, with a more detailed focus on the problems faced and how to address then. Followed by an overview

of how Coreworks tackles the outline problem, which is presented in more depth in Chapter 4. Finally some brief

considerations on the MB-LITE microprocessor a lightweight implementation of the Microblaze Instruction

Set Architecture, used for a comparison between the developed solution and the references implementations,

running on this processor can be found in Chapter 7.

29

4Development Environment

Contents4.1 Coreworks Processing Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 FireWorks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 SideWorks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 SideWorks Design Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 SideWorks Functional Unit Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

31

4.1 Coreworks Processing Engine Development Environment

This chapter provides a brief introduction to Coreworks framework. It describes the SideWorks architecture

and design flow and provides a design example to illustrate the basics of SideWorks.

The main the objective of this work is to development the project implementation using technology provided

by Coreworks. This computing technology accelerates the design of high-performance, small area and low

power reconfigurable processors. The Coreworks technology has been named SideWorks to express the fact that

reconfigurable processors built with this technology are mainly intended to work as dedicated high-performance

offload engines.

4.1 Coreworks Processing Engine

The CWPE is a system designed to run computationally intensive applications on a multi-processor envi-

ronment. The massive computational power of the system comes mainly from the SideWorks allowing various

configurations of the accelerator.

Figure 4.1: Coreworks Processing Engine interfacing with user cores and memory[1].

In addition the already mentioned SideWorks, this system also includes the FireWorks processor, a DMA

module, a Data I/O peripheral, externally accessible Control and Status register files, a Boot interface peripheral,

and a bus master interface.

The FireWorks processor is a 32-bit RISC processor equipped with instruction and data caches and controls

the SideWorks engines, each of the engines runs in parallel with the FireWorks processor to perform the more

compute intensive tasks. It also has a collection of DSP instructions which in some cases dispenses with the use

of SideWorks accelerators.

The embedded DMA engine controlled by FireWorks is responsible for autonomous high-speed data transfer

between SideWorks and external memory, or between the Data I/O peripheral and external memory.

The Data I/O interface for data is represented by Boot I/F in Figure 4.1 this is a peripheral of FireWorks

based on a First In First Out (FIFO) and used for data streaming.

32

Development Environment 4.2 FireWorks Architecture

The Control and Status register files are accessed by external circuitry to operate on the processor and the

application running. FireWorks reads settings and commands from the Control register file. And then writes

status information to the Status register file.

The Boot interface is used to load a program from a boot device when the system is initialized, this is also a

FireWorks peripheral.

Finally the Master Bus Interface, which mainly serves to access an external memory which can be shared

with other processes being executed. The Master Bus Interface serves the FireWorks instruction and data caches

and the DMA engine.

4.2 FireWorks Architecture

The FireWorks is a multi-purpose, Harvard architecture 32-bit RISC processor core whose high-level block

diagram is illustrated in Figure 4.2. This architecture uses a pipeline of 5 stages, which shall be described

very briefly: search for the next instruction (Fetch), decoding (Decode), execute (Run), Memory (Memory) and

writing (Write Back). The processor has a parallel programming interface that can be used to write into any

memory space, through written instructions. The processor has instruction and data configurable cache size,

also existing data memories and instructions for storing small local startup programs.

Figure 4.2: FireWorks Architecture[1].

33

4.3 SideWorks Architecture Development Environment

4.3 SideWorks Architecture

The SideWorks is a general purpose architecture for reconfigurable accelerators. Using this architecture

one can automatically generate the desired hardware components, through Coreworks integrated development

environment. These components will act as accelerators for FireWorks, and this controls the SideWorks through

the local bus by accessing control registers and internal state of SideWorks.

The architecture is pre-defined and can be composed of any combination of the following: FUs, storage

units (MUs) and AGUs, each can be flexible and dynamically interconnected according to the information in a

configuration register file. A second register file holding the next SideWorks configuration is included for fast

reconfiguration of the engine.

The SideWorks uses FUs provided by its own library, which contains general-purpose units, such as ALUs,

multipliers, shifters, thus making this a flexible technology to create any type of application in a simplified way.

The memory addresses are generated in the AGUs. All embedded memories are dual port memories in order to

maximize the bandwidth. However, single port memories can be used, if necessary. The data processed by the

FUs comes from either the MUs or the outputs of other FUs. The Read Crossbar selects the FU inputs. Each

FU produces multiple data outputs, which may have different bit-widths, including 1-bit output flags. Flags are

also routed by the Read Crossbar to other FUs where they are used as control inputs. Flags can also be selected

to halt SideWorks and assert a status bit in the Status register. The data from the outputs of the FUs can be

forwarded to Read Xbar in order to be used by other FUs. Another option is to send them to the Write Xbar

in order to be written in MUs. Finally, it can be forwarded to the Address Output Xbar to be used as memory

addresses. The MUs array can be accessed from two interfaces: the first is a slave to FireWorks, the other is

a slave to the DMA engine. The latest access is particularly relevant because the access to the memory by the

DMA while the application is running, it is very important to minimize data transfer times.

The configuration register file contains the data that defines the configuration of programmable FUs, cross-

bars, and address generators. It also stores some constants used in the computations. The SideWorks config-

uration register file can be accessed by the DMA while the engine is running also hiding or mitigating recon-

figuration time. SideWorks tasks are modeled as sequences of nested loop groups in C, using a proprietary

function library called SideC. The values in the configuration registers create a temporary hardware datapath for

executing a nested loop group.

4.4 Design Flow

The design flow can be divided into three parts: the FUs, the settings for the SideWorks with the desired

architecture, and files that constitute the core.

Describing a FU in the Coreworks architecture requires four files: a .c describing the operation of the FU,

which is mainly used for the simulation; a .h relative to .c; an XML file describing the inputs and outputs of the

FU and respective length, this file is used in the construction of datapath architecture, and the description of

FU in VHSIC Hardware Description Language (VHDL) or Verilog. The compilation of FUs generates a library

of elements that will be used in the construction of configuration used by SideWorks. The latter is constructed

with the functions defined in C language files of the FU with the necessary connections between them. The

34

Development Environment 4.4 Design Flow

compilation of the SideWorks architecture generates the core elements as well as the Register Transfer Level

(RTL) files to build all the necessary hardware to generate the SideWorks configuration for FPGA. Using the

Xilinx ISE, it is necessary to add to the RTL files specific for the desired FPGA design, resulting in the ability

to generate the required bitfile.

The core consists of three distinct settings: SIM, TEST and EMB. The SIM setting executes the hardware

simulation, without the need to use the FPGA. So you can test the operation of the receptive SideWorks config-

uration created. The SIM compilation generates an executable, which simulates hardware in the PC. The EMB

and TEST settings are used both for hardware implementation. The TEST constitutes part of the application that

will run on your PC, includes the processes that are not made in FireWorks and, it also transfers data between

the PC and the memories of the board to be used by FireWorks. As for the EMB, it contains the code that will

be executed in FireWorks, including the setting and monitoring of SideWorks and data transfers between the

different memories. Compiling this code generates an Executable and Linkable Format (ELF) file that is used

to program the FireWorks processor.

In order to run a project in hardware, these are the necessary steps: program the FPGA with the bitfile, then

use the ELF file to program FireWorks and finally run the TEST setting on the computer.

All of this description is depicted in Figure 4.3.

Figure 4.3: Hardware/Software Co-design flow using SideWorks[1].

35

4.5 SideWorks Design Example Development Environment

4.5 SideWorks Design Example

The best way to understand how to use SideWorks is by means of an example. The chosen example is a

vector add operation. Listing 4.1 shows the vector add C code, which consists of a single loop. Nested loops are

also supported by SideWorks architecture, but a simple single loop example is a better choice to comprehend

and illustrate it.

Listing 4.1: Example C Pseudo-code version

for (j = 0 ; j < vector_size ; j++) {D[j] = A[j] + B[j] + C[j]

}

A typical vector add implementation in the C language for a regular processor is shown in Listing 4.1. In

a regular processor, load from memory, addition, and store to memory instructions are executed sequentially.

Assuming one cycle per instruction two loads will be needed, one per add and one store per loop iteration, plus

the index j increment and branch instruction. It will take 7*vector size cycles to complete this loop.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

MEM0 MEM1 MEM2

ADD0

ADD1

Figure 4.4: Sequential vector add example diagram.

To implement the code in Listing 4.1 using the SideWorks architecture, it is necessary to use the SideC

library of primitives and write, for example, the code in Listing 4.2 and illustrated in Figure 4.4. Note that Listing

4.2 shows only the relevant parts of the code to illustrate the present example, a full version of respective code

can be found in Listing A.1 of appendix A. Figure 4.4 that also shows the use of a dual port memory, MEM0,

in order to reduce the memory used, the result is saved on the save memory that originally contained vector A.

However, single port memories could also be used, as seen in MEM1 and MEM2.

Listing 4.2: Datapath example

//Loop setup. The state register is used to configure the loops number ofiterations

F_STATE_11(&loop_end_k);F_TIME(&tu_rd,loop_end_k.out,ZERO,ZERO,ZERO, 0);

//addresses for readingF_BAU_11(&bau_rd, TWO ,ONE, undefined,tu_rd.enable[K], tu_rd.start[K],0);

//data read from memoriesF_MEM_PORT_A(&m0, bau_rd.out, undefined, ZERO,0);F_MEM_PORT_A(&m1, bau_rd.out, undefined, ZERO,0);

//final paramenter represents the delay, in this case its one, the secondadder has an offset of one time unit

F_MEM_PORT_A(&m2, bau_rd.out, undefined, ZERO,1);

36

Development Environment 4.5 SideWorks Design Example

//vector add operationF_ADD_32(&alu_0, m0.data_a,m1.data_a);//value form the first adder and memory twoF_ADD_32(&alu_1, m2.data_a,alu_0.out);

//addresses for writingF_DELAY(&tu_wr,&tu_rd,3);

//increment step size, initial valueF_BAU_11(&bau_wr,ONE, ONE, undefined, tu_wr.enable[K], tu_wr.start[K],0);

//data destinationsF_MEM_PORT_B(&m0, bau_wr.out, alu_1.out, tu_wr.enable[K], 1);

Rewriting the code in a functionally equivalent way as shown in Listing 4.3 and assuming the pseudo-

parallel operation. Consider that all operations could be done in parallel, it will take 1/2 vector size cycles to

complete this loop. This represents increment in speed compared to execution on a regular processor.

Listing 4.3: Example C Pseudo-code, Pseudo-parallel version

for (j = 0 ; j < vector_size/2 ; j++) {D[j*2] = A[j*2] + B[j*2] + C[j*2]D[j*2+1] = A[j*2+1] + B[j*2+1] + C[j*2+1]

}

Using SideWorks this can be achieved as the Figure 4.5 depicts, taking full advantage of the dual port

memories, to access the vectors even and odd values and adding a fourth memory MEM3 to save the result.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

MEM0A MEM1A MEM2A

ADD0

ADD1

MEM0B MEM1B MEM2B

ADD2

ADD3

MEM3

Figure 4.5: Example diagram parallel version

This example is a basic illustration of how SideWorks can improve performance by creating a hardware

accelerator for a generic program loop. Note that this accelerator is temporarily created using the SideWorks

reconfigurable fabric, which can express other accelerators at different times to accelerate other tasks.

Listing 4.4 shows a short version of the new implementation under SideWorks. The code in Listing A.2

shows the full version description of the Vector Add and configuration in a SideWorks engine which will morph

into various other configurations, including the Next Vector Crunch configuration as outlined in the same List-

ing. That includes the necessary SideWorks instance and can be defined by describing the set of configurations

it must morph into, to implement a given algorithm. This description uses the Coreworks proprietary SideC

library of primitives.

Listing 4.4: Datapath example parallel version

37

4.6 SideWorks Functional Unit Example Development Environment

//Loop setup. The state register is used to configure the loops number ofiterations

F_STATE_11(&loop_end_k);F_TIME(&tu_rd,loop_end_k.out,ZERO,ZERO,ZERO, 0);

//addresses for readingF_BAU_11(&bau_rd_even, TWO ,ZERO, undefined, tu_rd.enable[K],

tu_rd.start[K],0);F_BAU_11(&bau_rd_odd, TWO ,ONE, undefined, tu_rd.enable[K], tu_rd.start[K],0);

//data read from memoriesF_MEM_PORT_A(&m0, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m0, bau_rd_odd.addr, undefined, ZERO,0);F_MEM_PORT_A(&m1, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m1, bau_rd_odd.addr, undefined, ZERO,0);F_MEM_PORT_A(&m2, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m2, bau_rd_odd.addr, undefined, ZERO,0);

//vector add operation evenF_ADD_32(&alu_0, m0.data_a,m1.data_a);//value form the first adder and memory twoF_ADD_32(&alu_1, m2.data_a, alu_0.out);

//vector add operation oddF_ADD_32(&alu_2, m0.data_b,m1.data_b);//value form the first adder and memory twoF_ADD_32(&alu_3, m2.data_b, alu_2.out);

//addresses for writingF_TU_4_4_11_DELAY(&tu_wr,&tu_rd,3);F_BAU_11(&bau_wr_even,TWO, ZERO, undefined, tu_wr.enable[K], tu_wr.start[K],0);F_BAU_11(&bau_wr_odd, TWO, ONE, undefined, tu_wr.enable[K], tu_wr.start[K],0);

//final paramenter represents the delay, in this case its one, the secondadder has an offset of one time unit

F_MEM_PORT_A(&m3, bau_wr_even.out, undefined, ZERO,1);F_MEM_PORT_B(&m3, bau_wr_odd.out, undefined, ZERO,1);

The description of a SideWorks configuration consists of three parts:

1. Declaration of the SideWorks objects (FUs) used in the configuration

2. Declaration of placement constraints for the SideWorks objects

3. Description of the hardware datapath implemented by the configuration

4.6 SideWorks Functional Unit Example

As detailed in Section 4.4 adding a new FU to the SideWorks SideC Library requires the creation of four

new files. A .c file describing the operation of the FU, which is mainly used for the simulation; a .h relative

to the .c file; an XML file describing the input, output and length of the FU (used in the construction of the

datapath architecture) and the description of FU in VHDL or Verilog. On the course of this work, all FUs

hardware descriptions were done using VHDL.

As an example, the ROL FU was developed, in order to understand and replicate the process necessary

to add a new FU to the SideC library. The ROL FU operates as a simple 32-bits barrel shifter, Listing 4.5

represents the XML associated, the file is quite simple, and it contains only the outputs and inputs of the FU

with its name and length.

Listing 4.5: SideWorks XML description for ROL

<FunctionalUnit type="ROL_32">

38

Development Environment 4.6 SideWorks Functional Unit Example

<OutputPort name="out0" width="32"/><InputPort name="word0" width="32"/><InputPort name="word1" width="12"/><Function name="F_ROL_32" select="0"><FunctionArg name="word0"/><FunctionArg name="word1"/>

</Function></FunctionalUnit>

Word0 is the value to be shifted and word1 the number of bits to be shifted. The XML descriptions are

used as input by SideConf program and are compiled to generate the SideWorks Architecture XML file as

illustrated in Figure 4.3.

Listing 4.6 is a simplified version of the ROLFU C code description, the original version can be found in

Listing A.3. This description is used to simulate the behavior of the FU and assist in the build of the SideWorks

Architecture XML (for simulation only), so to this purpose the use of SideC function and types are checked,

e.g. when developing the C code for FU all the input are defined as sideC port t, and should be accessed;

for example by validate input length function, all these tools are provided by the SideC library. Also, the

number of cycles necessary for the FU to correctly complete its task must be taken into account and mimic

by the C code simulation. To achieve this a pseudo pipeline is considered (as seen in the code), so the input

signal.current.value is assigned to the output signal.next.value representing a single cycle latency FU.

Note that all the FUs created in this work are single cycle latency FUs otherwise, a multi-stage pipeline must be

used.

Listing 4.6: SideWorks simplified C code description for ROL

#include "sim/sideC_types.h"#include "sim/sideC_internal.h"#include "rol_32.h"

#define ROL32(a, offset) ((((unsigned long)a) << (offset)) ˆ (((unsigned long)a) >>(32-(offset))))

void sidefu_rol_32_sim(sidefu_rol_32 *fu, sideC_port_t word0, sideC_port_t word1,int line, char *file )

{if(sideC_validate){

validate_input_length(1,word0, 32, line, file);validate_input_length(2,word1, 12, line, file);

}else{fu->out0.signal.next.value = ROL32(word0.signal.current.value,

word1.signal.current.value) & 0xFFFFFFFF;;}

}

void sidefu_rol_32_xml(sidefu_rol_32 *fu, char *fu_name, char *word0, char *word1,int line, char *file)

{if(sideC_init){

sideC_xml_print_func("ROL_32", fu_name, "F_ROL_32",line,file);sideC_xml_print_char_arg("word0",word0);sideC_xml_print_char_arg("word1",word1);

}}

The VHDL description for the ROL FU is presented in Listing 4.7, this FU was the first developed in the

course of this work, it is based on SBSHIFT which is the counterpart FU original used by default in the official

39

4.7 Summary Development Environment

release of Coreworks SideC library. With some minor differences e.g. the original was developed in Verilog,

Section B.8 details the FU specifications, and Section B.9 represents the original FU specifications extracted

from the SideWorks Reference Book[1].

Following the line of SideWorks Reference Book[1], for each FU developed on this work, an entry was

created in Appendix B. These entries were created in the image of those present in the SideWorks data book[1]

which provides a description of the SideWorks objects(FUs) and SideWorks object functions.

Listing 4.7: SideWorks VHDL description for ROL

library IEEE;use IEEE.STD_LOGIC_1164.ALL;use IEEE.numeric_std.all;entity ROL_32 is

Port (ROL_32_clk : in STD_LOGIC;ROL_32_reset : in STD_LOGIC;ROL_32_init : in STD_LOGIC;ROL_32_in_word0 : in STD_LOGIC_VECTOR (32-1 downto 0);ROL_32_in_word1 : in STD_LOGIC_VECTOR (12-1 downto 0);ROL_32_out_out0 : out STD_LOGIC_VECTOR (32-1 downto 0));

end ROL_32;

architecture Behavioral of ROL_32 is beginROL_32_out_out0 <= (To_StdLogicVector( To_bitvector(ROL_32_in_word0) rol

to_integer(unsigned(ROL_32_in_word1))));end Behavioral;

4.7 Summary

This chapter presents SideWorks technology to a new user, describes the SideWorks architecture, detailing

the major components of the Coreworks platform, CWPE, FireWorks, SideWorks and design flow. In order to

better understand the architecture, a number of how-to-use design examples are provided to illustrate the basics

of SideWorks.

The last two Sections wrap up the developed work, explaining the process using datapath examples, com-

plemented with an illustration of how an FU runs in the platform. Followed by an overview of how to integrate

all the components mentioned in the previous Sections together.

40

5Proposed Structures for Hash Functions

Contents5.1 SHA-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 SHA-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

41

5.1 SHA-1 Proposed Structures for Hash Functions

This Chapter focus o hashing algorithms which can be extremely useful in data authentication and mes-

sage integrity checks. Currently, the most commonly used hash functions are SHA-1 and SHA-2. This chapter

reviews the design challenges and concerns faced when implementing the chosen hash algorithms in the Side-

Works architecture. In addition, it is presented the support procedures and mechanisms devised during the

course of the work, to be used within the proposed structures.

5.1 SHA-1

5.1.1 Proposed Structure

The main objective for the first phase of this project was to fully comprehend and understand the target

framework, including the workflow directly involved in its development. The starting point was the implemen-

tation of the SHA-1 algorithm. The first step in this implementation was to evaluate the availability of the logic

functions and all of the necessary FUs in SideWorks. Although all the necessary logic functions required by

the SHA-1 algorithm are available as FUs, the design and implementation of FUs is an important and necessary

step in the workflow. As such, and in order to acquire experience in this task, an alternative to the barrel shift,

already available in the SideWorks library SideC, was implemented. This FU was named ROL and while the

implementation process was described in Section 4.6. The rationale for this choice is simple and obvious in this

context, as ROL represents a simple operation with a verified counterpart already present in SideC, and both

are interchangeable and can be used in any of the SHA-1 datapaths outlined in this Section. The only notable

difference between them being the fact that the original is described in Verilog and its counter part in VHDL,

nevertheless the logic function is exactly the same.

Listing 5.1: SHA-1 Pseudocode

//Extend the sixteen 32-bit words into eighty 32-bitwords

for i from 16 to 79w[i] = (w[i-3] xor w[i-8] xor w[i-14] xor w[i-16])

leftrotate 1

//Main Loopfor i from 0 to 79

if 0 ≤ i ≤ 19 thenf = (b and c) or ((not b) and d)k = 0x5A827999

else if 20 ≤ i ≤ 39f = b xor c xor dk = 0x6ED9EBA1

else if 40 ≤ i ≤ 59f = (b and c) or (b and d) or (c and d)k = 0x8F1BBCDC

else if 60 ≤ i ≤ 79f = b xor c xor dk = 0xCA62C1D6

temp = (a leftrotate 5) + f + e + k + w[i]e = d; d = c; c = b leftrotate 30; b = a; a = temp;

Secure Hashing Standard, defined in Federal Information Processing Standards (FIPS) PUB 180[40] and

Request for Comments (RFC) 3174 [41] were used as a source for the reference implementation, presented in

Listing 5.1.

42

Proposed Structures for Hash Functions 5.1.1 Proposed Structure

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

MEM0

XOR0

MEM1 MEM2 MEM5

XOR1

XOR2

ROL0

MEM5

Figure 5.1: SHA-1 Extend the sixteen 32-bit words into eighty 32-bit words.

In order to maximize the learning experience, the implementation of SHA-1 was unfolded in four datapaths,

each one outlining one of the four sub-functions of the SHA-1 F function described in the Section 2.1.1 (and

depicted in Equation 2.1). Taking advantage of the features explained and demonstrated in Section 4.5, four

datapaths were designed, one for the message schedule (extending sixteen 32-bit words into eighty 32-bit words)

Figure 5.1, and three others for each branch of the F function Figures 5.2, 5.3 and 5.4.

In the datapaths the elements presented with a red background are used to represent memory blocks that

require two clock cycles (for the read and write operation), or as an alternative, only one clock cycle for each of

the operations when using registers. The blue elements represent FUs (always single cycle latency as mention

before). Figures 5.1 to 5.4 were designed to highlight the number of cycles required by each of the operations

and their respective access time. Note that in some of the figures the same memory block or register is rep-

resented multiple times, although they are one and the same they are used to represent different access times

during the datapath and or different operations, read or write.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 7

Stage 8

MEM0

ROL0

MEM4CNSTK

ADD0

ADD1

MEM5

ADD2 OR0

ADD3

MEM0MEM1

AND1 AND2

MEM1

MEM2

NOT0

MEM3

ROL1

MEM2 MEM3 MEM4

Figure 5.2: SHA-1 F1.

Furthermore, in Figure 5.1, the stages highlighted in red outline the last two clock cycles that only need

to be counted once, for the last round as the value is saved in MEM5. This value will not be reused in this

datapath, as such there is no need to read the value. This means the number of cycles necessary per round

is four, taking into consideration that two extra cycles should be added to the total number of cycles required

43

5.1.1 Proposed Structure Proposed Structures for Hash Functions

(number of rounds times cycles per round) to enable the value of the last round to be written into the respective

memory block.

The main drive behind the design decision to implement the four datapaths was to take different approaches

to the common elements with the goal of reducing the necessary number of cycles for each round of the respec-

tive sub-function.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 7

Stage 8

MEM0

ROL0

MEM4CNSTK

ADD0

ADD1

MEM5

MEM1 MEM2

ADD2

XOR0

MEM3

XOR1

ADD3 ROL1

MEM0 MEM2MEM1 MEM4 MEM3


Through the analysis of the critical path of each datapath, it is clear the main margin for reducing the number

of cycles necessary for a round, is in the multiple adders required to obtain the new value of A word (Figure 2.1

or Listing 5.1). This limitation is caused by the fact the adders available in SideWorks support only two input

words.

For the SHA-1, the use of a four input adder would be highly beneficiary in order to increase the efficiency

of the implementation. The same rationale applies to a three input XOR.

The resulting implementation by coupling the four datapaths presented produces a working version of the

SHA-1 algorithm, as efficient as the current FUs available in the SideC library allows.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 7

Stage 8

MEM0

ROL0

MEM4CNSTK

ADD0

ADD1

MEM5

ADD2 OR1

ADD3

MEM0MEM1

AND2 OR0

AND0 AND1

MEM1MEM2 MEM3

MEM4MEM3 MEM2

ROL1


44

Proposed Structures for Hash Functions 5.1.2 Required Elements

5.1.2 Required Elements

After the preliminary analysis of the critical datapath stated in the previously subsection, a more in depth

break down is needed. As depicted in Figures 5.2 to 5.4 all of three branches of the F function requires eight

cycles per round. Through the analysis of the illustration, the four cycles required by the memory blocks and

respective access timing stand out, but this will be addressed later in this Section. Focusing initially on the

necessary operations, a four adder chain is clearly an opportunity to improve, but would a simple four word

input adder be the best choice? To better answer these questions the individual parts of the branches must be

considered. F1 employs two AND operation, one NOT and one OR (Figure 5.2), again overlooking the memory

access cycles this requires three cycles; F2 requires only two XORs, to obtain the result of XORing three words,

which has a cost of two cycles; On the other side F3 employs two OR plus three AND operations at the cost of

three cycles per round. It would be simple to implement the whole F function in a custom FU, but that choice

goes against the main idea of the reusability of the FUs present in the SideC library. With this in mind, the

option was to split the F function in two distinct FUs in order to improve the chance to reuse them in other

datapaths. The F1 and F3 operations are available through the FU named ANDX and the F2 in XORX both

FUs specifications are present in Section B.3 and B.14 accordingly. Reviewing the specifications mention above

it becomes clear the FUs provides more operations than it was previously described, a detailed explanation can

be found in Section 5.2.

Taking into account the use of the newly proposed FUs, that will execute the F function on a single cycle,

it is necessary to address the multiple input adder. Reconsidering the previous datapaths (Figure 5.2 to 5.4), the

values to be added from memory block or register amount to three, plus the result from one rotation operation,

and the three values from the F function this give us a total of seven input words. In order to execute this

operation, an FU called ADDX was developed with seven 32-bits word input and two one bit selectflag

input, performing a five-word adder with a selection of which of the last three words is used for the operation,

this provides the ability to select the F branch. The specifications of the new FU can be found in Section B.1,

and a breakdown of the operation is depicted in Figure 5.5.

ADDX

ADD ADD

MUX

ADD

w0w1w2w3 w4 w5 w6

out

select

Figure 5.5: ADDX FU description

5.1.3 Final Structure

Using the FUs outline in the last subsection the implementation of the SHA-1 algorithm in the SideWorks

platform was possible, requiring only four cycles per round against the eight needed for the implementation

45

5.1.3 Final Structure Proposed Structures for Hash Functions

detailed in Section 5.1.1. This was achieved by developing and implementing the new FUs and by replacing

some of memory blocks with register when possible. The critical datapath is illustrated in Figure 5.6. The use

of register instead of memory blocks represents a benefit at the cost of losing flexibility in the implementation

illustrated in Figure 5.6, it is only possible to calculate one hash at a time this limitation is imposed by the use

of registers for the input words. To surpass this limitation, two extra cycles would be needed for each hash value

one at the end of the execution of the main loop to permit the final values to be written into memory and one at

the start of a new hash in order to load the initial values to the registers. In terms of cycles necessary two extra

clock cycles is a marginal value but the logic required for this implementation would significantly increase the

size of the datapath, this and other improvements are discussed in greater detail in Chapter 7.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

MEMw MEMk

B C D

ANDX

C B

XORX ROL0 E

ADDX

A

Figure 5.6: SHA-1 Final critical datapath

The final structure includes one ADDX , one ANDX , two ROL and one XORX , considering eighty,

which is the number of rounds of the SHA-1 algorithm multiplied by four (number of cycles per round) results

in three hundred and twenty cycles required to hash a value using SHA-1 in SideWorks. The full datapath

declaration can be found in Appendix A.4, this is also a good example as the implementation is small and easy

to understand most of the mechanism employed by the SideWorks platform and demonstrate how easy and

adaptable the implementation in SideWorks can be. In Listing 5.2 depict the core loop of this implementation.

Listing 5.2: SHA-1 Final main loop

//------------------stage0-----------------------------------------//Read W[]F_MEM_PORT_A(&m0, bau_w.out, undefined, ZERO,0);//Read K[]F_MEM_PORT_A(&m1, bau_k.out, undefined, ZERO,0);

//------------------stage1-----------------------------------------//ROL(30,B)F_ROL_L_32(&rolt0, reg_b.out, cnst30.out);//ROL(5,A)F_ROL_L_32(&rol1, reg_a.out, cnst5.out);//ANDx(B, C, D)F_ANDX_32(&andx_0, reg_b.out, reg_c.out, reg_d.out);//XORx(B, C, D)F_XORX_32(&xorx_0, reg_b.out, reg_c.out, reg_d.out, bau_k.eq);

//------------------stage2------------------------------------------F_ADDX_32(&addx_0, rol1.out, reg_e.out, m1.data_a, m0.data_a,

xorx_0.out0, andx_0.out0, andx_0.out1, bau_k.eq, bau_w.lt);//--------------------stage3---------------------------------------

//ADDx --> AF_REG_WEN_32(&reg_a,addx_0.out0, tu_wr.end[K]);//A --> BF_REG_WEN_32(&reg_b,reg_a.out, tu_wr.end[K]);//ROL(30,B) --> CF_REG_WEN_32(&reg_c,rol0.out, tu_wr.end[K]);//C --> DF_REG_WEN_32(&reg_d,reg_c.out, tu_wr.end[K]);//D --> EF_REG_WEN_32(&reg_e,reg_d.out, tu_wr.end[K]);

46

Proposed Structures for Hash Functions 5.2 SHA-2

5.2 SHA-2


SHA-2 includes a significant number of changes from its predecessor SHA-1, consisting of a set of four

hash functions with different digest sizes, with 224, 256, 384 or 512 bits respectively. The SHA-2 hashing

algorithm is identical for the SHA224, SHA256, SHA384 and SHA512 hash functions, differing mainly on the

size of the operands, the initialization vectors, and the size of the final digest message. Basically SHA224 and

SHA384, apart from the initialization vectors, are the same as SHA256 and SHA512 respectively, being their

digest values obtained by truncating the final hash value to its leftmost bits. SHA256 hash function differs from

SHA512 mostly in the size of the operands, using 32 bits words instead of 64 bits, and in the number of rounds,

using 64 rounds instead of 80. Listing 5.3 shows the pseudocode based on the reference implementation for

SHA-2 family which is described here [40, 41]. Apart from all the differences for the SHA-1 the basic functions

used by both are the same.

Listing 5.3: SHA-2 Main loop, Pseudocode

//Kt represents a sequence of 64 32-bit or 80 64 bits word constantsfor i from 0 to 63 //SHA-224/256for i from 0 to 79 //SHA-384/512∑

0 = (a ROTR 2) xor (a ROTR 13) xor (a ROTR 22) //SHA-224/256∑0 = (a ROTR 28) xor (a ROTR 34) xor (a ROTR 39)

//SHA-384/512ma = (a and b) xor (a and c) xor (b and c)t2 = s0 + maj∑1 = (e ROTR 6) xor (e ROTR 11) xor (e ROTR 25)

//SHA-224/256∑1 = (e ROTR 14) xor (e ROTR 18) xor (e ROTR 41)

//SHA-384/512ch = (e and f) xor ((not e) and g)t1 = h +

∑1 + ch + k[i] + w[i]

h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1+ t2;


Following the work developed in the previous Section on the implementation of the SHA-1 algorithm, break-

ing down the algorithm to its basic functions, even the SHA512 which uses 64 bits words can be implemented

using 32 bits words. Examining the pseudocode present in Listing 5.3 some of the same problems faced when

implementing the SHA-1 stands out, such as benefits of using a multiple input adder, again this problem can be

addressed by using ADDX FU. Although in the current form this is only valid for the SHA256 version, due

to the fact the SHA512 is based on 64 bits words, the duplication of the number of ADDX would not solve

the problem, as it produces the same operation as the 32 bits one, as a consequence the creation of a second FU

based on ADDX was necessary, originally named ADDX2. This FU is compatible with the SHA-1 imple-

mentation described in Section 5.1.3 and SHA256, with one distinction, when both select flags are equal to one

the FU behaves as a three 64 bits word adder as specified in Section B.2. This new FU provides the ability to

add three 64 bits values without the need of using a ripple carry adders or similar solutions.

Functions mention in Listing 5.3 as∑

0 and∑

1 are implemented by the FU named XORX which rep-

resents a three input XOR with the particularity of supporting the rotations required by SHA256. Figure 5.7

47

5.2.3 Final Structure Proposed Structures for Hash Functions

XORX

XOR ROL0 ROL1

XOR XOR

w1 w2 w0

MUX

out1out0

select

Figure 5.7: XORX FU description

illustrates the operations performed by the XORX FU, it is important to note that the rotations are carried out

based on fix values (shown in Listing 5.3).

In the same manner as theADDX2 a second versionXORX2 was created to support the SHA512 version,

in order to supply the different rotations values. Renaming only the operations represented in Listing 5.3 has

ch and ma which are very similar to the unique functions present in SHA-1 F function branches, F1 (Figure

5.2) and F3 (Figure 5.4). As such in the same way as processed in the SHA-1 the respective function were

added to the FU named ANDX , in the case of the SHA512 implementation no adaptation was necessary as all

the operations realized by this FU are a bitwise operation without any carry out flag needed, for the SHA512

datapath the only requirement is to duplicate de existing FU to obtain the desired result. All the FUs specification

can be found in Section A.


Exploiting the advantages provided by the FUs detailed above, two new datapath were devised one for

the SHA256 depicted in Figure 5.8 and other for the SHA-512, both adopt the same structure being the main

difference the duplication of FUs when required to support 64 bits operations or the use of similar ones as

discussed in the previous Section.

Table 5.1 contains the amount of FUs and which are required by each of the datapaths. Note that the datapath

for SHA256 also supports SHA224 as the SHA512 datapath supports version SHA384.

Stage 0

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

E E E

XORX0

HMEMW MEMK

ADDX0

∑1

E F G

ANDX0

ch

A A A

XORX1

B C

ANDX1

ADDX1t1

∑0

maj

D

ADDt1

E A

Figure 5.8: SHA256 Final schematic

The number of cycles per round required by both versions is exactly the same and can be inferred by

observing Figure 5.8, this datapath needs five cycles to complete one round. The direct correlation between the

48

Proposed Structures for Hash Functions 5.3 SHA-3

pseudocode present in Listing 5.3 and the datapath is highlighted in the connection among the FUs using the

same labels has the pseudocode. In the SHA256 datapath an adder already available in the SideC library is

used, the motive behind this decision is simple, the operation required only a two 32 bits input adder, there was

no need to use an extra ADDX for this operation, although in the SHA512 counterpart it was necessary to use

an ADDX2 instead.

In order to reduce the number o cycles to the bear minimal, registers were used instead of memory blocks

in every case possible, as mention before this comes at the cost of only being able to calculate one hash for

each call of the datapath. Again to solve this problem memory blocks would be added and two extra cycles

are necessary per hashed value in order to load and save the values to and from memory. But in this particular

datapath it is simple to add memory and maintain the same number of clocks per rounds, due to the fact that

the word E defines the critical datapath, which means by keeping the register used to store E side by side with

the new memory block, the register acts as buffer maintaining the required clock cycles per round, increasing

only the memory footprint. The size of the memory added determines the limit number of hash values that can

be calculated sequentially. Relying on the datapath presented in Figure 5.8 the total number o cycles can be

determined by multiplying five by the number of round, in the case of SHA256 is sixty-four resulting in three

hundred and twenty since the SHA512 number of rounds is eighty the total in this case is four hundred per

hashed value.

FU SHA-1 SHA256 SHA512Latency

(ns) Slices

ADDX 1 2 0 8.436 226ADDX2 0 0 4 6.904 257ANDX 1 2 4 1.599 32ROL 2 0 0 3.417 80XORX 1 1 0 2.414 79XORX2 0 0 2 2.386 96

Table 5.1: SHA-1 and SHA-2 Functional Units requirements and technical features.

Table 5.1 presents a brief summary of the technical features associated with the FUs added to SideC to

implement SHA-1 and SHA-2 as well as the number instances required in each datapath. It is important to

emphasize that all FUs used in the SHA512 datapath can be reused in the SHA-1 and SHA256 datapaths, with

the exception of XORX2. This means that the three datapaths can share all the FUs apart from one XORX

which is used only by SHA-1 and SHA256. However, if the original SHA512 datapath FUs list (present in Table

5.1) is considered, it is easy to extend this the list to provides all necessary elements for all three datapaths just

by adding one XORX and two ROL FUs.

5.3 SHA-3


The SHA-3 hash function uses the sponge construction to map arbitrarily long inputs into fixed length

outputs, and its official versions have an internal state size of b = 1600 bits, and an output size n of either

224, 256, 384 or 512 bits. The internal permutation of Keccak consists of 24 application of a nonlinear round

function applied to the 1600-bit state.

49

5.3.2 Required Elements Proposed Structures for Hash Functions

The state of Keccak-f[1600] is organized as a three-dimensional array, which suggests several ways to

partition the bits. While this is an optimal choice on software platforms actually offering 64-bit operations, the

technique known as bit interleaving allows efficient implementations on systems with smaller word sizes and

can also be used to target compact hardware circuits. In its simplest form, namely factor-2 interleaving, it splits

the odd and even bits of each lane.

The technique of bit interleaving consists in coding a w bit lane as an array of s = w/m words of m bits

each, with word ζ containing the lane bits with z ≡ ζ (mode s). Here, s is called the interleaving factor. This

technique can be applied to any version of Keccak-f to any processor with word length m that divides its lane

length w. First treating the concrete case of 64-bit lanes and 32-bit z-coordinate and the other those with odd

z-coordinates. More exactly, a lane L[z] = a[x][y][z] is mapped onto words U0 and U1 with U0[j] = L[2j]

and U1[j] = L[2j + 1]. If all the lanes of the state are coded this way, the bitwise operations can be simply

implemented with bitwise instructions operating on the words. The main benefit is that the lane translations in ρ

and θ can now be implemented with 32-bit word rotations. A translation of Lwith an even offset 2τ corresponds

to the translation of the two corresponding words with offset τ . A translation of L with an odd offset 2τ + 1

corresponds to U0 ← ROT32(U1, τ + 1) and U1 ← ROT32(U0, τ). Additionally, a translation with an offset

equal to 1 or −1 results in only a single word rotation. For Keccak-f[1600] specifically, this is the case for 6 out

of the 29 lane translations in each round (5 in θ and 1 in ρ) [11].

Looking at the specific case of an in-place implementation using an interleaving of factor s = 2, e.g., to

implement Keccak-f[1600] using 32-bit operations. After 4 rounds, one would expect all the lanes to come

back to their initial positions but not necessarily the word offsets within the lanes, requiring another 4 rounds

for O(8, x, y) to be zero (module 2). However, one of the major advantages about this special case is that

O(4, x, y) = 0 (mod 2) for all coordinates x, y. This implies that after 4 rounds, all the words within the lanes

are also back to their initial positions[11].

The chosen implementation, Inplace32BI proposed by the authors Keccak exploits the described charac-

teristics by unrolling 4 rounds and using the factor 2 interleaving technique to map the 64-bit lanes to 32-bit

words and representing the state of Keccak-f[1600] as 50 words of 32 bits in order to save memory.


The selected structure based on the factor 2 interleaving implementation still depends on the same basic

functions as the original, but through the unfolding of the main loop combined with the grouping and routing

of this functions provides an efficient 32 bits implementation. Composed by the repetition of the following four

simple functions, a two word XOR, a five word XOR, ROL operation and the χ function (see 2.1). The χ

in itself is a combination of an XOR, AND and NOT operations (depicted in Figure 5.9 and Listing 5.4), that

can be implemented as the cascade of the three FUs NOT, AND and XOR by this order, with the required FUs

already present in the SideC library. This approach would require the need for three clock cycles rather than

only one, for this reason, a dedicated FU by the name CHI was designed.

Developing and implementing this version of the SHA-3 family algorithm was the last task performed during

the developing stage, therefore, most of the FUs were already design and implemented. Some were reused in

order to speed the development process, this is the case for CXOR originally employed for the AES solution

50

Proposed Structures for Hash Functions 5.3.3 Final Structure

CHI

XOR NOT

AND

XOR

w0 w1 w2 w3

out

Figure 5.9: CHI FU description

and here reused as a five 32 bits word XOR. Another example of this is the ROL FU used in SHA-1. On the

other hand not all the required functions were available in an efficient way, to fill this gap two new FUs, CHI

presented in Listing 5.4 and XORR in Listing 5.5.

Listing 5.4: CHI FU Pseudocode

CHI(w0, w1, w2, w3){return (w0 ∧ (!w1 & w2) ∧ w3)

}

Listing 5.5: XORR FU Pseudocode

XORR_L(w0, w1, w2){return ROL_L((w0 ∧ w1), w2 )

}

All of the FUs added can be easy achieved / replicated by chaining the already existing SideC FUs, for

example, a single two-input XOR plus ROL would output the same result as the one provided by XORR FU

the downside is for each FU in the chain one clock cycle is necessary to achieve the expected result. Considering

the size of the SHA-3 datapath, this approach would greatly increase the number of clock cycle needed in order

to perform a single round. The two FUs created for the SHA-3 are among the smallest and simplest developed

XORR

XOR

ROL

w0 w1 w2

out

Figure 5.10: XORR FU description

in the scope of this work, a reflection of the chosen implementation that relies on small operations, but requires

multiple instances.


Although SHA-3 requires only a small number of operations such as AND, XOR, NOT and ROT, the

implementation itself is not so trivial due to the structure of the algorithm. The first step to adapt the reference

implementation was to split the base solution development by the authors in small reusable FUs, taking into

account the already existing FUs, at the same time trying to reduce the number of cycles necessary e.g. balance

between the area occupied by the FUs and the amount of FUs required. Table 5.2 lists the FUs required by this

51

5.4 Summary Proposed Structures for Hash Functions

structure and the technical features for each one. Clearly the amount of FUs required is much higher than any

other structure proposed in this work, this is to be expected since the state of the SHA-3 is 1600 bits which

in itself also represent a significant increase taking into account the other algorithms. It is also important to

note that the FUs developed for this structure are amongst the simple a smaller developed throughout this work.

However, this can be a problem when considering the high number of instances of XORR FU which accounts

for one-third of the area occupied by the SHA-3 datapath.

FU SHA-3Latency

(ns) Slices

CHI 200 1.579 32CXOR 40 4.821 248ROL 20 3.417 80XORR 240 4.003 96

Table 5.2: SHA-3 Functional Units requirements and technical features.

The final structure provides an implementation of the SHA-3 512 function, with a single execution of the

datapath performing 4 rounds of a total of 24 at a cost of 20 clock cycles per execution, resulting in 120 cycles

needed to produce the hash value. As expected the chosen implementation of a unrolled factor 2 interleaving

structure results in a fast implement of this algorithm, which represents a good example of the trade-off in the

area to speed ratio, note this structure is at least three times faster than the ones proposed for the former versions

of the SHA family algorithms.

5.4 Summary

In this chapter, the development of the SHA family algorithms was described. This new approach enables

the use of the algorithms within the Coreworks platform. Also detailing the major challenges faced by using

this platform and how to overcome the limitations found in the preliminary version. The system was made to be

highly reconfigurable by being based on a modular architecture that enables multiple combinations, to provide

the ability to generate tailor solutions.

52

6Proposed Structures for Symmetric Key

Encryption Algorithms

Contents6.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.1.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.1.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 CLEFIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.2 Required Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.3 Final Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

53

6.1 AES Proposed Structures for Symmetric Key Encryption Algorithms

In this chapter, structures for two symmetrical encryption algorithms are proposed. The first Section pro-

poses a compact AES structure capable of achieving high throughputs with a small device occupation. The

second Section proposes a structure for CLEFIA, notable for its lightweight efficiency. Even though less fre-

quently used, the CLEFIA algorithm is still used by many applications. Both CLEFIA and AES share similar

ciphering techniques, such as diffusion matrices and byte substitution (SBox), providing a good example in

order to demonstrate the Sideworks capabilities.

6.1 AES

A lot of work has been done regarding the implementation of the AES algorithm on reconfigurable hardware

devices using different types of architectures and methods. There is always a trade-off between throughput and

area utilized. In particular, FPGAs are well suited for encryption implementations due to their flexibility and an

architecture, which can be exploited to accommodate typical encryption transformations.

In this Section, a LookUp Table (LUT) methodology based in a fully rolled AES implementation is de-

scribed.


AES can be implemented in hardware using several different types of architectures like rolled, unrolled,

pipelined, sub-pipelined etc. A rolled architecture is selected when the area is to be minimized because the

same hardware is utilized for all the rounds, hence for an n round algorithm, n iterations of that round are

carried out to perform the encryption. This means the output of one round is given as an input to the next round.

In an unrolled structure, the round computation is expanded and pipelined. With this, multiple rounds of the

algorithm can be executed independently and in parallel.

Pipelining is achieved by replicating the round and placing registers between each round to control the flow

of data. A fully pipelined architecture processes all different data blocks at the same time i.e. when round one is

completed, the output is then used as input for the second round. At the same time, round one takes new data as

input for processing. This applies for all the rounds. Sub-Pipelining is carried out on a partially pipelined design

when the round is complex, by adding registers both in between each round and in between each transformation.

It decreases the pipelines delay between stages but increases the number of clock cycles required to perform

encryption operation.

Coupled with the different architectures there is also multiple methods implementation, like S-Box[42–44],

T-Box[45–48], both based on LUTs, the first was explained in Section 2.2.1.1. A T-box based implementing of

AES was first proposed by J. Daemen and V. Rijmen for a software implementation using 32-bit microprocessors

[16], afterward the approach was adapted for hardware implementations [49].

T-Box is a well-defined approach where SubBytes and MixColumn are substituted by a single word for

each byte of plaintext, and the state is reconstructed beforeAddRoundKey. Therefore, these operations are, in

effect, precomputed and the expected results for all possible inputs placed in the LUTs. This is accomplished by

directly mapping the transformation of SubByte on MixColumn. To clarify let us consider the combination

an output column of this transform equals:

54

Proposed Structures for Symmetric Key Encryption Algorithms 6.1.1 Proposed Structure

b0b1b2b3

=

02 03 01 0101 02 03 0101 01 02 0303 01 01 02

×S[a0]S[a1]S[a2]S[a3]

(6.1)

where the bi represent the transform output bytes and the ai its input bytes. The bi vector is equivalent to:

02010103

× [S[a0]]⊕

03020101

× [S[a1]]⊕

01030201

× [S[a2]]⊕

01010302

× [S[a3]] (6.2)

Therefore, defining the four tables as:

T0[a] =

S[a] • 02S[a]S[a]

S[a] • 03

T1[a] =

S[a] • 03S[a] • 02S[a]S[a]

T2[a] =

S[a]

S[a] • 03S[a] • 02S[a]

T3[a] =

S[a]S[a]

S[a] • 03S[a] • 02

(6.3)

the combination of SubBytes and MixColumns equals:b0b1b2b3

= T0(a)⊕ T1(a)⊕ T2(a)⊕ T3(a) (6.4)

The state, which is a 4 x 4 matrix of the 128 bit input, is multiplied to a constant matrix to obtain the T-Box

tables and Inverse T-Box Tables which are as given in 6.3 and 6.5 respectively. The result of a round is obtained

by XORing the output of the T-Box tables as given in the expression 6.4. Similarly, the output of decryption

can be obtained by XORing the output of inverse T-Box tables 6.5.

T0−1[a] =

S[a]−1 • 0ES[a]−1 • 09S[a]−1 • 0DS[a]−1 • 0B

T1[a]−1

=

S[a]−1 • 0BS[a]−1 • 0ES[a]−1 • 09S[a]−1

T2[a]−1

=

S[a]−1 • 0DS[a]−1 • 0BS[a]−1 • 0ES[a]−1 • 09

T3[a]−1

=

S[a]−1 • 09S[a]−1 • 0DS[a]−1 • 0BS[a]−1 • 0E

(6.5)

A complete round function can be obtained by concatenating the 32 bits round outputs. The last round, however,

is unique in that it omits the MixColumns operation (see Figure 2.6), requiring special consideration. In

this round, SubByte values are needed instead of T-box outputs. Additional memory space is not needed to

implement an additional SubByte table, as one-byte outputs of the SubByte transformation can be extracted

directly since this value is equal to the multiplication by 1 (see equation 6.6), given by the memory computation.

In the decryption, however, all four results are multiplied by coefficients that are different from 1 (i.e. 9, E,B,D

see equation 6.7).

55

6.1.2 Required Elements Proposed Structures for Symmetric Key Encryption Algorithms

01 • ai = 03 • ai ⊕ 01 • ai ⊕ 01 • ai ⊕ 02 • ai (6.6)

01 • ai = 0B • ai ⊕ 0D • ai ⊕ 09 • ai ⊕ 0E • ai (6.7)

The implementation chosen is based on the one proposed by Chaves et al. [47] based on T-Boxes, where

the authors choose to implement the ShiftRows operation by hardwired and multiplexed path. Also to im-

prove hardware efficiency, both encryption and decryption T-Box are merged into a single memory block. The

encryption-decryption memory block requires 2048 addressable bytes (2 ∗ 32 ∗ 28 = 16384 bits = 2048 bytes).

The new memory address is given by a byte, of the state matrix, and an additional bit, indicating whether the

operation is encryption or decryption, as depicted in Figure 6.3. Resulting in a simple rolled architecture using

T-Boxes, where the key expansion is realized in software.


Using the implementation described by Chaves et al. in [47] as a starting point, three new FUs were

designed, CXOR, SXOR, and FXOR each one corresponding to a different stage of the algorithm. From

the beginning, the SXOR was conceived to select and route, the initial and the feedback inputs for the rounds.

Accepting as input the Round Key for the first round, the plaintext input and the result from the previews rounds

as well as a mode select in order to determine when to use decryption. Figure 6.1 presents the inner works of this

FU where the red values represent the switch in the index for the decryption mode, the support documentation

for this FU can be found in Section B.10.

XOR

MUX0

MUX1

Input0,j

32

Key0,j

32

PipeIN

32select

1

Decrypt

1

88

8

Si,0

8

Si,2

8

Si,3

Si,1

8

Si,1

Si,3

9 9 9 9

Figure 6.1: XOR Select description

The output of which is the directly connected to the address of the T-Box memory, as such the value obtained

from the T-Box memory needs to be XORed to complete the round, for that purpose the CXOR was created.

The result of the round is computed by XORing the output of the T-Box as explain in the previous Section and

depicted in Figure 6.2, CXOR specification can be found in Section B.5.

The last round of the AES computation has the particularity of not applying the MixColumn operation, as

illustrated in the in Figure 6.3. This can be computed with the memory structure which only performs the SBox

operation. Also, the output value is directly added to the key, since no polynomial addition has to be performed.

56

Proposed Structures for Symmetric Key Encryption Algorithms 6.1.3 Final Structure

XOR

S′

0

32

8XOR

S′

1

32

8XOR

S′

2

32

8XOR

S′

3

32

8

XOR XOR XOR XOR

8 8 8 8Keyi,j

8 8 8 8

8 8 8 8

32

Sifinal

Figure 6.2: XOR final description

To this effect, the FXOR is used, providing the necessary features required to compute the last round of

the AES algorithm. In Figure 6.3 which shows a partial round of the chosen implementation where the FXOR

FU is depicted as well, note that although not present in this figure the CXOR FU is placed right beside

FXOR, meaning that for each FXOR an equivalent CXOR is necessary using the same state inputs in order

to compute the inner state of the algorithm.

SXOR

XOR

MUX0

MUX1

Input0,j

32

Key0,j

32

PipeIN

32select

1

Decrypt

1

88

8

Si,0

8

Si,2

8

Si,3

Si,1

8

Si,1

Si,3

9 9 9 9

T-BOX

MEM0

T-BOX

MEM1

9 9 9 9

32 32 32 32

FXOR

XOR XOR

XOR

XOR

ror(8) ror(16)

ror(24)32 32

32

Keyi,j

32

32

out0

Figure 6.3: AES ECB Core Partial Round Example


This memory based structure allows an efficient AES encryption and decryption core implementation, and

at the same time, potentially more resistant to Differential Power Analysis (DPA) cryptanalysis attacks [50], due

to the uniform power consumption of the memory blocks. Furthermore to optimize the implementation Block

Memory (BRAM) is used, available through the Sideworks framework, this enables the use of dual port RAM

57

6.1.3 Final Structure Proposed Structures for Symmetric Key Encryption Algorithms

cutting in half the memory requirements, the ideal option in order to save the logic resources.

In Figure 6.3 a part of the AES ECB core is shown, which processes 32 bits block of the 128 bits state of

the AES algorithm, as such the full structure is composed of a total four structures like the one represented, each

one processing 32 bits. Ever 128 bits block requires four clock cycles in order to be processed. Note as mention

previously although CXOR is not present in Figure 6.3, but one is used for every FXOR, to a total of four,

needed to process the inner state values.

SXOR2

XOR

XOR

IV32

MUX0

MUX1

Input0,j

32

Key0,j

32

PipeIN

32select0

1

Decrypt1

88

8

Si,0

8

Si,2

8

Si,3

Si,1

8

Si,1

Si,3

9 9 9 9

T-BOX

MEM0

T-BOX

MEM1

9 9 9 9

32 32 32 32

FXOR2

XOR XOR

XOR

XOR

XOR

MUX

ror(8) ror(16)

ror(24)32 32

32

Keyi,j

32

32

32

32

32

out0

32IV

1select1

Figure 6.4: AES CBC Core Partial Round Example

A second version of this structure is proposed in Figure 6.4, the main difference is the added ability to use

CBC mode, described in Section 2.2.3. To enable this feature a second version of SXOR and FXOR was

created, supporting the same functions as the first version but adding one more input word and a select flag

used to determine when the CBC mode is to be employed, and the new word input to accept the respective IV.

The proposed implementation uses a fully rolled version of the AES core capable of encrypting and decrypt-

ing data blocks in both ECB and CBC modes for all key sizes, which in this case corresponds to 10, 12 or 14

rounds. The operation modes, as well as the IV for the CBC mode, are passed as additional parameters.

58

Proposed Structures for Symmetric Key Encryption Algorithms 6.2 CLEFIA

6.2 CLEFIA

Many of the lightweight cryptographic algorithms sacrifice security and/or speed; however, CLEFIA pro-

vides high-level security of 128, 192, and 256 bits and high performance in software and hardware[20]. CLEFIA

should be considered when using resource-constrained devices[20] especially if connected to the Internet.


As described in Section 2.2.2.1, the CLEFIA algorithm computation is divided into the key scheduling

computation and the ciphering computation itself. While the ciphering computation needs to be performed for

every 128 bit data block, the key scheduling computation only needs to be computed once for the same input

key. This is an important factor when deciding to add or not dedicated hardware to perform the key expansion.

As in the AES algorithm, the key scheduling operation is considered to be performed in software and that the

resulting round keys are transferred to the hardware core during the initialization procedure, or be available to

the respective datapath during the execution by the order which they are used.

Due to the parity presented between AES and CLEFIA a similar approach was chosen again based on rolled

T-Box implementation. Originally presented in [25], and validated by the structures proposed in [25, 51], a fast

implementation of the CLEFIA algorithm can be achieved with the usage of T-Boxes. As in the AES algorithm,

the T-Boxes merge the computation of the S-Box and part of the diffusion matrices (2.5) operations with the

linear transformation layers, compressing the resulting structure into a LUT, also resulting in a reduction of the

critical path [42].

In the CLEFIA F-functions operation, T-Boxes can be used to replace S0, S1, M0, and M1, by the lookup

operations depicted by (6.8), followed by the bitwise XOR operations (additions over GF(28)) [25]:

T00 = (S0, 02× S0, 04× S0, 06× S0)

T01 = (02× S1, S1, 06× S1, 04× S1)

T02 = (04× S0, 06× S0, S0, 02× S0)

T03 = (06× S1, 04× S1, 02× S1, S1)

T10 = (S1, 08× S1, 02× S1, 0A× S1)

T11 = (08× S0, S0, 0A× S0, 02× S0)

T12 = (02× S1, 0A× S1, S1, 08× S1)

T13 = (0A× S0, 02× S0, 08× S0, S0)

(6.8)

The resulting T-Boxes have a 8 bits input bus and a 32 bits data output. These lookup tables can be im-

plemented using LUT [51], or using dedicated memory blocks. Given that most of the current reconfigurable

devices, in particular, FPGAs have dedicated embedded memory blocks designated as BRAMs, the T-Box im-

plementation can be efficiently realized with these components. This allows for a faster and less LUT demand-

ing solutions [51]. Further optimizations can be accomplished in order to reduce the resource requirements, by

taking into account that these tables perform identical calculations. Actually, T00 and T02, presented in (6.8),

perform the same lookup operation, given the same input, only differing in a 16 bit shift of the output. The same

59

6.2.2 Required Elements Proposed Structures for Symmetric Key Encryption Algorithms

applies to T01/T03, T10/T12, and T11/T13. The additional shift operations can be implemented by hard-wired

routing, without an additional area overhead. The remaining hardware required to perform the round compu-

tation is composed by a tree of bitwise XOR operations (additions over GF(28)) [25]. Apart from the round

computation, the addition of the four 32 bits whitening keys also needs to be performed, two at the beginning

and two more at the end of the final round computation. The resulting structure is similar to the one proposed

in [25].


Taking full advantage of the similarities between AES and CLEFIA, only one new FU was added to the

SideC library, named XORCL (Appendix B.12). In order to complete the final stage of each round the unit

compromise of XOR operations and hardwire shifts, as depicted in Figure 6.5. The first stage of CLEFIA is

performed by reusing the SXOR FU design for AES, while it is true that not all the features provided by this

unit are in fact used, assuming the AES Core is present, the reuse of the FUs will optimize resources allocation.

SXOR

XOR

MUX0

MUX1

P0

32

WKi

32

Zero

32select

1

F0

1

88

8

8 8 89 9 9 9

T-BOX0

MEM0

T-BOX1

MEM1

9 9 9 9

32 32 32 32

XORCL

XOR XOR

XOR

XOR

32 32

32

P1

32

32

P0

SXOR

XOR

MUX0

MUX1

P2

32

WKi+1

32

Zero

32

select1

F1

1

9 9 9 9

88

8

8 8 8

T-BOX0

MEM2

T-BOX1

MEM3

9 9 9 9

XORCL

XOR XOR

XOR

XOR

32 32

32

P3

32

32

P2

RKi+1RKi

Figure 6.5: CLEFIA Round Example


The implemented structure is depicted in Figure 6.5, representing the complete structure of CLEFIA al-

gorithm as it can be seen it uses a fourth of the memory needed by the AES for the T-Boxes, this structure

60

Proposed Structures for Symmetric Key Encryption Algorithms 6.3 Summary

requires four 1024 bytes T-Boxes (32 ∗ 28 = 8192 bits = 1024 bytes), as opposed to the eight 2048 bytes T-

Boxes (2 ∗ 32 ∗ 28 = 16384 bits = 2048 bytes) used in AES. In this case two for T0 and two T1, considering

the use of dual port memories. Such a high reduction of the memory footprint is possible since the CLEFIA

algorithm uses a 128 bits block but only 64 bits ( 32 + 32 ) change in each round.

The use of the CXOR FU at the beginning of the CLEFIA datapath provides access to the 9th address

bit of the T-Box is in the same fashion as the AES implementation which, in this case, is not used. Yet in a

folded structure as the one proposed by Proenca et al. [52] in order to obtain an even more compact structure

for the CLEFIA implementation, the symmetry between the F0 and F1 functions is explored. Since the main

difference between F0 and F1 resides in the M0 and M1 tables, as depicted in 2.5. A more compact structure

can be derived by merging the computation of these two tables into a single LUT. The resulting merged T-Boxes

are capable of computing both the F0 and F1, using the 9th address bit to differentiate between an F0/F1 T-Box

when computing the CLEFIA algorithm. Although this folded implementation is not available, it is worth to

mention because the differences in the required FUs are minimal and the result implementation would resort

only to one T-Box with double the size (2 ∗ 32 ∗ 28 = 16384 bits = 2048 bytes) of the currently used at a cost of

double the time as the proposed implementation. Still this would represent half the memory necessary for the

proposed structure.

FU AES EBC AES CBC CLEFIALatency

(ns) Slices

CXOR 4 4 0 4.821 248FXOR 4 0 0 1.676 32FXOR2 0 4 0 2.600 80SXOR 4 0 2 2.401 40SXOR2 0 4 0 3.218 61XORCL 0 0 2 4.170 56

Table 6.1: AES and CLEFIA Functional Units requirements and technical features.

Table 6.1 presents the necessary FUs and the respective number of instances for the structures discussed

in this Section, as well as the area requirements associated with each FU. Confirming that the CLEFIA has a

much smaller area footprint that the AES structures, this is also present in the representation of the structures in

Figures 6.3, 6.3 and 6.5.

6.3 Summary

This Chapter details the proposed structures of AES and CLEFIA, also explaining the rationale for each

structure. Describing several new and existing mechanisms to improve the performance and efficiency of the al-

gorithms. Implementing and explaining the datapaths and the respective FUs created to improve the algorithms

in the platform.

61

7Evaluation

Contents7.1 Hardware Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 Functional Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3 Proposed Structures performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

63

7.1 Hardware Tests Evaluation

In this Section, experimental results for the proposed structures are presented and compared. The obtained

results regarding the FUs was obtained using the Xilinx ISE WebPACK Design Suite (v12.4) with the FU

designs described using VHDL. For all the FUs, a respective test bench was created using known values for the

target algorithms in order to validate the FU correct operation. The values presented for the proposed designs

were obtained after Place and Route processing with software default parameters, namely Synthesis Normal

Speed Optimization Effort, and High Optimization Effort in Mapping and Place and Route, with no extra effort.

All the obtained results for the total number of clock cycles required by each datapath are provided by the

SideWorks simulator (see Section 4.4) present in the Coreworks developer tools.

7.1 Hardware Tests

All the hardware tests were realized using a Spartan3 xc3s5000 gently provided by Coreworks, on this FPGA

the clock obtained in the SideWorks platform was 80 MHz, this is the value used as a reference. The Spartan3

FPGA was a shared resource among the developers working in Coreworks with product debug and testing

taking priority. As such the availability of this resource was scarce and coupled with the ongoing rewriting

of part of the framework at the time, lead to a complicated situation in order to test and debug the developed

solution. Nevertheless with the help and patience of the Coreworks engineers it was possible to test the FUs

that constitute the SHA-1 structure as well as the AES ECB implementation. This includes a test bench for

each of the FU present in the developed structures, where by using multiple known values obtained from the

software implementation of the algorithms it was possible to verify the correct operation of each FU. This phase

was also very important to assess some of the limitation of the platform, in particular, the latency associated

with each FU. Initial tests included more complex FUs, because of the complexity the FUs presented a high

Latency value which represented a problem as these higher Latency values would translate in the FU requiring

two or more clock cycles to complete the target operation. As such this line of developed was discontinued

and the FUs described in this work all present low Latency considering the reference value for the SideWorks

implementation. From all the developed FUs present in this work, the one with the higher Latency value is

ADDX with 8.5ns, supporting an operating frequency up to 117 MHz, far below the reference value for the

SideWorks platform. Confirming that ADDX FU requires only one clock cycle to complete its operation

properly.

During this phase, many bugs were found and corrected although the greater majority of them were a byprod-

uct of the ongoing rework of the framework taking place at the time of this testing. All the corresponding bugs

where checked by a Coreworks engineer but at that time considered low priority since they did not affect the

main products. Working around some memory issues present at the time, it was possible to validate the SHA-1

and AES ECB implementation. Due to the lack of time and the limited resources this was the full extent of the

hardware tests, all the subsequent results are provided by SideWorks simulator (see Section 4.4) present in the

Coreworks developer tools

64

Evaluation 7.2 Functional Unit Design

7.2 Functional Unit Design

During the design and developing stages one of the main objectives when adding a new FU was to try the

most generic approach possible, in order to enable the reuse of the same FU among multiple datapaths seeking to

optimize the solution the best possible way. Even considering all the merits of this approach it can be extremely

counterproductive. A good example of this is shown in table 7.1. Using the CXOR FU as an example, this FU

receives as one of its inputs an integer to be used as a factor on a rotation executed by this specific FU. Which

means that this variable can be any value from 0 to 31, but after a careful analysis of the algorithms using the

FU, it becomes apparent that only two values are ever used as input for the factor variable, they are 1 and 8.

Taking this into account, the FU can be largely improved, by reducing the number of values considered for this

variable or removing it.

FULatency

(ns) Slice AESArea

(slices) SHA-3Area

(slices)

original 4.81 248 4 992 40 9920factor 8 1.676 32 4 128 n.a. -factor 1 1.676 32 n.a. - 40 1280select 1.844 80 4 320 40 3200

Table 7.1: CXOR alternative implementations features

The first approach was to use a fix value, this limits the use of CXOR to one of the two algorithms using

the FU, but the area required by the FU drops drastically. A compromise can be found by adding a select flag

giving the ability to the developer to specify which of the available values for the factor variable is to be used

in each datapath call. Note that when the factor value is equal to 1 causes the FU to behave like a simple

five words XOR. The values presented in Table 7.1 demonstrate that reduction in area is between three to eight

times smaller for the select version and the fix value solutions respectively. In this Table, the Area field refers

to the total number of Slices necessary for all the instances of the FU in each implementation of the respective

algorithm. Although the benefits of the using a more limited FU stands out, either through the gain in speed

and the save in occupied area. This comes at the cost of flexibility which, in this case, translates to removing

the ability to shift the inputs by a chosen factor value. Of all available combinations the one that provides the

smallest area option, uses the fix value versions of the FUs for each of the algorithms, this because the combined

area required for the fix value FUs is less than the required area for the select version.

Following the same process, the FU ADDX and its evolution ADDX2, created in order to accommodate

the necessity of a 64 bit adder present in the SHA512 structure are considered. Table 7.2 represents the effort

to combine both FUs and the respective features. The resulting FU proves to be slower and requires a greater

area, however when all the SHA structures up to SHA512 are taken into account using only four instances of the

select version of ADDX FU requires 15% less area than using ADDX and ADDX2 in the same situation.

Aside for the differences in speed and area the most significant one is the fact that in the new FU there is no

longer available the option to add seven 32 bit values. This is a small gain in area which comes at the cost of a

seven word adder present in the original FU and removed in the select version. Nevertheless, since none of the

considered datapaths makes use of this function, replacing the original FU with the select version presents no

65

7.3 Proposed Structures performance Evaluation

FULatency

(ns) Slice SHA-1Area

(slices) SHA256 Area SHA512Area

(slices)

ADDX 8.436 226 1 226 2 452 n.a. n.a.ADDX2 6.569 257 n.a. 128 n.a. - 4 1028select 8.727 323 1 323 2 646 4 1296

Table 7.2: ADDX alternative implementations

downside.

Reviewing another FU evolution, in this case, the XORX and XORX2 that share the same datapaths as

ADDX and ADDX2, with this FU evolution also representing the need for a 64 bit operation. The results can

be found in Table 7.3 and, in this case, considering the select FU that implements both functional presents no

benefits. Unlike the ADDX using this solution in the case of XORX would increase the necessary area.

FULatency

(ns) Slices SHA-1Area

(slices) SHA256Area

(slices) SHA512Area

(slices)

XORX 2.414 79 1 79 2 158 n.a. n.a.XORX2 2.386 96 n.a. n.a. n.a. - 2 192select 2.928 192 1 192 2 384 2 384

Table 7.3: XORX alternative implementations

To fully leverage the capability of SideWorks it is extremely important to know how to select and when to

reuse FUs. All the previous examples clearly demonstrate why multiple approaches must be considered when

designing FU in order to optimize the final results. Well defined requirements from the beginning of a project

will greatly reduce the chance of mistakes like this to manifest in later stages of the project. Although due to

the versatility of the platform situations like this can be corrected quickly.

7.3 Proposed Structures performance

In this Section, the experimental results for the proposed structures and the respective counterparts based

on the MB-Lite processor are presented. For each of the proposed structures in this work, a MB-Lite version

is considered, all of them based on the reference implementations used during the original design, mention in

Chapter 5 and 6.

The results were obtained using the simulator provided in the Coreworks development tools. For the MB-

Lite version, Xilinx ISE WebPACK Design Suite (v12.4) was used to synthesize the design and perform the

Post-place&Route procedures, it is important to note that the considered MB-Lite version has the optional

barrel shift enable in order to improve the overall speed results.

Table 7.4 presents a performance overview of the proposed structures, in order to derive the values in the

table, it is necessary to take into account the reference values such as 80 MHz for the maximum clock speed

when considering the Coreworks platform running in a Spartan3 xc3s5000 FPGA. As for its counterpart, MB-

Lite using the same FPGA can get a maximum clock speed of 119 MHz. These are the reference values used

to calculate the throughput as exemplified in equation 7.1, where 40 is the number of cycles required per block

also shown in the table.

66

Evaluation 7.3 Proposed Structures performance

SHA-1 SHA256 SHA512 SHA-3 AES ECB AES CBC CLEFIABlock Size (bits) 160 256 512 1600 128 128 128Cycles per round 4 5 5 5 4 4 4Total Nbr Cycles 320 320 400 120 40 40 72Throughput (Mbps) 40 64 102 1067 256 256 142Slices 417 674 1348 32320 416 692 192

MB-LiteThroughput(Mbps)

3,9 2,43 2,83 0.6 0.08 0.06 0.22

Table 7.4: Proposed structures performance summary

Just as it is important to refer that the area values only account for the total number of slices required by the

FUs used by each of the proposed structures and not the total amount considering the SideWorks platform. This

is due to the fact that to be able to synthesize the design and perform the Post-place&Route procedures for the

SideWorks platform plus the developed structures requires access to SideWorks source files only available to

Coreworks engineers. Which was only an option during the hardware testing phase has it is detailed in Section

7.1.

ThroughputAES128 =80(MHz)× 128(bits)

40= 256Mbit/s (7.1)

However, analyzing the results it is possible to determine the main characteristics of the developed imple-

mentations and detect some problems. The first one that stands out is the case of the structure developed for

SHA-3. Despite functioning properly in the simulation environment, and the speed results presented being

rather promising, the area required only by the FUs make it impossible to implement in a Spartan3. Because

the total number of available slices in Spartan3 is 33 280 and only the FUs for that structure occupies 32,220.

This means that the area required for this structure is at least twenty-four times greater than any of the others

developed structures. When selecting the base implementation for this algorithm speed was favored instead of

low area, which results in a structure with a high throughput but high area requirements.

In Table 5.2 it is possible to confirm that the area requested by each of the FUs used in the SHA-3 structure

is within the values present in other implementations, but because it is necessary to use a large amount of FUs

the area needed increases. This situation can be corrected in future iterations by considering adding an FU

to control the state and timing as a means to make use of a smaller version of the datapath, and not process

simultaneously the 1600 bit state, managing operations with the control FU, abdicating of throughput for a

reduction in occupied area. On the other hand, this implementation has a throughput of 3 to 8 times higher than

the remaining structures established for the other versions of the SHA family.

Although all structures shown are prepared to operate individually, they were developed as to promote the

reuse of the FUs. Except the mention SHA-3 implementation, combining the remaining algorithms special if

the share similar structures can represent significant gains in area.

67

7.4 Summary Evaluation

Considering a combined implementation of SHA-1, 256 and 512 the total area needed is 1587 slices versus

497 (SHA-1) plus 592 (SHA256) in addition to 1348 (SHA512) totaling 2437 slices required for the sum of

individual structures. When considering an implementation of AES and CLEFIA the area required by the FU

is 528 slices versus the two structures representing 416 (AES) in addition to 192 (CLEFIA) with a total of 608

slices which means a saving of 15% in area for this case and 50% for the previous case. To achieve a higher

reduction of area in the case of AES the most appropriate solution is to consider a version in which the core

function is folded, reusing the structure illustrated 6.3 and only using two branches of this structure or even one.

In order complete this implementation it would be necessary to add a control FU, and the resulting structure

would require at least two or four times the current operating time respectively.

7.4 Summary

This Chapter provides the detailed simulation of the proposed structures and an overview of how the results

were obtained. As well as the analysis of the results followed by comments on the most important conclusions,

which are drawn from the work and the applied methodology, also suggestions for some significant improve-

ments to the original design.

68

8Conclusions

69

Conclusions

Cryptographic algorithms can be divided into three classes: public-key algorithms, symmetric key algo-

rithms, and hash functions. While the first two are used to encrypt and decrypt data, the hash functions are

one-way functions that do not allow the processed data to be retrieved. This work presents the development of

structures capable of processing the most common symmetrical encryption algorithms and cryptographic hash

functions, in Hardware and Software co-design using the SideWorks framework. The obtained results suggest

that this framework and the resulting architecture allow for a fast development environment with a high com-

putational potential, with a unique ability to facilitate the creation of tailor-made solutions. Coupled with the

fully functional implementation of multiple algorithms, this gives a good starting point for the development of

a commercially viable cryptographic core, supported on the SideWorks framework using a Xilinx Spartan 3

FPGA.

As stated before the selected methodology to undertake this project many not be the most appropriate, still

it provided a unique opportunity to learn from the mistakes made and obtain valuable experiences. In the

process providing the SideWorks framework with some of the building blocks required by many cryptographic

algorithms. The SideWorks framework provides a fast paced development environment with a lot of potential.

During the development of this work, it was possible to gain insight into the inner works of the platform and

recognize the great potential to develop cryptographic solutions using this framework as it provides a flexible

and fast development environment. While at the beginning of the development the framework may not seem

the more intuitive but once this initial phase is overcome the framework proves to be very easy to use and very

adaptable. The result of this work regardless of not being ready to be used as a product it is considered a success,

as it provides the Coreworks with the first set of fully operational cryptographic algorithms and respective FUs,

as well as the described experiences of the problems encounter during the development of this work.

This work can represent a new field in which Coreworks platform can thrive.

Future work

Considering the proposed implementation, there is a lot of room for improvements depending on the di-

rection chosen. Either by increasing the number of supported encryption modes for the symmetric encryption

algorithms or to provide new features such as extending the hash algorithms to be able to use Hash-based

Message Authentication Code (HMAC). The current structures can also be improved by reducing the memory

needed for each implementation, the most obvious case would be the AES implementation which can be adapted

to a folded version, the same can be done in CLEFIA. Both structures can be present in the SideWorks frame-

work, giving the option to the developer to choose the most suitable one based on the speed and area constraints.

The CLEFIA implementation which is fully prepared to use a folded approach in order to take advantage of the

merged T-Box used to obtain a low area structure. A similar low memory structure can be achieved for AES

which can be useful when targeting low-end devices.

However, it may be a more feasible approach to consider a new set our subset of the current algorithms cou-

pled with the knowledge obtained during this work to target a real world application with a clear and well define

requirements, as to produce working product able to truly confirm the commercial viability of cryptographic

algorithms in the SideWorks architecture. These issues and recommendation are assigned to future iterations of

this work.

70

Bibliography

[1] Coreworks, SideWorks Reference Book. Coreworks S.A., 2010, revision v1r2.

[2] K. J. Soo Hoo, “How much is enough: A risk management approach to computer security,” Ph.D. disser-

tation, Stanford, CA, USA, 2000.

[3] American Banker, “Credit Union Loan Data on Stolen Laptop,” American Banker, vol. 172, no. 71, p. 5,

2007.

[4] W. Glisson, T. Storer, G. Mayall, I. Moug, and G. Grispos, “Electronic retention: what does your mobile

phone reveal about you?” International Journal of Information Security, vol. 10, pp. 337–349, 2011,

10.1007/s10207-011-0144-3.

[5] R. H. Weber, “Internet of Things New security and privacy challenges,” Computer Law and Security

Review, vol. 26, no. 1, pp. 23 – 30, 2010. [Online]. Available: http://www.sciencedirect.com/science/

article/pii/S0267364909001939

[6] X. Wang and H. Yu, “How to Break MD5 and Other Hash Functions,” in Advances in Cryptology – EU-

ROCRYPT 2005, ser. Lecture Notes in Computer Science, R. Cramer, Ed. Springer Berlin / Heidelberg,

2005, vol. 3494, pp. 561–561.

[7] X. Wang, Y. Yin, and H. Yu, “Finding Collisions in the Full SHA-1,” in Advances in Cryptology – CRYPTO

2005, ser. Lecture Notes in Computer Science, V. Shoup, Ed. Springer Berlin Heidelberg, 2005, vol. 3621,

pp. 17 –36.

[8] National Institute of Standards and Technology, “SHA-3 First Round Candidates,” http://csrc.nist.gov/

groups/ST/hash/sha-3/Round1/, 2008, [Online; accessed 19-May-2012].

[9] National Institute of Standards and Technology , “SHA-3 Second Round Candidates,” http://csrc.nist.gov/

groups/ST/hash/sha-3/Round2/, 2009, [Online; accessed 19-May-2012].

[10] National Institute of Standards and Technology, “SHA3 – finalists,” http://csrc.nist.gov/groups/ST/hash/

sha-3/Round3/, 2010, [Online; accessed 19-May-2012].

[11] G. Bertoni, J. Daemen, M. Peeters, and G. V. Assche, “The Keccak SHA-3 submission,” Submission to

NIST (Round 3), 2011. [Online]. Available: http://keccak.noekeon.org/Keccak-submission-3.pdf

[12] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “Sponge functions,” Ecrypt Hash Workshop, 2007.

71

http://www.sciencedirect.com/science/article/pii/S0267364909001939

http://www.sciencedirect.com/science/article/pii/S0267364909001939

http://csrc.nist.gov/groups/ST/hash/sha-3/Round1/






http://keccak.noekeon.org/Keccak-submission-3.pdf

BIBLIOGRAPHY BIBLIOGRAPHY

[13] ——, “On the indifferentiability of the sponge construction,” in Advances in Cryptology a EUROCRYPT

2008, ser. Lecture Notes in Computer Science, N. Smart, Ed. Springer Berlin Heidelberg, 2008, vol.

4965, pp. 181–197.

[14] ——, “On the security of the keyed sponge construction,” Symmetric Key Encryption Workshop, 2011.

[15] M. J. Dworkin, “SP 800-38A 2001 Edition. Recommendation for Block Cipher Modes of Operation:

Methods and Techniques,” National Institute of Standards & Technology, National Institute of Standards &

Technology, 2001. [Online]. Available: http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf

[16] J. Daemen, J. Daemen, J. Daemen, V. Rijmen, and V. Rijmen, “AES Proposal: Rijndael,” 1998.

[17] N. I. of Standards and Technology, “Advanced Encryption Standard,” NIST FIPS PUB 197, 2001.

[18] J. Daemen, L. R. Knudsen, and V. Rijmen, “The Block Cipher SQUARE,” in FSE, ser. Lecture Notes in

Computer Science, E. Biham, Ed., vol. 1267. Springer, 1997, pp. 149–165.

[19] T. Shirai, K. Shibutani, T. Akishita, S. Moriai, and T. Iwata, “The 128-Bit Blockcipher CLEFIA (Extended

Abstract),” in FSE, ser. Lecture Notes in Computer Science, A. Biryukov, Ed., vol. 4593. Springer, 2007,

pp. 181–195.

[20] M. Katagi and S. Moriai, “The 128-Bit Blockcipher CLEFIA,” RFC 6114 (Informational), Internet

Engineering Task Force, Mar. 2011. [Online]. Available: http://www.ietf.org/rfc/rfc6114.txt

[21] “Security techniques - Lightweight cryptography - Part 2: Block ciphers,” 2000.

[22] T. Shirai and K. Shibutani, “On Feistel Structures Using a Diffusion Switching Mechanism,”

in Fast Software Encryption, 13th International Workshop, FSE 2006, ser. Lecture Notes

in Computer Science, vol. 4047. Springer, 2006, pp. 41–56. [Online]. Available: http:

//www.iacr.org/cryptodb/archive/2006/FSE/3254/3254.pdf

[23] H. Chen, W. Wu, and D. Feng, “Differential Fault Analysis on CLEFIA,” in ICICS, ser. Lecture Notes in

Computer Science, S. Qing, H. Imai, and G. Wang, Eds., vol. 4861. Springer, 2007, pp. 284–295.

[24] Y. Tsunoo, E. Tsujihara, M. Shigeri, T. Suzaki, and T. Kawabata, “Cryptanalysis of CLEFIA using multiple

impossible differentials,” in Information Theory and Its Applications, 2008. ISITA 2008. International

Symposium on, Dec 2008, pp. 1–6.

[25] T. Sugawara, N. Homma, T. Aoki, and A. Satoh, “High-performance ASIC implementations of the 128-bit

block cipher CLEFIA,” in ISCAS’08, 2008, pp. 2925–2928.

[26] National Institute of Standards and Technology , “FIPS PUB 81: DES Modes of Operation,” December

1980. [Online]. Available: http://csrc.nist.gov/publications/fips/fips81/fips81.htm

[27] G. D. Micheli and R. K. Gupta, “Hardware/Software Co-Design,” IEEE MICRO, vol. 85, pp. 349–365,

1997.

72

http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf

http://www.ietf.org/rfc/rfc6114.txt

http://www.iacr.org/cryptodb/archive/2006/FSE/3254/3254.pdf

http://www.iacr.org/cryptodb/archive/2006/FSE/3254/3254.pdf

http://csrc.nist.gov/publications/fips/fips81/fips81.htm


[28] I. Bolsens, H. J. D. Man, B. Lin, K. V. Rompaey, S. Vercauteren, and D. Verkest, “Hardware/Software

co-design of the digital telecommunication systems,” in Proceedings of the IEEE, 1997.

[29] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis,

and M. Horowitz, “Understanding Sources of Inefficiency in General-purpose Chips,” in Proceedings of

the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY,

USA: ACM, 2010, pp. 37–47.

[30] H. Lim, K. You, and W. Sung, “Design and Implementation of Speech Recognition on a Softcore Based

Fpga,” in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE Interna-

tional Conference on, vol. 3, May 2006, pp. III–III.

[31] K. You, H. Lim, and W. Sung, “Architectural Design and Implementation of an FPGA Softcore Based

Speech Recognition System,” in System-on-Chip for Real-Time Applications, The 6th International Work-

shop on, Dec 2006, pp. 50–55.

[32] S. Furber, ARM System-on-Chip Architecture, 2nd ed. Boston, MA, USA: Addison-Wesley Longman

Publishing Co., Inc., 2000.

[33] I. Corp, IBM PowerPC Quick Reference Guide, 2005.

[34] A. Corporation, NIOS 3.0 CPU Data Shee, 2004.

[35] X. Inc, Microblaze Processor Reference Guide, 2008. [Online]. Available: http://www.xilinx.com/

support/documentation/sw manuals/mb ref guide.pdf

[36] J. Baxter, OpenRISC 1200 IP Core Specification, 2011. [Online]. Available: http://openrisc.net/

or1200-spec.html

[37] E. C. Jiri Gaisler, Sandi Habinc, GRLIB IP Library Users Manual, 2014. [Online]. Available:

http://www.gaisler.com/products/grlib/grlib.pdf

[38] T. Kranenburg and R. Van Leuken, “MB-LITE: A robust, light-weight soft-core implementation of the

MicroBlaze architecture,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010,

March 2010, pp. 997–1000.

[39] Opencores, WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores, 2012.

[40] National Institute of Standards and Technology, “FIPS PUB 180-4: Secure Hash Standard (SHS),” March

2012. [Online]. Available: http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

[41] D. E. 3rd and P. Jones, “US Secure Hash Algorithm 1 (SHA1),” RFC 3174 (Informational),

Internet Engineering Task Force, Sep. 2001, updated by RFCs 4634, 6234. [Online]. Available:


[42] T. Good and M. Benaissa, “AES on FPGA from the Fastest to the Smallest,” in Cryptographic Hardware

and Embedded Systems – CHES 2005, ser. Lecture Notes in Computer Science, J. Rao and B. Sunar, Eds.

Springer Berlin Heidelberg, 2005, vol. 3659, pp. 427–440.

73

http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf

http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf

http://openrisc.net/or1200-spec.html

http://openrisc.net/or1200-spec.html

http://www.gaisler.com/products/grlib/grlib.pdf

http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf



[43] F.-X. Standaert, G. Rouvroy, J.-J. Quisquater, and J.-D. Legat, “Efficient Implementation of Rijndael En-

cryption in Reconfigurable Hardware: Improvements and Design Tradeoffs,” in Cryptographic Hardware

and Embedded Systems - CHES 2003, ser. Lecture Notes in Computer Science, C. Walter, A. KoA§, and

C. Paar, Eds. Springer Berlin Heidelberg, 2003, vol. 2779, pp. 334–350.

[44] S. Antao, R. Chaves, and L. Sousa, “AES and ECC Cryptography Processor with Runtime Configuration,”

in International Conference on Advanced Computing and Comunications - ADCOM. Bangalore: IEEE,

Dec. 2009.

[45] P. Bulens, F.-X. Standaert, J.-J. Quisquater, P. Pellegrin, and G. Rouvroy, “Implementation of the AES-128

on Virtex-5 FPGAs,” in Progress in Cryptology – AFRICACRYPT 2008, ser. Lecture Notes in Computer

Science, S. Vaudenay, Ed. Springer Berlin Heidelberg, 2008, vol. 5023, pp. 16–26.

[46] S. Drimer, T. Guneysu, and C. Paar, “DSPs, BRAMs and a Pinch of Logic: New Recipes for AES on FP-

GAs,” in Field-Programmable Custom Computing Machines, 2008. FCCM ’08. 16th International Sym-

posium on, April 2008, pp. 99–108.

[47] R. Chaves, G. Kuzmanov, S. Vassiliadis, and L. Sousa, “Reconfigurable memory based AES co-processor,”

in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, April 2006.

[48] J. a. C. Resende and R. Chaves, “Dual CLEFIA/AES Cipher Core on FPGA,” in Applied Reconfigurable

Computing, ser. Lecture Notes in Computer Science, K. Sano, D. Soudris, M. Hubner, and P. C. Diniz,

Eds. Springer International Publishing, 2015, vol. 9040, pp. 229–240.

[49] V. Fischer and M. Drutarovsky, “Two Methods of Rijndael Implementation in Reconfigurable Hardware,”

in Cryptographic Hardware and Embedded Systems – CHES 2001, ser. Lecture Notes in Computer Sci-

ence, C. Koc, D. Naccache, and C. Paar, Eds. Springer Berlin Heidelberg, 2001, vol. 2162, pp. 77–92.

[50] P. Kocher, J. Jaffe, B. Jun, and P. Rohatgi, “Introduction to differential power analysis,” Journal of Cryp-

tographic Engineering, vol. 1, no. 1, pp. 5–27, 2011.

[51] T. Kryjak and M. Gorgon, “Pipeline implementation of the 128-bit block cipher clefia in fpga.” in FPL,

M. Danek, J. Kadlec, and B. E. Nelson, Eds. IEEE, 2009, pp. 373–378.

[52] P. Proenca and R. Chaves, “Compact clefia implementation on fpgas.” in FPL. IEEE, 2011, pp. 512–517.

74

ASideWorks Examples

ContentsA.1 SideWorks code example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2A.2 SideWorks code example parallel version . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3A.3 SideWorks FU ROL32.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5A.4 SideWorks SHA-1 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6

A-1

A.1 SideWorks code example SideWorks Examples

A.1 SideWorks code example

Listing A.1: Datapath example

//initializations//Time Units for read and write operationsTU_4_4_11(tu_rd);TU_4_4_11(tu_wr);

REG_11(loop_end_k);

BAU_11(bau_rd);BAU_11(bau_wr);

ALU_32(alu_0);ALU_32(alu_1);

MEM(m0, 11, 32, 1024);MEM(m1, 11, 32, 1024);MEM(m2, 11, 32, 1024);

PLACE_MEM(&m0, "0");PLACE_MEM(&m1, "1");PLACE_MEM(&m2, "2");

//Datapath description.SIW_DATAPATH(tu_wr.done, 1) {

//Done status updated 1 cycle after tu_wr.done signal asserted//Loop setup. The state register is used to configure the loops number of

iterationsF_STATE_11(&loop_end_k);F_TIME(&tu_rd,loop_end_k.out,ZERO,ZERO,ZERO, 0);

//addresses for readingF_BAU_11(&bau_rd, TWO ,ONE, undefined,tu_rd.enable[K], tu_rd.start[K],0);

//data read from memoriesF_MEM_PORT_A(&m0, bau_rd.out, undefined, ZERO,0);F_MEM_PORT_A(&m1, bau_rd.out, undefined, ZERO,0);

//final paramenter represents the delay, in this case its one, the second adderhas an offset of one time unit

F_MEM_PORT_A(&m2, bau_rd.out, undefined, ZERO,1);

//vector add operationF_ADD_32(&alu_0, m0.data_a,m1.data_a);//value form the first adder and memory twoF_ADD_32(&alu_1, m2.data_a,alu_0.out);

//addresses for writingF_DELAY(&tu_wr,&tu_rd,3);

//increment step size, initial valueF_BAU_11(&bau_wr,ONE, ONE, undefined, tu_wr.enable[K], tu_wr.start[K],0);

//data destinationsF_MEM_PORT_B(&m0, bau_wr.out, alu_1.out, tu_wr.enable[K], 1);

//Debug print the vector values in the memory use for INPUTsideC_print_port(m0.data_a);sideC_print_port(m1.data_a);sideC_print_port(m2.data_a);

//Debug print the Output valuesideC_print_port(m0.data_b);}

}

A-2

SideWorks Examples A.2 SideWorks code example parallel version

A.2 SideWorks code example parallel version

Listing A.2: Next Vector Crunch Datapath, parallel version

//initializationsSIW_CONF (’vec_add’) { // Vector Add configuration description

//Time Units for read and write operationsTU_4_4_11(tu_rd);TU_4_4_11(tu_wr);

REG_11(loop_end_k);

BAU_11(bau_rd_even);BAU_11(bau_rd_odd);BAU_11(bau_wr_even);BAU_11(bau_wr_odd);

ALU_32(alu_0);ALU_32(alu_1);ALU_32(alu_2);ALU_32(alu_3);

MEM(m0, 11, 32, 1024);MEM(m1, 11, 32, 1024);MEM(m2, 11, 32, 1024);MEM(m3, 11, 32, 1024);

PLACE_MEM(&m0, "0");PLACE_MEM(&m1, "1");PLACE_MEM(&m2, "2");PLACE_MEM(&m3, "3");

//Datapath description.SIW_DATAPATH(tu_wr.done, 1) {

//Done status updated 1 cycle after tu_wr.done signal asserted//Loop setup. The state register is used to configure the loops number of

iterationsF_STATE_11(&loop_end_k);F_TIME(&tu_rd,loop_end_k.out,ZERO,ZERO,ZERO, 0);

//addresses for readingF_BAU_11(&bau_rd_even, TWO ,ZERO, undefined, tu_rd.enable[K], tu_rd.start[K],0);F_BAU_11(&bau_rd_odd, TWO ,ONE, undefined, tu_rd.enable[K], tu_rd.start[K],0);

//data read from memoriesF_MEM_PORT_A(&m0, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m0, bau_rd_odd.addr, undefined, ZERO,0);F_MEM_PORT_A(&m1, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m1, bau_rd_odd.addr, undefined, ZERO,0);F_MEM_PORT_A(&m2, bau_rd_even.addr, undefined, ZERO,0);F_MEM_PORT_B(&m2, bau_rd_odd.addr, undefined, ZERO,0);

//vector add operation evenF_ADD_32(&alu_0, m0.data_a,m1.data_a);//value form the first adder and memory twoF_ADD_32(&alu_1, m2.data_a, alu_0.out);

//vector add operation oddF_ADD_32(&alu_2, m0.data_b,m1.data_b);//value form the first adder and memory twoF_ADD_32(&alu_3, m2.data_b, alu_2.out);

//addresses for writingF_TU_4_4_11_DELAY(&tu_wr,&tu_rd,3);F_BAU_11(&bau_wr_even,TWO, ZERO, undefined, tu_wr.enable[K], tu_wr.start[K],0);F_BAU_11(&bau_wr_odd, TWO, ONE, undefined, tu_wr.enable[K], tu_wr.start[K],0);

//final paramenter represents the delay, in this case its one, the second adderhas an offset of one time unit

A-3

A.2 SideWorks code example parallel version SideWorks Examples

F_MEM_PORT_A(&m3, bau_wr_even.out, undefined, ZERO,1);F_MEM_PORT_B(&m3, bau_wr_odd.out, undefined, ZERO,1);

//Debug print the even vector values in the memory use for INPUTsideC_print_port(m0.data_a);sideC_print_port(m1.data_a);sideC_print_port(m2.data_a);

//Debug print the Output valuesideC_print_port(m3.data_a);}}

}SIW_CONF (’next_vec_crunch’) { // Next Vector Crunch configuration description...}

A-4

SideWorks Examples A.3 SideWorks FU ROL32.c

A.3 SideWorks FU ROL32.c

Listing A.3: SideWorks C description for ROL

#include "sim/sideC_types.h"#include "sim/sideC_internal.h"#include "rol_32.h"

#define ROL32(a, offset) ((((unsigned long)a) << (offset)) ˆ (((unsigned long)a)>> (32-(offset))))

void sidefu_rol_32_sim(sidefu_rol_32 *fu, sideC_port_t word0, sideC_port_t word1,int line, char *file )

{if(sideC_update_combinatorial) return;

if(sideC_reset) {sideC_reset_reg(&(fu->out0), line, file);sideC_add_reg(&(fu->out0), line, file);

}else if (sideC_init){fu->private_t.flags = 0;

fu->out0.width = 32;fu->out0.signal.current.x = 0;fu->out0.signal.current.value = 0;fu->out0.signal.next.x = 0;fu->out0.signal.next.value = 0;

}else if(sideC_validate){

validate_input_length(1,word0, 32, line, file);validate_input_length(2,word1, 12, line, file);

}else{fu->out0.signal.next.value = ROL32(word0.signal.current.value,

word1.signal.current.value) & 0xFFFFFFFF;;}

}

void sidefu_rol_32_xml(sidefu_rol_32 *fu, char *fu_name, char *word0, char *word1,int line, char *file)

{if(sideC_init){

sideC_xml_print_func("ROL_32", fu_name, "F_ROL_32",line,file);sideC_xml_print_char_arg("word0",word0);sideC_xml_print_char_arg("word1",word1);

}}

A-5

A.4 SideWorks SHA-1 Datapath SideWorks Examples

A.4 SideWorks SHA-1 Datapath

Listing A.4: SHA-1 Final datapath

#ifndef SHA1_CORE_H_#define SHA1_CORE_H_

SIW_CONF("sha1_core") {

TU_4_4_12(tu_rd);TU_4_4_12(tu_wr);

REG_12(loop_end_k); // nbr off cycles necessary for one roundREG_12(loop_end_j); // nbr off rounds

REG_32(reg_a); //AREG_32(reg_b); //BREG_32(reg_c); //CREG_32(reg_d); //DREG_32(reg_e); //E

BAU_12(bau_k);BAU_12(bau_w);BAU_12(bau_aux);

ROL_32(rol0);ROL_32(rol1);

CNST_11(cnst5);CNST_11(cnst30);

CNST_12(cnst19);

ADDX_32(addx_0);ANDX_32(andx_0);XORX_32(xorx_0);

// Size 80 * 32 bitsMEM(m0, 12, 32, M0_SIZE);// Size 4 *32MEM(m1, 12, 32, M1_SIZE);PLACE_MEM(&m0, "0");PLACE_MEM(&m1, "1");

SIW_DATAPATH(tu_wr.done, 0) { //Datapapth description.//Done status updated 1 cycle after tu_wr.done signal asserted// Loop setup. The state register is used to configure the loops number of

iterations//---------stage0-----------------------------------------------------------------

//Const 5F_CNST_11(&cnst5, 5);

//Const 30F_CNST_11(&cnst30, 30);

//Cnst 19F_CNST_12(&cnst19, 20);

F_STATE_12(&loop_end_k);F_STATE_12(&loop_end_j);

F_TIME(&tu_rd,loop_end_k.out,loop_end_j.out,ZERO,ZERO, 0);F_DELAY(&tu_wr,&tu_rd, 2);

//----------------stage1-----------------------------------------------------------F_BAU_12(&bau_w,ONE ,ZERO, cnst19.out, tu_rd.enable[J], tu_rd.start[J],0);F_BAU_12(&bau_aux,ONE ,ONE, cnst19.out, tu_rd.enable[J], bau_aux.gt, 0);F_BAU_12(&bau_k,ONE ,ZERO, TWO, bau_aux.gt, tu_rd.start[J],0);

//Read W[]F_MEM_PORT_A(&m0, bau_w.out, undefined, ZERO,0);

A-6

SideWorks Examples A.4 SideWorks SHA-1 Datapath

//Read K[]F_MEM_PORT_A(&m1, bau_k.out, undefined, ZERO,0);

//ROL(30,B)F_ROL_L_32(&rolt0, reg_b.out, cnst30.out);

//ROL(5,A)F_ROL_L_32(&rol1, reg_a.out, cnst5.out);

//ANDx(B, C, D)F_ANDX_32(&andx_0, reg_b.out, reg_c.out, reg_d.out);

//XORx(B, C, D)F_XORX_32(&xorx_0, reg_b.out, reg_c.out, reg_d.out, bau_k.eq);

//------------------stage2---------------------------------------------------------//S0 K ==2 S1 if W > 20F_ADDX_32(&addx_0, rol1.out, reg_e.out, m1.data_a, m0.data_a, xorx_0.out0,

andx_0.out0, andx_0.out1, bau_k.eq, bau_w.lt);

//--------------------stage3-------------------------------------------------------//ADDx --> AF_REG_WEN_32(&reg_a,addx_0.out0, tu_wr.end[K]);//A --> BF_REG_WEN_32(&reg_b,reg_a.out, tu_wr.end[K]);//ROL(30,B) --> CF_REG_WEN_32(&reg_c,rol0.out, tu_wr.end[K]);//C --> DF_REG_WEN_32(&reg_d,reg_c.out, tu_wr.end[K]);//D --> EF_REG_WEN_32(&reg_e,reg_d.out, tu_wr.end[K]);

//---------------------stage4------------------------------------------------------}

}#endif /*SHA1_CORE_H_*/

A-7

BSideWorks Functional Units

ContentsB.1 ADDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2B.2 ADDX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3B.3 ANDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4B.4 CHI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5B.5 CXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6B.6 FXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7B.7 FXOR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8B.8 ROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9B.9 SBHIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10B.10 SXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11B.11 SXOR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12B.12 XORCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14B.13 XORR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15B.14 XORX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16B.15 XORX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17

B-1

B.1 ADDX SideWorks Functional Units

This Section provides a description of the SideWorks objects and SideWorks object functions made available

in SideC function unit library by this work.

B.1 ADDX

ADDX32

out

select01

select11

word032

word132

word232

word332

word432

word532

word632

Figure B.1: ADDX FU Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 8.436Slice LUTs 226Used in SHA1, SHA256

Interface:Port: Direction: Description:select0 IN Select Flag 0select1 IN Select Flag 1word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2word3 IN[31:0] Input operand 3word4 IN[31:0] Input operand 4word5 IN[31:0] Input operand 5word6 IN[31:0] Input operand 6out OUT[31:0] ADDX data output

Functions:Name: Description:

F ADDX < N >(fu, word0-6, select0, select1)

select0 and select1 signals of ADDx{select0, select1}:”00”out = w0 + w1 + w2 + w3 + w4

”01”out = w0 + w1 + w2 + w3 + w5

”10”out = w0 + w1 + w2 + w3 + w6

”11”out = w0 + w1 + w2 + w3 + w4 + w5 + w6

Table B.1: ADDX Properties

B-2

SideWorks Functional Units B.2 ADDX2

B.2 ADDX2

ADDX2

32out0

32out1

select01

select11

word032

word132

word232

word332

word432

word532

word632

Figure B.2: ADDX2 FU Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 6.904Slice LUTs 257Used in SHA512

Interface:Port: Direction: Description:select0 IN Select Flag 0select1 IN Select Flag 1word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2word3 IN[31:0] Input operand 3word4 IN[31:0] Input operand 4word5 IN[31:0] Input operand 5word6 IN[31:0] Input operand 6out0 OUT[31:0] ADDX2 data outputout1 OUT[31:0] ADDX2 data output


F ADDX2 < N >(fu, word0-6, select0, select1)

select0 and select1 signals of ADDX2

{select0, select1}:”00”out = w0 + w1 + w2 + w3 + w4

”01”out = w0 + w1 + w2 + w3 + w5

”10”out = w0 + w1 + w2 + w3 + w6

”11”out0 = ((w0w1)+(w2w3)+(w4w5))[63 : 32]out1 = ((w0w1) + (w2w3) + (w4w5))[31 : 0]

Table B.2: ADDX2 Properties

B-3

B.3 ANDX SideWorks Functional Units

B.3 ANDX

ANDX

word032

word132

word232

32out0

32out1

32out2

32out3

Figure B.3: ANDx Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 1.599Slice LUTs 32Used in SHA1, SHA256, SHA512

Interface:Port: Direction: Description:word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2out0 OUT[31:0] Output 0out1 OUT[31:0] Output 1out2 OUT[31:0] Output 2out3 OUT[31:0] Output 3


F ANDX < N >(fu, word0-2)

out0 = (w0&w1)|(!w1&w2)out1 = (w0&w1)|(w1&w2)|(w0&w2)out2 = (w0&w1) ∧ (!w1&w2)out3 = (w0&w1) ∧ (w1&w2) ∧ (w0&w2)

Table B.3: ANDX Properties

B-4

SideWorks Functional Units B.4 CHI

B.4 CHI

CHI

word032

word132

word232

word332

32out

Figure B.4: CHI Diagram


Interface:Port: Direction: Description:word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2word3 IN[31:0] Input operand 3out OUT[31:0] Output

Functions:Name: Description:F CHI < N >(fu, word0-3) out = w0 ∧ (!w1&w2) ∧ w3

Table B.4: CHI Properties

B-5

B.5 CXOR SideWorks Functional Units

B.5 CXOR

CXOR32

out

decrypt1

factor11

word032

word132

word232

word332

word432

Figure B.5: CXOR Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 4.821Slice LUTs 248Used in AES-CBC, AES-ECB, SHA3

Interface:Port: Direction: Description:decrypt IN Decrypt flagfactor IN[10:0] Factorword0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2word3 IN[31:0] Input operand 3word4 IN[31:0] Input operand 4out OUT[31:0] CXOR data output


F CXOR < N >(fu, word0-6, factor, decrypt)

factor and decrypt signals of CXOR{decrypt}:”0”out = w0 ∧ ROL(w1, f) ∧ ROL(w2, f ∗ 2) ∧ROL(w3, f ∗ 3) ∧ROL(w4, f ∗ 4)”1”out = w0 ∧ ROL(w1, f ∗ 3) ∧ ROL(w2, f ∗ 2) ∧ROL(w3, f) ∧ROL(w4, f ∗ 4)

Table B.5: CXOR Properties

B-6

SideWorks Functional Units B.6 FXOR

B.6 FXOR

FXOR32

out

select01

word032

word132

word232

word332

word432

Figure B.6: FXOR Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 1.676Slice LUTs 32Used in AES-ECB

Interface:Port: Direction: Description:select0 IN Select Flag 0word0 IN[31:0] Input to be XORword1 IN[31:0] Input to be shifted and XORedword2 IN[31:0] Input to be shifted and XORedword3 IN[31:0] Input to be shifted and XORedword4 IN[31:0] Input to be shifted and XORedout OUT[31:0] Shifted and XORed output


F FXOR < N >(fu, word0-4)

Select0 signal for FXOR:”0”out =(w00 ∧ w10 ∧ w20 ∧ w30 ∧ w40 )[7:0](w01 ∧ w11 ∧ w21 ∧ w31 ∧ w41 )[15:8](w02 ∧ w12 ∧ w22 ∧ w32 ∧ w42 )[23:16](w03 ∧ w13 ∧ w23 ∧ w33 ∧ w43 )[31:24]”1”out =(w00 ∧ w10 ∧ w20 ∧ w30 ∧ w40 )[7:0](w01 ∧ w13 ∧ w23 ∧ w33 ∧ w43 )[15:8](w02 ∧ w12 ∧ w22 ∧ w32 ∧ w42 )[23:16](w03 ∧ w11 ∧ w21 ∧ w31 ∧ w41 )[31:24]

Table B.6: FXOR Properties

B-7

B.7 FXOR2 SideWorks Functional Units

B.7 FXOR2

FXOR232

out

select0 1select1 1

word032

word132

word232

word332

word432

word532

Figure B.7: FXOR2 Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 2.600Slice LUTs 80Used in AES-CBC

Interface:Port: Direction: Description:select0 IN Select Flag 0select1 IN Select Flag 1word0 IN[31:0] Input to be XORword1 IN[31:0] Input to be shifted and XORedword2 IN[31:0] Input to be shifted and XORedword3 IN[31:0] Input to be shifted and XORedword4 IN[31:0] Input to be shifted and XORedword5 IN[31:0] Input to be shifted and XORedout OUT[31:0] Shifted and XORed output


F FXOR2 < N >(fu, word0-5)

Select0 signal for FXOR2:”0”out =(w00 ∧ w10 ∧ w20 ∧ w30 ∧ w40 )[7:0](w01 ∧ w11 ∧ w21 ∧ w31 ∧ w41 )[15:8](w02 ∧ w12 ∧ w22 ∧ w32 ∧ w42 )[23:16](w03 ∧ w13 ∧ w23 ∧ w33 ∧ w43 )[31:24]”1”out =(w00 ∧ w10 ∧ w20 ∧ w30 ∧ w40 )[7:0](w01 ∧ w13 ∧ w23 ∧ w33 ∧ w43 )[15:8](w02 ∧ w12 ∧ w22 ∧ w32 ∧ w42 )[23:16](w03 ∧ w11 ∧ w21 ∧ w31 ∧ w41 )[31:24]If Select1 = 1out = out ∧ word5

Table B.7: FXOR2 Properties

B-8

SideWorks Functional Units B.8 ROL

B.8 ROL

ROL32

out

word032

word132

Figure B.8: ROL Diagram


Interface:Port: Direction: Description:word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1out OUT[31:0] ROL data output

Functions:Name: Description:F ROL < N >(fu, word0, word1)

out = word0 << word1(Shift word0 left by word1 bits)

Table B.8: ROL Properties

B-9

B.9 SBHIFT SideWorks Functional Units

B.9 SBHIFT

Figure B.9: SBSHIFT Diagram

Figure B.10: SBSHIFT Properties

B-10

SideWorks Functional Units B.10 SXOR

B.10 SXOR

SXOR

12 out0

12 out1

12 out2

12 out3

select1

decrypt1

word032

word132

word232

Figure B.11: SXOR FU Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 2.401Slice LUTs 40Used in AES-ECB, CLEFIA

Interface:Port: Direction: Description:select IN 0=first rounddecrypt IN Input to select decrypt mode (1 = enable)word0 IN[31:0] Input to be XORedword1 IN[31:0] Input to be XORedword2 IN[31:0] Input to be XORedout0 OUT[11:0] XOR data outputout1 OUT[11:0] XOR data outputout2 OUT[11:0] XOR data outputout3 OUT[11:0] XOR data output


F SXOR < N >(fu, word0-2, select, decrypt)

select and decrypt signals of XORS

{select, decrypt}:”0X”out0 = (w0 ∧ w1)[7:0]out2 = (w0 ∧ w1)[23:16]”00”out1 = (w0 ∧ w1)[15:8]out3 = (w0 ∧ w1)[31:24]”01”out1 = (w0 ∧ w1)[31:24]out3 = (w0 ∧ w1)[15:8]”1X”out0 = w2 [7:0] out2 = w2[23:16]”10”out1 = w2[15:8] out3 = w2[31:24]”11”out1 = w2[31:24] out3 = w2[15:8]

Table B.9: SXOR Properties

B-11

B.11 SXOR2 SideWorks Functional Units

B.11 SXOR2

SXOR2

12 out0

12 out1

12 out2

12 out3

select01

select11

select21

select31

word032

word132

word232

word332

Figure B.12: SXOR2 Diagram

B-12

SideWorks Functional Units B.11 SXOR2

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 3.218Slice LUTs 61Used in AES-CBC

Interface:Port: Direction: Description:select0 IN Select first flagselect1 IN Input to select decryptselect2 IN Input to select flagselect3 IN Input to select flagword0 IN[31:0] Input to be XORedword1 IN[31:0] Input to be XORedword2 IN[31:0] Input to be XORedword3 IN[31:0] Input to be XORedout0 OUT[11:0] XOR data outputout1 OUT[11:0] XOR data outputout2 OUT[11:0] XOR data outputout3 OUT[11:0] XOR data output


F SXOR2 < N >(fu, word0-3, select0-3 )

selects signals of SXOR2{select0, select1, select2, select3}:”0000”out0=(w0 ∧ w2)[7:0] out1=(w0 ∧ w2)[15:8]out2=(w0 ∧ w2)[23:16] out3=(w0 ∧ w2)[31:24]”01XX”out0=(w0 ∧ w2)[7:0] out1=(w0 ∧ w2)[31:24]out2=(w0 ∧ w2)[23:16] out3=(w0 ∧ w2)[15:8]”10XX”out0 = w1 [7:0] out1 = w1 [15:8]out2 = w1 [23:16] out3 = w1 [31:24]”11XX”out0 = w1 [7:0] out1 = w1 [31:24]out2 = w1 [23:16] out3 = w1 [15:8]”0010”out0=(w0 ∧w2 ∧w3)[7:0] out1=(w0 ∧w2 ∧w3)[15:8]out2=(w0∧w2∧w3)[23:16] out3=(w3∧w2∧w3)[31:24]”0011”out0=(w0 ∧w2 ∧w3)[7:0] out1=(w0 ∧w2 ∧w3)[31:24]out2=(w0∧w2∧w3)[23:16] out3=(w3∧w2∧w3)[15:8]

Table B.10: SXOR2 Properties

B-13

B.12 XORCL SideWorks Functional Units

B.12 XORCL

XORCL32

out

select01

select11

selectf1

word032

word132

word232

word332

word432

word532

Figure B.13: XORCL Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 2.461Slice LUTs 64Used in CLEFIA

Interface:Port: Direction: Description:select0 IN Select Flag 0select1 IN Select Flag 1selectf IN Select functionword0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2word3 IN[31:0] Input operand 3word4 IN[31:0] Input operand 4word5 IN[31:0] Input operand 5out OUT[31:0] XORCL data output


F XORCL < N >(fu, word0-5, select0, select1,selectf )

select0 and select1 signals f XORCL

{select0, select1}:”00”similar to CXOR equivalent mixing, plus twoinputs and a xor

Table B.11: XORCL Properties

B-14

SideWorks Functional Units B.13 XORR

B.13 XORR

XORR

word032

word132

word232

32out

Figure B.14: XORR Diagram


Interface:Port: Direction: Description:word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2out OUT[31:0] Output

Functions:Name: Description:F XORR R < N >(fu, word0-2)

out = ROL((w0 ∧ w1), w2)

F XORR L < N >(fu, word0-2)

out = ROL((w0 ∧ w1), w2)

Table B.12: XORR Properties

B-15

B.14 XORX SideWorks Functional Units

B.14 XORX

XORX

select01

word032

word132

word232

32out0

32out1

Figure B.15: XORx Diagram

Properties:Name: Value:Throughput [samples/time slots] 1Delay [time slots] 1Delay [ns] 2.414Slice LUTs 79Used in SHA1, SHA256

Interface:Port: Direction: Description:select0 IN Select Flag 0word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Input operand 2out0 OUT[31:0] Output 0out1 OUT[31:0] Output 1


F XORX < N >(fu, word0..2, select0)

out0 = w0 ∧ w1 ∧ w2

Select0 signal for XORX :”0”out1 = (w0 >> 2)(w0 >> 13) ∧ (w0 >> 22)”1”out1 = (w0 >> 6)∧(w0 >> 11)∧(w0 >> 25)

Table B.13: XORX Properties

B-16

SideWorks Functional Units B.15 XORX2

B.15 XORX2

XORX2

select01

word032

word132

word232

32out0

32out1

Figure B.16: XORX2 Diagram


Interface:Port: Direction: Description:select0 IN Select Flag 0word0 IN[31:0] Input operand 0word1 IN[31:0] Input operand 1word2 IN[31:0] Shift valueout0 OUT[31:0] Output 0out1 OUT[31:0] Output 1


F XORX2 < N >(fu, word0-2, select0)

out0 = w0 ∧ w1 ∧ w2

Select0 signal for XORX2:”0”out1 =((w0w1) >> 28) ∧ ((w0w1) >> 34) ∧ ((w0w1) >> 39)”1”out1 =((w0w1) >> 14) ∧ ((w0w1) >> 18) ∧ ((w0w1) >> 41)

Table B.14: XORX2 Properties

B-17

Telecommunications and Informatics Engineering...The FireWorks processor of this framework will be...

Documents

Transcript of Telecommunications and Informatics Engineering...The FireWorks processor of this framework will be...