RTL Implementations and FPGA Benchmarking of Three...

30
William Diehl and Kris Gaj ECE Department, George Mason University, Fairfax, Virginia, USA http://cryptography.gmu.edu RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers Competing in CAESAR Round Two Based on work supported by the National Science Foundation under Grant No. 1314540

Transcript of RTL Implementations and FPGA Benchmarking of Three...

Page 1: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

William Diehl and Kris Gaj

ECE Department, George Mason University, Fairfax, Virginia, USA

http://cryptography.gmu.edu

RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers Competing in

CAESAR Round Two

Based on work supported by the National Science Foundation under

Grant No. 1314540

Page 2: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Outline

• CAESAR Competition and Authenticated Ciphers

• CAESAR Hardware API & Compliant Code Development

• Discussion of Designs

• Results

• Summary, Conclusions, and Lessons Learned

2

Page 3: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Cryptographic Standard Contests

time

97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17

AES

NESSIE

CRYPTREC

eSTREAM

SHA-3

34 stream 4 HW winnersciphers → + 4 SW winners

51 hash functions → 1 winner

15 block ciphers → 1 winner

IX.1997 X.2000

I.2000 XII.2002

IV.2008

X.2007 X.2012

XI.2004

CAESAR

I.2013

57 authenticated ciphers → multiple winners

XII.2017

3

Page 4: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Authenticated Ciphers

Combine the functionality of confidentiality, integrity, and authenticity

Notation: Npub = Public Message Number; (Enc) Nsec = (Encrypted) Secret Message Number; AD = Associated Data

4

Page 5: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Evaluation Criteria

Security

Software Efficiency Hardware Efficiency

Simplicity

FPGAs ASICs

Flexibility Licensing

µProcessors µControllers

5

Page 6: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Motivation for Universal API

• Hardware API can have a high influence on Area and Throughput/Area ratio of all

implementations

• Hardware API typically much more difficult to modify than Software API

• Without a comprehensive hardware API, the comparison highly unreliable and

potentially unfair

• Designers can “play to strengths” and “hide weaknesses”

Conclusion: Impossible to perform fair evaluation of hardware

implementations without standardized interface and protocol

6

Page 7: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Specifies:

• Minimum compliance criteria

• Interface

• Communication protocol

• Timing characteristics

Assures:

• Compatibility

• Fairness

Timeline:

• Based on the GMU Hardware API presented at CryptArchi 2015,

DIAC 2015, and ReConFig 2015

• Revised version posted on Feb. 15, 2016

• Officially approved by the CAESAR Committee on May 6, 2016

CAESAR Hardware API

7

Page 8: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

General Interface and Internal Architecture for High-Speed Implementations

Additional detail available at https://cryptography.gmu.edu/athena/8

Page 9: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Implementer’s Guide• v1.0 - May 12, 2016

Development Packagea. VHDL code of generic pre-processing and post- processing units

for high-speed implementations (src_rtl)

b. Universal testbench (AEAD_TB)

c. Python app used to automatically generate test vectors

(aeadtvgen)

d. Six reference high-speed implementations of Dummy authenticated ciphers

GMU Support for Designers of VHDL/Verilog Code

https://cryptography.gmu.edu/athena/index.php?id=download

9

Page 10: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

ManualDesign

HDL Code

Automated Optimization

FPGA Tools

Preliminary Post

Place & Route

Results

(Resource Utilization,

Max. Clock Frequency)

Functional Verification

Specification

Test Vectors

The API Compliant Code Development

Reference

C Code

Development

Package

src_rtl

Development

Package

aeadtvgen

Development

Package

AEAD_TB

Pass/

Fail

Formulas

for the

Execution Time

& Throughput

10

Page 11: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Ciphers• SCREAM, POET, Minalpher

• OMD (Not presented)

Compliance• Round Two published specification

• C Reference Code

• CAESAR HW API

Optimization Criteria1. Throughput-to-area (TP/A) ratio

2. Throughput (Maximize frequency; minimize cycles/block)

3. Area (Minimize LUTs)

My Tasks

11

Page 12: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

SCREAM

Side-Channel Resistant Authenticated Encryption with Masking

• Based on Liskov, Rivest, and Wagner’s “Tweakable Block Cipher”

• Unique tweak for every block

• 128-bit key, state variable, tag

• Padding for Associated Data

12

Page 13: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

SCREAM (cont’d)

Cryptographic primitive is Tweakable Block Cipher

(TBC) (EK)

10 steps (σ) per block, 2 rounds (ρ) per step• Tweak updated once per step to form Tweak key (TK)

Non-linear substitution layer composed of 8x8

“nearly involute” S-Boxes

16-bit Round Constant, RC(ρ,σ) applied each

round

Linear permutation layer L-Box

13Bus widths are 128 bits unless indicated

Page 14: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Basic Iterative versus Unrolled Architectures

14

Basic Iterative Unrolled x2

Typically TP/A ratio decreases

Source: “Throughput vs. Area trade-offs in High-speed Architectures of Five Round 3 SHA-3

Candidates Implemented Using Xilinx and Altera FPGAs,” Homsirikamol, Rogawski, Gaj, 2011

Page 15: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

SCREAM – CipherCore Datapath

Notation:

Auth = Authenticator

Sum = Checksum

Trunc = Truncation

T0 = Initial Tweak

EK = Tweakable Block Cipher

npub = Public Message Number

exp_tag = expected tag

Interesting Features:

• Initial Tweak = f(npub, counter, type &

length of block)

• Tag = f(Auth, EK[Sum])

• Truncation of output of partial final

plaintext and ciphertext blocks

15Bus width of thick wires is 128 bits unless indicated. Bus width of thin wires is 1 bit.

Page 16: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

POET

• Pipelineable On-line Encryption with Authentication Tag

• “Cipher Agnostic” – uses any 128-bit block cipher and ε-AXU keyed hash function

• AES-128 used for block cipher, and AES-4 used for keyed hash

• 128-bit key, state variable, tag

• Padding for Associated Data

16

Page 17: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

POET – CipherCore Datapath

Interesting Features:

• Requires three sub keys (L, K, Kf) and

round key generation in AES

• L sub key multiplied by 2 in GF(2128)

during header processing

• Variable shifts for tag generation and

verificationAESAES4

con

st

din

L 2i

key

key

bdo

tag

key

decrypt

dout dout

din

dout

L

K

S

Σ τ

Kf

X

τ

S

Y

τ

|M|

x2

bdi

Ft

>>distance

<<

= Trunc

Τα

last

4

34

Trunc

S

Τ tag_valid

βΤ’ τ

ατ'

τ

17

Notation:

x2 = GF(2128) x2 field multiplier

τ = Authenticator

Σ = Associated Data cumulative sum

>> (<<) = Variable right (left) shift

|M| = length of message

S = EK(|M|) Bus width of thick wires is 128 bits unless indicated. Bus width of thin wires is 1 bit.

Page 18: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Minalpher

• Uses Tweakable Even-Mansour (TEM) with Minalpher-P primitive

• 2 TEM cores for encrypt/decrypt and tag generation in parallel

• 128-bit key, 256-bit state, 128-bit tag

• Each final plaintext block must have *10 padding even if full

18

Page 19: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Minalpher (cont’d)

Each Minalpher-P consists of

• S (SubNibbles) 4x4 S-Boxes,

• T (ShuffleRows),

• E (Round Constant),

• M (MixColumns)

• Decryption has reversed ‘E’ and ‘M’

17.5 rounds of Minalpher-P in one TEM

• Final “half round” is an extra S and T

function

19

Bus widths of paths A and B are 128 bits.

Page 20: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Minalpher – CipherCore Datapath

Interesting Features:

• Parallel encrypt/decrypt and tag

generation using 2 TEM cores requires

19 cycles/block

• “Serial truncator” to remove final *10

during decryption

20

Notation:

TEM = Tweakable Even Mansour

A = Associated Data register

M = Plaintext/Ciphertext Register (TEM)

C = Plaintext/Ciphertext Register (TEM aux)

L = Block specific mask

T = Cumulative Tag generation

npub = Public Message Number

exp_tag = expected tag Bus width of thick wires is 128 bits unless indicated. Bus width of thin wires is 1 bit.

Page 21: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

FPGA Devices

• Xilinx Virtex-6: xc6vlx240tff1156-3

FPGA Devices & Tools

FPGA Tools:

Synthesis Tool: Xilinx XST 14.7

Implementation Tool: Xilinx ISE 14.7

Automated Optimization: ATHENa

Options of tools:

No embedded memories and no embedded DSP units allowed inside of AEAD

21

Page 22: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

0.506

0.497

0.478

0.385

SCREAM has highest

Throughput-to-Area (TP/A) Ratio• Basic iterative (=1 round/clock cycle) higher

TP/A than Unrolled x2

POET has high TP but large area• Several AES cores required

Minalpher close to SCREAM in TP/A

TP

(M

bp

s)

Area (LUTs)

SCREAM (Basic Iterative)

SCREAM (Unrolled x2)

MinalpherPOET

Results of GMU Implementations – Virtex 6

22

Page 23: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Implementations by Other Groups

SCREAM (by Lubos Gaspar & Stephanie Kerckhof, CG UCL, INRIA)

• Full-block width custom interface

• No support for the CAESAR API Protocol

POET (by Amir Moradi, EmSec Rühr-Universität Bochum)

• Full-block width custom interface

• No support for the CAESAR API Protocol

Minalpher (by Takeshi Sugawara, Minalpher Team/Mitsubishi Electric)

• Fully compliant with the CAESAR API

23

Page 24: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Differences in API Specifications

SCREAM• Contains 5-bit “length” input and output fields

(Allows for partial blocks)

POET• No input for length of final block (key parameter in

tag generation and verification)

• Only 1 output port (Details of tag generation and

verification left to higher protocol)

Derived from:: A. Moradi (2016) and S. Kerckhof, L. Gaspar (2015)

24

Page 25: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Comparison with non-GMU Implementations – Virtex 6

0.506

0.497

0.478

0.385

0.679

0.575

0.272

0.636

Area (LUTs)

TP

(M

bp

s)

SCREAM (Basic Iterative)

SCREAM (Unrolled x2)

Minalpher POET

Minalpher has highest (TP/A) Ratio• 1 TEM core versus 2 TEM cores in GMU

design (39 cycles/block vs. 19 cycles/block)

SCREAM TP/As close to GMU• Not compliant with CAESAR HW API

Divergent TP/A for POET• Not compliant with CAESAR HW API

• Different choice of architecture

25

Page 26: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Throughput/Area of AES-GCM = 1.020 (Mbit/s)/LUTs

Comparison with all Round Two Candidates

E – Throughput/Area for Encryption

D – Throughput/Area for Decryption

A – Throughput/Area for Authentication Only

Default: Throughput/Area the same for all 3 operations

Relative Throughput/Area in Virtex 6 vs. AES-GCM

26

17 21 25

(Mitsubishi) (GMU CERG) (GMU CERG)

Page 27: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Summary & Conclusions

Three functionally correct high-speed implementations of CAESAR Round Two Candidates using RTL design in CAESAR HW API

• Substantial part of GMU CERG benchmarking effort in support of CAESAR Round Two evaluations

SCREAM (w/basic iterative architecture) had highest TP/A in GMU CERG implementations

• Minalpher (Mitsubishi version) highest TP/A overall

27

Page 28: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Lessons Learned

Features of implemented ciphers negatively affecting their performance

• Variable Shifts and Truncations

• #Ciphertext blocks ≠ #Plaintext blocks

28

Page 29: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

One Stop Website

https://cryptography.gmu.edu/athena/index.php?id=download

OR

https://cryptography.gmu.edu/athena

and click on CAESAR

• VHDL/Verilog Code of CAESAR Candidates: Summary I

• VHDL/Verilog Code of CAESAR Candidates: Summary II

• ATHENa Database of Results: Rankings View

• ATHENa Database of Results: Table View

• Benchmarking of Round 2 CAESAR Candidates in Hardware:

Methodology, Designs & Results

• GMU Implementations of Authenticated Ciphers and Their Building

Blocks

• CAESAR Hardware API v1.029

Page 30: RTL Implementations and FPGA Benchmarking of Three ...ece.gmu.edu/~kgaj/publications/conferences/GMU_DSD... · RTL Implementations and FPGA Benchmarking of Three Authenticated Ciphers

Questions?