Vivado HLS Implementation of Round-2 SHA-3 Candidates

1

Vivado HLS Implementation of Round-2 SHA-3 Candidates

Farnoud FarahmandECE 646

CERG: http://cryptography.gmu.edu

2

Cryptographic Standard Contests

time97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17

AES

NESSIE

CRYPTREC

eSTREAM

SHA-3

34 stream 4 HW winnersciphers → + 4 SW winners

51 hash functions → 1 winner

15 block ciphers → 1 winnerIX.1997 X.2000

I.2000 XII.2002

IV.2008

X.2007 X.2012

XI.2004

CAESARI.2013

57 authenticated ciphers → multiple winnersXII.2017

3

Evaluation Criteria

Security

Software Efficiency Hardware Efficiency

Simplicity

FPGAs ASICs

Flexibility Licensing

µProcessors µControllers

4

Traditional Development & Benchmarking Flow

ManualDesign

HDLCode

Manual OptimizationFPGATools

Netlist

PostPlace&Route

Results

Functional Verification

Timing Verification

InformalSpecification TestVectors

5

• Large number of candidates• Long time necessary to develop and verify

RTL (Register-Transfer Level)Hardware Description Language (HDL) codes

• Multiple variants of algorithms (e.g., multiple hash value size)

• Multiple hardware architectures • Dependence on skills of designers

Remaining Difficulties of Hardware Benchmarking

6

High-Level Synthesis (HLS)

High Level Language(e.g. C, C++, SystemC)

Hardware Description Language(e.g., VHDL or Verilog)

High-Level Synthesis

7

High-Level Synthesis

HDLCode

Automated OptimizationFPGATools

Netlist

PostPlace&Route

Results

Functional Verification

Timing Verification

ReferenceImplementationinC

TestVectorsManual Modifications

(pragmas, tweaks)

HLS-readyCcode

Proposed HLS-Based Development and Benchmarking Flow

SoftwareVerification

8

Examples of Source Code Modifications (1)

for (i = 0; i < 16; i++) #pragma HLS UNROLL

b[i] = a[i]+4;

Unrolling of loops:

void main() { int A[16];#pragma HLS ARRAY_RESHAPE Variable=A complete dim=1

...}

Array Reshaping:

9

Examples of Source Code Modifications (2)

Function Reuse:

10

• 4 Round 2 SHA-3 candidates + 5 Finalists are done previously • Basic iterative architecture• GMU Hardware Interface for SHA core• Padding done in software• RTL Implementations developed previously • Starting point: Informal specifications and reference software

implementations in C provided by the algorithm authors• Post P&R results generated for

- Xilinx Virtex 5 using Xilinx ISE • No use of BRAMs or DSP Units

Our Test Case

11

Parameters of Hash Functions

Algorithm Number of Rounds

Block Size Hash size Chain Size

Luffa 8 256 256 768Cubehash 16 256 256 1024BMW 1 512 256 512ECHO 8 1536 256 512

12

SHA core Interface

13

HLS and RTL Implementations Parameters

Algorithm #Rounds Cycles/BlockRTL

Cycles/BlockHLS

IntervalHLS

Luffa 8 9 10 11Cubehash 16 16 17 18BMW 1 2 1 1ECHO 8 26 25 26

Interval = Number of Cycles require to processtwo consecutives blocks

14

Datapath vs. Control Unit

Datapath ControlUnit

Data Inputs

Data Outputs

Control Inputs

Control Outputs

Control Signals

StatusSignals

Determines• Area• Clock Frequency

Determines• Number of clock cycles

15

Control Unit suboptimal• Difficulty in inferring an overlap between completing the last

round and reading the next input block• One additional clock cycle used for initialization of the state at

the beginning of each round

Encountered Problems

16

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

HLS RTL

Area

Cubehash Luffa BMW ECHO

RTL vs. HLS (Area):

17

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

HLS RTL

Throughput

Cubehash BMW Luffa ECHO

RTL vs. HLS (Throughput):

18

Throughput vs. Area:

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000

Thro

ughp

ut

Area

HLSLuffaCubeHashBMWECHO

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000

Thro

ughp

utArea

RTLLuffaCubeHashBMWECHO

19

Overall Results:

HLS RTLFreq. TP A TP/A Freq TP A TP/A

Luffa 217.39 5,565 1,390 4.00 334.4 9,513 1,023 9.29Cubehash 168.71 2,399 1,076 2.23 248.3 3,972 663 5.99BMW 9.45 4,838 4,859 0.99 34 8,704 4,350 2.00ECHO 145.77 8,612 6,389 1.34 203.8 12,094 4,888 2.47

RTL/HLSFreq. TP A TP/A

Luffa 1.53 1.70 0.73 2.32Cube Hash 1.47 1.65 0.61 2.68BMW 3.59 1.79 0.89 2.00ECHO 1.39 1.40 0.76 1.83

20

Conclusions

• High-level synthesis offers a potential to facilitate hardware benchmarking during the design of cryptographic algorithms and at the early stages of cryptographic contests

• Case study based on 4 Round 2 SHA-3 candidates demonstrated correct ranking for all of candidates using all major performance metrics

• More research & development needed to overcome remaining difficulties

• Suboptimal control unit of HLS implementations• Wide range of RTL to HLS performance metric ratios• A few potentially suboptimal HLS or RTL implementations• Efficient and reliable generation of HLS-ready C codes

21

Comments? Questions?

Suggestions?

CERG: http://cryptography.gmu.edu

Thank you!

Vivado HLS Implementation of Round-2 SHA-3 Candidates

Documents

Transcript of Vivado HLS Implementation of Round-2 SHA-3 Candidates