Vivado HLS Implementation of Round-2 SHA-3 Candidates
Transcript of Vivado HLS Implementation of Round-2 SHA-3 Candidates
1
Vivado HLS Implementation of Round-2 SHA-3 Candidates
Farnoud FarahmandECE 646
CERG: http://cryptography.gmu.edu
2
Cryptographic Standard Contests
time97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17
AES
NESSIE
CRYPTREC
eSTREAM
SHA-3
34 stream 4 HW winnersciphers → + 4 SW winners
51 hash functions → 1 winner
15 block ciphers → 1 winnerIX.1997 X.2000
I.2000 XII.2002
IV.2008
X.2007 X.2012
XI.2004
CAESARI.2013
57 authenticated ciphers → multiple winnersXII.2017
3
Evaluation Criteria
Security
Software Efficiency Hardware Efficiency
Simplicity
FPGAs ASICs
Flexibility Licensing
µProcessors µControllers
4
Traditional Development & Benchmarking Flow
ManualDesign
HDLCode
Manual OptimizationFPGATools
Netlist
PostPlace&Route
Results
Functional Verification
Timing Verification
InformalSpecification TestVectors
5
• Large number of candidates• Long time necessary to develop and verify
RTL (Register-Transfer Level)Hardware Description Language (HDL) codes
• Multiple variants of algorithms (e.g., multiple hash value size)
• Multiple hardware architectures • Dependence on skills of designers
Remaining Difficulties of Hardware Benchmarking
6
High-Level Synthesis (HLS)
High Level Language(e.g. C, C++, SystemC)
Hardware Description Language(e.g., VHDL or Verilog)
High-Level Synthesis
7
High-Level Synthesis
HDLCode
Automated OptimizationFPGATools
Netlist
PostPlace&Route
Results
Functional Verification
Timing Verification
ReferenceImplementationinC
TestVectorsManual Modifications
(pragmas, tweaks)
HLS-readyCcode
Proposed HLS-Based Development and Benchmarking Flow
SoftwareVerification
8
Examples of Source Code Modifications (1)
for (i = 0; i < 16; i++) #pragma HLS UNROLL
b[i] = a[i]+4;
Unrolling of loops:
void main() { int A[16];#pragma HLS ARRAY_RESHAPE Variable=A complete dim=1
...}
Array Reshaping:
9
Examples of Source Code Modifications (2)
Function Reuse:
10
• 4 Round 2 SHA-3 candidates + 5 Finalists are done previously • Basic iterative architecture• GMU Hardware Interface for SHA core• Padding done in software• RTL Implementations developed previously • Starting point: Informal specifications and reference software
implementations in C provided by the algorithm authors• Post P&R results generated for
- Xilinx Virtex 5 using Xilinx ISE • No use of BRAMs or DSP Units
Our Test Case
11
Parameters of Hash Functions
Algorithm Number of Rounds
Block Size Hash size Chain Size
Luffa 8 256 256 768Cubehash 16 256 256 1024BMW 1 512 256 512ECHO 8 1536 256 512
12
SHA core Interface
13
HLS and RTL Implementations Parameters
Algorithm #Rounds Cycles/BlockRTL
Cycles/BlockHLS
IntervalHLS
Luffa 8 9 10 11Cubehash 16 16 17 18BMW 1 2 1 1ECHO 8 26 25 26
Interval = Number of Cycles require to processtwo consecutives blocks
14
Datapath vs. Control Unit
Datapath ControlUnit
Data Inputs
Data Outputs
Control Inputs
Control Outputs
Control Signals
StatusSignals
Determines• Area• Clock Frequency
Determines• Number of clock cycles
15
Control Unit suboptimal• Difficulty in inferring an overlap between completing the last
round and reading the next input block• One additional clock cycle used for initialization of the state at
the beginning of each round
Encountered Problems
16
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
HLS RTL
Area
Cubehash Luffa BMW ECHO
RTL vs. HLS (Area):
17
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
HLS RTL
Throughput
Cubehash BMW Luffa ECHO
RTL vs. HLS (Throughput):
18
Throughput vs. Area:
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000
Thro
ughp
ut
Area
HLSLuffaCubeHashBMWECHO
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000
Thro
ughp
utArea
RTLLuffaCubeHashBMWECHO
19
Overall Results:
HLS RTLFreq. TP A TP/A Freq TP A TP/A
Luffa 217.39 5,565 1,390 4.00 334.4 9,513 1,023 9.29Cubehash 168.71 2,399 1,076 2.23 248.3 3,972 663 5.99BMW 9.45 4,838 4,859 0.99 34 8,704 4,350 2.00ECHO 145.77 8,612 6,389 1.34 203.8 12,094 4,888 2.47
RTL/HLSFreq. TP A TP/A
Luffa 1.53 1.70 0.73 2.32Cube Hash 1.47 1.65 0.61 2.68BMW 3.59 1.79 0.89 2.00ECHO 1.39 1.40 0.76 1.83
20
Conclusions
• High-level synthesis offers a potential to facilitate hardware benchmarking during the design of cryptographic algorithms and at the early stages of cryptographic contests
• Case study based on 4 Round 2 SHA-3 candidates demonstrated correct ranking for all of candidates using all major performance metrics
• More research & development needed to overcome remaining difficulties
• Suboptimal control unit of HLS implementations• Wide range of RTL to HLS performance metric ratios• A few potentially suboptimal HLS or RTL implementations• Efficient and reliable generation of HLS-ready C codes
21
Comments? Questions?
Suggestions?
CERG: http://cryptography.gmu.edu
Thank you!