Research Article High Performance and Low Power Hardware...
Transcript of Research Article High Performance and Low Power Hardware...
Research ArticleHigh Performance and Low Power Hardware Implementationfor Cryptographic Hash Functions
Yunlong Zhang1 Joohee Kim1 Ken Choi1 and Taeshik Shon2
1 Electrical and Computer Engineering Illinois Institute of Technology Chicago IL 60616 USA2Division of Information and Computer Engineering College of Information Technology Ajou University San 5Woncheon-Dong Yeongtong-Gu Suwon 443-749 Republic of Korea
Correspondence should be addressed to Ken Choi kchoieceiitedu
Received 12 September 2013 Accepted 4 January 2014 Published 2 March 2014
Academic Editor Jongsung Kim
Copyright copy 2014 Yunlong Zhang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Since hash functions are cryptographyrsquos most widely used primitives efficient hardware implementation of hash functions is ofcritical importance The proposed high performance hardware implementation of the hash functions used sponge constructionwhich generates desired length digest considering two key design metrics throughput and power consumption Firstly this paperintroduces unfolding transformation which increases the throughput of hash function and pipelining and parallelism designtechniques which reduce the delay Secondly we propose a frequency trade-off technique which can give us a scope of frequencyvalue for making a trade-off between low dynamic power consumption and high throughput Finally we use load-enable basedclock gating scheme to eliminate wasted toggle rate of signals in the idle mode of hash encryption system We demonstrated theproposed design techniques by using 45 nm CMOS technology at 10MHz The results show that we can achieve up to 4797 timeshigher throughput 631 delay reduction and 1365 dynamic power reduction
1 Introduction
The explosion of e-commerce nowadays boosts the trans-action over the internet thus we have to prevent intrudersfrom accessing the sensitive information According to thiscircumstance we call for higher security level protectionThere are many types of modern cryptography for examplesymmetric-key cryptography public-key cryptography andcryptographic hash function Cryptographic hash functionis used in almost every modern application especially in amultitude of protocols be it as digital signatures for achievingmessage authentication and integrity protection For exam-ple hash-based message authentication codes (HMACs) areused in IP security protocol and also in secure sockets layer(SSL) protocol [1]
As we know some hash functions such as message-digest algorithm (MD) series (MD4 and its strengthenedvariant MD5) and secure hash algorithm (SHA) series (SHA-0 and SHA-1) were widely used however broken in practice
Considering the potential danger of being attacked for SHA-2 in 2008 theNational Institute of Standards andTechnology(NIST) has started the NIST hash competition to develop thefuture hash standard SHA-3 [2]
Although software encryption is becoming more preva-lent today hardware design is the embodiment of choicefor many commercial applications and military [3] Firstlyhardware design is much faster than the correspondingsoftware implementation [4] Secondly hardware implemen-tation provides physical protection as high level of security[5] However higher security level hash function meansmore complicated gates and much more information needshigher frequency to improve the efficiency (or throughput)As a result the power dissipation of hardware design wouldincrease tremendously This will cause serious problems inhardware systems such as less reliability higher energyconsumption and higher device costs Thus low powertechniques are highly appreciated in nowadays hardwaredesign
Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2014 Article ID 736312 12 pageshttpdxdoiorg1011552014736312
2 International Journal of Distributed Sensor Networks
PadM
0
f
0f f f f
Z
br
c
Absorbing Squeezing
middot middot middot middot middot middot
⨁⨁⨁
Figure 1 Sponge construction [6]
The rest of this paper is organized as follows Spongeconstruction and low power methods which are used inthis paper will be introduced in Section 2 In Section 3 weanalyze the hash function designed by sponge constructionand its original hardware implementation and then unfold-ing transformation and pipelining and parallelism designtechniques used to improve the throughput and delay of hashfunction are presented In Section 4 we construct the hashencryption system and introduce two low power techniquesthe frequency trade-off technique and load-enable basedclock gating scheme This paper is concluded in Section 5
2 Background of the Research
In this section first sponge construction will be explainedNext we will introduce two dynamic power reduction meth-ods which are used in this paper
21 Sponge Construction The idea of sponge constructioncame from the design of RadioGatun and its final definitionwas given at the Ecrypt Hash Workshop in Barcelona [6] Asshown in Figure 1 sponge construction takes arbitrary lengthinput with finite internal state and gives an output of anydesired length
There are three components in sponge construction [7]
(i) a state memory(ii) a function of fixed length that permutes or transforms
the state memory(iii) a padding function
The statememory in Figure 1 is divided into two parts the topsection called bitrate of 119887119903 bits and the bottom section calledcapacity of 119888 bits And the input message (119872 in Figure 1) willbe padded as a wholemultiple of the bitrateThus this paddedinput message could be broken into many 119887119903-bit blocks
Sponge construction consists of two processes absorbingand squeezing Considering the left part of dash line inFigure 1 called absorbing firstly the inputmessage is paddedand the statememorywill be initialized secondly the first 119887119903-bit block of padded input will be XORed with the initial 119887119903bit of state memory thirdly the fixed length function (block119891 in Figure 1) updates the state memory Then steps two andthree will be repeated until all the padded 119887119903-bit blocks areused up Considering the right section which is squeezing
firstly the 119887119903 bit of the latest state memory is the first 119887119903-bit output secondly if we need more output bits the fixedlength function is used to update the state memory and the119887119903 bit of new state memory is the second 119887119903-bit output Thisprocess is repeated until the desired number of output bits (119885in Figure 1) is produced
The extent 119888-bit part which is altered by the inputmessagedepends on the fixed length function [7] The security ofhash function for example resistance to collision or preimageattacks relies on this 119888-bit part Because of its arbitrarilylong input and output sizes the sponge construction allowsbuilding various primitives such as hash function Keccakhash function known as the new SHA-3 uses this spongeconstruction
22 Dynamic Power Reduction Methods Digital circuits willconsume dynamic power in the active mode There are twosources of dynamic power consumption [8]
(i) charging and discharging processes of output capaci-tance
(ii) short-circuit current when PMOS and NMOS net-works are all ON
Because the short circuit power is usually less than 10 oftotal dynamic power [9] the dynamic power consumptionwhichwe try to reduce in this paper is referred to as switchingpower for the rest of this paper Dynamic power can beexplained in (1) Note that 119891 is the clock frequency and TRis the toggle rate of gate output
119875dynamic =1
21198621198711198812
DD119891 sdot TR (1)
Since the power optimization at RTL has significantimpact with reasonable accuracy RTL is considered as theoptimal stage for low power techniques [8] According to(1) four parameters such as voltage clock frequency loadcapacitance and the toggle rate of gate output determinethe dynamic power consumption Because reducing supplyvoltage will increase critical path delay and changing thecapacitance of gate output needs to redesign the load logicit is more efficient to focus on clock frequency and toggle rateat RTL
221 Dynamic VoltageFrequency Scaling Figure 2 gives us abasic dynamic voltagefrequency scaling (DVFS) systemTheDVFS controller will determine the clock frequency whichis sufficient to finish work and gives the best performancewithout overheating by collecting information about theworkload and the temperature Then this variable clockfrequency scheme will lead to dynamic power reduction bychoosing proper clock frequency
222 Load-Enable Based Clock Gating As we all knowcombinational clock gating technique is widely used to solvedynamic power issue for single level register And sequentialclock gating method considers multiple level (pipeline) reg-isters In this research we focus on the combinational clock
International Journal of Distributed Sensor Networks 3
DVFScontroller
Core logic
Switchingvoltage
regulatorVoltage control
Frequency control
Workload
Temperature
Vin
VDD
Figure 2 DVFS system [9]
FFs
E
D
clk
engclk
D[N-10] Q[N-10]
Figure 3 Load-enable based clock gating
gating technique particularly we use load-enable based clockgating scheme [10]
Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research
3 Proposed High-Speed HashingModule in Hardware
Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity
Various techniques have been proposed to speed up orto improve the throughput of hash function for example
Table 1 The parameters of SHAT
SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48
unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds
31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1
311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908
7) Each 4-bit word needs to go through this 119878-box
The definition of the 119878-box is 119904119908119894= 119878box(119908
119894) (119894 = 0 7)
This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer
(((((((((((((
(
1199081015840
0
1199081015840
1
1199081015840
2
1199081015840
3
1199081015840
4
1199081015840
5
1199081015840
6
1199081015840
7
)))))))))))))
)
=
((((
(
01111001
10111100
11010110
11100011
01111110
10110111
11011011
11101101
))))
)
((((
(
1199080
1199081
1199082
1199083
1199084
1199085
1199086
1199087
))))
)
(2)
should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908
0 119904119908
7) this diffusion layer mixes them Diffusion
layer is defined as (2)
312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And
4 International Journal of Distributed Sensor Networks
Table 2 119878-box of the 119866 function
119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865
0 times 1 0 times 2 0 times 9 0 times 8
0 times 2 0 times 4 0 times 119860 0 times 9
0 times 3 0 times 119861 0 times 119861 0 times 7
0 times 4 0 times 119863 0 times 119862 0 times 6
0 times 5 0 times 119864 0 times 119863 0 times 3
0 times 6 0 times 119860 0 times 119864 0 times 0
0 times 7 0 times 5 0 times 119865 0 times 119862
0
0
Perm
br
c
128-i M0 Mn H0 H1 H2 H3
Perm
Perm
Perm
Perm
Perm
Initialization Absorbing Squeezing
⨁ ⨁⨁
middot middot middot
Figure 4 Sponge construction of SHAT
the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878
0 119878
4119894minus1)
(119894 = 1 2 3)In the absorbing phase the input message 119872 =
(11987201198721 119872
119899minus1) shown in Figure 4 is padded as a whole
multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation
(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)
Then we set 1198784119894minus1
as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot
119896are rot
0= 19 rot
1= 1 and rot
2= 14 In the
squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2
SQUEEZE (119878 119894) =
1198783 119894 = 1
1198783 1198787 119894 = 2
1198783 1198787 11987811 119894 = 3
(4)
32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map
According to Table 2 we get the logic functions of 119878-box as
shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-
box and 119876119894(119894 = 0 1 2 3) as the output bit
1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601
+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600
1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600
+ 119860311986021198600+ 1198603119860211986011198600
1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600
+ 119860211986011198600+ 1198603119860211986011198600
1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600
+ 119860311986021198600+ 119860311986021198601
(5)
There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903
119894(119894 = 1 to
47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as
119899= 4 sdot Delay (oplus) + Delay (119892) (6)
33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function
331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus 1198782)
1198781015840
2= 1198781
1198781015840
1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
1198781015840
0= 1198783oplus 119903
(7)
International Journal of Distributed Sensor Networks 5
Step(119878)
(i) For 119896 = 0 to 119894 minus 1(a) 119878
4119896+3= 1198784119896+3
oplus 119903(b) 119878
4119896= 1198784119896oplus 1198784119896+1
(c) 119878
4119896+2= 1198784119896+2
oplus 1198784119896+3
(d) 119878
4119896= 1198784119896oplus 119866(119878
4119896+2)
(e) 1198784119896+2
= 1198784119896+2
oplus (1198784119896ltltlt rot
119896)
(ii) Temp = 1198784119894minus1
(iii) For 119896 = 4119894 minus 1 to 1
119878119896= 119878119896minus1
(iv) 119878
0= Temp
Algorithm 1 Typical one step algorithm
SHAT-(128 sdot 119894)(119872)
Inputs 119899 padded message blocks119872 = (11987201198721 119872
119899minus1)
Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)
(1) 119878 = (1198780 119878
4119894minus1) = (0 0 0 128 sdot 119894) initialization
(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase
(i) For 119896 = 0 to 119894 minus 11198784119896+3
= 1198784119896+3
oplus119872119895119896
(ii) Perm(119878)(4) 119867
0= SQUEEZE(119878 119894) squeezing phase
(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867
119896= SQUEEZE(119878 119894)
Algorithm 2 SHAT-(128 sdot 119894)
Padding unitMessage digest
extraction
SHAT
Control unit
RAM
Padded data
Message digestInput data
n times 32 bits
32-bit wide registers
4 times 32 bits
128 bits
Figure 5 A typical SHAT core
Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840
119894
(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we
ROT
S0
ri
g
S1 S2 S3
S0 S1 S2 S3
⨁
⨁
⨁
⨁
⨁
Figure 6 Typical architecture of one STEP round
get 24 rounds in one permutation process The expression ofthroughput is given as
Throughput = ( of bits) sdot119891round
of rounds (8)
Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
2 International Journal of Distributed Sensor Networks
PadM
0
f
0f f f f
Z
br
c
Absorbing Squeezing
middot middot middot middot middot middot
⨁⨁⨁
Figure 1 Sponge construction [6]
The rest of this paper is organized as follows Spongeconstruction and low power methods which are used inthis paper will be introduced in Section 2 In Section 3 weanalyze the hash function designed by sponge constructionand its original hardware implementation and then unfold-ing transformation and pipelining and parallelism designtechniques used to improve the throughput and delay of hashfunction are presented In Section 4 we construct the hashencryption system and introduce two low power techniquesthe frequency trade-off technique and load-enable basedclock gating scheme This paper is concluded in Section 5
2 Background of the Research
In this section first sponge construction will be explainedNext we will introduce two dynamic power reduction meth-ods which are used in this paper
21 Sponge Construction The idea of sponge constructioncame from the design of RadioGatun and its final definitionwas given at the Ecrypt Hash Workshop in Barcelona [6] Asshown in Figure 1 sponge construction takes arbitrary lengthinput with finite internal state and gives an output of anydesired length
There are three components in sponge construction [7]
(i) a state memory(ii) a function of fixed length that permutes or transforms
the state memory(iii) a padding function
The statememory in Figure 1 is divided into two parts the topsection called bitrate of 119887119903 bits and the bottom section calledcapacity of 119888 bits And the input message (119872 in Figure 1) willbe padded as a wholemultiple of the bitrateThus this paddedinput message could be broken into many 119887119903-bit blocks
Sponge construction consists of two processes absorbingand squeezing Considering the left part of dash line inFigure 1 called absorbing firstly the inputmessage is paddedand the statememorywill be initialized secondly the first 119887119903-bit block of padded input will be XORed with the initial 119887119903bit of state memory thirdly the fixed length function (block119891 in Figure 1) updates the state memory Then steps two andthree will be repeated until all the padded 119887119903-bit blocks areused up Considering the right section which is squeezing
firstly the 119887119903 bit of the latest state memory is the first 119887119903-bit output secondly if we need more output bits the fixedlength function is used to update the state memory and the119887119903 bit of new state memory is the second 119887119903-bit output Thisprocess is repeated until the desired number of output bits (119885in Figure 1) is produced
The extent 119888-bit part which is altered by the inputmessagedepends on the fixed length function [7] The security ofhash function for example resistance to collision or preimageattacks relies on this 119888-bit part Because of its arbitrarilylong input and output sizes the sponge construction allowsbuilding various primitives such as hash function Keccakhash function known as the new SHA-3 uses this spongeconstruction
22 Dynamic Power Reduction Methods Digital circuits willconsume dynamic power in the active mode There are twosources of dynamic power consumption [8]
(i) charging and discharging processes of output capaci-tance
(ii) short-circuit current when PMOS and NMOS net-works are all ON
Because the short circuit power is usually less than 10 oftotal dynamic power [9] the dynamic power consumptionwhichwe try to reduce in this paper is referred to as switchingpower for the rest of this paper Dynamic power can beexplained in (1) Note that 119891 is the clock frequency and TRis the toggle rate of gate output
119875dynamic =1
21198621198711198812
DD119891 sdot TR (1)
Since the power optimization at RTL has significantimpact with reasonable accuracy RTL is considered as theoptimal stage for low power techniques [8] According to(1) four parameters such as voltage clock frequency loadcapacitance and the toggle rate of gate output determinethe dynamic power consumption Because reducing supplyvoltage will increase critical path delay and changing thecapacitance of gate output needs to redesign the load logicit is more efficient to focus on clock frequency and toggle rateat RTL
221 Dynamic VoltageFrequency Scaling Figure 2 gives us abasic dynamic voltagefrequency scaling (DVFS) systemTheDVFS controller will determine the clock frequency whichis sufficient to finish work and gives the best performancewithout overheating by collecting information about theworkload and the temperature Then this variable clockfrequency scheme will lead to dynamic power reduction bychoosing proper clock frequency
222 Load-Enable Based Clock Gating As we all knowcombinational clock gating technique is widely used to solvedynamic power issue for single level register And sequentialclock gating method considers multiple level (pipeline) reg-isters In this research we focus on the combinational clock
International Journal of Distributed Sensor Networks 3
DVFScontroller
Core logic
Switchingvoltage
regulatorVoltage control
Frequency control
Workload
Temperature
Vin
VDD
Figure 2 DVFS system [9]
FFs
E
D
clk
engclk
D[N-10] Q[N-10]
Figure 3 Load-enable based clock gating
gating technique particularly we use load-enable based clockgating scheme [10]
Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research
3 Proposed High-Speed HashingModule in Hardware
Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity
Various techniques have been proposed to speed up orto improve the throughput of hash function for example
Table 1 The parameters of SHAT
SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48
unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds
31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1
311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908
7) Each 4-bit word needs to go through this 119878-box
The definition of the 119878-box is 119904119908119894= 119878box(119908
119894) (119894 = 0 7)
This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer
(((((((((((((
(
1199081015840
0
1199081015840
1
1199081015840
2
1199081015840
3
1199081015840
4
1199081015840
5
1199081015840
6
1199081015840
7
)))))))))))))
)
=
((((
(
01111001
10111100
11010110
11100011
01111110
10110111
11011011
11101101
))))
)
((((
(
1199080
1199081
1199082
1199083
1199084
1199085
1199086
1199087
))))
)
(2)
should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908
0 119904119908
7) this diffusion layer mixes them Diffusion
layer is defined as (2)
312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And
4 International Journal of Distributed Sensor Networks
Table 2 119878-box of the 119866 function
119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865
0 times 1 0 times 2 0 times 9 0 times 8
0 times 2 0 times 4 0 times 119860 0 times 9
0 times 3 0 times 119861 0 times 119861 0 times 7
0 times 4 0 times 119863 0 times 119862 0 times 6
0 times 5 0 times 119864 0 times 119863 0 times 3
0 times 6 0 times 119860 0 times 119864 0 times 0
0 times 7 0 times 5 0 times 119865 0 times 119862
0
0
Perm
br
c
128-i M0 Mn H0 H1 H2 H3
Perm
Perm
Perm
Perm
Perm
Initialization Absorbing Squeezing
⨁ ⨁⨁
middot middot middot
Figure 4 Sponge construction of SHAT
the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878
0 119878
4119894minus1)
(119894 = 1 2 3)In the absorbing phase the input message 119872 =
(11987201198721 119872
119899minus1) shown in Figure 4 is padded as a whole
multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation
(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)
Then we set 1198784119894minus1
as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot
119896are rot
0= 19 rot
1= 1 and rot
2= 14 In the
squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2
SQUEEZE (119878 119894) =
1198783 119894 = 1
1198783 1198787 119894 = 2
1198783 1198787 11987811 119894 = 3
(4)
32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map
According to Table 2 we get the logic functions of 119878-box as
shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-
box and 119876119894(119894 = 0 1 2 3) as the output bit
1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601
+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600
1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600
+ 119860311986021198600+ 1198603119860211986011198600
1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600
+ 119860211986011198600+ 1198603119860211986011198600
1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600
+ 119860311986021198600+ 119860311986021198601
(5)
There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903
119894(119894 = 1 to
47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as
119899= 4 sdot Delay (oplus) + Delay (119892) (6)
33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function
331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus 1198782)
1198781015840
2= 1198781
1198781015840
1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
1198781015840
0= 1198783oplus 119903
(7)
International Journal of Distributed Sensor Networks 5
Step(119878)
(i) For 119896 = 0 to 119894 minus 1(a) 119878
4119896+3= 1198784119896+3
oplus 119903(b) 119878
4119896= 1198784119896oplus 1198784119896+1
(c) 119878
4119896+2= 1198784119896+2
oplus 1198784119896+3
(d) 119878
4119896= 1198784119896oplus 119866(119878
4119896+2)
(e) 1198784119896+2
= 1198784119896+2
oplus (1198784119896ltltlt rot
119896)
(ii) Temp = 1198784119894minus1
(iii) For 119896 = 4119894 minus 1 to 1
119878119896= 119878119896minus1
(iv) 119878
0= Temp
Algorithm 1 Typical one step algorithm
SHAT-(128 sdot 119894)(119872)
Inputs 119899 padded message blocks119872 = (11987201198721 119872
119899minus1)
Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)
(1) 119878 = (1198780 119878
4119894minus1) = (0 0 0 128 sdot 119894) initialization
(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase
(i) For 119896 = 0 to 119894 minus 11198784119896+3
= 1198784119896+3
oplus119872119895119896
(ii) Perm(119878)(4) 119867
0= SQUEEZE(119878 119894) squeezing phase
(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867
119896= SQUEEZE(119878 119894)
Algorithm 2 SHAT-(128 sdot 119894)
Padding unitMessage digest
extraction
SHAT
Control unit
RAM
Padded data
Message digestInput data
n times 32 bits
32-bit wide registers
4 times 32 bits
128 bits
Figure 5 A typical SHAT core
Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840
119894
(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we
ROT
S0
ri
g
S1 S2 S3
S0 S1 S2 S3
⨁
⨁
⨁
⨁
⨁
Figure 6 Typical architecture of one STEP round
get 24 rounds in one permutation process The expression ofthroughput is given as
Throughput = ( of bits) sdot119891round
of rounds (8)
Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 3
DVFScontroller
Core logic
Switchingvoltage
regulatorVoltage control
Frequency control
Workload
Temperature
Vin
VDD
Figure 2 DVFS system [9]
FFs
E
D
clk
engclk
D[N-10] Q[N-10]
Figure 3 Load-enable based clock gating
gating technique particularly we use load-enable based clockgating scheme [10]
Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research
3 Proposed High-Speed HashingModule in Hardware
Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity
Various techniques have been proposed to speed up orto improve the throughput of hash function for example
Table 1 The parameters of SHAT
SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48
unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds
31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1
311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908
7) Each 4-bit word needs to go through this 119878-box
The definition of the 119878-box is 119904119908119894= 119878box(119908
119894) (119894 = 0 7)
This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer
(((((((((((((
(
1199081015840
0
1199081015840
1
1199081015840
2
1199081015840
3
1199081015840
4
1199081015840
5
1199081015840
6
1199081015840
7
)))))))))))))
)
=
((((
(
01111001
10111100
11010110
11100011
01111110
10110111
11011011
11101101
))))
)
((((
(
1199080
1199081
1199082
1199083
1199084
1199085
1199086
1199087
))))
)
(2)
should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908
0 119904119908
7) this diffusion layer mixes them Diffusion
layer is defined as (2)
312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And
4 International Journal of Distributed Sensor Networks
Table 2 119878-box of the 119866 function
119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865
0 times 1 0 times 2 0 times 9 0 times 8
0 times 2 0 times 4 0 times 119860 0 times 9
0 times 3 0 times 119861 0 times 119861 0 times 7
0 times 4 0 times 119863 0 times 119862 0 times 6
0 times 5 0 times 119864 0 times 119863 0 times 3
0 times 6 0 times 119860 0 times 119864 0 times 0
0 times 7 0 times 5 0 times 119865 0 times 119862
0
0
Perm
br
c
128-i M0 Mn H0 H1 H2 H3
Perm
Perm
Perm
Perm
Perm
Initialization Absorbing Squeezing
⨁ ⨁⨁
middot middot middot
Figure 4 Sponge construction of SHAT
the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878
0 119878
4119894minus1)
(119894 = 1 2 3)In the absorbing phase the input message 119872 =
(11987201198721 119872
119899minus1) shown in Figure 4 is padded as a whole
multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation
(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)
Then we set 1198784119894minus1
as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot
119896are rot
0= 19 rot
1= 1 and rot
2= 14 In the
squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2
SQUEEZE (119878 119894) =
1198783 119894 = 1
1198783 1198787 119894 = 2
1198783 1198787 11987811 119894 = 3
(4)
32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map
According to Table 2 we get the logic functions of 119878-box as
shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-
box and 119876119894(119894 = 0 1 2 3) as the output bit
1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601
+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600
1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600
+ 119860311986021198600+ 1198603119860211986011198600
1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600
+ 119860211986011198600+ 1198603119860211986011198600
1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600
+ 119860311986021198600+ 119860311986021198601
(5)
There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903
119894(119894 = 1 to
47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as
119899= 4 sdot Delay (oplus) + Delay (119892) (6)
33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function
331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus 1198782)
1198781015840
2= 1198781
1198781015840
1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
1198781015840
0= 1198783oplus 119903
(7)
International Journal of Distributed Sensor Networks 5
Step(119878)
(i) For 119896 = 0 to 119894 minus 1(a) 119878
4119896+3= 1198784119896+3
oplus 119903(b) 119878
4119896= 1198784119896oplus 1198784119896+1
(c) 119878
4119896+2= 1198784119896+2
oplus 1198784119896+3
(d) 119878
4119896= 1198784119896oplus 119866(119878
4119896+2)
(e) 1198784119896+2
= 1198784119896+2
oplus (1198784119896ltltlt rot
119896)
(ii) Temp = 1198784119894minus1
(iii) For 119896 = 4119894 minus 1 to 1
119878119896= 119878119896minus1
(iv) 119878
0= Temp
Algorithm 1 Typical one step algorithm
SHAT-(128 sdot 119894)(119872)
Inputs 119899 padded message blocks119872 = (11987201198721 119872
119899minus1)
Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)
(1) 119878 = (1198780 119878
4119894minus1) = (0 0 0 128 sdot 119894) initialization
(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase
(i) For 119896 = 0 to 119894 minus 11198784119896+3
= 1198784119896+3
oplus119872119895119896
(ii) Perm(119878)(4) 119867
0= SQUEEZE(119878 119894) squeezing phase
(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867
119896= SQUEEZE(119878 119894)
Algorithm 2 SHAT-(128 sdot 119894)
Padding unitMessage digest
extraction
SHAT
Control unit
RAM
Padded data
Message digestInput data
n times 32 bits
32-bit wide registers
4 times 32 bits
128 bits
Figure 5 A typical SHAT core
Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840
119894
(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we
ROT
S0
ri
g
S1 S2 S3
S0 S1 S2 S3
⨁
⨁
⨁
⨁
⨁
Figure 6 Typical architecture of one STEP round
get 24 rounds in one permutation process The expression ofthroughput is given as
Throughput = ( of bits) sdot119891round
of rounds (8)
Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
4 International Journal of Distributed Sensor Networks
Table 2 119878-box of the 119866 function
119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865
0 times 1 0 times 2 0 times 9 0 times 8
0 times 2 0 times 4 0 times 119860 0 times 9
0 times 3 0 times 119861 0 times 119861 0 times 7
0 times 4 0 times 119863 0 times 119862 0 times 6
0 times 5 0 times 119864 0 times 119863 0 times 3
0 times 6 0 times 119860 0 times 119864 0 times 0
0 times 7 0 times 5 0 times 119865 0 times 119862
0
0
Perm
br
c
128-i M0 Mn H0 H1 H2 H3
Perm
Perm
Perm
Perm
Perm
Initialization Absorbing Squeezing
⨁ ⨁⨁
middot middot middot
Figure 4 Sponge construction of SHAT
the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878
0 119878
4119894minus1)
(119894 = 1 2 3)In the absorbing phase the input message 119872 =
(11987201198721 119872
119899minus1) shown in Figure 4 is padded as a whole
multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation
(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)
Then we set 1198784119894minus1
as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot
119896are rot
0= 19 rot
1= 1 and rot
2= 14 In the
squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2
SQUEEZE (119878 119894) =
1198783 119894 = 1
1198783 1198787 119894 = 2
1198783 1198787 11987811 119894 = 3
(4)
32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map
According to Table 2 we get the logic functions of 119878-box as
shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-
box and 119876119894(119894 = 0 1 2 3) as the output bit
1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601
+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600
1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600
+ 119860311986021198600+ 1198603119860211986011198600
1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600
+ 119860211986011198600+ 1198603119860211986011198600
1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600
+ 119860311986021198600+ 119860311986021198601
(5)
There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903
119894(119894 = 1 to
47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as
119899= 4 sdot Delay (oplus) + Delay (119892) (6)
33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function
331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus 1198782)
1198781015840
2= 1198781
1198781015840
1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
1198781015840
0= 1198783oplus 119903
(7)
International Journal of Distributed Sensor Networks 5
Step(119878)
(i) For 119896 = 0 to 119894 minus 1(a) 119878
4119896+3= 1198784119896+3
oplus 119903(b) 119878
4119896= 1198784119896oplus 1198784119896+1
(c) 119878
4119896+2= 1198784119896+2
oplus 1198784119896+3
(d) 119878
4119896= 1198784119896oplus 119866(119878
4119896+2)
(e) 1198784119896+2
= 1198784119896+2
oplus (1198784119896ltltlt rot
119896)
(ii) Temp = 1198784119894minus1
(iii) For 119896 = 4119894 minus 1 to 1
119878119896= 119878119896minus1
(iv) 119878
0= Temp
Algorithm 1 Typical one step algorithm
SHAT-(128 sdot 119894)(119872)
Inputs 119899 padded message blocks119872 = (11987201198721 119872
119899minus1)
Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)
(1) 119878 = (1198780 119878
4119894minus1) = (0 0 0 128 sdot 119894) initialization
(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase
(i) For 119896 = 0 to 119894 minus 11198784119896+3
= 1198784119896+3
oplus119872119895119896
(ii) Perm(119878)(4) 119867
0= SQUEEZE(119878 119894) squeezing phase
(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867
119896= SQUEEZE(119878 119894)
Algorithm 2 SHAT-(128 sdot 119894)
Padding unitMessage digest
extraction
SHAT
Control unit
RAM
Padded data
Message digestInput data
n times 32 bits
32-bit wide registers
4 times 32 bits
128 bits
Figure 5 A typical SHAT core
Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840
119894
(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we
ROT
S0
ri
g
S1 S2 S3
S0 S1 S2 S3
⨁
⨁
⨁
⨁
⨁
Figure 6 Typical architecture of one STEP round
get 24 rounds in one permutation process The expression ofthroughput is given as
Throughput = ( of bits) sdot119891round
of rounds (8)
Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 5
Step(119878)
(i) For 119896 = 0 to 119894 minus 1(a) 119878
4119896+3= 1198784119896+3
oplus 119903(b) 119878
4119896= 1198784119896oplus 1198784119896+1
(c) 119878
4119896+2= 1198784119896+2
oplus 1198784119896+3
(d) 119878
4119896= 1198784119896oplus 119866(119878
4119896+2)
(e) 1198784119896+2
= 1198784119896+2
oplus (1198784119896ltltlt rot
119896)
(ii) Temp = 1198784119894minus1
(iii) For 119896 = 4119894 minus 1 to 1
119878119896= 119878119896minus1
(iv) 119878
0= Temp
Algorithm 1 Typical one step algorithm
SHAT-(128 sdot 119894)(119872)
Inputs 119899 padded message blocks119872 = (11987201198721 119872
119899minus1)
Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)
(1) 119878 = (1198780 119878
4119894minus1) = (0 0 0 128 sdot 119894) initialization
(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase
(i) For 119896 = 0 to 119894 minus 11198784119896+3
= 1198784119896+3
oplus119872119895119896
(ii) Perm(119878)(4) 119867
0= SQUEEZE(119878 119894) squeezing phase
(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867
119896= SQUEEZE(119878 119894)
Algorithm 2 SHAT-(128 sdot 119894)
Padding unitMessage digest
extraction
SHAT
Control unit
RAM
Padded data
Message digestInput data
n times 32 bits
32-bit wide registers
4 times 32 bits
128 bits
Figure 5 A typical SHAT core
Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840
119894
(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we
ROT
S0
ri
g
S1 S2 S3
S0 S1 S2 S3
⨁
⨁
⨁
⨁
⨁
Figure 6 Typical architecture of one STEP round
get 24 rounds in one permutation process The expression ofthroughput is given as
Throughput = ( of bits) sdot119891round
of rounds (8)
Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
6 International Journal of Distributed Sensor Networks
numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by
temp3= ROT (temp
1) oplus (temp
0oplus 1198782)
temp2= 1198781
temp1= 119866 (119878
3oplus 119903 oplus 119878
2) oplus (119878
0oplus 1198781)
temp0= 1198783oplus 119903
1198781015840
3= ROT (1198781015840
1) oplus (119878
1015840
0oplus temp
2)
1198781015840
2= temp
1
1198781015840
1= 119866 (temp
3oplus 119903 oplus temp
2) oplus (temp
0oplus temp
1)
1198781015840
0= temp
3oplus 119903
(9)
332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function
For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line
In Figure 7 cycle counter 119903119894+1
can be calculated withtemp2first and then XORed with temp
3in second STEP
part Comparing with the first STEP part where 119903119894XORed
with 1198783and then XORed with 119878
2 we can figure out that
there is another additional component which used to makea calculation with temp
3and 119903119894+1
Because of the mandatoryoutput generation necessity this area penalty cannot beavoided
Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as
119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))
119899
(10)When we have a limit of 119899 (10) could be changed into
lim119899rarrinfin
119899= 3 sdot Delay (oplus) + Delay (119892) (11)
This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound
ROT
ROT
g
g
S0 S1 S2 S3
S0 S1 S2 S3
ri + 1
temp2 temp0
temp1
ri + 1
ri
temp3
⨁
⨁⨁
⨁
⨁
⨁
⨁
⨁ ⨁
⨁
⨁
Figure 7 Proposed architecture of two STEPs round
34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology
FOM =Throughput
GE2 (12)
Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average
In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 7
Table 3 Hardware implementation results of some 128-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-128 32 48 6667 1605 25880
H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041
MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186
ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755
U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056
PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515
SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016
Table 4 Hardware implementation results of some 256-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417
ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705
BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153
PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017
SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062
Table 5 Hardware implementation results of some 384-bit hash functions
Hash function Block size(bits) Number of operations Throughput at 100 kHz
(kbps)Area(GE) FOM
SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649
Table 6 Performance results of hash function using pipeline and parallelism
Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()
48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
8 International Journal of Distributed Sensor Networks
Table 7 Performance results of unrolling steps constructions
Number ofiteration rounds
Area(GE)
Delay(ns)
Power(120583W)
Throughput at10MHz(Mbps)
48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000
times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage
In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384
Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty
Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8
4 Low Power Design for Hash Function
Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related
to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well
Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption
41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well
However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption
Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875
1and clock frequency 119891
1which is defined by
the necessity of circuit design (the clock period computedfrom 119891
1needs to be not less than the critical path delay)
Then according to (8) we can get the throughput 1198791at this
frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs
119875max = 1198751 sdot 119899
119879min = 1198791(13)
This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 9
Receiver
RAM
Maincontrol LCD
displayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Outputdigestn times br bits
and
Figure 8 Hash encryption system
Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min
This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44
42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8
Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3
Sampling Clock Cycles =Clock Frequency
Baud rate sdot Sampling Rate
=100MHz
16 times 4800Bs
asymp 1302
(14)
Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers
Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out
Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system
Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data
43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system
Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method
As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed
Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
10 International Journal of Distributed Sensor Networks
Receiver
Hash
LCD
Phase one Phase two Phase three
Data receiving and padding
Idle
Initialization Idle
Idle
IdleHash processing
LCD displaying
Figure 9 Three phases of hash encryption system
Receiver
RAM
Maincontrol
LCDdisplayer
Hashprocess
Clockdivider
Inputmessage
Digestdisplay
Paddedmessage
Digest
en lcd
fsh r
en di
en h fsh h
fsh lcd
clk r
and
Figure 10 Control signals of hash encryption system
one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone
During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing
This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message
By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44
44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10
Table 8 Hardware implementationwithwithout load-enable basedclock gating
Systemtype
Area Delay Power
(GE) Increase() (ns) Increase
() (120583W) Reduction()
Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365
Table 9 Area and delay performances of frequency trade-offtechnique
Number ofiteration rounds
Area(GE)
Delay(ns)
Frequency(MHz)
48 965 094 100024 1930 192 500 lt 119891
24lt 696
16 2895 291 333 lt 11989116lt 620
12 3860 391 250 lt 11989112lt 589
8 5790 590 167 lt 1198918lt 560
6 7720 789 125 lt 1198916lt 547
4 11580 1187 083 lt 1198914lt 534
3 15440 1584 063 lt 1198913lt 528
2 23160 2380 042 lt 1198912lt 522
1 46320 4771 021 lt 1198911lt 521
and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879
119894stands for throughput and 119879
119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875
119894means the total
dynamic power consumption by finishing a complete Permfunction and 119875
119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds
Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 11
Table 10 Dynamic power consumption of frequency trade-off technique
Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)
48 130896 na 100024 94008 lt 119875
24lt 130848 2818 lt 119875
24pct lt 004 500 lt 11989124lt 696
16 70288 lt 11987516lt 130720 4630 lt 119875
16pct lt 013 333 lt 11989116lt 620
12 55512 lt 11987512lt 130800 5759 lt 119875
12pct lt 007 250 lt 11989112lt 589
8 38904 lt 1198758lt 130712 7028 lt 119875
8pct lt 014 167 lt 1198918lt 560
6 29880 lt 1198756lt 130764 7717 lt 119875
6pct lt 010 125 lt 1198916lt 547
4 20392 lt 1198754lt 130676 8442 lt 119875
4pct lt 017 083 lt 1198914lt 534
3 15471 lt 1198753lt 130689 8818 lt 119875
3pct lt 016 063 lt 1198913lt 528
2 10430 lt 1198752lt 130676 9203 lt 119875
2pct lt 017 042 lt 1198912lt 522
1 5229 lt 1198751lt 130760 9601 lt 119875
1pct lt 010 021 lt 1198911lt 521
Table 11 Throughput performances of frequency trade-off technique
Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)
48 667 na 100024 667 lt 119879
24lt 928 000 lt 119879
24pct lt 3913 500 lt 11989124lt 696
16 667 lt 11987916lt 124 000 lt 119879
16pct lt 8591 333 lt 11989116lt 620
12 667 lt 11987912lt 1571 000 lt 119879
12pct lt 13553 250 lt 11989112lt 589
8 667 lt 1198798lt 2240 000 lt 119879
8pct lt 23583 167 lt 1198918lt 560
6 667 lt 1198796lt 2917 000 lt 119879
6pct lt 33733 125 lt 1198916lt 547
4 667 lt 1198794lt 4272 000 lt 119879
4pct lt 54048 083 lt 1198914lt 534
3 667 lt 1198793lt 5632 000 lt 119879
3pct lt 74438 063 lt 1198913lt 528
2 667 lt 1198792lt 8352 000 lt 119879
2pct lt 115217 042 lt 1198912lt 522
1 667 lt 1198791lt 16672 000 lt 119879
1pct lt 239955 021 lt 1198911lt 521
5 Conclusion
In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode
The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)
References
[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008
[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
12 International Journal of Distributed Sensor Networks
[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996
[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002
[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005
[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml
[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function
[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012
[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010
[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012
[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000
[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006
[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005
[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010
[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of