Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of...

7
Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for IT Security, Ruhr University Bochum, Germany {moradi, mischke}@crypto.rub.de Abstract—Due to the propagation of the glitches in combi- national circuits side-channel leakage of the masked S-boxes realized in hardware is a known issue. Our contribution in this paper is to adopt a masked AES S-box circuit according to the FPGA resources in order to avoid the glitches. Our design is suitable for the 5, 6, and 7 FPGA series of Xilinx although our practical investigations are performed using a Virtex-5 chip. In short, compared to the original design synthesized by automatic tools while requiring the same area (slice count) our design reduces power consumption, critical path delay, and more importantly the side-channel leakage. In our practical investigations we could not recover any first-order leakage of our design using up to 50 million traces. However, since the targeted S-box realizes a first-order boolean masking, the second-order leakage could be revealed using around 25 million measurements. I. I NTRODUCTION With the increasing pervasion of cryptography in more and more embedded systems to protect either the intellectual property of a vendor or to preserve privacy by allowing secure communications, the need of secure implementations of cryptographic primitives like AES is at an all-time high. These implementations should not only be resistant to classical attacks but also be protected against side-channel attacks like power analysis [11], [12]. Countermeasures against power analysis attacks in hardware can be realized on multiple levels. However, if the target platform is an FPGA, the algorithmic-level countermeasures are mainly the possible choices. Masking of sensitive values is one of the most considered solutions, and several schemes have already been published. These options include multi- plicative [2], [10], additive [3], [7], [20], or relatively recent affine [9] masking schemes. The problem of masking in hardware could not yet be solved by these schemes. Several attacks have been published, e.g., [13], [15], which exploit a remaining first-order leakage in the designs. The reason for the remaining leakage namely glitches in the combinational circuits is well known to the community. A couple of new schemes have been proposed to solve this issue by creating glitch-resistant implementations. The notable ones are the threshold implementation (TI) [17], [18], [19] and a new proposal based on a mixture of multi-party computation (MPC) and Shamir secret sharing [22], [23]. However, making a correct TI of most algorithms is very challenging. So far only the Noekeon [8] and the PRESENT [5] S-boxes could be successfully imple- mented [19], [21]. The MPC scheme has not been practically evaluated yet, but because of the proposed design of the inversion, the area and speed overheads of a single S-box computation are quite large. In this work we try not to create a glitch-resistant implemen- tation but instead try to avoid causing any glitches. The target of our implementation is the Virtex-5 LX-50 FPGA of the readily available side-channel evaluation platform SASEBO- GII [1]. For this we take the very compact masked S-box by Canright-Batina [7] and manually map the combinational functions to the resources of our target platform. By efficiently using special enable signals in each FPGA Look-Up-Table (LUT), we can suppress any glitches at the LUT outputs by enabling them only sequentially. We have evaluated different versions of our design including a fully pipelined one achieving a very high clock frequency. Note that although our design has been initially optimized to the 6-Input LUT architecture of the Xilinx Virtex-5 FPGA, the same architecture is used in their newer Series 6 and 7 FPGAs which allows using the same design on these recent platforms. When evaluating the side-channel leakage of our final design, contrary to the original S-box implementation our design did not show any first-order leakage by analyzing 50 million measurements. Since the scheme only implements a first-order masking, a second-order attack is expected to be successful, which is practically confirmed using a very high amount of 25 million measurements. In the next section we briefly describe the reasons why we have selected the Canright-Batina masked S-box as the basis of our implementation. Moreover, we introduce the Xilinx LUT architecture and how we have used it to eliminate glitches. Section III gives an overview of our S-box design and names the implementation profiles used in the evaluation whose results are depicted in Section IV. Finally, Section V concludes this article. II. TARGETS In the following we will first give a short summary of the recent masked S-box designs and state why we have chosen the one of Canright and Batina as basis for our modifications to create a glitch-free version. Afterwards we will describe the architecture of the Xilinx 6-Input LUT and how we use it to minimize the possible leakage. A. Masked AES S-box As stated previously the currently known glitch-resistant schemes come with some drawbacks. Threshold implemen- tation has been shown to be quite effective when using small

Transcript of Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of...

Page 1: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

Glitch-Free Implementation of Masking in Modern FPGAsAmir Moradi and Oliver Mischke

Horst Görtz Institute for IT Security, Ruhr University Bochum, Germanymoradi, [email protected]

Abstract—Due to the propagation of the glitches in combi-national circuits side-channel leakage of the masked S-boxesrealized in hardware is a known issue. Our contribution in thispaper is to adopt a masked AES S-box circuit according to theFPGA resources in order to avoid the glitches. Our design issuitable for the 5, 6, and 7 FPGA series of Xilinx althoughour practical investigations are performed using a Virtex-5chip. In short, compared to the original design synthesizedby automatic tools while requiring the same area (slice count)our design reduces power consumption, critical path delay, andmore importantly the side-channel leakage. In our practicalinvestigations we could not recover any first-order leakage of ourdesign using up to 50 million traces. However, since the targetedS-box realizes a first-order boolean masking, the second-orderleakage could be revealed using around 25 million measurements.

I. INTRODUCTION

With the increasing pervasion of cryptography in moreand more embedded systems to protect either the intellectualproperty of a vendor or to preserve privacy by allowingsecure communications, the need of secure implementationsof cryptographic primitives like AES is at an all-time high.These implementations should not only be resistant to classicalattacks but also be protected against side-channel attacks likepower analysis [11], [12].

Countermeasures against power analysis attacks in hardwarecan be realized on multiple levels. However, if the targetplatform is an FPGA, the algorithmic-level countermeasuresare mainly the possible choices. Masking of sensitive valuesis one of the most considered solutions, and several schemeshave already been published. These options include multi-plicative [2], [10], additive [3], [7], [20], or relatively recentaffine [9] masking schemes.

The problem of masking in hardware could not yet be solvedby these schemes. Several attacks have been published, e.g.,[13], [15], which exploit a remaining first-order leakage in thedesigns. The reason for the remaining leakage namely glitchesin the combinational circuits is well known to the community.A couple of new schemes have been proposed to solve thisissue by creating glitch-resistant implementations. The notableones are the threshold implementation (TI) [17], [18], [19] anda new proposal based on a mixture of multi-party computation(MPC) and Shamir secret sharing [22], [23].

However, making a correct TI of most algorithmsis very challenging. So far only the Noekeon [8] andthe PRESENT [5] S-boxes could be successfully imple-mented [19], [21]. The MPC scheme has not been practicallyevaluated yet, but because of the proposed design of the

inversion, the area and speed overheads of a single S-boxcomputation are quite large.

In this work we try not to create a glitch-resistant implemen-tation but instead try to avoid causing any glitches. The targetof our implementation is the Virtex-5 LX-50 FPGA of thereadily available side-channel evaluation platform SASEBO-GII [1]. For this we take the very compact masked S-boxby Canright-Batina [7] and manually map the combinationalfunctions to the resources of our target platform. By efficientlyusing special enable signals in each FPGA Look-Up-Table(LUT), we can suppress any glitches at the LUT outputs byenabling them only sequentially.

We have evaluated different versions of our design includinga fully pipelined one achieving a very high clock frequency.Note that although our design has been initially optimized tothe 6-Input LUT architecture of the Xilinx Virtex-5 FPGA,the same architecture is used in their newer Series 6 and 7FPGAs which allows using the same design on these recentplatforms.

When evaluating the side-channel leakage of our finaldesign, contrary to the original S-box implementation ourdesign did not show any first-order leakage by analyzing 50million measurements. Since the scheme only implements afirst-order masking, a second-order attack is expected to besuccessful, which is practically confirmed using a very highamount of 25 million measurements.

In the next section we briefly describe the reasons why wehave selected the Canright-Batina masked S-box as the basis ofour implementation. Moreover, we introduce the Xilinx LUTarchitecture and how we have used it to eliminate glitches.Section III gives an overview of our S-box design and namesthe implementation profiles used in the evaluation whoseresults are depicted in Section IV. Finally, Section V concludesthis article.

II. TARGETS

In the following we will first give a short summary of therecent masked S-box designs and state why we have chosenthe one of Canright and Batina as basis for our modificationsto create a glitch-free version. Afterwards we will describe thearchitecture of the Xilinx 6-Input LUT and how we use it tominimize the possible leakage.

A. Masked AES S-box

As stated previously the currently known glitch-resistantschemes come with some drawbacks. Threshold implemen-tation has been shown to be quite effective when using small

Page 2: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

Optimizedxor/sq/scl/mul

4

XORS

inputmask

outputmask

Fig. 1. Masked GF(28) Inverter by Canright-Batina (taken from [15])

S-boxes [21], but because of the large S-box size of AESup to now no expressions could be found to rewrite theAES S-box using this scheme. Note that the implementationreported in [16] has been made by masking the multipliers ofa tower-field implementation of the AES S-box which couldnot follow the requirements of the threshold implementation.At CHES 2011 a mixture of Shamir secret sharing scheme andmulti-party computation was introduced [22]. While it has notbeen practically evaluated yet, it is clear that the hardwareresource requirements are quite high. Furthermore, because ofthe sequential way of computing the inversion of the S-box alarge number of clock cycles are necessary to compute onlyone S-box output. All these predicted area and time overheadsmay hinder its practical feasibility.

Instead of focusing on glitch resistance in this article wetry to avoid any glitches at the FPGA LUTs at all. Fromthe more traditional currently known masking schemes theone of Canright-Batina [7] uses an additive masking andimplements the S-box in a tower-field approach using carefullychosen normal bases to minimize the circuit size. It is basedon the area-optimized S-box by Canright [6], and it is stillsupposed to be the most compact design available. While itwas claimed to be perfectly secure by the definition of [3], itwas shown in [15] that because of glitches in the circuit therestill exists an exploitable first-order leakage. Figure 1 showsan overview of the GF(28) inverter design omitting the tower-field conversions. The GF(24) inverter is implemented usingthe same design the only difference being that the inversionin GF(22) is also merged to this module. The authors of theoriginal design were kind enough to supply the HDL sourcecode online1 which we used as basis for our modificationsdetailed in the following.

B. Xilinx FPGA Resources

When not using dedicated hardware blocks like Multipli-ers/DSPs, a combinational logic circuit in an FPGA is usuallyimplemented by means of many-to-one Look-Up Tables. Theirgeneral design is as a number of single-bit storage elementswhose values are initialized during the configuration of theFPGA by the bitstream. The inputs of the LUT control thesetting of internal multiplexers thereby choosing which stored

1http://faculty.nps.edu/drcanrig/pub/index.html

(a) (b)

Fig. 2. Two possible LUTs in Virtex-5: (a) 6-input LUT, (b) 32-bit Shift-Register [25]

bit value is available at the output of the LUT. As example,considering the 6-to-1 LUT of the Xilinx Series 5, 6, and 7FPGAs, the implementation of this LUT is realized as two5-to-1 LUTs and a multiplexer as can be seen in Fig. 2(a).Each of these 5-to-1 LUTs themselves can again be seen astwo 4-to-1 LUTs and a multiplexer and so on.

In our device under test, the Xilinx Virtex-5 LX50 FPGAmounted on a SASEBO-GII Board, each slice consists of fourLUT6 and four single-bit flip-flops. The LUT6, as depicted inFig. 2(a), can be hardinstanced in two different configurations.As LUT6_1 any combinational function having up to 6 inputsignals and one output signal can be implemented. Using theLUT in a LUT5_2 configuration allows providing two outputsignals from the 5 inputs but only if these 5 inputs are thesame for both internal 5-to-1 LUTs, i.e., the inputs must beshared.

Glitches at the output of a LUT happen since the inputsignals arrive at different instances of time because of therouting specification in the device. In order to avoid this theoutput of the LUT must be hold stable until all input signalshave arrived. We achieve this by using one of the input signalsas an active low enable signal, i.e., in our case as long as thisinput signal is set to logic ’1‘, the LUT output will alwaysbe logic ’0‘ no matter the values of the other input signals.Here it is important to choose the correct LUT input signalas enable carefully. Let us consider choosing the input I5in Fig. 2(a) as the enable signal. While the output of theLUT_6 will actually not change during the transition periodof the other input signals, there will still be glitches at theoutput of one of the internal LUT_5 instances. We thereforehave to choose the input signal which controls the very firstmultiplexer stage so that toggles at the select signals of thefollowing multiplexers do not cause any glitches.

Although the details of the internal architecture of the FPGAresources are not publicly available, this input signal can beidentified by looking at the architecture of the SRLC32Edepicted in Fig. 2(b). It is a special mode of operation forLUTs in some slices of Xilinx FPGAs that realizes a shiftregister. In this mode the content of the LUT storage cells

2

Page 3: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en1 en2 en3

GF_MUL_4x4

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en1 en2 en3

GF_MUL_4x4

cl

XORLUT6x1

LUT5x2

LUT5x2

LUT5x2

LUT6x1

LUT6x1

LUT6x1

LUT6x1

c1

c3

c2

c4

c5

c6

c8

c7

cst

af8

a

b

LUT6x1

LUT6x1

en2

en3

en4

cst

n m

mn

csm

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

en1

csm

LUT5x2

XOR

LUT5x2

XOR

e

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en1 en2 en3

GF_MUL_4x4

an

mb

mn

m4

ch

d

o1

o0

QH

QL

p

dn

q

em

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en12 en13 en14

GF_MUL_4x4

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en12 en13 en14

GF_MUL_4x4

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en12 en13 en14

GF_MUL_4x4

LUT5x2

XOR

LUT5x2

XOR

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL.SCL 2x2

ahal

bhbl

LUT5x2

XOR

LUT5x2

XOR

ph

pl

pblbh

alah

Q1

Q0

en12 en13 en14

GF_MUL_4x4

q

LUT6x1

LUT6x1

LUT6x1

LUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

LUT5x2

MUL 2x2

an

mn

mb

cst1

cst0

cm1

cm0

e

d

em

p

dn

b

a

n

m

m2

en6 en7 en8 en9 en10

Q1

Q0

GF_INV_4XOR

LUT6x1

XORLUT6x1

XORLUT6x1

XORLUT6x1

a

b

m

n

m4

m5

en2 en3 en4 en5 en6 en7 en8 en9 en10 en11 en12 en13 en14 en15

GF_INV_8 (masked)

LUT5x2

XOR

LUT5x2

XOR

Fig. 3. Design of our full-custom optimized S-box (inversion part only)

can be changed in a serial fashion during the operation of theFPGA. By using the LUT inputs as select lines, the lengthof the shift register can be set dynamically. Since the allzero input sets the length to 1 bit, and switching the I0input signal to logic ’1‘ increases the length to 2 bits, i.e.,choosing the neighboring cell, the I0 signal must controlthe very first multiplexer stage. Therefore, I0 is the correctchoice for the enable signal. Note that since the synthesizerpermutes the LUT input signals (and accordingly changesthe LUT configuration) to optimize the routing, by specialconstraints [24] one has to keep the PIN positions of thehardinstanced LUTs locked.

III. OUR DESIGN

The detailed structure of our design is given by Fig. 3.Omitting the tower-field conversion, 15 LUT stages are re-quired to perform the full inversion in GF(28). We giveperformance figures for 6 different implementation profiles,

from the original unmodified design to our optimized onewith or without pipelining stages and when the special enablesignals to minimize glitches in the circuit are used or not. Theimplementation profiles of the S-box are as follows:

1) The original HDL code optimized by the ISE synthesizer2) The original HDL but avoiding any optimizations or

trimming by the synthesizer, i.e., one LUT per gate tokeep all hierarchy levels

3) Our modified design using hardinstanced LUTs, allenable signals always ’0‘, no pipeline registers

4) Our modified design without pipelining but activatingeach stage sequentially by the enable signals

5) Our modified design using pipelining to hinder glitchpropagation, but all enable signals always ’0‘

6) Our modified design using both pipelining to hinder theglitch propagation and using the enable signals to avoidglitches in the circuit

3

Page 4: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

In Profiles 1, 2, and 3 the implementations are pure com-binational functions where at each clock the full S-box iscomputed at once. Glitches in the first LUT stage thereforeare passed through the whole S-box generating a highlyglitching circuit until all signals get stable. Therefore, we donot consider Profiles 1 and 2 in our side-channel evaluations(Section IV).

Profile 4 avoids this issue. Here only one LUT stage isactivated in each clock cycle, thereby not only hindering thepropagation of glitches but also not causing any glitches at all.That is because the input signals of the next LUT stage arestable when they are activated in the following clock cycle.The downside of this profile is the apparent non-practicality.One needs 15 clock cycles to compute a single S-box outputwhile the inputs must be hold stable. In order to make mattersworse one would need to spend another 15 clock cycles todeactivate each stage in the reverse order before the next S-box computation can begin. In Profile 5 the pipelining stageshinder the glitch propagation. On the other hand, keeping allenable signals at ’0‘ glitches will still occur at the LUT outputsof each stage.

Finally, in the last Profile 6 we combine both the pipeliningto avoid any glitch propagation and the use of the active-lowenable signals to completely shut down glitches at the LUToutputs. In order to reach our goal in a straightforward wayone would need to i) first disable all LUTs, ii) clock everysecond pipelining registers after enabling their correspondingLUTs, iii) disable all LUTs again, iv) clock the other half ofpipelining registers having their corresponding LUTs enabledand so on. This means that only every four clock cycles anew S-box input can be feed into the circuit, and it leadsto a latency of 30 clock cycles from input to output. Thisis necessary because if one would simply merge clockingevery second register and disabling the connected LUT stageat the same time, the routing of the signals would determinewhether the disable signal arrives at the LUT first or if otherinputs arrive earlier, which the later causes glitches at the LUToutput.

To avoid this issue we can use the special way the clocksignal is routed in the FPGA. The clock is routed on specialdedicated paths to each switch box separately to avoid raceproblems in synchronous circuits. However, the LUT outputsignals need to first go back to the corresponding slice’s switchbox and from there travel to the destination LUT inputs wheremore switch boxes might be passed. Therefore, a transition,e.g., low-to-high, on the clock signal arrives at the registers andLUTs of each slice earlier than the other signals. Therefore,by tying our active-low LUT enable signals to the clock signalthe LUT gets deactivated at each rising clock edge before thenew inputs arrive. At the falling edge of the clock the LUTgets active and provides the output signal to the next flip-flopstage where it will be stored at the next rising edge. This waythe pipelining registers can be active at every clock cycle andno glitches will occur. Please note that the maximum clockfrequency in this case cannot be faster than twice the longestcritical path delay of the S-box circuit. In order to provide

LUT output

clk/LUT en

LUT data in

output (i) output (ii)

inputs (i) inputs (ii)

‘0’‘0’ ‘0’

inputs (iii)

Fig. 4. Signal timings on LUT inputs and outputs

a better understanding Fig. 4 showcases the different signaltimings. Also, the performance results of each implementationprofile for only the inversion module of the S-box is given inTable I.

TABLE ISYNTHESIS RESULTS FOR ALL PROFILES (INVERSION ONLY)

Profile Max. Freq. #LUTs #FFs Latency Throughput(#clocks) (16 Inv. /s)

1 105.519MHz 99 0 0 6 594 9372 56.504MHz 244 0 0 3 531 5003 88.300MHz 100 0 0 5 518 7504 641.026MHz 100 0 30 1 335 4715 641.026MHz 100 649 15 (pipe’d) 20 678 2586 320.513MHz 100 649 15 (pipe’d) 10 339 129

IV. EVALUATION

We used a SASEBO-GII [1] board as the target platform toexamine the side-channel leakage of our designs. Differentprofiles of our design were implemented on the Virtex-5(XC5VLX50) FPGA embedded on the target board, and thepower consumption traces were collected using a LeCroyWP715Zi 1.5GHz oscilloscope at the sampling rate of 1GS/s.Since our design employs a very few number of LUTs inthe target FPGA, and the number of toggles in each clockcycle is restricted, the peak-to-peak amplitude of the signal inthe power traces was quite low. Therefore, we measured thepower traces by means of a 1Ω resistor in the VDD path, aDC blocker, a passive probe and an amplifier. Furthermore,we restricted the bandwidth of the measurements (on theoscilloscope) to 20MHz to eliminate the electrical noise whileour designs run by a stable 3MHz oscillator.

We made an exemplary architecture where one AddRound-Key module (128-bit) and one instance of the targeted S-box exist. The 128-bit (masked) input is XORed by a 128-bit secret key, and the result is sequentially given to the S-box module one byte per clock cycle. The method we usedto examine the side-channel leakage of our targeted designsis a correlation collision attack [15]. It examines the first-order leakage of one circuit instance that is used in differenttime instances. Therefore, it perfectly suits to our exemplaryarchitecture since the targeted S-box instance is shared for allSubBytes transformations. The target masked S-box [7] usestwo different mask bytes per input byte, i.e., a random byte tomask an input byte and another random byte as the mask ofS-box output. Therefore, we provided two random values foreach input byte, and gave the above mentioned architecture themasked inputs and the corresponding masks. In other words, ineach run of the circuit two independent 128-bit random values

4

Page 5: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

0 2 7 9 −5

10V

olta

ge [m

V]

Time [µs]

(a)

(b)

(c)

Fig. 5. Profile 3: evaluation results (a) a sample trace, (b) attack result using50 000 traces, (c) over the number of traces.

as input and output masks are provided for the aforementionedcircuit.

For comparison purposes we start our evaluations by Pro-file 3 to have a reference as a design where glitches are notcontrolled and can be propagated. Please note that we omittedthe evaluation results of Profiles 1 and 2 since there is nocontrol over the glitches, and they have the same side-channelleakage as that of Profile 3.

A sample power trace of this design is shown in Fig. 5(a).Sixteen clock cycles related to the sixteen S-box computationsare clearly distinguishable. We measured 50 000 traces, andperformed a correlation collision attack considering two plain-text bytes which are processed consecutively by the targetedS-box instance. Note that this attack, similar to the most of theside-channel collision attacks, recovers a relation between thetargeted secret key bytes. In case of our targets (like the linearcollision attack on AES [4]) the attack searches for the XORdifference of the key bytes corresponding to the two targetedplaintext bytes. The result of this attack is depicted in Fig. 5(b)and Fig. 5(c) showing the simplicity of recovering the secret,i.e., 5 000 traces, when the glitches in the masked S-box arenot controlled.

Profile 5 is the next S-box design we evaluated. As men-tioned in Section III, this design does not avoid the glitches,but it prevents their propagation to the next circuit stages.Since this design provides a pipeline with 15 stages, sequen-tially giving the 16 key-whitened plaintext bytes to this S-boxinstance leads to requiring 31 clock cycles to compute all the

0 5 18 13−5

6

Vol

tage

[mV

]

Time [µs]

(a)

(b)

(c)

Fig. 6. Profile 5: evaluation results (a) a sample trace, (b) attack result using20 000 000 traces, (c) over the number of traces.

SubBytes transformations. These 31 clock cycles are clearlydistinguishable in Fig. 6(a) which shows a sample power traceof this design. As an interesting point, compared to that ofProfile 3 (Fig. 5(a)) the power consumption of Profile 5 isreduced though it needs 15 more clock cycles to finish allSubBytes transformations.

In order to perform a successful attack on this design andrecover the desired secret, we required to collect much moretraces compared to Profile 3, i.e., 20 000 000. This is due topreventing the glitch propagations, which control the data-dependent leakage and consequently is harder to detect. Thesame attack scheme with the same target as in Profile 3 wasperformed. As shown in Fig. 6(b), there is still a first-orderleakage. This shows that controlling the propagation of theglitches is effective to significantly reduce the side-channelleakage, but it does not completely prevent it, as we needabout 8 000 000 traces to see the desired leakage (Fig. 6(c)).

The last design we considered for evaluation is Profile 6,where by a sophisticated control over the LUT enable signalsthe glitches are prevented. The level of power consumption ofthis design, as shown by Fig. 7(a), is roughly the same2 asthat of Profile 5. In order to perform the attacks, we measured50 000 000 traces of this design. Performing the same attackas before led to the unsuccessful result which is depicted inFig. 7(b). In fact, it shows that preventing the glitches signifi-cantly helps resisting against the first-order attacks. However,this design should have second-order leakage because of its

2Indeed, it is slightly lower because of the glitch prevention.

5

Page 6: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

0 5 18 13−5

6V

olta

ge [m

V]

Time [µs]

(a)

(b)

(c)

(d)

Fig. 7. Profile 6: evaluation results (a) a sample trace, (b) attack result using50 000 000 traces, (c) attack result on 50 000 000 squared mean-free traces,(d) over the number of traces.

underlying first-order masking scheme. In order to checkthis issue we performed the same attack, i.e., correlationcollision attack, but using the second-order moments. Thatis, as illustrated in [14], in a correlation collision attack onecan employ the variance traces of the measurements insteadof the averages to examine the second-order moments. It is,in fact, the same as squaring the mean-free traces and thenperforming a correlation collision attack [14]. We performedthis preprocessing step prior to the same correlation collisionattack as before, and the result is presented by Fig. 7(c).As expected, the second-order leakage is available, and canbe used to reveal the desired secret using around 25 000 000measurements (see Fig. 7(d)).

We should mention that we considered only the univariateattacks, i.e., first-order and zero-offset second-order. Becauseof the pipeline architecture of our design the leakages relevantto the one S-box computation are distributed over 15 clockcycles. Therefore, one may perform a second-order attack bycombining the leakages appearing at different clock cycles,

i.e., a multivariate attack. This is out of the evaluation criteriawe have considered in this paper. However, we believe thatcombining leakages of different time instances leads to in-creasing the noise factor and most likely provides not a betterresult than the univariate second-order attack whose result isshown here.

V. CONCLUSIONS

In this work we have taken the highly optimized for ASICsvery compact masked S-box by Canright and Batina, andported it to use the available resources of the current XilinxFPGA Series (Virtex-5 onward) in a size-optimized manner.Compared to a design created by an automatic synthesizer thisled to the same number of LUTs and slight decrease of theoperation frequency. We could also, as already pointed outin [15], confirm the still available first-order leakage of thisS-box design when implemented in a straightforward manner.

Since this leakage was caused by glitches in the circuit,we have first eliminated the glitches by placing enable signalsin each used LUT, so that no output is propagated while theinputs are not stable. By combining this solution together withpipelining stages and utilizing the special way how the clocksignals are routed for the LUT enable signals, we could createan implementation which operates at an extremely high clockfrequency while showing absolutely no first-order leakage bymeans of 50 million power consumption measurements.

While not specifically focusing on this, we also achieveda quite high resistance against univariate second-order at-tacks. In this case 25 million traces is the threshold afterwhich the secrets become slowly distinguishable using thevery sophisticated attacks of [14]. We should emphasize acomparison between our results and those of a thresholdimplementation of AES reported in [16] and [14]. Althoughtheir implementation platform is different to ours, their schemerequired roughly the same number of traces the second-order leakage to be exploited while the area overhead oftheir design – excluding all the internal PRNGs – is muchhigher than our optimized one. In order to allow further studyof our design and to use it in real applications the HDLsource code of our masked S-box design is available online athttp://www.emsec.rub.de/research/publications.

ACKNOWLEDGMENT

In this project O. Mischke has been part-financed by theEuropean Union, Investing in your future, European RegionalDevelopment Fund.

REFERENCES

[1] Side-channel attack standard evaluation board (sasebo). Further informa-tion are available via http://www.morita-tech.co.jp/SASEBO/en/index.html.

[2] M.-L. Akkar and C. Giraud. An Implementation of DES and AES,Secure against Some Attacks. In CHES 2001, volume 2162 of LNCS,pages 309–318. Springer, 2001.

[3] J. Blömer, J. Guajardo, and V. Krummel. Provably Secure Maskingof AES. In SAC 2004, volume 3357 of LNCS, pages 69–83. Springer,2004.

6

Page 7: Glitch-Free Implementation of Masking in Modern …...2012/04/24  · Glitch-Free Implementation of Masking in Modern FPGAs Amir Moradi and Oliver Mischke Horst Görtz Institute for

[4] A. Bogdanov. Multiple-Differential Side-Channel Collision Attacks onAES. In CHES 2008, volume 5154 of LNCS, pages 30–44. Springer,2008.

[5] A. Bogdanov, G. Leander, L. Knudsen, C. Paar, A. Poschmann, M. Rob-shaw, Y. Seurin, and C. Vikkelsoe. PRESENT - An Ultra-LightweightBlock Cipher. In CHES 2007, number 4727 in LNCS, pages 450–466.Springer, 2007.

[6] D. Canright. A Very Compact S-Box for AES. In CHES 2005, volume3659 of LNCS, pages 441–455. Springer, 2005. The HDL specification isavailable at the author’s official webpage http://faculty.nps.edu/drcanrig/pub/index.html.

[7] D. Canright and L. Batina. A Very Compact "Perfectly Masked" S-Box for AES. In ACNS 2008, volume 5037 of LNCS, pages 446–459.Springer, 2008. the corrected version at Cryptology ePrint Archive,Report 2009/011 http://eprint.iacr.org/.

[8] J. Daemen, M. Peeters, G. Assche, and V. Rijmen. Nessie proposal:NOEKEON. Submitted as an NESSIE Candidate Algorithm, 2000. http://www.cryptonessie.org.

[9] L. Genelle, E. Prouff, and M. Quisquater. Thwarting Higher-Order SideChannel Analysis with Additive and Multiplicative Maskings. In CHES2011, volume 6917 of LNCS, pages 240–255. Springer, 2011.

[10] J. D. Golic and C. Tymen. Multiplicative Masking and Power Analysis ofAES. In CHES 2002, volume 2523 of LNCS, pages 198–212. Springer,2003.

[11] P. C. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. InCRYPTO 1999, volume 1666 of LNCS, pages 388–397. Springer, 1999.

[12] S. Mangard, E. Oswald, and T. Popp. Power Analysis Attacks: Revealingthe Secrets of Smart Cards. Springer, 2007.

[13] S. Mangard, N. Pramstaller, and E. Oswald. Successfully AttackingMasked AES Hardware Implementations. In CHES 2005, volume 3659of LNCS, pages 157–171. Springer, 2005.

[14] A. Moradi. Statistical Tools Flavor Side-Channel Collision Attacks. InEUROCRYPT 2012, volume 7237 of LNCS, pages 428–445. Springer,2012.

[15] A. Moradi, O. Mischke, and T. Eisenbarth. Correlation-Enhanced PowerAnalysis Collision Attack. In CHES 2010, volume 6225 of LNCS, pages125–139. Springer, 2010. the extended version at Cryptology ePrintArchive, Report 2010/297 http://eprint.iacr.org/.

[16] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang. Pushingthe Limits: A Very Compact and a Threshold Implementation of AES.In EUROCRYPT 2011, volume 6632 of LNCS, pages 69–88. Springer,2011.

[17] S. Nikova, C. Rechberger, and V. Rijmen. Threshold ImplementationsAgainst Side-Channel Attacks and Glitches. In ICICS 2006, volume4307 of LNCS, pages 529–545. Springer, 2006.

[18] S. Nikova, V. Rijmen, and M. Schläffer. Secure Hardware Implemen-tations of Non-Linear Functions in the Presence of Glitches. In ICISC2008, volume 5461 of LNCS, pages 218–234. Springer, 2008.

[19] S. Nikova, V. Rijmen, and M. Schläffer. Secure Hardware Implementa-tion of Nonlinear Functions in the Presence of Glitches. J. Cryptology,24(2):292–321, 2011.

[20] E. Oswald, S. Mangard, N. Pramstaller, and V. Rijmen. A Side-ChannelAnalysis Resistant Description of the AES S-Box. In FSE 2005, volume3557 of LNCS, pages 413–423. Springer, 2005.

[21] A. Poschmann, A. Moradi, K. Khoo, C.-W. Lim, H. Wang, and S. Ling.Side-Channel Resistant Crypto for Less than 2, 300 GE. J. Cryptology,24(2):322–345, 2011.

[22] E. Prouff and T. Roche. Higher-Order Glitches Free Implementation ofthe AES Using Secure Multi-party Computation Protocols. In CHES2011, volume 6917 of LNCS, pages 63–78. Springer, 2011.

[23] A. Shamir. How to Share a Secret. Commun. ACM, 22(11):612–613,1979.

[24] Xilinx. Constraints Guide. Available via http://www.xilinx.com/itp/xilinx10/books/docs/cgd/cgd.pdf, 2008.

[25] Xilinx. Virtex-5 Libraries Guide for HDL Designs. Availablevia http://www.xilinx.com/support/documentation/sw_manuals/xilinx11/virtex5_hdl.pdf, September 2009.

7