TIMA Lab. Research Reportstima.univ-grenoble-alpes.fr/publications/files/rr/adc_234.pdf · A...

ISSN 1292-862

TIMA Lab. Research Reports

CNRS INPG UJF

TIMA Laboratory, 46 avenue Félix Viallet, 38000 Grenoble France

An Asynchronous DES Crypto-Processor Secured against Fault Attacks

Y. Monnet, M. Renaudin, R. Leveugle, S. Dumont, F. Bouesse TIMA Laboratory

46, avenue Felix Viallet 38031 GRENOBLE cedex-FRANCE

[email protected]

Abstract This paper presents a hardened asynchronous DES

crypto-processor against fault attacks. A fault attack consists in causing an intentional temporary dysfunction of a circuit by injecting faults in its combinational or sequential parts. This failure enables hackers to access protected memory areas or secret information like cryptographic keys. An analysis of the behavior of VLSI Quasi Delay Insensitive (QDI) asynchronous circuits in the presence of faults shows that they are attractive to design robust systems. In this paper, an asynchronous reference DES architecture is described. Then hardening techniques are proposed and applied at the design time to significantly harden the DES architecture with a very low area overhead and a reasonable performance penalty.

1. Introduction

1.1. Overview The security of systems such as smart cards relies on

the ability of the smart card to perform cryptographic operations while keeping the key secret. A particular threat is the use of fault injections to attack such devices. A fault injection can be achieved for instance by power supply noise, radiation from lightning, laser emission... An incorrect behaviour of the circuit enables hackers to bypass some crucial verification or to find some stored secret information.

Most of the integrated circuits today are synchronous. Their activities are controlled by a global clock which triggers at the same time the memorization of the complete state of the circuit. The behaviour and the sensitivity of synchronous circuits exposed to fault injection have widely been studied [1] [2] [3].

Asynchronous circuits represent a class of circuits which are not controlled by a global clock but by the data themselves. Because of their specific architecture, asynchronous circuits have a very different behaviour than synchronous circuits in the presence of faults. QDI circuits are asynchronous circuits that operate correctly regardless of gate delays in the system. Their delay-insensitive property makes them naturally robust against

some categories of faults such as delay faults. Thus, QDI circuits are attractive to design fault tolerant systems.

This paper is organised as follows. Section 2 introduces the QDI asynchronous technology. Section 3 presents the reference DES architecture. Then, fault models and hardening techniques are presented in Section 4. These techniques are applied to design a secure DES architecture presented in Section 5. Reference and secure circuits are evaluated in terms of area and speed after fabrication and compared in Section 6. Then Section 7 concludes the paper.

1.2. Related Work

A duplication-based method was proposed in [4] to

detect errors in micro-pipeline asynchronous circuits. However, since this class of circuits needs timing assumptions, it is not naturally robust against fault injection and the behaviour is similar to synchronous circuits. In [5], self-checking properties of m-out-of-n encoded circuits are exploited to detect incorrect codes by means of alarms. Fault detection and isolation techniques for QDI circuits were proposed in [6] both at the layout level and at the circuit level, for a large class of faults. A hardening technique was proposed in [7] to design SEU tolerant QDI circuits using the Handshaking Expansion (HSE) language. It consists in doubling every node of the circuit, which is very costly in terms of area and speed.

2. Asynchronous logic: Quasi Delay Insensitive Circuits

2.1. Overview

An asynchronous circuit is composed of individual

modules which communicate to each other by means of point-to-point communication channels [8]. Therefore, a given module becomes active when it senses the presence of incoming data. It then computes and sends the result to the output channels. Communications through channels are governed by a protocol which requires a bi-directional signalling between senders and receivers (request and acknowledgement). They are called Handshaking protocols.

The communication protocol is the basis of the sequencing rules in asynchronous circuits. There are two main classes of handshaking protocols: two-phase and four-phase. Only the four-phase protocol is considered in this work. Figure 1 describes the four-phase protocol, which requires a return to zero phase for both data requests and acknowledgments. In phase 1, a valid data is detected. This data is acknowledged in phase 2. Then the data is re-initialized in phase 3 (return to zero phase) and the acknowledgment signal is reset in phase 4.

Figure 1. Four-phase handshaking protocol

Considering that one bit has to be transferred through

a channel using the four-phase protocol, three different values have to be encoded for this bit: invalid, valid at ‘1’ and valid at ’0’. A 1-of-2 (dual rail) encoding is required to encode these three states (Table 1).

Table 1. Dual rail encoding of the three states required to communicate 1 bit

Channel data A1 A0 0 0 1 1 1 0

Invalid 0 0 Unused 1 1

2.2. Asynchronous stages

Figure 2 shows a general structure of an

asynchronous stage. Similarly as in synchronous circuits, it is composed of a computational logic block and a memory block (registers). The computational block computes data inputs. The memory block not only stands for registers but also implements the four-phase communication protocol. The next sub-section presents the structure of these two blocks.

Figure 2. Basic structure of an asynchronous

stage

2.3. Structure of one stage The structure of one asynchronous stage used in this

work is illustrated in Figure 3. The circuit is composed of a computational part which in this example

implements a Dual-rail XOR function between the input channels A(A0,A1) and B(B0,B1), and a four-phase half-buffer which generates the output channel S(S0,S1). The C-element (or Muller gate) is the memory cell used in asynchronous QDI circuits. It generates a rising transition when rising transitions occur at all the inputs and generates a falling transition when falling transitions occur at all the inputs [8].

The definition of the computational and memory parts does not refer to a well defined statement. However, it is necessary to establish this distinction in order to define a relationship between these parts and the fault models used in this work.

Figure 3. Dual-rail XOR gate with a four

phase dual rail half-buffer

2.3.1. Computational part The computational part implements the logical

function, using Muller gates and standard combinational gates (AND, OR, NOR …). Muller gates noted M00 to M03 in Figure 3 are used as a logical “AND” operator to compute and synchronize incoming data events. Their state-holding nature is necessary to ensure the QDI properties of the circuits. This property would be lost by the use of standard “AND” gates.

2.3.2. Memory part

In the memory block, Muller gates are used to

implement the communication protocol between two consecutive asynchronous stages.

The global circuit state is defined as the state of all its Muller gates implemented in memory blocks. These gates hold data information at the behavioral level. Because they need an initial state, all Muller gates of memory parts are Muller gates with Set (MS) or Muller gates with Reset (MR).

3. Reference DES architecture

3.1. Global architecture

The asynchronous DES crypto-processor is

implemented using the technique described above: four-phase protocol and 1-of-N encoded data. Its architecture

M00 C

C

C

C

M01

M02 C

C M03

A0B1A1B0

A1B1A0B0

MR00

MR01

AB_ack

S0

S1

S_ack

O0

O1

Data Ack

Computational logic block

Memory block Data

Ack Computational logic

block Memory

block Data

Ack

stage

Part I : computational part

Part II : memory part

is described in Figure 4. It is basically an iterative structure, based on three self-timed loops synchronized through communicating channels. Channel Sub-Key synchronizes the ciphering data-path. CTRL is a set of channels generated by the Controller block (a finite state machine) which controls the data-path along sixteen iterations as specified by the DES algorithm [9].

The 1-bit input channel CRYPT/DECRYPT is used by the Controller to configure the chip and trigger the ciphering. The 64-bit channels DATA and KEY are used to respectively enter the plain text and the key. The ciphered text is output through the 64-bit channel OUTPUT.

IP

IP -1

C ip h e rin gD a ta -p a th

P C 1

P C 2

S u b -K e yD a ta -p a th

C o n tro lle r

D A T A C R Y P T /D E C R Y P T K E Y

O U T P U T

S u b -K ey

C T R L

6 4 6 4

64

1

IP

IP -1

C ip h e rin gD a ta -p a th

P C 1

P C 2

S u b -K e yD a ta -p a th

C o n tro lle r

D A T A C R Y P T /D E C R Y P T K E Y

O U T P U T

S u b -K ey

C T R L

6 4 6 4

64

1

Figure 4. Asynchronous DES architecture

3.2. Modules architecture

3.2.1. Ciphering module

Figure 5(a) presents the architecture of the ciphering

module. Each box represents an asynchronous stage as described above (a computational part and a memory part), except EXPANSION and SBOX modules which are only composed of a computational part. Arrows represent a handshaking communication (request and acknowledgment) between stages.

MUX blocks are used to control the inputs and outputs flow during the sixteen rounds. Half-Buffers (HB) are added to improve the circuit performance by increasing the number of stages in the loop [12]. Thus, the ciphering module is based on two iterative structures: the first one is composed of three stages (MUX_R, XOR48 and XOR32); the second one is composed of six stages (MUX_L, HB, XOR32, MUX_R, HB, HB). 3.2.2. Sub-Key module

Figure 5(b) shows the architecture of the Sub-key

module which is in charge generating sixteen 48-bit sub-keys from the initial 56-bit key.

Similarly as for the ciphering module, MUX and DMUX structures are used to control data flow in the loop, and two half buffers are added to improve circuit performance. SHIFTREG is a rotation table used to generate sub-keys. Thus, the sub-key module is composed of five stages.

Figure 5. (a) Ciphering module. (b) Sub-key

module of the reference DES circuit. 3.2.3. Controller module

The controller module (not shown on the figure) is a

finite state machine that controls a set of channels to send commands to the ciphering and the sub-key modules. It is composed of three stages. A 1-of-16 code is used to encode the rounds counter.

4. Hardening techniques

4.1. Fault models

Many fault models based on different abstraction

levels (transistor-level, gate-level, macro-cells ...) have been proposed in the test domain. In the present paper, the fault effect is considered as a logical perturbation in the circuit. Therefore, whatever the physical effects causing faults are, we assume that the fault eventually becomes one or more logical errors. However, it is important to clearly define these models when applied to asynchronous circuits.

4.1.1. Transient faults

A transient fault is in most cases a current transient,

for instance induced by the hit of a particle, which can propagate in the circuit, thereby corresponding at a logical level to a signal toggle with a short duration.

The structures defined in the previous section are similar to the structure of a synchronous circuit (combinational logic blocks and registers). Thus we are able to define a transient fault as a fault injected in the computational part of the asynchronous circuit. More details about transient fault injection in the computational part can be found in [10]. This fault can propagate through the logic to the memory part inputs. If the transient is captured by a memory cell, it may lead to a soft error as shown on Figure 6-a.

4.1.2. Soft errors

A soft error is an abnormal modification of the global

state of the circuit, without destructive effect. It can be

IP

IP_1

MUX_L MUX_R

HB

HB

HB

HB

HB

Expansion

XOR48

XOR32 SBOX

DATA

OUTPUT

KEY

MUX_K

DMUX_K

SHIFTREG

PC2

PC1

(a)

ctrl ctrl

ctrl

ctrl

ctrl

(b)

either the memorization of a transient fault which has propagated to a Muller gate in a memory block, or a fault injected straight upon this Muller gate (Figure 6-b). The latter can be compared to a Single Event Upset (SEU) in synchronous circuits (extended to MBU Multiple Bit Upset model). In any case, a soft error results in one or more memory bit-flips.

4.1.3. Delay faults

A gate delay fault modifies the time needed for a

transition to occur at the gate’s output. This fault can be considered as a temporary “stuck-at” fault throughout the fault’s activity. Delay faults don’t affect the circuit logical function, except if they occur on an isochronic branch, since it is the only timing assumption in QDI circuits. Since QDI circuits are highly tolerant to this class of faults, this work is focused on hardening the DES circuit against transient faults and soft errors. A counter-measure for delay faults on isochronic branches is described in [6].

Figure 6. (a) A transient fault injection in the

computational part. (b) A soft error injected in the memory part.

4.2. Hardening technique against transient faults: rail synchronization

Figure 7 presents a 2-bit data C whose channels C(0)

and C(1) are synchronized. This synchronization ensures that a channel cannot memorize a data without the presence of a data in the other channel. When both data are ready they are memorized and both acknowledgment signals are generated.

Figure 7. A 2-bit data channel hardened with

the synchronization method The synchronization is implemented at the memory

block level. Figure 8 shows a 1-bit Dual Rail XOR to represent C(0). This block is synchronized with C(1), not

shown on the figure, by means of signals D2(0), D2(1), Ack(1) and Ack(0). Assume that a fault is injected somewhere in C(0) and propagates to O0 while correct data propagate through C(1). The data generate a rising transition on D2(1), and MR00 fires. The fault injected in C(0) has actually generated valid data. This leads to an incorrect global state: MR00 fired because of the faulty transition, and MR01 fired because of the correct data. This invalid code “11” can easily be detected by means of an alarm that monitors the output (S0, S1) [5].

Figure 8. A synchronized 1-bit dual-rail XOR

gate As a conclusion, this technique enables us to improve

the circuit tolerance by filtering transient faults. When the fault cannot be filtered, a wrong data code is generated. This error can easily be detected in most cases by adding an alarm cell. This hardening technique leads to an acceptable area overhead because functional redundancy is used. Only a few gates are added to implement the synchronization function. The circuit speed is slightly affected due to the propagation time through the OR gate and the hardened Muller gate transition time is greater than the original gate.

4.3. Hardening strategy against soft errors

4.3.1. Token game

At a high level of abstraction, an asynchronous

circuit architecture is modeled using the so-called “token game” [11]. A token is carrying information and is stored in a memory element. When using a four-phase protocol, the asynchronous data-path processes a stream of alterning valid (denoted V) and invalid (return to zero, denoted I) tokens. The following alphabet is used: Alphabet = {V, I}.

Data flow information is controlled by two rules: - Token rule: a memory may receive and store a new

token (valid or invalid) from its predecessor if and only if it has a bubble.

- Bubble rule: a memory becomes empty (bubble) if and only if its successor has received and stored the token that it was holding.

Let us consider the following notations: N the number of stages in the loop K the number of data in the loop (data are

composed of a valid token plus an invalid token)

C

C

C

C

C

C

A0B1A1B0

A1B1A0B0

MR00

MR01

AB_ack

S0

S1

S_ack

O0

O1

D2(0) D2(1) Ack(1) Ack(0)

Ack Computational logic

part

Memory

part S(0)

Computational logic part

C(0)

Memory part C(1)

Ack S(1)

S_ack

S_ack

Data

Ack Computational

logic block Memory

block Data

Ack

(a) Transient fault injection

propagation

(b) Soft error

memorization

It has been shown in [12] that if N < 2K + 1 then the loop deadlocks.

For instance, the first loop of the ciphering module is composed of three stages and contains one data (K=1 and N=3). The second loop is composed of six stages and two data (K=2 and N=6). This architecture ensures the liveness property of the structure.

4.3.2. Soft error and tokens

A soft error is an abnormal modification of a token.

Two possible failures are considered: token corruption and token modification.

- Token corruption means that a token belonging to the alphabet is turned into a token that no longer belongs to this alphabet which means that an illegal token was generated. By using the dual-rail logic, the illegal token “11” can easily be detected by means of an alarm. This alarm signal is to be used by the environment to provoke a deadlock or reset the circuit, or to inform the outside environment, depending on the security policy adopted.

- Token modification means that a token is turned into a token that still belongs to the alphabet. The global state of the circuit is changed, but it is still a reachable correct state. Therefore this token may not be detected. Basically, a token modification consists in generating a valid token (I V), vanishing a valid token (V I), or substituting a token (V V’).

4.3.3. Hardening strategy

The hardening strategy consists in tuning the

architecture so that a token modification leads to a deadlock as often as possible. The next section shows how to adapt the DES architecture for this purpose. Moreover, alarms are used to detect token corruption. This strategy doesn’t improve the circuit tolerance. However, it prevents an attacker from exploiting an erroneous result, thus improving the circuit robustness to fault attacks.

4.3.4. Alarms implementation

A corrupted token can be detected by means of an

alarm cell. Since a 1-of-N encoding is used, the alarm cell generates a rising transition when more than one wire is set to one. Actually, this function is achieved by an asymmetric Muller gate. Only a ‘reset’ signal is able to generate a falling transition on this gate, which means that the alarm signal still rises even if the faulty token turns back to a valid 1-of-N encoding. The alarm output signal is stored in an outside register. Thus, the environment is able to read the alarm status.

5. Secure DES Architecture

Figure 9 shows the hardened architecture.

Computational blocks are protected against transient faults with the rail synchronization technique (denoted with a filled rectangle). Computational blocks are the

EXPANSION module, SBOXES, and the computational parts of XOR48 and XOR32 blocks. Rails are synchronized as shown on Figure 8. The output of the IP_1 block is also protected against transient faults in order to harden the last round. Moreover, an alarm is implemented with each rail synchronization.

A set of alarms (denoted with a filled circle) is implemented to check the validity of each control channel generated by the Controller module. Each of these alarms is stored in an alarm register bit. The sixteen wires of the controller round counter is also protected with an alarm (not shown on the figure).

Figure 9. Secure DES circuit architecture

The sub-key module loop is now composed of four

stages and contains exactly one data. Indeed, at least three stages are needed in the loop (K=1 => N≥3), and at most four stages must be used. As a result, a valid token generation (I V) eventually leads the circuit to deadlock (K=2 and N=4). A valid token vanishing (V I) freezes the loop: outside blocks can’t synchronize with this loop anymore because its valid token has vanished, which also leads the circuit to deadlock. If five stages were used, a second token (faulty) could be inserted without causing the circuit to malfunction.

We are aware that there are some weaknesses left: a valid token modification (V V’) can’t be detected. However, a single fault is not enough to turn a valid token into another valid token. Therefore, the attacker needs to modify several memorized data simultaneously.

Because of its iterative structure, the DES architecture is naturally robust against token generation/vanishing. Actually, only one Half Buffer stage of the sub-key loop of the reference DES architecture had to be dismissed to harden the whole architecture with respect to the token rules.

6. Results

The technology used is the 0.13 µm CMOS process

from ST Microelectronics. The layout and floorplan of the secure circuit are shown in Figure 10 and the main characteristics of the secure and reference circuits are reported in table 2. A few dedicated asynchronous cells were used in both circuits [13]. However, more specific

IP

IP_1

MUX_L MUX_R

HB

HB

HB H

B

Expansion XOR48

XOR32 SBOX

DATA

OUTPUT

KEY

MUX_K

DMUX_K

SHIFTREG

PC2

PC1

(b(a)

ctrl ctrl

ctrl

ctrl

ctrl

AlarmAlarm + Rail synchronization

hardened cells could reduce the overhead due to the rail synchronization and alarms cells in the secure circuit.

The placement and routing of each block of the circuit was constrained in order to provide more effective attack localization for the future validation experiments. Both reference and secure circuits were fabricated and verified functional.

Figure 10. Secure circuit layout (a) and

floorplan (b).

Asynchronous logic is known to be expensive in terms of area. The reference version presented in this work is five times bigger than the synchronous version not hardened. However, different ways of reducing this overhead to a factor two are currently studied [13].

The secure version of the DES is only 7.7% bigger than the reference one and 18% slower. The rail synchronization technique slightly increases the area as well as the transition time of signals. The hardening technique against soft errors may reduce the area overhead in some cases because some stages are dismissed, but the circuit performance consequently decreases [12].

Table 2. Circuit characteristics

Circuit Area/ gate counts Computation time Simple DES (1.2 V)

Reference 0.156 mm² / 8.4 Kgates 165 ns

Secure 0.168 mm² / 9.4 Kgates 201 ns

7. Conclusion

This paper presented asynchronous reference DES

processor architecture. Techniques were proposed to harden this architecture against transient fault and soft errors with a very low area cost and a reasonable performance penalty, exploiting QDI asynchronous circuit properties. The first technique improves circuits tolerance to transient fault injection. The second technique consists in using the deadlock properties of asynchronous circuits when token rules are violated. These techniques were applied to the design of a secure asynchronous DES processor. Both circuits are fabricated and they are currently being attacked to evaluate the resistance of the reference circuit to fault injection, and the relative resistance of the hardened version with respect to the reference.

8. References

[1] R. Leveugle, K. Hadjiat, “Multi-level fault injections in VHDL descriptions: alternative approaches and experiments”, Journal of Electronic Testing: Theory and Applications (JETTA), Kluwer, vol. 19, no. 5, October 2003, pp. 559-575. [2] D. Alexandrescu, L. Anghel, M. Nicolaidis, “New methods for evaluating the impact of single event transients in VDSM ICs”, The IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Vancouver, Canada, November 6-8, 2002, IEEE Computer Society Press, Los Alamitos, California, 2002, pp. 99-107. [3] M. Sonza-Reorda, M. Violante, “Accurate and efficient analysis of single event transients in VLSI circuits”, 9th IEEE International On-Line Testing symposium, Kos, Greece, July 7-9, 2003, pp. 101-105. [4] T. Verdel, Y.Makris, “Duplication-Based Concurrent Error Detection in Asynchronous Circuits: Shortcomings and Remedies”, 17th IEEE International Symposium on Defect and Fault Tolerance in VLSI System, 2002, pp. 345-353. [5] S. Moore, R. Anderson, R. Mullins, G. Taylor, J. J. A. Fournier, “Balanced self-checking asynchronous logic for smart card applications”, Microprocessors and Microsystems, Elsevier Science Publishers, vol. 27, 2003, pp. 421-430. [6] C. LaFrieda, R.Manohar, “Fault Detection and Isolation Techniques for Quasi Delay-Insensitive Circuits“, International Conference on Dependable Systems and Networks (DSN'04), Florence Italy, June 28 - July 01, 2004, p.41 [7] Wonjin Jang, Alain J. Martin, “SEU-Tolerant QDI Circuits”,11th IEEE International Symposium on Asynchronous Circuits and Systems, New York City, USA, March 13-16, 2005, pp. 156-165. [8] M. Renaudin, “Asynchronous Circuits and Systems: a promising design alternative”, Microelectronics-Engineering Journal, Elsevier Science, Guest Editors: P.Senn, M. Renaudin, J. Boussey, Vol54, N°1-2, December 2000, pp.133-149. [9] NIST, Data Encryption Standard (DES), FIPS PUB 46-2, National Institute of Standards and Technology,http://csrc.nist.gov/csrc/fedstandards.html [10] Y. Monnet, M. Renaudin, R. Leveugle, “Asynchronous circuits transient faults sensitivity evaluation”, 42th Design Automation Conference, Anaheim, USA, June 13-17, 2005. [11] M. Renaudin, J. Fragoso, “Asynchronous Circuits Design: An Architectural Approach”, chapter in “V Escola de Microeletrônica da SBC-Sul”, Edited by José Gunzel & Ricardo Reis, Rio Grande, Sept. 17-20, 2003. [12] T.E. Williams, “Performance of iterative computation in self timed rings”, Journal of VLSI signal processing, N°7, pp. 17-31, Feb.1994. [13] A. Razafindraibe, Ph. Maurine, M. Robert, F. Bouesse, B. Folco, M. Renaudin, "Secured structures for secured asynchronous QDI circuits", XIX Conference on Design of Circuits and Integrated Systems, Bordeaux, France, November 24-26, 2004.

(a) (b)

TIMA Lab. Research Reportstima.univ-grenoble-alpes.fr/publications/files/rr/adc_234.pdf · A...

Documents

Transcript of TIMA Lab. Research Reportstima.univ-grenoble-alpes.fr/publications/files/rr/adc_234.pdf · A...