Evolving Static Hardware Redundancy for Defect Tolerant FPGAs · Asbj˝rn Djupdal Evolving Static...

Asbjørn Djupdal

Evolving Static Hardware Redundancyfor Defect Tolerant FPGAs

Doctoral thesisfor the degree of philosophiae doctor

Trondheim, April 2008

Norwegian University of Science and TechnologyFaculty of Information Technology, Mathematics andElectrical EngineeringDepartment of Computer and Information Science

NTNUNorwegian University of Science and Technology

Doctoral thesisfor the degree of philosophiae doctor

Faculty of Information Technology,Mathematics and Electrical EngineeringDepartment of Computer and Information Science

c© Asbjørn Djupdal

ISBN 978-82-471-6874-5 (printed version)ISBN 978-82-471-6888-2 (electronic version)ISSN 1503-8181

Doctoral theses at NTNU, 2008:48

Printed by NTNU-trykk

Typeset with LATEX 2ε in Computer Modern 10pt

Abstract

Integrated circuits have been in constant progression since the first prototype in1958. The semiconductor industry has maintained a constant rate of miniaturisa-tion of transistors and wires, resulting in ever increasing speed, size and complexityof circuits. One challenge that has always been present is reduced yield due to pro-duction defects. A certain amount of chips must be scrapped because productiondefects have rendered the chips unusable. Recent predictions suggest that the av-erage number of production defects per chip will rise drastically in the future asCMOS scaling approaches the physical limits of what is possible to manufacture. Ifthese predictions are true, circuits should exhibit some level of tolerance to defectsso to keep yield at acceptable levels.

The main contribution of the thesis is to the field of defect tolerance, with a focuson FPGAs. Apart from the widespread employment of FPGAs, two technical rea-sons make the FPGA especially suited for inclusion of defect tolerance techniques.The regular structure of the FPGA can be exploited for efficient redundancy tech-niques. In addition, the FPGA can be seen as a bridge between production and theapplication designer. Through defect tolerance techniques incorporated transpar-ently in the FPGA, a fully functioning gate array can be provided to the applicationdesigner despite defects from production.

The approach taken in this thesis is to search for new ways of introducing statichardware redundancy in a circuit through the application of artificial evolution.However, the challenge of applying evolutionary techniques provided a secondarycontribution. The work provides a contribution to the field of artificial evolutionand the subfield evolvable hardware (EHW) by addressing ways in which such tech-niques may be applied to search for non-specifiable structures. The work is alsobridging the fields of EHW and traditional hardware design and reliability met-rics have been investigated for the purpose of comparing evolved and traditionallydesigned circuits.

Redundant structures are first evolved for gate level circuits where both voterbased solutions and more intricate non-voter based solutions are achieved. Tran-sistor level redundancy structures are targeted next to approach the main goalof defect tolerance for FPGAs. A defect tolerant inverter is evolved which formsthe basis of a general defect tolerance technique, termed the Multiple Short-Open(MSO) technique. The FPGA look-up table (LUT) is one of the essential compo-nents of the FPGA and a defect tolerant LUT is, therefore, constructed applyingthe MSO technique. An evolutionary experiment is also conducted where a defecttolerant 1-input LUT is evolved directly.

Preface

This thesis was submitted to the Norwegian University of Science and Technology(NTNU) in partial fulfilment of the requirements for the degree of philosophiaedoctor (PhD).

The work presented herein was conducted at the Department of Computer andInformation Science, NTNU, under the supervision of Associate Professor PaulineC. Haddow. The work was funded by the Faculty of Information Technology,Mathematics and Electrical Engineering, NTNU

Acknowledgements

First of all, I would like to thank my supervisor Assoc. Prof. Pauline C. Haddow.Without her support and advice, this work would not have been possible.

I would like to thank the members of my PhD committee, Assoc. Prof. SnorreAunet and Prof. Kjetil Nørvag for helpful input at my evaluation meetings. I wouldalso like to thank Snorre Aunet for cooperation on some of the early papers andmany discussions.

I would like to thank Assoc. Prof. Morten Hartmann for valuable research dis-cussions and refreshing tea breaks, Assoc. Prof. Gunnar Tufte for always sharinghis many ideas, and Prof. Lasse Natvig for his support. I would also like to thankall the other members of the Computer Architecture and Design Group at NTNUfor providing a friendly working environment.

Finally, I would like to thank my family for moral support and optimism.

Asbjørn DjupdalFebruary 1, 2008

Contents

Abstract iii

Preface v

List of Figures ix

Abbreviations xi

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . 32.2 Chip Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Photolithography . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Production Defects . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Yield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Defect Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Triple Modular Redundancy . . . . . . . . . . . . . . . . . . . 112.4.2 Interwoven Logic . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.3 Transistor Level Redundancy . . . . . . . . . . . . . . . . . . 12

2.5 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Research Summary 173.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Initial Investigations . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Evolving Redundancy . . . . . . . . . . . . . . . . . . . . . . 193.1.4 Gate Level Redundancy . . . . . . . . . . . . . . . . . . . . . 203.1.5 Transistor Level Redundancy . . . . . . . . . . . . . . . . . . 223.1.6 Transistor Level Redundancy for FPGAs . . . . . . . . . . . 24

3.2 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

viii Contents

3.3 Paper Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 Paper I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 Paper II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 Paper III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.4 Paper IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.5 Paper V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.6 Paper VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.7 Paper VII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.8 Paper VIII . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Concluding Remarks 334.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Bibliography 37

Papers 41I Yield Enhancing Defect Tolerance Techniques for FPGAs . . . . . . 43II Addressing the Metric Challenge: Evolved versus Traditional Fault

Tolerant Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57III Evolving Redundant Structures for Reliable Circuits — Lessons Learn-

ed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67IV Evolving and Analysing “Useful” Redundant Logic . . . . . . . . . . 77V Defect Tolerant Ganged CMOS Minority Gate . . . . . . . . . . . . 91VI Evolving Efficient Redundancy by Exploiting the Analogue Nature

of CMOS Transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . 97VII Defect Tolerance Inspired by Artificial Evolution . . . . . . . . . . . 105VIII The Route to a Defect Tolerant LUT through Artificial Evolution . . 113

List of Figures

2.1 Simplified example of an FPGA with 16 CLBs . . . . . . . . . . . . 42.2 Photomasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Cross section of CMOS inverter . . . . . . . . . . . . . . . . . . . . . 62.4 Triple Modular Redundancy (TMR) . . . . . . . . . . . . . . . . . . 112.5 Interwoven logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Series and parallel replication of transistors . . . . . . . . . . . . . . 13

3.1 Research process and relation of papers . . . . . . . . . . . . . . . . 183.2 “Fake” redundancy from paper III . . . . . . . . . . . . . . . . . . . . 203.3 Evolved redundancy from paper IV . . . . . . . . . . . . . . . . . . . 213.4 Defect tolerant minority gate from paper V . . . . . . . . . . . . . . 223.5 Multiple Short-Open (MSO) technique from paper VII . . . . . . . . 233.6 Evolved LUT1 from paper VIII . . . . . . . . . . . . . . . . . . . . . 24

Abbreviations

AE Artificial Evolution

CB Connection Block

CLB Configurable Logic Block

CMOS Complementary Metal Oxide Semiconductor

CRAB Complex, Reconfigurable, Adaptive, Bio-inspired Hardware

EA Evolutionary Algorithm

EC Evolutionary Computation

EHW Evolvable Hardware

FF Flip Flop

FPGA Field Programmable Gate Array

GA Genetic Algorithm

HDL Hardware Description Language

IC Integrated Circuit

IOB Input/Output Block

ITRS International Technology Roadmap for Semiconductors

LUT Look Up Table

MSO Multiple Short-Open

MTTF Mean Time To Failure

NMR N-Modular Redundancy

OPC Optical Proximity Correction

RC Reconfigurable Computing

SB Switch Block

SRAM Static Random Access Memory

TMR Triple Modular Redundancy

UV Ultra Violet

VLSI Very Large Scale Integration

Chapter 1

Introduction

1.1 Introduction

Field Programmable Gate Arrays (FPGAs) have become more and more popularin recent years and are now widely used, not only as a prototyping device, as wasthe original purpose, but also as a component in end user products. High endFPGAs are produced in the latest technology processes and are now, in number oftransistors, among the largest Integrated Circuits (ICs) that are produced.

The lithographic process employed when producing ICs can not be perfectlycontrolled, resulting in some defective devices. Large, high-end ICs are most sus-ceptible to defects and the ITRS roadmap [20] predicts this situation is growingworse as CMOS technology scales further. Today, an FPGA is typically scrappedwhen tests at the factory reveal one or more defects in the FPGA. If the ITRSpredictions are true, further technology scaling will make it practically impossibleto produce FPGAs that are 100% defect free. When looking towards future tech-nologies, the predictions are even more pessimistic and a significant portion of eachproduced chip is expected to be defective [27].

Production defects can be tolerated with redundant components, resulting infewer scrapped FPGAs. There exists several known techniques for introducingredundancy in a system with the purpose of tolerating defects. All redundancytechniques do, however, come with a price. There is a balance between area andthe amount of defects that can be tolerated. If the technique is specialised, forexample towards handling production defects in an FPGA, it is often possible toeither achieve better defect tolerance or more area efficient defect tolerance [23].

This thesis represents a search for new redundancy techniques that can helptolerate defects in FPGAs. The biologically inspired technique Artificial Evolution(AE) is employed as a tool for searching for new redundancy structures, structuresthat either alone or together with existing techniques will provide improved defecttolerance. Although the goal is a new redundancy technique for traditional designs,a large part of the research behind this thesis deals with how to set up an evolu-tionary experiment such that redundancy structures emerge, structures that not

2 Chapter 1. Introduction

only exhibit redundant elements but are also useful for the purpose of toleratingdefects.

1.2 Research Questions

The main research question for this thesis is:

How can the FPGA architecture be designed such that production defectsin the FPGA do not affect the application design running on the FPGA?

Production defects are here defined as any physical deviation from the originaldesign due to imperfect production techniques, even if not detected at the factorytest.

In the search for an answer to the main research question, several smaller andmore specific research questions have been addressed:

1. How can artificial evolution be employed in the search for new static hardwareredundancy structures?

2. Which redundancy structures can evolution find at the transistor level andhow can the redundancy structures be combined with existing traditionaldesign techniques?

3. How can the defect tolerance of an FPGA be enhanced through the redun-dancy techniques that resulted from the evolutionary experiments?

1.3 Thesis Outline

This thesis is a collection of papers. The main part of this thesis and all relevantresearch results are, therefore, found in the papers. The papers were written withthis thesis in mind and, therefore, build naturally on each other, leading towardsthe conclusion. Chapter 2 presents the necessary background material. Chapter 3provides an overview of the research process and the papers. Chapter 4 concludesthe thesis, summarises research contributions and suggests future work.

Chapter 2

Background

2.1 Field Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) represent the most general purposereconfigurable digital devices. A reconfigurable device is a device that has nopredetermined functionality but can be configured to the desired functionality atany time. A reconfigurable device contains a number of configurable functionalprimitives that together implement the desired functionality. In addition, there isa configurable interconnect that connects the functional primitives.

Architecture and terminology for FPGAs differ between vendors. This sectionpresents a simplified FPGA architecture in figure 2.1. The functional primitives ofan FPGA in this thesis is termed Configurable Logic Blocks (CLBs) and the FPGAconsists of a regular array of CLBs. Each CLB contains at least one Look UpTable (LUT) and Flip Flop (FF). A LUT consists of an SRAM and has typicallyfour to six inputs addressing the SRAM. A LUT can thus implement any logicfunction with four inputs. To implement a function that is too large to fit in oneCLB, the function is split up and placed in several CLBs, connected through theconfigurable interconnect. The interconnect consists of lines, Switch Blocks (SBs)and Connection Blocks (CBs). Each switch block is configurable and connects linesentering and leaving the switch block. A connection block has a structure similarto the switch block but connects the lines to the inputs and outputs of a CLB.To be able to connect the configured circuit to the outside world, the FPGA alsocontains Input/Output Blocks (IOBs). An IOB is often similar in structure to aCLB but has additional circuitry for connecting to a physical pin on the FPGA.

A modern FPGA is more complex than the FPGA shown in figure 2.1. Theinterconnect is more flexible, with long lines that bypass several switch blocks forreduced delay. The CLBs are often clustered to reduce delay for local connections.Each CLB also contains several configurable multiplexers to increase the flexibilityof internal CLB routing and dedicated carry chains to reduce delay when imple-menting adder circuits. A LUT can also be configured as a small memory block ora shift register. The FPGA may also contain specialised units like dedicated RAM

4 Chapter 2. Background

LUTFF

SB

IOB

SB SB SB SB

IOB IOB IOB

IOB IOB IOB IOB

CLB CLB CLB CLB

SB SB SB SB SB

IOB IOB

CLB CLB CLB CLB

SB SB SB SB SB

IOB IOB

CLB CLB CLB CLB

SB SB SB SB SB

IOB IOB

CLB CLB CLB CLB

SB SB SB SB SB

IOB IOB

Switch Block

Single switch

Configurable Logic Block

Figure 2.1: Simplified example of an FPGA with 16 CLBs

2.2. Chip Production 5

Exposed photoresist

Unexposed photoresist

Wafer

Quartz Glass

Chrome pattern

UV light

Figure 2.2: Photomasking

blocks, multiplicities and complete processor cores, all embedded in the array ofCLBs.

The FPGA is almost always SRAM based which means that all configurableelements are controlled by at least one SRAM cell. The set of all configurationSRAM cells is called the configuration memory of the FPGA. When an FPGAis to be programmed, a bit file containing a value for every SRAM cell in theconfiguration memory is uploaded to the FPGA. This bit file is the result ofan automated design flow, where a circuit described in a Hardware DescriptionLanguage (HDL) is synthesised, placed, routed and converted to a suitable bit filefor the device.

A more comprehensive overview of FPGAs is given by Oldfield and Dorf [29].For examples of modern high end FPGAs, see [3, 37].

2.2 Chip Production

2.2.1 Photolithography

ICs, including FPGAs, are produced through the process of photolithography wherethe layers of the chip are formed on an extremely pure silicon wafer, known as thesubstrate, pre-doped to either p-type or n-type.

Each layer on the wafer is defined with photomasks. The application of pho-tomasks in photolithography is shown in figure 2.2. A wafer is covered with aphotoresist which is to be patterned. The photomask consists of a quartz glasswith a chrome pattern. UV light floods through the photomask such that certainareas of the photoresist on the wafer are exposed and hardened. Unexposed pho-toresist is removed with a solvent. A photomask is typically smaller than the wafer.A stepper moves the photomask across the wafer.

The cross section of an inverter fabricated on a p-type substrate is shown infigure 2.3. A simple fabrication process for the inverter in figure 2.3 may go throughthe following steps:


oxide

metal

polysilicon

p+ n+ n+ n+p+ p+

n−well

p−substrate

Figure 2.3: Cross section of CMOS inverter in an Integrated Circuit (IC), showingthe different layers.

1. Form n-well

(a) Perform oxidation of wafer, to form SiO2 on the surface

(b) Apply photoresist over oxide layer

(c) UV-illuminate with n-well mask to harden photoresist

(d) Remove unexposed photoresist with solvent to expose oxide

(e) Etch exposed oxide

(f) Remove the rest of the photoresist

(g) Form well by adding dopants with diffusion process or ion implantation

(h) Remove remaining oxide with acids

2. Form transistor gates

(a) Perform oxidation to form thin gate oxide

(b) Grow heavily doped polysilicon with the chemical vapor deposition pro-cess where heated gases react with the wafer and deposits polysilicon onthe surface

(c) Pattern wafer with photoresist and the polysilicon mask, to leave polysil-icon gates (as in step 1)

3. Form n-diffusion (n+ regions), similarly as for the the n-wells

4. Form p-diffusion (p+ regions), similarly as for the n-diffusion and n-wells

5. Form thick field oxide to insulate wafer from metal

(a) Pattern wafer with photoresist and the contact mask to remove oxidewhere metal should be allowed to make contact

6. Form metal layer

(a) Sputter aluminium onto the wafer


(b) Pattern wafer with the metal mask and plasma etch to remove unwantedmetal

More metal layers are usually provided by repeating steps 5 and 6. Sub-micronprocesses must also take into account that the smallest features are smaller than thewavelength of light. Optical Proximity Correction (OPC) of the originally designedpatterns are then needed to ensure printed features have the correct shape. Oneexample is line ends which receive less light than the centre of the line. The OPCtechniques then add “hammerheads” to the line ends in the mask to compensatefor this effect.

A wafer is typically up to 300mm in diameter and several dies (the sub partsof a wafer representing single chips) are produced on the same wafer at the sametime. After the wafer is finished, it is cut into the individual dies which are thenpackaged to form the finished chips.

For further information on CMOS VLSI design and fabrication, see Weste andHarris [36].

2.2.2 Production Defects

The chip fabrication process outlined in section 2.2.1 can not be perfectly controlled.The result is defects in some of the formed structures. Production defects may beput into three main categories [12, 23]: Random defects, systematic defects andgross defects.

• Random defects are small local defects that appear at random places on thewafer. Random defects typically occur due to impurities in the materialsand airborne particles deposited on the wafer or obstructing the illuminationsteps. Random defects may occur in any layer of the wafer and betweenlayers. The result is typically extra or missing material.

As an example from the fabrication process described in section 2.2.1, animpurity in the gate oxide in step 2a could result in an oxide pinhole shortingthe gate and the channel of one of the transistors. Another example is anairborne particle obstructing the illumination in step 6b when patterning themetal layer. The result would be missing metal in the spot where the particlewas.

• Systematic defects are those related to the inability of a process to correctlymanufacture a specific design. One cause of systematic defects is the match-ing problem. Optical effects cause matching structures in different areas ofthe die to end up with slightly different shapes and thus have different elec-trical characteristics. Another cause is variations across a wafer due to aninaccurate mask stepper. Slightly differently aligned masks for the differentdies on the wafer result in differing device characteristics.

• Gross defects are global defects like scratches in the wafer, rendering thewhole wafer unusable.


Traditionally, random defects have been the most important factor limitingyield. Systematic defects are, however, a growing challenge[4, 5]. As feature size isreduced to the limits of what can be controlled optically, failure to form featureswill dominate over random defects as a yield limiting mechanism[11]. Althoughthese defects are systematic, they may appear random because of the complexityof the conditions required for their occurrence [20].

Defect Models

When analysing how well a circuit tolerates defects, the effect of likely defects inthe circuit can be simulated. The effect of a production defect can be complex. Ac-curate defect modelling based on layout and geometrical considerations, as in [35],is normally not an option when the effect of production defects is to be analysed.For this reason, several defect models have been proposed at different levels ofabstraction.

At gate level the effect of a production defect is often modelled with the wellknown stuck-at defect model, where the output or one of the inputs of the gateis either stuck at 0 or stuck at 1. The stuck-at model is widely used for creatingtest patterns. However, the stuck-at model does not reflect real behaviour fromproduction defects. Random defects typically result in resistive shorts and opensat random locations in the circuits, which are rarely seen as a constant 0 or 1.Instead, the observed behaviour might be degraded voltage levels, increased propa-gation delays or transition faults [15]. More advanced gate level models have beenproposed [34] but often with the disadvantage of requiring complex additions to theoriginal gate level design to support simulations, reducing the simplicity advantageof the gate level.

More realistic defect models may be applied at the transistor level [1]. Themost widely used transistor level defect models are the stuck-open and stuck-closeddefects. A stuck-open transistor is never conducting while a stuck-closed transistoris always conducting. More advanced transistor level defect models include shortsand opens on any of the three terminals of the transistor. The most advancedmodels include effects such as increased delay and other parametric deviationsfrom the ideal behaviour.

2.2.3 Yield

Yield can be defined as the ratio of the number of usable items after production tothe number of potentially usable items [12]. The main contributor to low yield forICs is defects during photolithography. Yield is an important measure because onlyusable items are sellable. Low yield can make production prohibitively expensive.

For chip production, the total yield is the product of wafer process yield, deviceyield and module test yield. Wafer process yield is the ratio of usable wafers. Deviceyield is the ratio of usable dies after photolithography and module test yield is theratio of usable chips after packaging. Device yield is the most important component,and the only one that is dependent on the specific circuit.


Redundancy techniques, such as the ones explained in section 2.4, can improvedevice yield by tolerating a certain amount of defects. The yield improvement ofintroducing redundancy to a circuit can be defined as Y I = YR

YNR, where YNR is the

yield for a given nonredundant circuit and YR is yield for the redundant versionof the circuit. If yield is the prime reason for introducing redundancy, redundancyshould only be introduced if Y I > 1. However, redundancy also increases the areaof the die resulting in fewer dies per wafer. Although device yield is increased, alarge amount of redundancy might therefore reduce the total number of usable dieson a wafer. A more suitable measure for redundant circuits is therefore effectiveyield. Effective yield at wafer level is the ratio of working dies with redundancy tothe total number of nonredundant dies that would fit on the wafer. If redundancyis to increase effective yield, Y I > RR where Y I is the device yield improvementand RR is the redundancy ratio. The redundancy ratio is the amount of hardwarerequired by a redundant system, divided by the amount required by a nonredundantsystem performing the same function [21]. It follows that redundancy only helpseffective yield if YR > YNR · RR. As an example, consider a chip whose deviceyield is YNR = 0.2. Introducing redundancy with factor RR = 4 would only helpeffective yield if the new chip has a device yield YR > 0.8.

Yield may be estimated with a yield formula. Early yield formulas were basedon the assumption that defects occur independently. It has, however, been observedthat defects often occur in clusters. The negative binomial yield formula [23], givenin equation (2.1), takes clustering into account and is today the most widely usedyield formula.

Y =(

1 +λ

α

)−α(2.1)

In equation (2.1), λ is the average number of faults on the chip and α is theamount of fault clustering. Smaller values of α indicate increased clustering. Whenthe yield of a new chip is to be estimated, the usual practise is to get λ and α froma reference design with known yield information and scale as well as possible to thenew design. A design with redundancy requires more elaborate yield calculationsbecause of the ability to tolerate some defects [23].

Device yield depends on the design of the chip to be produced. Even whenbasing λ and α on known information from a similar design, calculations based onsimple yield formulas like equation (2.1) are seldom accurate. Another way of esti-mating yield involves Monte Carlo simulations on a finished layout where geometricfeatures modelling production defects are placed randomly onto the layout beforesimulation [35]. Monte Carlo simulations for yield are also applicable to designswith redundancy and remove the need to scale yield information from a previousrepresentative design. Process specific information about type and distribution ofdefects must still be available. A complex layout may be too time consuming tosimulate for Monte Carlo simulations to be practical.


2.3 Reliability

The reliability of a system can be defined as the ability to perform the specifiedfunction under stated conditions. When the reliability of a system is to be evaluated,one of several possible evaluation criteria is chosen and the system is evaluatedagainst this criterion [31].

For hardware systems, the most common way of evaluating reliability is to applya probabilistic reliability function R(t) that gives the probability that a system isworking correctly between time 0 and time t, given certain conditions and correctbehaviour at time 0. If the failure rate of the system is constant over time, thereliability function is R(t) = e−λt where λ is the constant failure rate for one unitof time. When λt is small, R(t) ≈ 1− λt.

In a system composed from several subcomponents, all of which must be work-ing, the reliability of the system is given as R =

∏nc=1Rc where Rc is the reliability

of subcomponent c and n is the number of subcomponents. A defect tolerant sys-tem can continue to operate despite a certain number of defects. For such systems,where not all subcomponents need to be working, more elaborate reliability calcu-lations must be performed or, more realistically for complex systems, Monte Carlosimulations need to be employed.

An alternative evaluation criterion is Mean Time To Failure (MTTF) which isthe average time a system will run before failing. MTTF is linked to the failurerate in the following way: MTTF = 1

λ . If λ is the failure rate per hour, MTTF isthe average number of hours before failing.

When considering how reliable a system is in the presence of production defects,time is not relevant. MTTF is therefore not applicable and reliability is simplyR = 1 − λ where λ is the probability of failing under stated conditions. It shouldbe noted that reliability in this case is similar, but not the same as yield. Yieldis the percentage of chips that can be sold. Reliability is a probability of workinggiven certain conditions. These conditions need not be directly linked to whatactually causes unsellable chips. However, if the stated conditions are realistic andrelevant for what constitutes a sellable chip, high reliability will lead to high yield.

2.4 Defect Tolerance

A defect tolerant circuit is a circuit that functions correctly even if there are defec-tive subcomponents, for example defective transistors and/or wires. Defect toler-ance can be seen as a special case of fault tolerance where only permanent defectsare considered. Transient faults that do not result in permanent damage are notan issue.

A defect tolerant circuit is a circuit that is designed to tolerate a certain amountof defective components. The term defect coverage refers to the percentage of allpossible defects a defect tolerant system can tolerate. 100% defect coverage meansthat any possible single defect anywhere in the system is tolerated. Often, defectcoverage is less than 100%, either because not all defect types are tolerated orbecause some parts of the system are not defect tolerant.

2.4. Defect Tolerance 11

(a) Single TMR system (b) Cascaded TMR system with triplicated voters

Figure 2.4: Triple Modular Redundancy (TMR)

In the beginning of the history of digital electronic circuits, logic was built fromunreliable vacuum tubes. As a result, there was a significant amount of research onhow to build reliable computers from unreliable components and many of the mostwell known defect tolerance techniques date from the early period of computing.After the introduction of the IC, failure rates dropped drastically and reducedthe importance of defect and fault tolerance techniques, except for a few extremecases such as for space exploration. Recent predictions on failure rates in futureproduction processes have renewed interest in defect tolerance.

Defect tolerance is achieved through the use of redundancy techniques. Re-dundancy techniques relevant for tolerating hardware defects can be classified asstatic hardware redundancy; dynamic hardware redundancy or information redun-dancy [24]. Static hardware redundancy involves having redundant hardware com-ponents connected in such way that defects are tolerated without any need to firstdetect the defects. Dynamic hardware redundancy involves first detecting a defectand then applying measures, for example reconfiguration, for avoiding the detecteddefect. With information redundancy, redundant information is added to an exist-ing data set, e.g. enabling error correction by adding error correcting codes to theoriginal data set.

2.4.1 Triple Modular Redundancy

Triple Modular Redundancy (TMR) [25] is a static hardware redundancy techniquederived from the work of von Neumann [33]. The original scheme by von Neumannis shown in figure 2.4(a). Three equal modules perform the same function and avoter outputs the value the majority of the modules output. If there is a defect inone of the modules, there are still two modules outputting the correct value. TMRis a special case of the technique N-Modular Redundancy (NMR) where there areN equal modules. An NMR system is able to tolerate bN−1

2 c defective modules. Ifthe modules are very large compared to the voter, the redundancy ratio of NMRcan be approximated to N .

The voter in figure 2.4(a) is a single point of failure. If the voter is defective,the result will be an incorrect output. For this reason, the voter may be triplicatedas shown in figure 2.4(b). Figure 2.4(b) also shows the concept of cascaded TMRsystems. The probabilistic reliability of a system is only improved with NMR (in-


cluding TMR) if the probability of a single module working is more than 0.5. If theprobability is less than 0.5, adding more redundant modules to a system will onlydecrease the reliability of the system. For this reason, a system constructed fromhighly unreliable components must be split up into several smaller subsystems untileach subsystem consists of so few basic components that the submodule reliabilityis more than 0.5. Each submodule can then be made redundant with TMR andcascaded to form the total system.

2.4.2 Interwoven Logic

TMR is mostly used when the modules are much larger and more error prone thanthe voter [31]. TMR is not suitable when the size of each module is only a few gatesbecause the voter then becomes the dominant part of the system. For this reason,several gate level techniques have been proposed. Pierce [30] combined these intoa general theory called interwoven logic.

Interwoven logic is based on the concept of critical and subcritical input faults.A critical input fault to a gate is a fault that by itself is enough to propagate as afault on the output to the gate. A subcritical fault, on the other hand, does notalone cause a gate output error. If the gate is a NAND gate, a critical fault wouldbe an input whose value is incorrectly 0. That would force the output of the NANDto 1 no matter what the other inputs to the gate are. A subcritical fault would bean incorrect input 1 which would only force the output to 0 if all other inputs tothe NAND gate are also 1.

Note that in a circuit constructed entirely from NAND gates, any critical faultwill be converted to a subcritical fault in the next layer. Interwoven logic involvescreating the layers of a circuit such that any critical fault is converted to a sub-critical fault in the next layer and then completely removed within two layers oflogic.

An example is given in figure 2.5. Each gate in a nonredundant NAND-basedcircuit is quadrupled and the interconnect between layers is interwoven such thatany critical input fault on the X and Y inputs are completely removed before theZ output. A critical fault on the X1 input, propagates as a subcritical fault on theV1 and V4 outputs and is completely removed on the Z outputs. To make sure thesubcritical faults on the A and B outputs are not allowed to propagate further, it isimportant that the interweaving pattern to the inputs of the second layer of gatesis different from the pattern on the first layer of gates. If ignoring interconnectresources and assuming a four-input NAND is twice the size of a two-input NAND,the redundancy ratio for the circuit in figure 2.5(b) is 8.

2.4.3 Transistor Level Redundancy

Redundancy may also be applied at the transistor level. The most well knowntransistor level technique is the series-parallel transistor replication technique, alsoknown as the quadrupled transistor technique. Series-parallel replication for im-proving reliability of a system was first described for relays by Moore and Shannon[28]. The purpose is to tolerate both stuck-open (permanently non-conducting)

2.4. Defect Tolerance 13

X

Y

Z

W

V

(a) Nonredundant circuit

X1

X4

Y1

Y4

X2

X3

Y2

Y3

W1

W2

W3

W4

Z1

Z2

Z3

Z4

X2

X3

Y2

Y3

X1

X4

Y1

Y4

V1

V2

V3

V4

V1

V2

W1

W2

V1

V2

W1

W2

V3

V4

W3

W4

V3

V4

W3

W4

(b) Interwoven circuit

Figure 2.5: Interwoven logic. Example from Siewiorek and Swarz [31]

A

B

C

A

B

C

Figure 2.6: Series and parallel replication of transistors


Algorithm 1 Evolutionary Algorithm1: procedure EA2: Generate random initial population3: repeat4: Evaluate all individuals in population to obtain fitness score5: repeat6: Select two good parents7: Recombine parents to get children8: Mutate children9: Place children in next generation

10: until Complete Generation is produced11: until Ending criteria is met12: Return most fit individual13: end procedure

and stuck-closed (permanently conducting) transistors. Stuck-open tolerance isachieved with transistors connected in parallel. If a transistor is stuck-open, thereis still another, parallel transistor that can conduct. Likewise, two transistors inseries will tolerate a stuck-closed transistor. If one transistor is stuck-closed, thereis still another transistor available that can stop the unwanted flow of current.

The series-parallel technique combines the ideas of series and parallel transistorsin order to tolerate both stuck-open and stuck-closed defects. Several possibletopologies exist with slightly different properties. A typical topology is shown infigure 2.6 where two series connected paths are connected in parallel. The techniquein figure 2.6 has a redundancy ratio of 4.

2.5 Evolutionary Algorithms

When searching for solutions to problems where an exact algorithmic method istoo time consuming, some sort of heuristic must be employed. An EvolutionaryAlgorithm (EA) is one such heuristic based on the idea of a population of individualswhere the population is repeatedly changed by reproduction and mutation of theindividuals in the population. A fitness function estimates an individual’s abilityto solve the given problem and is the basis for a mechanism selecting which parentsare fit enough to be allowed to reproduce. The result is, hopefully, an increasinglyfit population. A problem solving process involving the application of EAs is calledArtificial Evolution (AE). EAs are studied in the field of Evolutionary Computation(EC) and an introduction is given by Eiben and Smith [9].

Several EAs have been proposed. The pseudo code of a variant known as gen-erational Genetic Algorithm (GA) is given in algorithm 1. First, the populationis initialised with random individuals. Then, in each iteration of the algorithm,a new generation of the population is created by selecting good parents from thepopulation and creating the new individuals for the next generation by recombiningthe parents and mutating the resulting children.

2.5. Evolutionary Algorithms 15

The selection mechanism determines how to select good parents for reproduc-tion (line 6 in algorithm 1). Tournament selection is one selection mechanism wherea group of g randomly selected individuals is created from which a parent is to beselected. There is then a probability p that the most fit individual in this group ischosen as the parent and a probability 1− p that a random individual in the groupis chosen as the parent.

Some individuals may also be copied unaltered to the next generation. Elitismis a concept where the best individual in a population is guaranteed to be copiedunchanged to the next generation. Elitism ensures that the maximum fitness in apopulation is never decreasing.

Evolvable Hardware (EHW) [18] is the application of AE for hardware design.When EHW is applied for circuit design, each individual represents an electroniccircuit. The genotype is the circuit description manipulated by the EA and thephenotype is the circuit itself in a form that can be tested for fitness. Intrinsicevolution is evolution of circuits where fitness evaluation is performed on actualhardware while extrinsic evolution relies on a circuit simulator for fitness evalua-tions [6].

Chapter 3

Research Summary

This chapter gives a summary of the research behind this thesis. The chapterstarts in section 3.1 with a description of the research process that led to thisthesis. Section 3.2 lists all publications by the author. However, not all thesepublications are relevant to the topic of this thesis. The eight papers includedin this thesis represent the research path taken towards finding answers to theresearch hypothesis of this work. Section 3.3 presents the abstract and retrospectivecomments for these papers.

3.1 Research Process

This section presents the research process that led to this thesis. The major choicesand the motivations behind them are presented together with some of the possibleavenues that were not followed. An illustration of how the published papers relateto each other is shown in figure 3.1. A rounded box represents a major topic anda circle represents one of the papers included in this thesis.

3.1.1 Background

The funding for this PhD was originally for research on applying FPGAs as abasis for efficient ultrasound imaging algorithms, in cooperation with Prof. BjørnAngelsen at the Department of circulation and medical imaging. As such, theFPGA was chosen as the main topic from the start. However, after an initialphase of investigations, it became clear that the research challenges were mainlyconfined to ultrasound imaging. While an FPGA implementation was important,the project was considered to be mainly implementation and not research. It wastherefore decided to find an alternative project which presented an FPGA researchchallenge.

The next topic that was seriously considered was Reconfigurable Computing(RC). RC is a field with several FPGA related research avenues. However, insteadof rushing towards a decision on which topic to choose, it was decided to participate

18 Chapter 3. Research Summary

Figure 3.1: Research process and relation of papers

3.1. Research Process 19

in an external project on low power nano electronics with Snorre Aunet from theUniversity of Oslo and Valeriu Beiu from Washington State University. Workingon this external project for a few months provided some basic training in doingscientific research. It also gave deeper understanding on issues and challenges withdeep submicron CMOS, issues such as power consumption and defect tolerance.

Based on knowledge gained during the external project and a growing personalinterest in such issues, the topic was decided to be defect tolerance for FPGAs. Thegrowing importance of tolerating production defects indicated that more researchon defect tolerance was needed. Defect tolerance and FPGA technology was alsoconsidered to be a good match. The wide acceptance and employment of FPGAs in-creases the importance of handling defects in such devices. Defect tolerance meansemploying redundancy techniques and the possibility of separating the redundancytechniques from application design seemed to be a good idea. If redundancy isincluded in the FPGA architecture itself, production defects can be tolerated with-out affecting the application design running on the FPGA. In addition, severalresearchers had already demonstrated that the regular and flexible architecture ofthe FPGA is well suited for implementing area effective redundancy [8, 14, 17, 19].

3.1.2 Initial Investigations

After having selected a topic, a literature study was needed to get an overviewof the state of the art in FPGA defect tolerance. This initial study resulted in asurvey paper, presented as paper I in this thesis. This survey showed that therehad already been a fair amount of research on high level architectural techniques,for example the redundant row/column technique [17]. However, at the lower levelof local redundancy, less work had been conducted. If an FPGA is to tolerate veryhigh defect densities, it is probably necessary to introduce redundancy at severallevels. It was therefore decided to focus on static hardware redundancy techniquesfor creating defect tolerant FPGA building blocks, such as defect tolerant CLBsand switch blocks.

3.1.3 Evolving Redundancy

The goal for this research was a new static hardware redundancy technique tobe applied by traditional hardware designers when implementing FPGA buildingblocks. However, in the process of searching for this new redundancy technique,it was decided to employ the more untraditional technique of artificial evolution.Instead of trying to design a new redundancy technique by hand and then evaluateit in an FPGA context, the idea was to conduct experiments where redundantcircuits were evolved and then manually analysed. If the analysis revealed somenew form of redundancy structure, this was to be used as inspiration for creatinga new redundancy technique. The benefit of artificial evolution is that evolution isnot bound by the conventional design techniques and, therefore, exploits propertiesof the technology or architecture a human designer might not think of. The choiceof employing artificial evolution was also natural at the CRAB lab because themain activity at the CRAB lab is research on biologically inspired techniques for


2

1

output

f

r

inputs

Figure 3.2: “Fake” redundancy structure typical for circuits found in paper III. f isa nonredundant circuit implementing the desired function and r represents “fake”redundant gates with no useful purpose.

hardware design. Related work on evolution of reliable circuits had also beenperformed in the CRAB lab [16].

One of the first challenges to overcome was the need to bridge the two differentfields of EHW and traditional hardware design. Most of the research in EHWconcentrates on the evolutionary process itself and how to apply evolution in ahardware setting. Comparison with traditional designs is rarely performed. Assuch, there is a need for ways to evaluate evolved designs to enable comparisonwith traditional designs. The evolved results must be measured in a way that isunderstandable and realistic with respect to traditional designs. For this PhD, thereliability of the evolved solutions is a key property. Therefore, we had to findsome way to compare the reliability of evolved circuits with traditional circuits. Areliability measure suitable for a fitness function during evolution is not necessarilysuitable for comparison with traditional circuits. On the other hand, traditionalmetrics are often not suited to evolutionary methods. Paper II discusses differentreliability metrics and how evolved and traditional gate level circuits compare withthe different reliability metrics.

3.1.4 Gate Level Redundancy

For the first attempt at evolving redundant structures, it was decided to evolvecircuits at the gate level. Although efficient implementation of FPGA componentsmust be performed at the transistor level, gate level experiments would providean easier start where some aspects about evolving redundant circuits could beunderstood before handling the more complex transistor level. The CRAB labalso had experience with evolving hardware at the gate level [16] and the existingsimulator could be extended to run the necessary experiments. In addition, anynew redundancy structures found at the gate level would probably be useful fora defect tolerant FPGA, although these initial experiments did not target FPGAcircuits directly.

While the idea of evolving redundant circuits sounds easy, it became evidentthat much of the difficulty in this research would be to tune evolution towardscreating redundant structures that were useful for tolerating defects. Both fromthe work in Hartmann and Haddow [16] and from the work on paper II it wasclear that evolution chose to create smaller circuits so to avoid defects instead ofcreating larger circuits with redundancy to tolerate defects. The first challenge onevolving gate level redundancy was, therefore, to find a way to encourage evolution


module

module

module

module

X Y

(a) Voter technique

1100

1100

0100

1100

in1

out

in

in

in0

in0

in

in in0

X

Y

(b) Redundant XOR2

Figure 3.3: Evolved redundancy from paper IV.“X”represents defect tolerant gateswhile “Y” represents gates not tolerant to defects.

to create larger circuits while retaining a focus on reliability. Although area efficientredundancy techniques is preferable, an evolutionary experiment that favours smallcircuits will probably not find any redundant circuits at all.

Paper III addresses this issue. Paper III not only represents several months ofresearch, but is also an important contribution as it documents the difficulties ofevolving redundant structures. Unlike most other evolutionary experiments, thegoal is not a specific functionality or to achieve a certain structure known in ad-vance. Instead, the goal is to find a new and previously unknown redundancystructure, a structure which can not be directly specified in the fitness function.This can be compared to some of the challenges in evolving art [32]. However, thedifficulty of redundant structures lies in the fact that the structure that is soughtis not only unknown, but also has the requirement of enhancing the reliability ofa given digital circuit. Specifying these requirements in a way that at the sametime steers evolution towards good solutions presented a challenge no other workknown to the author have addressed. Paper III is an exploring paper where sev-eral experimental setups are investigated for the purpose of finding at least onesetup which is successful at generating circuits with redundant gates. The mostinteresting results came from changing the fault model from gate reliability to thesingle fault model. In the gate reliability fault model, each gate fails independentlyand with a certain probability. With a gate reliability based fitness function, theprobability of having defective gates increases with the size of the circuit. Insteadof introducing redundant gates, evolution shrinks the circuits to reduce the proba-bility of having defective gates. A gate reliability based fitness function implicitlyfavours small circuits which works against the goal of creating circuits exhibitingredundancy. In the single fault model, there is only one defective gate in a circuitat a time and the number of defective gates is, therefore, not dependent on thesize of the circuit. The change to a single fault based fitness function had a largeimpact on the results. Instead of creating small, nonredundant circuits, evolutionintroduced a large number of redundant gates to the circuits.

As a first step, the results in paper III were encouraging. However, the redun-dant gates from the single fault experiments did not perform any useful function


A B C

Out

Vdd

Vss

Figure 3.4: Defect tolerant minority gate from paper V

in the circuit. Evolution found a way to cheat by creating circuit constructs thatenabled the introduction of large amounts of “fake” redundant gates that did notaffect the output in any way, yet scoring highly on fitness. One such example isshown in figure 3.2.

To enhance the results in paper III, the single fault setup had to be improvedto make sure that the introduced redundancy was really useful for our purpose.One way, which was briefly attempted in paper III, was to actively guide evolu-tion away from known unwanted redundancy structures. This was, however, soondiscovered to be an impossible task as there is an unlimited number of unwantedredundancy structures. Instead, a more general method was found and publishedin paper IV. An algorithm was constructed to classify redundant gates as eitheruseful or fake. When making sure evolution was only rewarded for introducinguseful redundant gates, interesting redundancy structures emerged. Simple voterbased solutions, as shown in figure 3.3(a), were evolved when the target functional-ity was simple. For more complex target functionalities, more intricate structuresresembling interwoven logic, as shown in figure 3.3(b), were evolved.

3.1.5 Transistor Level Redundancy

At this point several possible avenues could be followed. While it was interestingto see that evolution created voter based solutions, those solutions were known andnot very likely to be improved upon. It is more likely to see new and better gatelevel redundancy structures of the interwoven logic kind, and further research onevolving such redundancy structures could prove fruitful. However, to approachour goal of defect tolerant FPGAs, it was decided to take what was learned atthe gate level and step down to try to evolve redundancy at the transistor level.This decision was based on the fact that FPGA components are designed at thetransistor level, gate level designs are too inefficient.

Moving from a digital gate level simulator to an analogue circuit simulatorincreases both simulation time and the solution space, resulting in drastically in-creased evolution time. However, the advantages are that the transistor level hasmore possibilities for evolution to exploit. The world of analogue circuits is opened


A

B

C

A

B

C

Figure 3.5: Multiple Short-Open (MSO) technique from paper VII

up for evolution to play with. As an example of the benefit of moving to the tran-sistor level, a smaller project on transistor level redundancy for minority gates waspublished and is presented as paper V in this thesis. Paper V originated as an ideaduring the work on low power electronics with Aunet and Beiu. Although paper Vdoes not involve evolution, it does show that it is possible to exploit the analoguenature of a circuit for reliability purposes and presents a redundancy techniquethat is impossible at the digital gate level. The resulting defect tolerant minoritygate is shown in figure 3.4.

Several new aspects must be considered when moving to the transistor level. Pa-per VI takes a first look at evolving transistor level redundancy for digital circuits.The paper presents and discusses a successful experimental setup and demonstratesthe setup with an experiment that evolves a stuck-open tolerant digital inverter.While successful at evolving useful redundancy at the transistor level, the evolvedstuck-open tolerant inverter from paper VI did not lead to any new technique.Its redundancy was based on parallel replication of transistors. Later experimentson stuck-closed tolerant inverters also just reinvented the previously known seriesreplication technique. It seemed that evolving an inverter for defects where theseries-parallel replication technique is successful would not lead to any interestingresults. Two possible avenues were considered to take the research one step fur-ther. One possibility was to aim for something similar to the technique publishedin paper V by changing the target functionality to something more complex thanan inverter and hope for a more effective solution than the series-parallel technique.Another possibility was to stick to the inverter but evolve tolerance to defect sit-uations where the series-parallel technique is unsuccessful. The latter possibilitywas chosen, both because of the possibility of finding solutions to the weaknessesof the series-parallel technique but also to avoid the need to evolve more complextarget functions which could prove to be too time consuming to evolve.

The chosen defect type for the next experiment was shorts between the tran-sistor gate and either source or drain. This is a defect type not handled by theseries-parallel replication technique. Based on the setup in paper VI, a new defecttolerant inverter was evolved that successfully tolerates gate shorts. This evolvedredundant circuit is presented in paper VII together with an analysis and gener-alisation into a technique, the MSO technique shown in figure 3.5, that can beemployed when designing traditional circuits.


D

W1

W0

A

~D

Vdd

Vss

M1pmosw=1000nml=1000nm

M3nmosw=583nml=426nm







M10nmosw=30nml=30nm



M5pmosw=30nml=776nmM6

pmosw=93nml=520nm


Figure 3.6: Evolved LUT1 from paper VIII

3.1.6 Transistor Level Redundancy for FPGAs

Based on the positive results in paper VII, it was time to link the work to FPGAtechnology. One of the most important components in an FPGA is the LUT. Itwas decided to create a new LUT at the transistor level that was to be tolerant totransistor stuck-open, stuck-closed and gate-short defects.

Two possibilities were thought of. The first was to apply the MSO techniquefrom paper VII for creating a defect tolerant LUT. The second possibility wasto evolve a LUT and see if evolution could exploit some property unique to theLUT and end up with more efficient redundancy than the MSO technique. ALUT is, however, a rather complex component providing a serious challenge to theEA. It was decided to try both and compare the results. Paper VIII presentsboth the evolutionary experiment, the resulting evolved LUT, the MSO LUT anda comparison of the new LUTs with traditional implementations.

Both the evolved solution and the MSO solution have advantages and disad-vantages. The evolved solution, shown in figure 3.6, is very small, yet still exhibitssome tolerance to defects. As such, the evolved solution is an interesting examplefor further research. However, high delay, low output voltage swing and the factthat the LUT relies on dynamic storage makes the evolved solution unrealistic inreal FPGAs without further improvements. The MSO LUT has the advantage ofhigh output voltage swing, reasonable low delay and tolerates all single transistordefects of the four allowed defect types. An extremely high area requirement is themain disadvantage of the MSO LUT, which motivates further research on evolvingdefect tolerant LUTs.

3.2. List of Publications 25

3.2 List of Publications

Papers Included in Thesis

I A. Djupdal and P. C. Haddow. Yield enhancing defect tolerance techniquesfor FPGAs. In Military and Aerospace Programmable Logic Device Interna-tional Conference (MAPLD), paper ID 203, 2006.

II P. C. Haddow, M. Hartmann and A. Djupdal. Addressing the metric chal-lenge: Evolved versus traditional fault tolerant circuits. In Adaptive Hardwareand Systems, pages 431–438, 2007.

III A. Djupdal and P. C. Haddow. Evolving redundant structures for reliablecircuits – lessons learned. In Adaptive Hardware and Systems, pages 455–462, 2007.

IV A. Djupdal and P. C. Haddow. Evolving and analysing “useful” redundantlogic. In International Conference on Evolvable Systems (ICES), pages 256–267, 2007.

V A. Djupdal and P. C. Haddow. Defect tolerant ganged CMOS minority gate.In IEEE NORCHIP, 2007.

VI A. Djupdal and P. C. Haddow. Evolving efficient redundancy by exploit-ing the analogue nature of CMOS transistors. In International Conferenceon Computational Intelligence, Robotics and Autonomous Systems (CIRAS),pages 81–86, 2007.

VII A. Djupdal and P. C. Haddow. Defect tolerance inspired by artificial evo-lution. Accepted at IEEE Computer Society Annual Symposium on VLSI(ISVLSI), 2008.

VIII A. Djupdal and P. C. Haddow. The route to a defect tolerant LUT throughartificial evolution. Submitted to IEEE Transactions on Circuits and SystemsI, 2008.

Papers on Subthreshold Logic and Low Power

• V. Beiu, A. Djupdal and S. Aunet. Ultra Low-Power Neural Inspired Addi-tion: When Serial Might Outperform Parallel Architectures. In InternationalWork-Conference on Artificial Neural Networks (IWANN), pages 486–493,2005.

• V. Beiu, S. Aunet, R. R. Rydberg III, A. Djupdal and J. Nyathi. The Van-ishing Majority Gate: Trading Power and Speed for Reliability. In IEEEInternational Workshop on Design and Test of Defect-Tolerant NanoscaleArchitectures, 2005.


• V. Beiu, S. Aunet, J. Nyathi, R. R. Rydberg III and A. Djupdal. On the ad-vantages of serial architectures for low-power reliable computations. In IEEEInternational Conference on Application-Specific Systems, Architectures andProcessors (ASAP), pages 276–281, 2005.

Other Papers

• L. Natvig, S. Line and A. Djupdal. Age of Computers: An Innovative Com-bination of History and Computer Game Elements for Teaching ComputerFundamentals. In Frontiers in Education Conference (FIE), pages S2F1–S2F6, 2004.

• A. Djupdal and L. Natvig. Age of Computers II - An Improved Systemfor Game Based Teaching. In Norsk Informatikk Konferanse (NIK), pages158–167, 2004

• L. Natvig, G. Sindre and A. Djupdal. A Compulsory yet Motivating Ques-tion/Answer Game to Teach Computer Fundamentals. To be published inJournal on Computer Applications in Engineering Education

3.3 Paper Abstracts

This section presents abstract for each paper included in this thesis. In addition,retrospective comments are given for each paper except for the most recent ones.

3.3.1 Paper I

Yield Enhancing Defect Tolerance Techniques for FPGAs

Abstract

As technology scales, the problem of production defects is expected to increase.This makes maintaining device yield a challenge. Also, it may be expected thatmore and more defect circuits will pass the production tests as the device testingchallenge grows due to more and more transistors being compacted onto a singlechip.

Reconfigurable technology has experienced an increasing popularity in recentyears. Similar to ASIC design, reconfigurable technology suffers from productiondefects. However, unlike ASIC design, reconfigurable technology provides a bridgebetween production and the application designer. The inclusion of defect tolerancein the FPGA architecture could provide a functionally correct FPGA for the ap-plication designer, despite production defects. As such, the application designer isrelieved of the extra complexity of designing for imperfect devices.

This paper presents a survey of known approaches to making defect tolerantFPGAs and discusses their advantages and disadvantages, especially in the contextof maintaining FPGA yield and device correctness.

3.3. Paper Abstracts 27

Retrospective View

In this paper, the topic was defect tolerance techniques for FPGAs with a focus onproduction defects. A survey paper is always a compromise and even with a narrowtopic, not everything can be included. For this reason it was decided to leave outrun-time defect tolerance techniques, for example the work on roving STARs [10].In addition, the whole issue of fault detection and fault diagnosis was omitted asthe survey focused on the defect tolerance techniques themselves.

The title mentions “yield” specifically, yet there is no quantitative comparisonof the effect on yield for the different techniques. This was deliberately omitted dueto the difficulty of obtaining good yield estimates. Rough yield estimates can becalculated based on several simplifications and assumptions and is found in severalof the surveyed papers. Good estimates, however, require basing the estimate onrepresentative manufactured chips which do not exist for most of the surveyedtechniques. The discussion and table 1 indirectly deals with yield through thediscussion of defect coverage and area overhead.

One useful reference not found in the paper is the Altera variant of the redun-dant row/column technique for enhancing yield in some of their larger FPGAs [2].Another reference that could have been included is a similar survey by Doumar andIto [7], although with more focus on fault detection and fault diagnosis techniques.

Errata

• Table 1 in the paper has a row for “Extra HW required”. Local redundancytechniques are marked “low” in this row. While true for the switch blocktechniques, this is not true for LUTs with error correcting codes where asignificant increase in the LUT area is expected for the support logic on theLUT output [22]. For this reason, local redundancy should be marked “low–high”.

3.3.2 Paper II

Addressing the Metric Challenge: Evolved versus Traditional Fault Tolerant Cir-cuits

Abstract

The field of Evolvable Hardware, applying artificial evolution to the design of digitaland analogue hardware is around ten years old. However, the field is far fromreaching main stream electronics, although some few examples exist. One causemay be that the problems that are addressed in the field are, in general but notalways, relatively simple designs which may be regarded as “toy problems”, thiswork being no exception.

Interest in the possibilities inherent in evolved designs is growing, as may beseen from the inclusion of evolvable hardware as a topic in a number of moretraditional electronics conferences. However, how good are the designs that areevolved? How can they be compared to their traditional counterparts? Suitable


metrics are needed which enable comparison between these two fields of design andthat can provide an accurate and fair evaluation of the given design technique. Inthis work the issue of fault tolerance is addressed together with the design metricreliability.

Retrospective View

The paper discusses and compares the reliability metrics Rtrad and Rehw. Laterpapers in this thesis have, however, discovered the importance of the fault modelfor the evolved circuits. A similar comparison for the applied fault model, gatereliability or single fault, could therefore have proved useful for later work.

As this paper is a quantitative comparison based on metrics, the structure of theevolved circuits are not analysed. Experiments conducted in later papers, especiallypaper III, show that it is unlikely that gate reliability experiments as performedin paper II result in any structural redundancy. The last section of the conclusionassumes the larger circuits with high Rehw exhibit some kind of useful redundancy.In retrospect, the observed effect is probably a kind of graceful degradation asmeasured by Rehw, rather than real redundancy of the kind searched for in laterpapers.

The conclusion of the paper gives the impression that it is unrealistic to evolvecircuits with an Rtrad based fitness function due to Rtrad being too coarse grained.This is true if Rtrad is the only component in the fitness function. Experiments inlater papers in this thesis have, however, successfully evolved circuits with Rtrad inthe fitness function through the use of multipart fitness functions.

3.3.3 Paper III

Evolving Redundant Structures for Reliable Circuits — Lessons Learned

Abstract

Fault Tolerance is an increasing challenge for integrated circuits due to semiconduc-tor technology scaling. This paper looks at how artificial evolution may be tunedto the creation of novel redundancy structures which may be applied to meet thischallenge. However, as these structures are unknown it is a challenge in itself totune evolution to create them. As such, no solution has yet been found. This paperprovides a discussion about the issues addressed and experiments conducted andthus provides an overview of the lessons learned in this work.

Retrospective View

This paper introduces the concept of a fault model and implicitly defines it, throughthe examples of the gate reliability and single fault models, to be a model of whichfault scenarios are possible and their probabilities of occurring. This definitionis used consistently in all papers in this thesis. Although explicitly defined inpaper VI, this definition is unfortunate as it contradicts with the more common


usage of the term fault model as a model of how a component fails (see [1, 7, 26]for examples).

Errata

• The experiment summarised in table 3 and the TMR Rtrad single experimenthad an error in the mechanism selecting random seeds, resulting in onlyfive different random seeds for the ten different evolutionary runs in eachexperiment. This error does not affect the discussion or any conclusions.

• Reference 3 should have been to paper I in this thesis.

3.3.4 Paper IV

Evolving and Analysing “Useful” Redundant Logic

Abstract

Fault Tolerance is an increasing challenge for integrated circuits due to semiconduc-tor technology scaling. This paper looks at how artificial evolution may be tuned tothe creation of novel redundancy structures which may be applied to meet this chal-lenge. An experimental setup and results for creating “useful” redundant structuresis presented.

Retrospective View

The evolved redundant XOR2 does in many ways resemble interwoven logic. Adiscussion of the evolved redundancy in comparison with interwoven logic couldhave been included in the paper.

3.3.5 Paper V

Defect Tolerant Ganged CMOS Minority Gate

Abstract

Production defects, resulting in faulty transistors, provide a challenge for the semi-conductor industry in terms of reduced yield. As defect densities are expected toincrease as the semi-conductor feature size decreases, some form of transistor leveldefect tolerance is desirable to reduce this increasing production challenge. Thispaper proposes a solution, based on the ganged CMOS minority gate, for transistorlevel defect tolerance for minority gates.

Retrospective View

The ganged CMOS style implementation of gates is not very well known and rarelyused in real designs. An advantage of ganged gates is speed, while high power


consumption and bad noise margins are some of the disadvantages. These gatesare also more affected by parameter variations than traditional CMOS gates [13].

After publishing this paper, it was discovered that the simulation in figure 6 b)is not the worst case scenario, there are situations where a single defective transistorresults in more degraded output than what is shown in figure 6 b), although stillcorrect. The worst case scenario is where one of the pMOS transistors are stuck-closed.

The reference for the quadrupling technique is to a recent paper describingquadrupling for future nanotechnologies. However, their technique seems to besimilar to the original series-parallel configuration of relays as described by Mooreand Shannon [28] which, therefore, seems to be a more appropriate reference.

3.3.6 Paper VI

Evolving Efficient Redundancy by Exploiting the Analogue Nature of CMOS Tran-sistors

Abstract

Fault tolerance is an increasing challenge for integrated circuits due to semiconduc-tor technology scaling. Triple modular redundancy is often used to achieve faulttolerance in digital circuits, but this method is inefficient. By exploiting the ana-logue nature of CMOS transistors, more efficient redundancy techniques may beapplied.

This paper looks at how artificial evolution may be guided towards the creationof redundancy structures at the CMOS transistor level. A preliminary experimentis performed that successfully evolves redundant stuck-open defect tolerant digitalinverters.

Retrospective View

The evolved stuck-open tolerant inverters can be seen as a special case of theseries-parallel replication technique [28]. The paper clearly states that the evolvedsolution is a reinvention of a known technique. A direct reference to the series-parallel technique is, however, not provided in the paper.

3.3.7 Paper VII

Defect Tolerance Inspired by Artificial Evolution

Abstract

Defect densities in integrated circuits are expected to increase as the semiconduc-tor feature size decreases. Some form of transistor level defect tolerance is, there-fore, desirable to reduce this increasing production challenge. Series and parallelreplication of transistors can be applied to a circuit for tolerating stuck-open and


stuck-closed transistors. The circuit is, however, still damaged by gate/drain andgate/source shorts.

This paper applies an evolutionary algorithm to evolve a circuit tolerant to anysingle short between two transistor terminals. The evolved circuit is then analysedand a general defect tolerance technique is formed based on the evolved circuit.Applying the new technique to a circuit results in tolerance to any single stuck-open, stuck-closed, gate/drain shorted or gate/source shorted transistor. A MonteCarlo experiment compares the reliability of the new technique applied to a NANDgate with other redundant NAND gate implementations.

3.3.8 Paper VIII

The Route to a Defect Tolerant LUT through Artificial Evolution

Abstract

The challenge of production defects for integrated circuits is expected to increase asthe feature size is scaled towards the limits of what is possible to manufacture. Tohandle the increasing number of defects, some form of redundancy can be employedfor defect tolerance.

The FPGA can be seen as a bridge between production and application designer.Introduction of defect tolerance techniques to the FPGA itself could provide adefect free gate array to the application designer, despite production defects.

This paper describes a search for transistor level defect tolerance for FPGAlook-up tables (LUTs) through the application of artificial evolution. Two differentstrategies result in two defect tolerant LUT implementations. Through simulations,the new LUT implementations are compared to a traditional non-redundant LUTand a TMR version of the traditional LUT.

Chapter 4

Concluding Remarks

4.1 Conclusion

The work in this thesis has addressed the challenge of tolerating production defectsin FPGAs. The main approach has been to apply artificial evolution to createcircuits with previously unknown static hardware redundancy structures. Throughmanual analysis of these circuits, the evolved redundancy structures have beenexplained such that they can be applied by traditional designers when designingthe components of an FPGA.

Section 1.2 formulated the following main research question for this thesis:

How can the FPGA architecture be designed such that production defectsin the FPGA do not affect the application design running on the FPGA?

This thesis does not give the full and complete answer to this question but rep-resents one contribution towards the goal of a defect tolerant FPGA. Paper VIIpresents a new static hardware redundancy technique and paper VIII directly ap-proaches the challenge of building a defect tolerant LUT. An FPGA consists ofmuch more than LUTs. A defect tolerant LUT is, however, one step towards thegoal.

Section 1.2 also formulated three more specific questions.

1. How can artificial evolution be employed in the search for new static hardwareredundancy structures?

Of the three more specific questions, question one has been the most challengingone to find an answer to and is addressed directly or indirectly in all papers exceptpaper I. Papers III and IV are, however, the most important contributions to thisquestion and not only discuss the difficulties of achieving useful redundancy butalso demonstrate that it is possible.

2. Which redundancy structures can evolution find at the transistor level andhow can the redundancy structures be combined with existing traditionaldesign techniques?

34 Chapter 4. Concluding Remarks

Papers VI, VII and VIII address the challenge of evolving redundancy at thetransistor level. In addition to reinventing the known series-parallel redundancytechnique, evolution also finds a new solution that inspires the invention of theMSO technique. Paper VIII shows how the MSO technique can be applied in atraditional setting.

3. How can the defect tolerance of an FPGA be enhanced through the redun-dancy techniques that resulted from the evolutionary experiments?

Paper VIII addresses question three by showing two LUT implementations, onedesigned traditionally applying the MSO technique from paper VII and the otherevolved directly.

4.2 Research Contributions

The following are the main contributions of this thesis:

1. A step towards a defect tolerant FPGA has been taken through the creationof a defect tolerant LUT.

2. The research has contributed to an understanding of how artificial evolutioncan be applied for the creation of non-specified structures, especially in thecontext of redundancy for defect tolerant circuits.

3. A new transistor level redundancy technique has been found and formulated,the MSO technique, that can be applied to circuits to enable them to tolerategate/source short, gate/drain short, stuck-closed and stuck-open defectivetransistors.

The following are other contributions:

• A LUT exhibiting partial tolerance to transistor defects has been evolved.

• A new transistor level redundancy technique specific to the ganged CMOSminority gate has been found and described. Defect coverage is the same asfor the general series-parallel replication technique but with less number oftransistors.

• Reliability metrics have been investigated for the purpose of comparing evolvedand traditionally designed circuits.

4.3 Limitations

There are some factors limiting the more practical results of this work. For thepurpose of enhancing yield, area is important because larger dies result in fewerdies per wafer. The techniques that result from this thesis are very area inefficient.The exception is the evolved defect tolerant LUT which for other reasons is not

4.4. Future Work 35

yet a realistic alternative. The nonredundant device yield must be extremely lowfor these techniques to be of any help. Increased area also results in longer localinterconnect which negatively affects wire delay.

The MSO technique is tested for four different transistor level defects. Thereexist other possible VLSI defects that are not investigated in the papers whichcould affect the reliability results.

4.4 Future Work

This section suggests some directions for extending the research in this thesis.

• The work in paper V could be extended with Monte Carlo simulations to in-vestigate the effects of parameter variations. A comparison with the standardCMOS implementation of a minority gate would also be interesting.

• The work on gate level redundancy in paper IV could be continued. Especiallythe results on the evolved XOR2 motivates for further research. If successful,non-voter based solutions could result that are more area efficient than theinterwoven logic technique known today.

• Research is needed to find more area efficient transistor level redundancytechniques.

• Existing transistor level redundancy techniques, including the ones describedin this thesis, are designed for a subset of all possible defects. These tech-niques should be analysed for a broader range of defects. If found to performpoorly for other defects, new techniques should be researched.

• More research should be performed on evolving redundant structures to seeif useful redundancy can be evolved more efficiently than the experiments inthis thesis.

• One factor limiting what kind of redundancy structures are possible to evolve,is the size and complexity of the circuits. EHW is still immature in this re-spect and quickly grows too computationally intensive if targeting too com-plex functionality. Any improvement with respect to the scalability issues ofEHW will directly affect the possibilities for evolving better redundancy.

Bibliography

[1] J. A. Abraham and W. K. Fuchs. Fault and error models for VLSI. Proceedingsof the IEEE, 74(5):639–654, May 1986.

[2] Altera. Apex redundancy. http://www.altera.com/products/devices/apex/features/apx-redundancy.html, Accessed january, 2008.

[3] Stratix III Device Handbook, Volume 1. Altera, 2007. http://www.altera.com/literature/hb/stx3/stx3_siii5v1.pdf.

[4] C. N. Berglund. Trends in systematic non-particle yield loss mechanisms andthe implication for IC design. In Proceedings of the SPIE, pages 457–465, 2003.

[5] M. Cote and P. Hurat. Layout printability optimization using a silicon simula-tion methodology. In Symposium on Quality Electronic Design, pages 159–164,2004.

[6] H. de Garis. Artificial life: Growing an artificial brain with a million neuralnet modules inside a trillion cell cellular automata machine. In InternationalSymposium on Micro Machines and Human Science, 1993.

[7] A. Doumar and H. Ito. Detecting, diagnosing, and tolerating faults in SRAM-based field programmable gate arrays: a survey. IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, 11(3):386–405, 2003.

[8] A. Doumar, S. Kaneko, and H. Ito. Defect and fault tolerance FPGAs byshifting the configuration data. In International Symposium on Defect andFault Tolerance in VLSI Systems (DFT), pages 377–385, 1999.

[9] A. E. Eiben and J. E. Smith. Introduction to Evolutionary Computing.Springer, 2003.

[10] J. M. Emmert, C. E. Stroud, and M. Abramovici. Online fault tolerancefor FPGA logic blocks. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 15(2), 2007.

[11] J. Ferguson. Shifting methods: Adopting a design for manufacture flow. InSymposium on Quality Electronic Design, pages 171–175, 2004.

http://www.altera.com/products/devices/apex/features/apx-redundancy.html

http://www.altera.com/products/devices/apex/features/apx-redundancy.html

http://www.altera.com/literature/hb/stx3/stx3_siii5v1.pdf

http://www.altera.com/literature/hb/stx3/stx3_siii5v1.pdf

38 Bibliography

[12] A. V. Ferris-Prabhu. Introduction to Semiconductor Device Yield Modeling.Artech House, 1992.

[13] K. Granhaug and S. Aunet. Improving yield and defect tolerance in multifunc-tion subthreshold CMOS gates. In IEEE International Symposium on Defectand Fault Tolerance in VLSI Systems (DFT), pages 20–28, 2006.

[14] F. Hanchek and S. Dutt. Methodologies for tolerating cell and interconnectfaults in FPGAs. IEEE Transactions on Computers, 47(1):15–33, 1998.

[15] H. Hao and E. J. McCluskey. “resistive shorts” within CMOS gates. In Inter-national Test Conference (ITC), pages 292–301, 1991.

[16] M. Hartmann and P. C. Haddow. Evolution of fault-tolerant and noise-robustdigital designs. IEE Proceedings - Computers and Digital Techniques, 151(4):287–294, July 2004.

[17] F. Hatori, T. Sakurai, K. Nogami, K. Sawada, M. Takahashi, M. Ichida,M. Uchida, I. Yoshii, Y. Kawahara, T. Hibi, Y. Saeki, H. Muraoga, A. Tanaka,and K. Kanzaki. Introducing redundancy in field programmable gate arrays.In IEEE Custom Integrated Circuits Conference, pages 7.1.1–7.1.4, 1993.

[18] T. Higuchi, T. Niwa, T. Tanaka, H. Iba, H. de Garis, and T. Furuya. Evolvinghardware with genetic learning: a first step towards building a darwin machine.In From Animals to Animats: Simulation of Adaptive Behavior, pages 417–424, 1993.

[19] N. J. Howard, A. M. Tyrrell, and N. M. Allinson. The yield enhancementof field-programmable gate arrays. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 2(1):115–123, Mar. 1994.

[20] ITRS. International technology roadmap for semiconductors. Technical report,ITRS, 2005.

[21] B. W. Johnson. Design and Analysis of Fault Tolerant Digital Systems. Ad-dison Wesley, 1989.

[22] A. KleinOsowski and D. J. Lilja. The NanoBox project: Exploring fabrics ofself-correcting logic blocks for high defect rate molecular device technologies.In IEEE Symposium on VLSI, pages 1–10, 2004.

[23] I. Koren and Z. Koren. Defect tolerance in VLSI circuits: Techniques andyield analysis. Proceedings of the IEEE, 86(9):1819–1837, Sept. 1998.

[24] P. K. Lala. Self-Checking and Fault Tolerant Digital Design. Morgan Kauf-mann Publishers, 2001.

[25] R. E. Lyons and W. Vanderkulk. The use of triple-modular redundancy toimprove computer reliability. IBM Journal, pages 200–209, Apr. 1962.

Bibliography 39

[26] E. J. McCluskey and C.-W. Tseng. Stuck-fault tests vs. actual defects. InInternational Test Conference (ITC), pages 336–344. IEEE Computer Society,2000.

[27] M. Mishra and S. C. Goldstein. Nano, Quantum and Molecular Computing,Implications to High Level Design and Validation, chapter 3: Defect Toleranceat the End of the Roadmap. Kluwer Academic Publishers, 2004.

[28] E. F. Moore and C. E. Shannon. Reliable circuits using less reliable relays. J.Franklin Inst., pages 191–208, 291–297, 1956.

[29] J. V. Oldfield and R. C. Dorf, editors. Field Programmable Gate Arrays: Re-configurable Logic for Rapid Prototyping and Implementation of Digital Sys-tems. John Wiley and Sons Inc, 1995.

[30] W. Pierce. Failure-Tolerant Computer Design. Academic Press, 1965.

[31] D. P. Siewiorek and R. S. Swarz. Reliable Computer Systems, Design andEvaluation. Digital Press, 2nd edition, 1992.

[32] K. Sims. Artificial evolution for computer graphics. Computer Graphics, 25(4):319–328, 1991.

[33] J. von Neumann. Probabilistic logics and synthesis of reliable organisms fromunreliable components. Automata Studies, 34:43–98, 1956.

[34] R. L. Wadsack. Fault modeling and logic simulation of CMOS and MOSintegrated circuits. Bell System Technical Journal, pages 1449–1474, May1978.

[35] H. Walker. VLASIC: A catastrophic fault simulator for integrated circuits.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-tems, 5(4):541–556, Oct. 1986.

[36] N. H. E. Weste and D. Harris. CMOS VLSI Design. Addison Wesley, 2005.

[37] Virtex-5 FPGA User Guide. Xilinx, 2007. http://www.xilinx.com/support/documentation/user_guides/ug190.pdf.

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

Papers

I A. Djupdal and P. C. Haddow. Yield enhancing defect tolerance techniquesfor FPGAs. In Military and Aerospace Programmable Logic Device Interna-tional Conference (MAPLD), paper ID 203, 2006.

II P. C. Haddow, M. Hartmann and A. Djupdal. Addressing the metric chal-lenge: Evolved versus traditional fault tolerant circuits. In Adaptive Hardwareand Systems, pages 431–438, 2007.

III A. Djupdal and P. C. Haddow. Evolving redundant structures for reliablecircuits – lessons learned. In Adaptive Hardware and Systems, pages 455–462, 2007.

IV A. Djupdal and P. C. Haddow. Evolving and analysing “useful” redundantlogic. In International Conference on Evolvable Systems (ICES), pages 256–267, 2007.

V A. Djupdal and P. C. Haddow. Defect tolerant ganged CMOS minority gate.In IEEE NORCHIP, 2007.

VI A. Djupdal and P. C. Haddow. Evolving efficient redundancy by exploit-ing the analogue nature of CMOS transistors. In International Conferenceon Computational Intelligence, Robotics and Autonomous Systems (CIRAS),pages 81–86, 2007.

VII A. Djupdal and P. C. Haddow. Defect tolerance inspired by artificial evo-lution. Accepted at IEEE Computer Society Annual Symposium on VLSI(ISVLSI), 2008.

VIII A. Djupdal and P. C. Haddow. The route to a defect tolerant LUT throughartificial evolution. Submitted to IEEE Transactions on Circuits and SystemsI, 2008.

Paper I

Yield Enhancing Defect Tolerance Techniques for FPGAsAsbjørn Djupdal and Pauline C. HaddowIn Military and Aerospace Programmable Logic Device InternationalConference (MAPLD), paper ID 203, 2006

Yield Enhancing Defect Tolerance Techniques for FPGAs

Asbjoern [email protected]

Pauline C. [email protected]

CRAB Lab (http://crab.idi.ntnu.no)Department of Computer and Information ScienceNorwegian University of Science and Technology

Abstract

As technology scales, the problem of production de-fects is expected to increase. This makes maintain-ing device yield a challenge. Also, it may be ex-pected that more and more defect circuits will passthe production tests as the device testing challengegrows due to more and more transistors being com-pacted onto a single chip.

Reconfigurable technology has experienced an in-creasing popularity in recent years. Similar toASIC design, reconfigurable technology suffers fromproduction defects. However, unlike ASIC de-sign, reconfigurable technology provides a bridgebetween production and the application designer.The inclusion of defect tolerance in the FPGAarchitecture could provide a functionally correctFPGA for the application designer, despite pro-duction defects. As such, the application designeris relieved of the extra complexity of designing forimperfect devices.

This paper presents a survey of known ap-proaches to making defect tolerant FPGAs and dis-cusses their advantages and disadvantages, espe-cially in the context of maintaining FPGA yieldand device correctness.

1 Introduction

Defect Tolerance for reconfigurable devices may besaid to date back to the early 1990’s. However,research focus moved more towards run-time faulttolerance in the late 1990s. Lately, renewed inter-est in defect tolerance has arisen due, in part, toincreasing research into nano-computers where de-fect densities might be high.

Production of integrated circuits using opticallithography has always had problems with pro-duction defects, resulting in yield less than 100%.While these problems have been manageable, theyare expected to increase. The ITRS roadmap statesthat yield enhancement will become a major chal-lenge [12] and that “fabrication of chips with 100%working transistors and interconnects becomes pro-hibitively expensive.”[11] They further state that“relaxing the requirement of 100% correctness fordevices and interconnects may dramatically reducecosts of manufacturing, verification, and test.”[11]

There are several different causes of faulty op-eration in integrated circuits. Non permanent er-rors mostly occur from radiation. Permanent errorsmight be the result of radiation; wear-out phenom-ena such as electromigration and cracks from tem-perature fluctuations or from production defects.The latter is addressed in this paper with respect toyield. Contamination (dust) during the manufac-turing process may result in production defects—typically shortened or broken wires. These defectsare called random particle defects. Another kindof defect is systematic defects seen as malformedstructures, often resulting from optical effects.

Reconfigurable technology, represented by fieldprogrammable gate arrays (FPGAs) has becomemore and more popular in recent years. Althoughoriginally a prototyping device, the FPGA todayis, in addition, widely used as a component in thefinal product. Just like all other lithographicallyproduced chips, FPGAs suffer from production de-fects. However, reconfigurable technology providesa bridge between chip production and the appli-cation designer. The inclusion of defect tolerancein the generic FPGA architecture would provide a

Djupdal 1 MAPLD 2006/203

Paper I 45

functionally correct FPGA for the application de-signer, despite production defects. As such, theapplication designer is relieved of the extra com-plexity of designing for imperfect devices.

This paper presents a survey of known ap-proaches for defect tolerant FPGAs and discussestheir advantages and disadvantages, especially inthe context of maintaining FPGA yield and devicecorrectness. Only techniques targeting FPGA yieldare discussed.

The rest of this paper is organised as follows:Section 2 gives the necessary background informa-tion on defect models, yield estimation and defecttolerance techniques in general. Section 3, 4, 5 and6 present surveys of configuration approaches, ar-chitecture approaches using node redundancy, lo-cal redundancy and application specific FPGAs re-spectively. Finally, section 7 gives a summary anddiscussion.

2 Background

2.1 Defect Models

The common way to model production defects is toassume that each defect is like a disc with a certaindiameter ranging from the minimum feature sizeup to some assumed maximum defect size. If a discmakes a short or an open on a die, it produces afault on that die.

In order to estimate yield for a given chip design,several assumptions about the production processmust be made. One assumption is the spatial dis-tribution of defects on a wafer i.e. the degree ofdefect clustering. Another assumption is the defectsize distribution, which is important as not all de-fects are large enough to make a fault in every chipdesign.

Based on a given chip design, the fault prob-ability kernel — the probability that a defect ofa given type and size results in a fault, may befound. Monte Carlo simulations on the chip lay-out, together with the defect size distribution, maybe used to find the probability that a defect pro-duces a fault. Knowing both the fault probabilityand having an assumption about the spatial dis-tribution of defects, yield may be estimated. One,commonly applied yield equation is the negative bi-nomial yield.

Estimating the yield of a given chip design is notvery accurate. The problem lies in the strong de-pendency on both the design and the manufactur-ing process. Yield data from the manufacturingprocess is in general not available. The general spa-tial distribution of defects and the defect size distri-bution for a manufacturing process is very difficultto find, as it is typically only faults in a producedchip that is detected. Even when both design andmanufacturing process are known, it is difficult toget accurate yield numbers without actually man-ufacturing the device. This is because of all the as-sumptions behind the calculations, hiding the com-plex physical circumstances that leads to defects.The best yield estimates are often found by havingan older similar chip design with known yield num-bers and then scaling the parameters for the yieldequation.

While the most important work regarding pro-duction defects has to do with improving the man-ufacturing process itself, including defect tolerancein the specific chip design to be produced might al-low the design to tolerate a certain amount of defectlogic.

2.2 Defect Tolerance Techniques

Techniques for making defect tolerant designs in-volve some form of redundancy. Defect tolerancein hardware can be be achieved by either static ordynamic techniques.

Static redundancy is advantaged by the mask-ing of faults without the need to detect them first.However, several equal modules implementing thesame functionality are typically required, thus con-suming a much larger die area than a non defecttolerant design. An example of such a techniqueis Triple Modular Redundancy (TMR) with tripledarea and increased power consumption. Informa-tion redundancy is a static redundancy techniquethat involves adding redundant information to theoriginal data. Information redundant techniquessuch as error correcting codes may be used effi-ciently with respect to die area, but only for partsof a typical hardware design.

To avoid reducing the total number of usable diesfrom a wafer (effective yield), area efficient defecttolerance techniques are needed. Dynamic redun-dancy has a mechanism for fault detection and ac-tively recovers from the detected effect of a fault.


46

Dynamic defect tolerance techniques may be morearea efficient than static techniques because there isno need to mask any possible faults. Only detectedfaults need special treatment.

When requiring area efficient defect tolerance,the typical approach is to exploit regularity in thedesign [15]. The problem with techniques like mod-ular redundancy is that an enormous amount of re-dundancy must be introduced. A regular designmay use the regularity to introduce only a small ormoderate number of redundant elements and stillgive high defect tolerance. For example: The 16bit Hyeti microprocessor [19] contains a bit sliceddatapath with 17 slices, of which one is redundant.This datapath organisation has an area overhead ofroughly 17

16 and still makes the processor functioncorrectly with defects in one of the slices. TMRwould have the much larger overhead of roughlythree times the original area. Two other betterknown examples are RAM ICs and hard driveswhere only a small amount of extra storage canbe used to mask defects. The technique involvesrelocating data from the defective areas to the re-dundant ones.

Similar to RAM, the FPGA has a regular struc-ture. This originally motivated the search for effi-cient defect tolerance techniques for FPGAs in the1990’s.

3 Configuration Approaches

This group of methods consists of the approachesthat tolerate defects by introducing changes to thetool chain or chip configuration.

3.1 Chip-specific Bitfiles

Commercial tools for place-and-route have the op-tion of specifying placement constraints. Thismeans that the designer can specify which parts ofthe device are not to be used. This can be exploitedfor defect tolerance. If the device is first tested fordefects, place-and-route can generate a bit file forthe device that avoids the defective areas—see fig-ure 1. Examples of this are Kumar et al. [16], theTeramac project [1], NanoFabric [22] and a some-what different method using JBits [25].

Kumar et al. [16] described one of the first de-fect tolerance methods for FPGAs, with retesting

Synthesis Tester

Defect aware

Place−and−route

Netlist Defect map

Bitfile

Design FPGA

Figure 1: Chip specific bitfiles

of each chip to discover defects. Prior to config-uration, changes were made to the layout so thatdefects were avoided without requiring a full place-and-route.

The Teramac project [1] was a large custom com-puter with lots of partly defective FPGAs. One oftheir major contributions was to develop methodsto precisely locate defects in generic FPGAs. In ad-dition, they showed with real hardware that manypartly defective FPGAs could be used successfully.The NanoFabric is a more recent project that re-sembles the Teramac approach but addressing re-configurable nano-computers [22].

In Sundararajan and Guccione [25], JBits is usedto generate run-time parametrisable cores for FP-GAs. These cores are not fixed data objects, butcode sequences describing how to construct circuits.Important design decisions like bus width can bedecided at run-time enabling circuits to be modi-fied while running. By including defect testing andavoidance in the core generating process, defect tol-erance is achieved.

Chip-specific bit files provide a high degree offlexibility with respect to creating a new layout thatavoids defects. As testing and reconfiguration areconducted offline (except for the JBits approach),little on-line resources is required.

However, chip specific bit files are not harmo-nious with mass-production. Each device producedmust have its own bit file, making it resource de-manding to create the bit files for high volumeproducts. In addition, distributing firmware up-


Paper I 47

Figure 2: Precompiled configuration [17], showingone tile in the FPGA.

grades to end-users is difficult as each end-userneeds a tailor made bit file. The Jbits approachdoes not suffer from the same problems becausethe method is designed for run-time parametrisablesystems.

3.2 Precompiled Configuration

A set of different configurations, may be compiled.with the aim that at least one of these solutionswill function correctly in the presence of a defect.In the example shown in figure 2, taken from Lachet al. [17], the FPGA is divided into tiles of 2 · 2CLBs each. For each tile, one CLB is chosen as aspare and four configurations are made where thespare CLB has different positions. When a specificchip is to be configured with a defect in a tile, thetile configuration that does not use the defectiveCLB is chosen.

The advantage of this method compared to chipspecific bit files is that there is no need to run place-and-route for each chip despite the fact that defectsin the individual chips can still be avoided. Onedisadvantage is larger bit files i.e. an increased needfor external storage. This disadvantage could bereduced with bit file compression, as described byHuang and McCluskey [10]. Another disadvantage

Figure 3: Shifting entire design [3]. Two examplesof how spare nodes may distributed on the FPGA.

is reduced flexibility in how many defects that canbe covered. As an example, the method used infigure 2 can at most tolerate one defect in each tile.

3.3 Adaptive Configuration

3.3.1 Shifting Entire Design

Doumar et al. [3] has an interesting approach wheredefect tolerance is achieved by embedding sparenodes into the design as well as shifting the entiredesign vertically and/or horizontally. The chip istested at power-on and if a defect is found, the en-tire design is shifted such that a spare node coversthe defect.

There are several possible ways of embeddingspare nodes into the design. Two possible exam-ples are shown in figure 3. The left one shows howto embed spare nodes so that any single defect canbe covered by shifting at most one step horizon-tally and/or one step vertically. The right exampleshows how to embed spare nodes so that at mostone shift step, either horizontally or vertically, isneeded to cover a defect.

The advantage of this approach is the simplic-ity of the relocation algorithm. No rerouting is re-quired (except very simple rerouting at chip I/Opins). This method can however only cover one de-fect (unless very lucky in the location of defects) atthe expense of a relatively large number of spares.

3.3.2 Dynamic Place-And-Route

A radical approach to defect tolerance is to havean adaptive way of “growing” circuits onto a de-fective medium e.g. the Cell Matrix reconfigurable


48

Figure 4: Cell Matrix [20]. Black areas are defective. Left: Initial Cell Matrix. Middle: After stage1—Configuration of supercells. Right: After stage 2—supercells have found an implementation of targetcircuit.

device [5]. Cell Matrix is self-reconfigurable —each logic block has the possibility to reconfigureits neighbours. Macias and Durbeck [20] describesan adaptive configuration process for Cell Matrixthat takes care of chip testing and placement androuting avoiding defective regions.

Figure 4 illustrates the configuration processstarting with a defective chip (left). The first stagein the configuration process is to configure the CellMatrix as a matrix of superblocks—defect free n ·nlogic blocks. Once a superblock is configured, itstarts testing neighbouring n · n blocks. If foundto be defect free, the block becomes a superblock(holding a netlist of the desired circuit), otherwiseit is marked as defect.

In the second stage, a decentralised and dis-tributed place-and-route algorithm takes place thatresults in each superblock being assigned a part ofthe target circuit and communication paths are setup between relevant superblocks (right part of fig-ure 4).

The distributed wavefront method of doing test-ing and configuration of superblocks (stage 1) hasthe advantage of having some degree of parallelism.This could be important in a nano-computer con-text where the size of the reconfigurable array pre-vents a more conventional and slower sequentialform of configuration. The distributed algorithmfor doing place-and-route (stage 2) is currently se-quential, removing much of the speed advantage ofthis method, but future work hopefully results in adistributed and parallel place-and-route algorithm.

A disadvantage of this method is the area re-quirement for a full copy of the netlist in each su-percell and the inefficiency inherent in decentralisedplace-and-route.

4 Architecture Approachesthrough Node Redundancy

Nodes in the FPGA architecture may be reserved asspare nodes, together with the necessary resourcesfor making the spare nodes take over for defectiveones.

4.1 Redundant Row and/or Column

The FPGA may be made defect tolerant with theuse of redundant rows and/or columns. One of therows and/or columns are reserved for spare nodesand if a defective node is found in one of the nor-mal rows or columns, that row or column is by-passed and the redundant row or column is put intouse. Variants of the redundant row/column methodhave been investigated by Hatori et al. [8], Howardet al. [9], Durand and Piguet [4] and Shibayamaet al. [24].

Hatori et al. [8] presented this method for thefirst time. Figure 5(a) shows an FPGA with a sparerow. Horizontal wiring is unmodified, whilst verti-cal wiring segments span one extra row. In the caseof a defect, shown in figure 5(b), the defective rowis disconnected, vertical wiring is set up to bypassthe disconnected row and all lower rows are shiftedone row down.

Howard et al. [9] introduce a variant of this whichthey call a block structured defect tolerant FPGA.The idea is to divide the array of nodes into sev-eral large blocks of nodes and removing all globalsignals. This is similar to having several indepen-dent and interconnected FPGAs. Defect toleranceis achieved with redundant rows and columns ofthese large blocks. The rationale behind the block


Paper I 49

(a) Without defective row (b) With defective row

Figure 5: Row redundancy [8]

structuring is that many defects are unlikely to beconfined to the node it occurs in (fault contain-ment), instead they are likely to affect the wholearray. The block structured FPGA will confine anydefect within the block. In addition, timing withina block will be unaffected. They use Monte Carlosimulations to back up their claims about lack offault containment in typical FPGAs.

Durand and Piguet [4] use a binary decisiontree based FPGA with redundant columns. In abinary decision tree, only neighbouring test cellsare connected together, which is reflected in theirFPGA architecture where only neighbouring cellsmay communicate. This simplifies the logic neededto bypass defective columns. Each cell has twoidentical configuration registers with parity bits,holding the configuration data for that cell so asto tolerate defects in the configuration registers.When defects are detected at run-time, the configu-ration data of all columns to the right of the defectis shifted to the right, bypassing the defective col-umn. To avoid shifting in the full column height,the reconfigurable array is divided horizontally intoseveral subarrays with routers in between. Shiftingto avoid a defect in a subarray can then be confinedto within the subarray.

Shibayama et al. [24] have a physical implemen-tation of a defect tolerant FPGA with both a sparerow and a spare column. Run-time self check-ing triggers shifting and bypassing of rows andcolumns.

Advantages of the redundant row or column ap-proach is the simplicity in defect avoidance by turn-ing off and bypassing rows. This could be imple-mented using laser-blown fuses at the factory (as

Figure 6: Node covering [7]

suggested by Hatori et al. [8]) or with run-timeself reconfiguration logic, making the method com-pletely transparent to the customer. The high over-head is a disadvantage, entire rows or columns arediscarded for single defects. Extra switches andlonger wires used for row or column bypassing leadto longer routing delays.

4.2 Redundant Single Nodes

Instead of invalidating entire rows or columns,schemes exist where a single redundant node cantake over the functionality of a defective one. Thiscan be implemented by introducing a number ofspare nodes in each row and if there is one defec-tive node in a row, nodes in that row are relocatedso that the defective node is unused. With onespare node in each row, each row can tolerate onedefective node.

Hanchek and Dutt [7] describes this method,which they call node covering. Relocating nodes ina row when a node is defect is achieved as in figure6, where nodes to the right of the defective node areshifted one step further to the right. Spare wiringsegments (cover segments) exist that allow signals


50

to bypass defective nodes and to support correctrouting between rows after a row has been restruc-tured due to a defective node. The same techniqueis used for wiring segments between switch blocksor for entire “grids” (set of interconnectable tracks,one track from each horizontal and vertical chan-nel). The node covering method was extended to adynamic method with less overhead, where no extracover segments are used. Instead, interconnect re-sources are incrementally rerouted and if necessary,layout is incrementally modified [6, 21].

A variant is described by Kelly and Ivey [13]. Off-chip testing generates a defect map for inclusion inthe configuration bitstream. When configurationstarts, on-chip modifications are performed on thearray according to the defect map so that defec-tive nodes are avoided in the same manner as inHanchek and Dutt [7].

Compared to the redundant row or column ap-proach, the general method of using redundant sin-gle nodes is less wasteful in that a single defectivenode does not discard the entire row or column.This method may thus be able to tolerate more de-fects in total than the redundant row or columnapproach. Some form of post-production configu-ration could be performed by the factory to makedefect hiding transparent to the user. Just as forthe redundant row or column method, there is extradelays due to switches and longer wires. A disad-vantage is the extra complexity in the routing be-tween rows which gives higher total overhead thanthe redundant row or column method.

5 Local Redundancy

5.1 Modifications to CLBs

Local modifications to a CLB can make the CLBmore robust against production defects. The be-haviour of a CLB is implemented using look-up ta-bles (LUTs) and these can be made defect tolerantusing error correcting codes. This has been sug-gested for use in the Cell Matrix [23] and in theNanoBox project [14]—a reconfigurable array fornanotechnology.

Another possibility is to allow inputs to a LUTto change place. If a LUT is only using three ofits four inputs and there is one defect bit in theLUT SRAM memory, swapping inputs can be used

to avoid activating the defective bit [18].

An advantage of error correcting codes in theLUTs is the ability to mask defects in the LUT us-ing a static redundancy method. A separate testingphase is not necessary for this to work. Another ad-vantage is the independence of LUTs — a defect inone LUT will not preclude the masking of defectsin another. This is unlike the methods using a lim-ited amount of spare nodes where there is a limitednumber of defects in total that can be tolerated. Adisadvantage is that it requires extra logic in everyLUT all over the FPGA.

The advantage of swapping LUT inputs is thatresources unused in a specific design on the FPGAcan be used for defect tolerance in a simple way.A disadvantage is the limited use of the method—this method can only be used for defects that occurin LUTs that happen to have unused inputs. Thisreduces the usefulness compared to error correctingcodes.

5.2 Modifications to Switch Blocksand Local Routing

Local modifications to the routing system can beused to tolerate defects in switch blocks or wiringsegments. Doumar and Ito [2] add an extra wireto the switch block, making it possible to connectany two wires, thus have the ability to bypass afaulty switch. Yu and Lemieux [28] have multiplex-ors on the inputs and outputs of a switch block,together with spare lines between switch blocks.With corresponding changes in connection blocks,this can then be used to bypass defective wires be-tween switch blocks, which is quite similar to theswitch blocks in Hanchek and Dutt [7].

Xu et al. [27] introduce extra wires and switchesand a routing procedure to replace faulty CLBs us-ing the spare routing capacity.

An advantage of these techniques is that in-terconnect defects can be tolerated without doingcomplex rerouting of the design on the FPGA. Asa large part of the chip area of a modern FPGA isoccupied by interconnect, many defects are likelyto occur in the interconnect.


Paper I 51

6 Application Specific FPGAs

One approach to using partly defective FPGAs is tomake the defects fit the application. This is calledApplication Specific FPGA (ASFPGA) and is pro-vided by Xilinx in their EasyPath program [26].The idea is to that the designer creates a bitfile asusual and then Xilinx selects defective FPGAs thatfunction with the given bitfile.

The advantage here is that there is no need forany change in the FPGA architecture, bitfile, toolchain or methods used when developing the design.The disadvantage is that the reconfigurability as-pect is lost or reduced because the FPGA is notguaranteed to work with any bitfile.

7 Discussion

Table 1 gives an overview of all surveyed techniquesfor achieving defect tolerance in FPGAs with re-spect to different criteria., discussed below.

Defect coverage Defect coverage is the abilityto tolerate defects. The ultimate goal of doing de-fect tolerance is to be able to tolerate all defectsthat may occur. This is not possible and only alimited number of defects and a limited set of de-fect types can be tolerated by any defect tolerancetechnique. As can be seen in table 1, defect cov-erage is highest for the techniques involving a fullplace-and-route after defects have been detected;chip specific bitfiles and the dynamic place-and-route method of the Cell Matrix. In addition, froma customer point of view, the application specificFPGAs have perfect defect coverage in that no FP-GAs will have defects affecting the target applica-tion.

The redundant row/column technique has lowdefect coverage due to the inability to tolerate de-fects in more rows or columns than there are spareones, typically only one. Similarly, the techniqueof shifting the entire design can in most situationstolerate only one defect.

Area overhead This criterion is represents theamount of extra chip area needed for redundancy.This is not applicable for chip specific bitfiles andapplication specific FPGAs because these tech-niques do not require allocation of spare resources.

The area overhead is high for the precompiledconfiguration technique because there is one sparenode for every tile. The dynamic place-and-route ofthe Cell Matrix has also a very high area overheaddue to the requirement of storing the entire netlistin every supercell.

The method of shifting the entire design is alsoquite expensive in terms of chip area due to thelarge number of spares needed.

Node redundancy techniques have low area over-head because the number of redundant nodes issmall. The area overhead for the local redundancytechniques is also small — neither error correctingcodes nor extra lines between switch blocks take alot of space.

Timing overhead Timing overhead is low formost of the configuration techniques because thesignals paths will only be marginally larger. An ex-ception is the dynamic place-and-route method ofthe Cell Matrix that will have a large timing over-head simply because the supercells are so physicallylarge, making signals travel far.

Timing overhead is medium for the node redun-dancy techniques due to the extra wire mengthsand switches needed to bypass faulty nodes.

There might be some timing overhead for the lo-cal redundancy techniques, but it is likely to besmall. Extra multiplexors on the switch blocksmight, however, contribute significantly to the in-terconnect delay.

Bitfile size Most of the surveyed techniques haveno effect on the bitfile size, compared to a typicalnon defect tolerant FPGA. The exceptions are theprecompiled configuration technique that has sev-eral configurations for each tile and the dynamicplace-and-route technique where the bitfile mightbe smaller as only the supercell configuration anda netlist of the target circuit is needed.

Extra hardware required Several of the sur-veyed techniques need extra chip area in the formof on-chip support of the technique itself. Thetechniques of chip specific bitfile and applicationspecific FPGAs do not have any on-chip hardwaresupport whereas all others have some. However, alarge portion of each supercell, in the cellmatrix,


52

Table 1: Overview of surveyed techniques(a) Configuration techniques

Chip specific bitfile PrecompiledAdaptive configurationShifting Dynamic PAR

Defect coverage High Medium Low HighArea overhead — High Medium/ high HighTiming overhead Low Low Low HighBitfile size Medium High Medium LowExtra HW required — Low Low HighMaturity High Medium Medium LowMass production friendly Low High High High

(b) Other techniques

Node redundancyLocal redundancy ASFPGA

Redundant row/col Single nodes

Defect coverage Low Medium Medium HighArea overhead Low Low Low —Timing overhead Medium Medium Low —Bitfile size Medium Medium Medium MediumExtra HW required Low Medium Low —Maturity High Medium High HighMass production friendly High High High High

is dedicated to performing the configuration algo-rithm. The single redundant nodes have a moder-ate amount of support hardware, more than theredundant row/column approach, because of themore complex routing support between rows.

Maturity This criteria refers to how close thistechnique is to be put into use i.e. in a commeri-cal setting. The dynamic place-and-route methodis perhaps the least mature furthest from being fin-ished and perhaps another technology base (hugereconfigurable nanoarrays) to be useful. The chipspecific bitfile and the application specific FPGAtechniques are the most mature and have been orare in use today. In addition, error correcting codesin LUTs and the node redundancy techniques havebeen researched by several research groups for sev-eral years and are relatively mature.

Mass production friendly This criteria reflectshow well the technique suits FPGAs that are to beused in mass produced end-user products. The onlytechnique that really does not suit mass productionis the chip specific bitfile technique due to the prob-lems associated with a tailor made bitfile for everyend-user product.

All presented techniques rely on exploitation ofthe generality and regular structure of the FPGA,either at the top level or lower levels. At the toplevel, the generality of the FPGA components areexploited. Every CLB is equal and can, therefore,take over the functionality of a defective one. Theregular structure of the CLBs makes it possible touse simple techniques such as redundant row fordefect avoidance. At CLB level, the regularity ofthe look-up tables is exploited by introducing errorcorrecting codes. Similarly, regularity in the inter-connect makes it possible to introduce redundantwires.

Although one of the local methods (error correct-ing codes in LUT memory) does not require anypost-production modifications. The other methodsdo to actively avoid a detected defect.

The effect of production defects on FPGAs af-fects the kind of defect tolerance techniques thatcan be used. Howard et al. [9] shows that faultcontainment should be studied for FPGAs designedfor defect tolerance. In a typical FPGA, a signifi-cant amount of defects will have so large effect thatthey can not be tolerated. This indicates that ap-proaches without any architectural changes to theFPGA might not be the best way to achieve gooddefect tolerance. Still, some real world examples


Paper I 53

(Teramac and Xilinx EasyPath) demonstrate theusefulness of these methods.

The defect distribution is also importance withrespect to tech suitability of different defect tol-erant methods. If defects distribute in a highlyclustered fashion, the defect tolerance method usedshould be able to tolerate more than one defect. Ifthere is one defect on a die, there are probably otherdefects on the same die. Similarly, which parts ofthe FPGA consumes large areas on the chip. Thereis little point in concentrating efforts on making de-fect tolerant CLBs if the interconnect occupies mostof the die area.

Bibliography

[1] W. B. Culbertson, R. Amerson, R. J. Carter,P. Kuekes, and G. Snider. Defect tolerance onthe teramac custom computer. In Proc. IEEESymposium on FPGA-Based Custom Comput-ing Machines (FCCM), page 116, 1997.

[2] A. Doumar and H. Ito. Design of switchingblocks tolerating defects/faults in FPGA in-terconnection resources. In IEEE Symposiumon Defect and Fault-Tolerance, pages 134–142,2000.

[3] Abderrahim Doumar, Satoshi Kaneko, andHideo Ito. Defect and fault tolerance FPGAsby shifting the configuration data. In Proc.International Symposium on Defect and FaultTolerance in VLSI Systems (DFT), pages 377–385, 1999.

[4] Serge Durand and Christian Piguet. FPGAwith self-repair capabilities. In Proc. Interna-tional ACM/SIGDA Workshop on Field Pro-grammable Gate Arrays, 1994.

[5] Lisa J. K. Durbeck and Nicholas J. Macias.The cell matrix: an architecture for nanocom-puting. Nanotechnology, Institute of Physics,2001.

[6] Shantanu Dutt, Vimalvel Shanmugavel, andSteve Trimberger. Efficient incrementalrerouting for fault reconfiguration in field pro-grammable gate arrays. In ICCAD, pages 173–176, 1999.

[7] Fran Hanchek and Shantanu Dutt. Methodolo-gies for tolerating cell and interconnect faultsin FPGAs. IEEE Transactions on Computers,47(1):15–33, 1998.

[8] F. Hatori, T. Sakurai, K. Nogami, K. Sawada,M. Takahashi, M. Ichida, M. Uchida, I. Yoshii,Y. Kawahara, T. Hibi, Y. Saeki, H. Mu-raoga, A. Tanaka, and K. Kanzaki. Introduc-ing redundancy in field programmable gate ar-rays. In Proc. IEEE Custom Integrated Cir-cuits Conference, pages 7.1.1–7.1.4, 1993.

[9] Neil J. Howard, Andrew M. Tyrrell, andNigel M. Allinson. The yield enhancementof field-programmable gate arrays. IEEETransactions on Very Large Scale Integration(VLSI), 2(1):115–123, mar 1994.

[10] Wei-Je Huang and Edward J. McCluskey.Column-based precompiled configuration tech-niques for FPGA fault tolerance. In Sympo-sium on Field-Programmable Custom Comput-ing Machines (FCCM), pages 137–146, 2001.

[11] ITRS. Design. Technical report, ITRS, 2005.

[12] ITRS. Lithography. Technical report, ITRS,2005.

[13] Jason L. Kelly and Peter A. Ivey. Defect toler-ant SRAM based FPGAs. In Proc. IEEE In-ternational Conference on Computer Design:VLSI in Computers and Processors (ICCD),pages 479–482, 1994.

[14] AJ KleinOsowski and David J. Lilja. TheNanoBox project: Exploring fabrics of self-correcting logic blocks for high defect ratemolecular device technologies. In IEEE Sym-posium on VLSI, pages 1–10, 2004.

[15] Israel Koren and Zahava Koren. Defect tol-erance in VLSI circuits: Techniques and yieldanalysis. Proceedings of the IEEE, 86(9):1819–1837, sep 1998.

[16] Vijay Kumar, Anton Dahbura, Fred Fischer,and Patrick Juola. An approach for the yieldenhancement of programmable gate arrays. InInternational Conference on Computer-AidedDesign, pages 226–229, 1989.


54

[17] John Lach, William H. Mangione-Smith, andMiodrag Potkonjak. Low overhead fault-tolerant FPGA systems. IEEE Trans. VeryLarge Scale Integr. Syst., 6(2):212–221, 1998.ISSN 1063-8210. doi: http://dx.doi.org/10.1109/92.678870.

[18] Vijay Lakamraju and Russell Tessier. Toler-ating operational faults in cluster-based FP-GAs. In Proc. International Symposium onField Programmable Gate Arrays, pages 187–194, 2000.

[19] R. Leveugle, Z. Koren, I. Koren, G. Saucier,and N. Wehn. The Hyeti defect tolerant mi-croprocessor: A practical experiment and itscost-effectiveness analysis. IEEE Transactionson Computers, 43(12):1398–1406, 1994.

[20] N. J. Macias and L. J. K. Durbeck. Adaptivemethods for growing electronic circuits on animperfect synthetic matrix. Biosystems, 73(3):173–204, 2004.

[21] Nihar R. Mahapatra and Shantanu Dutt. Ef-ficient network-flow based techniques for dy-namic fault reconfiguration in FPGAs. InInternational Symposium on Fault-TolerantComputing, pages 122–129, 1999.

[22] Maham Mishra and Seth C. Goldstein. Nano,Quantum and Molecular Computing, Implica-tions to High Level Design and Validation,chapter 3: Defect Tolerance at the End of theRoadmap. Kluwer Academic Publishers, 2004.

[23] C. R. Saha, S. J. Bellis, A. Mathewson, andE. M. Popovici. Performance enhancement de-fect tolerance in the cell matrix architecture.In Proc. International Conference on Micro-electronics, pages 777–780, 2004.

[24] Atsufumi Shibayama, Hiroyuki Igura,Masayuki Mizuno, and Masakazu Yamashina.An autonomous reconfigurable cell array forfault-tolerant LSIs. In Proc. IEEE Interna-tional Solid-State Circuits Conference, pages230–232, 1997.

[25] Prasanna Sundararajan and Steven A. Guc-cione. Run-time defect tolerance using JBits.In Proc. FPGA, pages 193–198, 2001.

[26] Xilinx. EasyPath FPGAs. http://www.xilinx.com/products/easypath.

[27] Jian Xu, Weikang Huang, and Babrizio Lom-bardi. A novel fault tolerant approach forSRAM-based FPGAs. In Proc. Pacific RimInternational Symposium on Dependable Com-puting, pages 40–44, 1999.

[28] Anthony J. Yu and Guy G. F. Lemieux.Defect-tolerant FPGA switch block and con-nection block with fine-grain redundancy foryield enhancement. In Proc. Field Pro-grammable Logic and Applications, pages 255–252, 2005.


Paper I 55

Paper II

Addressing the Metric Challenge: Evolved versus Traditional FaultTolerant CircuitsPauline C. Haddow, Morten Hartmann and Asbjørn DjupdalIn Adaptive Hardware and Systems, pages 431–438, 2007

Addressing the Metric Challenge: Evolved versus Traditional Fault TolerantCircuits

Pauline C Haddow, Morten Hartmann and Asbjoern DjupdalComplex, Reconfigurable, Adaptive and Bio-inspired Hardware (CRAB)Lab

Dept. of Computer and Information ScienceNorwegian University of Science and Technology

(pauline,mortehar,djupdal)@idi.ntnu.no

Abstract

The field of Evolvable Hardware, applying artificial evo-lution to the design of digital and analogue hardware isaround ten years old. However, the field is far from reachingmain stream electronics, although some few examples exist.One cause may be that the problems that are addressed inthe field are, in general but not always, relatively simple de-signs which may be regarded as “toy problems”, this workbeing no exception.

Interest in the possibilities inherent in evolved designsis growing, as may be seen from the inclusion of evolvablehardware as a topic in a number of more traditional elec-tronics conferences. However, how good are the designsthat are evolved? How can they be compared to their tra-ditional counterparts? Suitable metrics are needed whichenable comparison between these two fields of design andthat can provide an accurate and fair evaluation of the givendesign technique. In this work the issue of fault tolerance isaddressed together with the design metric reliability.

1 Introduction

The need for fault tolerance is an important issue of mod-ern electronic design. High density chips increase the pos-sibility of failing components and the complexity of de-sign increases the probability of human errors. Space ex-ploration in unknown and dynamic environments places agreater pressure on fault tolerance in terms of the faultsthemselves and the cost of repair. The need for fault tolerantdesigns is stated amongst the long term grand challenges ofthe International Technology Roadmap for Semiconductors,ITRS (2005) [5].

The introduction of reconfigurable technology has, insome ways, reduced the need for fault tolerance as the re-configurable nature of these devices enables runtime cor-

rection, assuming a fault detection mechanism is available.However, when one considers space applications, run-timecorrection is not so easy. Reconfiguration requires upload-ing the new configuration when access windows permit ac-cess to the device. For small satellites, a 10 minute ac-cess window may be available 2/3 times a day with trans-fer rates of perhaps only 10 to 19kbits/sec [16]. With, forexample, Xilinx’s virtex II’s 10Mbit configuration stream,even with the possibility of partial reconfiguration and tech-niques such as data compression and splitting and fusing ofconfiguration data, fault tolerance still needs to be addressedto reduce the need to reconfigure.

The field of evolvable hardware, EHW, where artifi-cial evolution is applied to the design of electronic cir-cuits, is a promising field in the area of fault tolerantdesign for dynamic environments. The field of fault-tolerance in EHW was pioneered by Thompson, who in1995 evolved fault-tolerant electronic control systems [14].More recent work on evolved fault-tolerance includes thatof The Intelligent Systems Research Group at The Univer-sity of York [15, 3], which in cooperation with L’EcolePolytechnique Federale de Lausanne (EPFL) [10] also in-vestigates other bio-inspired methods for achieving fault-tolerance [6]. Stefatos and Arslan [12]proposed a two-layered fault-tolerant VLSI architecture where the secondlayer provides for detection and correction of the first layer.At NASA, field-programmable transistor arrays (FPTA) areevolved to be fault-tolerant [7] and fault-recovering [17].Genetic representations are also investigated with regardsto fault recovery on FPGAs [9]. Branke has investigatedthe workings of evolutionary systems in dynamic environ-ments [1, 2]. Zhang et al, are investigating competitiveand concensus-based evolution as an approach to handlingfaults [18]. A large variety of static circuit topologies wereextrinsically evolved to be fault-tolerant and/or robust tonoise by Hartmann and Haddow [4].

However, how good are the results being produced? One

1

Paper II 59

major problem is that not only the area of fault tolerance butthe field of evolvable hardware itself is still in its infancy,creating relatively simple designs and evaluating these de-signs using internal metrics e.g. number of generations,which have no meaning at all in terms of traditional designs.The problems associated with the evolution of less simpledesigns are well known in the field and there is much workaddressing these issues. In this paper the focus is to addressthe challenge of finding ways to evaluate evolved designssuch that they may be fairly compared to traditional designsevaluated by traditional metrics, and vice versa. To avoidthe challenge of evolving larger designs, these initial inves-tigations are based on evaluations of a relatively simple de-sign — a multiplier, and the limitations that such a decisionbrings to the results are presented.

In this paper, the fault tolerance metric reliability is ad-dressed, using two variations of this metric. The first Rtrad,as used in traditional design evaluation, is a measure of theprobability that a system will not fail under specified con-ditions. This reliability measure is, of course, a measureof the probability that a circuit is 100% functional and saysnothing about how dis-functional the circuit is when it isnot 100% functional. The second reliability measure ap-plied Rehw, most often applied as the fitness measure indesign by evolution — see section 2, represents how cor-rect the solution is, on average, over a spectrum of faults.All experiments are conducted on simulations of a simpleelectronic circuit — a 2 bit multiplier, and the traditional de-signs evaluated are based on a non-redundant multiplier anda multiplier design with triple modular redundancy (TMR).

The paper provides an introduction to Evolvable hard-ware and, in particular, evolving fault tolerant designs insection 2. Section 3 explains how circuits designed usingtraditional techniques are incorporated in our experiments.The representation used in evolving circuits is presentedin section 4. Details about how circuits were evolved andtested, and how faults are applied are given in section 5. Theexperiments conducted, their results and analysis of theseresults are given in section 6. Finally, section 7 draws someconclusions from this work.

2 Evolvable Hardware

The principles of artificial evolution are based on Dar-win’s theory of evolution by natural selection. His theorywas first adapted to the field of artificial evolution in 1975.The main features of natural evolution consisting of repro-duction by cloning, mutation and crossover through the ex-change of and alteration of genetic material are includedin today’s evolutionary algorithms. However, an artificialselection mechanism was introduced, which unlike that innature, steers artificial evolution towards a given goal i.e. asolution to a given problem.

The application of evolutionary techniques to hardwaredesign is termed evolvable hardware (EHW) [13], themain goal being to replace traditional design methods withevolutionary techniques for given hardware applications.These applications are either not achievable using tradi-tional methods or would benefit from an evolutionary ap-proach. A number of algorithms have been developed forevolutionary design and many have been applied to evolv-able hardware. To highlight some of the main concepts be-hind the concept of an evolutionary algorithm, a genetic al-gorithm is described in section 2.1. It should be noted, how-ever, that the evolutionary algorithm applied in this work istermed Cartesian Genetic Programming [11].

2.1 Genetic Algorithms

Population Individual

Copy population

101010

101100

101011

011100

110011

Crossover

Mutation

Mutation

101011Clone

Crossoverpoint

Fitness fun

ction

Selectio

n

101100101010

101011

101100

101010

111100

010011

110011

011100

101000101010

Generation n Generation n+1

101100

010011

111100

101011

101010

101010 101011

Figure 1. A Genetic Algorithm

The evolutionary process is a dynamic process whereat a given point in time we have what is termed in biol-ogy a generation: a population of individuals for the givenspecies. Biologically, a generation change is not an eventbut a constantly unfolding process. However, in artificialevolution, we introduce new generations through the appli-cation of genetic operators to selected individuals within thecurrent population to form a new population of individualsi.e. a new generation. Unlike biological evolution, in ar-tificial evolution we have a concrete goal, that is to createa correct solution i.e. a functionally correct circuit whichsolves a given problem. The goal is specified in terms ofthe fitness measure. This measure may define the function-ality of the circuit i.e. how correct a given circuit solution is.However, it may also take into account other features suchas area usage and power consumption and weight these dif-ferent factors. It is the purpose of the selection mechanismto steer evolution towards the goal described by the fitnessmeasure.

Figure 1 illustrates the process of a genetic algorithm.The set of individuals — circuit solutions, make up the pop-ulation in the current generation. Individuals are selectedand genetic operators such as cloning (copying), crossover(swapping information between individuals) or mutation(inverting a bit) may be applied before fitness is calculated

2

60

and the individuals are placed in the new population. Whenthe new population is full then the next generation can be-gin. The process continues until an individual in the popula-tion achieves 100% fitness or the simulator is stopped aftera certain number of generations.

2.2 Evolving Fault-tolerant Designs

Biological organisms are, in comparison to electronicsystems, extremely tolerant to sudden variations andchanges such as failing elements. While the failure of a sin-gle transistor in a standard processor often has critical con-sequences, biological systems are constantly subject to sim-ilar or often much more severe failures, but usually continueto operate unaffected. Biologically inspired fault-tolerancetries to transfer such properties of biological systems to en-gineered systems by identifying and using key features inbiology that contribute to those properties.

In the process of evolving a design, the mutation opera-tor of the evolutionary algorithm, which applies change toindividuals based on a mutation probability, may be said toact as noise in the evolution process. In some cases, thismay give rise to a certain amount of implicit fault tolerancein the evolved design. As such, even with 0% faults ap-plied, a certain amount of fault tolerance may be present inthe evolved design. In this work, evolved designs are bothimplicitly and explicitly designed for fault tolerance.

Explicit fault tolerance is herein achieved by testing in-dividuals during evolution with various fault scenarios andevaluating how well the individual functions under thesescenarios. These results, Rehw, are used to provide a fit-ness measure for the individual. The goal is that this fit-ness measure, combined with the selection method applied,are suitably defined so as to drive evolution towards findinggood fault tolerant solutions.

Rehw provides evolution with information, not only onthe correct solutions but on how good the non correct so-lutions are. As such, the Rehw measure provides morefine grained information to the evolutionary process aboutthe quality of different solutions than that which would beachievable with Rtrad.

3 Introducing Traditional Redundance

The most common form of redundancy in traditional de-sign is that of Triple Modular Redundancy(TMR). Logic isencased (virtually) in a module and replicated three timesand a majority voter is applied to these modules. The voteroutputs a vector of bits where each bit is the bit producedby the majority of the modules. TMR ensures correct out-put as long as faults occur in only one of the modules at atime and none of the faults occur in the majority voter. Thevoter is thus a critical part of a TMR circuit. However, the

voter may be replicated but at the cost of tripled voter logic.Further, at some point in the remainder of the circuit, theseoutput signals will need to be voted to a single output.

To incorporate traditional fault tolerance, TMR was ap-plied to the traditional multiplier circuit applied herein. Itwas chosen to apply TMR to the multiplier and not to thevoter in order to avoid the drastic increase in the number ofgates in the TMR circuit and thus avoiding the tripled outputproblem.

4 Evolved circuit Representation

GND

VCC

GND0

1

2

3

4

5

6

7

8

9

10

14

11

12

25

26

21

13

15

22

23

24

20

19

18

17

1634

35

36

37

39

4030

28

27

33

31

43

29

32

41

42

38

Figure 2. Example Circuit

what label type input A input Binput 0input 1input 2input 3gate 4 AND 3 1gate 5 NOT 4gate 6 AND 0 0gate 7 AND 2 0...gate 38 VCCgate 39 OR 4 9gate 40 NOR 30 22gate, output 41 AND 15 27gate, output 42 OR 37 34gate, output 43 NAND 38 24

Figure 3. Symbolic Netlist of the Example Cir-cuit

Both evolved and traditional circuits were expressed inthe simulator using a symbolic netlist representation. Fig-ure 3 provides a symbolic description of the circuit illus-trated in figure 2. Each row represents a gate, with its typeand from where to connect its two inputs. Connections canonly be made to lower numbered gates (or the circuit in-puts). This simple scheme assures a pure feed-forward net-work and thus combinatorial circuits. The last gates in thenetlist are connected to the circuit output.

5 Experimental Setup

All experiments were conducted on simulations of 2-bitmultiplier designs. In earlier evolved experiments [4] faultswere applied as stuck-at-faults at the inputs or at the out-puts. As such, a given stuck-at-fault may or may not create

3

Paper II 61

a logical fault. In the work herein, faults were applied asan inverting of the output i.e. a worst case scenario, andapplied on a fault rate per gate basis.

For the evolved circuits, a population of 20 randomlygenerated individuals was given to the evolution process.The selection method applied was tournament selectionwith elitism [8], crossover was applied at a rate of 0.2 andmutation at a rate of 0.05. In these experiments, mutationwas applied at the gate level. A mutated gate either has oneof its input connections rewired or its type changed. Theevolution of each solution i.e. each evolved circuit, was al-lowed to continue for a maximum of 100,000 generationsand the maximum size of an individual (the genome) was80 gates.

The goal of evolving a fault tolerant circuit solution canbe interpreted as the evolution of a circuit that successfullyproduces the correct mapping between input and outputvectors despite faults. The target behavior of the multiplierwas specified by a truth table. Each individual of the evolu-tionary process was evaluated by applying the complete setof input vectors and calculating the hamming distance be-tween the output of the circuit to the target truth-table. Thefitness was then the average of the hamming distance in 20fault scenarios, with faults being applied randomly for eachscenario according to the per gate probability of failure forthe given experiment. The fitness result was normalised be-tween 0 and 1, 0 being no correct outputs and 1 being allcorrect outputs.

Figure 4. Experiment setup

Figure 4 illustrates the experimental setup. As shown,the evolutionary process (“Genetic Algorithm”) was re-peated 20 times to achieve 20 evolved circuit solutions andthus 20 best individuals for the given fault rate. Each bestsolution was then tested with 1000 test scenarios. The re-sults of each of the 1000 tests were used to calculate Rehw

and Rtrad for each of the 20 best individuals. The resultsplotted on the graphs for a given fault rate represent the bestreliability result from the reliability of the 20 best evolved

circuits after the post evolution testing. It should be notedthat there are two ways in which the 1000 test scenarios areapplied. The first (a) is to apply the 1000 tests at the samefault rate as the fault rate applied when the individuals wereevolved. The second (b) is to apply a different fault rateduring the 1000 tests. The first case is termed, herein, in-creasing fault rate whereas the second is termed fixed faultrate. Fault rates investigated in this work range from 0.0 to0.2.

One important consideration when applying faults asfaults per gate is that the bigger the circuit i.e. more gatespresent, the larger the challenge one faces with respect toreliability. As such, a traditionally designed multiplier us-ing a minimum number of gates was included in the exper-iments. A 9 gate multiplier was hand-designed using Kar-naugh map and manual optimisation of joint logic betweenseparate outputs.

6 Experiments and Results

The goal of these experiments was to understand andcompare the effect of using the two metrics : reliability tra-ditional Rtrad and reliability evolvable Rehw, with respectto digital circuits — in particular a 2-bit multiplier.

6.1 Reliability of Evolved Circuits

In earlier work on evolving reliable circuits [4], Rehw

was the reliability metric applied. Faults were applied bothat the inputs and output to a gate providing the possibilitythat a fault would not necessarily result in a gate failure.Fault scenarios applied during evolution were based on thefault rate per gate under investigation. The best individualsfrom 20 evolutionary runs were then each tested 1000 timesusing the same fault rate as applied during evolution. Rehw

was calculated for each of the 20 results and the averageas well as the best and worst results were plotted. Outputgates were protected from faults, as such faults could not beremedied by evolution.

These experiments are repeated, herein, but with a worstcase fault scenario — inverting the output to cause a gatefailure, an applying the “increasing fault rate scenario”. Theresults are displayed in figure 5 as Rehw. As stated, Rehw

represents the best result of the 20 individuals. The best re-sult was chosen, rather than average, to illustrate what evo-lution is capable of rather than what it achieves on average.As shown, evolution achieves circuits with over 70% relia-bility even with 20% faults in a worst-case fault scenario.

These figures may, of course, be said to be misleading,in terms of a more traditional view of reliability. As such, itwas important to evaluate these circuits in terms of Rtrad.Rtrad was calculated, similar to Rehw, as the best resultfor the 20 individuals under the 1000 post evolution fault

4

62

scenarios — see Rtrad in figure 5. As can be seen from theresults, at a fault rate of 0.04 and greater, evolution does notachieve any circuits that are 100% functional.

The break at 0.04 would indicate that at this point thetask of achieving 100% correctness becomes too hard forevolution. However, as indicated by Rehw, evolution man-ages to retain a reasonable sub-optimal solution, degradinggracefully from around 90% Rehw at 0.04 to around 75% at0.2.

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y

RehwRtrad

Figure 5. Circuits Evolved with IncreasingFault Rates

0.00 0.05 0.10 0.15 0.20

0.5

0.6

0.7

0.8

0.9

1.0

fault rate pr. gate

relia

bilit

y (R

ehw

)

0.000.010.020.030.040.050.06

Figure 6. Rehw: Circuits Evolved with FixedFault Rates (0 to 0.06)

Looking again at Rtrad in figure 5 raised the question:could the better results achieved at fault rates from 0 to 0.03be exploited? How reliable were these circuits at higherfault rates than those fault rates that the circuits were ex-posed to under evolution? What would happen if a circuitevolved to tolerate 0.01 faults were tested for tolerance to arange of faults? To investigate either side of the breakpointin the graph, the results from evolving circuits at 0.0 to 0.06were given a new set of post evolution testing. For each ofthese fault rates, the 20 best individuals were tested againstthe whole range of fault rates i.e. exposed to 1000 fault sce-narios for each fault rate. This fixed rate fault testing re-sulted in 7 sets of results for all fault rates. Both Rehw and

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y (R

trad

)

0.000.010.020.030.040.050.06

Figure 7. Rtrad: Circuits Evolved with FixedFault Rates (0 to 0.06)

Rtrad were measured, as illustrated in figures 6 and 7 re-spectively. For the case of Rehw, as shown, circuits evolvedwith lower fault rates perform well below fault rates of 0.04.On the other hand, circuits evolved with higher fault ratesperform well above fault rates of 0.04. It should be notedthat the scale on the y-axis has been increased, compared toother graphs so that the results are more visible. There isa substantial improvement in Rtrad when fixed fault ratesof 0.0 to 0.04 are applied. One may assume that fault ratesof 0.05 and 0.06 provide too difficult a task for evolutionand, even at a fault rate of 0.00, the best evolved circuits for0.05 and 0.06 are not able to produce any 100% functionalcircuits from the 1000 fault scenarios. The best results fromfixed fault rates for both Rehw and Rtrad are displayed infigure 8.

Comparing the fixed fault rate solution (“Best of 0.00-0.06 fault rate”) with that of increasing fault rates (“Evolvedwith given fault rate”) in figure 9 one can see that there islittle difference in Rehw. However, Rtrad is substantiallyimproved — see figure 10. Even at a fault rate of 0.2 some100% functional solutions may be found.

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y

RehwRtrad

Figure 8. Circuits Evolved with Fixed FaultRates (0 to 0.06)

5

Paper II 63

0.00 0.05 0.10 0.15 0.20

0.5

0.6

0.7

0.8

0.9

1.0

fault rate pr. gate

relia

bilit

y (R

ehw

)

Best of 0.00−0.06 fault rateEvolved with given fault rate

Figure 9. Rehw: Evolved Fixed vs Evolved In-creasing

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y (R

trad

)

Best of 0.00−0.06 fault rateEvolved with given fault rate

Figure 10. Rtrad: Evolved Fixed vs EvolvedIncreasing

6.2 Reliability of Traditional Circuits

To calculate Rehw for the traditional circuit, 1000 faultscenarios were applied and their average calculated for eachof the fault rates from 0 to 0.2. Figure 11 illustrates the re-sults for both the traditional and traditional with TMR ver-sion of the circuit. Unfortunately, but not unexpectedly, theweighting of the 16 vulnerable gates in the voter comparedto the small 9 gate multiplier modules is a big disadvantageto TMR as a methodology. Also the sheer size of the cir-cuit makes the average number of occurring faults relativelylarge due to faults being generated based on gate reliability.The traditional circuits without redundancy are, therefore,more reliable than those with TMR.

Figure 12 illustrates Rtrad for both the traditional andTMR version of the tradition circuit. Statistical estimateswhich illustrate the validity of the results received are alsoplotted. It should be noted that the TMR multiplier resultsdeviate from the statistical estimates. The reason for thisis that the statistical estimates do not take into account anumber of features inherent in TMR. One such feature is thefact that two faults may occur on a TMR triple gate without

0.00 0.05 0.10 0.15 0.20

0.5

0.6

0.7

0.8

0.9

1.0

fault rate pr. gate

relia

bilit

y (R

ehw

)

TraditionalTMR

Figure 11. Rehw: Traditional Circuits

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y (R

trad

)

Traditional mul(est)TMR mul(est)Traditional mulTMR mul

Figure 12. Rtrad: Traditional Circuits

resulting in gate failure. As such, the measured reliabilitywill be higher than the estimated results. It may, also, benoticed that the traditional circuit with TMR performs, asin the Rehw case, poorer than the non redundant version.

Similar to the evolved circuits, it may be seen that Rehw

results give a more positive impression of reliability thanthe Rtrad results.

6.3 Traditional vs Evolved

0.00 0.05 0.10 0.15 0.20

0.5

0.6

0.7

0.8

0.9

1.0

fault rate pr. gate

relia

bilit

y (R

ehw

)

TMRTraditionalBest of 0.00−0.06 fault rate

Figure 13. Rehw: Traditional vs Evolved

Rehw for the traditional circuits is somewhat poorer thanthe evolved results — see figure 13, indicating that the

6

64

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

fault rate pr. gate

relia

bilit

y (R

trad

)

TMRTraditionalBest of 0.00−0.06 fault rate

Figure 14. Rtrad: Traditional vs Evolved

evolved solutions perhaps exhibit more graceful degrada-tion. On the other hand, one can see from figure 14 thatthe evolved circuits are less reliable, with respect to Rtrad,than the traditional solutions without TMR. As stated, inthe evaluation phase the evolving circuits are not being op-timised against a specific fault scenario. Instead a sub-optimal solution is being sought with respect to individ-ual fault scenarios which gives rise to a more optimal so-lution for the set of fault scenarios. Further, the evolvedresults achieve a better Rtrad than the traditional circuitswith TMR.

6.4 Gate Usage

0.00 0.05 0.10 0.15 0.20

010

2030

4050

fault rate pr. gate

num

ber

of u

sed

gate

s

Eolved (best by Rehw)TraditionalTMR

Figure 15. Rehw: Number of Gates versusFault Rate (per gate)

Figure 15 illustrates gate usage for traditional, traditionalwith TMR and evolved with fixed fault rate circuits. Eachpoint in the evolved with fixed fault rate experiments rep-resents the best result, with respect to Rehw, of the 7 resultsets for each fault rate. Although the results are somewhatunstable in nature, there is a tendency for evolution to re-duce the size of the circuits to achieve better Rehw values.Looking at the gate count for Rtrad —see figure 16, morestability can be seen in the gate count for the evolved cir-

0.00 0.05 0.10 0.15 0.20

010

2030

4050

fault rate pr. gate

num

ber

of u

sed

gate

s

Eolved (best by Rtrad)TraditionalTMR

Figure 16. Rtrad: Number of Gates versusFault Rate (per gate)

cuits. This is due to the fact that, for most of the points inthe graph, it is the same circuit of the 20 evolved circuitswhich achieves the best Rtrad.

7 Conclusion

This paper has addressed the issue of applying reliabilitymetrics to both evolved and traditional circuit solutions andcompared the performance of circuits with regards to thosemetrics in terms of a worst case fault scenario applied on aper gate basis.

When considering the results it is worth noting that thefault scenarios used herein are hard to solve. Redundancytechniques, in general, can cope with single faults. The tra-ditional circuit with TMR, for instance, would handle anysingle fault not occurring in the voter element. The faultsherein are, however, applied as the probability of a gate fail-ing on a per gate basis. All circuits are tested 1000 timesin arbitrary fault scenarios. Several of these scenarios willpresent the circuit with a large number of failing gates, of-ten including the output gates. Several of the generated sce-narios may, thus, be fault scenarios that are impossible tosolve, regardless of the circuit architecture whether evolvedor designed using traditional techniques.

Evolved circuits, herein, are evolved with the goal of im-proving Rehw in the presence of faults. On the other hand,Rtrad is a design metric tuned to traditional design meth-ods. As a consequence it is not so surprising that evolvedcircuits are more reliable in terms of Rehw and traditionalcircuits are more reliable in terms of Rtrad.

The circuits used in the experiments conducted may besaid to be small toy problems. For larger problems observedresults may differ. This is especially true in the case ofTMR, as the technique is not suited for problems of thecurrent size. Further investigations would, therefore ben-efit from scaling up the application. Scaling up solutions is,however, currently one of the greater challenges faced by

7

Paper II 65

the evolvable hardware community as a whole.The field of evolvable hardware is still a field, far from

maturity, and applying evolvable techniques is not a straightforward science. As shown, applying fixed fault rates dur-ing evolution led to greatly improved reliability, especiallyin terms of Rtrad. One might expect that one could useRtrad to drive evolution instead of Rehw and thus improveRtrad for evolved designs. However, Rtrad does not pro-vide sufficiently fine grained circuit feedback so as to sep-arate potential solutions and, therefore, limits evolution’ssearch for a solution to a rougher search.

Another topic that needs to be addressed in the future isthe suitability of Rehw and Rtrad for different applications.Whilst Rtrad has obvious advantages as a metric, Rehw isalso beneficial in certain cases such as in circuit evolution.Other cases may include those where partially correct out-puts would still be of greater value than a completely in-correct output. This could, for instance, be correctness forcertain input vectors in a multiplier or reduced resolutionor precision in general. Also, as technology scales, pro-duction defects and lifetime faults are likely to become anincreasing problem. As such, one can expect Rtrad to fallas it is going to be harder and harder to tolerate faults 100%.It may, therefore, become important to not only investigateRtrad but also Rehw in traditional designs to study circuitsdegradation in the presence of faults.

Another interesting fact observed was that, in certaincases, evolution was seen to have high tolerance to faultswith up to 3 times the number of gates of the traditional non-TMR solution. In these cases a question arises. What typeof redundance is evolution using so as not to experience thenegative effect of a large number of gates experienced by theTMR solution? Although this is not a general result, thesecases are interesting in their own right and worth further in-vestigation to find out whether we can improve traditionalredundance techniques based on architectural observationsfrom evolved circuits.

References

[1] J. Branke. Evolutionary Optimization in Dynamic Environ-ments. Kluwer, 2001.

[2] J. Branke and H. Schmeck. Designing evolutionary algorithmsfor dynamic optimization problems. Theory and Application ofEvolutionary Computation: Recent Trends, 31, 2002.

[3] R. O. Canham and A. Tyrrell. Evolved fault tolerance in evolv-able hardware. In Proc. Congress on Evolutionary Computa-tion (CEC’02), pages 1267–1272. IEEE, May 2002.

[4] M. Hartmann and P. C. Haddow. Evolution of fault-tolerantand noise-robust digital designs. IEE Proc. -Comput. Digit.Tech., 151(4):287–294, 2004.

[5] International Roadmap Committee. Executive summary. InInternational Technology Roadmap for Semiconductors. 2005.http://public.itrs.net.

[6] A. H. Jackson, R. Canham, and A. M. Tyrrell. Robot fault-tolerance using an embryonic array. In J. Lohn, R. Zebulum,J. Steincamp, D. Keymeulen, A. Stoica, and M. I. Ferguson,editors, Proc. The 2003 NASA/DoD Conference on EvolvableHardware, pages 91–100. IEEE Computer Society, 2003.

[7] D. Keymeulen, R. S. Zebulum, Y. Jin, and A. Stoica. Fault-tolerant evolvable hardware using field-programmable transis-tor arrays. IEEE Transactions on Reliability, 49(3):305–316,2000.

[8] J. Koza. Genetic Programming. The MIT Press, 1993.[9] J. Lohn, G. Larchev, and R. DeMara. A genetic representa-

tion for evolutionary fault recovery in virtex fpgas. In A. M.Tyrell, P. C. Haddow, and J. Torresen, editors, Evolvable Sys-tems: From Biology to Hardware. Fifth Int. Conf., ICES 2003,volume 2606 of Lecture Notes in Computer Science, pages 47–56. Springer, 2003.

[10] D. Mange, M. Sipper, A. Stauffer, and G. Tempesti. To-ward self-repairing and self-replicating hardware: the embry-onics approach. In J. Lohn, A. Stoica, D. Keymeulen, andS. Colombano, editors, Proc. The Second NASA/DoD Work-shop on Evolvable Hardware, EH 2000, pages 205–214. IEEEComputer Society, 2000.

[11] J. F. Miller, D. Job, and V. K. Vassilev. Principles in the evo-lutionary design of digital circuits – part i. Journal of GeneticProgramming and Evolvable Machines, 1(1):8–35, 2000.

[12] E. F. Stefatos and T. Arslan. An efficient fault-tolerant, vlsiarchitecture using parallel evolvable hardware technology. InProc. 2004 NASA/DoD Conference on Evolvable Hardware,pages 1–7, 2004.

[13] A. Thompson. Evolving electronic robot controllers that ex-ploit hardware resources. In The 3rd European Conference onArtificial life (ECAL95), 1995.

[14] A. Thompson. Evolving fault tolerant systems. In Proc.1st IEE/IEEE Int. Conf. on Genetic Algorithms in EngineeringSystems: Innovations and Applications (GALESIA’95), pages524–529. IEE Conf. Publication No. 414, 1995.

[15] A. M. Tyrrell, G. Hollingworth, and S. L. Smith. Evolution-ary strategies and intrinsic fault tolerance. In D. Keymeulen,A. Stoica, J. Lohn, and R. S. Zebulum, editors, Proc. The ThirdNASA/DoD Workshop on Evolvable Hardware, EH 2001,pages 98–106. IEEE Computer Society, 2001.

[16] T. Vladmirova, X. Wu, K. Sidibeh, D. Bernhart, and A.-H.Jallad. Enabling technologies for distributed picosatellite mis-sions in leo. In Proc. First NASA/ESA Conference on AdaptiveHardware and Systems (AHS’06), pages 330–337. IEEE, 2006.

[17] R. S. Zebulum, D. Keymeulen, V. Duong, X. Guo, M. I.Ferguson, and A. Stoica. Experimental results in evolution-ary fault-recovery for field programmable analog devices. InJ. Lohn, R. Zebulum, J. Steincamp, D. Keymeulen, A. Stoica,and M. I. Ferguson, editors, Proc. The 2003 NASA/DoD Con-ference on Evolvable Hardware, pages 182–188. IEEE Com-puter Society, 2003.

[18] K. Zhang, R. F. DeMara, and C. A. Sharma. Consensus-based evaluation for fault isolation and on-line evolutionaryregeneration. In Proc. International Conference in Evolv-able Systems (ICES’05), volume 3637, pages 12–24. Springer,2005.

8

66

Paper III

Evolving Redundant Structures for Reliable Circuits — LessonsLearnedAsbjørn Djupdal and Pauline C. HaddowIn Adaptive Hardware and Systems, pages 455–462, 2007

Evolving Redundant Structures for Reliable Circuits — Lessons Learned

Asbjoern Djupdal and Pauline C. HaddowCRAB Lab (http://crab.idi.ntnu.no)

Department of Computer and Information ScienceNorwegian University of Science and Technology

djupdal,[email protected]

Abstract

Fault Tolerance is an increasing challenge for integratedcircuits due to semiconductor technology scaling. This pa-per looks at how artificial evolution may be tuned to thecreation of novel redundancy structures which may be ap-plied to meet this challenge. However, as these structuresare unknown it is a challenge in itself to tune evolution tocreate them. As such, no solution has yet been found. Thispaper provides a discussion about the issues addressed andexperiments conducted and thus provides an overview of thelessons learned in this work.

1 Introduction

As the semiconductor feature size decreases and thenumber of transistors on a single chip increases, one of thegrowing challenges facing the electronic design communityis faulty behaviour [11]. This challenge may be met by ei-ther improved detection and repair techniques, by improvedfault tolerance methods or a combination of the two.

The semiconductor fault challenge may be, in general,a long term challenge but is here today for large ICs, likeField Programmable Gate Arrays (FPGA). The mass pro-duction of FPGAs enables FPGAs to be produced in thenewest technologies. Xilinx Virtex 5 [18] is an example ofa new FPGA series from Virtex produced in 65nm technol-ogy with up to 330,000 logic cells.

If faults are expected to occur in a digital circuit, faulttolerance — the ability to function correctly in the presenceof faults, may be achieved by incorporating redundancy(additional resources) in some form. These additional re-sources may be in the form of additional hardware, in whichcase it is called hardware redundancy [14] which is the fo-cus of this paper.

One well known hardware fault tolerance method isTriple Modular Redundancy (TMR) [14]. This method in-volves tripling logic and using a voter to choose the correct

solution. Although TMR is a very successful redundancytechnique, its accepted weaknesses are the tripling of areaand the susceptibility of the voter to faults. Also, if such atechnique were to be applied to all the logic on an FPGA,around 2/3rds of the available logic would be applied to re-dundancy. This would drastically reduce the amount of pri-mary logic on the device.

When introducing redundancy in a circuit, there is oftena trade-off between area and fault tolerance. TMR may besaid to trade area for higher fault tolerance. Much researchis currently looking at more area efficient ways of achievingfault tolerance and section 2 provides a short summary ofsome of this work.

The goal for the work behind this paper is to find newways of introducing redundancy in a circuit. The ultimategoal for this work is fault tolerance in FPGAs. While thework presented in this paper is not specifically targeting FP-GAs, the goal is to gain knowledge on novel redundancytechniques that may later be adapted for use in an FPGAcontext. To find new redundancy techniques it is impor-tant to free oneself from the constraints brought upon us bythinking in the way of traditional redundancy. The wholeway one thinks about designing at either the circuit designor technology architecture level is influenced by the waythat one is taught electronics, designed electronics and thetools used in the design process. One way of freeing one-self from these human and design automated constraints isto search for ideas using some sort of heuristic search pro-cess. One such process is that of evolutionary algorithms(EA) [4].

The application of EA to the design of hardware istermed evolvable hardware [10]. The goal being, either toexplore for unique solutions or to optimise existing solu-tions. However, in both cases, the goal is usually to obtaina given behaviour e.g. a binary adder [9]. Further, evolu-tion may be applied when seeking some sort of structure,like evolving the french flag [16]. In both these cases thegoal may be explicitly defined and given to the EA for com-parison between the evolving solutions and the sought solu-

Paper III 69

tions. In the former case it is the functionality that needs tobe explicitly defined whereas in the latter case the structureneed be explicitly defined.

When evolving redundant circuits for the purpose offinding novel redundancy techniques, one is looking for re-dundant structures. However, these structures are unknown,unlike the case of the earlier mentioned french flag prob-lem [16]. It is not possible to explicitly describe the struc-ture that one is seeking, only the functionality of the soughtcircuit — perhaps in terms of the truth table.

In the earlier work of Hartmann and Haddow [7], faulttolerant circuits were evolved. While achieving high fit-ness on a reliability based fitness function, they did not fo-cus on creating 100% functional circuits where reliabilityis achieved through redundancy. In this work, the goal isto push evolution to retain 100% functionality and to findways to introducing redundancy for fault tolerance in thecircuit. To the surprise of the authors themselves this prob-lem is much more challenging than it might first appear. Assuch, the paper presents some of the approaches that havebeen applied to address this challenge and discusses theseand further possibilities.

Section 2 gives a summary of the state of the art inarea efficient redundancy techniques for FPGAs. Section 3presents some important issues that must be addressed whenevolving redundant circuits. Experimental setup, results anddiscussion is given in section 4 and the paper concludes insection 5.

2 Redundancy in FPGAs

To achieve area efficient defect tolerance, the typical ap-proach is to exploit structural regularity [12]. The FPGAhas a regular structure, which has inspired several tech-niques for defect tolerance in FPGAs. Techniques, espe-cially in the context of enhancing yield, are reviewed indetail in [3]. Selected techniques may be classified underNode Redundancy, Configuration, Precompiled Configura-tion and Local Redundancy Techniques.

2.1 Node Redundancy

The node redundancy class of techniques contains themost widely studied techniques for redundancy in FPGAsand have been used with success to enhance yield in com-mercial Altera FPGAs [1]. The idea is to reserve sparenodes (logical blocks) in the FPGA architecture and enablespare nodes to take over for defective ones using on-chipresources. An early example [8] provided redundancy inthe form of a redundant row. In the case of a defect, foundduring factory tests, the defective row is disconnected, ver-tical wiring is set up to bypass the disconnected row and alllower rows are shifted one row down. This reconfiguration

is performed once, and may therefore be completed at thefactory with antifuses or similar write-once technology.

The node redundancy technique has since been gener-alised to applying individual spare nodes instead of entirerows. A single defect then results in using one of the sparenodes [6], instead of discarding a whole row of nodes.

A difficulty with using such a method is to trade offhardware simplicity for good defect coverage. Changingthe physical location of functionality from faulty functionalnodes requires rerouting and flexible rerouting is expensivein terms of both area and delay and is difficult to achieveon-chip. Lack of flexibility either means that fewer defectsare tolerated or that more redundancy is wasted on each ofthem.

2.2 Configuration

The FPGA, or a system external to it, may change itsconfiguration so that defective portions of the chip are leftunused. The concept here is to use spare resources that arenaturally present in an FPGA design as no FPGA designuses all of the FPGAs resources. Doing a new place-and-route is very computationally expensive and uses much re-sources and is, therefore, often completed off-chip usingstandard synthesis tools (the Teramac project [2]). How-ever, the Cell Matrix [15] provides an example of an on-chipsolution incorporating a complex cell solution.

2.3 Precompiled Configuration

A further alternative to node redundancy is to share someof the work of rerouting with the synthesis tools. The bit-stream may contain several different configurations, eachassuming a defect in a different position. At configurationtime, the chip may select those configurations that fits thecurrent defect map [13].

2.4 Local Redundancy

The redundancy methods presented may be said to workat the system level. Local redundancy, on the other hand,introduces redundancy that effects only the area local to iti.e. adding extra local routing between two switch blocks.If a defective wire is found, the redundant one takes over itsfunctionality [6, 19].

3 Issues on Evolving Redundant Structures

As stated, the goal of this work is to evolve redundantstructures. There are, however, a number of issues that mustbe addressed and this section gives an introduction to theseissues.

70

3.1 Behaviour (Function)

In this paper, the goal is not to evolve a multiplier, a flipflop or some other specified functionality. Instead our goalis to create redundant structures that enhances fault toler-ance. However, we cannot explicitly define these structuresbut instead implicitly define them through a function thatperforms well in the presence of faults. As such, we needto define some kind of function and expose the function tofaults. However, if the function is a challenge for evolution,evolution will use much time trying to achieve the function.This is, of course, undesirable. Instead, the function shouldbe relatively easily evolvable so that evolution time is fo-cused on the problem in hand, achieving redundant struc-tures.

The size of the minimum representation of a given func-tion is an important criteria when choosing a function. Ifthere are very few gates, any form of redundancy will pro-vide a substantial overhead, so a large circuit would prob-ably be better suited for redundancy structures. However,this challenge has to be traded off against the challenge ofevolving large circuits and wasting much evolution time onachieving the function itself rather than fault tolerance.

3.2 Fault Models

Two fault models are considered in this work: the gatereliability model and the single fault model. In the gate reli-ability model, each gate has a certain probability of failing.A fault scenario is one possible configuration of faulty gatesfor a given circuit. If a fault scenario for the gate reliabil-ity model is to be created, each gate in the circuit is testedagainst a random number generator and selected to be faultyor not based on a chosen gate reliability. This is a reasonablemodel of reality as the probability of having failing gates ina circuit is directly proportional to the number of gates inthe circuit.

In the single fault model, a circuit can have exactly onefault at any time. If a fault scenario for the single faultmodel is to be created, one of the gates are selected to fail.Further, there are two cases of the single fault model. Eitherevery possible gate failure is tested or a subset of possiblegate failures may be tested to assess the reliability of thecircuit in hand. The former is, of course, a more thoroughand accurate test but uses significant resources. The latter isintroduced with a view to reducing evaluation time.

3.3 Measuring Functionality and Reliabil-ity

The functionality of a circuit is found by trying all possi-ble input values and recording the respective output values

Table 1. Naming convention for reliabilitymetrics together with fault models applied

Name MeaningRtrad_single Rtrad using single fault modelRtrad_gate Rtrad using gate reliabilityRehw_single Rehw using single fault modelRehw_gate Rehw using gate reliability

of the circuit. If all recorded output values correspond ex-actly to the desired truthtable for the function, the circuitis working perfectly, otherwise 100% functionality is notachieved. Traditionally, the result of such a test for func-tionality is either “not working” (0) or “working” (1), re-ferred to as fbool herein.

When using artificial evolution to create circuits, a mea-sure of functionality is usually included in the fitness func-tion. Since fbool provides little information as to the qual-ity of the solutions that are not 100% functional, evolu-tion is unable to distinguish between two solutions that donot reach 100% functionality. Evolution needs to separatea circuit that is almost working, from a circuit that is farfrom working, even though both of these circuits will havefbool = 0 (“not working”). One way of achieving moreinformation is to measure the hamming distance betweenthe circuit output and the desired output, i.e. the number ofbits that are different between these two solutions. In thispaper, the hamming distance is normalised to the interval[0, 1] where 1 is 100% working. This measure of function-ality is called fham. If a circuit is working, both fbool andfham will be 1.

A reliability metric measures how well a circuit func-tions in the presence of faults. The traditional reliabilitymetric Rtrad is the average of all fbool results after hav-ing tested a number of randomly selected fault scenarios.A second reliability measure can be formulated based onfham. The reliability metric Rehw is the average of all fham

results after having tested a number of randomly selectedfault scenarios.

The possible fault scenarios depend on the fault modelchosen. The reliability of the circuit will depend on boththe reliability metric and the fault model applied. To aidreadability of the discussions and experiment the namingconvention in table 1 is applied in this paper.

It is important to note that these reliability measures maybe applied to provide a measure as to how well the cir-cuit, when treated as a black box, performs in the pres-ence of faults. It says nothing about the redundancy struc-tures themselves which is, of course, the goal of this work.So how can one measure the presence of redundancy in acircuit? Automatic Test Pattern Generation (ATPG) tools

Paper III 71

may be used to identify redundancy as gates that can notbe tested by any test vector and therefore represent redun-dant gates. However, this test detects redundancy whetheruseful (contributing to fault tolerance) or not. The questionof detecting useful redundancy structures is in fact quite acomplex one as highlighted herein.

4 Discussion, Experiments and Results

The purpose of this section is not only to present the ex-perimental work in this paper, but also to present the processof analysing the problem itself and the intermediate resultswhich led to the lessons learned.

4.1 Experimental Setup

All experiments are conducted on simulations of circuitsin a digital feed forward circuit simulator. Only Booleanlogic is allowed and the following gates are available: AND,OR, NAND, NOR, NOT. A faulty gate is simulated by in-verting the output of the gate. The EA is Cartesian GeneticProgramming [17] with the following parameters:

• Maximum number of gates: 200

• Population size: 20

• Tournament selection with elitism (g = 3, p = 0.7)

• Crossover rate: 0.2

• Mutation rate: 0.05 (mutation applied at the gate level)

4.2 Single Fault Experiments

In earlier work by Hartmann and Haddow [7], experi-ments were performed using the gate reliability model andcircuits were evolved with a fitness function based on Rehw.These experiments provided evidence that evolution tradedoff functionality for improving fitness. The number of gateswere reduced to minimise the probability of having failinggates in the circuit, and instead of creating circuits that were100% correct, simpler functions were created giving correctoutput for most of the possible input values.

Why did earlier experiments not lead to redundant struc-tures, even though they were evolved with a reliability met-ric as fitness?

It seems that evolution chose the simplest way of at-tacking the problem — avoiding it by shrinking the circuit.When using the gate reliability model, the probability ofhaving a failing gate is reduced when the number of gatesis reduced. There is, therefore, an implicit size factor in thefitness evaluation that encourages small circuits, which isthus inhibiting larger circuits with redundancy.

These earlier experiments focused on making circuitsthat score high on the chosen reliability metric for givenfunctions — multipliers and adders. Further work [5], alsolooked at the reliability metrics themselves and how theycompare and may be applied in the context of traditionaland evolved designs. As such, none of this work explicitlysearched for redundancy structures and the goal herein is tofind ways to either implicitly or explicitly specify a searchleading to redundancy structures. Thus functionality andreliability in terms of a given metric are not in focus in thispaper.

How could evolution be forced to create larger and moreinteresting redundant circuits?

One way of encouraging large circuits is to introduce asize factor in the fitness function that evens out the negativeeffect that a large circuit has on reliability. Introducing asize factor would require some sort of weighting betweenfunctionality and size. However, a size factor does not, initself, encourage useful redundancy, just more gates.

Redundancy techniques typically introduce some kind ofoverhead, in terms of the number of gates, and this overheadis especially dominating for smaller circuits, such as thoseexperimented with herein. Further, the number of gates inthe circuit affects the probability that the circuit will havea gate that fails when applying the gate reliability model.However, applying the single fault model removes any biastowards smaller circuits.

The single fault model may be applied with either theRehw or the Rtrad metric. As stated, Rehw provides a mea-sure of how a circuit degrades in the presence of faults.Rehw may have a non-zero value even with faults present,whether or not the circuit has any redundancy or not. Rtrad,on the other hand, is zero in the presence of a fault unlessredundancy is present. As such Rtrad may be said to be abetter indicator of redundancy. Rtrad does, however, notprovide evolution with sufficient fine-grained information.A two part fitness function was thus created, as presented inequation (1). For the purpose of the experiments herein, k1

was set to 0.3 and k2 was set to 0.7.

f = k1 · fham + k2 · Rtrad_single (1)

The first part containing fham takes care of building afunctional circuit. Before fham is 1.0, Rtrad_single will re-main zero and thus does not contribute to the total fitness.When fitness = 0.3, a fully functional circuit is evolved andRtrad_single will be the part that evolution will have to in-crease in order to improve fitness.

As stated, a particular function is not the focus of thework herein. However, a function is needed for fitness eval-uation. What function should be applied?

In this work, redundant structures are sought, requir-ing analysis of the evolved circuits after evolution. Earlierwork [7] focused on 2 · 2 multipliers and adders. Having

72

Table 2. Results from Rtrad_single experiments

# Size Fitness Rtrad_single Rtrad_single (opt)0 72 0.862 0.792 01 111 0.936 0.901 02 90 0.906 0.856 03 40 0.803 0.700 04 132 0.941 0.909 05 99 0.907 0.859 06 68 0.854 0.779 07 96 0.926 0.885 08 46 0.798 0.696 09 65 0.891 0.831 0

2

1

output

f

r

inputs

Figure 1. Redundancy structure exploitingunreachable gate

more than one output is from a functional point of view thesame as having several circuits (even though logic may beshared between different outputs). Concentrating the evolu-tionary efforts on only one output thus seemed reasonable.In addition, analysis of a single output circuit is somewhatsimpler than for a multiple output circuit.

The truthtable of a suitable function was constructed(“1001011101100110” with bit zero to the right), that is rel-atively easy to evolve, has four inputs and one output andwith a non-redundant implementation of nine gates.

The results of ten evolution runs are given in table 2.Note that it would seem that evolution has been forced touse more gates and a reasonable fitness is achieved withRtrad_single lying between 0.7 and 0.9.

The results seemed promising until a manual inspectionidentified that what was being exploited by evolution wasthe concept of unreachable gates

An example of unreachable gates is shown in figure 1,where the ellipse marked f is a circuit performing our de-sired function. The OR gate marked 1 is unreachable. It willalways have a constant one as output, no matter what themain circuit inputs are. The subcircuit having this unreach-able gate as its output, marked r, will not affect the maincircuit output in any way, so any faults in this area will beswallowed without creating a wrong output. Making a smallfunctioning subcircuit f and making a large variant of suchan unwanted subcircuit r will result in a high Rtrad_single.

These unreachable gate structures are, however, not use-ful for our purpose. They contain no real redundancy. The

Table 3. Results from Rtrad_single experi-ments where unreachable gate structures areavoided

# Size Fitness Rtrad_single Rtrad_single (opt)0 106 0.933 0.896 0.8961 135 0.937 0.904 0.9042 135 0.937 0.904 0.9043 61 0.883 0.820 0.8204 61 0.883 0.820 0.8205 108 0.921 0.880 0.8806 109 0.922 0.881 0.8817 108 0.921 0.880 0.8808 101 0.923 0.881 0.8819 101 0.923 0.881 0.881

last column in table 2, “Rtrad_single (opt)”, show the valueof Rtrad_single after the circuits have been optimised insuch a way that the unreachable gate structures have beenremoved. The results show clearly that no other form ofredundancy is present in these circuits.

The fitness value for circuits with unreachable gate struc-tures is high. Such circuits are therefore the kind of circuitsthat are likely to be promoted to future generations. Evenif circuits with good redundancy structures exist in earlygenerations, they are probably discarded because the fitnessfunction is again not explicit enough as to what a redundantstructure is.

4.3 Single Fault, Excluding UnreachableGates

Growing the before mentioned unreachable gate struc-tures is probably the easy solution for the EA. Those struc-tures should therefore be discouraged in some way so thatother more useful redundancy structures may emerge.

How can evolution both be forced to increase the size ofthe circuits but not exploit unreachable gate structures?

Again, one might say that evolution has found a wayto avoid the problem of gate faults but has not solved it.Removing the possibility of evolution creating unreachablegates would constrain evolution’s freedom to explore forcircuits. It was, therefore, deemed more appropriate to mod-ify the fitness function such that unreachable gate structuresdo not contribute positively to the fitness value. This wasachieved by detecting all gates that are part of a subcircuitwith unreachable gates as outputs and when applying thesingle faults, faults were only applied to gates outwith thesesubcircuits.

A summary of the results of adapting the fault model toapply faults at only reachable gates may be found in table 3.As shown, it would seem that the problem of unreachable

Paper III 73

2 output

f

r 1

inputs

Figure 2. Reachable gates that do not con-tribute to the output

gates was solved and reasonable fitness was again reached.However, evolution once again found a way to cheat.When known unreachable gate structures were made un-

profitable, another and similar kind of structure was in-vented by the EA. Instead of introducing unreachable gatesat the exit point for a large random subcircuit, structuressuch as the example shown in figure 2 were created. Simi-lar to the previous example, the ellipse marked f representsa circuit performing our desired function. Circuit f is con-nected to an AND gate (marked 2). Evolution exploits thefact that the second input to this AND gate has “input don’tcares” whenever the first input is a logical 1. By introducinga structure, such as the one represented by gate 1, evolutioncan once again grow a large circuit r that does not contributeto the output in any way, yet scores positively on fitness.

This is also an unwanted structure. It is, however, not aseasy to detect automatically as the unreachable gate struc-tures because there is an unlimited number of ways to con-struct similar solutions. It is, therefore, a challenge to ex-clude such structures from fitness evaluation.

4.4 Redundant Subcircuits

How can evolution be forced to put redundant structureswithin the circuit itself?

An alternative way of evolving redundant circuits is tosplit the target circuit into smaller subcircuits and evolveredundant versions of these subcircuits. This may provideevolution with a simpler problem to evolve and analysingthese smaller circuits for redundancy might be simpler.

The problem of selecting subcircuits is not an easy one.What granularity to use? How to avoid illegal structures,like feedback loops? Since the goal of this work has noth-ing to do with partitioning, basic logic gates were selectedas subcircuits so as to avoid the partitioning problem. Re-dundant versions of the basic logic gates were evolved andthese together with the logic gates themselves were avail-able to evolution to create a redundant circuit.

Again, evolution chose to find ways of introducing un-reachable gates to the redundant logic gates in the same wayas was introduced to the complete circuit (figure 2).

4.5 Larger Gate Reliability Experiments

When it was clear that using Rtrad_single in the fitnessfunction did not result in any useful redundancy structures,it was decided to try looking once again at the gate reliabil-ity model.

How could the complexity of the function sought be in-creased whilst avoiding the implicit size reduction in thefitness function and achieving a reasonable evaluation timedespite applying the gate reliability model?

One of the challenges in using a gate reliability model isthat there are a large number of possible fault scenarios. Us-ing only a few fault scenarios drastically reduces the evalua-tion time at the expense of a very noisy fitness evaluation. Anoisy fitness evaluation makes the task harder for evolution.

One possibility is to exploit the fact that the number offaulty gates in a fault scenario with the gate reliability modelis binomially distributed. If X is a random variable for thenumber of faults in a fault scenario, x is the number offaults, n is the number of gates in the circuit and p is thefail rate for the gates (1−gate reliability), equation (2) maybe used to find the probability of having a specific numberof faulty gates in a fault scenario.

P [X = x] = b(x;n, p) =(

n

x

)px(1 − p)n−x (2)

The circuit may be evaluated with the zero fault scenarioand all the single fault scenarios and the results may bescaled by the probability for that number of faults (x0 andx1). However, the case of more than one fault would stillbe computationally expensive and, as such, it was chosento express reliability excluding a component for more thanone fault — see equations (3) and (4) for Rtrad and Rehw

respectively.

Rtrad_gate = x0 · fbool + x1 · Rtrad_single (3)

Rehw_gate = x0 · fham + x1 · Rehw_single (4)

The third output of the 3 · 3 multiplier was chosen be-cause its non-redundant implementation needs 17 gates, asopposed to 9 in the earlier experiments. A fitness functionwas designed that ensures that a fully functioning circuitalways scores higher than a circuit that does not function100% correctly. This function is shown in equation (5). Asshown, as long as the functionality is not 100% (fham isnot 1.0), the Rehw_gate part of the fitness function does notcontribute to the fitness value.

f = k1 · fham + k2 ·

0 if fham < 1.0Rehw_gate if fham = 1.0 (5)

74

Table 4. Results from larger Rehw_gate experi-ments

# Size Fitness fham Rehw_gate Rtrad_single

0 23 0.941 1 0.921 01 22 0.946 1 0.928 02 23 0.944 1 0.928 03 23 0.946 1 0.928 04 23 0.936 1 0.916 05 26 0.940 1 0.924 06 24 0.940 1 0.923 07 23 0.928 1 0.907 08 25 0.943 1 0.928 09 24 0.943 1 0.928 0

The best fit individuals from nine evolutionary runs areshown in table 4. It is clear from the results that while evo-lution is able to make circuits that work 100% when nofaults are applied (fham equals 1.0), they do not includeany redundancy: Rtrad_single is zero so no gates may failwithout damaging the output. Instead of introducing redun-dancy, evolution tries to minimise the number of gates whilestill maintaining functionality. This is also obvious from thegate counts which are again significantly lower than thosein tables 2 and 3, despite the larger circuit being evolved.

Is Rehw_gate good enough at rewarding redundancy?A benchmark that may be used for investigating how

good the fitness function rewards redundancy, is to test thefitness function on a circuit and on a TMR-version of thesame circuit. TMR is a known redundancy structure, andif a fitness function scores lower on the TMR version ofthe circuit it is an indication that the fitness function mightalso score lower for other redundant structures that might beconsidered interesting. A 17 gate example of the third out-put of the 3 · 3 multiplier was investigated in this context.

This benchmarking was performed both on Rehw_gate

and Rtrad_gate, and the results are shown in table 5. The“quick” way of estimating the metrics is that described inequations (3) and (4) and applied during evolution. “MC”reflects a thorough Monte Carlo simulation of the circuit. Itcan be seen in the table that both Rehw_gate and Rtrad_gate

(MC) are correctly favouring the TMR circuit, but thatthe “quick estimate” of Rehw_gate is presenting the TMRpoorer than the non-redundant version. This would indicatethat the fitness function seems reasonable for Rtrad_gate butnot Rehw_gate. This may be explained by the fact that withtwo or more faults appearing in a circuit, Rtrad_gate is closeto zero. However, Rehw_gate will be significantly more thanzero even for two or more faults. As such, the choice to ex-clude the “more than one fault” component of (4) is detri-mental to Rehw_gate.

When using the quick estimate, Rtrad_gate seems better

Table 5. Benchmarking Rehw_gate andRtrad_gate using 3 · 3 multiplier, output 3

Rehw_gate Rtrad_gate

Size Quick MC Quick MCPlain 17 0.893 0.900 0.843 0.842TMR 55 0.882 0.950 0.878 0.905

Table 6. Results from larger Rtrad_gate experi-ments

# Size Fitness fham Rtrad_gate Rtrad_single

0 18 0.884 1 0.843 01 18 0.884 1 0.829 02 21 0.867 1 0.809 03 22 0.861 1 0.805 04 20 0.873 1 0.817 05 21 0.867 1 0.809 06 18 0.884 1 0.832 07 21 0.867 1 0.812 08 21 0.867 1 0.812 09 21 0.867 1 0.811 0

suited for evolving redundant structures and this was triedexperimentally. The fitness function applied is shown inequation (6) and the results are shown in table 6.

f = k1 · fham + k2 · Rtrad_gate (6)

For these experiments, 100% functioning circuits areevolved but without any redundancy (shown by the zero val-ues in the Rtrad_single column). Instead, evolution tries itsbest to make the circuit as small as possible without remov-ing functionality.

4.6 TMR Seeded Population

If it is hard to make evolution create redundant structureswhy not start at the other end and give evolution redundantstructures and let it prune them?

The TMR experiments in this paper were conducted byseeding the starting population with one TMR-circuit. Therest of the population was random circuits. The experimentspresented in the previous sections were rerun with TMR-seeded populations.

When using the Rtrad_single based fitness function inequation (1) and avoiding unreachable gates, the originalTMR structure of the seeding individual was kept. In ad-dition, the EA introduced to the TMR circuit the kind ofstructure shown in figure 2, making it score highly on fit-ness. This experiment did however not result in anythingnew or useful.

Paper III 75

Evolving for Rehw_gate using the fitness function inequation (5) was also tried with TMR-seeded populationand, interestingly, the TMR structure was not kept in thiscase. Instead, redundancy was removed. This is most likelycaused by the fact that the quick estimator of Rehw_gate isnot favouring the TMR circuit over a non-redundant one.

When using Rtrad_gate and the fitness function in equa-tion (6) the TMR circuit is kept without any change at all.Evolution is unable to find any way of changing the TMRcircuit that gets better fitness, and because of elitism, thesame TMR circuit is kept as the best one.

5 Concluding Remarks and Future Work

Several different attempts at evolving redundancy struc-tures were tried in this paper and the results illustrate thedifficulty inherent in evolving redundant structures. Whenthe single fault model was applied, new structures were cre-ated that scored high on fitness, but that provided no usefulredundancy. When the gate reliability model was applied,evolution responded by making small circuits without anyredundancy at all.

The challenge is to specify a fitness function that cor-rectly scores high on circuits with useful redundancy struc-tures and scores low on structures that are useless from afault tolerance point of view, whilst still maintaining evolv-ability and encouraging the creation of such structures. Oneapproach in this paper has been to actively avoid a knownunwanted structure, only to discover that other forms ofunwanted structures were invented instead. A better waywould be to have a more general way of classifying redun-dancy as useful or not and only including useful redundantgates when calculating fitness. Further work will investigatebetter algorithms for detecting useful redundancy.

The functionality of the evolved circuits is not the focusherein, but the functionality may still affect how easily evo-lution can create redundancy structures. An interesting ex-periment would be to co-evolve the function itself togetherwith the reliable circuits implementing this function.

In summary this work has shown that evolution “cheats”by avoiding the problem instead of attacking the problemaggressively.

References

[1] Altera. Apex redundancy. http://www.altera.com/products/devices/apex/features/apx-redundancy.html.

[2] W. B. Culbertson, R. Amerson, R. J. Carter, P. Kuekes, andG. Snider. Defect tolerance on the teramac custom com-puter. In Proc. IEEE Symposium on FPGA-Based CustomComputing Machines (FCCM), page 116, 1997.

[3] A. Djupdal and P. C. Haddow. Yield enhancing defect toler-ance techniques for FPGAs. In FPL 2007, 2007. Submittedto FPL 2007.

[4] A. E. Eiben and J. E. Smith. Introduction to EvolutionaryComputing. Springer, 2003.

[5] P. C. Haddow, M. Hartmann, and A. Djupdal. Addressingthe metric challange: Evolved versus traditional fault tol-erant circuits. In Adaptive Hardware and Systems (AHS),2007.

[6] F. Hanchek and S. Dutt. Methodologies for tolerating celland interconnect faults in FPGAs. IEEE Transactions onComputers, 47(1):15–33, 1998.

[7] M. Hartmann and P. C. Haddow. Evolution of fault-tolerantand noise-robust digital designs. IEE Proceedings - Com-puters and Digital Techniques, 151(4):287–294, jul 2004.

[8] F. Hatori, T. Sakurai, K. Nogami, K. Sawada, M. Taka-hashi, M. Ichida, M. Uchida, I. Yoshii, Y. Kawahara, T. Hibi,Y. Saeki, H. Muraoga, A. Tanaka, and K. Kanzaki. Introduc-ing redundancy in field programmable gate arrays. In Proc.IEEE Custom Integrated Circuits Conference, pages 7.1.1–7.1.4, 1993.

[9] H. Hemmi, J. Mizoguchi, and K. Shimohara. Developmentand evolution of hardware behaviors. In Artificial Life IV:Proc. 4th Int. Workshop Synthesis Simulation Living Syst.,pages 371–376. MIT Press, 1994.

[10] T. Higuchi, T. Niwa, T. Tanaka, H. Iba, H. de Garis, andT. Furuya. Evolving hardware with genetic learning: a firststep towards building a darwin machine. In Proc. 2nd int.conf. From animals to animats: simulation of adaptive be-havior, pages 417–424, 1993.

[11] ITRS. International technology roadmap for semiconduc-tors. Technical report, ITRS, 2005.

[12] I. Koren and Z. Koren. Defect tolerance in VLSI circuits:Techniques and yield analysis. Proceedings of the IEEE,86(9):1819–1837, sep 1998.

[13] J. Lach, W. H. Mangione-Smith, and M. Potkonjak. Lowoverhead fault-tolerant FPGA systems. IEEE Trans. VeryLarge Scale Integr. Syst., 6(2):212–221, 1998.

[14] P. K. Lala. Self-Checking and Fault Tolerant Digital Design.Morgan Kaufmann Publishers, 2001.

[15] N. J. Macias and L. J. K. Durbeck. Adaptive methods forgrowing electronic circuits on an imperfect synthetic matrix.Biosystems, 73(3):173–204, 2004.

[16] J. F. Miller. Evolving a self-repairing, self-regulating, frenchflag organism. In Genetic and Evolutionary Computation(GECCO), pages 129–139, 2004.

[17] J. F. Miller, D. Job, and V. K. Vassilev. Principles in theevolutionary design of digital circuits part i. Journal ofGenetic Programming and Evolvable Machines, 1(1):8–35,2000.

[18] Xilinx. Xilinx virtex 5 overview. http://www.xilinx.com/products/virtex5/index.htm.

[19] A. J. Yu and G. G. F. Lemieux. Defect-tolerant FPGA switchblock and connection block with fine-grain redundancy foryield enhancement. In Proc. Field Programmable Logic andApplications, pages 255–252, 2005.

76

Paper IV

Evolving and Analysing “Useful” Redundant LogicAsbjørn Djupdal and Pauline C. HaddowIn International Conference on Evolvable Systems (ICES), pages256–267, 2007

Evolving and Analysing Useful Redundant Logic

Asbjoern Djupdal and Pauline C. Haddow

CRAB Lab (http://crab.idi.ntnu.no)Department of Computer and Information ScienceNorwegian University of Science and Technology

djupdal,[email protected]

Abstract Fault Tolerance is an increasing challenge for integrated cir-cuits due to semiconductor technology scaling. This paper looks at howarticial evolution may be tuned to the creation of novel redundancystructures which may be applied to meet this challenge. An experimentalsetup and results for creating useful redundant structures is presented.

1 Introduction

As the semiconductor feature size decreases and the number of transistors on asingle chip increases, one of the growing challenges facing the electronic designcommunity is faulty behaviour [1]. This challenge may be met by improved faulttolerance methods. The semiconductor fault challenge may be, in general, a longterm challenge but is here today for large ICs, like FPGAs. The mass productionof FPGAs enables FPGAs to be produced in the newest technologies. XilinxVirtex 5 [2] is an example of a new FPGA series from Virtex produced in 65nmtechnology with up to 330,000 logic cells.

If faults are expected to occur in a digital circuit, fault tolerance the abilityto function correctly in the presence of faults, may be achieved by incorporat-ing redundance (additional resources) in some form. These additional resourcesmay be in the form of additional hardware, in which case it is called hardware

redundancy [3], the focus of this paper.To nd new redundancy techniques it is important to free oneself from the

constraints brought upon us by thinking in the way of traditional redundancetechniques. The way one thinks when designing circuits is inuenced by theway that one is taught electronics, designed electronics and the tools used inthe design process. One way of freeing oneself from these human and designautomated constraints is to search for ideas using some sort of heuristic searchprocess. One such process is that of evolutionary algorithms [4].

The application of evolutionary algorithms to the design of hardware istermed evolvable hardware [5]. The goal being, either to explore for unique solu-tions or to optimise existing solutions. However, in both cases, the goal is usuallyto obtain a given behaviour e.g. a binary adder [6]. Further, evolution may beapplied when seeking some sort of structure, such as evolving the french ag [7].In both these cases the goal may be explicitly dened and given to the evolu-tionary algorithm for comparison between the evolving solutions and the sought

Paper IV 79

2

solutions. In the former case it is the functionality that needs to be explicitlydened whereas in the latter case it is the structure that needs to be explicitlydened.

In this work, the goal is to push evolution to nd useful redundant struc-tures for achieving fault tolerance whilst retaining full functionality. However,these redundant structures are unknown, unlike the case of the earlier mentionedfrench ag problem. It is not possible to explicitly describe the structure thatone is seeking, only the functionality of the sought circuit perhaps in termsof a truth table.

Section 2 gives an overview of necessary background material. Section 3presents relevant previous work. The experimental setup is found in section 4with results and discussion in section 5. The paper concludes in section 6.

2 Background

2.1 Fault Models and Simulated Faults

Two fault models are considered in this work: the gate reliability model andthe single fault model. In the gate reliability model, each gate has a certainprobability of failing. A fault scenario is one possible conguration of faultygates for a given circuit. If a fault scenario for the gate reliability model is to becreated, each gate in the circuit is tested against a random number generatorand selected to be faulty or not based on a chosen gate reliability. This may besaid to be a reasonable model of reality as the probability of having failing gatesin a circuit is directly proportional to the number of gates in the circuit.

In the single fault model, a circuit can have exactly one fault at any time. Ifa fault scenario for the single fault model is to be created, one and only one ofthe gates are selected to fail.

A failing gate can be modelled in several ways. This paper models a failinggate by inverting its output, something that can be said to be a worst-casescenario. Although an inverted output is not a realistic fault for a defect CMOSgate, this fault model is useful for simulation and analysis purposes because itensures a wrong output for all possible input values.

2.2 Redundancy

A redundant gate in a circuit is a gate that may fail without damaging the circuitsoutputs. To nd if a gate in a circuit is redundant or not, a gate redundancy testmay be performed where the gate is temporarily made defect. If this does notaect the circuit outputs, the gate is redundant. Finding all redundant gates ina circuit involves applying the redundancy test on all gates one by one.

The ultimate goal of this work is not redundancy, but reliability. Some formsof redundancy are known to enhance a circuits reliability, while other formsof redundancy consist of dead meat that does not contribute and should beoptimised away from the circuit. In this paper the term useful redundancy isused for redundant gates that have a useful purpose in the circuit, while fake

redundancy is used for gates that have no useful purpose.

80

3

2.3 Measuring Functionality and Reliability

The functionality of a circuit is found by trying all possible input values andrecording the respective output values of the circuit. If all recorded output valuescorrespond exactly to the desired truth table for the function, the circuit isworking perfectly, otherwise 100% functionality is not achieved. Traditionally,the result of such a test for functionality is either not working (0) or working(1), referred to as fbool herein.

When using articial evolution to create circuits, fbool is too coarse grainedto be used for guiding evolution towards a working circuit. One way of givingevolution more information about how far an individual is from a working so-lution, is to measure the hamming distance between the circuit output and thedesired output i.e. the number of bits that are dierent between these two so-lutions. This is then normalised to the interval [0, 1] where 1 is 100% working.This measure of functionality is called fham in this paper.

A reliability metric measures how well a circuit functions in the presence offaults. The traditional reliability metric used in this paper is called Rtrad and isthe average of all fbool results after having tested a number of randomly selectedfault scenarios. The possible fault scenarios depend on the fault model chosen.In this paper the traditional reliability metric Rtrad is used together with thesingle fault model and is named Rtrad_single.

A reliability metric may also be based on fham and is called Rehw. Rehw isthe average of all fham results after having tested a number of randomly selectedfault scenarios.

3 Previous Work

In the earlier work of Hartmann and Haddow [8], circuits were evolved with anRehw based tness function using the gate reliability model. The results providedclear evidence that evolution traded o functionality for reliability. Instead ofmaking 100% functional circuits and tolerating faults using redundancy, evolu-tion shrunk the circuits. For the gate reliability model, the probability of havinga faulty gate in a circuit is directly proportional to the number of gates in thecircuit. Evolution took the easiest path to tolerating the faults it avoidedmany of them by removing gates to a point where the circuit was no longer100% functional. While [8] only looked at Rehw, [9] investigated and comparedboth the Rehw and Rtrad reliability metrics for evolved and traditional circuits.

In traditional electronics 100% functionality is considered essential. In previ-ous work [10] the problem of evolving 100% functional circuits with redundancywas investigated. Like in this paper, reliability in itself was not the main goal,but rather the creation of redundant structures. To ensure 100% functionality,the tness function was designed such that fham was the only contributor totness unless functionality was 100%. Thus reliability only aected tness after100% functionality was reached.

Several experimental setups were tried, using both the gate reliability modeland the single fault model. When using the gate reliability model, no form of

Paper IV 81

4

2

1

output

f

r

inputs

(a)

2 output

f

r 1

inputs

(b)

Figure 1. Structures evolved in [10]

redundancy was achieved as the simplest solution for evolution was to minimisethe number of gates used in implementing a fully functional circuit. The singlefault model experiments on the other hand created larger circuits containingredundant gates. It was concluded that the single fault model does not discouragelarge circuits and evolution can therefore more easily introduce new redundantstructures.

The rst evolved structure in [10] containing redundant gates had the formshown in gure 1(a). The subcircuit marked f implements the desired functionand the subcircuit marked r implements any function. All gates in r are redun-dant. The three gates in the gure makes sure that r does not have any impacton the output at all the output of gate 1 is constant 1 no matter what revaluates to. This gate is called unreachable because no input vector has anyimpact on the output of the gate. This structure was evolved using the tnessfunction f = k1 · fham + k2 · Rtrad_single and evolution achieved high tnessby making r as large as possible and f as small as possible, thus scoring highon Rtrad_single. The redundant gates in r are fake and thus not useful for anypurpose. They do not, in any way, inuence the output and could just as wellbe removed.

One way of avoiding the structure in gure 1(a) is to detect unreachablegates. This was also tried in [10]. Any subcircuit with unreachable gates asthe only outputs can be excluded when Rtrad_single is calculated. In this way,such structures do not contribute to tness, i.e. Rtrad_single and evolution isencouraged to nd another way to improve tness. The result is typically astructure as in gure 1(b). Here there are no unreachable gates but the redundantgates in r are still just as useless for the same reason: r contains only fakeredundancy and could just as well be removed from the circuit without aectingfunctionality or reliability.

The work in [10] managed to create several circuits with redundant gates.However, the method used did not manage to evolve any circuits with usefulredundancy. It was concluded that evolution chooses the easiest way to solvethe problem, and the easiest way in the experimental setup in [10] was fakeredundancy. When the tness function is not good enough at separating circuitswith useful redundancy from circuits with fake redundancy, the result is largeamounts of fake redundancy and no useful redundancy. The goal of this paperis to tune the evolutionary process further in order to be able to evolve usefulredundancy.

82

5

outputinputs X

Y g

Figure 2. Circuit partition after selecting any gate g

4 Experiments

This paper builds on the lessons learned in [10]. In [10], a tness function usingRtrad_single seemed most promising with regard to introducing redundancy andRtrad_single is therefore chosen in this paper.

A key point for improving on the previous experiments is to correctly sepa-rate useful redundancy from fake and only include useful redundant gates whenRtrad_single is to be calculated. Detecting known unwanted structures, like theunreachable gate subcircuits in gure 1(a), is not the answer. Experiments in [10]show that evolution is only going to come up with new ways of cheating by in-troducing new forms of fake redundancy.

The solution chosen in this paper is to use a more general way of classifyingredundancy as useful or fake. Instead of detecting unwanted structures, a gate issimply classied as useful redundant if it has some observable inuence on thecircuits output. More specically, a gate is said to be useful redundant if, whenthe gate becomes defect, some other redundant gate becomes non-redundant inorder to maintain correct circuit functionality.

4.1 Algorithm for Classifying Redundant Gates

Algorithm 1, FindFake, is a heuristic for classifying the redundant gates in agiven circuit as being either useful redundant or fake redundant. The algorithmworks on a given circuit. First, all redundant gates are marked as useful re-dundant. Then a gate g is selected. For the selected gate g, the circuit can bepartitioned into two sets of gates X and Y , both of which may be the emptyset, as shown in gure 2. The gates in Y can be disconnected from the circuitby changing the chosen gate g to either V cc or Gnd, both of which are tried. Ifthis change does not damage the output of the circuit, the number of redundantgates in X after the change is compared to the number of redundant gates inX before the change. If the number of redundant gates in X is unchanged, thegates in Y have no impact on the output and are useless. They are then markedas fake. This is repeated for all the gates in the circuit.

4.2 Rtrad_single Based on Measured Redundancy

A measure like Rtrad_single depends on the function the circuit is supposed toperform Rtrad_single is 0 when functionality is not 100%. To encourage redun-dancy early during evolution, before the individuals reach 100% functionality,

Paper IV 83

6

Algorithm 1 Classifying redundant gates as useful or fake

1: procedure FindFake(circuit)2: markAllRedundantGatesAsUseful3: for all gates g do

4: partitionCircuit(X, Y, g) . Find gate sets X and Y given g5: redundantInX ← numberRedundant(X)6: g ← vcc . Disconnect Y by substituting g with Vcc7: if outputsUnchanged then . If circuit is still working8: redundantV cc← numberRedundant(X)9: if redundantInX ≥ redundantV cc then10: markAsFake(Y)11: end if

12: end if

13: g ← gnd . Disconnect Y by substituting g with Gnd14: if outputsUnchanged then

15: redundantGnd← numberRedundant(X)16: if redundantInX ≥ redundantGnd then

17: markAsFake(Y)18: end if

19: end if

20: restoreCircuit . Change circuit back to the original21: end for

22: end procedure

the current behaviour of the individual is measured. The measured behaviour isthen used when calculating Rtrad_single instead of the desired target behaviour,resulting in a score for Rtrad_single even when 100% functionality is not reached.

4.3 Experimental Setup

All experiments are conducted on simulations of circuits in a digital feed for-ward circuit simulator. Only Boolean logic is allowed and the following gates areavailable: AND, OR, NAND, NOR, NOT. Cartesian genetic programming [11]is applied with the following GA parameters:

Maximum number of gates: 100

Population size: 20

Tournament selection with elitism (g = 3, p = 0.7) Crossover rate: 0.2

Mutation rate: 0.05 (mutation applied at the gate level)

The experiments in this paper use the single fault model. The algorithm ex-plained in section 4.1 classies redundant gates as either useful or fake and onlyuseful redundant gates are included when Rtrad_single is calculated. Rtrad_single

is calculated based on the current measured behaviour and not the target be-haviour.

84

7

Evolving function and redundancy at the same time: For experimentsevolving functionality and redundancy at the same time, the following tnessfunction is used:

f1 = 0.7 · fham + 0.3 · Rtrad_single (1)

Three sets of experiments are performed using the tness function in equa-tion (1), diering in target functionality: Two input AND, two input OR andtwo input XOR.

Evolving function rst, then redundancy: If 100% functionality is requiredbefore evolving redundancy, the following tness function is used:

f2 = 0.7 · fham + 0.3 ·

0 if fham < 1.0Rtrad_single if fham = 1.0 (2)

The tness function in (2) is used when evolving CIR4, a four input oneoutput function with the truth table 1001011101100110 (bit zero to the right)

Evolving with unspecied function: If the target functionality is not speci-ed but instead evolved together with the circuits, the following tness functionis used:

f3 = Rtrad_single (3)

5 Results and Discussion

The results and their discussions are separated into three subsections, based onthe complexity and type of target behaviour.

5.1 Simple Functionality

The chosen functionality for the simple experiments is a two-input Boolean func-tion that can be implemented with a single gate circuit. Both AND2 and OR2have been tried. The reason for evolving these very simple functions is to seewhat redundancy structures emerge when the function requires little eort toevolve.

Table 1 shows the best individuals after running ten independent experimentsfor both AND2 and OR2. The best results from these experiments all have thesame basic idea behind the introduced redundancy: a voter structure similar togure 3(a) is introduced just before the output of the circuit. Four independentcircuit modules are connected to this voter that all perform the desired func-tion. If three of the four modules work correctly, the voter outputs the correctvalue. This voter structure is created by the evolutionary algorithm to solve theproblem, nothing in the experimental setup predenes a voter as the preferredresult.

Paper IV 85

8

Table 1. Results, simple functionality. Type indicates redundancy type: voter orsomething else. Red. is the number of redundant gates. Non-red. is the number ofnon-redundant gates.

(a) AND2

# Type Red. Non-red.0 Voter 23 31 Voter 32 32 33 53 37 74 50 45 Voter 39 36 Voter 40 37 Voter 38 38 35 59 23 7

(b) OR2

# Type Red. Non-red.0 18 71 Voter 21 32 Voter 38 33 Voter 17 34 Voter 28 55 23 56 33 67 Voter 29 48 33 69 29 5

This design may be compared to the most well known traditional fault toler-ance method, Triple Modular Redundancy (TMR), that has three modules anda majority voter. It is interesting to see that evolution in fact nds a voter asthe best solution. Of all the possible solutions that evolution could have foundit chose something close to the traditional solution. The evolved voter is smallerthan the TMR-voter (three gates as opposed to four), but needs more workingmodules. This is no disadvantage when simulating using the single fault model,in fact a three-gate four-input voter is the best solution in this case. In the morerealistic gate reliability model, TMR is better as it requires fewer gates in totaland, therefore, has fewer gates that may fail.

It is also clear from table 1 that when evolution has managed to createredundancy, the redundant subcircuits are expanded. This can be explained bythe use of the single fault model. It is favourable for tness to have as manyredundant gates as possible because Rtrad_single is the same as the number ofredundant gates divided by the total number of gates.

Analysis of Evolved Voter: Why is the voter structure in gure 3(a) success-ful at hiding single defects in the modules connected to the voters inputs? Thevoter can be explained by doing a don't care (DC) analysis of the circuit.

If one input to an AND-gate is zero, the other input is DC because no matterwhat it is, the output of the AND-gate is zero. Likewise, if one input to an OR-gate is one, the other input is DC. In addition, an input DC is in most casespropagated to the subcircuit connected to this input, meaning that all gates inthe subcircuit have a DC for this specic case. This is not true for all possiblecircuits, but is true for the voter in gure 3(a).

These simple rules can now be used to explain the voter. All four modulesconnected to the voter should perform the same function, so every wire in g-ure 3(a) has the same value. The purpose of the voter is to make sure that any

86

9

module

module

module

module

(a)

module

module

module

module

00

0

DC

DC

0

DC

(b)

module

module

module

module

1

11

1

DC

DC1

(c)

Figure 3. Evolved voter

single fault in any of the modules is tolerated. The voter should therefore bedesigned such that if any three of the four inputs to the voter is correct, thefourth input is DC. To see if the voter fulls this requirement, one should sepa-rately examine the two possible cases of voter operation: When the voter outputshould be zero, and when the voter output should be one.

Zero-case: This case is illustrated in gure 3(b). When the result of the votershould be zero, only one input to the AND-gate of the voter needs to be zero.This means the other input and both modules that are indirectly connected tothe input are DC.

One-case: This case is illustrated in gure 3(c). When the result of the votershould be one, both inputs to the AND-gate must also be one. This case musttherefore be handled by the OR-gates. For each of the OR-gates to output one,only one of the inputs to each OR-gate needs to be one. This means the otherinput and the module connected to it are DC.

These two cases show that the voter outputs the correct value even whenone of the four modules connected to the voter fails. Note the symmetry ingures 3(b) and 3(c). For example in gure 3(b), it is just as correct to markthe lower two modules having a DC output and the upper two modules havingoutput 0. It can now be seen that if a single module is selected as faulty, if thethree other modules work correctly the output will still be correct.

5.2 Complex Functionality

If the functionality of the circuit is more complex it becomes harder to evolvea functional circuit. How does this aect the redundancy structures that areevolved?

XOR2 is a step up in functionality. XOR is not among the gates available forevolution and requires minimum a three gates implemention. In the XOR-caseevolution has a much harder time nding a solution as ecient as the voter ingure 3(a). Table 2(a) shows the best individuals after running ten independentexperiments for XOR2. The same kind of voter was observed in one of the evolvedXOR-circuits, but mostly functionality and the structures used for introducingredundancy in the circuit were mixed together in an intricate way. An exampleof this is given later in this section.

The most complex functionality evolved in this paper is the four-input CIR4circuit that requires a nine gate minimum implementation. To ensure 100% func-tionality, it was necessary to apply the tness function in equation (2) that forces

Paper IV 87

10

Table 2. Results, complex functionality. Same layout as in table 1

(a) XOR2

# Type Red. Non-red.0 22 61 38 82 19 103 22 64 (not working)5 32 76 Voter 43 37 19 108 42 119 36 7

(b) CIR4

# Type Red. Non-red.0 17 131 24 212 24 153 21 164 11 145 24 156 25 187 15 178 18 139 32 16

1100

1100

0100

1100

in1

out

in

in

A

B

in0

in0

in

in in0

Figure 4. Redundant XOR2 without voter. IN0 and IN1 are the main circuit inputs.

evolution of functionality rst and then redundancy. Table 2(b) shows the bestindividuals after running ten independent experiments for CIR4. In this casethere are no voters evolved and like most of the XOR2 evolutionary runs, func-tionality and redundancy are mixed together. As can be seen from the numberof non-redundant gates in the evolved circuits, the introduced redundancy is notvery ecient.

Although not as ecient as the voter solutions, these solutions are still inter-esting. The purpose of this work is not to evolve the voter but to nd new waysof introducing redundancy to a circuit. The solutions in table 2 do representnew redundancy solutions. The ineciency might come from the fact that thetness function forces 100% functionality before redundancy. The evolutionaryruns were also stopped after a certain amount of time. More ecient redundancymight have been the result if the experiments were allowed to run longer.

Example of Non-Voter Based Redundant Circuit: What does a non-voterbased redundant circuit look like? An example of such a circuit is the XOR circuitnumber nine in table 2(a). This circuit is illustrated in gure 4. The four roundedboxes are subcircuits having the truth table written inside the box (bit zero to

88

11

Table 3. Results, evolving function together with redundancy. Same layout as in ta-ble 1, with the addition of column Function which is the evolved functionality. IN0and IN1 are the circuit inputs.

# Function Type Red. Non-red.0 IN0 Voter 28 31 ¬ IN0 39 52 AND 23 43 ¬ IN1 (Voter) 32 34 IN1 Voter 59 35 IN0 Voter 28 36 IN0 49 57 IN1 Voter 40 38 ¬ IN1 Voter 17 39 IN1 Voter 31 3

the right). All gates in region A (to the left of the dotted line) are redundantwhile all gates in region B are non-redundant.

The redundant gates in gure 4 are useful redundant, they do have an im-pact on the circuit output. The XOR functionality is, however, not producedexclusively in the redundant part of the circuit. None of the rounded boxes inthe redundant part of the circuit represent XOR. Instead, XOR is formed witha combination of the redundant and non-redundant gates. An analysis similarto the DC analysis for the voter in section 5.1 can be used to understand whythe gates in region A are redundant.

5.3 No Specied Functionality

From the previous experiments in this paper it is clear that functionality aectshow redundancy is achieved and how eective this redundancy is. As the com-plexity of the functionality increases, more focus is placed on getting a circuitworking and it becomes harder to nd an ecient way of creating a redundantversion of the circuit.

The evolved redundancy structures are the goal for this paper, not a specicfunctionality. A set of experiments are performed that does not explicitly statewhat function the evolved circuits should perform. The only requirement is thatthe circuit must have two inputs and one output. Evolution is thus free to createany function and focus all eorts on creating circuits with redundancy. This isaccomplished by using the tness function in equation (3). As Rtrad_single isthe only factor in this tness function and because Rtrad_single is based on thecurrent measured functionality of an individual, the target functionality of thecircuits is evolved together with the redundant circuits themselves. It is likelythat the resulting function is something that can easily be made redundant in anecient way. This is backed up by the results. Table 3 shows the best individualsafter running ten independent experiments where the target functionality is notspecied. The evolved functions are very simple (typically cloning an input or

Paper IV 89

12

being the equivalent of a single Boolean gate) and most individuals use a votersimilar to gure 3(a).

6 Conclusion and Further Work

This paper has presented an experimental setup that sucessfully uses articialevolution to create digital circuits with useful redundancy. The purpose of thisexperimental setup is to nd new ways of building redundant circuits.

The results show that although there is no explicit guiding towards creating avoter structure, evolution does in some cases create a voter resembling the voterused in traditionally designed reliable circuits. This is typically the result whenevolving circuits with simple functionality. The voter is a known way of makingredundant structures and while it is interesting that evolution creates voter likestructures, the main goal is to nd new ways of introducing redundancy. Whenevolving more complex functions, the result is non-voter based redundancy. Al-though not as ecient as a voter based solution, these results are interestingexamples on how to do redundancy without the traditional voter.

Planned further work includes experiments where evolution is allowed toleave the strict Boolean logic domain and exploit the analog properties of theCMOS technology.

References

1. ITRS: Int. techn. roadmap for semiconductors. Technical report, ITRS (2005)2. Xilinx: Xilinx virtex 5 overview. http://www.xilinx.com/virtex53. Lala, P.K.: Self-Checking and Fault Tolerant Digital Design. Morgan Kaufmann

Publishers (2001)4. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer

(2003)5. Higuchi, T., Niwa, T., Tanaka, T., Iba, H., de Garis, H., Furuya, T.: Evolving

hardware with genetic learning: a rst step towards building a darwin machine.In: Proc. Int. Conf. From animals to animats: simulation of adaptive behavior.(1993) 417424

6. Hemmi, H., Mizoguchi, J., Shimohara, K.: Development and evolution of hardwarebehaviors. In: Articial Life IV: Proc. 4th Int. Workshop Synthesis SimulationLiving Syst., MIT Press (1994) 371376

7. Miller, J.F.: Evolving a self-repairing, self-regulating, french ag organism. In:Genetic and Evolutionary Computation (GECCO). (2004) 129139

8. Hartmann, M., Haddow, P.C.: Evolution of fault-tolerant and noise-robust digitaldesigns. IEE Proc. - Computers and Digital Techniques 151(4) (jul 2004) 287294

9. Haddow, P.C., Hartmann, M., Djupdal, A.: Addressing the metric challange:Evolved versus traditional fault tolerant circuits. In: Adaptive Hardware and Sys-tems. (2007)

10. Djupdal, A., Haddow, P.C.: Evolving redundant structures for reliable circuits lessons learned. In: Adaptive Hardware and Systems. (2007)

11. Miller, J.F., Job, D., Vassilev, V.K.: Principles in the evolutionary design of digitalcircuits Â part i. Journal of Genetic Programming and Evolvable Machines 1(1)(2000) 835

90

Paper V

Defect Tolerant Ganged CMOS Minority GateAsbjørn Djupdal and Pauline C. HaddowIn IEEE NORCHIP, 2007

Defect Tolerant Ganged CMOS Minority GateAsbjoern Djupdal

CRAB LabIDI, NTNU

Email: [email protected]

Pauline C. HaddowCRAB LabIDI, NTNU

Email: [email protected]

Abstract— Production defects, resulting in faulty transistors,provide a challenge for the semiconductor industry in terms ofreduced Yield. As defect densities are expected to increase as thesemi-conductor feature size decreases, some form of transistorlevel defect tolerance is desirable to reduce this increasingproduction challenge. This paper proposes a solution, basedon the ganged CMOS minority gate, for transistor level defecttolerance for minority gates.

I. INTRODUCTION

As the semiconductor feature size decreases and the numberof transistors on a single chip increases, one of the growingchallenges facing the electronic design community is defectivechips resulting in faulty behaviour [1].

There are several causes of defects, and defects may appearin different parts of an integrated circuit. This paper con-centrates on transistor defects. A defective transistor may bemodelled in several ways [2]. This paper considers stuck-openand stuck-closed defective transistors. A stuck-open transistoris a transistor that is never conducting, no matter what gatevoltage is applied. A stuck-closed transistor is, on the otherhand, always conducting.

The challenge of faulty transistors may be met by improvedfault tolerance methods. Fault tolerance methods often involvethe use of redundant hardware resources. Redundant hardwaremay be introduced at different levels. At the system level, oneof the most popular redundancy techniques is Triple ModularRedundancy (TMR) [3]. Three equal modules calculate thesame function and a voter outputs the majority output. TMRmay also be applied at the gate level where each module is asmaller part of the complete system and where a cascade ofTMR subsystems make up the complete system.

Defects may occur in any part of the system, including thevoter. One disadvantage of TMR is the need for a perfectworking voter or, if a perfect voter is not likely, triplicating thevoter itself. The need for a voter makes TMR only practicalwhen each of the modules are large compared to the voter.For TMR to function, the probability of having a functioningmodule must be more than 0.5. If the expected defect densityof the IC is high, the modules must be small to ensure theprobability of working is more than 0.5. If the defect density ishigh enough, TMR is no longer suitable because each modulemust be so small that the voter is dominating both in terms ofarea and susceptibility to defects.

A gate level alternative to TMR is interwoven logic [4] orquadded logic [5]. Quadded logic involves constructing the

A

B

C

A

B

C

Fig. 1. Quadrupling transistors

network of logic gates in a way such that it masks defects.Defect masking is achieved by quadrupling every gate in thesystem and connecting the gates in a specific way so as toavoid the need for a voter. The lack of a voter makes quaddedlogic useful at higher defect densities than TMR.

When the expected defect density is so high that it isprobable that a large amount of the digital gates are defective,gate level techniques like interwoven logic fail to mask allthe defects. This makes it useful to introduce redundancyat the transistor level i.e. introducing redundant transistorswhen implementing the basic logic gates. Redundancy atthe transistor level would help the systems reliability byproviding robust gates. To get even higher reliability, theserobust gates could be used together with gate level or systemlevel redundancy techniques. Another benefit of introducingredundancy at the transistor level is to be able to exploit somenon-digital properties of the transistor level architecture of thegate. These non-digital properties might lead to more efficientsolutions areawise, than that what is possible at the Booleangate level.

The focus of this paper is defect tolerance at the transistorlevel. More specifically, the focus is to make a three-inputminority gate tolerant to all single stuck-open and stuck-closeddefects.

Aunet and Hartmann [6] propose a solution where two ormore identical minority gates drive the same output. Theirsolution is tolerant to stuck-open transistor faults. By requiringonly twice the number of transistors, the method used byAunet and Hartmann is an example of how the use of non-Boolean techniques may provide more efficient redundancythan what is possible at the Boolean gate level.

Anghel and Nicolaidis [7] propose a general transistor leveldefect tolerance method that handles both stuck-open andstuck-closed defects. This is achieved by quadrupling everytransistor in the circuit, as shown in figure 1. By having twotransistors in series, stuck-closed defects are tolerated. Two

Paper V 93

Vdd

Vss

A B C

Out

Fig. 2. Ganged CMOS minority gate implementation [8]

transistors in parallel tolerates stuck-open, much in the sameway as in Aunet and Hartmann’s work. Combining these asin figure 1 results in tolerance to both stuck-open and stuck-closed.

This paper starts in section II with a description and analysisof the ganged CMOS minority gate. The ganged CMOSminority gate is a specific minority gate implementation thatis fundamental for the rest of the paper. Section III proposesa new minority gate implementation that is tolerant to allsingle stuck-open and stuck-closed transistors by building onthe analysis in section II and previously known redundancytechniques. Section IV presents a simulation that comparesdifferent minority gate implementations with respect to relia-bility. A discussion of the properties of the proposed gate isgiven in section V and the paper concludes in section VI.

II. GANGED CMOS MINORITY GATE

The term ganged CMOS [9] refers to a CMOS circuit wherethe outputs of several inverters are wired together. Instead ofacting as switches (standard digital CMOS), the transistorsact as variable resistors controlled by their gate voltages. Thecircuit may thus be represented as a resistor network whereconducting transistors are represented by small resistors andnon-conducting transistors by large resistors [10]. Figure 2illustrates a ganged CMOS minority gate, proposed in [11] as amajority gate (extra inverter at the output). Further, figure 3(a)presents the same circuit as a resistor circuit for the case whereall inputs are zero.

Is it possible to exploit the concept of a resistor networkfor the minority gate so as to achieve defect tolerance? Theapproach taken in this section is to analyse what effect stuck-open faults could have on such a network and how these faultsmight be tolerated by sizing the transistors accordingly.

A. Characterisation of Minority Gate

The following analysis assumes three types of transistor be-haviour: conducting (resistance r), non-conducting (resistanceR) and stuck-open (resistance ∞), where R is much largerthan r. Further it is assumed that only one pMOS transistor isstuck-open at a given time. The symmetric nature of the circuitimplies that it is not necessary to check every combination ofhigh inputs, but rather the four general cases: zero, one, twoor three high inputs.

For each of the four cases, the ratio of the resistances ofthe pMOS and nMOS transistors is expressed. If a stuck-opentransistor, represented as infinite resistance, provides a disad-vantage to the pMOS/nMOS resistance ratio, the expressedratio is adjusted to this worst-case scenario.

Vdd

Vss

rp rp rp

Rn Rn Rn

Out

(a) Zero high inputs

Vdd

Vss

rp rp Rp

Rn Rn rn

Out

(b) One high inputVdd

Vss

rp Rp Rp

Rn rn rn

Out

(c) Two high inputs

Vdd

Vss

Rp Rp Rp

rn rn rn

Out

(d) Three high inputs

Fig. 3. Equivalent resistor network for gate in figure 2 for different inputs

Zero high inputs: When all inputs are zero, all pMOStransistors and no nMOS transistors conduct, resulting in theresistor network shown in figure 3(a) with an output closeto Vdd. A stuck-open pMOS, represented as 0 in equation1, provides a stricter condition (worst case) than that whichdirectly represents the resistor network.(

0 +1rp

+1rp

)−1

(

1Rn

+1

Rn+

1Rn

)−1

(1)

One high input: When one input is high, two pMOS andone nMOS are conducting, as illustrated in figure 3(b), andthe output is close to Vdd. Similar to zero inputs, representingstuck open in equation (2) provides a worst case condition.(

0 +1rp

+1

Rp

)−1

(

1Rn

+1

Rn+

1rn

)−1

(2)

Two high inputs: As shown in figure 3(c), in this case,there is one conducting pMOS and two conducting nMOStransistors resulting in an output close to Vss. Any stuck-open pMOS will increase the left hand side of the condition,thus positively affecting the condition. As such, a worst casecondition is where no pMOS are stuck-open and thus equation(3) directly reflects the resistance network of figure 3(c).

(1rp

+1

Rp+

1Rp

)−1

(

1Rn

+1rn

+1rn

)−1

(3)

Three high inputs: When all inputs are one, no pMOS andall nMOS transistors are conducting and the output is closeto Vss. When no pMOS transistors conduct then a stuck-openpMOS can only positively affect the output and is thus notreflected in equation (4).

(1

Rp+

1Rp

+1

Rp

)−1

(

1rn

+1rn

+1rn

)−1

(4)

94

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120 140 160

111110101100011010001000

outp

ut [V

]

time [ns]

input value

(a) No faulty transistor

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120 140 160

111110101100011010001000

outp

ut [V

]

time [ns]

input value

(b) One pMOS stuck-open fault

Fig. 4. Simulation of gate in figure 2, showing gate output for all inputcombinations from 0 to 7. Inputs change every 20ns.

When condition (2) is satisfied, (1) is also satisfied because(2) is a tighter constraint. Likewise, if (3) is satisfied, (4) isalso satisfied. In conclusion, this analysis suggests that whenthe conditions (2) and (3) are satisfied, the minority gate infigure 2 should have the correct output even when one of thepMOS transistors are stuck-open.

B. Case Study A

If transistors are sized properly, it is possible to satisfyconditions (2) and (3). To verify the above analysis and thatthe conditions are satisfied, a ganged CMOS minority gate wassimulated. Ngspice [12] with the 22nm Berkeley PredictiveTechnology Models (BPTM) [13] and a supply voltage of1V is employed as the simulator. Stuck-open transistors aremodelled by removing the transistor from the SPICE netlist.

The transistor dimensions depend on the chosen technology.To find suitable transistor sizes for 22nm BPTM, a SPICEnetlist representing conditions (2) and (3) were created andtransistor sizes were manually adjusted until the conditionswere satisfied. For this experiment, the following transistordimensions were found suitable: WP = 90nm, LP = WN =LN = 30nm

Figure 4(a) shows a simulation of the minority gate whenall transistors are working. All input combinations from 0 to7 are shown, starting with input 0 at time 0 and changinginput every 20ns. The output of the minority gate should be11101000 (least significant bit first). Figure 4(a) show that thegate output is correct and differs from the ideal digital voltagewith less then 0.25V. Figure 4(b) illustrates the operation of theminority gate in the presence of a stuck-open pMOS transistor.As can be seen, the output is still correct and less than 0.3Vfrom the ideal output voltage. Symmetry ensures the sameresult for any stuck-open pMOS.

C. Limitations of Sizing for Defect Tolerance

To achieve tolerance to stuck-open nMOS transistors, asimilar analysis and sizing adjustment is required. Studyingthe analysis of pMOS and nMOS defects at the same timeprovides conflicting conditions as shown in equation 5 (pMOS)and equation 6 (nMOS). As these conditions cannot both befulfilled, no sizing adjustment will provide tolerance to bothpMOS and nMOS stuck open defects. As such this techniqueis limited to tolerating single stuck-open defects in 50% of itstransistors but requires no additional transistors.

A B C

Out

Vdd

Vss

Fig. 5. New minority gate

(0 +

1rp

+1

Rp

)−1

(

1Rn

+1

Rn+

1rn

)−1

(5)

(1rp

+1

Rp+

1Rp

)−1

(

1Rn

+1rn

+ 0)−1

(6)

III. CONSTRUCTING A DEFECT TOLERANT MINORITYGATE

To construct a minority gate tolerant to all single stuck-open and stuck-closed defects, the ganged CMOS minoritygate and work in section II is used as basis. Each pMOS issized, duplicated and placed in series so as to allow for bothstuck-open and stuck-closed defects. To allow for both stuck-open and stuck-closed nMOS transistors, they are quadrupledin the same way as in Anghel and Nicolaidis [7], shown infigure 1. The resulting minority gate, tolerant to stuck-openand stuck-closed defects, is illustrated in figure 5.

A. Case Study B

The same experimental setup is applied as in section II-B. Stuck-closed defects are simulated by substituting thefaulty transistor with a 1Ω resistor between source and drain.Transistor dimensions are found manually, see section II-B.For the circuit in figure 5, the following transistor dimensionswere found suitable: WP = 140nm, LP = WN = LN =30nm.

Figure 6(a) shows simulated results with no faults applied.Figure 6(b) shows the same results with one stuck-closednMOS. As seen, the output is correct in both cases. Only oneof the possible defect configurations is shown. Simulationshave shown that the output is correct with any single stuck-open or stuck-closed defective transistor. The gain is, however,dependent on the defect configuration, with defective pMOStransistors having the worst effect on gain.

IV. COMPARING RELIABILITY

The reliability R of a circuit is, in this paper, defined asthe probability of having a correct output given a probabilityRt that a transistor is functional. Further, it is assumed thateach transistor fails independently with a certain probability

Paper V 95

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120 140 160

111110101100011010001000

outp

ut [V

]

time [ns]

input value

(a) No faulty transistor

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120 140 160

111110101100011010001000

outp

ut [V

]

time [ns]

input value

(b) One nMOS stuck-closed fault

Fig. 6. Simulation of gate in figure 5, showing gate output for all inputcombinations from 0 to 7. Inputs change every 20ns.

0

0.2

0.4

0.6

0.8

1

0.5 0.6 0.7 0.8 0.9 1

R

Rt

proposed gate (fig. 5)ganged quadpseudonmos

ganged (fig. 2)ganged double

pseudonmos quad

Fig. 7. Reliability of different minority gate implementations

(1 − Rt). When failing, the transistor is either stuck-open orstuck-closed, each defect type being equally probable.

Monte Carlo simulations for different levels of transistorreliability Rt have been performed on the proposed minoritygate of figure 5. For each Rt, 10000 different simulationsare performed with random fault scenarios. The results havefurther been compared against similar simulations on five otherminority gate implementations: the gate in figure 2 (“ganged”);a doubled version of figure 2 like in Aunet and Hartmann [6](“ganged double”); a version of figure 2 quadrupled in a sim-ilar way as the circuits of Anghel and Nicolaidis [7] (“gangedquad”); a pseudo-nMOS minority gate (“pseudonmos”); and aquadrupled version of the pseudo-nMOS gate (“pseudonmosquad”). The results are presented in figure 7.

As can be seen from the Monte Carlo experiment, the newminority gate presented herein is more reliable than the otherminority gate implementations for the given defect conditions.It is even better than “ganged quad”. This can be explainedby its smaller size (18 transistors for the proposed gate vs.24 transistors for the quadrupled one). Fewer transistors in agate means that the expected number of defective transistorsis lower.

The redundant “ganged double” implementation is actuallyworse than the non-redundant “ganged” implementation. Thedoubling technique only tackles stuck-open defects and thisexperiment allow both stuck-open and stuck-closed. The largersize of “ganged double” compared to “ganged” makes itperform worse.

The non-redundant “pseudonmos” performs better than“ganged” because it uses only four transistors. It does howevernot benefit from quadrupling, as seen in the figure where“pseudonmos quad” has the lowest reliability of all the gates.

V. DISCUSSION

As with other ganged CMOS implementations, this gatesuffers from bad gain. Bad gain can be a problem if cascadingseveral of these gates with no driver in between. A simpleinverter on the output removes this problem and transformsthe gate into a majority gate. If the inverter is quadrupled tomake it defect tolerant, the resulting majority gate will have 26transistors, still significantly smaller than the the 32 transistorsneeded for the fully quadrupled version.

VI. CONCLUSION AND FURTHER WORK

This paper has analysed the ganged CMOS minority gatewith respect to tolerance to defects. The analysis has providedconditions which, when met through transistor sizing, resultsin a circuit that exhibits tolerance to single stuck-open pMOSor nMOS defects. Further a revised ganged CMOS minoritygate has been proposed with triple the transistor count buttolerant to both single stuck-open and stuck-closed faults.The results presented show that the proposed circuit is morereliable than the other ganged minority gate implementationsit has been compared to, including the quadrupled implemen-tation.

Further work will involve a study of how the proposedgate is affected by parameter variations. In addition, theMonte Carlo experiment should be repeated with other waysof modelling stuck-open and stuck-closed transistors as thismight affect the reliability results.

REFERENCES

[1] ITRS, “International technology roadmap for semiconductors,” ITRS,Tech. Rep., 2005.

[2] J. Abraham and W. Fuchs, “Fault and error models for VLSI,” Proceed-ings of the IEEE, vol. 74, no. 5, pp. 639–654, may 1986.

[3] R. E. Lyons and W. Vanderkulk, “The use of triple-modular redundancyto improve computer reliability,” IBM Journal, pp. 200–209, April 1962.

[4] W. Pierce, Failure-Tolerant Computer Design. Academic Press, 1965.[5] J. G. Tryon, Redundancy Techniques for Computing Systems. Spartan

Books, 1965, ch. Quadded Logic, pp. 205–228.[6] S. Aunet and M. Hartmann, “Real-time reconfigurable linear threshold

elements and some applications to neural hardware,” in Proc. Interna-tional Conference on Evolvable Systems: From Biology to Hardware,(ICES), 2003, pp. 365–376.

[7] L. Anghel and M. Nicolaidis, “Defects tolerant logic gates for unreliablefuture nanotechnologies,” in IWANN, 2007, pp. 422–429.

[8] M. Johnson, “A symmetric CMOS NOR gate for high-speed applica-tions,” IEEE Journal of Solid-State Circuits, vol. 23, no. 5, pp. 1233–1236, oct 1988.

[9] K. J. Schultz, R. J. Francis, and K. C. Smith, “Ganged CMOS: Tradingstandby power for speed,” IEEE Journal of Solid-State Circuits, vol. 25,no. 3, pp. 870–873, 1990.

[10] V. Beiu, J. Quintana, and M. Avedillo, “VLSI implementations ofthreshold logic — a comprehensive survey,” IEEE Transactions onNeural Networks, vol. 14, no. 5, pp. 1217–1243, Sept 2003.

[11] J. B. Lerch, “Threshold gate circuits employing field-effect transistors,”USPTO, Tech. Rep., Feb 1973, U.S. Patent 3 715 603.

[12] GEDA, “Ngspice homepage,” http://ngspice.sourceforge.net/, 2007.[13] W. Zhao and Y. Cao, “New generation of predictive technology model

for sub-45nm design exploration,” in 7th International Symposium onQuality Electronic Design (ISQED), 2006, pp. 585–590.

96

Paper VI

Evolving Efficient Redundancy by Exploiting the Analogue Nature ofCMOS TransistorsAsbjørn Djupdal and Pauline C. HaddowIn International Conference on Computational Intelligence, Roboticsand Autonomous Systems (CIRAS), pages 81–86, 2007

Evolving Efficient Redundancy by Exploiting the AnalogueNature of CMOS Transistors

Asbjoern Djupdal and Pauline C. HaddowCRAB Lab (http://crab.idi.ntnu.no)

Department of Computer and Information ScienceNorwegian University of Science and Technology

[email protected], [email protected]

AbstractFault tolerance is an increasing challenge for integrated circuits due to semiconductor technology scaling. Triplemodular redundancy is often used to achieve fault tolerance in digital circuits, but this method is inefficient. Byexploiting the analogue nature of CMOS transistors, more efficient redundancy techniques may be applied.This paper looks at how artificial evolution may be guided towards the creation of redundancy structures at theCMOS transistor level. A preliminary experiment is performed that successfully evolves redundant stuck-opendefect tolerant digital inverters.

Keywords: Evolvable hardware, Redundancy, Fault tolerance, FPGA

1 IntroductionAs the semiconductor feature size decreases and thenumber of transistors on a single chip increases, oneof the growing challenges facing the electronic designcommunity is faulty behaviour [1]. This challenge maybe met by improved fault tolerance methods.

If faults are expected to occur in a digital circuit,fault tolerance — the ability to function correctlyin the presence of faults, may be achieved byincorporating redundancy (additional resources) insome form. These additional resources may be inthe form of additional hardware, in which case it iscalled hardware redundancy [2]. One form of hardwareredundancy is static hardware redundancy. Staticredundancy involves introducing extra components in away that masks defects, thus without any need to detectand repair the defects. Triple Modular Redundancy(TMR) [3] is one well known static redundancytechnique.

The semiconductor fault challenge may be, in general,a long term challenge but is here today for large ICs,like FPGAs. The mass production of FPGAs enablesFPGAs to be produced in the newest technologies.Xilinx Virtex 5 [4] is an example of a new FPGAseries from Virtex produced in 65nm technology withup to 330,000 logic cells. Just like other largelithographically produced chips, FPGAs suffer fromproduction defects and would, from a fault tolerancepoint of view, benefit from redundancy.

Redundancy for improving fault tolerance maybe included at the design level i.e. incorporatingredundancy into the FPGA application. The generalityof the FPGA architecture makes it possible for theapplication designer to use traditional Boolean faulttolerance techniques, e.g. TMR, in the circuit design.However, this increases the complexity of the circuit

design. Further, such techniques are today onlyapplied to chips that have already passed the yield test.If the problem of production defects increases to apoint where known defective chips must be shipped,the requirement that the application designer musttake care of tolerating these defects will be a hardone. The FPGA provides a bridge between chipproduction and the application designer. The inclusionof fault tolerance in the FPGA architecture itself wouldprovide a functionally correct FPGA for the applicationdesigner, despite production defects. The applicationdesigner would thus be relieved from the complexity ofdesigning for imperfect hardware.

One approach to achieving a fault tolerant FPGAarchitecture would be to change the high levelFPGA architecture. One such high level architecturetechnique is to include a redundant row or column oflogic blocks in the FPGA architecture. Such blocksare applied to take over if a defective row or columnis detected [5]. Several such techniques are reviewedin [6].

Another way to make the FPGA fault tolerant is toattack the problem from the CMOS transistor level.Instead of looking at how high level components,such as logic blocks, may be structured to supporta fault tolerance technique, the VLSI implementationof these high level blocks might themselves includefault tolerance. By attacking the fault problemat the transistor level, solutions not possible usingBoolean techniques can now be considered. While theBoolean TMR technique is useful in many contexts,the technique is inefficient in that it requires a workingvoter in addition to three equal modules performingthe desired function, of which two modules must workperfectly. If redundancy is introduced at the transistorlevel, more efficient redundancy techniques may beachieved.

Paper VI 99

Aunet and Hartmann [7] provide an example ofefficient redundancy at transistor level illustrating thatby having two equal circuits drive the same output,single stuck-open transistor defects are tolerated. Theganged CMOS minority gate was analysed in [8] anda version of the gate was presented where carefulsizing of transistors together with redundant transistorsresults in tolerance to both stuck-closed and stuck-open defective transistors. The defect tolerance wasachieved using three times the number of transistorsand no voter was required. The Boolean alternativewould be to use TMR, which would be less reliableand have larger area requirements due to the voter.In Schmid and Leblebici [9], modular redundancy,resembling TMR, is presented where the digital voteris substituted by an analogue averaging unit to improvereliability.

The focus of this paper is to find a way to create newstatic redundancy techniques for FPGAs by attackingthe problem at the transistor level. To find newredundancy techniques it is important to free oneselffrom the constraints brought upon us by thinking inthe way of traditional redundancy techniques. Theway one thinks when designing circuits is influencedby the way that one is taught electronics, designedelectronics and the tools used in the design process.One way of freeing oneself from these human anddesign automated constraints is to search for ideasusing some sort of heuristic search process. One suchprocess is that of evolutionary algorithms [10]. Theapplication of evolutionary algorithms to the design ofhardware is termed evolvable hardware (EHW) [11].

Previous work [12] proved that it is possible to tunethe evolutionary algorithm to create useful hardwareredundancy. Not only was a TMR like redundancystructure created, but also some completely newredundancy structures. However, the work onlyconsidered Boolean logic and were constrained bythe limitations in the Boolean world. Layzell andThompson [13], on the other hand, have evolvedcircuits at the transistor level. Although not explicitlytrying to evolve redundancy, a stuck-open fault tolerantdigital inverter was evolved that contained redundanttransistors in an efficient way similar to what waspresented in [7]. Unlike Layzell and Thompson whodid not try to force evolution to create efficient anduseful redundancy, the goal of this paper is to find away to explicitly evolve efficient redundancy structuresat the transistor level. The purpose is not to evolve aspecific fault tolerant circuit or to analyse some aspectof artificial evolution. Instead, the purpose is to seekexamples of ways to use redundancy at the transistorlevel for achieving fault tolerance. Analysing theseexamples might then provide insight and inspirationthat may lead to new techniques for creating more faulttolerant FPGAs.

The remainder of this paper is organised as follows:Section 2 discusses several aspects regarding evolution

of redundancy structures using the analogue SPICEsimulator. An experimental setup is presented insection 3 together with a preliminary experiment thatdemonstrates its usefulness. The paper concludes insection 4.

2 Evolving Circuits with Redundancyusing SPICE

2.1 Fault ModelsA fault scenario is one possible configuration of faultytransistors for a given circuit. A fault model dictateshow the fault scenarios can be constructed and howprobable the different fault scenarios are of occurring.Two fault models are considered in this work: thetransistor reliability model and the single fault model.

In the transistor reliability model (the transistorequivalent of the gate reliability model in [12]), eachtransistor has a certain probability of failing and eachtransistor fails independently of each other. If a faultscenario for the transistor reliability model is to becreated, each transistor in the circuit is tested againsta random number generator and selected to be faulty ornot, based on a chosen fault rate.

In the single fault model, a circuit can have exactlyone fault at any time and any single fault scenario isequally probable. One and only one of the transistorsare selected to fail for any given fault scenario.

A failing transistor can be modelled in severalways [14]. A well known model is the stuck-atmodel, where a failing transistor results in a wire beingclamped to Vss or Vdd. Another and more realisticmodel is where a failing transistor is either stuck-closedor stuck-open (permanently on or off). In this paper,stuck-open defective transistors are considered.

2.2 Measuring Functionality andReliability

One functionality metric, fbool, simply states whetherthe circuit functions correctly (fbool = 1) or not(fbool = 0). The output of a circuit is defined to betrue if having a value larger than Vdd

2 . If the output isless than Vdd

2 , it is defined to be false. A circuit is saidto be fully functioning if, for all possible input values,the output is correctly true or false according to a giventruth table.

When evolving circuits, fitness may representfunctionality in terms of how close the circuit’sfunctionality is to the desired functionality. The mainfunctionality metric for this paper, frms, is basedon the Root-Mean-Square (RMS) error between thesimulated output and the ideal output, for all n possibleinput values i.

100

frms = 1 −√∑n

i (sim(i) − ideal(i))2

n(1)

A reliability metric indicates how well a circuitfunctions in the presence of faults. Reliability maybe measured by testing the circuit against a number ofrandomly selected fault scenarios. The possible faultscenarios depend on the chosen fault model. The Rtrad

metric, which is used in this paper, is the percentageof these tests where the circuit is fully functioning(fbool = 1). When Rtrad is applied with the single faultmodel, it is named Rtrad_single. Rtrad_single may becalculated exactly by testing all possible single faults.When applied with the transistor reliability model, themetric is named Rtrad_trans.

Rtrad_trans = x0 · fbool + x1 · Rtrad_single (2)

Rtrad_trans may be estimated using equation (2). x0and x1 are the probabilities for having, respectively,zero and one defective transistor in a randomlychosen fault scenario. [15] provides an explanation anddiscussion of this estimator.

2.3 Evolving RedundancyA redundant transistor in a circuit is a transistor thatmay fail without damaging the circuits outputs. To findif a transistor is redundant or not, a redundancy testmay be performed where the transistor is temporarilymade defective. If the circuit output is unaffected, thetransistor is called redundant. Finding all redundanttransistors in a circuit involves applying the redundancytest on all transistors one at a time.

Earlier work on evolving redundant structures [12, 15]has concentrated on evolving redundancy for Booleanlogic. These experiments showed that it is far fromstraight forward to evolve useful redundancy. The workin [15] concluded that evolution chooses the easiestsolution to the problem. When measuring fitness basedon gate reliability in [15], the easiest solution was tominimise a 100% working non-redundant circuit. Onthe other hand, applying a single fault based fitnessfunction resulted in large amounts of gates connectedto the functioning circuit in a way that do not influencethe output in any way. Such gates are redundant, butnot in a useful way.

The single fault results in [15] were improvedin [12] by introducing an algorithm that classifies aredundant gate as either “useful” or “fake”. Whenevaluating fitness in this case, evolution started creatingredundancy that had the potential to enhance faulttolerance. It was concluded that by carefully guidingevolution, evolution is able to produce circuits withuseful redundancy. The experiments in [12] resulted

in both a voter structure resembling the TMR voter andsome new redundancy structures.

2.4 SPICE ConsiderationsSPICE provides different ways of analysing circuits.For the work in this paper, two analyses are relevant:operating point analysis and transient analysis.

For performing an operating point analysis, the circuitinputs are modelled as voltage sources with voltageequal to either Vdd or Vss. The operating point analysisprovides the output voltage the circuit would have,assuming stable input values, and is performed for allpossible input combinations.

When performing a transient analysis, the circuitinputs are modelled as Piece Wise Linear (PWL)voltage sources. PWL is used to analyse a specificinput transition. As the circuit output may dependon what the previous input was, analysing differentinput transitions is important. When analysing aninput transition, the PWL source starts at an initialinput voltage and is then, during 1ns, swept to theinput voltage that is to be analysed. This voltageis then kept stable for a specified amount of time.This specified amount of time is in effect the delayrequirement for the circuit. SPICE analyses thecircuits behaviour reporting the circuits output voltageat the end of the specified time interval. To find thefunctionality of a circuit using transient analysis, allpossible transitions for all possible input combinationsare performed. The least correct output is chosen andapplied in equation (1). This is considerably slowerthan operating point analysis, but makes it possible toset a delay limit when evolving circuits because theoutput value is measured at a certain point in time.

3 Experiments and Results3.1 Experimental SetupCircuits are evolved using the SPICE simulator. Aspice netlist is created each time a circuit is to be tested.This netlist is then written to a file to be read by theBSD licensed SPICE simulator ngspice [16] based onthe original Berkeley SPICE3.

A representation resembling Cartesian GeneticProgramming [17] is applied, with the modificationthat in a gene, both inputs and output of the componentare explicitly defined, in addition to component type(nMOS or pMOS transistor) and transistor dimensions.A (1+4) evolutionary strategy is applied with mutationrate 0.1. Mutation is applied independently for eachinformation block (inputs, output, type, sizes) insideeach gene in the genome.

The BPTM 22nm CMOS transistor models areapplied [18] with transistor sizes from 30nm to1000nm. Feedback loops are not allowed, but severaltransistors may drive the same line.

Paper VI 101

The experiments are evolved in two phases. The reasonfor the first phase is to generate redundancy. Earlierwork [15] has shown that the single fault model is bestsuited for generating redundancy. The first phase, theexploratory phase, starts evolving circuits from scratchusing a single fault based fitness function:

f1 = 0.3frms + 0.2Rtrad_single + 0.5fbool (3)

The first part in equation (3) is the functionality metric.This is to ensure correct functionality is achieved.The second part is the reliability metric. A Rtrad

reliability metric equals 0 until 100% functionality isachieved. To reward redundancy in a circuit before100% functionality is reached, the functionality of thecircuit is measured each time fitness is to be evaluated.The reliability metric is then calculated based on thecurrent behaviour of the circuit, not the desired targetbehaviour. The last part of equation (3) is to ensurethat once 100% functionality is reached, it stays there.Evolution is stopped after 12 hours run time.

As in [15], fake redundancy may be the result fromevolving with a single fault fitness function. Thealgorithm applied in [12] for classifying redundancyas useful or fake can not be used in these experimentsbecause it is too computationally expensive and noteasily adapted to circuits where several componentsmay drive the same line. Therefore, a second phase,allowed to run for 12 hours, applies a transistorreliability fitness function (equation (4)) so as toremove any fake redundant transistors. The bestindividual from phase one seeds the population.

f2 = 0.3frms + 0.2Rtrad_trans + 0.5fbool (4)

The chosen target functionality for these experimentsis a digital inverter. This inverter is to tolerate asmany single stuck-open transistor faults as possiblein an effective way. Stuck-open faults are chosenbecause the results are then easily compared to thetechniques in [13] and [7]. It is known that these faultsare possible to tolerate 100% using effective analogue“tricks”, and for a preliminary experiment it is usefulto know there is a potential for evolution to exploit.A stuck-open transistor is simulated by removing thetransistor completely from the SPICE netlist.

Three different experiments are performed, one usingoperating point analysis and two using transientanalysis with 99ns and 5ns delay requirementsrespectively. All three experiments involve runningthe same evolutionary setup ten times with differentrandom seeds, to produce ten different individuals.

Table 1: Experiment summary

Op Trans99 Trans5Avg. # generations 27270 8334 8506Avg. fbool 1 1 1Avg. Rtrad_single 1 1 1

Table 2: Best evolved inverters, operating pointanalysis, after 10 runs

# Size 1 Size 2 Solution0 5 4 Variant of fig. 21 8 4 Fig. 12 7 4 Fig. 13 7 4 Fig. 14 11 4 Fig. 15 9 4 Fig. 16 5 3 Fig. 27 7 4 Fig. 18 8 4 Fig. 19 5 3 Fig. 2

Table 3: Best evolved inverters, transient analysis,99ns delay requirement, after 10 runs

# Size1 Size2 Solution0 3 3 Fig. 21 4 3 Fig. 22 6 4 Fig. 13 8 44 5 5 Variant of fig. 15 6 5 Variant of fig. 16 9 57 5 3 Fig. 28 6 4 Variant of fig. 29 6 5 Variant of fig. 1

Table 4: Best evolved inverters, transient analysis, 5nsdelay requirement, after 10 runs

# Size1 Size2 Solution0 5 4 Fig. 11 6 4 Fig. 12 4 4 Fig. 13 8 4 Fig. 14 6 5 Variant of fig. 15 5 4 Fig. 16 5 4 Fig. 17 7 4 Fig. 18 6 4 Fig. 19 5 4 Fig. 1

102

Vdd

Vss

P1

N1

P2

N2

Figure 1: Best evolved stuck-open tolerant inverter

3.2 Results and DiscussionTable 1 presents a summary of the results from thethree experiments, showing average values for thebest individuals in all ten evolutionary runs in eachexperiment. As can be seen, all evolutionary runsresulted in perfect working inverters (fbool = 1) able totolerate all single stuck-open faults (Rtrad_single = 1).

Table 2 shows a summary of the best individuals afterten evolutionary runs for the operating point analysisexperiment. Tables 3 and 4 are similar but evolvedusing transient analysis for 99ns and 5ns respectively.“Size 1” is the number of transistors after phase 1and “Size 2” is the number of transistors after havingrun the optimising phase 2. It is clear that in mostcases, phase one introduces redundant transistors ofwhich some are not helping improve Rtrad_trans and,therefore, removed during phase 2.

Two main inverter variants are evolved, with somevariations. The best evolved inverters resemble theone in figure 1. This solution showed up in all thethree different experiments. This inverter is similar toa standard CMOS inverter, except that every transistoris duplicated such that two equal transistors drive thesame output in parallel. This means that if one ofthe transistors is stuck-open but should have beenconducting, the inverter output will still be drivencorrectly by the duplicate transistor.

The solution in figure 1 is the same as the one presentedin [7]. This demonstrates that by rewarding redundancyin the way presented in this paper, it is possible totune evolution to the creation of efficient redundancystructures. The solution in figure 1 is also the solutionevolved in [13]. The redundant transistors in [13] were,however, introduced by evolution as a side effect oftrying to improve fitness by creating a circuit with aslightly higher voltage swing. In contrast, the invertersin this paper contain redundancy because redundancyhas been explicitly rewarded. While the result forstuck-open tolerant inverters is the same, it is expectedthat the experimental setup in this paper may also beapplied to evolving redundant circuits tolerating otherkinds of transistor faults, such as stuck-closed.

Figure 2 shows the other variant: an inverter withonly three transistors. This solution showed up inboth the operating point analysis experiment and the99ns transient analysis experiment, but not in the 5nstransient analysis experiment. It is clear that the

Vdd

Vss

P1

N1

P2

Figure 2: Smallest evolved stuck-open tolerantinverter

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0 50 100 150 200 250 300

V

time [ns]

output

Figure 3: Transient analysis of the inverter in figure 2when transistor N1 is stuck-open. The input changes

from 0 to 1 at t = 100ns.

duplicated pMOS transistors provide reliability in thesame way as the inverter in figure 1. However, canthis inverter tolerate a stuck-open nMOS when only onenMOS is present? A transient analysis of this situationis given in 3. When the input changes from 0 to 1,the output is not driven by any active transistor. Theoutput load, however, slowly discharges the output to avalue slightly less than 0.5V. The circuit is thereforeclassified as working. This discharging takes longerthan 5ns, which is why this solution is not created inthe 5ns transient analysis experiment.

Tables 2–4 indicate which inverter variants are createdin the different evolutionary runs. Although beingsimpler and faster, the operating point analysisexperiment is just as successful at creating redundantinverters as the transient analysis experiments. Whentiming is of importance, transient analysis should beapplied to ensure the circuit delay is small enough.When the goal is to produce lots of different examplesof doing redundancy, operating point analysis might bebetter suited because of its faster run time.

4 Conclusion and Further WorkThis paper has shown that a technique for evolvingredundant structures, successfully applied earlier atthe Boolean gate level [12], can also be applied withsuccess, at the transistor level.

Experiments have been performed where digitalinverters are evolved that are tolerant to stuck-

Paper VI 103

open defective transistors. All evolved invertershave achieved tolerance to stuck-open transistors byconnecting several redundant transistors in parallel.

Further work is to include extensive experiments withthe purpose of generating new and efficient redundancystructures.

5 References[1] ITRS, “International Technology Roadmap for

Semiconductors”, Technical report, ITRS (2005).

[2] P. K. Lala, Self-Checking and Fault TolerantDigital Design, Morgan Kaufmann Publishers(2001).

[3] J. von Neumann, “Probabilistic Logics and theSynthesis of Reliable Organisms from UnreliableComponents”, in C. Shannon, editor, AutomataStudies, pp 43–98 (1956).

[4] Xilinx, “Xilinx Virtex 5 Overview”,http://www.xilinx.com/products/virtex5/index.htm (2007).

[5] F. Hatori, T. Sakurai, K. Nogami, K. Sawada,M. Takahashi, M. Ichida, M. Uchida, I. Yoshii,Y. Kawahara, T. Hibi, Y. Saeki, H. Muraoga,A. Tanaka and K. Kanzaki, “IntroducingRedundancy in Field Programmable GateArrays”, in Proc. IEEE Custom IntegratedCircuits Conference, pp 7.1.1–7.1.4 (1993).

[6] A. Djupdal and P. C. Haddow, “Yield EnhancingDefect Tolerance Techniques for FPGAs”, in Int.MAPLD Conference (2006), paper ID 203.

[7] S. Aunet and M. Hartmann, “Real-timeReconfigurable Linear Threshold Elementsand Some Applications to Neural Hardware”, inProc. Int. Conf. Evolvable Systems: From Biologyto Hardware, (ICES), pp 365–376 (2003).

[8] A. Djupdal and P. Haddow, “Defect TolerantGanged CMOS Minority Gate”, Submitted toNORCHIP 2007.

[9] A. Schmid and Y. Leblebici, “Robust and Fault-Tolerant Circuit Design for Nanometer-ScaleDevices and Single-Electron Transistors”, inISCAS, pp 685–688 (2004).

[10] A. E. Eiben and J. E. Smith, Introduction toEvolutionary Computing, Springer (2003).

[11] X. Yao and T. Higuchi, “Promises and challengesof evolvable hardware”, in Evolvable Systems:From Biology to Hardware (ICES 96), number1259 in LNCS, pp 87–97, Springer (1996).

[12] A. Djupdal and P. C. Haddow, “Evolving andAnalysing “Useful” Redundant Logic”, in ICES,pp 256–267 (2007).

[13] P. J. Layzell and A. Thompson, “UnderstandingInherent Qualities of Evolved Circuits:Evolutionary History as a Predictor of FaultTolerance”, in Proc. International Conference onEvolvable Systems (ICES), pp 133–144 (2000).

[14] J. A. Abraham and W. K. Fuchs, “Fault and ErrorModels for VLSI”, Proc. IEEE, 74(5), pp 639–654 (1986).

[15] A. Djupdal and P. C. Haddow, “EvolvingRedundant Structures for Reliable Circuits –Lessons Learned”, in Adaptive Hardware andSystems, pp 455–462 (2007).

[16] GEDA, “Ngspice homepage”, http://ngspice.sourceforge.net/ (2007).

[17] J. F. Miller and P. Thomson, “CartesianGenetic Programming”, in Genetic Programming,Proceedings of EuroGP’2000, pp 121–132(2000).

[18] W. Zhao and Y. Cao, “New generation ofpredictive technology model for sub-45nm designexploration”, in 7th International Symposium onQuality Electronic Design (ISQED), pp 585–590(2006).

104

Paper VII

Defect Tolerance Inspired by Artificial EvolutionAsbjørn Djupdal and Pauline C. HaddowAccepted at IEEE Computer Society Annual Symposium on VLSI(ISVLSI), 2008

Defect Tolerance Inspired by Artificial EvolutionAsbjoern Djupdal and Pauline C. HaddowCRAB Lab (http://crab.idi.ntnu.no)

Department of Computer and Information ScienceNorwegian University of Science and TechnologyEmail: [email protected], [email protected]

Abstract— Defect densities in integrated circuits are expectedto increase as the semiconductor feature size decreases. Someform of transistor level defect tolerance is, therefore, desirable toreduce this increasing production challenge. Series and parallelreplication of transistors can be applied to a circuit for toleratingstuck-open and stuck-closed transistors. The circuit is, however,still damaged by gate/drain and gate/source shorts.

This paper applies an evolutionary algorithm to evolve a circuittolerant to any single short between two transistor terminals. Theevolved circuit is then analysed and a general defect tolerancetechnique is formed based on the evolved circuit. Applying thenew technique to a circuit results in tolerance to any single stuck-open, stuck-closed, gate/drain shorted or gate/source shortedtransistor. A Monte Carlo experiment compares the reliability ofthe new technique applied to a NAND gate with other redundantNAND gate implementations.

I. INTRODUCTION AND MOTIVATION

As the semiconductor feature size decreases and the numberof transistors on a single chip increases, one of the growingchallenges facing the electronic design community is faultybehaviour [1].

If defects are expected to occur in a digital circuit, defecttolerance — the ability to function correctly in the presenceof defective components, may be achieved by incorporatingredundancy (additional resources) in some form. These addi-tional resources may be in the form of additional hardware,in which case it is called hardware redundancy [2]. One formof hardware redundancy is static hardware redundancy. Staticredundancy involves introducing extra components in a waythat masks faults, thus without any need to detect and repairthe defects.

There are several causes of defects, and defects may ap-pear in different parts of an integrated circuit. This paperconcentrates on transistor defects. A defective transistor maybe modelled in several ways [3]. This paper considers stuck-open and stuck-closed, as well as any short between twoof the terminals of a transistor. A stuck-open transistor is atransistor that is never conducting, no matter what gate voltageis applied. A stuck-closed transistor is, on the other hand,always conducting.

Redundant hardware may be introduced at different levels.At the system level, one of the most popular redundancytechniques is Triple Modular Redundancy (TMR) [4]. Threeequal modules calculate the same function and a voter outputsthe majority output. TMR may also be applied at the gatelevel where each module is a smaller part of the complete

A

B

C

A

B

C

Fig. 1. Series and parallel replication of transistors

system and where a cascade of TMR subsystems make up thecomplete system.

Defects may occur in any part of the system, including thevoter. One disadvantage of TMR is the need for a perfectworking voter or, if a perfect voter is not likely, triplicating thevoter itself. The need for a voter makes TMR only practicalwhen each of the modules are large compared to the voter.For TMR to function, the probability of having a functioningmodule must be more than 0.5. If the expected defect densityof the circuit is high, the modules must be small to safeguardthat the probability of working is more than 0.5. If the defectdensity is high enough, TMR is no longer suitable becauseeach module must be so small that the voter will be dominatingin terms of susceptibility to defects.

A gate level alternative to TMR is interwoven logic [5] orquadded logic [6]. Quadded logic involves constructing thenetwork of logic gates in a way such that it masks defects.Defect masking is achieved by quadrupling every gate in thesystem and connecting the gates in a specific way so as toavoid the need for a voter. The lack of a voter makes quaddedlogic useful at higher defect densities than TMR.

When the expected defect density is so high that it isprobable that a large amount of the digital gates are defective,gate level techniques, like interwoven logic, fail to mask allthe defects. This makes it useful to introduce redundancyat the transistor level i.e. introducing redundant transistorswhen implementing the basic logic gates. Redundancy atthe transistor level would help the systems reliability byproviding robust gates. To get even higher reliability, theserobust gates could be applied together with gate level or systemlevel redundancy techniques. A further benefit of introducingredundancy at the transistor level is to be able to exploit somenon-digital properties of the transistor.

A general transistor level redundancy technique is shown in

Paper VII 107

figure 1. Originally described by Moore and Shannon [7] forrelays, the technique provides redundant transistors in seriesfor tolerating stuck-closed defects and redundant transistorsin parallel to tolerate stuck-open defects. Combining these, asin figure 1, results in tolerance to both stuck-open and stuck-closed faults.

When considering only stuck-open and stuck-closed de-fective transistors at high defect rates, the series–paralleltechnique results in more reliable circuits than TMR. Toleranceto shorts between the transistor gate and source or drainis, however, in general not tolerated by the series-paralleltechnique. To tolerate gate shorts, either TMR must be appliedwith triplicated voters, or a new transistor level redundancytechnique is needed. Further, the series–parallel techniqueis area inefficient in that it quadruples the area needed forimplementing a circuit.

The main objective of this paper is to provide toleranceto gate/drain and gate/source shorts through a new transis-tor level redundancy technique. When trying to find newredundancy techniques it is important to free oneself fromthe constraints brought upon us by thinking in the way oftraditional redundancy techniques. The way one thinks whendesigning circuits is influenced by the way that one is taughtelectronics, designed electronics and the tools used in thedesign process. One way of freeing oneself from these humanand design automated constraints is to search for ideas usingsome sort of heuristic search process. One such process is thatof evolutionary algorithms [8]. The application of evolutionaryalgorithms to the design of hardware is termed evolvablehardware (EHW) [9]. Previous work [10] proved that it ispossible to tune an evolutionary algorithm to create usefulhardware redundancy at the transistor level.

This paper builds on the work in [10] and apply an evo-lutionary algorithm to find a circuit tolerant to gate/sourceand gate/drain shorts. The evolved circuit then forms thebasis for a new redundancy technique. The paper starts insection II with a discussion of different aspects regardingevolution of transistor level redundancy. Section III explainsthe evolutionary experiment and provides an analysis of thebest evolved circuit. A new redundancy technique based onthe evolved circuit is presented in section IV and severalredundant NAND implementations are compared with respectto reliability in section V. The paper concludes in section VI.

II. EVOLVING REDUNDANT CIRCUITS

The approach taken for evolution of redundant circuits inthis paper follows the technique outlined in [10]. When fitnessis to be evaluated, the circuit is tested repeatedly with differentinjected faults. Fault injection during fitness evaluation pro-vides a means to calculate a reliability metric for the circuit.The quality of the evolved redundancy depends on the wayfault injection is performed and how the reliability metricis included in the fitness function. This section presents theapproach taken for the evolutionary experiment in section III.

A. Measuring Functionality

The output of a circuit is defined to be true if having a valuelarger than Vdd

2 . If the output is less than Vdd

2 , it is defined tobe false. One functionality metric, fbool, simply states whetherthe circuit has correct Boolean output for all possible inputcombinations (fbool = 1) or not (fbool = 0).

fbool is an important functionality metric when reliabilityis to be determined. However, fbool might be too coarsegrained when evolving towards a specific functionality. Fitnessshould therefore include a functionality metric that representsfunctionality in terms of how close the circuit’s output voltageis to the desired output voltage. The main functionality metricfor this paper, frms, is based on the Root-Mean-Square (RMS)error between the simulated output and the ideal output, forall n possible input values i.

frms = 1−√∑n

i (sim(i)− ideal(i))2

n(1)

B. Fault Models

A fault scenario is one possible configuration of faultytransistors for a given circuit. A fault model dictates howthe fault scenarios can be constructed and how probable thedifferent fault scenarios are of occurring. Two fault modelsare considered in this work: the transistor reliability modeland the single fault model.

In the transistor reliability model, each transistor has acertain probability of failing and each transistor fails inde-pendently of each other. If a fault scenario for the transistorreliability model is to be created, each transistor in the circuitis tested against a random number generator and selected tobe faulty or not, based on a chosen fault rate.

In the single fault model, a circuit can have exactly one faultat any time and any single fault scenario is equally probable.One and only one of the transistors are selected to fail for anygiven fault scenario.

C. Failing Transistors

A transistor may fail in several ways. In this paper, severaltypes of transistor defects are considered: Stuck-open transis-tors are permanently off and are modelled by removing thetransistor from the SPICE netlist. Stuck-closed transistors arepermanently on and are modelled by shorting the source anddrain with a 1Ω resistor. In addition, there may be a shortbetween gate/drain or gate/source which both are modelledwith a 1Ω resistor shorting the respective transistor terminals.

D. Measuring Reliability

A reliability metric indicates how well a circuit functionsin the presence of faults. Reliability may be measured bytesting the circuit against a number of randomly selected faultscenarios. The possible fault scenarios depend on the chosenfault model. The Rtrad metric, which is used in this paper,is the percentage of these tests where the circuit is fullyfunctioning (fbool = 1). When Rtrad is applied with the singlefault model, it is named Rtrad_single. When applied with thetransistor reliability model, the metric is named Rtrad_trans.

108

Rtrad_single may be calculated exactly by testing all possiblesingle faults.

Rtrad_trans may be estimated using a Monte Carlo sim-ulation. A thorough Monte Carlo simulation is too timeconsuming during evolution. Instead, equation (2) is applied toestimate Rtrad_trans, see [11]. x0 and x1 are the probabilitiesfor having zero and one defective transistor, respectively, in arandomly chosen fault scenario.

Rtrad_trans = x0 · fbool + x1 ·Rtrad_single (2)

E. Fitness Function

Earlier work on evolving transistor level redundancy [10]achieved best results when evolving the circuits in two phases.First generate redundancy using an Rtrad_single based fitnessfunction. As concluded in [11], an Rtrad_single based fitnessfunction is better suited for generating redundancy than anRtrad_trans based fitness function.

Phase one typically generates very bloated circuits. Theevolved circuit is, therefore, optimised in a second evolu-tionary phase using an Rtrad_trans based fitness function.Rtrad_trans is much less forgiving for transistors without anyreal purpose.

The following two fitness functions, f1 and f2, are appliedin this paper for phase one (equation (3)) and phase two(equation (4)):

f1 = k1frms + k2frms + k3Rtrad_single + k4fbool (3)

f2 = k1frms + k2frms + k3Rtrad_trans + k4fbool (4)

The first component, frms, is just for a single test withno defective transistors and is included to encourage highgain circuits. The second component, frms, represents theaverage frms after having tested the circuit for all single faults.The second component is included to make sure the circuitperforms as well as possible, also when there are defectivetransistors. The third component is the reliability metric andthe fourth component, fbool is to make sure a working circuitis always rewarded more than a non-working circuit.

III. EVOLVING A CIRCUIT TOLERANT TO GATE SHORTS

An evolutionary experiment is performed where evolutionis steered towards creating redundancy tackling gate/drain andgate/source shorts. The purpose is to generate at least oneexample that illustrates how evolution tolerates a gate shortin a circuit. This section explains the experimental setup andprovides an analysis of the best circuit that resulted from theexperiment.

A. Experimental Setup

To keep the size small and complexity (and thus theevolution time) low, the chosen target functionality for theevolutionary experiment is a digital inverter.

The test setup employed when measuring a circuits func-tionality is illustrated in figure 2. To make sure the evolvedcircuits are able to drive a representative load, the output ofthe circuit under test is connected to a chain of two inverters.

+−

+−

+−

Circuit under test

in_1

in_2

in_n

out

Measured output

~Vin_1

~Vin_2

~Vin_3

Fig. 2. Functionality test setup

TABLE ICHARACTERISTICS OF INVERTER IN FIGURE 3

Property Valuesize 15 transistorsfrms 0.998929frms 0.978471Rtrad_single 1.000000Rtrad_trans|Rt = 0.99 0.965700

Inverters are also driving the inputs to the circuit under test toavoid using perfect voltage sources as inputs. Perfect voltagesources would not be representative when an injected faultresults in a short between input and either Vdd or Vss.

When the functionality of a circuit is to be tested, allpossible input transitions are tested in turn by setting thePiece Wise Linear (PWL) input voltage sources in figure 2to correspond to the input transition to be tested. A transientanalysis of the test setup is then performed in the BSD licensedSPICE simulator ngspice [12].

The circuit output is measured after inputs have beenstable for 50ns. Circuit components allowed are nMOS andpMOS transistors. The V1.0 BPTM 22nm CMOS transistormodels [13] are applied with allowed transistor sizes from30nm to 1000nm. Supply voltage Vdd = 1V . Feedback loopsare not allowed, but several transistors may drive the samewire.

A representation resembling Cartesian Genetic Program-ming [14] is applied, with the modification that in a gene, bothinputs and the output of the component are explicitly defined,in addition to component type (nMOS or pMOS transistor) andtransistor dimensions. A (1+4) evolutionary strategy is appliedwith mutation rate 0.1. Mutation is applied independently foreach information block (inputs, output, type, sizes) inside eachgene in the genome.

Evolution may create the circuit from a maximum of 50transistors and 55 nets for each circuit. A net is an internalwire in the circuit, including inputs and output. Fault scenariosare created employing the following defect types: drain-sourceshort (stuck-closed), gate-drain short and gate-source short.Rtrad_trans, a component in the fitness function for evolutionphase two, can only be found given a certain transistorreliability. The transistor reliability applied in this experimentis 0.99. Coefficients used for the fitness functions for bothevolution phases are k1 = k2 = k3 = 0.2 and k4 = 0.4. Thehigh value for k4 is there to favour fully functioning circuitsover non-functioning circuits.

Paper VII 109

M1pmosw=268nml=30nm

M6pmosw=79nml=30nm

M3pmosw=65nml=30nm

M7pmosw=30nml=30nm







M8nmosw=226nml=58nm

M12nmosw=72nml=30nm


M9nmosw=65nml=247nm

M5nmosw=100nml=30nm

Pull−up network

Pull−down network

Input network

Vdd

Vss

Fig. 3. Evolved defect tolerant inverter

Vdd

Vss

M1pmos

M2

nmos

Fig. 4. Standard inverter

B. Analysis of Best Evolved Inverter

The best circuit found by evolution is shown in figure 3. Theevolved inverter is fully functional and different metrics forthe inverter are shown in table I. As seen by the Rtrad_single

metric in table I, the inverter is tolerant to all possible singlegate/drain, gate/source and source/drain shorts on any tran-sistor present in the circuit. As such, the circuit in figure 3 issuited for further analysis regarding how to tolerate gate/sourceand gate/drain shorts.

The standard CMOS inverter is shown in figure 4 andconsists of a pull-up pMOS transistor (M1) and a pull-downnMOS transistor (M2). The first step towards understandingthe evolved circuit is to identify the corresponding pull-up andpull-down transistor networks. Transistors M1, M2, M6 andM7 in figure 3 represent the pull-up network, while transistorsM12 and M15 represent the pull-down network.

The pull-up and pull-down structures are interesting bythemselves. First, they show that evolution has introducedredundant transistors in series (M1/M6, M2/M7, M12/M15)to tolerate drain/source shorts (stuck-closed transistors). Asthe drain/source short defect was one of the defects injectedduring evolution, redundant transistors in series was expected.

However, evolution also introduced redundant transistors inparallel in the pull-up network (the M1–M6 chain is parallel toM2–M7), a structure known to tolerate stuck-open defectivetransistors. Stuck-open was not one of the defects injectedduring evolution, so why did evolution introduce these parallelstructures? When there is a short between gate and source ontransistor M1 or M2, the transistor is effectively stuck-open,resulting in the need for a parallel chain of transistors. The

pmos

nmos

Fig. 5. Resistor implementation

A

B

C

A

B

C

Fig. 6. New defect tolerance method, shown for pMOS. nMOS transistorsare substituted in the same way. Resistors are implemented as in figure 5.

same reasoning applies for the pull-down network. However,instead of introducing parallelism in the pull-down network,evolution has relied on the output slowly discharging to thecorrect value. Unfortunately, such a solution is both suboptimaland technology dependent and is therefore not suited for atraditional solution.

The next step is to understand the purpose of the transistorsin the input network i.e the transistors that connect the inverterinput with the transistor gates in the pull-up and pull-downnetworks. None of these transistors are connected to Vdd orVss, but are instead just passing on the inverter input. SPICEsimulations showed that all nets in the input network are moreor less degraded versions of the inverter input. It seems thatevolution has tried to separate the inverter input from thepull-up and pull-down networks with a resistive circuit. Totolerate a short between, for example, the transistor M1 gateand source, the inverter input must be separated from the gateto avoid clamping the input to Vdd and thus resulting in theinverter output stuck-at-0. If the input is separated from theshorted gate with a resistor, the result is a slightly degradedinput signal whilst retaining correct output.

A resistor in a CMOS IC is usually formed by using annMOS transistor with gate connected to Vdd. Evolution neverintroduced such resistors in the input network in figure 3 be-cause those resistors are not themselves tolerant to gate/sourceand gate/drain shorts. Instead, evolution has created the inputnetwork without any connections to Vdd or Vss, thus avoidingthe problem of gate shorts in the input network.

IV. A GENERALISED DEFECT TOLERANCE TECHNIQUE

The analysis in section III-B can now be used as basis toform a new redundancy technique: (1) To allow for stuck-open and stuck-closed transistors, redundant transistors shouldbe introduced to the pull-up and pull-down networks both inseries and parallel, as in figure 1. (2) To tolerate gate/sourceand gate/drain shorts, the transistor gates in the pull-up andpull-down networks must be isolated from the inverter inputusing a defect tolerant resistor. A defect tolerant resistor canbe formed using the transistor configuration in figure 5. Com-bining these two elements results in the technique summarised

110

Vdd

Vss

M3pmos

w=50nm

l=30nm

M4pmos

w=50nm

l=30nm

M1pmos

w=50nm

l=30nm

M2pmos

w=50nm

l=30nm

M5nmos

w=30nm

l=30nm

M6nmos

w=30nm

l=30nm

M7nmos

w=30nm

l=30nm

M8nmos

w=30nm

l=30nm

R2

pmos=30/30nmos=30/30

R4


R6


R8


R7


R5


R3


R1


Fig. 7. Defect tolerant inverter. Resistor sizing given as W/L

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35 40

[V]

time [ns]

(a) Input

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35 40

[V]

time [ns]

(b) Output

Fig. 8. Simulation of inverter in figure 7 (no defects)

in figure 6.To demonstrate how the technique summarised in figure 6

can be used to create a defect tolerant inverter, figure 7 showsthe result after having applied the substitution in figure 6 tothe standard inverter in figure 4. The resistance of the resistorsmust be large enough to achieve isolation. For the circuit infigure 7, minimum sized transistor gates are suitable for mostof the resistors, expect for R7 and R8 that should be sized forlarger resistance to reduce the impact of a gate short to Vss.

Figure 8 shows a SPICE simulation of the inverter infigure 7 when no transistors are defective. For this simula-tion and all later simulations in this paper, the V2.0 BPTM22nm transistor models are applied. Figure 9 show the samesimulations when transistor M7 has a short between gate andVdd, one of the most damaging shorts. As seen in figure 9(a),the shorted gate pulls down the circuit input, but the isolatingresistor R7 ensures this pull-down is so small that it doesnot result in incorrect output. Figure 9 shows only one faultscenario. Further simulations were performed to verify thatthe inverter is actually capable of tolerating all single transistordefects of types gate/source short, gate/drain short, stuck-openand stuck-closed. A summary of the performance of the gateis given in table II.

V. RELIABILITY ANALYSIS

To investigate the quality of the proposed defect tolerancetechnique, the technique has been applied to a standardNAND gate. This new defect tolerant NAND-gate is calledNANDnew in this section. The reliability of NANDnew isthen compared with the reliability of three other NAND imple-mentations: NAND (Standard four-transistor CMOS NAND),

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35 40

[V]

time [ns]

(a) Input

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35 40

[V]

time [ns]

(b) Output

Fig. 9. Simulation of inverter in figure 7 (M7 gate to Vss shorted)

TABLE IICHARACTERISTICS OF INVERTER IN FIGURE 7


NANDser−par (a series–parallel version of CMOS NAND)and NANDtmr (a TMR version of NAND, see figure 10).To make the comparison with NANDtmr fair, the voter istriplicated to tolerate faults. Triplicated voters do, however,mean that the circuit will have three outputs instead of oneand give NANDtmr the added advantage of only needing twoof the three outputs to be correct. Each voter in NANDtmr

consists of a mirrored adder [15] connected to an inverter.


The reliability metric Rtrad_trans can be interpreted as theprobability of having a correct output given a probability Rt

that a transistor is functional. Rtrad_trans is therefore a morerealistic reliability metric for a real circuit than Rtrad_single

and is thus chosen for the following experiments. It is furtherassumed that each transistor fails independently with a certainprobability (1−Rt). When failing, the transistor may be stuck-open; stuck-closed; gate/source, gate/drain or source/drainshorted. Each defect type is equally probable.

B. Results

Monte Carlo simulations for different levels of transistorreliability Rt have been performed on the different NANDimplementations and the results are shown in figure 11. Eachpoint in the graph represents the result after 10000 simulationswith random fault scenarios for the given Rt.

As seen in figure 11, NANDnew is the most reliable giventhe assumption that all defect types are equally probable.NANDtmr is slightly more reliable than plain NAND forRt ≥ 0.95. NANDser−par is not suited at all for gate shorts.

The large number of extra transistors in the proposed redun-dancy technique might have a bad impact on the reliabilityif the probability of having shorted transistor gates is muchless than stuck-open or stuck-closed defects. To investigatehow the extra number of transistors affect reliability when theprobability of having gate shorts is low, a new Monte Carloexperiment was performed. The reliability of NANDnew is

Paper VII 111

V

V

V

in_1a

in_2a

in_1b

in_2b

in_1c

in_2c

out_a

out_b

out_c

Fig. 10. Triplicated NAND gates and voters

0

0.2

0.4

0.6

0.8

1

0.5 0.6 0.7 0.8 0.9 1

R

Rt

NANDNANDtmr

NANDser-parNANDnew

Fig. 11. Reliability of different NAND implementations

compared with NANDser−par for different probabilities forhaving a gate/source or gate/drain short when a transistor isdefective. Rt = 0.96. The results are shown in figure 12.

As seen in figure 12, NANDnew is not affected by the defecttype while the reliability of NANDser−par decreases as theprobability of having a gate short increases. When there are nogate shorts, the reliability of the two NAND implementationsis about the same. As such, the large amount of area devotedto the isolation of inputs in the proposed technique does notseem to be of a disadvantage to reliability.

VI. CONCLUSION AND FURTHER WORK

An evolutionary experiment has been performed wherean inverter was evolved for tolerance to shorted transistorterminals. Using an analysis of the evolved inverter as basis,a new redundancy technique was proposed. A NAND gateimplemented applying the proposed technique is shown toperform well compared to other NAND gate implementations,with regard to reliability.

The proposed redundancy technique is an augmentation ofthe series-parallel technique and is therefore even less areaefficient than the original series-parallel technique. Furtherwork should look at more area efficient ways of achievingtolerance to gate shorts.

None of the experiments in this paper have considered shortsand opens in the metal layers implementing wiring. Defects inthe metal layer is an important class of defects and althoughmany metal defects will have an effect similar to one of thetransistor defect types considered in this paper, further workshould look at how defects in the metal layers affect reliability.

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 0.1 0.2 0.3 0.4 0.5

R

P(transistor defect is gate short)

NANDnewNANDser-par

Fig. 12. Reliability for different probabilities for a defective transistor beingeither gate/source short or gate/drain short. 0 means gate shorts never occur,0.5 means gate/drain and gate/source shorts are as probable as stuck-open andstuck-closed.

REFERENCES


[2] P. K. Lala, Self-Checking and Fault Tolerant Digital Design. MorganKaufmann Publishers, 2001.

[3] J. Abraham and W. Fuchs, “Fault and error models for VLSI,” Proceed-ings of the IEEE, vol. 74, no. 5, pp. 639–654, may 1986.

[4] R. E. Lyons and W. Vanderkulk, “The use of triple-modular redundancyto improve computer reliability,” IBM Journal, pp. 200–209, April 1962.

[5] W. Pierce, Failure-Tolerant Computer Design. Academic Press, 1965.[6] J. G. Tryon, Redundancy Techniques for Computing Systems. Spartan

Books, 1965, ch. Quadded Logic, pp. 205–228.[7] E. F. Moore and C. E. Shannon, “Reliable circuits using less reliable

relays,” J. Franklin Inst., pp. 191–208, 291–297, 1956.[8] A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing.

Springer, 2003.[9] X. Yao and T. Higuchi, “Promises and challenges of evolvable hard-

ware,” in Int. Conf. Evolvable Systems (ICES). Springer, 1996.[10] A. Djupdal and P. C. Haddow, “Evolving efficient redundancy by

exploiting the analogue nature of CMOS transistors,” in CIRAS, 2007.[11] A. Djupdal and P. C. Haddow, “Evolving redundant structures for

reliable circuits – lessons learned,” in AHS, 2007, pp. 455–462.[12] GEDA, “Ngspice homepage,” http://ngspice.sourceforge.net/, 2007.[13] W. Zhao and Y. Cao, “New generation of predictive technology model

for sub-45nm design exploration,” in Int. Symp. Quality ElectronicDesign (ISQED), 2006, pp. 585–590.

[14] J. F. Miller and P. Thomson, “Cartesian genetic programming,” inGenetic Programming, Proc. EuroGP, 2000, pp. 121–132.

[15] D. Hampel, K. J. Prost, and N. R. Scheinberg, “Threshold logic usingcomplementary MOS device,” Jun 1974, U.S. Patent 3 900 742.

112

Paper VIII

The Route to a Defect Tolerant LUT through Artificial EvolutionAsbjørn Djupdal and Pauline C. HaddowSubmitted to IEEE Transactions on Circuits and Systems I, 2008

1

The Route to a Defect Tolerant LUT throughArtificial Evolution

Asbjoern Djupdal and Pauline C. Haddow Member, IEEE

Abstract— The challenge of production defects for integratedcircuits is expected to increase as the feature size is scaled towardsthe limits of what is possible to manufacture. To handle theincreasing number of defects, some form of redundancy can beemployed for defect tolerance.

The FPGA can be seen as a bridge between production andapplication designer. Introduction of defect tolerance techniquesto the FPGA itself could provide a defect free gate array to theapplication designer, despite production defects.

This paper describes a search for transistor level defect tol-erance for FPGA look-up tables (LUTs) through the applicationof artificial evolution. Two different strategies result in twodefect tolerant LUT implementations. Through simulations, thenew LUT implementations are compared to a traditional non-redundant LUT and a TMR version of the traditional LUT.

Index Terms— Defect tolerance, FPGA, transistor level redun-dancy, LUT, artificial evolution, EHW

I. INTRODUCTION

The continued miniaturisation of features in CMOS inte-grated circuits (ICs) has resulted in larger, more complex andfaster devices. The feature size in today’s high-end chips isalready at the nanoscale and is expected to scale further [1].The lithographic process employed for the production of ICscan not be perfectly controlled, resulting in reduced yield dueto production defects. The ITRS roadmap [1] predicts thatthis situation will worsen as CMOS is scaled down, resultingin a significant percentage of produced chips having defects.Predictions for future technologies are even more pessimisticwhere a significant portion of each produced chip is expectedto be defective [2]. To handle the increasing defect rates andavoid yield levels resulting in prohibitively expensive chips,circuits should be designed to tolerate a certain amount ofdefective components. Such defect tolerant circuits could beachieved through the introduction of redundant components.

The Field Programmable Gate Array (FPGA) is a suitabletarget for redundancy techniques. FPGAs are today widelyused, both for the original purpose as a prototyping deviceand as a component in end user products. High end FPGAsare produced with the most advanced production processesand are, therefore, among the first ICs that will encounter theexpected increase in production defects. FPGAs can be seen asa bridge between production and the application designer. In ascenario where ICs must exhibit some kind of defect tolerance,an FPGA built with transparent redundancy techniques canprovide the application designer with a functionally correctgate array, despite production defects. In addition, specialising

Both authors are from the CRAB Lab at the Department of ComputerScience, Norwegian University of Science and Technology

the redundancy techniques towards the FPGA architecture canresult in more efficient redundancy. To achieve area efficientdefect tolerance, the typical approach is to exploit structuralregularity [3]. The FPGA has a regular structure, which hasinspired the search for effective defect tolerance techniquesfor FPGAs.

This paper represents a search for a defect tolerant look-uptable (LUT), one of the essential components in the FPGA.A defect tolerant LUT is, therefore, one step towards a moredefect tolerant FPGA.

A new transistor level defect tolerance technique, termedherein as the Multiple Short-Open (MSO) Technique, waspresented by the authors in [4]. The MSO technique wasthe result from a manual analysis of circuits created withan Evolutionary Algorithm (EA). The process towards thistechnique, as well as the technique itself, is presented insection V so as to highlight how EAs were applied in thisprocess.

Two different strategies towards a defect tolerant LUT arefollowed in this paper. The first strategy is to apply the MSOtechnique to a traditionally designed LUT. The second strategyis to apply an EA in an attempt at evolving a defect tolerantLUT directly. This paper thus considers both the creationof defect tolerant LUTs and the application of EAs for thepurpose of achieving defect tolerance.

Section II presents an overview of related work on tradi-tional defect tolerance. Section III gives an introduction toartificial evolution and presents related work on evolved defecttolerance. Section IV discusses a number of issues and prac-tical considerations important for the evolution of transistorlevel circuits exhibiting redundancy. Section V presents theprocess that led to the MSO technique. Section VI appliesthe MSO technique for constructing a defect tolerant LUT.Section VII presents an evolutionary experiment where adefect tolerant LUT is evolved directly. A comparison of LUTimplementations and a discussion is given in section VIII andthe paper concludes in section IX.

II. REDUNDANCY TECHNIQUES FOR DEFECT TOLERANTSYSTEMS

One of the most popular and well known redundancytechniques is Triple Modular Redundancy (TMR) [5]. Threeequal modules calculate the same function and a voter outputsthe majority output. TMR is most often applied at the systemlevel for critical systems with very low probability of failing.However, the reliability of a system is typically increasedif redundancy is introduced at a more fine grained level. A

Paper VIII 115

2

A

B

C

A

B

C

Fig. 1. Series and parallel replication of transistors

system can be split up into smaller subsystems, each maderedundant with TMR and cascaded to form the completesystem. In a cascaded TMR system with small modules, asignificant part of the system is devoted to voting. As such,the voter may need to be triplicated as well.

At the gate level, when the size of each submodule is onlya few gates, TMR is unsuitable due to the voter becoming adominant factor with respect to susceptibility to defects. A gatelevel alternative to TMR is interwoven logic [6]. Interwovenlogic involves constructing the network of logic gates in away such that it masks defects. Defect masking is achievedby quadrupling every gate in the system and connecting thegates in a specific way so as to avoid the need for a voter.

If very high reliability is required, redundancy can beintroduced at the transistor level. Transistor level redundancycan be applied to build reliable components that form thebasis for higher level redundancy techniques. One transistorlevel redundancy technique known as series-parallel transistorreplication is shown in figure 1. Originally described by Mooreand Shannon for relays [7], the technique provides redundanttransistors in series for tolerating stuck-closed defects andredundant transistors in parallel to tolerate stuck-open defects.Combining these, as in figure 1, results in tolerance to bothstuck-open and stuck-closed faults. TMR may also be appliedat the transistor level. However, when considering only stuck-open and stuck-closed defective transistors at high defect rates,the series-parallel technique results in more reliable circuitsthan TMR [8]. Although the series-parallel technique toleratesall single stuck-open and stuck-closed defects, other possibledefects, such as transistors with shorted gate and source, canstill be catastrophic.

Bolchini et al. [9] present another example of transistor levelredundancy targeting multiple output static CMOS circuits.Single stuck-closed defects are tolerated and a number of otherdefects are detected through the application of Berger codes.

One benefit of introducing redundancy at the transistor levelis the possibility of exploiting non-digital properties of thetechnology [10]. One example is found in earlier work by theauthors [11] where a minority gate was made tolerant to stuck-open and stuck-closed defects through a redundancy techniquenot possible at the digital gate level. Bolchini et al. [12]presents another example of transistor level redundancy notpossible at the gate level, providing tolerance to stuck-closeddefective transistors.

A. Defect Tolerance for FPGAsThe most widely researched defect tolerance technique for

FPGAs is variants on the redundant row technique, originallyproposed by Hatori et al. [13]. One row of logic blocks isreserved as a spare row. If a defect is found in a row, thedefective row is bypassed and the spare row is put into use. Avariant of the redundant row technique is employed by Alterato enhance yield in some of their commercial FPGAs [14].

Most of the redundancy techniques for FPGAs, includingthe redundant row technique, may be said to work at the chiplevel. However, as is the goal of this paper, it is also possibleto apply redundancy internal to the basic building blocksof the FPGA, such as the logic blocks and switch blocks.KleinOsowski and Lilja [15] explore both the application oferror correcting codes and TMR to enhance the reliability ofLUTs. Saha et al. [16] suggests introducing error correctingcodes to the LUTs in the Cell Matrix architecture. Doumarand Ito [17] add an extra wire to the switch block to providea bypass for a faulty switch.

Defect tolerance techniques for FPGAs, especially in thecontext of enhancing yield, are reviewed in detail by theauthors in [18].

III. EVOLUTION OF DEFECT TOLERANT CIRCUITS

The task of finding an optimal circuit given a set of require-ments is often so complex that an exact algorithmic method istoo time consuming. For that reason, fully automatic hardwaredesign typically employs a heuristic search process. One suchheuristic search process is artificial evolution (AE) [19] and itsapplication for hardware design is termed evolvable hardware(EHW) [20]. The algorithm employed in AE systems is calledan evolutionary algorithm (EA) and one of the most wellknown EAs is the genetic algorithm (GA) [21].

GAs mimic aspects of natural evolution where a populationof individuals is repeatedly changed by reproduction andmutation. The result is hopefully an increasingly fit population.A fitness function estimates an individuals ability to solve thegiven problem and is the basis for a mechanism that selectsgood parents allowed to reproduce. One selection mechanismis tournament selection where a group of individuals randomlyselected from the population is given the probability p thatthe most fit individual is chosen and a probability 1−p that arandom individual is chosen. In EHW when applied for circuitdesign, each individual represents a circuit and the search istypically stopped when one individual in the population isa circuit considered to be good enough at solving the givenproblem.

AE is not bound by traditional design techniques and canexploit properties of the technology a human designer wouldnot think of [22]. As such, new and interesting circuits canresult. However, AE techniques are currently resource bound,limiting the size of circuits that may be evolved.

Several researchers have applied EHW in the search forfault or defect tolerant circuits. One approach is to evolve newsolutions when the old one no longer functions, e.g. [23], [24].A defect that results in faulty behaviour triggers a mechanismthat starts the evolution of a new circuit that can cope withthe defect.

116

3

One way to achieve defect tolerant FPGAs is throughstatic hardware redundancy. The goal of this paper is staticredundancy for LUTs and the static strategies for evolvedfault and defect tolerance are, therefore, more relevant to thispaper. Thompson [25], [26] pioneered the field of evolvedfault tolerance with an evolved state machine for a robotcontroller, capable of tolerating stuck-at faults in the 32 bitlarge RAM implementing the state machine. Thompson wasable to evolve a static solution that was tolerant to anysingle stuck-at fault in the RAM. Canham and Tyrrell [27]evolved an oscillator for a Xilinx Virtex FPGA exhibitingredundancy to tolerate simulated stuck-at and bridging faults.Hartmann and Haddow [28] have evolved gate level circuitstargeting tolerance to both noise and faults. Previous workby the authors [29] have directly targeted evolution of statichardware redundancy structures at the gate level, resulting invoter based solutions and more intricate solutions resemblinginterwoven logic. Keymeulen et al. [23] have evolved faulttolerant transistor level circuits for a field programmabletransistor array (FPTA), resulting in analog multipliers anddigital XNOR gates tolerant to six predefined defects in theFPTA. Layzell and Thompson [30] present another exampleof evolved transistor level defect tolerance for digital gates.Although a byproduct of their fitness function, they evolveddigital inverters where parallel replication of transistors pro-vides tolerance to stuck-open defects.

IV. ISSUES ON EVOLVING TRANSISTOR LEVELREDUNDANCY

The approach taken for evolution of redundant circuits inthis paper follows the technique outlined in [10]. When fitnessis to be evaluated, the circuit is tested repeatedly with differentinjected defects. Defect injection during fitness evaluationprovides a means to calculate a reliability metric for the circuit.The quality of the evolved redundancy depends on the waydefect injection is performed and how the reliability metric isincluded in the fitness function.

A. Measuring Functionality

To measure if a circuit is functional according to a specifi-cation, test vectors are applied at the circuit’s inputs and theresponse on the output is monitored with SPICE simulations.The output of a circuit is defined to be true if having a valuelarger than Vdd

2 . If the output is less than Vdd

2 , it is defined tobe false. One functionality metric, fbool, simply states whetherthe circuit has a correct Boolean response to all tested inputvectors (fbool = 1) or not (fbool = 0).

fbool is an important functionality metric when reliabilityis to be determined. However, fbool might be too coarsegrained when evolving towards a specific functionality. Toavoid the evolutionary algorithm from becoming a randomsearch, the fitness function must provide enough informationfor separating good and poorer individuals, even when noindividual in the population has reached 100% functionality.Fitness should, therefore, include a functionality metric thatrepresents functionality in terms of how close the circuit’soutput voltage is to the desired output voltage. The main

+−

+−

+−

Circuit under test

in_1

in_2

in_n

out

Measured output

~Vin_1

~Vin_2

~Vin_3

Fig. 2. Functionality test setup

functionality metric for this paper, frms, is based on the Root-Mean-Square (RMS) error between the simulated output andthe ideal output, for all n output measurements.

frms = 1 −√∑n

i (sim(i) − ideal(i))2

n(1)

An illustration of the functionality test setup is given infigure 2. To make sure the evolved circuits are able to drivea representative load, the output of the circuit under test isconnected to a chain of two inverters. Inverters are also drivingthe inputs to the circuit under test to avoid using perfectvoltage sources as inputs. Perfect voltage sources would not berepresentative when an injected fault results in a short betweeninput and either Vdd or Vss.

B. Fault Models

A fault scenario is one possible configuration of faultytransistors for a given circuit. The term fault model is appliedin this paper as the specification of how the fault scenarios canbe constructed and how probable the different fault scenariosare of occurring. Two fault models are considered in this work:the transistor reliability model and the single fault model.

In the transistor reliability model, each transistor has acertain probability of failing and each transistor fails inde-pendently of each other. If a fault scenario for the transistorreliability model is to be created, each transistor in the circuitis tested against a random number generator and selected tobe faulty or not, based on a chosen fault rate.

In the single fault model, a circuit can have exactly one faultat any time and any single fault scenario is equally probable.One and only one of the transistors are selected to fail for anygiven fault scenario.

C. Failing Transistors

A transistor may fail in several ways. In this paper, severaltypes of transistor defects are considered: Stuck-open transis-tors are permanently off and are modelled by removing thetransistor from the SPICE netlist. Stuck-closed transistors arepermanently on and are modelled by shorting the source anddrain with a 1Ω resistor. In addition, there may be a shortbetween gate/drain or gate/source which both are modelledwith a 1Ω resistor shorting the respective transistor terminals.

D. Measuring Reliability

A reliability metric indicates how well a circuit functionsin the presence of faults. Reliability may be measured bytesting the circuit against a number of randomly selected faultscenarios. The possible fault scenarios depend on the chosen

Paper VIII 117

4

fault model. The Rtrad metric, which is used in this paper,is the percentage of these tests where the circuit is fullyfunctioning (fbool = 1). When Rtrad is applied with the singlefault model, it is named Rtrad_single. When applied with thetransistor reliability model, the metric is named Rtrad_trans.Rtrad_trans can be said to be the probability of functioning100%, given a certain transistor fail rate.

Rtrad_single may be calculated exactly by testing all possi-ble single faults. Rtrad_trans may be estimated using a MonteCarlo simulation. Rtrad_trans must, however, be estimatedonce for every fitness evaluation for every individual in thepopulation during the entire evolutionary experiment. Theresult is that thousands of Rtrad_trans estimations must beperformed for every evolutionary run. A thorough Monte Carlosimulation is, therefore, too time consuming during evolution.One possibility is to exploit the fact that the number ofdefective transistors in a fault scenario with the transistorreliability model is binomially distributed. If X is a randomvariable for the number of faults in a fault scenario, x is thenumber of faults, n is the number of transistors in the circuitand p is the fail rate for the transistors, equation (2) may beapplied to find the probability of having a specific number ofdefective transistors in a fault scenario.

P [X = x] = b(x; n, p) =(

n

x

)px(1 − p)n−x (2)

To find the reliability of a circuit, the circuit is evaluatedwith the zero fault scenario and all the single fault scenariosand the results may be scaled by the probability for thatnumber of defects (x0 and x1). This is shown in equation (3).

Rtrad_trans = x0 · fbool + x1 · Rtrad_single+(1 − x0 − x1) · RMC>1

(3)

The reliability of the circuit when having more than onedefect (RMC>1) must still be found through Monte Carlosimulations. However, if (1−x0−x1) is small, the number ofMonte Carlo tests can be greatly reduced. If (1− x0− x1) isclose to zero, the RMC>1 part of equation (3) may be ignoredcompletely.

E. Fitness Function

Earlier work on evolving transistor level redundancy [10]achieved best results when evolving the circuits in two phases.First generate redundancy using an Rtrad_single based fitnessfunction. As concluded in [31], an Rtrad_single based fitnessfunction is better suited for generating redundancy than anRtrad_trans based fitness function. Phase one typically gen-erates very bloated circuits. The evolved circuit is, therefore,optimised in a second evolutionary phase using an Rtrad_trans

based fitness function. Rtrad_trans is much less forgiving fortransistors without any real purpose.

The following two fitness functions, f1 and f2, are appliedin this paper for phase one (equation (4)) and phase two(equation (5)):

f1 = k1frms + k2frms + k3Rtrad_single + k4fbool (4)

f2 = k1frms + k2frms + k3Rtrad_trans + k4fbool (5)

The first component, frms, is for a single test with nodefective transistors and is included to guide evolution towardsa functioning circuit with high output voltage swing. Thesecond component, frms, represents the average frms afterhaving tested the circuit for all single faults. The secondcomponent is included to encourage high output voltage swingalso when there are defective transistors. The third componentis the reliability metric and the fourth component, fbool is tomake sure a working circuit is always rewarded more than anon-working circuit.

F. Genetic Algorithm

A representation resembling Cartesian Genetic Program-ming [32] is applied, with the modification that in a gene, bothinputs and the output of the component are explicitly defined,in addition to component type (nMOS or pMOS transistor)and transistor dimensions. Mutation is applied independentlyfor each information block (inputs, output, type, sizes) insideeach gene in the genome. Crossover is performed on geneboundaries.

V. MULTIPLE SHORT-OPEN TECHNIQUE

The series-parallel technique described in section II is de-signed for tolerance to stuck-open and stuck-closed defectivetransistors. Other defect types can, however, still be catas-trophic. One example is transistors where the gate is shortedto either source or drain. To tolerate such defects, a newredundance technique was introduced in [4]. The techniqueprovides tolerance to any short between two of the threetransistor terminals and any open on any of the three transistorterminals. The new technique is herein termed the MultipleShort-Open (MSO) technique and is presented in section V-C. The process towards the technique is also presented as theprocess highlights one way of applying AE for the creation ofredundancy structures.


The first step towards the new redundancy technique is toevolve a circuit with successful redundancy. To keep the sizesmall and complexity (and thus the evolution time) low, thechosen target functionality for the evolutionary experiment isa digital inverter.

When the functionality of a circuit is to be tested, allpossible input transitions are tested in turn by setting thePiece Wise Linear (PWL) input voltage sources in figure 2to correspond to the input transition to be tested. A transientanalysis of the test setup is then performed in the BSD licensedSPICE simulator ngspice [33].

The circuit output is measured after inputs have beenstable for 50ns. Circuit components allowed are nMOS andpMOS transistors. The V1.0 BPTM 22nm CMOS transistormodels [34] are applied with allowed transistor sizes from30nm to 1000nm. Supply voltage Vdd = 1V . Feedback loopsare not allowed, but several transistors may drive the samewire.

A (1+4) evolutionary strategy is applied with mutation rate0.1. Evolution may create the circuit from a maximum of 50

118

5

M1pmosw=268nml=30nm

M6pmosw=79nml=30nm

M3pmosw=65nml=30nm

M7pmosw=30nml=30nm







M8nmosw=226nml=58nm

M12nmosw=72nml=30nm


M9nmosw=65nml=247nm

M5nmosw=100nml=30nm

Pull−up network

Pull−down network

Input network

Vdd

Vss

Fig. 3. Evolved defect tolerant inverter

TABLE ICHARACTERISTICS OF INVERTER IN FIGURE 3


transistors and 55 nets for each circuit. A net is an internalwire in the circuit, including inputs and output. Fault scenariosare created employing the following defect types: drain-sourceshort (stuck-closed), gate-drain short and gate-source short.

Rtrad_trans, a component in the fitness function for evolu-tion phase two (equation (5)), can only be found given a certaintransistor reliability. The transistor reliability applied in thisexperiment is 0.99, a relatively low number chosen becauseof the small size of the circuits in this paper. Coefficientsused for the fitness functions for both evolution phases arek1 = k2 = k3 = 0.2 and k4 = 0.4. The high value for k4 isthere to favour fully functioning circuits over non-functioningcircuits.

B. Analysis of Best Evolved Inverter

The best circuit found by evolution is shown in figure 3.The evolved inverter is fully functional and characteristics ofthe inverter are shown in table I. As seen by the Rtrad_single

metric, the inverter is tolerant to all possible single gate/drain,gate/source and source/drain shorts on any transistor presentin the circuit. As such, the circuit in figure 3 is suited forfurther analysis.

The standard CMOS inverter is shown in figure 4 andconsists of a pull-up pMOS transistor (M1) and a pull-downnMOS transistor (M2). The first step towards understandingthe evolved circuit is to identify the corresponding pull-up andpull-down transistor networks. Transistors M1, M2, M6 andM7 in figure 3 represent the pull-up network, while transistorsM12 and M15 represent the pull-down network.

The pull-up and pull-down structures are interesting bythemselves. First, they show that evolution has introducedredundant transistors in series (M1/M6, M2/M7, M12/M15)

Vdd

Vss

M1pmos

M2

nmos

Fig. 4. Standard inverter

to tolerate drain/source shorts (stuck-closed transistors). Asthe drain/source short defect was one of the defects injectedduring evolution, redundant transistors in series was expected.

However, evolution also introduced redundant transistors inparallel in the pull-up network (the M1–M6 chain is parallel toM2–M7), a structure known to tolerate stuck-open defectivetransistors. Stuck-open was not one of the defects injectedduring evolution, so why did evolution introduce these parallelstructures? When there is a short between gate and source ontransistor M1 or M2, the transistor is effectively stuck-open,resulting in the need for a parallel chain of transistors. Thesame reasoning applies for the pull-down network. However,instead of introducing parallelism in the pull-down network,evolution has relied on the output slowly discharging to thecorrect value. Unfortunately, such a solution is suboptimalbecause the delay will increase and the output will just barelyreach a value less than Vdd

2 .The next step is to understand the purpose of the transistors

in the input network i.e the transistors that connect the inverterinput with the transistor gates in the pull-up and pull-downnetworks. None of these transistors are connected to Vdd orVss, but are instead just passing on the inverter input. SPICEsimulations showed that all nets in the input network are moreor less degraded versions of the inverter input. It seems thatevolution has tried to separate the inverter input from thepull-up and pull-down networks with a resistive circuit. Totolerate a short between, for example, the transistor M1 gateand source, the inverter input must be separated from the gateto avoid clamping the input to Vdd and thus resulting in theinverter output stuck-at-0. If the input is separated from theshorted gate with a resistor, the result is a slightly degradedinput signal whilst retaining correct output.

A resistor in a CMOS IC can be formed with an nMOStransistor with gate connected to Vdd. Evolution never intro-duced such resistors in the input network in figure 3 becausethose resistors are not themselves tolerant to gate/source andgate/drain shorts. Instead, evolution has created the inputnetwork without any connections to Vdd or Vss, thus avoidingthe problem of gate shorts in the input network.

C. A Generalised Defect Tolerance Technique

The analysis in section V-B was then used as a basis toform the new redundancy technique: (1) To allow for stuck-open and stuck-closed transistors, redundant transistors shouldbe introduced to the pull-up and pull-down networks both inseries and parallel, as in figure 1, (2) To tolerate gate/sourceand gate/drain shorts, the transistor gates in the pull-up and

Paper VIII 119

6

pmos

nmos

Fig. 5. Resistor implementation

A

B

C

A

B

C

Fig. 6. The Multiple Short-Open (MSO) Technique, shown for pMOS. nMOStransistors are substituted in the same way. Resistors can be implemented asin figure 5.

pull-down networks must be isolated from the inverter inputusing a defect tolerant resistor. A defect tolerant resistor canbe formed with a high-resistivity polysilicon meander structurebut is here implemented with two transistors as shown infigure 5. Combining figure 1 and the defect tolerant resistorresults in the MSO technique summarised in figure 6.

To demonstrate how the MSO technique can be used tocreate a defect tolerant inverter, figure 7 shows the result afterhaving applied the substitution in figure 6 to the standardinverter in figure 4. The resistance of the resistors must belarge enough to achieve isolation. For the circuit in figure 7,minimum sized transistor gates are suitable for most of theresistors, expect for R7 and R8 that should be sized for largerresistance to reduce the impact of a gate short to Vss.

VI. DEFECT TOLERANT LUT BASED ON THE MSOTECHNIQUE

The MSO technique described in section V can now beapplied when implementing a LUT with traditional designtechniques. To keep the size and complexity of the LUTat a manageable level for the evolutionary experiment insection VII, the chosen LUT specification for this paper is onewith only one address input, referred to as LUT1. A traditionalimplementation of LUT1 is shown in figure 8, consistingof two standard 6-transistor SRAM cells and a standard 8-transistor static CMOS multiplexer. Figure 8 also shows whichinputs and outputs are required for LUT1. Asserted “W0” or“W1” results in the value on “D” (and its complement “˜D”)is written to the respective SRAM cell. “A” chooses whichSRAM value should be reflected on the output.

When the test setup in figure 2 is applied for testing a LUT1,it is not possible to test all possible input values as was thecase for the inverter experiment in section V-A. The fact thata LUT consists of storage elements means that a waveformmust be employed, testing the effect of input vectors overtime. A functionality test waveform for LUT1 is shown infigure 9. Figure 9 shows at which times the input signalsare asserted and also shows the expected output value atdifferent times. For calculating frms, the output is measured atthe following times: 300ms, 400ms, 650ms, 750ms, 1000ms,1100ms, 1350ms and 1450ms.

Vdd

Vss

M3pmos

w=50nm

l=30nm

M4pmos

w=50nm

l=30nm

M1pmos

w=50nm

l=30nm

M2pmos

w=50nm

l=30nm

M5nmos

w=30nm

l=30nm

M6nmos

w=30nm

l=30nm

M7nmos

w=30nm

l=30nm

M8nmos

w=30nm

l=30nm

R2


R4


R6


R8


R7


R5


R3


R1


Fig. 7. Defect tolerant inverter. Resistor sizing given as W/L

D

W1

W0

A

~D

SRAM0

SRAM1

Fig. 8. Traditional implementation of LUT1

The output of a simulation of the LUT1 for the test vectorsin figure 9 is shown in figure 10 and it can be seen that theLUT1 is functioning as intended. There is a glitch in the outputat 800ns which is the consequence of the SRAM cell delay.The delay is visible as a glitch due to changing the “A” inputat the same time as “W0” is asserted.

VII. EVOLVED DEFECT TOLERANT LUT

While the LUT1 in section VI is highly defect tolerant,the transistor count is very high and is thus not meeting oursecond goal of area efficiency. As such, a second strategytowards a defect tolerant LUT is to directly evolve a LUTwith redundancy. The hypothesis is that more efficient redun-dancy techniques can result if specialising towards the LUTfunctionality and letting evolution play with larger and morecomplex circuits.

A. Exploring Experiment

A LUT1, even though much less complex than the morecommon 4-input LUTs, still presents a significant challengeto evolution. As such, the successful setup from section V-Amay not be suitable. An introductory and exploring experimentis, therefore, conducted where different experimental setupsare tried. To keep complexity down on this introductoryexperiment, the only defects injected are gate-source shortsin nMOS transistors. The experiment in section V-A showedthat gate-source short tolerance can lead to tolerance also toother possible defect types. Concentrating on nMOS cuts thenumber of evaluations in half. In addition, the functionalitytest in figure 9 is reduced to the part between 450ns–1150ns.

It should be noted that the purpose of this experiment isnot to conclude on the EAs ability to evolve circuits. Manymore experiments are needed to draw any conclusion onsuch matters. Instead, the purpose is to generate at least one

120

7

W0

W1

D

A

O

1000 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

0 1 1 10 0 1 0

Fig. 9. Functionality test of LUT1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

0 0 0 1 1 0 1 1

[V]

time [ns]

Fig. 10. Output of MSO LUT1, according to test vectors in figure 9. Verticaldotted lines indicate where the output is measured.

promising circuit to serve as a starting point for a refiningexperiment in section VII-B.

Eight different experimental setups are tried and a total of20 different evolutionary runs are conducted. The experimentalsetups differ in three areas: Seeding, elitism and test coverage.Seeding refers to how the initial population is created. Theinitial population is either completely random or seeded withfive LUT1 circuits where nMOS transistors are made defecttolerant with the MSO technique from section V. The purposeof seeding is to optimise or enhance a given circuit, as opposedto starting evolving a circuit from just random individuals.Elitism is a feature of some GAs where the most fit individualis copied unaltered to the next generation. Elitism is eitherpresent or not. Elitism ensures that the fitness of the bestindividual in a population is never less than in the previousgeneration. Test coverage refers to how many of all possiblesingle defects are tested during fitness evaluation. Completetest coverage provides the most accurate fitness estimationbut is time consuming. Partial test coverage speeds up theevolutionary experiment. For runs with partial test coverage,the defects that are tested for are randomly selected.

For seeded experiments, the goal is to optimise the givenredundant circuit and fitness function (5) is therefore applied.Non-seeded experiments start with a random initial populationand, therefore, needs fitness function (4) to provide sufficientinformation to evolve.

A GA is applied with population size 20, mutation rate0.02, crossover rate 0.2 and tournament selection with groupsize 3 and selection probability 0.7. The maximum numberof transistors is 400 and the maximum number of nets is407. Feedback is allowed to give evolution the possibilityof evolving static storage elements. All other details of theexperimental setup are as for the experiment in section V-A.

D

W1

W0

A

~D

Vdd

Vss









M10nmosw=30nml=30nm



M5pmosw=30nml=776nmM6

pmosw=93nml=520nm


Fig. 11. Evolved LUT1

Each evolutionary run is stopped after one week. Character-istics of the best individual from each run are shown in table II.In addition, the characteristics of the circuit applied as seed forthe seeded experiments are shown in the first row of table II.The numbers in table II are based on 100% test coverage, evenfor the individuals that were evolved with partial test coverage.

It is clear from table II that the seeded experiments withcomplete test coverage did not manage to improve the seed inany way. The resulting individual has the same characteristicsas the seeding individual. Some of the seeded experiments withreduced test coverage removed some redundant transistors,with the effect of reducing Rtrad_single. This is probably dueto the fact that reduced test coverage results in a noisy fitness.“Noisy fitness” means that if fitness is evaluated twice for theexact same individual, the fitness value may vary. In this case,the reason is that a removed transistor may reduce the circuitsability to tolerate defects, but this is not necessarily detected bythe fitness function as the circuit is not tested for all possibledefects. One interesting result is run 9 where the reduction intransistor count is larger than the reduction in Rtrad_single.

For the non-seeded experiments, evolution was unable tofind any working circuits, expect in one case (run 13) withelitism and complete test coverage.

B. Refining Experiment

The exploring experiment in section VII-A resulted in atleast two interesting circuits. Run 13 resulted in a workingcircuit with complete tolerance to all single gate-source shortsin nMOS gates, yet consisting of only eight transistors. Asthe motivation for these experiments is a more area efficientredundant LUT than what was constructed in section VI, run13 was selected for a refining experiment.

The same experimental setup was applied, except with thefull functionality test in figure 9 and gate-source short defectswhere injected in both nMOS and pMOS transistors. Theinitial population for the refining experiment was seeded withthe individual from run 13.

As explained in section IV-E, evolution was conducted intwo phases. A parallel variant of the EHW simulator was runon 20 compute nodes on a cluster and each evolutionary phasewas run for several days. Figure 11 shows the resulting evolvedLUT1 and figure 12 shows a simulation for the test vectors infigure 9.

Paper VIII 121

8

TABLE IIRESULTS FROM THE EXPLORING LUT1 EXPERIMENT

Run Seed Elitism Test Cov. f Rtrad_single frms frms Size Works(trad. designed) 0.998930 1.000000 0.996812 0.997838 165 Yes

1 Yes No 100% 0.998930 1.000000 0.996812 0.997838 165 Yes2 Yes No 100% 0.993349 1.000000 0.996812 0.997838 165 Yes3 Yes Yes 100% 0.998930 1.000000 0.996812 0.997838 165 Yes4 Yes Yes 100% 0.998930 1.000000 0.996812 0.997838 165 Yes5 Yes Yes 10% 0.998930 1.000000 0.996812 0.997838 165 Yes6 Yes Yes 10% 0.988609 0.960396 0.986770 0.997712 158 Yes7 Yes Yes 10% 0.998930 1.000000 0.996812 0.997838 165 Yes8 Yes No 10% 0.998930 1.000000 0.996812 0.997838 165 Yes9 Yes No 10% 0.976675 0.934783 0.980497 0.997325 143 Yes

10 Yes No 10% 0.998930 1.000000 0.996812 0.997838 165 Yes11 No No 100% 0.443721 1.000000 0.609302 0.609302 4 No12 No No 100% 0.425508 0.000000 0.551135 0.596305 3 No13 No Yes 100% 0.871550 1.000000 0.673611 0.684732 8 Yes14 No Yes 100% 0.415692 0.000000 0.502494 0.585967 4 No15 No Yes 10% 0.424682 0.000000 0.537003 0.616108 4 No16 No Yes 10% 0.424385 0.000000 0.552871 0.588953 3 No17 No Yes 10% 0.419506 0.000000 0.534065 0.593164 4 No18 No No 10% 0.433093 0.000000 0.585882 0.609286 4 No19 No No 10% 0.421290 0.000000 0.503640 0.612811 4 No20 No No 10% 0.427660 0.000000 0.518157 0.640041 5 No

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

0 0 0 1 1 0 1 1

[V]

time [ns]

Fig. 12. Output of evolved LUT1, according to test vectors in figure 9.Vertical dotted lines indicate where the output is measured.

VIII. DISCUSSION

To evaluate the two LUT1 implementations in sections VI(MSO) and VII (Evolved), two other LUT1 implementationshave been constructed and simulated. The first is a traditionalnon-redundant LUT1 (Non-red.), as shown in figure 8. Thesecond is a TMR implementation (TMR) where the non-redundant LUT1 is triplicated and a mirrored adder [35]applied as the voter. The mirrored adder has the very usefulproperty of being tolerant to single stuck-closed and stuck-open defects when applied as a TMR voter, as long as thereare no defective modules.

Characteristics for all four LUT1 implementations are givenin table III for comparison. Rtrad_trans is estimated for atransistor reliability of 0.99 and is based on standard MonteCarlo simulations with 10000 tests. Delay is the time theoutput needs for stabilising after the address input “A” changes.Delay measurements are based on the slowest transition forthe test in figure 9 with no injected defects. Although not

TABLE IIICOMPARISON OF LUT1 IMPLEMENTATIONS

Property Non-red. TMR MSO Evolvedsize 22 trans. 76 trans. 264 trans. 14 trans.frms 0.998717 0.999346 0.997665 0.733561frms 0.779882 0.961079 0.994218 0.643702Rtrad_single 0.102273 0.858553 1.000000 0.446429Rtrad_trans 0.820000 0.882200 0.986700 0.921900delay < 1ns < 1ns < 1ns ≈15ns

entirely accurate, these delay numbers provide an estimate fordiscussions. Timing requirements for writing to the LUT arenot considered, based on the assumption that configuration ofthe LUT is rarely on the critical path for an FPGA applicationand, therefore, less important. The numbers in table III arebased on simulations where the full range of defect types areinjected: gate-source short, gate-drain short, stuck-open andstuck-closed.

When considering Rtrad_single, the MSO LUT1 is the mostreliable and tolerates all possible single defective transistors(Rtrad_single = 1). The TMR LUT1 comes in second andtolerates 86% of all single defects. The reason TMR does nottolerate all possible single defects is a lack of tolerance to gateshorts. Some gate shorts in the TMR modules either pulls upor pulls down one of the module inputs so much that the othermodules also fail. The voter also fails for some gate shorts.

For Rtrad_trans, the MSO LUT1 is again the most reliable,estimated to 0.99. This means that given a transistor reliabilityof 0.99, the probability that the MSO LUT1 is working 100%is 0.99. TMR with 0.88 is a considerable improvement overthe non-redundant LUT1 with 0.82. The evolved LUT1 iseven better than TMR and is estimated to 0.92. The reasonfor the high Rtrad_trans of the evolved solution despite thelower number of tolerated single defects, is the small size. Theevolved LUT1 consists of only 14 transistors, resulting in alower probability of having one or more defective transistors.14 transistors is even less than the standard non-redundant

122

9

LUT1, which is interesting, considering the higher number oftolerated single defects. The MSO LUT1 is by far the largestimplementation in number of transistors, with 3.5 times thenumber of transistors of the TMR LUT1.

Although the evolved solution is both small and scores wellon the reliability metrics, there are several disadvantages thatseparates it from all the three other LUT1 implementations. Asevident from the simulation in figure 12, the evolved solutionrelies on some form of dynamic storage that is discharged bythe output load. Some form of refresh is, therefore, needed.The output voltage swing is also very low, shown as a lowvalue for frms in table III, and a restoring gate must, therefore,be present at the output. In addition, the evolved solution hasa high delay.

This paper has concentrated on the 1-input LUT. MostFPGA LUTs today have from four to six inputs. The techniqueapplied when constructing the LUT in section VI can easilybe applied to any LUT implementation with more inputs. Amultiple input LUT is, however, far too complex to be evolvedand must therefore be constructed from several evolved 1-inputLUTs and a defect tolerant multiplexer.

Two factors limit the accuracy of the results in table III. Allsimulations are based on the functionality test in figure 9 whereonly a limited number of test cases are present. It is possiblethat a faulty LUT may be classified as working. In addition,the results are valid for the four types of transistor defectsconsidered in this paper. If other defects are considered, suchas shorts in the interconnect between two transistors, theresults could change.

IX. CONCLUSION

This paper has looked at how artificial evolution may beemployed as a tool for creating defect tolerant look-up tablesfor FPGAs. Two strategies have been followed. The firststrategy employed the Multiple Short-Open (MSO) techniqueand described how such a technique can be achieved througha process involving artificial evolution. The second strategywas to evolve a defect tolerant LUT directly.

Both the MSO LUT1 and the evolved LUT1 have advan-tages and disadvantages. The evolved solution is very small,yet still exhibits some tolerance to defects. As such, theevolved solution is an interesting example for further research.However, high delay, low output voltage swing and the factthat the LUT relies on dynamic storage makes the evolvedsolution unrealistic in real FPGAs without further improve-ments. The MSO LUT1 has the advantage of high outputvoltage swing, low delay and tolerates all single transistordefects of the four types this paper has concentrated on. Anextremely high area requirement is the main disadvantagewhich motivates further research on the approach of directlyevolving defect tolerant LUTs.

The purpose of this paper was not only to create defecttolerant LUTs but also to investigate the task of applyingartificial evolution as a tool for achieving defect tolerance.As such, the two LUTs in this paper represent two differentapproaches where artificial evolution at some level plays a rolein achieving defect tolerance.

REFERENCES


[2] M. Mishra and S. C. Goldstein, Nano, Quantum and Molecular Com-puting, Implications to High Level Design and Validation. KluwerAcademic Publishers, 2004, ch. 3: Defect Tolerance at the End of theRoadmap.

[3] I. Koren and Z. Koren, “Defect tolerance in VLSI circuits: Techniquesand yield analysis,” Proceedings of the IEEE, vol. 86, no. 9, pp. 1819–1837, Sept. 1998.

[4] A. Djupdal and P. C. Haddow, “Defect tolerance inspired by artificialevolution,” submitted to ISVLSI, 2008.

[5] R. E. Lyons and W. Vanderkulk, “The use of triple-modular redundancyto improve computer reliability,” IBM Journal, pp. 200–209, Apr. 1962.

[6] W. Pierce, Failure-Tolerant Computer Design. Academic Press, 1965.[7] E. F. Moore and C. E. Shannon, “Reliable circuits using less reliable

relays,” J. Franklin Inst., pp. 191–208, 291–297, 1956.[8] C. Bolchini, G. Buonanno, D. Sciuto, and R. Stefanelli, “Static redun-

dancy techniques for CMOS gates,” in IEEE International Symposiumon Circuits and Systems (ISCAS), 1996, pp. 576–579.

[9] C. Bolchini, G. Buonanno, D. Sciuto, and R. Stefanelli, “A CMOSfault tolerant architecture for switch-level faults,” in IEEE InternationalSymposium on Defect and Fault Tolerance in VLSI Systems (DFT), 1994,pp. 10–18.

[10] A. Djupdal and P. C. Haddow, “Evolving efficient redundancy byexploiting the analogue nature of CMOS transistors,” in InternationalConference on Computational Intelligence, Robotics and AutonomousSystems (CIRAS), 2007, pp. 81–86.

[11] A. Djupdal and P. C. Haddow, “Defect tolerant ganged CMOS minoritygate,” in NORCHIP, 2007.

[12] C. Bolchini, G. Buonanno, D. Sciuto, and R. Stefanelli, “An improvedfault tolerant architecture at CMOS level,” in IEEE International Sym-posium on Circuits and Systems (ISCAS), 1997, pp. 2737–2740.

[13] F. Hatori, T. Sakurai, K. Nogami, K. Sawada, M. Takahashi, M. Ichida,M. Uchida, I. Yoshii, Y. Kawahara, T. Hibi, Y. Saeki, H. Muraoga,A. Tanaka, and K. Kanzaki, “Introducing redundancy in field pro-grammable gate arrays,” in Proc. IEEE Custom Integrated CircuitsConference, 1993, pp. 7.1.1–7.1.4.

[14] Altera, “Apex redundancy,” http://www.altera.com/products/devices/apex/features/apx-redundancy.html.

[15] A. J. KleinOsowski and D. J. Lilja, “The NanoBox project: Exploringfabrics of self-correcting logic blocks for high defect rate moleculardevice technologies,” in IEEE Computer Society Annual Symposium onVLSI (ISVLSI), 2004, pp. 1–10.

[16] C. R. Saha, S. J. Bellis, A. Mathewson, and E. M. Popovici, “Perfor-mance enhancement defect tolerance in the cell matrix architecture,” inInternational Conference on Microelectronics, 2004, pp. 777–780.

[17] A. Doumar and H. Ito, “Design of switching blocks tolerating de-fects/faults in FPGA interconnection resources,” in IEEE InternationalSymposium on Defect and Fault Tolerance in VLSI Systems (DFT), 2000,pp. 134–142.

[18] A. Djupdal and P. C. Haddow, “Yield enhancing defect tolerancetechniques for FPGAs,” in Military and Aerospace PLD InternationalConference (MAPLD), 2006, paper ID 203.

[19] A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing.Springer, 2003.

[20] T. Higuchi, T. Niwa, T. Tanaka, H. Iba, H. de Garis, and T. Furuya,“Evolving hardware with genetic learning: a first step towards buildinga darwin machine,” in From Animals to Animats: Simulation of AdaptiveBehavior, 1993, pp. 417–424.

[21] J. H. Holland, Adaptation in Natural and Artificial Systems. MIT Press,1992.

[22] A. Thompson, “An evolved circuit, intrinsic in silicon, entwined withphysics,” in International Conference on Evolvable Systems (ICES),1996, pp. 390–405.

[23] D. Keymeulen, R. S. Zebulum, Y.Jin, and A. Stoica, “Fault-tolerantevolvable hardware using field-programmable transistor arrays,” IEEETransactions on Reliability, vol. 49, no. 3, pp. 305–316, 2000.

[24] K. Zhang, R. F. DeMara, and C. A. Sharma, “Consensus-based eval-uation for fault isolation and on-line evolutionary regeneration,” inInternational Conference on Evolvable Systems (ICES), 2005, pp. 12–24.

[25] A. Thompson, “Evolving fault tolerant systems,” in Genetic Algorithmsin Engineering Systems: Innovations and Applications (GALESIA), 1995,pp. 524–529.

Paper VIII 123

10

[26] A. Thompson, “Evolving inherently fault-tolerant systems,” Proceedingsof the Institution of Mechanical Engineers, Part I: Journal of Systemsand Control Engineering, vol. 211, no. 5, pp. 365–371, 1997.

[27] R. O. Canham and A. M. Tyrrell, “Evolved fault tolerance in evolvablehardware,” in Congress on Evolutionary Computation (CEC), 2002, pp.1267–1271.

[28] M. Hartmann and P. C. Haddow, “Evolution of fault-tolerant andnoise-robust digital designs,” IEE Proceedings - Computers and DigitalTechniques, vol. 151, no. 4, pp. 287–294, July 2004.

[29] A. Djupdal and P. C. Haddow, “Evolving and analysing “useful”redundant logic,” in International Conference on Evolvable Systems(ICES), 2007, pp. 256–267.

[30] P. J. Layzell and A. Thompson, “Understanding inherent qualities ofevolved circuits: Evolutionary history as a predictor of fault tolerance,”in International Conference on Evolvable Systems (ICES), 2000, pp.133–144.

[31] A. Djupdal and P. C. Haddow, “Evolving redundant structures forreliable circuits – lessons learned,” in Adaptive Hardware and Systems,2007, pp. 455–462.

[32] J. F. Miller and P. Thomson, “Cartesian genetic programming,” inGenetic Programming, Proc. EuroGP, 2000, pp. 121–132.

[33] GEDA, “Ngspice homepage,” http://ngspice.sourceforge.net/, 2007.[34] W. Zhao and Y. Cao, “New generation of predictive technology model

for sub-45nm design exploration,” in International Symposium on Qual-ity Electronic Design (ISQED), 2006, pp. 585–590.

[35] D. Hampel, K. J. Prost, and N. R. Scheinberg, “Threshold logic usingcomplementary MOS device,” June 1974, U.S. Patent 3 900 742.

124

Evolving Static Hardware Redundancy for Defect Tolerant FPGAs · Asbj˝rn Djupdal Evolving Static...

Documents

Transcript of Evolving Static Hardware Redundancy for Defect Tolerant FPGAs · Asbj˝rn Djupdal Evolving Static...