A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level...

53
A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune, India 2004 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical and Computer Engineering 2008

Transcript of A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level...

Page 1: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

A Framework for Analyzing Architecture-Level

Fault Tolerance Behavior in Applications

by

Harshad Sane

B.S., University of Pune, India 2004

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Master of Science

Department of Electrical and Computer Engineering

2008

Page 2: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

This thesis entitled:A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in

Applicationswritten by Harshad Sane

has been approved for the Department of Electrical and Computer Engineering

Professor Daniel A. Connors

Professor Manish Vachharajani

Professor Li Shang

Date

The final copy of this thesis has been examined by the signatories, and we find thatboth the content and the form meet acceptable presentation standards of scholarly

work in the above mentioned discipline.

Page 3: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

iii

Sane, Harshad (M.S., Computer Engineering)

A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications

Thesis directed by Professor Daniel A. Connors, Ph.D.

Radiation-induced transient faults, also known as single-event upsets, are a major

concern in the design and operation of modern computer systems. Transient errors first

impact the circuit and logic levels of a processor, and may propagate to the micro-

architecture and architecture state of a processor. When undetected, transient errors

in architecture state can lead to incorrect and undefined application behavior.

Detailed simulation is a vital component of the design process of modern pro-

cessors and exploration of new design concepts. However, high-level architectural sim-

ulators typically run 100-1000x slower, making detailed simulation of most programs

prohibitively long. Due to the deficiencies in simulation technology, current architecture-

level transient fault studies are primarily based on sampling techniques. Current pro-

gram fault behavior analysis uses architecture-level injection of random bits selected

over a time line. Such injection methods allow only a limited number of injections per

unit time, a high percentage of which may not expose the true fault susceptibility nature

of a program.

This thesis introduces an accurate and fast fault-injection framework for studying

inherent code execution properties of a program that correlate to different levels of fault

susceptibility. The framework utilizes a three step approach consisting of trace logging,

fault injection and replay execution to emulate the effect of a transient fault in the

architectural registers of a running application. Finally by correlating the injection

analysis results to dependence graph patterns across sets of applications, an analysis

methodology is constructed to accurately predict the fault tolerance of an application

without performing any error injections.

Page 4: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Dedication

I dedicate this thesis to my family for their unconditional support.

Page 5: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

v

Acknowledgements

Firstly I would like to thank my adviser, Professor Dan Connors who guided me

through my academic endeavor at CU. I would like to thank him for all the knowledge

and encouragement he bestowed upon me. This work has been possible because of his

guidance and prompt feedback.

I would like to thank all the members of the DRACO research group for their

insight and ideas.

I would like to thank my friends who made sure that I enjoyed life along with

my career. Most importantly, I would like to thank my family who have supported me

unconditionally throughout out my life.

Page 6: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

vi

Contents

Chapter

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 4

2.1 Single Event Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Architecture-level Fault Analysis . . . . . . . . . . . . . . . . . . . . . . 6

3 Motivation 8

3.1 Overview of current injection methodologies . . . . . . . . . . . . . . . . 8

3.2 Fault penetration and point of injection . . . . . . . . . . . . . . . . . . 9

3.3 Natural fault resilience and ineffective injections . . . . . . . . . . . . . 11

4 Fault Emulation Framework and Results 14

4.1 Overview and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 The TEFIS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Execution Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.2 Fault Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.3 Fault Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Trace Emulation Experimental Results and Analysis . . . . . . . . . . . 21

4.3.1 Fault tolerance categories . . . . . . . . . . . . . . . . . . . . . . 22

Page 7: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

vii

4.3.2 Accuracy of emulations . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.3 Average Execution Time of Fault Emulations Against Full Injections 24

5 Analysis of Fault Tolerance Program Behavior 26

5.1 Source Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Algorithm Level Effects . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.2 Dynamic Source Code Behavior . . . . . . . . . . . . . . . . . . . 28

5.2 Dynamic Program Trace Behavior . . . . . . . . . . . . . . . . . . . . . 29

5.2.1 Inter-Procedural Fault Tolerance . . . . . . . . . . . . . . . . . . 31

5.3 Dynamic Dependence Graph Representation . . . . . . . . . . . . . . . . 33

5.4 Estimating Program Fault Tolerance . . . . . . . . . . . . . . . . . . . . 36

6 Future Work 38

7 Conclusion 40

Bibliography 41

Page 8: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

viii

Tables

Table

3.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Fault tolerance characteristics of logical operations. . . . . . . . . . . . . 12

Page 9: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

ix

Figures

Figure

2.1 A neutron strike. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Feature size Vs Soft error rate. . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Current random injection methodologies. . . . . . . . . . . . . . . . . . 9

3.2 Fault penetration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Fault injection distribution in time. . . . . . . . . . . . . . . . . . . . . . 11

3.4 A high fault tolerant case from 186.crafty with 80.6% correct results of

emultion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Fractional emulation captures the effect of fractional execution with traces

against full injections in the entire application. The figure shows the ex-

clusion of build-up and monitoring time. . . . . . . . . . . . . . . . . . . 16

4.2 Context dump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Framework overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Fault emulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5 Result categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Emulation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.7 A comparison of execution time. . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Fault tolerance analysis of sorting routines (a)Heapsort (b)Quicksort. . . 27

5.2 A trace from 164.gzip with 91.4% incorrect results of emulation. . . . . 29

Page 10: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

x

5.3 A trace from 300.twolf with 100% segmentation fault results of emulation. 30

5.4 Fault tolerance correlation with program counter similarity. . . . . . . . 31

5.5 Fault susceptibility correlation calculated using similar program code

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Inter-procedural fault tolerance. . . . . . . . . . . . . . . . . . . . . . . . 33

5.7 Example: Dependency graph. . . . . . . . . . . . . . . . . . . . . . . . . 34

5.8 Dependence graph similarity correlation. . . . . . . . . . . . . . . . . . . 35

5.9 Dependence graph similarity correlation - All benchmarks. . . . . . . . . 35

5.10 Fault tolerance prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 11: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 1

Introduction

Scaling trends in technology that lead to faster and smaller transistors, lower

voltage levels and reduced noise margins also increase the susceptibility of a circuit to

transient faults. Shielding systems in hardware from radiation, cosmic rays and cross

talk is difficult from a high-speed design perspective and is costly in terms of active power

consumption. Transient errors first impact the circuit and logic levels of a processor,

and can propagate to the micro-architecture and architecture state. Architecture state

errors lead to invalid and unpredictable software behavior. As many program phases

are more tolerant of some single-bit architecture errors, software shielding becomes an

exciting solution due to its low cost and flexibility. Many software-based fault tolerant

techniques [24, 23, 10] have been proposed to balance performance with error detection

and recovery.

Current studies involving the observation of effects of transient faults utilize a

limited number of random fault injections in which micro-architecture (pipeline reg-

isters, intermediate logic) or architecture (register, memory) state is modified during

simulated execution. As these error injection campaigns involve emulating machine

execution, each injection run includes substantial experimental time leading up to the

point of injection, and the remaining execution of the application to determine program

correctness. Collectively the excessive time for simulated injections limit the points of

Page 12: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

2

the program execution that can be studied, thus reducing the significance of under-

standing the fault tolerance behavior in applications.

There are a number of ways to improve the accuracy and execution time of fault

analysis over traditional fault injection systems. Firstly, as applications are charac-

terized by repeating phases [5], there are opportunities to reduce the number of fault

injections by studying representable phases of execution. Furthermore random injects

often do not expose relevant program behavior as significant portions of program ex-

ecution involve dead code [6] and value locality [16]. There is substantial potential to

model fault behavior of code sequences by correlating the results of fault injections to

dynamic code regions. In this way, a fault analysis modeling framework can be con-

structed anticipate the fault susceptibility of an application based on execution profile

of code regions which do not require fault injections.

1.1 Contributions

This thesis presents an experimental study of the current architecture-level injec-

tion techniques for evaluating fault tolerance. Based on this study, the first half of the

thesis motivates the need and design of a new infrastructure for fault injection method-

ology. The framework is based on trace logging and performing all possible architecture

faults of the trace through an injection emulation system (TEFIS). An analysis of the

approach order to improve the accuracy and timeliness of the system is examined. The

next section focuses on correlating the properties of a program with inherent fault toler-

ance characteristics. Based on this correlation, an experimental model is constructed to

predict the fault tolerance of a program without any injections. This model would pro-

vide a baseline for making fault tolerance predictions of application. The above points

are encompassed as contributions. These are the following contributions made to this

thesis:

Page 13: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

3

(1) Development of a fast and accurate fault modeling framework: A new methodol-

ogy for evaluating the transient fault tolerance of program regions is presented.

The framework deploys re-play execution of selected program traces that pro-

vide the same accurate results of full-scale fault injection in a fraction of the

experimental evaluation time.

(2) Demonstration of correlation between fault tolerance and program structure:

The fundamental code property of the dynamic dependence graph of archi-

tecture state is analyzed to expose patterns exhibiting various levels of fault

susceptibility. Case studies are examined to reveals relations between source

code structures and fault tolerance behavior.

(3) Fault tolerance prediction based on studied program behavior : Estimation of

fault tolerance of an application by constructing an analytical model to accu-

rately assess fault tolerance to code regions.

The following sections elaborate on each of thesis contributions. The thesis con-

cludes by proposing possible enhancements to the framework and scope for future work

with this system.

Page 14: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 2

Background

2.1 Single Event Phenomenon

Radiation effects on processors is a major concern for architects with reduction in

transistor features. Among these effects, bit flips resulting from the ionization by neu-

tron strikes from cosmic rays and alpha particles are considered critical source of upsets

owing to their random occurrence. These effects, called Single Event Upset (SEU), con-

stitute a serious threat to the reliability of digital equipment built on advanced circuits.

Single event phenomena can be classified into three effects:

I Single event upset (soft error)

II Single event latch up (soft or hard error)

III Single event burnout (hard failure)

Single event upsets is defined by NASA as ”radiation induced errors in microelec-

tronic circuits when charged particles lose energy by ionizing the medium through which

they pass, leaving behind a wake of electron-hole pairs” [2]. These electron hole pairs

generate charge as they combine, and if this charge is greater than the critical charge

of the device, it results in a change of state. Transient faults fall under this category of

SEUs. Figure 2.1 is an example of a neutron strike on a transistor.

Page 15: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

5

Figure 2.1: A neutron strike.

Transient faults are emerging as a major concern for architects designing reliable

computer systems [3, 17]. Trends in silicon process technology report a bleak future in

terms of the fault susceptibility of an application. While the future error rate of a single

transistor is expected to stay relatively constant [11, 13], the number of transistors per

chip continues to increase at an exponential rate. As a result, the overall error rate for

processors is expected to increase dramatically, making fault tolerance as important a

design characteristic as performance, power consumption, and temperature dissipation.

Figure 2.2 shows the effect of reduction in feature size with years to come. The degra-

dation rate is about 8% per bit per generation [4] and follows the curve as shown in the

figure.

Hardware designs can be customized for fault tolerant execution with redundant

resources such as latches or extended pipelines. Providing fault tolerance may require

the addition of hundreds of thousands of delay latches and 20-30% logic to an existing

processor [28]. Other more specialized approaches create even more sophisticated sys-

tems requiring both hardware and software integration [1, 31]. While these approaches

work well in their specific scientific computing domain, the general purpose design field

must adapt to the need for fault tolerance in fundamentally different ways. As design

Page 16: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

6

cycle time is critical, many chip designers propose implementing redundancy-based fault

tolerance using existing multi-core and multi-threaded process extensions [9, 18]. The

driving motivation is to extend the engineering decision towards multi-context processor

to provide fault tolerance.

Figure 2.2: Feature size Vs Soft error rate.

2.2 Architecture-level Fault Analysis

Most recent architecture research is focused on using performance models to pro-

vide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather

than deploying detailed fault injection into hardware RTL models. AVF is defined as

the probability that a fault in that particular structure will result in an error in the final

output of a program [19]. A structure’s error rate is the product of its raw error rate, as

determined by process and circuit technology, and the AVF. Processor designers can use

AVF analysis to determine the processor structures in probabilistic need of protection

(e.g., structures with high AVF are likely to be protected). Some structures, such as

the branch predictor, have no effect on whether an error will propagate to the output of

the program. In contrast, other structures are on the opposite end of the spectrum such

Page 17: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

7

as the instruction issue window, load-store queue, and re-order buffer. The majority of

hardware structures fall in the middle of the two extremes.

While AVF analysis provides support for investigating new fault tolerant architec-

ture techniques, program execution characteristics are largely missing from the determi-

nation of periods of software error susceptibility. A software-centric view makes this key

insight: although faults occur at the hardware level, the only faults which matter are the

faults which affect software correctness. By changing the boundaries of output compari-

son to software, a software-centric model shifts the focus from ensuring correct hardware

execution to ensuring correct software execution. As a result, only faults which affect

correctness are detected. Benign faults are safely ignored. A software-centric system

with both detection and recovery, will not need to invoke the recovery mechanism for

faults which do not affect correctness. The primary problem with AVF is that software

periods of vulnerability substantially differ from micro-architecture periods of vulner-

ability. As research trends dictate, finding ways to selectively enable transient fault

tolerant mechanisms, run-time and off-line experimental techniques must be guided

equally by program behavior and hardware. As such, it is important to determine and

predict when program susceptibility and hardware susceptibility differ.

Page 18: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 3

Motivation

3.1 Overview of current injection methodologies

A major problem in the development of fault-tolerant systems is the accurate

determination of the dependability properties of the system. Unlike performance, which

can be evaluated through the use of benchmark programs, the degree of fault tolerance

and reliability of a system cannot be evaluated in such a manner. This is because we

do not often have the luxury of allowing systems to run for a very long time to see their

behavior under fault effects. The generally preferred solution to this problem is to inject

the effects of faults in a simulation model or a prototype implementation, and to observe

the behavior of the system under the injected faults. Fault injection in a simulation is

very flexible but far too time consuming. On the other hand, it is much more difficult

to inject accurate (i.e. realistic) faults into a prototype, though their effect is readily

observable.

Figure 3.1 shows the current methods of fault injection, where random injection

procedure has been adopted. The following graph shows the number of injections per

benchmark used in recent papers that adopted random fault injection methodology. Ta-

ble 3.1 are the list of references for each of the published articles numbered in Figure 3.1.

Page 19: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

9

Figure 3.1: Current random injection methodologies.

Number Short Description Reference(1) Soft-error detection through software TFT techniques(99) [22](2) Y-Branches(03) [29](3) Characterizing TF effects on processor pipeline(04) [30](4) Configurable TF detection via dynamic binary translation(06) [24](5) Symplfied(08) [20](6) Using PLR to exploit multi cores for TFT(07) [27]

Table 3.1: References

A recent paper focusing on instruction level error derating adopted an interval

injection methodology [7]. The injection campaign includes 100 uniformly distributed

points of injection in trace lengths of 100 instructions, in 32 as well as 64 bit registers,

resulting up to 224 injections per instruction. This method does show a representative

set of experiments without having to simulate the entire benchmark.

3.2 Fault penetration and point of injection

Fault injections can be performed at geometric layout, circuit, gate or block level

models. The block level model is a functional view defining data and control paths of

the application. Logic gates can go through several levels of masking like electrical and

Page 20: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

10

latching window and logical before they can affect the behavior of an application [26].

From a user point of view, it only matters if a transient fault causes undesirable effects

in the application. Hence our study can utilize fault injections at the block level as long

as they emulate the same effect of the propagation of a hardware fault. The process

of injecting faults into the architectural registers captures the above notion as shown

in figure 3.2, although logical masking would still persist between the architectural and

application layers.

Figure 3.2: Fault penetration.

Current injection techniques corrupt a single bit at 1000 random execution points [25,

27]. As discussed previously, such tests do not regard program behavior and have sub-

stantial variation. Figure 3.3 demonstrates the cumulative time for 1000 random fault

injections. The injection campaign time is sorted to account for the longest running

injections then the shorter injections. Some of the runs only take a matter of seconds,

while others take several minutes to complete to understand whether the program’s

behavior was changed. Based on Figure 3.3, the view of performing statistically signif-

icant fault injection using random or interval based schemes would require substantial

computation efforts. Clearly the use of fault injections into program state must be

strategically guided to gather fault outcomes for only certain regions of interest.

Page 21: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

11

Figure 3.3: Fault injection distribution in time.

3.3 Natural fault resilience and ineffective injections

As stated above though electrical and latching window masking is covered by

injecting faults at the architectural level, logical masking of faults are overlooked in ran-

dom injection methods. Examples of logical masking effects include logical operations,

conditional operations, overwriting faulted values before use, binary return values, dy-

namically dead instructions [6], silent stores [15] etc. This is why a large number of

random injections do not expose the fault susceptibility of a program.

For example, the table 3.2 shows the probable levels of tolerance with logical

operations at bit level. The table compares the logical operation of a register either

with itself or with another operand. The effects of the values presented in this table can

be realized from their truth tables. From the table it can be seen that an XOR operation

of a register on itself has 100% fault tolerance since it simply clears the register. Hence

any bit perturbation before this operation would have no effect whatsoever.

Figure 3.4 is an example of logical masking obtained from a trace of 186.crafty.

The figure shows the source code of a function from the benchmark along with its

Page 22: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

12Operation Other Operand Itself

AND,OR,NAND,NOR 50% 0%XOR,XNOR 0% 100%

NOT 0% -

Table 3.2: Fault tolerance characteristics of logical operations.

dynamic block execution. It can be seen from the source code, that all this function

does is return a binary value (true or false) based on the movement of a white or black

piece in a game of chess. The analysis shows that with such a dependence flow, the

effect of flipping bits would hardly impact the return values.

Random methods of fault injection impose a limitation on the number of injections

per unit time, a high percentage of whose susceptibility may be low as per the above

example. These limitations tend to invoke a motivation in developing a system which

could not only be folds of magnitude faster but also as accurate as the full injections.

Besides it should also be able to capture the properties of the program that affect its

fault susceptibility.

TEFIS (Trace emulation fault injecting system) is a powerful technique for emu-

lating hardware faults using software and uses PIN, a binary instrumentation tool for

tracing and emulation, which form the two major portions of the framework. Traces are

captured along with their state information from a running executable based on various

user defined parameters. This is followed by a rigorous fault injection procedure and

replay execution of the faulted trace, but only up to the length of the trace. This ap-

proach provides the desired flexibility, and at the same time, allows execution of many

experimental runs in a relatively very short period of time.

Page 23: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

13

(a) Source code (b) Dynamic control flow

Figure 3.4: A high fault tolerant case from 186.crafty with 80.6% correct results ofemultion.

Page 24: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 4

Fault Emulation Framework and Results

4.1 Overview and Goals

Software implemented fault injection methodologies can be broadly classified un-

der compile time and run time injections. A run time fault injection system would truly

emulate the effect of a real time transient fault occurrence. Current methods of run time

software fault injection are either time based or interrupt driven. These are program

level fault injection techniques which are generally unguided. Faults are injected either

randomly, interval based or within phases of the application. Phase based injection

would exploit some of the properties of an application. In either case, these methods

would suffer from the following drawbacks:

• Experimental time for a single injection : This time consists of two parts, the

build up time and monitoring time. Build up time would be the time required

for full execution up until fault point, while monitoring time is what follows the

injection until an outcome.

• Multiplied impact of the single injection overhead : There is a limitation in

studying the complete application due the single injection overhead time mul-

tiplied over the number of injections applied to the application. For evaluating

Page 25: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

15

the fault tolerance of the application, a statistically significant amount of injec-

tions are necessary, which consumes a lot of time, limiting the exposure of the

application to a fewer number of faults.

• Inaccuracy in determination of dependability properties of a system : Due to

the combined effect of he above two points, limited number of injections provide

low determinism in understanding the dependency behavior of the system under

all kinds of fault behavior.

Keeping the above limitations in mind, there is a requirement for a system which

overcomes the need to run full executions of the application for each injection, as well as

explore the dependability properties of a system that affect its fault susceptibility. The

trace emulation fault injecting system (TEFIS) has been developed with the following

goals in mind:

(1) Accuracy in fault tolerance with fractional execution : This frame work shall

employ fractional traces for execution within the binary. Traces of any length

can be obtained at any time point within the application. Accuracy is deter-

mined by comparing results of this fractionally injected execution to those of full

injections. Figure 4.1 shows a comparison between full injection and fractional

emulation on trace lengths of 100 and 200 instructions, where the dark regions

show the fractional length of execution.

(2) Very high speed injections and summary generation : Length of a single run

extends from the point of injection to the end of the trace. This eliminates the

build up time before the point of injection and monitoring time after the end

of the trace, limiting the execution time to only the number of instructions per

trace. This method not only speeds of result collection but also provides room

for a large number of emulations.

Page 26: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

16

Figure 4.1: Fractional emulation captures the effect of fractional execution with tracesagainst full injections in the entire application. The figure shows the exclusion of build-up and monitoring time.

(3) Expose deeper understanding of program behavior : Program properties have

been studied based on their possible impact on the fault susceptibility of the

application. Two methods of correlation have been described and used in this

analysis, code region similarity and data dependence similarity.

4.2 The TEFIS Framework

The framework has been designed to have functionally independent units. Each

of these units are presented in the order in which they are executed within the frame-

work. The three units of this procedure are execution tracing, fault generation and fault

emulation.

The tracing and emulation system uses PIN [21], a binary instrumentation tool

developed at Intel. PIN is used in this framework for generating and loading context

information at an instruction level.

Page 27: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

17

4.2.1 Execution Tracing

Execution tracing is a one time procedure involving generation of traces from the

binary with a number of controllable parameters of operation. The framework provides

operational flexibility on tracing methods with the following options:

• Length - Traces of any length can be generated

• Time

∗ Any point in time

∗ Interval - uniform intervals with uniform lengths

• Phase - Integration with SIMPOINTS [5]

• Function specific tracing

This system uses a tool called ExecutionTracer for this purpose. The Execution-

Tracer is a PIN [21] tool capable of dumping snippets of the binary at run time with

predefined knobs for the user to control the length and starting point of the trace. Once

these parameters are passed in, every instruction is instrumented before its execution.

This instrumentation constitutes gathering of the following information:

• Trace context : Disassembly with instruction pointer which give control flow

information

• Register context : Value stored within each of the 32 bit registers

• Memory context : Any values read from or written to memory along with their

addresses.

• Dependence graphs : Data flow dependence information generated in graphical

format with the help of dot [12]

Page 28: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

18

• Edge context : Edge information at each basic block (Not used as of now)

An example format for context dumps within a trace is shown in Figure 4.2. The

figure shows a register context which is generated for each instruction, a memory context

if exists for that instruction and the disassembly for that instruction, in the form of a

trace context with its instruction pointer information.

The conventions for storing each context have the context name followed by the

trace number. For trace generation these files form static storage, but for fault emula-

tions, they are created and destroyed on the fly as required.

Figure 4.2: Context dump.

4.2.2 Fault Generation

Fault generation follows the one time tracing procedure. This is a highly iterative

process that repeats a number of times in a trace. The bit flips required for fault

emulation are generated in this step. The tool used for bit flips needs to unfold through

a number of stages which are listed sequentially as follows:

• Trace - A trace generated from the binary forms the top level entry

• Instruction - Each instruction within a trace forms the next level of entry

Page 29: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

19

• Source registers - Every 32 bit source register for an instruction, if present forms

the next level of entry

• Bits - Every bit in the 32 bit source register is flipped and forms a single emu-

lation

Figure 4.3 gives an overview of the tracing and fault generation procedure. The

right half of the figure depicts the one time tracing system, while the left hand side shows

the fault generation and emulation steps. The box in the middle shows a possible future

step of less susceptible instruction filtering. The right half block has three columns

representing the binary, a trace and the context. The binary is shown to have been

split into traces, with each trace having a number of instructions and each instruction

having a context associated with it. The fault generation portion is associated for a

single trace. The figure shows number of emulations for each instruction in the trace.

Also within each instruction, 32 flips per source register are shown for each emulation.

This structure repeats over all the traces within the executable.

Faulted contexts are saved for each emulation in a flat directory structure with

extensions to each file name representing the trace and emulation it belongs to. This

step prepares the faulted contexts for their actual runs during fault emulation.

4.2.3 Fault Emulation

This step runs the binary loaded with the faulted context. Each run consists

of a context with a single bit flipped from within a source register in an instruction

within the trace. Though there are a large number of runs of the executable, their

execution length is limited only to the end of the trace to which the emulation belongs.

Faulted register contexts from the fault generation stage along with the memory context

as dumped during tracing are loaded at run time using instrumentation via PIN [21].

Page 30: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

20

Figure 4.3: Framework overview.

Since the instruction pointer is known from the trace, the exact dumped contexts at

an instruction are loaded at the exact same point, but with one of the source register

bits flipped. After this initial loading the binary is run from this state for the same

number of instructions that follow it in the original trace. This process is repeated over

all emulations till half of the trace length. i.e. First 50% instructions within the trace.

The framework lets each trace represent a folder with all the emulations for the trace

residing in that folder.

Figure 4.4 shows the fault emulation system with the reference trace on the left

and the emulation trace on the right. The reference trace is shown for the purpose of

result comparison. The figure shows a point of injection, which consists of loading of

the memory context and the faulted register context based on the context gathered for

the same instruction in the reference trace. When the binary is run with the state set

in such a manner, there are five different possibilities in execution.

• Data flow deviation: This results in a context mismatch with or without a

control flow deviation

• Control flow deviation: Instruction pointers show a mismatch in the traces

Page 31: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

21

Figure 4.4: Fault emulation.

generally always accompanied by a data flow deviation as expected, since a

totally new set of instructions are encountered.

• Signal Fault: Trace execution aborts due to inappropriate memory accessing

signaled as an error from the operating system.

• Timeout: Cases where the trace gets stuck in a loop or hangs, which is when

execution is forced to halt based on a set timer. These cases are generally very

rare due to the design of the system.

• None: The emulation exactly matches the reference trace in all contexts

The results are based on comparing the contexts of the reference trace with the

emulated trace at the end of execution of the last instruction. The results are placed in

bins indicating mismatches, signal faults or correct execution.

4.3 Trace Emulation Experimental Results and Analysis

12 sets of integer benchmarks from SPEC2000 suite were used as candidate

applications for the fault emulation experiments. Full injections were also performed on

Page 32: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

22

the same set of traces that were used for emulation purpose. 1000 traces per benchmark

were generated for the experiment. Trace lengths of 100 and 200 instructions were

selected and emulations were run across half the length of each trace. Since the start of

the trace lengths of 100 and 200 instructions were the same, the 200 trace length covers

the characteristics of the prior and more.

4.3.1 Fault tolerance categories

Each emulation was compared with the respective reference trace on the fly,

and results were generated. The basis of grouping the results of emulation can be

summarized in the following manner:

• Mismatch: This case covers a mismatch in either the register, memory contexts

or control flow

• Signal Fault: The emulation tool is equipped with handlers to catch any signal

faults from the operating system.

• None: All contexts and control flow match at the end of the trace.

Figure 4.5 shows the fault tolerance characteristics of each benchmark. The fig-

ure depicts fault tolerance based on the overall categories of results mentioned above.

The groupings in the left figure are made based on a cut-off percentage. i.e ”Mostly

incorrect” signifies over 50% incorrect with the number of correct entries less than 10%

in that trace. Same is the case for the segmentation fault case, while the ”Correct”

bin accumulates all correct entries over 10% in the trace. This figure shows an overall

characteristic of each benchmark displaying its fault tolerance behavior. From the figure

it can be seen that 186.crafty and 256.perlbmk show a high fault tolerant behavior.

Page 33: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

23

Figure 4.5: Result categories.

4.3.2 Accuracy of emulations

As described previously the goals of the framework is to prove two contributions;

accuracy of the frame work and speed of execution based on comparison with full injec-

tions. For this purpose the same traces used for emulation were taken and full injections

were performed on each bit of every source register found in every instruction of the

trace. Since the full injections run the complete binary and base the results on entire

execution, the results of emulation are compared against these for determining accuracy.

Figure 4.6 shows the accuracy of emulations with two trace lengths against full

injections. The figure shows 3 vertical bars per benchmark, the first two being emula-

tions of trace length 100 and 200 instructions respectively, while the last bar showing

full injections. The categories are in the same order as mentioned before. The Y axis

shows the percentage contribution per category. The accuracy of emulations is deter-

mined by the closeness of the correct entries for the 100 and 200 instruction trace length

emulations to the full injections. From the figure it seems that the accuracy is fairly

good for trace lengths of 100 and 200 instructions. Though application dependent, the

real injection data seems fairly close to the emulation accuracy observed with above

Page 34: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

24

trace lengths.

An enhancement to this would include trace lengths of increasing order above

200 instructions. An experiment was conducted to monitor the position of the control

deviation due to fault injections. It was observed that a control flow would occur

generally between 100 to 200 basic blocks. That gives an approximate length of 1000

instructions considering an average of 5 instructions per basic block. A saturating curve

for correctness should be observed with increasing length of traces, somewhere upon this

value. This would follow an analysis to see whether an increasing trend of correct entries

is observed closer to the full injections with this method.

Figure 4.6: Emulation accuracy.

4.3.3 Average Execution Time of Fault Emulations Against Full Injec-

tions

Figure 4.7 shows a comparison between the times of execution of 1000 emulations

to the same number of full injections. Notice that the Y axis is on a logarithmic scale.

It can be seen that the execution times for emulations are much smaller than the full

Page 35: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

25

injections. Also this execution time for emulations remains constant over all benchmarks

since the length of instructions executed remains more or less the same.

Figure 4.7: A comparison of execution time.

The above two results are the most important which help in proving the accu-

racy of the emulation system as well as its ability for having execution times folds of

magnitude lesser than that for full injections. The system also has the flexibility to add

the extra block for ineffective instruction filtering which would even further enhance its

ability to capture the fault susceptibility of an application as well as reduce execution

times.

Page 36: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 5

Analysis of Fault Tolerance Program Behavior

The above sections have proved the accuracy and reduced execution time of the

emulation framework. This section would look in to correlating the observed fault

tolerance obtained from the system to the program behavior. This is necessary in un-

derstanding the relation between the properties of a run time program and its fault

susceptibility. The criteria applied for selecting such properties is based on the assump-

tion that similar code execution should have similarity in its fault susceptibility.

5.1 Source Code Analysis

Finding properties of a program that affects its fault tolerance require a much

deeper understanding of what is going on within the source. An application may be

analyzed by either looking at its algorithm level or dynamic assembly to understand

its fault susceptibility. This section shall look at algorithmic effects on fault tolerance

as well as do a source code analysis in order to understand program behavior from a

reliability point of view.

5.1.1 Algorithm Level Effects

To view an algorithm based effect on fault tolerance, fault injections were applied

to sorting routines, which basically performed the same function, using the same inputs

Page 37: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

27

but applied different algorithms. These applications were also compiled under different

versions to observe the effect of optimizations across each. A portion of the results

are displayed in Figure 5.1 to show the effect of algorithm on the fault tolerance of an

application.

(a) Heapsort

(b) Quicksort

Figure 5.1: Fault tolerance analysis of sorting routines (a)Heapsort (b)Quicksort.

Page 38: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

28

5.1.2 Dynamic Source Code Behavior

A PIN [21] tool was written for the purpose of tracking the dynamic flow of

instructions in basic blocks along with their disassembly. The tool is capable of taking

in a function name and dump a static control flow graph with markings for its dynamic

execution. The function names were provided from the functions that occurred in the

traces, which were dumped using a script that used another simple PIN [21] tool. This

gives an idea of what the trace is actually executing and a complete view of how a fault

propagates.

Sections of these traces were processed through this tool to analyze which instruc-

tions the fault encountered during its flow dynamically through the program. A trace

having fault tolerance has already been illustrated in the motivation section. Further

cases of traces with low fault tolerance and high segmentation fault cases in this section.

Figure 5.2 shows a case of low fault tolerance, about 91.4% incorrect behavior, in

164.gzip. This is a function called updcrc() from the trace which does a crc check and

hence is compute intensive. Comparing this layout to the high fault tolerant case, it

does not have binary decisions or logic to mask values off. Instead the operations just

before returning from the function perform an xor with all ”F’s” which necessarily flips

all bits. From the context point of view this information will never match after a fault

injection.

Figure 5.3 shows a case of high segmentation fault. In this case it is a function

from 300.twolf. The source code shows that this function performs a linked list traversal,

within which it assigns values. Also, most of the assignments use pointers, hence a very

likely candidate for misalignment in memory accesses.

The study shown above, requires properties closer to the source, based on their

impact on fault tolerance. Two such properties and their effects on the fault susceptibil-

Page 39: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

29

/* Source function */ulg updcrc(s, n) uch *s; /* pointer to bytes to pump through */ unsigned n; /* number of bytes in s[ ] */ { register ulg c; /* temporary variable */ static ulg crc = (ulg)0xffffffffL; /* shift register contents */ if (s == NULL) { c = 0xffffffffL; } else { c = crc; if (n) do { c = crc_32_tab[((int)c ^ (*s++)) & 0xff] ^ (c >> 8); } while (--n); } crc = c; return c ^ 0xffffffffL; /* (instead of ~c for 64-bit machines) */}

(a) Source code (b) Dynamic control flow

Figure 5.2: A trace from 164.gzip with 91.4% incorrect results of emulation.

ity of a program have been chosen. The following sections analyze code region similarity

and data dependency of a program as tools to correlate its fault tolerance.

5.2 Dynamic Program Trace Behavior

This section analyzes the behavior of a program based on similarity of code re-

gions and its effect on fault tolerance of the application. The analysis method comprises

of comparing the existing traces with each other, in an attempt to find matches in pro-

gram counter values. This gives an indication of the program executing in the same

code region but probably in a different dynamic state.

Page 40: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

30

(a) Source code (b) Dynamic control flow

Figure 5.3: A trace from 300.twolf with 100% segmentation fault results of emulation.

The correlation with fault tolerance is based on generating a score of similarity

among a pair of traces. This score would give an indication of how similar the two

traces are. The score generation procedure comprises of taking two traces, finding the

percentage of matching program counters from one trace to the other and vice versa,

and averaging their value to generate a similarity score. This procedure is iterated in

pairs of traces among all the traces. Figure 5.4 shows the overall correlation of fault

tolerance to PC similarity across all benchmarks. The different bins represent the delta

Page 41: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

31

in fault tolerance with similarity scores of above 65%.

Figure 5.4: Fault tolerance correlation with program counter similarity.

To observe the correlation, the difference in the fault tolerance of a pair of traces

with similarity scores of 65% and above are plotted as shown in figure 5.5. The figure

shows program counter similarity score plot against a function of the fault tolerance

delta between the pair of traces. The curve observed is a polynomial fitting curve of

third degree depicting a trend of the susceptibility of a program to faults with similarity

of code regions.

There are 2 observable trends from the figures of 175.vpr and 181.mcf which

show a very promising decrease in the fault tolerance delta with increasing score of

similarity. On the other hand compression algorithms like 164.gzip and 175.vpr show

a very irregular behavior. This could be due to the large variations in the dynamic

states, though similar instructions are being executed, but at a different time. Also

these applications tend to have more inter-procedural calls than others.

5.2.1 Inter-Procedural Fault Tolerance

Page 42: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

32

(a) 175.vpr (b) 256.bzip2

(c) 181.mcf (d) 164.gzip

Figure 5.5: Fault susceptibility correlation calculated using similar program code points.

The existing trace information was used to correlate the effect of number of pro-

cedures in a trace to its fault susceptibility. As expected, a decreasing trend of fault

tolerance was observed with increasing number of inter procedural calls within a trace.

Figure 5.6 shows the average fault tolerance across all traces with increasing number of

procedures per trace. The figure also shows the percentage of times these occur among a

set of 1000 traces and is scaled with a factor of 0.2 for visibility in the same region. The

occurrence of single functions within traces dominates the rest, showing fewer traces

with higher number of procedures having the highest fault susceptibility.

The above study gives us some indication that though certain applications show

a favorable trend of fault tolerance with code similarity, others depend on the dynamic

state of the system. This is taken into consideration in the following section where data

Page 43: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

33

Figure 5.6: Inter-procedural fault tolerance.

dependencies which are the primary propagators of a fault are analyzed and compared

among traces.

5.3 Dynamic Dependence Graph Representation

Program counter correlation captures the similarity in terms of functional execu-

tion, but fails to capture the dynamic data flow, which propagates or masks the effect

of a transient fault. The propagation of a fault depends on the data dependencies of a

program, and this information needs to be captured and used to find similar trends, in

order to correlate with the fault susceptibility of an application. Figure 5.7 shows an

example of a data dependency graph in ’dot’ format. The reliability of the system is

inherent in the connectivity of the graph.

For this purpose, data dependency graphs are generated from the traces. The

graphs are generated by a graph clustering tool developed by Dennis Sasha at NYU [8].

The tool takes a data set of directed graphs generated by a PIN [21] tool which in-

struments and dumps data dependency information. The graph clustering tool uses

SUBDUE [14] to find common sub-structures in a given trace. The tool then iterates

Page 44: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

34

Figure 5.7: Example: Dependency graph.

over all traces to find t similar structures and clusters them together. The tightness

measure of a cluster which defines the precision of similarity can also be defined.

The above graph clustering procedure was iterated over all the generated traces.

A similarity score was calculated among a pair of traces based on the closeness of the

dependence graph contained by each trace. This analysis directly correlates the fault

tolerance with the dynamic dependencies of the program, which is the active ingredient

in fault propagation. For a larger view, similarity scores of 35% and above have been

plotted against their respective fault tolerances. The graph on the left gives the actual

raw number while that on the right is a polynomial fitting of the graph on the left.

The cases in Figure 5.8 are the same as shown for program counter similarity,

yet show a decreasing trend in fault tolerance with increasing similarity score of depen-

dences. A similar trend was observed among all the other benchmarks with very few

outliers.

Figure 5.9 shows the dependence similarity correlation over all the benchmarks.

from the figure, it can be observed that the trend remains similar averaged over all the

Page 45: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

35

(a) 164.gzip. (b) 164.gzip - Trend.

(c) 175.vpr. (d) 175.vpr - Trend.

Figure 5.8: Dependence graph similarity correlation.

(a) All Benchmarks. (b) All Benchmarks - Trend.

Figure 5.9: Dependence graph similarity correlation - All benchmarks.

benchmarks. The delta in fault tolerance shows a decrease with increasing similarity.

Although a slight rise in the curve can be observed in the end.

Page 46: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

36

5.4 Estimating Program Fault Tolerance

Based on the two correlation techniques seen in the previous sections, it is pretty

evident that dependency graph similarity is a much more accurate analysis of correlation

with fault tolerance. This technique is put to use in prediction of fault tolerance based

on the constant trends of fault susceptibility seen against graph similarity scores. If

the prediction accuracy does fall under an acceptable range, one could only look at the

dependence structure of any program and be able to predict its fault tolerance without

any injections.

The existing results for the traces from the emulation framework along with their

dependency graphs were used to make prediction models. The prediction model used

here uses incremental number of graphs which are most similar to the reference graph,

in order to predict its fault tolerance. The fault tolerance values are averaged for graph

number more than one. Ten models have been selected for this prediction, where the

first model predicts the fault tolerance of a trace based on the fault tolerance of another

trace which has the most similar dependency graph. The models greater than two, select

the designated number of most similar graphs and average out their fault tolerance score

for prediction. The accuracy of the prediction model is observed with increasing number

of similarity graphs in use.

Figure 5.10 gives the accuracy of each prediction model averaged over all the

traces on 12 benchmarks. The X axis represents the model with the number of simi-

larity graphs used for its prediction and the Y axis represents the average accuracy of

prediction. Since the Y axis is a difference between the actual and predicted value of

fault tolerance averaged over all traces, a value closer to the X axis represents higher

prediction accuracy.

Based on the figure, it is evident that the accuracy in prediction of fault tolerance

Page 47: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

37

Figure 5.10: Fault tolerance prediction.

increases with the number of similarity graphs used for the model. This model for

prediction uses an averaging approach. Better models need to be studied to improve

the accuracy of prediction. But this study gives us a baseline on which one can make

estimates of the fault susceptibility of any application based on the behavior of its

dependency graphs.

Page 48: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 6

Future Work

The TEFIS framework does justify a fractional execution system with accuracy

close to the full injection methodology. But further investigation in the trace length of

execution is required. Longer the trace length of execution, closer would be its emulation

accuracy to full injections, but with a compromise in the speed of execution. Generally

one would expect an increasing curve of accuracy as shown in Figure 4.6 with increasing

trace length for any application. Future work would include looking at these curves for

various applications.

This thesis presented an understanding of the fault tolerance of a program by

looking at its source code and determining possible points of vulnerability. Observations

were made based on logical masking effects that showed a large percentage of injections

which could prove to be ineffective. Addition of this feature for selective injections based

on this study, would be fairly easy with the existing framework.

Dependency graphs seem to be a good representative for correlation with the fault

tolerance of a program. Precision of the cluster formation based on levels of similarity

in graphs is something that definitely requires further investigation. Also, the fault

tolerance predictions made in this thesis are based on an averaging method. Better

solutions may include using a larger set of graphs with method of prediction based on

standard deviation.

Page 49: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

39

Application of this framework and correlation methodology to a number of com-

piled versions of the executable would be an interesting study to observe the effects a

compiler could make on the application as shown in figure 5.1.

Page 50: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Chapter 7

Conclusion

Semiconductor transient faults (soft errors) are a critical design concern in the

reliability of computer systems. Most recent architecture research is focused on using

performance models to provide Architecture Vulnerability Factor (AVF) estimates of

processor reliability rather than deploying detailed fault injection into hardware RTL

models. While AVF analysis provides support for investigating new fault tolerant ar-

chitecture techniques, program execution characteristics are largely missing from the

determination of periods of soft error susceptibility. The primary problem with AVF is

that software periods of vulnerability substantially differ from micro-architecture peri-

ods of vulnerability. As research trends dictate finding ways to selectively enable tran-

sient fault tolerant mechanisms, run-time and off-line experimental techniques must be

guided equally by program behavior and hardware.

Page 51: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

Bibliography

[1] Multiple Instruction Issue in the NonStop Cyclone Processor, 1990.

[2] S. Baloch, T. Arslan, and A. Stoica. Design of a single event upset (seu) mitiga-tion technique for programmable devices. In ISQED ’06: Proceedings of the 7thInternational Symposium on Quality Electronic Design, pages 330–345, Washing-ton, DC, USA, 2006. IEEE Computer Society.

[3] Robert C. Baumann. Soft errors in commercial semiconductor technology:Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes,Reliability Fundamentals, pages 121 01.1 – 121 01.14, April 2002.

[4] Shekhar Borkar. Designing reliable systems from unreliable components: The chal-lenges of transistor variability and degradation. IEEE Micro, 25(6):10–16, 2005.

[5] Brad Calder and Greg Hamerly and Tim Sherwood. Simpoint.

[6] J. Adam Butts and Guri Sohi. Dynamic dead-instruction detection and elimination.In Proceedings of the tenth international conference on Architectural Support forProgramming Languages and Operating Systems, 2002.

[7] Jeffrey J Cook and Craig Zilles. A characterization of instruction-level deratingand its implications for error detection. 2008.

[8] Dennis shasha and Diego Reforgiato Recupero. Graphclust.

[9] Mohamed Gomaa and et al. Transient-fault recovery for chip multiprocessors. InISCA, 2003.

[10] Mohamed A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detec-tion. In ISCA ’05: Proceedings of the 32nd annual international symposium onComputer Architecture, pages 172–183, Washington, DC, USA, 2005. IEEE Com-puter Society.

[11] S. Hareland and et al. Impact of CMOS Scaling and SOI on Software Error Ratesof Logic Processes. In VLSI Technology Digest of Technical Papers, 2001.

[12] John ellson and Emden Gansner and Yehuda Koren and Eleftherios Koutsofios andJohn Mocenigo and Stephen North and Gordon Woodhull.

Page 52: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

42

[13] T. Karnik and et al. Scaling Trends of Cosmic Rays Induced Soft Errors in StaticLatches Beyond 0.18µ. In VLSI Circuit Digest of Technical Papers, 2001.

[14] Nikhil S. Ketkar, Lawrence B. Holder, and Diane J. Cook. Subdue: compression-based frequent pattern discovery in graph data. In OSDM ’05: Proceedings of the1st international workshop on open source data mining, pages 71–76, New York,NY, USA, 2005. ACM.

[15] Kevin M. Lepak and Mikko H. Lipasti. Silent stores for free. In MICRO33: Proceedings of the 33rd annual ACM/IEEE international symposium onMicroarchitecture, pages 22–31, New York, NY, USA, 2000. ACM.

[16] M. Lipasti. Value locality and speculative execution, 1997.

[17] Sarah E. Michalak and et al. Predicting the Number of Fatal Soft Errors inLos Alamos National Laboratory’s ASC Q Supercomputer. IEEE Transactionson Device and Materials Reliability, 5(3):329–335, September 2005.

[18] Shubhendu S. Mukherjee and et al. Detailed design and evaluation of redundantmultithreading alternatives. In ISCA, 2002.

[19] Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt,and Todd Austin. A systematic methodology to compute the architectural vul-nerability factors for a high-performance microprocessor. In Proceedings of the36th Annual IEEE/ACM International Symposium on Microarchitecture, page 29.IEEE Computer Society, 2003.

[20] Karthik Pattabiraman, Nithin Nakka, Zbigniew Kalbarczyk, and RavishankarIyer. Symplfied: Symbolic program-level fault-injection and error-detection frame-work. In Proceedings of the International Conference on Dependable Systems andNetworks (DSN). IEEE, June 2008. This paper won the William Carter award forbest paper.

[21] PIN Dynamic Instrumentation Tool. http://rogue.colorado.edu/pin/.

[22] Maurizio Rebaudengo, Matteo Sonza Reorda, Marco Torchiano, and MassimoViolante. Soft-error detection through software fault-tolerance techniques. InDFT ’99: Proceedings of the 14th International Symposium on Defect andFault-Tolerance in VLSI Systems, pages 210–218, Washington, DC, USA, 1999.IEEE Computer Society.

[23] Steven K. Reinhardt and Shubhendu S. Mukherjee. Transient fault detection viasimultaneous multithreading. SIGARCH Comput. Archit. News, 28(2):25–36, 2000.

[24] George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S.Mukherjee. Configurable transient fault detection via dynamic binary translation.In Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December2006.

[25] George A. Reis and et al. SWIFT: Software implemented fault tolerance. In CGO,2005.

Page 53: A Framework for Analyzing Architecture-Level Fault ... Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune,

43

[26] Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, andLorenzo Alvisi. Modeling the effect of technology trends on the soft error rateof combinational logic. In Proceedings of the 2002 International Conference onDependable Systems and Networks, pages 389–399, June 2002.

[27] Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseph Blomstedt, and Daniel A.Connors. Using process-level redundancy to exploit multiple cores for transient faulttolerance. In DSN ’07: Proceedings of the 37th Annual IEEE/IFIP InternationalConference on Dependable Systems and Networks, pages 297–306, Washington,DC, USA, 2007. IEEE Computer Society.

[28] Timothy J. Slegel and et al. IBM’s S/390 G5 Microprocessor design. In IEEEMicro, volume 19, pages 12–23, March 1999.

[29] Nicholas Wang, Michael Fertig, and Sanjay Patel. Y-branches: When you come toa fork in the road, take it. In PACT ’03: Proceedings of the 12th InternationalConference on Parallel Architectures and Compilation Techniques, page 56, Wash-ington, DC, USA, 2003. IEEE Computer Society.

[30] Nicholas J. Wang, Justin Quek, Todd M. Rafacz, and Sanjay J. patel. Characteriz-ing the effects of transient faults on a high-performance processor pipeline. In DSN’04: Proceedings of the 2004 International Conference on Dependable Systems andNetworks, page 61, Washington, DC, USA, 2004. IEEE Computer Society.

[31] Y.C. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedingsof the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293–307,February 1996.