Download - Filter Creation for Application Level Fault Tolerance and Detection

Filter Creation for Application Level Fault Tolerance and Detection

Eric Ciocca, Israel Koren, C.M. KrishnaECE Department, UMass Amherst

Overview

Our approach to fault detection and tolerance relies on an application’s inherent familiarity with its own data Fault detection and tolerance in the application level Applications do not need hardware or middleware to

provide fault recovery To realize trends in application data, the

developer must be familiar with what the data represents

The existing trend can be used for fault detection, but needs to be quantitatively defined so the application may detect it.

What is ALFTD?

ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information

significantly reduces the overall cost of providing fault tolerance

ALFTD may be used alone or to supplement other fault detection schemes

ALFTD is tunable: It permits users to trade off fault tolerance against computation overhead Allowing more overhead for ALFTD produces better

results

Principles of ALFTD

Every physical node will run its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary)

If a fault corrupts a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality

Node 1

Node 2

Node 3

Node 4

P1 S4

P2 S1

P3 S2

P4 S3

Fault Detection

Faults do not always completely disable a node Malformed and corrupted data are also possible

Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages Faulty data is difficult to detect without application

syntax

Fault detection is a necessary condition for ALFTD to schedule which secondary nodes to run

Secondary processes can provide verification for ambiguously faulty data

Principles of ALFTD Filters

Faults are detected by passing results through one or more acceptance filters Filters are unique to applications with certain data

characteristics. Value bound tests are applicable to most applications Sanity checks require knowledge of the expected output

value and format.

Results from Primary

Filter 1

Secondary Task Queue

Filter 2

Data is OKPassFail

OTIS Characteristics

ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer), part of the REE suite OTIS reads radiation values from various bands and

calculates temperature data The output can be viewed graphically or numerically

OTIS lends itself to ALFTD because the output data (temperature) has Local Correlation: Data changes gradually over an area Absolute Bounds: Data falls within some expected

realistic range

ALFTD in OTIS

Local Correlation and Absolute Bounds on the data led to the creation of two data fault filters Spatial Locality Filter: If the difference between

pixel (x,y) and (x-1,y) is greater than some threshold , the pixel may be the result of faulty data

Absolute Bounds Filter: Any pixel not falling in the value range of < value < may be the result of faulty data

The filter thresholds are set based on the sample datasets provided

OTIS Datasets

“Blob” “Stripe” “Spots”

Faulty

Faultless

OTIS Datasets with ALFTD

“Blob” “Stripe” “Spots”

Faulty

ALFTD Corrected Faulty

Problem

ALFTD filters require calibration Calibration constants are context sensitive Filter values can be approximated, but gains can be

made in detection efficiency with well-tuned filters

Heuristics are created based on characteristics of the most frequent data

Frequency Plots (bounds filter)

Frequency of temperature values

Frequency Plots (spatial locality filter)

Frequency of differences between adjacent pixels

Approach

To test the detection characteristics of a scheme, an erroneous case and a control case of the same data are needed Errors may produce different kinds and intensities of

faults. It is important to decide what sort of errors we want to detect

In the case of OTIS, intensely faulty data (set-to-zero errors, memory gibberish) is easily detected, as it seldom falls inside the prescribed filters

Our experiments include moderately faulty data: offsets in input values of up to 30%

These faults tend to blend in with non-faulty data, making them especially hard to detect

Approach

Filters can be adjusted in steps of increasing complexity A single filter has a high and low cutoff The “left” and “right” bounds of data are usually exclusive,

therefore their detections act cumulatively

In each filter, a tradeoff must be resolved between the desired fault detection rate and the number of incurred false alarms

Multiple filters are independently calibrated Multiple filters will not necessarily detect different faults

Many filters working at a low expected detection rate may detect the same or more faults for a system than a single filter working with a high expected detection rate

Detection Plots (single side)

Fault detections and false alarms on a left-sided filter

Detection Plots (both sides)

By overlaying the left and right filter plots, general detection traits can be observed

Fault Detections, Numerically

Columns = left filter, Rows = right filterBounds Filter Fault Detections

300 304 306 310 314 318315 98.9% 99.1% 99.2% 99.2% 99.4% -317 96.6% 96.8% 96.8% 96.9% 97.1% -319 93.8% 93.9% 94.0% 94.0% 94.3% 98.5%321 91.0% 91.1% 91.2% 91.3% 91.5% 95.7%323 88.2% 88.3% 88.4% 88.4% 88.7% 92.9%325 83.6% 83.7% 83.8% 83.9% 84.1% 88.3%327 78.5% 78.7% 78.8% 78.8% 79.0% 83.3%329 71.2% 71.4% 71.5% 71.5% 71.7% 76.0%331 64.0% 64.2% 64.3% 64.3% 64.5% 68.8%333 61.4% 61.5% 61.6% 61.7% 61.9% 66.1%335 60.9% 61.0% 61.1% 61.2% 61.4% 65.6%337 60.2% 60.4% 60.4% 60.5% 60.7% 64.9%339 59.2% 59.4% 59.5% 59.5% 59.7% 64.0%

This table is used to find the possible configurations that satisfy a minimum fault detection rate.

False Alarms, Numerically

300 304 306 310 314 318315 92.4% 92.4% 92.4% 92.4% 96.3% -317 84.7% 84.7% 84.7% 84.7% 88.7% -319 78.6% 78.6% 78.6% 78.6% 82.5% 97.1%321 72.5% 72.5% 72.5% 72.5% 76.5% 91.0%323 64.8% 64.8% 64.8% 64.8% 68.7% 83.2%325 54.1% 54.1% 54.1% 54.1% 58.1% 72.6%327 41.2% 41.2% 41.2% 41.2% 45.2% 59.7%329 23.9% 23.9% 23.9% 23.9% 27.8% 42.3%331 5.0% 5.0% 5.0% 5.0% 9.0% 23.5%333 0.0% 0.0% 0.0% 0.0% 4.0% 18.5%335 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%337 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%339 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%

Columns = left filter, Rows = right filterBounds Filter False Alarms

Of the possible combinations chosen from the previous table, we can choose the one with the minimum number of false alarms

Detection Plots (both sides, spatial locality filter)

By overlaying the left and right filter plots, general detection traits can be observed

Sp

atia

l L

oca

lity

filt

erMultiple Filters

Fault Detection False Alarms60.0% 70.0% 90.0% 60.0% 70.0% 90.0%

40.0% 63.7% 71.9% 89.6% 40.0% 15.7% 22.6% 76.0%50.0% 64.0% 72.1% 89.7% 50.0% 15.7% 22.6% 76.0%60.0% 67.5% 72.7% 90.2% 60.0% 15.7% 22.6% 76.0%70.0% 76.3% 80.1% 94.2% 70.0% 36.3% 42.2% 84.2%80.0% 84.1% 87.4% 96.8% 80.0% 59.9% 64.6% 90.5%90.0% 93.0% 94.3% 98.7% 90.0% 77.1% 79.0% 94.5%

By combining multiple filters, fault detection is increased. To be effective, filters should have distinct fault

detection domains.

Bounds filter

Relation Between Datasets

“Blob” is an average data set, however we need to analyze the behavior of other datasets “Stripe”: Any filter settings achieve the same false

alarm and fault detection rate, within a few percent “Spots”: Not for the bounds filter

It has an average temperature 10K less than the others, pushing it closer to the “faulty” region of the bounds filter

We can relax the filter and accept the cut in efficiency, or predict when the “Spots” climate should be expected and use modified filters

This is the downfall of using absolute, instead of differential, data as criteria for the filters

Extensions to Other Applications

OTIS was a likely candidate for ALFTD, due to regularity of data Natural phenomena tends to have regular and

predictable behavior. Other applications dealing with temperature, imaging (NGST), or even geological surveys could have success with these two basic filters

These filter settings are only useful when considering environments similar to our sample datasets, but the method of calibrating filters is general enough to apply to other datasets and similar applications

Extensions to Novel Datasets

Once a working set of filters is devised, it should be applicable to any dataset which has the same characteristics Precalculated filter calibrations could be created to

allow for higher fault detection in very specific, localized datasets

General purpose filters can also be extracted by running through many datasets, but incur performance penalties

Dynamic Filter Calibration

Approximate settings are possible, but these may perform poorly when encountering new data cases The application may need to reconfigure its filters for

the new data This process could be automated – assuming the

calibrating computer can obtain at least one control (fault free) dataset

Without prior exposure to these novel datasets,automated dynamic reconfiguration should be implemented as a numerically based decision process

Conclusion

Filters are a critical part of ALFTD The efficiency of the ALFTD method is contingent on

a having a successful method of fault detection Careful calibration of filters can greatly improve the

fault detection capability of ALFTD Options for novel datasets

General Purpose filter calibrations Precalculated filter calibrations Dynamic calibration

Thank you!

For further information, please contact:

Israel Koren ([email protected])

C.M. Krishna ([email protected])

Eric Ciocca ([email protected])