Filter Creation for Application Level Fault Tolerance and Detection
Eric Ciocca, Israel Koren, C.M. KrishnaECE Department, UMass Amherst
Overview
Our approach to fault detection and tolerance relies on an application’s inherent familiarity with its own data Fault detection and tolerance in the application level Applications do not need hardware or middleware to
provide fault recovery To realize trends in application data, the
developer must be familiar with what the data represents
The existing trend can be used for fault detection, but needs to be quantitatively defined so the application may detect it.
What is ALFTD?
ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information
significantly reduces the overall cost of providing fault tolerance
ALFTD may be used alone or to supplement other fault detection schemes
ALFTD is tunable: It permits users to trade off fault tolerance against computation overhead Allowing more overhead for ALFTD produces better
results
Principles of ALFTD
Every physical node will run its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary)
If a fault corrupts a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality
Node 1
Node 2
Node 3
Node 4
P1 S4
P2 S1
P3 S2
P4 S3
Fault Detection
Faults do not always completely disable a node Malformed and corrupted data are also possible
Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages Faulty data is difficult to detect without application
syntax
Fault detection is a necessary condition for ALFTD to schedule which secondary nodes to run
Secondary processes can provide verification for ambiguously faulty data
Principles of ALFTD Filters
Faults are detected by passing results through one or more acceptance filters Filters are unique to applications with certain data
characteristics. Value bound tests are applicable to most applications Sanity checks require knowledge of the expected output
value and format.
Results from Primary
Filter 1
Secondary Task Queue
Filter 2
Data is OKPassFail
OTIS Characteristics
ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer), part of the REE suite OTIS reads radiation values from various bands and
calculates temperature data The output can be viewed graphically or numerically
OTIS lends itself to ALFTD because the output data (temperature) has Local Correlation: Data changes gradually over an area Absolute Bounds: Data falls within some expected
realistic range
ALFTD in OTIS
Local Correlation and Absolute Bounds on the data led to the creation of two data fault filters Spatial Locality Filter: If the difference between
pixel (x,y) and (x-1,y) is greater than some threshold , the pixel may be the result of faulty data
Absolute Bounds Filter: Any pixel not falling in the value range of < value < may be the result of faulty data
The filter thresholds are set based on the sample datasets provided
OTIS Datasets
“Blob” “Stripe” “Spots”
Faulty
Faultless
OTIS Datasets with ALFTD
“Blob” “Stripe” “Spots”
Faulty
ALFTD Corrected Faulty
Problem
ALFTD filters require calibration Calibration constants are context sensitive Filter values can be approximated, but gains can be
made in detection efficiency with well-tuned filters
Heuristics are created based on characteristics of the most frequent data
Frequency Plots (bounds filter)
Frequency of temperature values
Frequency Plots (spatial locality filter)
Frequency of differences between adjacent pixels
Approach
To test the detection characteristics of a scheme, an erroneous case and a control case of the same data are needed Errors may produce different kinds and intensities of
faults. It is important to decide what sort of errors we want to detect
In the case of OTIS, intensely faulty data (set-to-zero errors, memory gibberish) is easily detected, as it seldom falls inside the prescribed filters
Our experiments include moderately faulty data: offsets in input values of up to 30%
These faults tend to blend in with non-faulty data, making them especially hard to detect
Approach
Filters can be adjusted in steps of increasing complexity A single filter has a high and low cutoff The “left” and “right” bounds of data are usually exclusive,
therefore their detections act cumulatively
In each filter, a tradeoff must be resolved between the desired fault detection rate and the number of incurred false alarms
Multiple filters are independently calibrated Multiple filters will not necessarily detect different faults
Many filters working at a low expected detection rate may detect the same or more faults for a system than a single filter working with a high expected detection rate
Detection Plots (single side)
Fault detections and false alarms on a left-sided filter
Detection Plots (both sides)
By overlaying the left and right filter plots, general detection traits can be observed
Fault Detections, Numerically
Columns = left filter, Rows = right filterBounds Filter Fault Detections
300 304 306 310 314 318315 98.9% 99.1% 99.2% 99.2% 99.4% -317 96.6% 96.8% 96.8% 96.9% 97.1% -319 93.8% 93.9% 94.0% 94.0% 94.3% 98.5%321 91.0% 91.1% 91.2% 91.3% 91.5% 95.7%323 88.2% 88.3% 88.4% 88.4% 88.7% 92.9%325 83.6% 83.7% 83.8% 83.9% 84.1% 88.3%327 78.5% 78.7% 78.8% 78.8% 79.0% 83.3%329 71.2% 71.4% 71.5% 71.5% 71.7% 76.0%331 64.0% 64.2% 64.3% 64.3% 64.5% 68.8%333 61.4% 61.5% 61.6% 61.7% 61.9% 66.1%335 60.9% 61.0% 61.1% 61.2% 61.4% 65.6%337 60.2% 60.4% 60.4% 60.5% 60.7% 64.9%339 59.2% 59.4% 59.5% 59.5% 59.7% 64.0%
This table is used to find the possible configurations that satisfy a minimum fault detection rate.
False Alarms, Numerically
300 304 306 310 314 318315 92.4% 92.4% 92.4% 92.4% 96.3% -317 84.7% 84.7% 84.7% 84.7% 88.7% -319 78.6% 78.6% 78.6% 78.6% 82.5% 97.1%321 72.5% 72.5% 72.5% 72.5% 76.5% 91.0%323 64.8% 64.8% 64.8% 64.8% 68.7% 83.2%325 54.1% 54.1% 54.1% 54.1% 58.1% 72.6%327 41.2% 41.2% 41.2% 41.2% 45.2% 59.7%329 23.9% 23.9% 23.9% 23.9% 27.8% 42.3%331 5.0% 5.0% 5.0% 5.0% 9.0% 23.5%333 0.0% 0.0% 0.0% 0.0% 4.0% 18.5%335 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%337 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%339 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%
Columns = left filter, Rows = right filterBounds Filter False Alarms
Of the possible combinations chosen from the previous table, we can choose the one with the minimum number of false alarms
Detection Plots (both sides, spatial locality filter)
By overlaying the left and right filter plots, general detection traits can be observed
Sp
atia
l L
oca
lity
filt
erMultiple Filters
Fault Detection False Alarms60.0% 70.0% 90.0% 60.0% 70.0% 90.0%
40.0% 63.7% 71.9% 89.6% 40.0% 15.7% 22.6% 76.0%50.0% 64.0% 72.1% 89.7% 50.0% 15.7% 22.6% 76.0%60.0% 67.5% 72.7% 90.2% 60.0% 15.7% 22.6% 76.0%70.0% 76.3% 80.1% 94.2% 70.0% 36.3% 42.2% 84.2%80.0% 84.1% 87.4% 96.8% 80.0% 59.9% 64.6% 90.5%90.0% 93.0% 94.3% 98.7% 90.0% 77.1% 79.0% 94.5%
By combining multiple filters, fault detection is increased. To be effective, filters should have distinct fault
detection domains.
Bounds filter
Relation Between Datasets
“Blob” is an average data set, however we need to analyze the behavior of other datasets “Stripe”: Any filter settings achieve the same false
alarm and fault detection rate, within a few percent “Spots”: Not for the bounds filter
It has an average temperature 10K less than the others, pushing it closer to the “faulty” region of the bounds filter
We can relax the filter and accept the cut in efficiency, or predict when the “Spots” climate should be expected and use modified filters
This is the downfall of using absolute, instead of differential, data as criteria for the filters
Extensions to Other Applications
OTIS was a likely candidate for ALFTD, due to regularity of data Natural phenomena tends to have regular and
predictable behavior. Other applications dealing with temperature, imaging (NGST), or even geological surveys could have success with these two basic filters
These filter settings are only useful when considering environments similar to our sample datasets, but the method of calibrating filters is general enough to apply to other datasets and similar applications
Extensions to Novel Datasets
Once a working set of filters is devised, it should be applicable to any dataset which has the same characteristics Precalculated filter calibrations could be created to
allow for higher fault detection in very specific, localized datasets
General purpose filters can also be extracted by running through many datasets, but incur performance penalties
Dynamic Filter Calibration
Approximate settings are possible, but these may perform poorly when encountering new data cases The application may need to reconfigure its filters for
the new data This process could be automated – assuming the
calibrating computer can obtain at least one control (fault free) dataset
Without prior exposure to these novel datasets,automated dynamic reconfiguration should be implemented as a numerically based decision process
Conclusion
Filters are a critical part of ALFTD The efficiency of the ALFTD method is contingent on
a having a successful method of fault detection Careful calibration of filters can greatly improve the
fault detection capability of ALFTD Options for novel datasets
General Purpose filter calibrations Precalculated filter calibrations Dynamic calibration
Thank you!
For further information, please contact:
Israel Koren ([email protected])
C.M. Krishna ([email protected])
Eric Ciocca ([email protected])
Top Related