Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct....

23
Oct. 2007 Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Basic Concepts and Tools

Transcript of Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct....

Page 1: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 1

Fault-Tolerant ComputingBasic Concepts and Tools

Page 2: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 2

About This Presentation

Edition Released Revised Revised

First Oct. 2006 Oct. 2007

This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Page 3: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 3

Terminology, Models, and Measures for Dependability

Page 4: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 4

Page 5: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 5

Impairments to Dependability

ERROR

Malfunction

Degradation

Failure

Fault

Intrusion

Hazard

Defect

Flaw

Bug

Crash

Page 6: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 6

The Fault-Error-Failure Cycle

Schematic diagram of the Newcastle hierarchical model and the impairments within one level.

Failure

Aspect ImpairmentStructure Fault

⇓ ⇓State Error

⇓ ⇓Behavior

Includes both components and design

0 0

0Fault Correct

signal

Replaced with NAND?

Page 7: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 7

The Four-Universe Model

Cause-effect diagram for Avižienis’ four-universe model of impairments to dependability.

Universe ImpairmentPhysical Failure

⇓ ⇓Logical Fault

⇓ ⇓Informational Error

⇓ ⇓External Crash

Page 8: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 8

Unrolling the Fault-Error-Failure Cycle

Cause-effect diagram for an extended six-level view of impairments to dependability.

Abstraction ImpairmentComponent Defect

⇓ ⇓Logic Fault

⇓ ⇓Information Error

⇓ ⇓System Malfunction

⇓ ⇓Service Degradation

⇓ ⇓Result Failure

Low- Level

Mid- Level

High- Level

First Cycle

Second Cycle

Failure

Aspect ImpairmentStructure Fault

⇓ ⇓State Error

⇓ ⇓Behavior

Page 9: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 9

Multilevel Model

Component

Logic

Service

Result

Information

System

Low-Level Impaired

Mid-Level Impaired

High-Level Impaired

Initial Entry

Deviation

Remedy

Legned:

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Legend:

Tolerance

Entry

Page 10: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 10

Analogy for the Multilevel Model

An analogy for our multi-level model of dependable computing.Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow.

Wall heights represent inter-level latencies

Drain valves represent tolerance techniques

Concentric reservoirs are analogs of the six model levels, with defect being innermost

IIIIII

I I I I I I

Inlet valves represent avoidance techniques

Page 11: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 11

Why Our Concern with Dependability?Reliability of n-transistor system, each having failure rate λ

R(t) = e–nλt

There are only 3 ways of making systems more reliable

Reduce λ

Reduce n

1.0

0.8

0.6

0.4

0.2

0.0

e–n tλ

.9999 .9990 .9900

.9048

.3679

1010 810 610 410 nt

Reduce t

Alternative:Change the reliability formula by introducing redundancy in system

Page 12: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 12

Highly Dependable Computer Systems

Long-life systems: Fail-slow, Rugged, High-reliabilitySpacecraft with multiyear missions, systems in inaccessible locationsMethods: Replication (spares), error coding, monitoring, shielding

Safety-critical systems: Fail-safe, Sound, High-integrityFlight control computers, nuclear-plant shutdown, medical monitoringMethods: Replication with voting, time redundancy, design diversity

Non-stop systems: Fail-soft, Robust, High-availabilityTelephone switching centers, transaction processing, e-commerceMethods: HW/info redundancy, backup schemes, hot-swap, recovery

Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers

Page 13: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 13

Aspects of Dependability

RELIABILITY

Maintainability

Availability

Perform

ability

Security

Integrity

Serviceability

Testability

Safety

Robustness

Resilience

Reliability, MTTF = MTFF

Risk, consequence

Controllability,

observability

Perform

ability, M

CBFPointwise av., In

terval av.,

MTBF, MTTR

Page 14: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 14

Concepts from Probability Theory

Cumulative distribution function: CDFF(t) = prob[x ≤ t] = ∫0 f(x)dxt

Probability density function: pdff(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt

Time0 10 20 30 40 50

Time0 10 20 30 40 50

Time0 10 20 30 40 50

1.00.80.60.4

0.20.0

CDF

pdf0.050.040.030.020.010.00

F(t)

f(t)

Expected value of xEx = ∫−∞ x f(x)dx = ∑k xk f(xk)

+∞

Covariance of x and yψx,y = E [(x – Ex)(y – Ey)]

= E [x y] – Ex Ey

Variance of xσx = ∫−∞ (x – Ex)2 f(x)dx

= ∑k (xk – Ex)2 f(xk)

+∞2

Lifetimes of 20 identical systems

Page 15: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 15

Some Simple Probability Distributions

CDF

pdf

F(x)

f(x)

1

Uniform Exponential Normal Binomial

CDF

pdf

CDF

pdf

CDF

Page 16: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 16

Reliability and MTTFReliability: R(t)Probability that system remains in the “Good” state through the interval [0, t]

Two-state nonrepairablesystem

R(t + dt) = R(t) [1 – z(t)dt]

Hazard function

Constant hazard function z(t) = λ ⇒ R(t) = e–λt

(system failure rate is independent of its age)

R(t) = 1 – F(t) CDF of the system lifetime, or its unreliability

Exponential reliability law

Mean time to failure: MTTFMTTF = ∫ 0 t f(t)dt = ∫0 R(t)dt

+∞ +∞

Expected value of lifetime

Area under the reliability curve(easily provable)

Start state FailureUp Down

Page 17: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 17

Failure Distributions of Interest

Exponential: z(t) = λR(t) = e–λt MTTF = 1/λ

Weibull: z(t) = αλ(λt) α–1

R(t) = e(−λt)α MTTF = (1/λ) Γ(1 + 1/α)

Erlang:MTTF = k/λ

Gamma:Erlang and exponential are special cases

Normal:Reliability and MTTF formulas are complicated

Rayleigh: z(t) = 2λ(λt)R(t) = e(−λt)2 MTTF = (1/λ) √π / 2

Discrete versionsGeometric

Binomial

Discrete Weibull

R(k) = q k

Page 18: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 18

Comparing Reliabilities

Reliability gain: R2 / R1

Reliability difference: R2 – R1

Reliability functionsfor Systems 1/2

Reliability improv. indexRII = log R1(tM) / log R2(tM)

1.0

0.0

Time (t)

R (t)

R (t)

1

2

MTTF1MTTF2t

R (t )1

R (t )2r

T (r )1 T (r )2M

G

G G

M

M

System Reliability (R)

Mission time extensionMTE2/1(rG) = T2(rG) – T1(rG)

Mission time improv. factor:MTIF2/1(rG) = T2(rG) / T1(rG)

Reliability improvement factorRIF2/1 = [1–R1(tM)] / [1–R2(tM)]Example:[1 – 0.9] / [1 – 0.99] = 10

Page 19: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 19

Availability, MTTR, and MTBF(Interval) Availability: A(t)Fraction of time that system is in the “Up” state during the interval [0, t]

Two-state repairable system

Availability = Reliability, when there is no repair

Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair)

MTTF MTTF μMTTF + MTTR MTBF λ + μ

Pointwise availability: a(t)Probability that system available at time tA(t) = (1/t) ∫ 0 a(x)dxt

Steady-state availability: A = limt→∞ A(t)

A = = =Repair rate1/μ = MTTR(Will justify thisequation later)In general, μ >> λ, leading to A ≅ 1

RepairStart state

Failure

Up Down

Page 20: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 20

System Up and Down Times

Time

Up

Down0 t

Time to first failure Time between failuresRepair time

t1 t2t'1 t'2

Short repair time implies good maintainability (serviceability)

RepairStart state

Failure

Up Down

Page 21: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 21

Performability and MCBFPerformability: PComposite measure, incorporating both performance and reliability

P = 2pUp2 + pUp1

Simple exampleWorth of “Up2” twice that of “Up1”pUpi = probability system is in state Upit

Three-state degradable system

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average)

Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails:PIF = (2 – 2 × 0.92) / (2 – 1.90) = 1.6

Question:What is system availability here?

Repair Partial repairStart state

FailurePartial failure

Up 1 DownUp 2

Page 22: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 22

Time

Up

Down0 tt1 t2 t'2 t'1 t3 t'3

Partial Failure

Total Failure

Partial Repair

Partially Up

System Up, Partially Up, and Down Times

Important to prevent direct transitions to the “Down” state (coverage)

MCBF

Repair Partial repairStart state

FailurePartial failure

Up 1 DownUp 2

Page 23: Basic Concepts and Tools - UCSBparhami/pres_folder/f33-ft-computing...Basic Concepts and Tools Oct. 2007 Terminology, Models, and Measures Slide 2 About This Presentation Edition Released

Oct. 2007 Terminology, Models, and Measures Slide 23

Integrity and SafetyRisk: Prob. of being in “Unsafe Failed” stateThere may be multiple unsafe states, each with a different consequence (cost)

Simple analysisLump “Safe Failed” state with “Good”state; proceed as in reliability analysis

More detailed analysisEven though “Safe Failed” state is more desirable than “Unsafe Failed”, it is still not as desirable as the “Good” state; so keeping it separate makes sense

Three-state fail-safe system

For example, if a repair transition is introduced between “Safe Failed”and “Good” states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability

Safefailed

Unsafefailed

Failure

Failure

Start state

Good