EEE499 -Real-Time Embedded System Design

42
EEE499 - Real-Time Embedded System Design Software Fault Tolerance

Transcript of EEE499 -Real-Time Embedded System Design

Page 1: EEE499 -Real-Time Embedded System Design

EEE499- Real-TimeEmbeddedSystemDesign

SoftwareFaultTolerance

Page 2: EEE499 -Real-Time Embedded System Design

Failures

• whenthebehaviorofasystemdeviatesfromthatofitsspecification.

• normallyassociatedwiththenotionthatsomethingwhichoncefunctionedproperly,nolongerdoesso.

• inthisclassical sense,softwaredoesnottrulyfail;ifthesoftwaredoesthewrongthing,itwillalwaysdothewrongthingunderidenticalcircumstances

Page 3: EEE499 -Real-Time Embedded System Design

Failures&Specifications

• traditionally(software)requirementsareviewedasapositive specificationofasystem

• anattempttodefinethefailuresofasystemcanbeviewedasanegative specificationofasystem

• whilebotharelikelyessential,thelatterisrarelyformallyidentifiedinprojects– systemengineerswilloftenneedthefailuredefinitions

Page 4: EEE499 -Real-Time Embedded System Design

Faults

• anunsatisfactoryconditionorstate– actions,timing,sequenceoramount

• randomfaults - inthecontextofabove,failuresarerandomfaultswhichoccurwhenacomponentbreaksinthefield.Randomfaultsonlyoccurinphysicalentities.

• systematicfaults - areintrinsicinthedesignorimplementationofacomponent.Errorsorsoftwaredefects(bugs)aresystematicfaults.

Page 5: EEE499 -Real-Time Embedded System Design

Faults(cont’d)

• hardware- faultscanberandomorsystematic– traditionalbath-tubcurvephenomenon

• software- faultsarealwayssystematic– softwareneitherbreaksnorwearsout– allsoftwarecontainslatentdefects(bugs)– estimates:• 20,000defectsperMillionLOC• 90%foundintesting• 200foundinearlylife• 1800latentdefects(0.18%defectrate)

Page 6: EEE499 -Real-Time Embedded System Design

FailureModes

• theclassificationofasystem’sfailuresaccordingtotheimpacttheyhaveontheservicesitdelivers.

• softwarefailuremodes:– valuedomain– timingdomain- early,omission,late– arbitrary(combinationofvalueandtiming)

Page 7: EEE499 -Real-Time Embedded System Design

FailureTypes

• singlepointfailures - thefailureofasystemduetoasinglecomponentfailure.– h/w: carbrakemastercylinder– s/w:?

• commoncausefailures - thefailureofmorethanonecomponentduetoacommoneventorcause.– h/w: powersupplyfailure– s/w:?

Page 8: EEE499 -Real-Time Embedded System Design

FaultPrevention

• faultavoidance - limittheintroductionoffaultycomponentsduringconstruction– rigorous(formal)specifications– provendesignmethodologies– properselectionofprogramminglanguage– engineering/developmentenvironments(IDEs)

• faultremoval - findingandremovingcausesoferror– reviews,inspections,testing,verification,validation

• butwhatiffaultscannotbeprevented?

Page 9: EEE499 -Real-Time Embedded System Design

FaultTolerance

• theabilityofasystemtocontinuefunctioningeveninthepresenceoffaults.– fullfaulttolerance - systemcontinuestooperatewithnosignificantlossoffunctionality(foralimitedtime)

– gracefuldegradation - systemcontinuestooperatebutinadegradedmode

– failsafe - systemtransitionstoasafestatepriortoshut-downorasaresultofaparticularfault

Page 10: EEE499 -Real-Time Embedded System Design

Redundancy

• inordertoachievefaulttoleranceinsoftwaresomedegreeofredundancy isrequired.– {Inthissense,thedefinitionofredundancyextendstoincludeadditionalcodethatwouldnotberequiredfornormaloperation}

• goal- minimizeredundancy,maximizeR(t)• paradox- toomuchmaydecreaseR(t)

Page 11: EEE499 -Real-Time Embedded System Design

StaticRedundancy

• redundantcomponentsareusedtomaskerrors• h/w:NModularRedundancy(NMR)

• s/w:N-versionprogramming– Nredundantversionsofthesamemoduleunderacommoncontrollerandavotingscheme

Page 12: EEE499 -Real-Time Embedded System Design

StaticRedundancy

• assumptions– complete,consistent,unambiguousspecification– independentfailuremodes• separatedevelopmentteams,processors,languages,fault-tolerantcommunicationlines,…

• issues(problems)– specificationsareamajorcontributortodefects– independenceisoftennotareasonableassumption

– $$$

Page 13: EEE499 -Real-Time Embedded System Design

DynamicRedundancy

• redundantcomponentsusedtodetecterrors• h/w:dataparitybitchecks

• s/w:recoveryblocks• phases– errordetection– damageassessment/control– errorrecovery– faulttreatment/continue

RP1

RP2

RP3

recovery points

Process P1

Page 14: EEE499 -Real-Time Embedded System Design

DynamicRedundancy

• assumptions– alternatemodulesonlyexecutebaseduponerrordetection

– stilllargelydependentuponthespecification– knownandunknownfailuremodesmay behandled

• issues(problems)– backwarderrorrecoverycan’talwaysundodamage– $$

Page 15: EEE499 -Real-Time Embedded System Design

AFewFaultTolerancePatterns

• HomogeneousRedundancyPattern– protectsagainsthardware(random)failuresonly

• DiverseRedundancyPattern– protectsagainstsystematic&randomfaults

• MonitorActuatorPattern– specializationofdiverseredundancy

• WatchdogPattern

Page 16: EEE499 -Real-Time Embedded System Design

HomogeneousRedundancyPattern

Page 17: EEE499 -Real-Time Embedded System Design

HomogeneousRedundancyPattern

Page 18: EEE499 -Real-Time Embedded System Design

DiverseRedundancyPattern

Page 19: EEE499 -Real-Time Embedded System Design

Monitor-ActuatorPattern

Page 20: EEE499 -Real-Time Embedded System Design

Monitor-ActuatorPattern

Page 21: EEE499 -Real-Time Embedded System Design

WatchdogPattern

Page 22: EEE499 -Real-Time Embedded System Design

WatchdogPattern

Can be hardware or software

Page 23: EEE499 -Real-Time Embedded System Design

Reliability

• reliability,R(t)- theprobabilitythat,whenoperatingunderstatedenvironmentalconditions,asystemwillperformitsintendedfunctionadequatelyforaspecifiedintervaloftime.

• ameasureofthesuccesswithwhichasystemconformstosomeauthoritativespecificationofitsbehavior

• mostfrequenthardwaremetric- MTBF– failurerateismoreuniversalinsoftware

Page 24: EEE499 -Real-Time Embedded System Design

Reliability&Time• asstatedinitsdefinition,reliabilityisevaluatedonatimeinterval– R(t=6hours)=95%

• intermsofsoftwarereliability,wemustdistinguishthreeformsoftime:– calendartime- standardelapsedtime– clocktime- thetimeaprocessorisactuallyrunningduringanycalendartime

– executiontime- thetimeanysoftwareprogramisexecutingonthegivenprocessor

Page 25: EEE499 -Real-Time Embedded System Design

Reliability&Environment

• again,asstatedinitsdefinition,reliabilityisafunctionoftheoperationalenvironment– morecommonlyusedisasystem’soperationalprofile

• anoperationalprofileofapieceofsoftwareisarepresentationofthevariousinputstatesandtheirrespectiveprobabilities– oftentheseprofilescanbefurtherdividedalongmodesofoperation

– forexampleanapplicationhasadifferentprofileduringstart-upmodethaninfulloperation

• testcases/testdataisselectedaccordingly

Page 26: EEE499 -Real-Time Embedded System Design

ReliabilityDefinition

• reliability,R(t)- theprobabilitythat,whenoperatingunderstatedenvironmentalconditions,asystemwillperformitsintendedfunctionforaspecifiedintervaloftime.

• reliabilityisevaluatedonatimeinterval– R(t=6hours)=95%

• thus,reliabilityisafunctionofsomefailurerate(intensity)andadesiredtimeinterval

Page 27: EEE499 -Real-Time Embedded System Design

UsesofReliability

• ameasureofdevelopmentstatus– givenareliabilitygoal(asetfailurerate),youmaydetermineremainingschedule,costs,tests,etcpriortoreleaseofthesoftware

• tomonitoroperationalperformance– isthereliabilitysuchthatmajormaintenanceisjustified

• providesinsightintothesoftwareproduct(quality)andthedevelopmentprocess

Page 28: EEE499 -Real-Time Embedded System Design

QuantifyingComponentReliability

• GivencomponentsfromaHomogeneousPoissonProcesswithafailureintensity(λ),thenreliabilitycanbeexpressedas:

R(T)=e- λTR

elia

bilit

y

Time

1.0reliability decreases exponentially with “mission” time

Page 29: EEE499 -Real-Time Embedded System Design

ComponentReliability- Example1

• Assumethatacapacitorhasbeentestedanddeterminedtohaveamean-time-between-failure(MTBF)of200hours,– whatistheprobabilitythatthecapacitorwillnotfailduring8hoursof(normal)useinthelab?

Solution:

λ = 1/ MTBF = 0.005 failures per hour

R(T=8 hours) = e -(0.005)(8)

= 0.961 or 96.1%

Page 30: EEE499 -Real-Time Embedded System Design

ComponentReliability- Example2

• Givenanunmannedspaceprobewitharequirementtooperatefailurefreefora25yearmissionwithaprobabilityof95%,– whatistherequiredsystemMTBF?

Solution:

λ = - ln (R(T)) / T

= -ln(0.95) / [(25)(365)(24)]

= 0.0000002 failures per hour

MTBF = 5,000,000 hours!

Page 31: EEE499 -Real-Time Embedded System Design

SoftwareFailureIntensity

• Asmostsoftwarereliabilityvaluesarestatedintermsofexecutiontime,weneedtobeabletoconvertthesevaluestoclocktime:– ifwedefine:• executiontimeasλτ• clocktimeasλt,

– then, λt=ρcλτ

– where, ρc representsaverageutilization

Page 32: EEE499 -Real-Time Embedded System Design

SoftwareFailureIntensity- Example

• Givenaperiodictaskwithanestimatedonehour(executiontime)reliabilityof99.5%.Inaddition,weknowthetaskrunsevery200msecwithanaveragecomputationtimeof2000µsec.– determinetheclock-time-basedfailureintensityλτ = - ln(R(τ=1))/τ = - ln(0.995)/1 = 0.005 f/hr

λt = ρc λτ = (0.01) (0.005) = 0.00005 f/hr

Page 33: EEE499 -Real-Time Embedded System Design

SystemReliability- SerialSystems

• Rsys =Π Ri forallicomponents

• Example– GivenRA =90%,RB =97.5%,RC =99.25%– Rsys =(.9)(.975)(.9925)=87.1%

*Assumes that all components are reliability-wise independent.

A B C

Page 34: EEE499 -Real-Time Embedded System Design

SerialSystems- Example

• Givenaspaceprobewith10,000identicalcomponentsanda25year95%reliabilityrequirement:– whatistherequiredcomponentfailurerate?

c1 c2 c10000•••

Solution:

RC = (Rsys)1/10000 = 0.999995

λc = -ln(.999995)/[(25)(365)(24)] = 2.28 E-11

Page 35: EEE499 -Real-Time Embedded System Design

• DefinetheprobabilityoffailureasQi=1- Ri,then

• Qsys =ΠQi

– itfollowsthatRsys =1– Qsys,or

• Rsys =1-Π (1- Ri)

SystemReliability- ParallelSystems

A

B

C

§ Example§ Given RA = 90%, RB = 97.5%, RC = 99.25%§ Rsys = 1 - (.1)(.025)(.0075) = 99.998%

Page 36: EEE499 -Real-Time Embedded System Design

• aspecialcaseoftheparallelsystemconfigurationwhereinitisrequiredthatonlykoutofnidenticalcomponentsareneededforsuccess

• Rsys =Σ [ ]Rci (1- Rc)n-i

SystemReliability- koutofn

A

A

Anii=k

n

§ where “n choose k”, [ ] = n! / (k!(n-k)!) nk

Page 37: EEE499 -Real-Time Embedded System Design

• Example– assumethatthereliabilityofc1 is90%andyouneedatleasttwoforthesystemtofunction

SystemReliability- koutofn

c1

c1

c1

solution:

Rsys = Σ [ ] Rci (1 - Rc)3-i

= [ ] Rc2 (1 - Rc) + [ ] Rc

3 (1 - Rc)0

= (3)(.81)(.1) + (1)(.729)(1)

= 97.2%

3ii=2

3

32

133

Page 38: EEE499 -Real-Time Embedded System Design

Exercise- SystemComponents• ConsideranEWsystemmadeupofthefollowing

components:– 2touchdisplays,onlyoneofwhichneedfunctionforthesystemto

function– eachtouchdisplaycontainsidentical software(meaningthatyou

shouldtreatthes/wasastand-alonemodule)– onesystemprocessorandthesystemintegrationsoftware– anECMwithembeddedjammersoftware

• Assumethatallhardwareutilizationis100%– inotherwordsallhardwarefunctionsfortheentiremissionduration

Page 39: EEE499 -Real-Time Embedded System Design

Exercise- ComponentData

Component Failure Rate(execution hours)

Utilization R(t=4)

TD - - .95

TD s/w 0.002 0.65 ?

SP - - .98

SP s/w .010 .95 ?

ECM - - .925

ECM s/w .012 .35 ?

Page 40: EEE499 -Real-Time Embedded System Design

Exercise- Questions

• a)Determinetheoverallsystemreliabilityfora4hourmission?

• b)Whatistheprobabilityofhavingatleastonefunctionaldisplayfora2hourmission?

• c)Whatistheweakestlinkinthesystem?Whatcouldbedonetoimprovetheoverallsystemreliability(assumingthatyoucannotsignificantlychangethecomponentreliabilitieswithoutseriousredesign).DrawthenewreliabilityblockdiagramandfindRsys

Page 41: EEE499 -Real-Time Embedded System Design

Exercise- Questions(continued)

• d)Assumethatwenowhave4equivalentECMsubsystems(combinedhardwareandsoftware)andthatforthesystemtobefunctionalany3ofthe4mustbefunctional.

• Drawthenewreliabilityblockdiagramandcalculatethesystemreliability.

Page 42: EEE499 -Real-Time Embedded System Design

References

[1]BurnsandWellings,“Real-TimeSystemsandProgrammingLanguages”,Chap5.

[2]Douglass,“DoingHardTime:DevelopingReal-TimeSystemswithUML,Objects,Frameworks,andPatterns”,Chap3.

[3]Musa,J.D.,Iannino,A.,Okumoto,K.“SoftwareReliability- Measurement,Prediction,Application”,Chapter4,McGraw-Hill1987.