Model-based diagnosis in SOPHIE III’ - PARC Based... · Model-based diagnosis in SOPHIE III ......

Model-based diagnosis in SOPHIE III’

Johan de Kleer and John Seely BrownXerox Palo Alto ResearchCenter

3333 Coyote Hill Road, Palo Alto CA 94304 USAEmail: dekleer, brown~parcxerox.com

Abstract

This paperdescribesthemodel-baseddiagnosticianused in SOPHIE III. This diagnosticianis capableof localizing faults in analog circuits such as DCpower supplies. Although it presumesthat the cir-cuit containsa single fault, it introducesa varietyof important model-basedtechniques: (1) it usesmodels of componentbehaviors to perform first-principles reasoningto detectconflicts andcorrobo-rationswith observations,(2) it considersa compo-nent to be faulted if its removal would eliminateallconflicts yet preserveall corroborations,(3) it usespropagationof constraintswith numerical ranges,(4) it usesfault modeinformation to eliminate com-ponentsfrom consideration,and (5) it providesasmoothintegrationwith rule-baseddiagnosticruleswhenfirst-principles reasoningis inadequateto per-form the diagnosis.

1 Introduction and history of theSOPHIE project

The researchdescribedin this papertook place over afive-yearperiod andcenteredaroundthreedifferent SO-PHIE systems(known asSOPHIE I, SOPHIE II andSOPHIE III). Work beganon what wasto becomeSO-PHIE I in early 1973 at the University of Californiaat Irvine and was completedin 1977 at Bolt Baranekand Newman in Massachusetts.SOPHIE was a largeproject whichproducedresultsin diversedomains,Thispaperfocuseson onesmall part of this project: the in-ferential machineryfor drawing diagnosticinferencesinSOPHIE III. The following is a brief history of the SO-PHIE project. For a fuller overview of the project see[1].

The Air Force had expressed in interest in usingcomputersin their advancedelectronic-troubleshootingcourse,particularly in the laboratory section; and we

This paperintegratesexcerptsfrom a numberof earlierpapers. Two of theseare: “Pedagogical,natural languageandknowledgeengineeringtechniquesin SOPHIEI, II andIII” by J.S. Brown, H. H. Burton and J. de Kleer which ap-pearedin Intelligent Tutoring Systemseditedby D. SleemanandJ.S.Brown publishedby AcademicPress(1982)[1]; “Lo-cal methodsof localizing faults in electroniccircuits,” by J.de IKleer published as M.I.T. Artificial IntelligenceLabora-tory Memo AIM-394 (1976)[2].

were interested in exploring interactive learning envi-ronmentswhich encouragedexplicit developmentof hy-pothesesduring fault solving by facilitating the com-munication of the student’s ideas to the machineandenablingthe machine to critique them. From this mo-tivation came the original SOPHIE I which providedthe student an electronic troubleshootingenvironment.The instructor can insert faults into the simulated cir-cuit (using SPICE [6]) and the student, through a nat-ural languageinterface,makemeasurementsto pinpointthe faulty component. The SOPHIE project focusedon troubleshootingDC power supplies(an exampleofwhich is illustrated in Figure 1). The student was notrestricted to a few predeterminedmeasurementpoints,but rather couldmakeanymeasuretnenthe/shewanted.A fundamentaladvantageof usinga simulatedover anactual circuit is that it makesit easyfor the instructorto insert any conceivablefault, not just thosethat areeasyto insert in a physicalcircuit, or thosewhich do notdestroythe entire circuit. SOPHIE I also includesex-tensivepedagogicalmachineryandcoursematerial withwhich we do not concernourselveshere.

In SOPHIE II we began to explore the idea of acomputer-basedtroubleshootingcoachwhich could ad-vise and guide the student’sactions with the objectiveof improving their troubleshootingskills. Of the manyideas explored, three bear on the current issue of di-agnosticinference we are focusing on here. First, weprovided a module which was capable of expert-leveltroubleshootingof any fault; thus the student could ob-serve an expert diagnosinga fault. Second, we pro-videdamodulewhichwould tell thestudent which faultscould explain the measurementsmadeso far. Finally,we provideda module which critiqued the student’sac-tual measurements.Although all of these capabilities,on the surface, appearto require substantialinferentialmachinery,they wereall achievedby very simple under-lying mechanisms.

The expert troubleshootingof SOPHIE II module isbasedon an underlyingdecisiontree. However, thede-cision tree is difficult to constructand must be modifiedfor eachnew circuit. The decisiontree typically cannotaccommodatethe measurementsthe studenthasalreadymade. It cannot tell the student what should be donenext, but only what an expert would havedone start-ing from the outset. Finally, the decision tree cannotprovide pedagogicallyuseful rationale for its decisions.

Q~tcUlr~t ContzolC~ntn,I Trnnpot

QurentRangeSwitch

Figure 1: Schematicof the IP-28

The fault identificationmoduleof SOPHIE II is moreinteresting. It is completely general, but is computa-tionally intractableand thereforeof limited use. It usesSPICE in a novel wayby addingall the observationstotheequationset beingsolved andsystematicallyleavingeachof the componentparametersunspecified. By em-ploying SPICE in a root-finding loop SOPHIE II canidentifywhat componentparametershifts would explainthe symptoms.For example,R3 hasa nominal value of1500 ohms. By iterating SPICE runs we might deter-mine that if R3 were50 ohms, then the resultswould beconsistentwith the observations.Thus, R3 is a possiblefault. On the other hand, if no resistancevalue for R3can be found that is consistent with the observations,then no fault in R3 alone can explain the symptoms.This approachhas two major disadvantages.First, thetechniqueis extremely slow (even for 1992 machines)anddoes not scale well — the student doesn’t want towait forever to havehis questionanswered.Second,thisapproachis incapableof explaining why a componentisor is not faulted in terms the student is familiar with.Thus, it is of limited pedagogicalvalue.

The measurementcritiquing module is built on thefault identification module. If a measurementdid notsignificantly reducethe numberof possiblefaults, thenthe measurementis a poor one (but see discussion inSection 2.4). This module suffered from similar prob-lems as the fault identification module. First, it is toocomputationallyintensive. Second,it is incapableof ex-plaining why a particular measurementis good or bad.Third, if the measurementwasbad, then it is incapableof proposinga better onefor the student to make.

1.1 SOPHIE IIIIn building SOPHIE III we decidedto adopt a com-pletely different approachto achievingthe required di-

agnosticinferences. Our approachwas to baseits in-ference techniqueson those that we observedexpertsand studentsusing. This allows SOPHIE III to pro-vide explanationsin terms the student is familiar with.By making SOPHIE III’s inferencingstrategiesmoreakin to those used by the student, we could begin todeterminewhich deductionsthestudent wasusing,con-struct a model of his abilities, then use this model togenerateexplanationsin familiar terms, For example,SOPHIE III might discover that the student repeat-edly mademeasurementsof both the current and thevoltage through resistors; this would be evidencethatthe studentdid not completelyunderstandOhm’s Law.

We also neededa fundamentallydifferent and morehuman-orientedinference schemebecausewe wantedto investigateusing SOPHIE III as a computer-basedconsultantfor on-the-jobtraining as an intelligent job-performanceaid. We thus wantedSOPHIE III to beable to work from measurementsbeing performed onreal, physical equipment, completely independentlyofany kind of circuit simulator.

It is important to note that this paper is not aboutpedagogyor cognitive science. Although we constantlyrefer to explanationandcritiquing capabilities,this doesnot at all suggest that the student should be immedi-ately critiqued or interrupted— that decisionis the jobof thecoach.What we want to provide is ageneralinfer-ential framework which is powerfulenoughto makethenecessaryinferencesandcritiques— someother modulebasedon someother theory makesthe ultimate deci-sions on whether the student should or should not beinterruptedandwhat should be said to thestudentwithrelatively few measurements.

The remainder of this paper describes the inferen-tial machinerynecessaryto achieve the diagnosticob-jectives. The inferential machineryshould:

NodeNwnbui

VoltageRangeSwitch

outpw

• determine what other voltagesand currents followfrom the measurementsand he able to explain theseconclusionsin termsfamiliar to the student.

• determinethe diagnosticconsequencesof a measure-ment (i.e., the component faults that explain thesymptoms)andto provide explanationsfor thesecon-clusions.

• givenwhatever themeasurementsthe studenthasal-readymade,rate the measurementsthat the studentcan now makeand provide explanationsfor theserat-ings. Notice that by repeatedlypicking the best mea-surementthesystemcan diagnosethe circuit withoutthe student.

• be ascircuit independentas possible. We don’t wantto facea largeknowledgeengineeringtask everytimewe want to troubleshoota new circuit.

• be efficient and be able to provide its results to thecoachin a timely manner.

Although the concepts apply to both AC and DCcircuits, at presentSOPHIE III only models DC be-havior. This is adequatefor modeling simple powersuppliessince they (except for switching regulators)canbe understoodalmost entirely from the quiescentpointof view. The circuits consist of resistors,diodes, zenerdiodes,capacitors,transistors,switches,potentiometersandDC voltagesources.

There is an important classof inferenceswe were notableto incorporatein SOPHIE III. In somesituations,it is more informativeto changethe front panelsettingsandthen to makea measurement,than it is to makean-other measurementunder the currentcontrol panelset-tings. A studentis alwaysfree to changethefront panelcontrolsof the IP-28. SOPHIE III’s inferential powersaresufficient to draw the conclusionsfrom the informa-tion garneredfrom variouscontrol settings. However,wedid not developgood inferentialtechniquesfor initiatingchangesto the front panelcontrols.

The typical knowledgeengineeredsystem hasthe ad-vantagethat it controlsthe initiative (or it knowsa pri-ori what datawill begiven to it) on what strategiesandactionsto take. The majorquestionfor suchsystemsiswhethersufficient knowledgehasbeencodified to handleall the casesthat arise, while thequestionfor SOPHIEIII is whetherasufficientamountof knowledgehasbeencodified to track any student. The student may makesomevery bizarre measurementsand SOPHIE III hasto be preparedto commenton them and make sugges-tions basedon the information obtainedby that poten-tial poor measurementeventhough it would havenevermadethat measurementif it hadthe initiative. Becausethe systemdoesnot havecontrol overwhat informationit must reasonwith, it needsa deep understandingofelectronicsandtroubleshooting.

1.2 The architecture of the electronicsexpert

Symptomatic circuit behavior is causedby somecom-ponent(s) failing to behaveas it was designedto. Thetask of troubleshootingis to identify the failing compo-nent(s). The difficulty is that we usually cannotdirectlyexaminecomponentsto see whether they are faulted,and instead must reason indirectly about their behav-iors. Troubleshooting proceeds by making measure-mentsin the faultedcircuit andgarneringasmuchinfor-mation as possiblefrom the results. The local propaga-tor (LOCAL) forms the basisfor the electronicsexpert.It uses general knowledge of circuit laws to determinewhat further voltagesand currents can be determinedfrom the measurementsthat havebeenmade. Becausethe local propagatorknows, for example,Ohm’s Law, itcandeducegiven thecurrent througha resistorwhat thevoltage across it must be. In order for a measurementto provide useful diagnosticinformation we must havesomeexpectationabout its value. If nothing is knownabout the value, then no information is gainedby themeasurement.

Considerthe Ohm’s Law exampleagain. Having mea-sured the current throughthe resistor,Ohm’s Law tellsus what the voltageacross it should be. Having madethis calculation it now makessomesenseto make themeasurementof the voltageacross the resistor. If thisvoltage is what we expected, then we can be reasonablyconfident that there is nothing wrong with the resistor.If this voltage is different than expected,then we knowfor certain that somethingis wrong with the resistor.SOPHIE III’s electronic expert is designedaroundthisfundamentalidea that confirmationsand refutationsofpredictions pinpoint the faulty components. LOCALconsists of two interacting modules: (1) a predictionmodule which predictswhat other circuit quantities areknown, given the observationsso far, and (2) an inter-pretation module which determines(using local propa-gation of constraints— hencethe nameLOCAL) of thediagnosticconsequencesof confirmationsor violationsofpredictions.

If somecomponentis not behavingas it should, thenthe prediction module will eventuallyencountera irrec-oncilablecontradictionsince its model of the circuit willdiffer from the actual faulty instancethat is being de-bugged. Every inference the prediction module makesimplicitly involves someassumption(the componentisbehavingasspecifiedby its manufacturer)abouta com-ponent or piece of circuit wiring. In troubleshooting,theseassumptionsmust be madeexplicit. For example,given the voltageacrossa resistor, the current throughit can be easilydeducedby assumingthat its resistanceis as specified—thatis not faulted. A contradiction isan informative event since it indicates which assump-tions are violated, narrowingthe field of possiblefaultycomponents.

The cost incurred by this elegantschemeis that theinferencemechanismof the prediction module must al-

ways supply accurateand exhaustivejustifications andassumptionsfor all of its deductions. A single excep-tion could renderthe troubleshooterimpotent for someclassesof faults. To some extent this sameorganiza-tional cleanlinessis also demandedfor generatingcoher-ent explanations,but the addedconstraint of this trou-bleshootingschemenow necessitatesthat the predictionmodule’sdeductionsbe void of any hidden presupposi-tions.

A major complication stemsfrom the fact that elec-tronics (and electronic troubleshooting) is a complexproblemdomain,part of which hasbeenformalized,partof which hasnot, especiallyin termsof the causalcal-culii tacitly usedby humanexperts. Becauseof thecom-plexity, we were faced with either restricting ourselvestoformalizedaspectsof thedomainor workingout aframe-work to systematicallyinclude the collection of ad hocrules and inference mechanismsneededfor the poorlyunderstoodpart of the domain. We chose the lattercourse,and our strategy for building SOPHIE III wasto encodeasmuch of the knowledge in the most generalform possible.

The circuit-specific rules bridge betweenthe predic-tion module and the interpretationmodule. For exam-ple, one rule is: “If the output voltage is low, then oneof ... must be faulted.” The explanationsof the circuit-specificknowledgearesuppliedby the electricalengineerandare attachedto eachrule. In contrast,explanationsfor deductionsmade from the genericelectroniclaws donot presumeany particular circuit andthus can be con-structed from the steps taken by the inference mecha-nism used to makethe deduction.

The circuit-specific knowledge is aimed directly atthose situations in which the general knowledge fails.Because of its known limited context, circuit-specificknowledgecan be put in a canonicalform, the numberof ad hoc rules can be minimized andthe determinationof whetherenoughcircuit-specific knowledgehas beenincludedto succeedis mademucheasier.

The circuit-specific knowledge is organized aroundastructural decompositionof the circuit. A circuit is adesignedartifact, consistingof acollectionof weakly in-teractingmodules.Eachmodule is, itself, a circuit in itsown right and can be likewise decomposed.The mod-ulesbehavecooperativelyto producethe behaviorof theoverallcircuit. The keyto the circuit-specificknowledgeis having terms to expressthe behavior of these mod-ules. The circuit-specific knowledgeconsists of rulesabout how the behaviorsof modules affect the behav-iors of other modules: Neighboringmodulesaffect eachother aswell a being influencedby the behaviorof theirlower-level, constituentmodules.

Our objective in designing and using the circuit-specific rule system was not to facilitate building rulebasesfor new circuits — although that was certainly asubgoal.Our overallobjectivewasto developathoroughenoughtheoryof diagnosticreasoningto makeLOCALpowerfulenoughto diagnosecircuits without using anycircuit-specific rules at all. Therefore, we added rules

only whenabsolutelynecessary.We obtaineda test suite of about 1000 troubleshoot-

ing scenariosof studentsusing SOPHIE land SOPHIEII. For eachfault and at eachstepof eachof thescenar-ios we usedthe brute force algorithm of SOPHIE II toidentify the faults which explained the symptoms. Us-ing this databaseas a gold-standard,we ensuredthatSOPHIE III wasableto achievethe same,perfect, di-agnosticprecision. For everyoneof the 50 rules requiredto performthenecessarydiagnosticinferenceson the IP-28, there exists a scenariofor which a fault mode willfail to be eliminatedwithout its presence.

Over the courseof the project the numberof circuit-specific rules both increasedand decreased. We con-stantly analyzedthe set of circuit-specific rules for pat-terns of inferencewhich could he supportedby extend-ing LOCAL. Whenever that occurred, the number ofcircuit-specific rules droppeddramatically. Other timeswe woulddiscoversomepathologicalmeasurementorderthat somestudent used which defeated LOCAL’s in-ferentialcapacity. Then additional circuit-specific ruleswould be added.

If the goal of SOPHIE III hadsolely beento build asystemto diagnosethe IP-28, then the ruleset could bereducedto about a dozen. The additional rules are re-quired to gleaninformation from sequencesof measure-mentsthat an optimal troubleshooterwould nevermake(and thereforenot neededin the core dozen). Thus, ifSOPHIE III were able to control which measurementsto make, then these rules would havebeenunnecessary.

In order to make it simpler to draw diagnosticinfer-ences,SOPHIE III makesa numberof presuppositions:

1. The circuit containsa single fault.

2. Faults only occurin components;not in circuit topol-ogy.

3. All componentfault modesare known.

4. The circuit is non-intermittent.

5. If a globalsymptom is observed(e.g., IP-28 output islower than its front panel controlsindicate), then thisis causedby somecomponentpresentlymanifestingasymptom.

6. Only DC behavior is important.

7. All faults are equally likely.

8. All measurementsare equally easyto make.

9. All fault modesareequally likely.

As SOPHIE III wasprimarily usedin a simulatedlab-oratorysetting, we could control the situation suchthatthesepresuppositionsare almost neverviolated. In realcircuits, thesepresuppositionsareeasilyviolated. Someof thesepresuppositionsare re-examinedin more detailin Section6.1.

Threetypesof reasoningare involved in the electron-ics expert eachwith its own knowledgestructure,com-plete with inferencemechanismand database(see Fig-ure 2). The propagationdatabasecontainsquantitative

Figure 3: A Current Limiter

voltage and currents (e.g., output voltage is 30 volts)operatingon by the propagatorusing the modelsof thecomponents. The qualitative databasecontainsasser-tions about the operating regions of the components,voltages,andcurrents(e.g., output current is low, tran-sistor Q5 is off). The databasefor the third knowledgestructure, the behaviortree, consistsof the possiblebe-havioralmodesof componentsandcircuit modules(e.g.,R5 is open,the current sourceis anemic).

The fundamentalproblem of intercommunicationbe-tweendifferent reasoningtypesis elegantlysolvedin SO-PHIE III with a commonlanguageof justifications andassumptions:Eachdeductionmadeby anyof the reason-ing typessimply recordsthereasonsfor andassumptionsunder which the deduction was made. This justifica-tion/assu.rnptiondatabase,just as in a general-purposetruth-maintenancesystem, can be oblivious to the dif-ferent kinds of reasoningthat underlie each deductionstep.

2 The local propagator

Thelocal propagatormayappearsimple, but thevarietyof subtle problemswith which it must grapplemake itquite complex. The biggest obstacle is the necessity forgenerality — it must deal with arbitrary measurementsin arbitrary circuits. But the profit we gain is great;because it is the only part of SOPHIE III that hasto reason upon the measurements and circuit topologydirectly.

The propagationscausedby asinglemeasurementcanbe quite deep. Consider the circuit describedby Fig-ure 3. Supposewe measuredthe current in R13 to be1 milliampere. Sincethe resistanceof R13 is set at 100ohms, the voltageacrossit must be 0.1 volts. But thevoltageacrossR13 is the sameas the voltageacrossthebase-emitterjunction of Q6. We know that a silicontransistor conducts no current when the base-emittervoltage is lessthan.6 volts so the current flowing in eachof the transistorterminals must be zero. Now considernode N9 (we precedeevery node by “N”, so the nodelabeled “9” on schematics is referred to as N9) which

connectsto R13, Q6 and R16. The current flowing outof R13 andQ6 must be flowing into resistor R16 so thecurrent flowing through R16 is 1 ma. As just shown,a single measurementmight lead to the calculation ofmany other circuit values. LOCAL formalizes this “IfA = x, then B = y” kind of reasoning.It keepsrecordsof eachdeductionso that explanationscanbe generatedandexaminedwhen troubleshooting.

Each componenthas an expert associatedwith itwhich constantlychecksto seewhether the currentsorvoltageson its terminalsare measuredor calculated;assoonasa value is discoveredfor one, it is usedto deducevoltagesandcurrentson the others. Since the terminalsandnodesaresharedwith other components,thesenewcurrentsand voltagesin turn trigger other componentexperts. The processcontinues until no new informa-tion can be determined.

Each propagation is recorded by constructing a jus-tification which lists the antecedentsand adetailed de-scription of the propagationitself. These descriptionsare kept in the following form:

(<type> <location> <reason> <assumptions>) = <value>.

<type> is either “v” or “I.” <location> (for location) isa pair of nodes for a voltage, a terminal for a current.The circuit consists of components whose terminals arejoined at nodes. Since terminals, unlike nodes, are al-ways attachedto componentswe adopt the conventionof labeling them by <terminal-type>/<component>.Cur-rents are normally associatedwith terminals, voltageswith nodes. In Figure 3, L/R13, R/R13, B/Q6, E/Q6,and C/Q6 are terminals, and N5 and N9 are nodes.<reason> describes how a componentexpert computedthe propagation and its form varies by component type.<assumptions> is a list of components which must beworking correctly to produce this propagation. This listis constructedby adding the component(if any) to theunion of the assumptionsof all the antecedentsto thepropagation.

The resistor, one of the simplest componenttypes,obeys Ohm’s Law. If LOCAL determinesthe voltage

Deductions InfetenceEngme

GeneralKnowledge 5 L/R13 R13 R/R13 9 R16 ii

Figure 2: ElectronicsExpert

c /Q6

acrossthe resistor from the current through it (via u =

iR), then the <reason>for the propagationis the pair(RESISTORI <resistor>). Conversely, if the current isdeterminedfrom the voltage (via i = ~),then the pairis (RESISTORV <resistor>).

Our example(Figure 3) proceedsas follows:(I R/13 (MEASURENENT) 0) = .001

The current in terminal R/Rl~3is measuredtohe 1 milliampere.

(V (115 119) (RESISTORI R13) (R13))=. 1The voltage across the resistor is determined

from the current through it by an instanceof Ohm’s Law (abbreviatedRESISTORI). Thispropagation assumesthe resistance of the re-sistor (set at 100 ohms) has not changedand thus the assumptionR13 is added to the<assumptions>.

The next propagation is (we discuss transistors inmore detail later):(I E/t~6(TRANOFF Q6) (qi R13)) = 0

The voltage across the base-emitterjunction ofQi indicatesit is off and therefore conductingno current.

The simplest propagationsare those involving Kir-choff voltageand current laws. Kirchoff’s current lawstates that the sum of all the currents flowing intoa node must be zero. Therefore, if all but one ofthe terminal currents of a component or node areknown, then the remaining onecan be deduced. There-fore, in our examplewe can now propagate(Kirchoff’svoltage law propagationsare given the <reason> (KCL<component—or—node>)):(I L/R9 (KCL N9) (R13 p6)) = —.001

The current in terminal L/R9 is determinedtobe -.001 milliamperes by using Kirchoff’s cur-rent law (abbreviatedKCL) to node N9.

(V (N9 Nil) (RESISTORI R16) (R16 R13 Q6)) = .000SKirchoff’s voltage law statesthat the sum of the volt-

age differences around a collection of nodes must bezero. This allows the deductionthat if two voltagesareknown relative to a commonpoint, the voltagebetweenthe other nodescan be computed. Kirchoff’s voltagelaw propagationsare given the <reason> (KVL <nodel><node2> <node3>). For example:

(V (MS Nil) (KVL MS N9 Nil) 0) = .1005

The voltage from node N5 to node Nil is cal-culated by Kirchoff’s voltage law (abbreviatedKVL) by summing the voltage from N5 to N9with the voltagefrom N9 to Nil.

LOCAL’s propagationmachineryis implementedbya collection of experts one for eachcomponenttype aswell asonesfor KCL and KVL. Every propagationstepwhich introducesno new assumptionsis executedimme-diately. Thus, KVL is immediately executedwheneveranynewvoltageis discovered.Likewise,KCL is immedi-atelyexecutedwheneverany new current is determined.The componentexperts are invoked in a breadth-firstmannerwith the use of a queue. Thus, the responseto

a new measurementis as follows. First, KCL or KVLis run on the new measurement.Then, the new mea-surementand any deductionswhich follow from KVLand KCL form the initial queue. At eachcycle of thepropagation, LOCAL removesa propagationfrom thequeueand invokesthe applicablecomponentexpertsonit. If any of theseexpertsproducenew currentsor volt-ages,then KCL and KVL areimmediately invokedandthe propagationsplacedon thequeue.This propagationcycle continuesuntil the queueis empty2

2.1 Component experts

LOCAL has an expert for each type of componentitmodels. Theseexperts are basedon the mathematicalelectrical engineeringmodels,but areimplementedwiththepropagationschemeoutlined in theprevioussection.Eachcomponentof a giventype must be modeledin thesame way. If, for example,one transistor in the cir-cuit were modeled differently from the others withoutthere being somesignificant physical difference in thetransistors,then the modelingwould havepresumedthefunctionality of the overallcircuit. Presumingthe func-tionality of a circuit is particularly dangerousfor a trou-bleshooting system becausea fault may so modify thecircuit as to significantly changeits functionality. Pre-supposingfunctionality would also bias LOCAL’s de-ductions to assumethe circuit functioned correctly. Ofcourse, different componentswill be connectedto dif-ferent nodesand thus respondto different voltagesandcurrents,but given thesedifferencesand individual dif-ferencesin componentparameters(e.g., the resistanceofa resistor) eachcomponentmust modeledby the sameprototypical expert.

Transistors, diodes, and zener diodes have multipleoperating regions. Eachregionis characterizedby a dis-tinct behaviorand the region usually needs to be identi-fied before any valuescan be propagated.Table 1 enu-meratesthe possiblecomponentregions. Note that thismodel of a transistor does not completely conform tothe traditional one which hasonly one SATURATEDmode, but it makes certain deductions more convenient.ON and OFF are determined from collector currentonly. SATURATED is determinedfrom the collector-base voltage only. While ON and OFF are exclusivestates,a transistor can be SATURATED and ON, orSATURATED andOFF.

The diode is the simplest kind of semiconductorde-vice. Its model is very simple: when it is reverse-biased(current can flow in only onedirection througha diode),the current through it must be zero:

2The actual implementationneedsto be careful to avoidduplicatepropagations. For example, any propagation pro-duced by a componentexpert should not immediately beapplied to the samecomponentas the result is guaranteedto be redundant. LOCAL enforcesa more general versionof this condition by not allowing anycomponentexpert torun if any of the triggering propagationsdependon thecom-ponentalready(this is easilycheckedthroughamembershiptest on the<assumptions>).

Component Thpe rating regionsDiode ON, OFFZenerDiode

~TransistorONON,

(at breakdown),OFFOFF, SATURATED-ON, SATURATED-OFF

Table 1: Componentoperatingregions.

(I D (DIODEV) (D)) = 0.

From the zener diode we know that if the currentthroughit is greaterthan somethreshold, then the volt-ageacrossit must be at its breakdownvoltage:

(V Z (ZENERI) (Z)) =

Conversely if the voltage across the zener diode isat less than its breakdown voltage, then the currentthrough the diode must be zero:

(I Z (ZENERV) (Z)) = 0.

The transistor is the most difficult component tomodel. This is both becauseit has the discontinuouscharacteristicsof a semiconductorcomponent,and be-causeit is a three-terminal component. If the currentthrough any one of the terminals is known, then thecurrent throughthe other two can be determinedusingthe gaincharacteristicsof thecomponent:

(I C/Qi (BETA Qi B/Qi) (Qi)) =

The collector currentof Ql is deducedby apply-ing its gain to the basecurrent (ic = l

3iB)

Furthermore,if the voltageacrossthebase-emitterjunc-tion is less than some threshold (.55 volts for silicontransistors),then the current flowing throughany of itsterminalsshould be zero:

(I C/Ql (TRAMOFF Qi) (Qi)) =

Considerthe fragmentof the IP-28 illustrated in Fig-ure 4. Supposevoltagemeasurementsat theoutput andacrossD5 havejust been made:

The following propagationsensue.For brevity, currentsin two-terminal componentsaresimplified to refer to thecomponentinsteadof the terminal,

(V (MiS N14)(V (N16 N14)

(MEASUREMENT) 0) =

(MEASURENENT) 0) =

(V (N16 MiS) (KVL Nl6 514 515) 0) =

(I R5 (RESISTORVRE) (R5)) =

(I D5 (ZENERV D5) (D5)) =

(I R4(V (N24(V (N24(V (N24(I D4

400

The voltage across thezenerD5 is lessthan itsbreakdown,thereforethecurrent through it mustbe zero.

(KCL N16) (RE D5)) =0.003N16)(RESISTORI R4) (R4 R5 D5)) =7.18M14)(KVL N24 N16 M14)(R4 R5 D5)) =41.18MiS) (KVL N24 Ni6 N1S)(R4 RE DE)) =11.18

(ZENERV D4) (D4 R4 RE D5))=0.

The voltage across zener is less than itsbreakdown.

Figure 4: ConstantVoltage Reference

(I R3 (KCL N24) (D4 R4 RE D5)) =0.003(V (M24 N2E)(RESISTORI R3) (R3 D4 R4 R5 D5))=4.90(V (l’125 M14)(KVL N25 M24 N14)(R3 D4 R4 RE DE))=46.l(V (N2E N16)(KVL N2E N24 N16)(R3 D4 R4 RE DEfl=12.0l(V (N2E MiS) (KVL M25 N24 Ml5)(R3 D4 R4 RE DE))=16.01.

More will be said concerning the component expertslater.

2.2 Coincidences,conflicts andcorroborations

In addition to simple propagation, two other activitiesare requiredto troubleshoot. Passivetroubleshootingin-terpretsthe results of the propagationsto draw conclu-sionsabout the correctnessof components.Active trou-bleshootingchoosesnew measurementsto make. Passivetroubleshooting knowledge is incorporated into LOCAL

.003 as an extension of the propagator.Active troubleshoot-ing is discussedin Section2.4.

The discovery of a value for a circuit quantity forwhich we already know a predicted, propagated value iscalled a coincidence. When the two propagations havethe same values, we call the coincidencea corrobora-tion; when they differ we call it a conflict. Coincidencesprovide information about theassumptionsmadein thepropagation: Corroborationsverify them and conflictsindicate at least one of them is in error.

Continuingthe exampleof the previous section, con-sider what we would learn if we measuredthe voltage

14

SOPU *~t.vet~g..Ni5

betweennodesN15 andN25 to he 20 volts insteadof thepredicted 16.01 volts. This conflict indicatesthat one ofthe assumptionsunderlying the prediction is violated:One of (R3 D4 R4 R5 D5) is faulted. (Sets of assump-tions underlying conflicts are called nogoods.) Supposewe insteadmeasuredthe voltage betweenN15 and N25to be the predicted 16,01 volts. In this casewe can beassuredthat R3, D4, R4, R5 and D5 are working cor-rectly. (Sets of assumptionsunderlying corroborationsare called goods.)

Although a componentis faulted, the overallsystemmay still be functioning correctly. (The fault may onlymanifest itself under certain load conditions, for exam-ple.) Thereforea corroborationprovidesno informationif the system is not manifesting a symptom under thespecifiedexternalconditions (e.g., load and control set-tings). To facilitate this deduction, LOCAL maintainsa specialflag which indicateswhetherthe circuit is vio-lating its overall function.

Any troubleshootingscenarioproducesa numberofconflicts and coincidences. Eachof theconflicts leadstoa nogood, and each of the coincidencesleads to a good.Under the presuppositionthat the circuit contains atmost one fault, each component not mentioned by anynogoodmust be unfaulted. This observationreducestheset of possibly faulty componentsto the intersectionofall the nogoods. In addition, any componentsappear-ing in any corroborationmust also be removed from thisfinal set. Equivalently, the set of possibly faulty com-ponentsis the intersectionof all the nogoodsand thecomplementsof all the goods. Intuitively, a componentis possibly faulted if its removal from the circuit wouldremoveeveryconflict while maintaining everycorrobo-ration. When there are multiple faults theseconceptsmust be generalized.Although a nogood indicatesthatat leastoneof its assumptionsis faulted, theother faultsmay lie outside the nogood and cannot be immediatelyverified. Corroborationsinvolving only one unverifiedcomponentalwaysguaranteethecomponentis ok. How-ever,corroborationsinvolving morethan onecomponentmay be causedby multiple faults cancelingeachother.

2.3 RepresentingpropagationsSo far we havebeendescribingLOCAL as if every cir-cuit quantity hasonly one propagationassociatedwithit. However, LOCAL maintains a set of propagationsfor each circuit quantity. These propagationsare dis-tinguishedby their value and their underlying assump-tions. LOCAL does not select out one of the valuesto propagatefurther, but propagatesall values. Twoconventions help control the proliferation of propagatedvalues. First, if the assumptionsunderlying a propaga-tion area supersetof a nogood,then that propagation’svalue is known to be invalid and is discarded. Second,iftwo propagationshaveidentical valuesand the assump-tionsof onepropagationis a propersubsetof the other,then the propagationwith the larger assumptionset isdiscarded.

One of the consequencesof theseobservationsis that

therecanhe relativelymanycoincidences.Supposesomemeasurementproducesa conflict with a long propaga-tion chain. After the conflict has been processed,themeasurementwill likely propagateto produce anotherconflict earlier in the long propagationchain. Theseco-incidencesbetweenpropagationsare usually redundantand produce no new information. However, thereare afew important circumstanceswhere they do, so they allhave to he processed. BecauseLOCAL does not usesymbolic algebra but only propagatesnumeric quanti-ties it is not logically complete. As a consequenceofthis incompleteness,a new measurementmay not pro-ducean initial coincidence,but propagationscausedbyit do (this is illustrated later in Table3).

In the worst case,if acircuit containsn possiblefaults,then eachcircuit quantitycan haveup to adistinct prop-agationsassociatedwith it. For example,if the only pos-sible faults are A, B, C and D, then a quantity couldhave4 propagationswith different valuesdependingonthe assumptionsets: {A, B, C}, {A, B, D}, {A, C, D}and {B, C, D}. The nogood produced by the union ofany two of thesesets is alwaysthe setof all faulty com-ponents, {A, B, C, D}, so provide no new information.Fortunately,this casedoesnot commonlyarisein prac-tice. Usually there are no more than two propagationsassociatedwith eachcircuit quantity.

The effect of conflicts and corroborations betweenpropagations is as follows. Supposecoincidingpropaga-tions A and B have underlying assumption sets PA andPB (e.g., thetwo propagationspredict thesamevalueforthe samecircuit quantity under differing assumptions).A conflict yields a nogood PA U PB. A corroborationyields the good PA U PB — [PA fl PB]. The intuition isas follows. A fault lying in the intersectionof the an-tecedentassumptionscould result in both propagationsbeingincorrect yetcorroboratoryso thesecannotbever-ified. Considerthe remaining assumptions. Supposethat somecomponentC E PA — PB is faulted. Prop-agation B must be correct becauseit doesnot dependon C. On the other hand asC contributessignificantlyto propagationA, thereforeits value must deviate sig-nificantly from the correct value. As the coincidencewas a corroborationandnot a conflict this cannot oc-cur. HenceC is unfaulted. Notice that the combinedeffect of a corroborationbetweentwo propagationsis toremovefrom either propagationany assumptionsnot incommon. For example,if {X, Y} underliespropagationA, and {Y, Z} underliescorroboratingpropagationB,then X and Y are eliminatedwith a result that bothA and B becomeessentiallyidenticaland dependon Yalone.

In the course of troubleshooting the student canchangethe front panelsettings. As we presumethecir-cuit is non-intermittentandthe fault doesn’tchange,thenogoodsand goodsobtainedfrom different control set-tingscan becombinedasusual. However, it is importantto keepthe propagationsdistinct. Therefore, for everyfront panelsettings, LOCAL createsadummy assump-tion which is addedto every measurementmadeunder

thosesettings. Coincidencesbetweenpropagationsob-tainedunderdifferent settingsare ignored, anddummyassumptionsare removedfrom all goods and nogoods.However,for simplicity, we do not discussthesedummyassumptionsfurther in this paper.

2.4 Active troubleshooting

The goalof troubleshootingis to removeasmanycom-ponentsfrom suspicionas efficiently as possible. Thus,the quality (score) of a potential measurementis theexpectednumber of componentsthat it would removefrom suspicion. The higher the score, the better themeasurement.LOCAL computesthesescoresfor everycircuit quantity having a propagatedvalue. Supposeacircuit quantity hasexactly one propagation. The ex-pectedvalue is the sum of two products: the numberofcomponentsthat would be verified by a corroborationtimes the probability of a corroboration,plus the num-ber of componentsthat would he verified by a conflicttimes the probability of a conflict.

In thefollowing analysiswe presumethat we havede-terminedthecircuit hasa symptomandthat the faultybehavior is due to some, as yet undetermined,singlefault. Supposethere are vi componentsunder suspi-cion and p unverified componentsunderlying a partic-ular propagation. We assumethat every componentisequally likely to fail. Therefore, theprobability that thepropagationis faulty is which will producea nogoodof size p therebyverifying the remaining n — p compo-nents. The probability that the propagafionis correctis 1 — which will produceagood of sizep. Combiningboth contributionswe find that the expectednumber ofcomponentsverified by the measurementis:

(n — p) + (1—?) x

2p(n — p)

vi

which simplifies to:

The best possible score is always half the number ofcomponentscurrently under suspicion.

Supposewe havemeasuredthe current and voltageof resistorsin seriesand determinedthat oneof the re-sistors is faulted. The familiar half-split methodwouldpropose a measurement midway within the resistor net-work. Ourscoringfunction, which is maximumat p =

proposesthe samemeasurement.Moreover,our scoringfunction applies even when the circuit hasa complextopology for which it is not obvious how to apply theconventionalhalf-split method. Notice that if there arealways measurementsavailable with p = ~, then thefault is alwaysisolatablein log~a measurements.

Consider the example of the previous section (Fig-ure 4). The following is an simplified exampleof SO-PHIE III troubleshootingthis circuit fragment. Sup-posethat theonly possiblefaultsareR3, D4, R4, R5 andD5 and that the actual fault is that D4 has its break-down voltagetoo low andthus is drawinga greatdealof

current. The possiblepoints to measureand their scoresare listed in Table 2.

The table shows the best measurementto make isone of (I R4), (V (N24 M16)), (V (M24 N14)) or (V (1124

1115)). The current through R4 is measuredand pro-duces a corroboration. Therefore R5 and D5 are ver-ified to be correct and one of R3, D4 or R4 must befaulted. This leaves5 interestingplacesto measure: (V

(1124 1116)), (V (M24 1114)), (V (N24 MiS)), (I D4)), or(I R3). (For brevity we do not show to revisedscores.)Supposewe measurethecurrent in D4. As D4 is shortedit drawsa great deal of current. Thus, this producesaconflict whose nogood is (R4 D4). Thus R3 is verified.

The two possiblefaultsarenow that R4 could be faultedhigh or D4 faulted low. Any of the measurements(V

(N24 N16)), (V (M24 M14)) or (V (N24 MiS)) isolate thefault to D4.

The objective is to localizethe fault asquickly aspos-sible on average. Notice that the student can make avery poor measurementwhich yields a lot of informa-tion. Consideran extremeexample. Supposethere are40 unverified components. Supposeone quantity hasa propagationwith 20 underlying assumptionsand an-other quantity has a propagationwith 1. Supposethestudentmakesthe secondmeasurementwhich turns outto be a conflict and thus instantly identifies the fault.The student should be critiqued for his poor measure-mentbecauseit wasveryimprobablethat theoutcomeofthis measurementwould producea conflict — far morelikely it would have produced a corroboration whichwould only haveeliminated one component. The stu-dent was just lucky, since the expectedvalue of thatmeasurementwas low. The expectednumberof compo-nentseliminated by the first measurementis 20, whilethe expectednumber of componentseliminated by thesecondmeasurementis 1.95.

The scoringfunction mayyield alargenumberof goodplacesto measure,manyof which are topologically dis-tant from the suspectcomponents.This occurs becausethere can be long propagationchainsthrough compo-nents which havebeen verified. Therefore, when SO-PHIE III is askedby the student to propose a nextmeasurement,it returns a measurementwith highestscorebut which also dependson the fewest numberofverified assumptions.The fewer the numberof verifiedcomponentsthe propagationdependson, the closer it istopologically to the suspectcomponentsand thereforemore intuitive to the student.

A circuit quantity may havemany propagationsasso-ciatedwith it. The more propagationsacircuit quantityhasthe better its scorecan be. Reconsiderthe previousexamplewherea quantity has4 conflicting propagationswith assumptionsets: {A, B, C}, {A, B, D}, {A, C, D}and {B, C, D}. Measuring this quantity will immedi-ately identify the faulty componentas there is a one-to-one correspondencebetweenfaulty componentsandvalues. In general, in order to determine the score ofa multi-propagationquantity, careful analysisof the as-sumptionsunderlying thepropagationsmust be madeto

Measurement ~nflictEliminates

CorroborationEliminates

Score

(V (1116 MiS)) impossible 0(I RS) R3,D4,R4,D5 R5 1.6(I OS) R3,D4,R4,R5 D5 1.6(I R4) R3,D4,R4 R5,D5 2.4(V (1124 Mi6)) R3,D4 R4,R5,D5 2.4(V (1124 1114)) R3,D4 R4,R5,D5 2.4(V (1124 MiS)) R3,D4 R4,R5,D5 2.4(I D4) D4,R4,R5,D5 R3 1.6(I R3) D4,R4,R5,D5 R3 1.6(V (1124 1125)) impossible 0(V (1125 1114)) impossible 0(V (1125 1116)) impossible J(V (1125 1115)) impossible 0

Table 2: Proposed measurements

determinethedegreeto which the propagationsare in-dependent.However,thereare manyothereffectswhichinfluencethescoreof a tentativemeasurement(seeSec-tion 4.2).

2.5 Primary and secondaryassumptionsThis strategyof exploiting conflicts and corroborationsto localize the faulty component forms the essential coreof LOCAL’s troubleshootingstrategy. Unfortunately, inpractice, we have to make an additional caveat as thisidealistic framework is inadequate.

The circuit may be exhibiting a symptom, but theway the faulted component causing the symptom is usedin a particular propagationmay not be significant. Acomponentcontributessignificantly to a propagationifa fault in the componentwould produce a value suffi-ciently different from the one propagatedthat it wouldbedetected.Therefore, LOCAL partitionsthe setof as-sumptionsas~ociatedwith eachpropagationinto a pri-mary and a secondaryset. Primary assumptionscon-tribute significantly, while secondaryassumptionsmayor may not contributesignificantly. Thus, when a cor-roboration occurs only the componentsnamedby theprimary assumptionsare eliminated. The componentsof the secondaryassumptionscannot be eliminated be-causewe cannot guaranteethe value is sensitive to afault in them. Therefore, LOCAL mustensurethat as-sumptionsare primary only if they significantly impactthe propagation’svalue.

Unless overridden by some exception, every com-ponent expert’s assumption is primary and the sec-ondary(primary)assumptionsof thepropagationincludethe union of all the secondary(primary)assumptionsofthe antecedents.Someof the exceptionsto this generalrule are:

• When a large quantity is added to a small quantitythe smaller quantity may vary significantly withoutaffecting the sum. The primary assumptionsof thesmallerquantity aresecondaryin the sum.

• When a quantity is multiplied by zero, the assump-tions of the non-zeroquantity becomesecondary.

• In a propagationwhich uses the same componenttwice, two effectsmay canceleachother out. There-fore, whenever two antecedentssharea commonas-sumption, this assumptionis madesecondaryfor theensuingpropagation.

• The assumptionsunderlying a comparisonmust be-comesecondary.If the base-emittervoltageof a tran-sistor is below 0.55 volts, then the collector currentis 0 amps. If the collector current is measuredto be0, this doesnot guaranteethat the base-emittervolt-age was correct becauseit could be wildly incorrectbut just less than 0.55 volts. As we needto err onthe conservativeside, the assumptionsmust thereforebecomesecondary.

In all casesthe primary and secondaryassumptionscan-not overlap and all secondaryassumptionsare alwaysremovedfrom the primary set.

Consider a coincidencebetweena measurementandpropagation. If the coincidenceis a conflict, then thefault must lie in the unionof the primary andsecondaryassumptions(the nogoodset). If thecoincidenceis acor-roboration,then all the primary assumptionsareelimi-nated(the good set).

The general caseis as follows. Supposethe primaryandsecondaryassumptionsunderlying the two coincid-ing propagationsare PA, PB, SA and SB’ If the co-incidence is a conflict, then we have a new nogood:PAUPBUSA U SB. If thecoincidenceis acorroboration,then wehaveanewgood:PAUPB—[SAUSBU(PAflPB)].The reasoningis analogousto that of the previoussec-tions, except that secondaryassumptionsshould neverbe verified by acorroboration. As aconsequenceof thenewly discoveredgood, eachof the corroboratingprop-agationswill propagateonwardwith their primary non-intersectingassumptionseliminated (just as before inSectionsection:Propagations).

It is important to control the proliferation of propa-gations as the introduction of the distinction betweenprimary and secondaryassumptionscausesadditionalpropagations.Considera corroborationin which PA =

{X}, SA = {Y, Z}, PB = {Z}, SB = {X, Y}. With-out the distinction betweenprimary and secondaryas-sumptionsoneof thesepropagationswould be discarded.Now, both must bepropagated.To control theprolifera-tion of propagations,LOCAL removesall propagationssubsumedby others. For example, if PA C PB U SBandSA C SB, then propagationB is discardedbecauselittle additional diagnostic information can be obtainedby propagatingit further.

3 Ranges

All measurementsin the circuit andall circuit param-eters havesomedegreeof error. The errors in circuitparametersoriginate from the fact that mar. ~facturerscannotmakeperfect componentsand insteadguaranteetheir specifications to within certain tolerances. (Typi-cally the resistanceof a resistor is only within 10 % ofits specifiedvalue.) Similarly, themeterusedto measurecircuit quantitiescan only measurevoltagesandcurrentsto certain accuracies(typically 2 % error). A furtherlimitation is that the meterhasa minimum range;SO-PHIE’s meter cannotaccurately measurebelow .1 voltor below 1 ~iA. These effects, though artificially intro-duced into SOPHIE, are representativeof what is en-counteredin realcircuits. Becausethenumericalcompu-tations performed by the componentexperts introducetruncation and roundoff errors, it makesmore sensetopropagateeither valuesand their tolerances,or rangesofvalues.Consequently,all thebasic arithmeticoperationsperformedby LOCAL are modified to accommodatethetolerancesassociatedwith eachcircuit quantity.

LOCAL representseachquantity by a rangeindicat-ing its extremevalues(which may be +co or —co). Aquantity Q is described by [QL,QH] which indicatesQL ~ Q ~ QH. No probability distributions are main-tained. Consideran example. If we know the currentthrough a resistor is between2.9 amperesand 3.1 am-peres(I = [2.9, 3.1]), andits resistanceis 100 ohmswitha 10% tolerance(R = [90, 110]), then Ohm’s Law tellsus that the voltageacross the resistor must be between261 and 310 volts (V = IR = [2.9, 3.1] x [90, 110] =

[261,341]).SupposeRu of the IP-28hasa higher resistancethan

specified. The resulting propagationtrace is illustratedin Table 3. Each propagationincludestwo lists of as-sumptions;the first aretheprimary assumptionsandthesecondthesecondaryassumptions.This propagationse-quence illustrates two additional interestingproperties.First, acoincidencearisesbetweentwo propagationsnei-ther of which is a measurement.Second,an assumptionbecomessecondarybecausea small rangeis addedto alarge one. Careful analysisof the trace illustrates thatthis propagationwould be muchsimpler if we addedin-formation about maximum basecurrents. We discuss

this issue next.

By using rangenotation, a great deal of additionalknowledge can be included in the componentmodels.For example,knowledgethat thecurrent throughadiodeis always positive can be expressedby the range[0, co].A silicon transistor in the ON state must havea base-emitter voltage of [.55 , 1]. Componentshave certainmaximum limits which cannotbe exceededwithout de-struction. For example,a current of 2 amperesthroughdiode D6 well exceedsits rating, indicating that either itis shorted,or the 2 amperepropagationis incorrect andoneof the underlyingassumptionused in its derivationhasbeen violated. Supposethe base-emitterof a two-transistor darlington pair (such as Q3 and Q4 of theIP-28) measuresgreaterthan 2 volts. Neither of the in-dividual base-emitterdropsare known. If the databaseis initialized with a [—co, .7] drop for eachbase-emitterjunction the 2 volt measurementwould conflict with the[—co 1.4] volt propagationand the problem will havebeen localized to the darlington. Before any measure-ment is ever made, LOCAL propagatesall minimumsand maximums. Thus, when a circuit quantity is dis-coveredto exceedsomelimit, a conflict is triggered.

Rangesalso admit a third kind of innovation. Some-times we would like a component’model to propagatea range with different explanationsfor eachend of therange. This can be accomplishedas follows. Insteadofpropagating{a,bJ, we propagatetwo ranges:[—oo, b] and[a, +co] eachwith its own justification.

If all voltage and current sources in a DC circuit areknown, then absolutelimits for all voltagesandcurrentscan also be determinedfrom basic principles. For ex-ample, supposethe current through a 2 MfZ resistor ispropagatedto be .5 ma. The resistormodel propagatesthe voltageacross it as 1000 volts by Ohm’s Law. Noproblemis detectedsincethe resistorcanoperateat 1000volts and the .5 watts that it is dissipating falls withinits rating. If this resistor were part of a low-voltagepower-supplysuch as the IP-28, then therecould be nopossibility that any voltage inside of it was 1000 volts.If this measurementturned up either the resistorwouldbe shortedor the propagationof the .5 ma would beincorrect.

Primary and secondaryassumptionsplay the samerole as before. The basic intuition is that an assump-tion p is primary to a propagatedvalue with range v,if everyfault in p causesthe actual circuit value to beoutside of v. As we see in a moment, this distinctionis important becauseif a later measurementis withinv, then we can be guaranteedthat p cannotbe faulted.Thus, an assumptionis primary if I > ~ where w~

is the width of v’s rangeandw~,is the toleranceof thecomponent. As this condition is difficult to guarantee,LOCAL always errs to the conservativeside, makingassumptionssecondaryunlessit can be definitely deter-mined that they are primary.

Table 3: Propagationtrace for Ru high. The conflicting propagationsare highlighted.

CONFLICT A:B:

CORROBORATION A:B: [

A SPLITS B A:B:

BSPLITSA A:B:

INCOMPARABLE A:B:

Figure 5: RangeComparisons

3.1 Consequencesof ranges oncoincidences

A comparisonbetweenrangesA and B canhaveoneoffive outcomes(seeFigure 5): (1) valuesconflict, (2) val-uescorroborate,(3) A splits B, (4) B splits A, and (5)no comparisonpossible. The comparisonprocedurefirstattemptsto determinewhether the rangescorroborate.A tolerancefor the comparisonis computedby choosingthe minimum width if the widths arevery different andchoosinghalfthe width if the widths areapproximatelythe same. If the width of either range is co, then thecoincidencecannotbe a corroboration. Depending onthecircuit andwhetherthe coincidenceis betweenvolt-agesor currents, a minimum toleranceis chosen. Theminimum tolerancefor a typical circuit is .1 ~.iAand .1volts. Then the differences betweenthe correspondingendsof the rangesare determined.If both differ withinthe tolerance,then the valuesaredeterminedto be cor-roboratory. For example,[.1 , .2] volts and[.15 , .3] voltsarejudgedto be corroboratory.If only onesideis withintolerancethe toleranceis relaxed by 50% andthe failing

side is checkedagain. This ad hoc rule worked well for

all our examples.

If this still doesnot match,then we cannotreally claima corroboration; insteadwe can only saythat onevaluesplits the other. For example, [0 , 1] splits [0 , 10].The two remainingcasesoccurwhen the valuesarecom-pletely disjoint (e.g., [0 , 1] and [3, 4]) and when onecontainsthe other (e.g., [0 , 6] and[3 , 4]). The contain-ment caseis treated as a split. Rangesare considereddisjoint only if they differ by greaterthan the tolerance.If noneof theseconditionsaremet, thenthe coincidenceis neithera corroborationnora conflict. For example,[0

.1] volts and[.2 , .3] neithercontradictnorcorroborate.

Conflicts andcorroborationsproducethe samegoodsand nogoodsas in the non-rangecase. We can alsodraw useful conclusions from the splitting case. LetPA, PB, SA,SB the primary and secondaryassumptionsof A and B. If A splits B, then we havethe new goodPB — [PA U 5

A]~ The rationaleis as follows. If the rangefor A splits the rangefor B and thereforeare subsets,then if A is valid (i.e., must include the actual circuitvalue), then B must be valid, For example,since A:[3

4] splits B:[0 , 10], the validity of A implies the valid-ity of B. Considersomepossibly faultedcomponentof

~B’ By definition, this will causetheactual valueof Bto be outsideof its current range,andof necessity,con-flict with A’s current range. If C doesn’t contribute A,then by the single-faultassumptionA’s rangeis correct.Therefore,anyfailure in C would havecausedaconflictand not a split and thereforeC cannot be faulted. No-tice that the notion of corroborationis now redundant— a corroborationof A with B as the sameeffect assimultaneouslysplitting A with B and B with A.

When the coinciding valuesare incomparable,LO-CAL synthesizesa new propagationwhich combinesthem. The range of the synthesizedpropagation isthe intersectionof the two ranges(if the intersectionis empty, the new propagationis not constructed).Theprimary assumptionsof the syntesizedpropagationaretheunionof theprimary assumptionsof theantecedents.The secondaryassumptionsof the synthesizedpropa-gation are the union of the secondaryassumptionsoftheantecedents,with the primary assumptionsremoved.This synthesizedpropagationis usually propagatesfur-

Location Reason Primary SecondaryAssumptions Assumptions

(I C/Q2 (MEASURENENT) 0 0) = [.00017 .00019](I B/Q2 (BETA Q2 C/Q2) (Q2) 0) = [i.1E—6 3.8E-6](I E/Q2 (BETA Q2 C/Q2) (Q2) 0) = [—.00019 —.00017](V (112 G) (MEASUREMENT) 0 0) = [45 49](I R9 (RESISTORV R9) (R9) 0) = [.012 .017]

—>(I C/Qi (KCL 112) (R9) (Q2)) = [.012 .017](I B/Q1 (BETA Qi C/Q1) (Qi R9) (Q2)) = [8.1E—S 33E—S](I E/Q1 (BETA Qi C/Qi) (Qi R9) (Q2)) = [—.017 -.012](I Rh (KCL M3) (Qi R9) (Q2)) = [—.00015 .00011](V (Ml 113) (RESISTORI Ru) (Qi R9 Ru) (Q2)) = [—.26 .i8]

—>(I C/Qi (TRANOFF Qi) (Ru Qi R9) (Q2)) = [—i.E—6 4.OE—5]

-co +00

[ I[1

[1

[ I

[ I

ther than its antecedentsas it hasasmaller width.As ranges allow one to express far more information

than before, thereare now typically manypropagationsper circuit quantity. LOCAL incorporates a variety oftactics to reduce the number of propagations. Propaga-tions with very wide ranges are arbitrarily stopped. New

values which are only marginally better (only marginallynarrower ranges) than old ones are ignored since theyprobably are the result of iteration. Kirchoff’s VoltageLaw produces so many propagations and voltages be-tweennodesof no interest that all KVL deductionsareperformedby a separatemodule which in onestep pro-ducesthebest valuefor all interestingvoltages. Finally,subsumedpropagationsare discarded. Propagation~4subsumespropagationB if the primary and secondaryassumptionssubsumeeach other as in the non-rangecase,and if the rangeof propagationA corroboratesorsplits the rangeof propagationB. Subsumedpropaga-tionsare removedafter all the fault information hasbeenextractedfrom the corroboration.

3.2 The model for a diode

Although every componentof a given type behavesinthe same basic way, the fine details of its behavior arecontrolled by its parameters.Theseparametersare de-terminedby the manufacturer,not by the component’suse in the circuit. For example, every resistor obeysOhm’s Law, but with varying resistance.Eachof LO-CAL’s componentmodels is complicatedand lengthyand there is no point in giving every one of them inthis paper. Instead,we only describethe diodeas it isthe simplestnon-linear device. The transistor model, incontrast, is dramaticallymore complicated.

The model for the diode is specified by eight param-eters, each parameteris listed with its value for D6 ofthe IP-28:

‘MIN, ‘MAX, VMIN, VMAX are all limits beyondwhichD6 would be destroyed. ‘OFF, ION, VOFF, VON helpspecify the operatingregionsof the diode.

The maximum limits are propagatedbeforeany mea-surementis ever made. Thus, LOCAL immediatelypropagates[IMIN,IMAX] with secondary assumptionD6. If the current is propagated(or measured)to lieoutside this range, then a conflict is triggered. For ex-ample, this allows LOCAL to dealwith diodesin series.If thevoltageacrosstwo diodesis measuredto be greaterthan2 VMAX, for example,LOCAL immediatelydetectsthat one of the two diodes is open. Note that a corrob-oration can never verify D6, both becauseit is only asecondaryassumptionof the initial propagationandthe

ON~

1off’on

SOP~UE-4iode~

Figure 7: Diode Current Parameters

rangewidths are too wide.If a newly discoveredvoltage is lessthan thethreshold

required to turn on thediode, then the diodecannotbeconductingmuchcurrent;on the other hand, if the newvoltage indicates the diode is on, then the diode mustbe conductinga significant amount of current (seeFig-ure 6). As the maximum andminimum limits imposedby the diodeare propagatedbeforeany measurementistaken, we do not include them in thesemodels:

A new voltageV = [VL, Vg] causesthe propagation: IfVH < VOFF, then propagatethe rangeI = [—~,ION],otherwise if VL > VdFF, then propagatethe rangeI = [ION, +00].

(Note that we could not evenexpressthis rule withoutranges.)

The propagationswhich result from a new currentareanalogous(seeFigure 7):

A new current I = [IL, IH] causes the propaga-tion: If ‘H ~ ‘OFF, then propagatethe rangeV =

[—co, VOFF], otherwiseif IL � ION, then propagatethe rangeV = [VON, +co].

This modelhasanumberof shortcomings.It could bemore accurate. For example,in the active (on) region,although the voltage-current relationship is roughly lin-ear, the model only propagatesthe region’sextremes.

3.3 Comparison to other propagationschemes

LOCAL is similar to StallmanandSussman’sEL whichis also basedon propagationof constraints[8]. LOCALdiffers from EL in three fundamentalways. First, sincethe task is troubleshootingwe needto makeassumptionsaboutthe faultednessof components,not just aboutop-erating regions of components. Henceour nogood sets(which representinconsistentsets of assumptions)in-clude assumptionsaboutfaultedness.Second,sinceval-uesare known only approximately, the quantitiesprop-agatedare ranges insteadof numbers. Third, since the

Voltage I-50 0.3 0.45 0.8

OFF ONVfl,~ ~ff Von V~

SOPHIE-diodr-volt2ge

Figure 6: Diode Voltage Parameters

-l~A 111A211A IACurrent

‘MIN : maximum reversecurrent flow, —1 ~A.‘MAX : maximum forward current flow, 1 A.VMIN : minimum voltageacross the diode, —50 VVMAX : maximum voltage acrossthe diode, .8 V.

‘OFF : definesthe diodeOFF state, 1 ~A.ION : defines thediode ON state,2 ~A.VOFF : definesthe diode OFF state, .3 V.VON : definesthe diodeON state, .45 V.

quantities are ranges, two quantities can split or overlap;one of them can be “better” or “worse” than another;and like EL they can be equal and unequal. One of theconsequences of incomparable (e.g., overlapping)rangesand the possibility of differing underlying assumptionsfor the same circuit value is that LOCAL must oftenpropagate multiple values for the same circuit quantitywhich leads to substantial complexity. For instance, thekind of assumption made about the faultednessof a com-ponent dependson the rangesof the quantities beingpropagatedthrough the component. We discussthis inmore detail in the following sections. These general-izationsprecludedLOCAL from propagatingvariables(known asanonymousobjects in EL).

4 Fault modes

So far we have placed almost no restrictions on how

componentscan fail. Nevertheless,the frameworkof co-incidences,conflicts and corroborationsprovidesa verygeneralframework for eliminating componentsfrom sus-picion. However, physical structure dictates that eachcomponentcanonly fail in a small numberof character-istic ways. Knowing thesefault modesgreatly increasesLOCAL’s diagnostic power. For example, if none ofthe possiblefault modesof a componentare consistentwith the observations,then that componentcannotbefaulted. This principle is at the heart of this section.

The fault modesfor componentsare enumeratedinTable4. The transistorhasso manypossiblefault modesthat it is impossible to choosesuccinct labels for each.The modesare listed in the form EB/EC/BC whereeachpair refers to a transistor “junction” which is oneofthreemodes: (1) OK, the junction is working correctly,(2) SH, thejunction is shorted,or (3) OP, the junctionis open. The possibilities are explainedin Table 5.

Eachfault modecanbe describedby adistinct model.We couldintroduceanassumptionfor eachpossiblefaultmodeand generalizeour theoryof conflict and corrob-orationsextract fault modeinformation. However, thisis computationallyinfeasible. Instead,LOCAL usesex-plicit fault models, but exploits a set of new inferentialmechanismsto exploit them. Thus, theassumptionsetsunderlyingpropagationsremainexactlyas theywerebe-fore. We presumethat if a componentis faulted, thenit remainsin the samefault modethroughoutthe entiretroubleshootingscenario.

LOCAL associatesa list of possiblefault modeswitheachcomponent.(In orderto be ableto later explain itsdeductions,LOCAL also recordsthe reasonsfor eachin-ferenceit makesaboutfault modes.) If all of thepossiblemodesof acomponentareeliminated,then that compo-nent is consideredverified (i.e., a singleton good). Onthe other hand, if a componentis definitely determinedto be in a particular fault mode,then that componentisthe fault (i.e., a singletonnogood). Knowledgeof faultmodesusually doesnot improve the goodsor nogoodsproducedby any individual coincidence. Rather, infor-mation gleanedfrom combinationsof coincidencesnow

combinenon-linearly. For example,oneconflict may in-dicate that if the resistor is faulted, then it must behigh or open while another indicatesthat if the resistoris faulted, then it must be low or shorted. The com-bination of these two conflicts indicate the resistor isunfaulted.

LOCAL incorporates three fault mode detectionstrategies: teleology, modeconsistency,and qualitativepropagation. We describeeach in turn.Teleology. If a particular component is faulted and theoverallcircuit is manifestinga symptom,then the faultycomponentmust be manifestinga symptom. It followsthat, if a componentis not displaying a symptomofa known fault mode, then it isn’t in that fault mode.For example, if a diode is open and causing a circuitsymptom, then the voltage across it should be eithervery high or very low (far below zero) — if the voltageisn’t, then the diodecan’t be faulted. Notice that thisdepends critically on knowing the purpose of the circuit;after all, any open diodewith no voltageacrossit wouldbe verified by this rule. Thus, teleologicalinferencesareenabled only if the flag (see Section 2.2) which indicatesthe circuit is malfunctioning is set.Mode consistency. Fault modes whose behavior isinconsistent with any propagationsis eliminated. Forinstance, if there is any voltageacrossa diode, then itcannot be shorted (since if it wereshorted the voltagewould be zero). The fault mode is eliminated regard-lessof whetherthe propagationhasassumptions.If thepropagationis incorrect, then oneof its underlyingcom-ponentsmust be faulted in which case the componentin question isn’t faulted. Therefore it is valid to elim-inate the fault mode independentof the correctnessofthepropagation(unlessof coursethecomponentitself isan assumptionof the propagationin whichcasenothingcan be deducedabout the fault modes).Qualitative propagation. This strategyexaminestheconflicting propagationsto identify what fault modescouldhavecausedthe particular conflict. Supposethatthroughalong propagationchain,we determinethe cur-rent througharesistoranduseOhm’sLaw to predict thevoltageacross it. A subsequentmeasurementindicatesthis predictedvoltage is too low. If the resistor is trulyfaulted, then its resistancemust be too high or must beopen. On the other hand, if it is not faulted, then thecurrent supplied to it must havebeentoo low; the pro-cessthen recursesexaminingthe componentthat wasused to deducethat high current. (In either case, theresistorcannotbe in the fault modesshortedor low.)

The justification structure produced by LOCALrecords how eachcomponentcontributes to the prop-agation.This information is used to determinethe faultmodeof the component,or which erroneousinput prop-agations(if any) could causethe observedsymptoms.Considerthe exampleof a high voltagededucedby ap-plying Ohm’s Law. The resistorexpert recordsthejusti-fication of the voltagewith an annotation(RESISTORI)that it used a current to deduce a voltage by Ohm’sLaw. The qualitativepropagationrule is: “All current-

Type Fault ModesRESISTOR open,shorted,high or low j

CAPACITOR shortedor leakyDIODE openor shorted IZENER DIODE breakdownhigh or breakdownlow

TRANSISTOR~

ok/op/op, op/op/ok, op/ok/op, op/op/op,sh/op/op sh/sh/sh, beta-low or beta-high.

op/sh/op, op/op/sh,~

Table 4: Table of common fault modes.

Mode j Explanationok/op/op collector is open,but BE junction is functional.op/op/ok emitter is open,but BC junction is functional.op/ok/op baseis open,but EC junction is functional.op/op/op I all terminals open.op/sh/op baseis open,and the emitter is shortedto the collector.

sh/op/op collector is open, and the baseis shortedto theemitter.op/op/sh emitter is open, andthe collector is shortedto the base.sh/sh/sh all terminalsareshortedwith eachother.

Table5: Transistorfault modes.

to-voltagepropagations,producedby a resistor,that aretoo high indicate either that the resistor is shorted orlow, or that the resistor’sinput current is too high.”

The qualitative propagator, instead of propagatingranges,throughthe circuit topology, propagatesthe to-kens “high” and “low” in reversedirection through thejustification of the problematic propagation. The to-kensrefer to whetherthe propagationis strictly higheror lower than wasactually measured.The potential dif-ficulty is that, if a componentoccursmore than once, itis not easyto tell whichcontribution dominates.To ac-count for this, LOCAL performs the entire qualitativepropagationas one unit, identifying which fault modesin which componentscould have causedthe observedsymptomsandlocalizing the fault to the unionof these.Thus, if thesameresistor contributestwice to the sameconflict, one explainedby high or open and the otherexplainedby low or shorted, no deduction about thatresistoris madeother than it is undersuspicion.

The qualitativepropagator is invoked for every con-flict, whether it is betweentwo propagationsor amea-surementanda propagation. For example,if the rangeof propagationA lies entirely below the rangeof propa-gation B, then it propagates“high” down B and “low”down A. This will collect all fault modeswhich couldremovethe conflict by movingA up or B down. In theusual case, one of A or B is a measurement,so it isonly necessaryto propagate “high” or “low” down onepropagation.

4.1 Diode fault modes

For thesakeof brevity, we only analyzethe fault modesof a diode in detail. The following teleologicalrule forthe diodeis basedon thepresuppositionthat thecircuit

-50 0 0.8

Voltage

Open~, Openmis max

-l~A IQurent j L

Shorted 1 1 ShortedI_

Figure 8: Diode Ratings

containsonly onefault andthat it is currentlymanifest-ing a symptom. If the voltage acrossthe diode is lessthan somemaximum, then the diode cannot be opensince it cannot be causinga symptom. If the currentthrough the diode is less thanenoughto warrant it be-ing consideredon, then it cannotbe shortedsince if itwereshortedandcausinga symptom, then it would beconductinga significantamountof current:

If VH < VMAX, then the diode cannotbe open. If‘H ~ ‘ON, then the diode cannotbe shorted.

The modeconsistencyrule for a diode is as follows.If the voltageacrossa diode is greaterthan zero, thenit cannot be shortedsince the definition of shorted isthat the voltageacross the diode is zero. Similarly ifthe current througha diodeis greaterthan zero, then itcannotbe open:

rlf VL > .1, then thediode cannotbe shorted. If IL ~‘OFF, then thediode cannotbe open.

When the valuepropagatedby the diode leads to alater conflict, the fact that the propagationwas high or

low can be used to determine the fault modes in whichthe diodecould be. The diode model can make six dif-ferentpropagations:(1) from a new voltagededucethatthediode is on andpropagatethecorrespondingcurrent,(2) from a new voltage,deducethat the diode is off andpropagate the corresponding current, (3) from a newcurrent, deducethat the diode is off and propagatethecorresponding voltage, (4) from a new current, deducethat the diode is on and propagatethe correspondingvoltage, (5) addextrememaximumand minimum volt-agesto the propagationdatabasebefore any measure-mentsare performed,(6) add extremeinitial maximumandminimum currents.

For brevity we discussonly the rulesfor cases(1), (5)and (6). (These qualitative propagationrules will pre-sumeall the precedingdioderules.) If the current prop-agatedby the diode model is high (i.e., the predictedis higher than the measured),then the diode must beshortedsincethe current flowing through it must be lessthan IMIN. No erroneousvalue of the voltage propa-gatedinto thediodecan causethis symptomsincethereis no valid regionof operationin which the diodecurrentis deducedto be less than IMIN (IMIN is negative,andrepresentsthe maximum reversecurrent flow).

On the other hand, if the propagatedcurrent flowingthroughthediode is low (i.e., the predictedis lower thanthemeasured),thediodemust be shorted,or thevoltageacrossthe diode is high (indicating that the diode is offrather than it being on).

For the propagationV * I for diode off: If I = high,then thediodemustbe shorted,and thesymptomcan-not be causedby V. If I = low, then the diodemustbe shorted, or V = high.

The diode being open never arisessince a voltage lowenoughto indicate the diode is off will also trigger thebehavioraldeduction that the diode cannotbe open.

The notion of state is crucial in understandingthebehaviorof the diode. For example, the propagationrule “If T/H < VOFF, then propagatethe range I =

[—co, ION]” is better thought of as “If Vif < VOFF, thenthediodemust beoff, andif the diodeis off, then thecur-rent flowing throughit must be I = [—co, ION].” There-fore thestateof the diode is explicitly recorded(with itsunderlyingassumptions,of course)andthe record is re-ferred to in explanations.

Cases (5) and (6) are much simpler.

If I = ‘MIN =high, then the diode must be shorted.If I = ‘MAX =low, then the diode must be shorted.If V = VMAX low, then the diode must be open. IfV = VMIN =high, then the diodemust be open.

4.2 Active troubleshooting extendedThe simple scoring procedure for proposed measure-mentsof Section 2.4 needsto be considerablyextendedto accommodatethe additional inferential complexitiesintroduced into LOCAL. The actual scoring procedureused expandsthe simple approachto account for thefollowing effects:

• There may be multiple propagationsfor each circuitquantity.

• Propagations can have secondary assumptions. These

assumptionscontribute to a conflict but not to a cor-roboration.

• Coincidencescan no longerbe simply divided into con-flicts andcorroborations(see Figure 5). Somehypo-thetical coincidencesturn out to be neither corrob-orations nor conflicts, thereby making the scoresofhypothetical measurementshigher than they shouldbe. The wider the rangeof the propagatedvalue, theless probable an interesting coincidence will occur.

• A measurement may eliminate some but not all of thefault modesof a component.

• Due to incompleteness,important coincidencesmayoccur only amongpropagationscausedby hypotheti-cal measurements.LOCAL cannotcalculatetheseef-fects,making thescoreof certainmeasurementslowerthan they should be.

Theactualscoring procedureusedby LOCAL attemptsto accommodateall theseaffects. However, the proce-dure is ad hoc andis not describedhere.

5 Circuit-specific knowledge

The first measurement,one even a neophytedebuggerwould perform, is to measurethe output. This is alwaysthe single most informative measurement,but LOCALcan draw no diagnosticconclusions from it. LOCALcan rarely draw any diagnostic inferencesfrom the sec-ond measurementaswell as it is usually too far from theoutput to produceany coincidences.Only after thestu-dent hasmadea numberof measurementshaveenoughvaluesbeenpropagatedto produce someuseful coinci-dences.

One approach to improving LOCAL’s diagnosticpowerswould be to usemoreprecisecomponentmodelscoupledwith more powerfulconstraintpropagationtech-niques which could solve simultaneousequationsoverranges. We decidednot to adopt this approachfor anumberof reasons:

• The requiredresearchtask is very difficult. Even un-der thesimplification that circuit valuesaresingleval-uedandcomponentmodelsarelinear, troubleshootingis still difficult. Under this simplification asystemlikeEL could do the propagation,but initially everycoin-cidencewould dependon everycomponentsinceeverycomponentwill in someway (usuallysmall) influenceevery value, thus making most coincidences (and cer-tainly all conflicts) uninteresting.

• The resulting algorithms would more than likely becomputationally intractable.

• LOCAL’s reasoningwould diverge so far from a hu-man’sthat it would no longerbeableto produceusefulexplanationsfor its deductions.

So how do technicians troubleshoot successfully?Technicians and engineers seemto draw extensivelyoncausal,qualitativeandteleologicalreasoningwhen trou-bleshooting circuits. They exploit an intuitive “un-derstandinghow the circuit works” to troubleshoot it.These forms of reasoningare exactly what are missingfrom LOCAL andcontribute to its poor performanceinthe early phasesof troubleshooting. LOCAL employsa myopic view of the circuit andthus fails to utilize theglobal mechanism through which the circuit functions.Its myopic view is appropriatewhen the fault hasbeenisolatedto somemodule or group of components;it isentirely inappropriatefor making initial measurementswhennot enoughpropagationshavebeenmade.

Unfortunately, in 1977, the stateof knowledgeaboutcausaland teleological reasoningwas not deepenoughto permit their inclusion in SOPHIE III (evenin 1992the field has not made enough advancesto do with-out circuit-specific knowledge). Our approachwas, in-stead, to augmentLOCAL by a simple rule-basedsys-tem whichdraws the missing conclusionsabout compo-nent faults. This rule-basedsystem, unlike the propa-gator, is circuit specific and must be changedfor everynew circuit SOPHIE III encounters.

The ensuingloss of generalityis unfortunate.In orderto mitigate this loss of generality we impose two prin-ciples when we encode this circuit-specific knowledge.First, weimposeastrict discipline uponits form, makingaddition, modificationandexplanationof circuit-specificknowledgemore stylized and less error-prone. Second,the propagatorand its underlying mechanismsfor han-dling assumptionsand fault modesare usedas a wayof simplifying thedeductionsthat the rule-basedsystemneedsto make. Considera simple example,supposewehadthe rule “if output voltage is low, then the regulatoris shorted.” Supposewe measurethe loadcurrent, andLOCAL propagatesit through the load to determinethat the output voltage is low. This triggers the regu-lator shortedrule underthe assumptionthat the loadiscorrect. Thereforewe havelearnedthat either regulatoris shortedor the load is faulted. Utilizing LOCAL weneedonly one rule to handleboth of thesecases. Be-causeof LOCAL’s propagatingpower we needonly afew of the thousandsof rules we would otherwiseneedto have.

In addition the fact that circuit-specific rules lose usgenerality,we alsolosesomecapacityto makeintelligentexplanations since the rules are produced by a personandnot by LOCAL. Fortunately, thereare four effectsthat mitigate this loss. First, the rulesare very simple,andthus any particular troubleshootingdeductionusesasequenceof them, thus admitting a kind of deductiveargumentbut with prestoredexplanationfor eachstepin the sequence.Second,the rigid discipline on the formof the rules makes it much easier to impose a coher-ent structure on them. Third, LOCAL’s propagatingpower reducesthe required rule set to an essentialcore.Fourth, the contribution LOCAL makesto eachdeduc-tion is easilyexplainedthroughits existing explanation

facilities.Although a circuit is constructed from components, it

is more easilyunderstoodasconsistingof a small num-

ber of interactingmoduleswhich in turn, consistof othermodules.The hierarchyof modulesof a circuit are rep-resentedby a tree, the root of which is the circuit itself,thenodesof which areits modulesandsubmodules,andthe leavesof which are its components. Figure 9 illus-trates the modulesof the IP-28.

5.1 The qualitative-quantitative interface

LOCAL continuouslychecksfor any propagationwhichaffects a key setof voltagesand currents (called watch-points). When a value is found for one of these watch-points, it maps the quantitative range to a qualita-tive valueandassertsthe qualitativevalue in a distinctdatabase. This assertionis recordedalong with the as-sumptionsunderlying the triggering propagation. Thiscorrespondenceis specified by a translationtable. Con-sider the watch-pointat theoutput of the IP-28 voltagereference(the voltagebetweennodesNil andN 14). Thetranslationtable is (high� 31.0> eq> 27.0> low). Thenumbersin the table delineatethe different qualitativeranges. If the voltage is above 31.0 volts, then (V (Nil1114) high) is addedto the qualitativedatabase. If thevoltage lies between31 and 27, then (V (Nil Ni4) eq)is asserted.The voltage [30 , 32] overlapshigh and eq,so this voltageresultsin the assertion(V (Nil 1114) (ORhigh aq)). The 12 watchpointsandtheir translationta-bles for the IP-28 aregiven by Table 6.

It is important to distinguish high and low used inthis teleologicalsense,to the “high” and “low” used inthe qualitative propagator. In the teleological sense highmeansthe valueis higher than it should becomparedtoa working (i.e., unfaulted) circuit; while “high” in thepropagatormeansthe valuewas higher than predicted,basedon previously made measurementsin the same(i.e., faulted) circuit which may already be higher thanit should be in a working circuit.

Semiconductorcomponentscan havemanyoperatingregions or states,eachmodeledby a distinctly differentcharacteristicbehavior. WheneverLOCAL determinesanoperatingregionof acomponent,it addsan assertionof the form (STATE <component> <state>) to the qualita-tive database. The possible states were summarized inTable 1.

5.2 Module modesJust like components, modules can have behavioralmodes. The module mode describes both the physicalcondition of the module (faulted or valid) and its be-havior. A module may havemultiple valid andfaultedmodes. In writing down the rules one needsto be veryclearwhat the modesactually mean,otherwisethe ruleset may become inconsistent and SOPHIE III will en-counteran irreconcilable contradiction(i.e., manifestedasan empty nogood). Modules typically havemanyter-minals, thereforewe adopt the conventionthat module

IP-28

Figure 9: Hierarchicaldecompositionof IP-28 modules. SeeAppendix for a descriptionof the modules.

Watchpoint Translation tableB/Q4 (normal> -.00055 � on)C/Q2 (high > .0008 > normal � .00056� low)L/R8 (high > 1.3 > heavy> .8 > light)(Ni GROUND) (high> 52.0 > ok > 45.0> low> 30.0> verylow)(Nil GROUND) Function of front panel settings(Nil N14) Function of front panel settingsN15 N16) (high > 10.0 > normal � 1.0 ~ low)Ni6 N14) (high > 37.0� eq > 35.0 � low)

(N21 Ni) (high > 2.6 � normal)(N23 N14) (high > 85.0 > normal > 60.0 > low> 1.0 � zero > 0.0)(N24 N14) (high > 58.0 > eq > 52.0 > low)(N4 GROUND) (high > 45.0 > normal)

T1~S1A>

SOP~UE-lP2$-~2

R14c

R13

cR-S

OR R15 Ru

Table6: Watchpointsand translationsfor IP-28.

modesare associatedwith the quantities at one desig-nated module port. For example,the modeof the con-stantcurrentsourceis completelydeterminedby its out-put voltageand current. The modes used in SOPHIEIII are chosenby the knowledgeengineeron the basisof simplicity, andthe majority of the moduleshaveonlyoneor two (e.g., high and low) faulted modes.

A module can be unfaultedyet be exhibiting the be-havior of a faulted mode becausethe behaviorof someexternalcomponentmakesthe module’sbehaviorappearto be faulty. Thereforewe adopt the conventionthat amodule is operatingin a faulted modeif it is exhibitingfaulty behaviorand containsa fault causingthat symp-tom. A module which containsno faults is by definitionoperatingin a valid mode. For example,the regulatorsectionof the IP-28 has four faulty modes: over-v-un— regulator output voltage is higher than the voltagecontrol setting, over—i—lim — regulator output currentis higher than the current control setting, low — regu-lator output is lower than either the voltageor currentcontrol setting, and very-low — regulator output volt-age is lessthan 1.2 volts and the output is lower thanthecontrol settings.The regulatorcanexhibit the faultybehaviorlow either becauseit containsthe fault or be-causeit is being supplied with a faulty signal (e.g., theDC sourceis low). The possiblefault modesneednot bedisjoint; the modeslow and very-low overlap. The samecomponent fault can causetwo different modesdepend-ing on thepreciseshift in parametervalue (e.g., R8 highcanmeanR8 hasaresistanceanywherebetween3 and00

ohms) and the front panel control settings. As modulebehaviorsneednot be disjoint, it is not only necessaryto knowall the possiblebehaviorsof a module, but alsowhich behaviorsare inconsistentwith eachother. Forexample,the regulator behavior low is inconsistentwithbehaviorhigh, but not with behavior very-low.

The knowledgeengineermust constructa knowledgestructure which indicates how each module’s behaviorcontributesto the behaviorsof its parent. We call thisstructure the behaviortree (although it is really a lat-tice). Every nodein the behavior tree correspondsto abehavioralmodeof a module. The incoming edges(seeFigure 10) to a nodecorrespondto the behaviorsof itssubmodulesthat could causethe node’s own behavior;theoutgoingedgesfrom a nodecorrespondto the parentbehaviorsthat the nodecan cause.As amodule behav-ior can producedifferent behaviorsin its parent,a nodemay havemultiple outgoing edges.The terminals of thebehaviortree are the fault modesof the components.

A node in the behavior tree is specified by the mod-ule nameandthe modeit is in, abbreviated(<module><node>). Figure 11 describesa small part of the IP-28behaviortree. (The actual tree takes about two pagesto print so we havenot includedit in this paper.)

In orderto be usedsuccessfullythe behaviortree mustobey the following principles:

Parentmodulebehaviors

Submodule/ A

behaviors

Figure 10: A Node in the BehaviorTree

(IP-28veiylow) (IP-28low)

(REGvery low) (DCS zein) (DCS verylow) (REGlow)sOPHIE-1P2s-~e

Figure 11: Part of the IP-28 behavior tree. The behav-ior (IP-28 very-low) can be causedby (REG very-low),(DCS zero), or (DCS very-low). The behavior (IP-28low) can be causedby (DCS zero), (DCS very-low), or(REG low). Note that (DCS zero) and (DCS very-low)can causeboth (IP-28 very-low) or (IP-28 low).

included3.

• Every node(except for the overallmodule)must havea parent.

• Includeall possiblewaysin whicha module can affecta parent.

5.3 Mode detection rules

The assertionsin the qualitativedatabasetrigger ruleswhich determinemodule modes. Determining a mod-ule’s fault mode tells us a lot about the circuit as theonly remainingfault modesarethosewhich can causeit(i.e., arebelow the observednode in the behaviortree).This sectionandthe next fleshesout this intuition.

The modedetectionrules neednot pinpoint the exactbehavioral modeof the module. They can determinethat the module is in one of a set of possiblemodes (alocalization), or determinethat the module cannot bein somemodes(an elimination). The rules for VARI-ABLE/REF (the variable voltage referenceof the IP-28) are:

(V (Nil N14) high) ->

(LOCALIZE VARIABLE/REF high ok)(V (1111 1114) eq) -> (LOCALIZE VARIABLE/REF ok)(V (Nil Ni4) low) -> (LOCALIZE VARIABLE/REF low)

3Some faults immediately causeothers, so theseare leftout of the tree by the single-fault presupposition.

VSOPHIE-bthavicii

• Every possible mode (faulty or valid) must be

(LOCALI~<module> <model> <node2> . . .) indicatesthat the module must be in one of the given modes.The first rule statesthat if the voltage betweennodesNil and N14 is high, then the module VARIABLE/REF, iffaulted, must be in modehigh. ok is addedin the firstcasebecausethe IP-28 voltage-referenceis designedtoactively correct a low voltage on its output terminal,but not a high voltage. Thus if a low voltage appears,then thevoltage referencemust be faulted, but if a highvoltageappears,then some externalmodule could alsobe supplyinga voltage to its port. This latter possibil-ity neverarisesin normal operation, but can in faultyoperation.

The rules for a constant current source are a goodexample of the expressive power of using multiple validmodes.A peculiarityof anon-idealcurrentsourceis thatit cannot supply a constant current into an arbitrarilyhigh-impedanceload. Its behavior when presented witha high-impedanceload is deceiving. The high (higherthan it should be teleologically) output voltage suggeststhat thecurrent sourceis high, while the low output cur-rent suggestsit is too low. The constantcurrent source(CCs) is modeledby four modes,twoof whicharefaultedand two of which are valid: high(faulted), low(faulted),saturated(ok),andnormal(ok):

(V (114 GROUND) high) ->

(LOCALIZE CCS saturatedhigh)(V (114 GROUND)normal) ->

(LOCALIZE CCS high low normal)(I C/Q2 high) -> (LOCALIZE CCS high)(I C/Q2 normal) —> (LOCALIZE CCS normal)(I C/Q2 low) —> (LOCALIZE CCS saturatedlow)

In the casewherethe fault lies outside the constantcur-rent sourceresulting in a high impedanceload, the highvoltage measurementindicatesthat the sourceis eitherin modeshigh or saturated, and the low current mea-surementindicatesthat thesourceis either in modeslowor saturated.Hencea high voltagecombinedwith a lowcurrent indicatesthat it is in the saturatedmode andthereforenot faulted.

When a propagationtriggers a mode-detectionrule,its assumptions,if any, modify the effect of the rule (seenext section). In addition, in limited circumstances,thequalitative propagatoris invoked. For example, if therule,

(V (Nil 11i4) low) -> (LOCALIZE VARIABLE/REF low),

is triggered, then the fault either lies in VARI-ABLE/REF or the triggering propagationis low. Thequalitativepropagatorcan then be invokedon the trig-ger restricting the fault modesof its underlyingassump-tions which arenot in VARIABLE/REF. The situationis quite different for the rule:

(V (1111 Ni4) high) —>

(LOCALIZE VARIABLE/REF high ok)

In this case, there is little point to invoking the qualita-tive propagatoras the fault might well lie outsideVARI-ABLE/REF or the assumptionsunderlying the trigger-ing propagation(becauseof the ok in the secondrule

consequence).The generalprinciple is that the qualita-tive propagatorcan only restrict thefault modesof thosecomponentsthat would otherwisebe verified by thecon-sequent. These conditionsare clearly overly restrictive,andfurther researchwould be requiredto determineex-actly under which conditionsthe qualitativepropagatorshould be invoked.

Although the behavior tree contains a large number ofmodules and their modes, it is only necessary to includea handful of rules to detect module modes — the re-maining module modesareeither entailedby LOCAL’spropagationsor only affected indirectly. This can beseen from the simple fact that 12 watchpointsarecom-pletely adequateto diagnoseall faults in the IP-28.

5.4 Reasoning with the behavior tree

Once the mode detection rules identify a module’smode(s), reasoning over the behavior tree identifieswhich faults can causethe observations.The behaviortree is primarily a datastructure to make it convenientto write andexplain circuit-specific rules. The presenceof the behaviortreedoesnot directly reducethenumberof rules needed(except in so far as it makesrule writ-ing a muchsimpler task). All of the consequentsof themodedetectionrules could be replacedby a setof com-ponent mode eliminations or localizations. For exam-ple, (LOCALIZE VARIABLE/REF low) could be replacedbya list of all thecomponentfault modeswhich causetheVARIABLE/REF to be low andthere would be no changeindiagnosticprecision. However,with this cruderepresen-tation one is far more proneto omit componentmodes.

The inference mechanismoperates on a databasewhich keeps a record of the remaining possible modesfor eachmodule. All localizationsareconverted to theirequivalenteliminations so we needonly consider infer-encesresultingfrom anelimination. A distinct suchdatabaseis kept for eachcontrol panelsetting. However,onecommondatabaseis kept for the componentfault modesas we presumethe physicalfault cannot change. Thisis sufficient to flow all the deductionsfrom one set ofcontrol settings to another. (For example, if a moduleis actually faulted under onesetting it must be in somefaulted modeunderall settings.)

Considertheelimination of a faulty mode. If afaultynodeis eliminated,the nodesbelow it can no longer bethe cause;eachof theselower nodesis examinedto seewhether it still has a parent (i.e., can still causeanyother behavior), and if not, it too is eliminated. Analo-gously,if amodeis eliminatedit can no longercauseanyof its parents,so eachparentbehavior it can causeis ex-aminedto seewhether it still can be causedby anotherbehavior,andif not it too is eliminated. Oneeliminationcan thus recursivelygeneratemanymore. If all possiblefaulty behaviorsof a module are eliminated, then themodule cannotbe faulted. It is unecessaryto explicitlyimplementthe intuitive principlethat if a module is un-faulted, then all of its submodulesare unfaulted. Thisis a simple consequenceof the fault mode eliminationmechanismjust outlined. If all of a module’s faulty be-

haviorshavebeeneliminated, then its submoduleshaveno possiblefaulty behaviorseither.

Elimination of valid modes is simpler. While a faultybehavior can be caused by a faulty behavior in any one of

the submodules, a valid behavior only results if all of thesubmodulesare behavingvalidly. Thus the eliminationof a valid nodehasno direct consequenceson the modesbelow (unless there is only one)it since the eliminationof any one of these alone would be sufficient to elimi-nate the mode itself. On the other hand, when a validmode is eliminated its parent cannot happen unless itcould also be caused by an alternative valid mode of themodule. If all possible valid behaviorsare eliminated,then the module must be faulted. Under the single-faultassumptionand the fact that modulesdo not overlap,any sibling modules are automatically unfaulted. If a

module is determined to contain a fault, then all of itsparent modules are, by definition, also faulted. Sincethe particular faulty mode need not be known, this isachievedby eliminating all of the valid modesof theparent modules.

The propagationswhich causethe originaleliminationmay haveunderlyingassumptionswhichmodify thecon-sequenteliminations and localizations. These assump-tions must be carried alongduring the recursiveelim-inations, continuouslyenforcing the restriction that anelimination of a faulty modecannot dependon an as-sumptionabout themodule in question. Eliminationsofvalid modescan,of course,dependon the module itself.An attempt to eliminatea modeof a moduleuponwhichthe triggering propagationdependsis ignored, however,the recursiveeliminations are still attempted. Usuallythe underlying assumptionsrefer to only someof thecomponentsof themodule, in which casetheeliminationstill leads to deductionsabout that part of the modulewhichdoesnot overlapwith the assumptionsunderlyingthe triggeringpropagation.

By far the majority of SOPHIE III’s circuit-specificrules detect module modes. The behavior tree, unfor-tunately, can only representhierarchical module inter-actions,andcannotrepresentcausalinteractionsamongsame-levelmodules. SOPHIE III employsthree addi-tional rule types to capture theselatter type of knowl-edge. We discusseach individually in the following sub-sections.

5.5 Good neighbor rulesSometimeswhena module is discoveredto be unfaultedandbehavingnormally (asdistinguishedfrom beingun-faulted but behavingsymptomatically) its “suppliers”and“consumers”can also be determinedto be behavingnormally, and henceto be unfaulted. Consideran ex-amplefrom the IP-28. The constant-voltagereferenceof this power-supply(seeFigure 4) consistsof a raw DCvoltagewhich is first regulatedto 56 volts andthen reg-ulateddown to 36 volts. The only supplier of power tothe 36-volt regulator is the 56-volt regulator, and theonly consumerof the 56 volt output is the 36-volt regu-lator. If voltageof thesecondreferenceis 36 volts, then

it is operating normally and unfaulted, and the 56-voltregulator must be unfaulted and operating normally. Tofacilitate good neighbor rules SOPHIE III maintains an

additional modefor every module,operating-ok, whichis consistent with every other good mode, andinconsis-tent with every other faulty mode. The operating-okmodeindicatesa module is unfaulted and behavingnor-mally. Thus, localizingamodule’smodeto operating-okis stronger thanjust localizing its modeto just ok.

The module-detectionrule,

(V (Nil Ni6) eq) -> (LOCALIZE 36V/REG operating—ok)

also triggers the good neighbor rule,

(36V/REG operating-ok) ->

(LOCALIZE 56V/REG operating—ok),

which, in turn, triggers the good neighbor rule,

(56V/REG operating—ok) ->

(LOCALIZE LQ/DCS operating—ok).

5.6 Bad neighbor rules

Faulty behaviorcan be causedby neighboringfaulty be-haviors. This fact can be stated by bad neighborrulesof the form:

(BEHAVIOR <modulei> <model>) —>

(BEHAVIOR <module2> <mode2>)

Sucha rule statesthat if <modulel is observedbehavingin faulty mode <model> that this can be explained byfaulty module <module2>behavingin <mode2>. Considerthe voltage-referenceexample again. If the output ofthe 36-volt regulator is low, then it is either faulted,beingsupplied with a badsignal (the 56-volt regulatoris low), or beingpresenteda badload (the output filterof the voltage referenceis shorted). One exampleof abadneighbor rule is:

(BEHAVIOR 36V/REG low) -> (BEHAVIOR 56V/REG low)

Bad neighbor rules are invoked by mode detectionrulesor other badneighborrules. For example,themodedetection rule:

(V (Mu MiS) low) —> (LOCALIZE 36V/REG low Ok),

is rewritten as:

(V (1111 1116) low) -> (BEHAVIOR 36V/REG low).

This rule now has the additional effect that (BEHAVIOR36V/REG low) needsto be explained. As a BEHAVIOR as-sertion statesthat the behavioris in a particular modewhich may or maynot originatewithin the module, theok mode is always implicit asa possibility. Another badneighborrule is:

(BEHAVIOR 56V/REG low) -> (BEHAVIOR LQ/DCS low).

All bad neighbor rules are triggered and run until qui-escent and the fault is localized to one of the faultsmentionedin one of the bad neighbor rules. Thus, theresult is a disjunctive localize, e.g., the above exam-ple becomesa (OR (LOCALIZE 56V/REG low) (LOCALIZE36V/REG low) (LOCALIZE LQ/DCS zero) ...).

5.7 Exception rules (REG NOT-uMrFn’Jo)

Thethird andlast typeof ruledoesnot fit anyparticularconceptualpattern. They areall of the form:

<measurement>-> (ELIMINATE <module> <mode>).

Thesearelike modedetectionrules,hut the watchpointis not at the module’s boundary. For example,one ofthe threesuch rules for the IP-28 is:(V (Nil 1114) EQ) -> (ELIMINATE Q5 sh/op/op).

There are two componentsbetweenQ5 (see Figure 4)andthe nodesNil and N14. Unfortunately, this ratherinelegantrule is needed.This particular fault mode forQ5 makes the voltage regulator inoperative, and alsoshorts theoutput of the voltagereferenceto the outputof theoverallpower-supply.The voltagereferenceis veryweak,so the main power-supplywill dominate.Thus, ifthe output of the voltage-referenceis normal, Q5 mustnot be shorted in this way. (If it were, then it wouldnot be voltage regulating, causing the voltage across thereferenceto be high.) This is a very subtle inferencewhich would not be generallypossiblewithout a rathersophisticatedcausalandteleologicalmodel of the IP-28.

5.8 Handling disjunction

As rangescan overlap multiple qualitative values, thedatabasecan contain disjunctive assertionssuch as (V(Nil 1114) (OR high eq)). Such assertionsare handledby processingeach of the individual disjuncts individ-ually andthen intersectingany resulting module elim-inations. In the example, the disjunction is processedby first identifying all the modeeliminations of (V (Nil1114) high) and then identifying the modeeliminationsof (V (NIl 1114) eq), and only permanentlyrecordingthose consequenceswhich are common to both. Thisparadigmis usedfor all typesof circuit-specific rules.

5.9 Example of reasoning with thebehavior tree

The top-levelof the behaviortree for the IP-28 is il-lustratedin Figure 12. Valid modesareomitted. Sup-pose the output of the IP-28 is measuredto be toolow. A mode detection rule monitoring the outputapplies and doesa localization of (IP-28 low). Thisis transformed into the equivalent elimination of allthose modesinconsistent with low: (IP-28 high) and(IP—28 ok). Considerthe elimination of (IP—28 high).Looking at the tree we see that (IP—28 high) can becausedby (REG not-limiting), (REG over-v-lim), and(REG over-i-lim). Thesemodes haveonly one parentso they can be eliminated. The eliminations recurseun-til the componentlevel is reached. For example:

(ELIMINATE REG over-v-lim)—> (ELIMINATE V/DIF/CONT always—off)-> (ELIMINATE VDC.THRESH/AMP always-off)-> (ELIMINATE Q5 op/op/op)

SupposeanothermeasurementlocalizesVARIABLE/REFbe low. (VARIABLE/REF ok) is inconsistent with it andis eliminated. There is no way to tell which submodule

(W-2$}U~ (REQO~-V-~

(REQoVRR-I-1.E~)

(DOSVERY LOW)

(DOS ZERO)

(TP-28 LOW)

(REQLOW)

(IP-2$VERY-LOW)

(REQVERY-LOW-TO-ZERO)

Figure 12: Top of IP-28’s behavior tree

of VARIABLE/REF is faulted, so the eliminations do notrecurse. Reasoningup the behaviortree we obtain:

(LOCALIZE VARIABLE/REF low)—> (LOCALIZE V/DIF/CONT low)-> (LOCALIZE REG low)—> (LOCALIZE IP—28 limit—too—low)

The (LOCALIZE V/DIF/CONT low) is anew pieceof infor-mation we did not know before. Presumingthat the cir-cuit hadonly asinglefault, all componentsnot explicitlylisted as faultedor containedwithin VARIABLE/REF mustbe ok. Figure 13 presentsthe faulty behaviorsthat arestill possiblealongwith thoseconnectionsin thebehav-ior tree that are relevantto them.

5.10 Troubleshooting with circuit-specificknowledge

The sametechniquesfor using the generalknowledgetoscoremeasurementsapply to the circuit-specific knowl-edge. Recall that the best strategy is to identify ameasurementwith the maximum expectedinformationvalue. SOPHIE III computesthe expecteddiagnosticscoresfor everypossiblecoincidenceandat everywatch-point. The scoreis computedon the basisof the com-bineddiagnosticresultsfrom both LOCAL andthe cir-cuit specific rules. Obviously, early in the troubleshoot-ing the measurementsat the watch-pointshavehigherscores;later in the troubleshootingthescoresfor hypo-theticalcoincidenceswith propagationswill be preferred.

Although SOPHIE III is now a nearly optimal trou-bleshooter of the IP-28, its scoring of early studentmeasurementsremains poor. Considerthe case wherethe student makesa measurementwhich is neither ata watch-point, nor coincideswith an existing propaga-tion. Our schemedoesnot computeanyexpectedscore

(LQ/DCS ZERO) — (LQJREGZERO)— (03 OPEN)

(R3 OP~

(04BREAKDOWNLOW)(R4 oc~(V/~ LOW)~(56V~EG LOW) < (R3 HIGH)

(36V/PEG LOW) (R4 HIGH)

(05BREAKDOWNLOW)

(02SHORTED)— (02 LEAKY)

(VREF.FILT LOW) (R5 LOW) (R6OPEN)

(yR-S~. (R6HIGH)

Figure 13: Final IP-28 behaviortree which showsthe only10 components).

for such measurements.Also, it does not consider theinfluence of propagationsinto multiple watchpoints,orinto a watchpointbut with underlyingassumptions.Weimplementeda versionof SOPHIE III whichcomputesbetterscoresby actuallysimulatingthepropagationandbehaviortree effects of various outcomesfor eachmea-surementpoint. Unfortunately, this versionwastoo slowto be of practical use. Therefore, as a consequencewedid not allow SOPHIE III to critique astudent’smea-surementon the basis of easilycomputedbut approxi-matescores.

We used a alternative, cruder, techniqueto critiquethe student’searly measurements.We relied on the ob-servationthat SOPHIE III’s scorefor ameasurementisalways too low (as it doesn’t simulatethe propagation,it cannot foreseeall the results of measurements).Onthe other hand, once a student hasmade a particularmeasurementSOPHIE III can preciselydeterminetheactual information gain produced by the measurement.Given the topology of the IP-28 every early measure-ment, no matterwhat its outcome,should providesub-stantial diagnostic information. Therefore,when a stu-dent’s actual measurementproducessubstantially lessinformation than theexpectedscorefor someothermea-surement,we can be certainthe measurementis substan-tially suboptimal. At this point the coachcan critiquethe student for his poor measurement.

6 Conclusions

In the introduction we laid out a set of five desideratafor the inferential machinery. SOPHIE III’s electronic

expert meetsall of the criteria fairly well. However, we

remainingpossiblecomponentfaults are the 16 leaves(in

do not yet havea generaldiagnosticianfor arbitrarycir-cuits, but we havemadea major step. The necessityofcircuit-specific rules is unfortunate,so let us re-examinethose. The following table enumeratesthe types andnumberof circuit-specific rules used in SOPHIE III:

I Mode detectionrules J 33Goodneighborrules I

I Bad neighborrules I 4

Exception rules II Total I 50

The fact that SOPHIE III uses only 50 circuit-specificrules deservesfurther discussionas this numberseemsboth too high and too low. If we added modedetectionrules for every possiblemodule, as well as allpossiblegood and bad neighbor rules, then we wouldprobablyrequire on theorder of 1000 rules. Almost allof theserules are unnecessarybecauseLOCAL propa-gation mechanismmakesit necessaryto representonlythe core few.

Although SOPHIE III requiresonly a few rules toadequatelytroubleshootthe IP-28, doesnot meanthattheserulesareeasyto obtain. Evenafterextensiveexpe-riencewith SOPHIE III we cannotbe totally confidentthat it can handleeverypossiblefault. It would be farmore useful if we could build a diagnosticianwhich didnot dependon anycircuit-specific rulesat all. This sug-geststwo directionsfor future research.First, LOCAL’sinferential capabilitiescould be expandedin somewayto be more sophisticated.Second,there might be someway to construct the necessaryrules in another morecomputationally intensive processwhich examinedthe

(1P-28LOW)

(REGLOW)

(V/D1F/OONTLm.trr-Too.LoW) /(VARLLE~ei~’LOW)

(R7A SHORTED)

\ (R7AOPEN)

(VO/POTLOW)

(R7B SHORTED)

(RiB LOW) SOPHIE-1P28-uow

circuit much more carefully.

6.1 A re-examination of presuppositionsIn order to draw diagnostic inferences, SOPHIE IIImakesa number of presuppositions. These presuppo-sitions impose a restriction on its capabilities becauseall of them are violated in somesituations. The con-sequencesof violating them rangefrom encounteringir-reconcilablecontradictions(i.e., emptynogoods)to veryinefficient troubleshooting. In this section,we briefly re-examinesomeof thosepresuppositions.

The presuppositionslisted in the introduction fall intotwo classes.Thefirst classconsistsof good troubleshoot-ing heuristics. Theseshould be assumedto hold onlyuntil evidenceis discoveredto the contrary. This classincludes:

• The circuit containsa single fault.

• Faults only occur in components;not in circuit topol-ogy.

• All componentfault modesare known.

• The circuit is non-intermittent.

• The faulty component is currently manifesting asymptom.

In a moregeneraltheoryof troubleshooting,they shouldbe treatedas defeasibleassumptionssubject to explicitreasoning,muchlike the componentassumptionsof LO-CAL. They are, in essence,a deeperlevel of assump-tions, examinedonly when all the componentassump-tions havebeeneliminated. Every deductionmade byLOCAL which dependson oneof thesepresuppositionshould explicitly include this fact as an underlyingas-sumption of that deduction. Unlike the componentas-sumptions,thesepresuppositionalassumptionsinfluencetheinferencemechanismmaking thedeductions.Withinthis extendedframework, LOCAL, would becomejust aspecialcaseof a more powerful reasonerthat wascapa-ble of “deliberating” over its own actions and theories.

The single-faultpresuppositionis the most pervasive.Many of SOPHIE III’s deductionsdependon it. Theideasof coincidence,conflict andcorroborationare gen-eralizableto multiple faults, but manyof the the spe-cific inferencestrategiesusedby LOCAL do not applydirectly. For example,without the single-faultpresup-position a corroborationcan not verify the underlyingcomponentsnor a conflict verify the componentsthataren’t mentionedby its underlying assumptions. Also,much of the reasoningmechanismof the behavior treeis no longer valid. The combinationof two faults mightcausean unusualbehaviorwhich is not capturedby anyof themodule’s fault modes.

For a very nonobvious reason,multiple independentfaults occur more frequently than one otherwisewouldexpect within our framework. Designerstypically de-sign circuits so that they can handle wide variationsin componentparameters.This allows designersto usecheapercomponentswith larger tolerancesas well as to

copewith changesin parametervalueswith temperatureand age. Manufacturerstypically producecomponentsin only a few fixed tolerancelevelsandthereforethe tol-erancespecified by the designer is alwaystighter thanthat of the component actually used. Thus a componentmay be faulted (exceed its tolerance), but the circuit still

function correctly. The component definitely is faulted:The samecomponentmight not work in anothercircuitwhere the designer’stoleranceis closer to the manufac-turer’s. The difficulty ensueswhenanothercomponentfault does causethe circuit to manifesta global symp-tom. The circuit now hasmultiple independentfaults,and SOPHIE III may well encounteran irreconcilablecontradictiontroubleshootingit. This suggeststhat LO-CAL should use the designer’s, not the manufacturer’stolerances,for componentparameters.

We havebeendiscussingonealternativeto thesingle-fault case,that of multiple independentfaults. A morecommon case is that of multiple consequentialfaults.For example, an output transistor may short causinganother transistor to open. The single-fault heuristicoften works well for such multiple faults. Frequently,the componentthat originally faulted will no longer bemanifestingasymptomsincetheconsequentialfault haseffectivelymadetheoriginally faultedcomponentappearunfaulted by isolating it. In the example,the voltagesandcurrentsaroundtheoutput transistorwould be neg-ligible if the transistorit causedto open wasresponsiblefor delivering it its power. Since the output transistornow hasno power, it will appearto befault-free, therebytemporarily reducingthesituationto asingle-faultcase.

The secondpresupposition, that SOPHIE III pre-sumesall possiblefaultsareknown, is againgood heuris-tic, but againdoesnot alwayshold. Possibletopologicalfaults include a short betweentwo nodeswhich do notshareacommoncomponent(e.g., two geometricallyad-jacent,but topologically distant andcomponentswhichhavebeenphysically bent so that their leads touch) oran opennode which disconnectssomecomponentsfromothers.

A more subtle violation occurs with componentfaultmodes, since it is hard to hypothesizeall the ways inwhich acomponentcan fault. For example,if the diodefaultedso asto becomea resistor,with thevoltageacrossit greater than VMAX andthecurrent throughit greaterthan ‘MIN, then the propagatorwould eventuallyen-countera conflict with no underlying assumptions.Anexampleof such a conflict, a measurementof the volt-age across the diode would indicate it wasopen, whilea measurementof the current throughit would indicatethat it couldn’t be. Fortunately, this is an extremelyrare fault modefor diodes. Diodesusually fault by be-comingeither nearly shortedor completelyopen— andtheseare the modescurrently modeled.

The non-intermittencypresuppositionis equivalenttoassumingthat nothingchangesbetweenthe times mea-surementsare takenand that the fault doesnot changewhen the control panelor load is changed.

The presuppositionthat thecircuit’s presentsymptom

is a direct consequenceof somecomponentpresentlybe-having symptomatically,is only true for circuits whichdo not havesomekind of “memory.” Supposethe circuithada circuit breakeron its input which blew everytimethepowersupply waspluggedin. At the time we observethe power supply output it is manifesting a symptom (no

output), but every component is behaving correctly. Asa consequenceLOCAL would encounteran irreconcil-able contradiction. The problemis, of course,that thecircuit breaker “remembers” that some component wasshorting out when the power supply was first pluggedin, even though the componentmight not be doing soat present.This is not a isolatedcase,most good powersupplies,for example,employ a kind of electronic cir-cuit breaker in addition to the usual regulation as anextrameasureof protection for the circuitry connectedto the power supply. This type of fault is notoriouslyhard to find since the troubleshooterdoesnot get theopportunity to see the faulted componentmanifest itssymptom.

The remaining presuppositionsare limitations whichneedto be addressedby some futuretheory:

• Only DC behavioris important.

• All faultsare equally likely.

• All measurementsare equally easyto make.

• All fault modesareequally likely.

In manycircuits AC behavioris important, so LOCALneedsto be extendedto deal with time and AC effects.As transistorsare far more likely to fail than resistors,the simplemeasurementscoringfunction LOCAL doesnot zero in on more probablefaults earlier. In the sim-ulated electronicslaboratoryall measurementscost thesame, but in real circuits some measurementsare farmoreeasyto makethan others. For example,DC cur-rentsare much harderto measurethanvoltagesbecausethey usually requirephysically removingoneof thecom-ponent’sterminals. Resistorsare far morelikely to openthan to short, thereforeif we haveeliminatedOpeningasa fault mode for a resistor which shouldtreat shortingas a very low probabilityfault.

6.2 Final observationsThere are many promising directions along which theworkdescribedin thispapercould havecontinued.How-ever, two very different problems,one pragmaticandone intellectual,stood in our way. The pragmaticis-sue was the needfor large addressspaceand personalLisp machines;the intellectual issue was our increasedawarenessthat we did not really know what it meantto“understand”how a complexpieceof equipmentworks.In particular we did not know what mentalmodels theexpertshad of a givensystem’s functioning, nor did weknow how thesemodelswere learned,for they certainlyweren’t explicitly taught.

The computationalissuethat we facedwith SOPHIEIII wasthat it barelyfits into the (256K words)addressspaceof a PDP-iO. SPICE,SOPHIE land SOPHIE

II eachalreadyoccupiedits own separateaddressspaceso nothing could be gained by running SOPHIE IIIstand-alone.This made it extremelydifficult to extendSOPHIE III, especiallyto expand its coachingcapa-bilities along the lines developingout of our coachingresearchon mathematicalgames.Addressspacelimita-tions werenot ouronly concern:To do “formative” eval-uations,discoveringthe reasonsfor the pedagogicalsuc-cessand failuresof our system, we neededthe speedofefficient, dedicatedLisp machines. Without this speed,student reactionsto the long and unpredictabledelaysoften swampedour experimentalprobes.

The issue concerningthe needfor a theory of humanunderstandingof complexsystems,in particular circuits,was clearly the more challengingone. Indeed, much ofour recentresearchhas beendirectedat attacking thisproblem. It quickly becameclear to us that the workthat went into SOPHIE II and III on explanationputthe cart beforethehorse. We hadno adequatetheoryofwhat it meantto understandacircuit andhenceno welldefined “target” model of what we wantedthe studentto learn. As a consequenceno realtheory of explanationwasforthcoming.

In our experimentswith SOPHIE I and II we sub-stantiated that the beginner and expert alike preferto reasonabout the circuit in qualitative and causalterms. The studentspreferred qualitativeexplanationsand wereonly comfortableabout their understandingifit was in terms of a qualitative causalmechanism,butSOPHIE III’s only referenceto causalityis in the pre-compiled explanationsfor its circuit-specific rules. Itcould not generatenew ones nor could it expand oldones. A significant step toward the necessary robusttheory of causalitywas achievedthe following year byde Kleer, in his doctoraldissertationpresentinga theoryof qualitativecausalreasoningaboutelectronics [3]. Histheoryhasturnedout to providea basisfor constructinghuman-orientedexplanations.

7 Acknowledgments

The SOPHIE project has beeninfluenced and helpedby numerouspeople. We areespeciallyindebtedto AlanBell and RichardBurton for the substantialamountofwork he did on SOPHIE I. Richard RubensteinandNed Benhaimhelpeda great dealon SOPHIE II. Weare also gratefulto Ed Gardnerfor making possibletheoriginal SOPHIE projectandto Harry O’Neil andDex-ter Fletcherfor their constantencouragementand will-ingnessto shelterthe project from variousbureaucraticconstraints. The project was initially funded by theAir Force Human ResourceLaboratory at Lowry AirForce Baseand later by DARPA/CTO and the Tn-Service training laboratories. Numerouspeople haveread preliminary drafts and havemadeinvaluablesug-gestions. Daniel G. Bobrow, Bruce Buchanan, BillClancey,JaimeCarbonell,Ken Forbus,Olivier Raimonand Rachel Rutherford deservespecial thanks for theirpatienceand insights.

References Appendix: IP-28 Modules

[1] Brown, J.S.,Burton, R. R. andde Kleer, J., Peda-gogical, natural languageand knowledgeengineer-ing techniquesin SOPHIE I, II andIII, in: D. Slee-man and J.S. Brown (Eds.), Intelligent TutoringSyslems,(Academic Press, New York, 1982) 227—282.

12] de Kleer, J., Local methods of localizing faultsin electronic circuits, Artificial Intelligence Labo-ratory, AIM-394, Cambridge: M.I.T., 1976.

[3] de Kleer, J., How circuits work, Artificial In~elli-gence24 (1984) 205—280; also in: D.G. Bobrow

(Ed.), ReasoningAbout Physical Systems(MITPress and North Holland, 1985) 205—280. 11.April 24,

1992.April 24, ~992

[4] de Kleer, J. and Brown, J.S., Mental models ofphysicalmechanismsand their acquisition, in: J.R.Anderson(Ed.), Cognitive Skills and their Acquisi-tion, (Erlbaum, Hillsdale, NJ, 1981) 285—310.

[5] Doyle, J., A truth maintenance system, ArtificialIntelligence12 (1979) 231—272.

[6] Nagel, L.W. and D.O. Pederson,Simulation pro-gramwith integratedcircuit emphasis,Proceedingsof the SixteenthMidwest Symposiumon. CircuitTheory,Waterloo,Canada,1973.

[7] Rieger, C. and Gninberg, M., The declarativerep-resentationand proceduralsimulation of causalityin physicalmechanisms,in: ProceedingsIJCAI- 77,Cambridge,MA, (August, 1977), 250—256.

[8] Stallman,R. andSussman,G.J., Forwardreasoningand dependency-directedbacktrackingin a systemfor computer-aidedcircuit analysis,Artificial Intel-ligence9 (1977) 135—196.

DCS The main DC sourcefor the power supply.

DCS.RECT The rectified DC power.

DCS.FILT Filters the raw DC provided by the rectifiersection.

REG Regulatesthe output basedon the information fedbackby theoutput sensors.

DARL The darlingtonpairwhich is thecentralcontrollingelement in the power supply.

DARL. 2OXAMP The secondstage of the darlington whichamplifies approximately20 times.

DARL. LIMIT A resistancewhich limits thegain of thesec-ond darlington transistor.

DARL. 100XAMP The first stageof thedanlingtonwhicham-plifies approximately100 times.

SAFETY/LOAD The minimal load on the darlington.

I/VOL/CNT This controlstheamountof currentprovidedto the load.

IVC. SENSE Senseswhether the amount of current pro-vided to the load is over the limit.

CC A variablecontrol which setsthe maximumamountof current provided to the load.

CR—S A switch which sets the rangeof the current con-trol.

IVC. THRESH/AMP Amplifies the signal provided by thecurrent sensor.

OPF Filters theoutput after regulation.

CCS A sourceof constantcurrentused to establisha ref-erence.

CCS.VOL/CONT A constantcurrentsourceusedto controlanothercurrent source.

CCS . ISENSE A resistancewhich is used to sensethe out-put currentof the CCS.VOL/CONT

CCS.VC.AMP Amplifies thesignal providedby thecurrentsensor.

INVERTER Becauseof the arrangementof components,the currentsourcecan be inverted.

CCS . AMP Amplifies the output of the simple currentsourceto provide astrongreference.

V/DIF/CONT This computesthe difference between thereference and the output.

VARIABLE/REF Providesa variable referencevoltage forcomparison.

VREF Providesafixed referencevoltage.

LQ/DCS A separatesource of unregulated DC used toestablisha reference.

Lq/RECT Providesthe raw DC for the reference.

LQ/FILT Filters the raw DC providedby the rectifier sec-tion.

SSV/REG A 56 volt referencesource.

36V/REG A 36 volt referencesource.

VREF.FILT An output filter for the fixed voltage refer-ence.

VR-S Sets the range of the voltage control.

VC/POT Controls the reference voltage and hence themaximum output voltage.

VDC . THRESH/AM? Computes the difference between theoutput and the reference and amplifies it.

FEEDBACK/DAMPERDamps out any potential oscillationcausedby the feedback.

Model-based diagnosis in SOPHIE III’ - PARC Based... · Model-based diagnosis in SOPHIE III ......

Documents

Transcript of Model-based diagnosis in SOPHIE III’ - PARC Based... · Model-based diagnosis in SOPHIE III ......