© Daphne Koller, 2005 Probabilistic Models of Relational Domains Daphne Koller Stanford University.
ersitdechter/courses/ics-275b/koller...Daphne Koller Stanford Univ ersit y Jan uary 13, 2000 In the...
Transcript of ersitdechter/courses/ics-275b/koller...Daphne Koller Stanford Univ ersit y Jan uary 13, 2000 In the...
-
Local probabilistic models�
Handout #10
Daphne Koller
Stanford University
January 13, 2000
In the previous chapter, we discussed the representation of global properties of independnece bygraphs. These properties of independence speci�ed a factorizaiton of the joint distribution in to aproduct of CPDs. Until now, we mostly ignored CPDs. However, it is clear that the representationalpower of networks is in the ability to represent CPDs. In this chapter we will examine CPDs inmore detail. We will describe a range of representations and consider their implications in termsof local properties of independence.
1 Tabular CPDs
When dealing with joint probability of discrete random variable, we can always resort to tabularrepresentation. Simply put, we can represent P (X j PaX) as a table that contain an entry for eachjoint assignment to X and PaX. In order for this to be a proper CPD, we require that all thevalues are not negatives, and that for each value paX, we haveX
x2Val(X)
P (x j paX) = 1: (1)
It is quite clear that this representation is as general as possible. We can represent every possiblediscrete CPD using such a table. As we shall also see, tabular CPDs can be used in a natural wayin inference algorithms.
However, aside from having these desirable properties, the tabular representation also su�ersfrom several disadvantages. The most obvious one is that the representation can become large andunwieldy. The number of values we need to describe a CPD is the number of joint assignments toX and PaX. Thus, we need jVal(PaX)j � jVal(X)j values in a tabular representation.1 Thus, forexample if we have 5 binary parents of a binary variable X, we need specify 25 = 32 values. Oncewe have 10 parents, we need to specify 210 = 1024 values. Clearly, the number of values growsexponentially in the number of parents.
This can quickly become a serious problem. Consider a medical domain where a symptom, say\high fever", depends on 10 diseases. It would be quite tiresome to ask our expert 1024 questionsof the format: \What is the probability of high fever when the patient has disease A, does not havedisease B, . . . ?" Clearly, our expert will lose patience with us at some point!
�Part of a draft for a textbook, co-authored with Nir Friedman, Hebrew University of Jerusalem1We can save some space by using storing the values for jVal(X)j�1 values of X and deducing the last probability
of the the remaining value by via Eq. (1).
1
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 2
This example shows another problem with the tabular representation: it ignores structure withinthe CPD. If the CPD is such that there are no similarity between the various cases, i.e., eachcombination of disease has drastically di�erent probability of high-fever, then the expert might bemore patient. However, in this example, like many others, there is a regularity inthe parametersfor di�erent values of the parents of X. For example, it might be that if the patient su�ers fromdisease A, then she is certain to have high-fever and thus P (X j paX) is the same whenever for allvalues paX in which A is true. Indeed, many of the representations we will consider below attemptto explicitly describe such regularties and expliot them to reduce the number of parameters neededto specify a CPD.
Finally, it is clear that if we consider random variables with in�nite domains, we cannot storeeach possible conditional probability in a table.
To avoid these problems we should view CPDs not as tables with all of the conditional prob-abilies, but rather as functions that given values of paX and x return the conditional probabilityP (x j paX). This is all we need in order to have a well-de�ned representation of a BN. In thereminder of the chapter we will explore some of the possible representations of such functions.
2 Deterministic nodes
One of the simplest types of regular CPDs are these where X is a deterministic function of PaX.That is, there is a function f : Val(PaX) 7! Val(X), such that
P (x j paX) =(
1 x = f(paX)0 otherwise
For example, X might be the \or" of its parents. Or in a continous domain, we might representP (X j Y;Z) by the function f(y; z) = sin(y+ez). Of course, the extent to which this representationis more compact than a table (i.e., takes less space in the computer) depends on the expressivepower that our language o�ers us for representing deterministic functions. That is, we might useexpressions over a set of basic logical and artithmetic operations to represent f .
It is clear that deterministic relations are useful in modeling many domains. They often allowsimplify the representation of dependencies (we will see such an example shortly). In addition, insome domains, they are naturally occuring. This is particularly true in \arti�cial domains" such asmodels of machines and electrical circuits. However, we can also �nd them in so called \natural"domains. A simple example is genetics. Recall that genotype of a person is determined by twocopies of each gene. The person's phenotypes are often functions of these values. For example, thegene reponsible for determining blood type has three values a, b, and o. If we represent by the G1and G2 the two copies of the gene, and by T the blood type, then we have that:
T =
8>>>>>>><>>>>>>>:
ab if G1 or G2 is a and the other is ba if at least one of G1 or G2 is equal to a and then other is
either a or ob if at least one of G1 or G2 is equal to b and then other is
either b or oo if G1 = o and G2 = o
Aside from compact representation, we get additional advantage from making the structureexplicit: we can represent additional properties of independencies. Recall that conditional inde-pendence is a numerical property | it is de�ned using equality of probabilities. However, the
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 3
D E
A
C
B
Figure 1: A simple example of a network with a determinstic CPD. The double line notationrepresents the fact that C is a deterministic function of A and B.
procedure det-sep(Graph, // network structureD, // set of deterministic nodesX;Y;Z // query
)While there is an Xi such that
(1) Xi 2 D // Xi has a deterministic CPD(2) PaXi � Z
Z Z [ fXigreturn d-sepG(X;Y j Z)
Figure 2: Procedure for computing d-separation in the presence of deterministic CPDs.
graphical structure made certain properties of a distribution explicit. This allowed us to deducethat some independencies hold without looking at the numbers. By making structure explicit inthe CPD, we can do even more of the same.
Example 2.1: Consider the simple network structure in Figure 1. If C is a deterministic functionof A and B, what new conditional independencies do we have? Suppose that we are given the valuesof A and B. Then, since C is deterministic, we also know the value of C. As consequence, we getthat D and E are independent. Thus, we conclude that I(D;E j A;B) holds in the distribution.
Note that if C was not a deterministic function of A and B, then this independence is notneccessarily true. Indeed, d-seperation would not deduce that D and E are independent given Aand B.
Can we augment the d-separation procedure to discover independencies I(X;Y j Z) such asthis? In our example, the �x is to consider C to be part of the evidence once we have A and Bin the evidence. In some situations, we might have variables that are deterministic functions ofvariables that are deterministic functions of the evidence. Thus, we have to iteratively extend theset of evidence variables to contain all the variables that are determined by it.
This discussion suggests the simple procedure shown in Figure 2. It is easy to convince ourselvesthat this algorithm is sound, in the same sense that d-separation is sound.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 4
A
C
B
D
E
Figure 3: A slightly more complex example with deterministic CPDs.
Theorem 2.2: (Soundness of det-sep) Let G be a network structure, and let D;X;Y;Z be setsof variables. If det-sep(G;D;X;Y;Z) returns true, then P j= I(X;Y j Z) for all distributions Psuch that P j= Markov(G) and for each X 2 D, P (X j PaX) is a deterministic CPD.
Does this procedure capture all of the independencies implied by the deterministic functions?As with d-separation, the answer has to be quali�ed. Given only the graph structure and the setof deterministic CPDs, we cannot �nd additional independencies.
Theorem 2.3: (Completeness of det-sep) Let G be a network structure, and let D;X;Y;Z besets of variables. If det-sep(G;D;X;Y;Z) returns false, then there is a distribution P such thatP 6j= I(X;Y j Z) but P j= Markov(G), for each X 2 D, P (X j PaX) is a deterministic CPD.
Of course, particular deterministic functions can imply additional independencies.
Example 2.4: Consider the network of Figure 3 where C is the exclusive or of A and B. Whatadditional independencies do we have here? In the case of XOR (although not for all other de-terministic functions) The values of C and B fully determine that of A. Therefore, we have thatI(D;E j B;C) holds in the distribution.
Speci�c deterministic functions can also induce other independencies, albeit of di�erent typethat the ones we discussed in Chapter ??.
Example 2.5: Consider the following Bayesian network:
X Y
OR
Z
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 5
and consider what happens if we are given that Y = y1. In this case, we also know that thedeterministic node D necessarily has value d1. And, as the value of D is �xed, we can concludethat X and Z are independent. In other words, we have that
P (Z j X;Y = y1) = P (Z j Y = y1):
On the other hand, if we are given Y = y0, the value of D is not determined, and it depends onthe value of X. Hence, the corresponding statement conditioned on y0 is false.
This example shows that deterministic nodes induce a form of independence, but it is di�erentfrom the standard notion on which we have focused so far. Up to now, we have restricted attentionto independence properties of the form I(X;Y j Z), that imply that P (X j Y;Z) = P (X j Z) forall values of X, Y and Z. Deterministic functions imply a type of independence that only holdsfor particular values of some variables.
De�nition 2.6: Let X;Y;Z be pairwise disjoint sets of variables, let C be a set of variables (thatmight overlap with X [ Y [ Z), and let c 2 Val(C). We say that X and Y are contextuallyindependent given Z and the context c denoted Ic(X;Y j Z; c), if
P (X j Y;Z;C) = P (X j Z; c) whenever P (Y;Z; c) > 0:
We call this form of indepedencies context-speci�c independencies (CSI).In the example above, we would say that Ic(X;Z j y1).
Example 2.7: Consider the same Bayesian network as above, but assume that C is the deter-ministic OR of A and B. In this case, knowing C and B does not always tell me the value of A.However, if C is known to be false, then A and B are both known to be false, and therefore they areindependent. Thus, we have that Ic(A;B j c0). As a consequence, we also have that Ic(D;E j c0).
3 Asymmetric dependencies
Aside from deterministic functions, what other types of regularity we can �nd in CPDs? A commontype of regularity is when we have precisely the same e�ect in several contexts. We can see such aregularity in a modi�ed version of the Alarm example.
Example 3.1: Suppose that the house owner often forgets to turn on the alarm. To model this,we add to the network of Example ?? an additional variable \On" that denotes whether the alarmwas turned on on that day. The structure of the modi�ed network is shown in Figure 4.
Now, we need to describe the CPD P (A j O;B;E). Clearly, if the alarm was not turned on,i.e., O = o0, then it would not be active regardless of the values of B and E. This implies that inthe four cases corresponding to values of O;B;E in which O = o0, we have that that the probabilityof alarm is zero (or a very very small number 10�10 if we want to account for extremely unlikelyoccurrences such as lightning strikes temporarily powering the alarm). That is, P (a1 j o0; b; e) = �for all values b and e.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 6
Earthquake Burglary
Alarm
Call
Radio
On
Earthquake Burglary
Alarm
Call
Radio
On
(a) (b)
Figure 4: (a) The Alarm example modi�ed to consider the probability that the alarm was turnedon. (b) The reduced graph after we remove spurious arcs given the context o0.
O
B
t f
ft0.000001
E
0.90.96
ftE
0.00010.6
ft
O
B
0.95 E
0.0010.2
t f
ft
ft0.000001
(a) (b)
Figure 5: Two tree representations for CPDs of P (A j O;B;E). Internal nodes in the tree denotetests on parent variables. Leaves are annoted with the probability of A = a1.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 7
In this simple example, we have a CPD in which four possible values of PaA describe the sameconditional probability over A. How do we represent such regularity? A simple approach is torepresent this by using a tree representation.
For example, Figure 5 shows two trees we might cansider for the CPD of A in Example 3.1.Given a tree we �nd P (A j o; b; e) by traversing the tree from the root downward. At each internalnode, we see a test on one of the attributes. For example, in the root node of the tree in Figure 5(a)there is a test on the value of O. We the follow the branch that is labeled with the value O is givenin the case we are interested in. Thus, if O = o0, we would reach the leaf labeled with �. Once wereach a leaf we read return the conditional distribution associated with the leaf.
Formally, we use the following recursive de�nition of trees.
De�nition 3.2: A CPD-tree representing a CPD for variable X is a rooted tree; each t-node inthe tree is either a leaf t-node or an interior t-node. Each leaf is labeled with a distribution P (X).Each interior t-node is (a) labeled with some variable Z 2 PaX, (b) associated with a set of arcsto its children, one arc for each zi 2 Val(Z), with each arc labeled by some zi.
A branch through a CPD-tree is a sequence of t-nodes and arcs beginning at the root andproceeding to a leaf node. The assignment induced by branch � is the assignment to the setZ � PaX where each element Z 2 Z labels an interior node of � and is assigned the value z thatlabels the corresponding arc that lies on b. We generally assume that a decision tree is irredundant,that is, no branch b contains two interior nodes labeled by the same variable.
Note that, to avoid confusion, we use t-nodes and arcs for a CPD-tree, as opposed to nodes andedges for entities for a BN.
To illustrate this de�nition, consider the tree in Figure 5(a). There are �ve branchs in thistree. One induces the assignment o0, and corresponds to the situation where the alarm was turnedo�. The other four induce complete assignments to all the parents of A: ho1; b0; e0i, ho1; b1; e0i,ho1; b0; e1i, and ho1; b1; e1i. Thus, this representation breaks down the conditional distribution ofA given its parents into �ve conditions by grouping some of the conditions that we can consider into one.
To elaborate the representation a little, consider a somewhat di�erent example
Example 3.3: Suppose we now have a di�erent alarm system where the wires cannot move andso an earthquake cannot directly cause a contact in the wires and trigger the alarm. This alarmsystem uses a sensitive motion sensor, one that is set o� even by the motion of objects caused bythe earthquake. A burglary causes the alarm if the burglar did not manage to disable the alarm.However, once the burglar disables the alarm, an earthquake can no longer set it o�. What typeof interaction do we get now? Now we know, that P (A j o1; b1; e0) and P (A j o1; b1; e1) are thesame: if a burglary attempt succeeded, the alarm is disabled and would not be triggered by anearthquake. On the other hand, if it failed, the alarm is set o�, and again, earthquake would notchange the �nal outcome.
This type of regularity is represented by tree in Figure 5(b). In this tree there is one branchthat induces the assignment o1; b1. Thus, for both cases we mention above, we would use the sameconditional distribution.
Regularities of this type occur in lots of other situations. As one example, we can have a Wetvariable, denoting whether I get wet; that variable would depend on the Raining variable, but onlyin the context where I am outside. Another, very common situation where this type of regularityoccurs is when we have actions in our model; in these cases, the set of parents for a variables mayvary considerably based on my action. For example, let us revisit the Travel-time example from
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 8
Time
T101 T280Road
(a) (b)
Figure 6: (a) A network for the travel-time example, and (b) tree representation of the CPD forP (Time j Road;T101;T280).
Tree1 Tree2
Figure 7: Two equivalent trees.
the previous chapter. Recall that Time, the travel time for getting to work, may depend on bothTraÆc101 and TraÆc280, but only on the one which I actually took. Figure 6 shows how wemight represent this example. Con�guration variables also result in such situations. As a real-lifeexample, in a printer diagnosis BN, the printer can be hooked up either to the net via an ethernetcable or to a local cable. The status of the ethernet cable only a�ects the printer if the printer ishooked up to it.
What is the semantics of the tree representation? As we have seen, to compute P (X j paX) weneed to �nd the unique branch that is consistent with paX and return the distribution associatedwith it. Thus, the form of the tree is not crucial. Only the assignments de�ned by the branchesare. The two trees in Figure 7 are equivalent in the sense that they de�ne the same branches, andassign the same conditional probabilies to each branch.
If we abstract away from the details of the tree representation, we see that what we are repre-senting are the partitions of Val(PaX) that are de�ned by the branches in a tree. This also allows tosee what can be represented as a tree. All the partitions de�ned by a tree must a description via anassignment to a subset of the variable. Thus, we cannot represent the partition that contains onlyo1; b0; e1 and o1; b1; e0. (Of course, we can use two branches with the same conditional probabilityin this example, but then we not capturing some parts of the structure of the CPD.)
This immediately suggests other possible representations of partitions. For example, we mightuse logical formulase to describe partitions. This is a very exible representation that can describeany partition we might consider, but the the formulas might get quite long.
Can we characterize the regularities represented by a tree, or more generally, by any representa-tion of partitions? As for deterministic CPDs, these structures induce properties of context-speci�cindependence. If we consider Example 3.1, then once we know that the alarm is o�, A is indepen-dent of B and E. Again, this is a context-speci�c independence, that holds only for a particularvalue of O. In other words, the CPD for P (A j O;B;E) satis�es Ic(A;B;E j o0). The tree ofFigure 5(b) describes another CSI Ic(A;E j o1; b1): If the alarm is on and there was a burglary, thealarm sound cannot be inuenced by an earthquake.
These two examples might suggest that the only contexts that induce CSI are those de�ned bycomplete branches, ones that go all the way from the root of a CPD-tree to a leaf. This is not
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 9
necessarily the case. Consider the CPD of Figure 6. In this example, once we decided to drive viahighway 101, my travel time does not depend on the traÆc load in highway 280. Thus, we havethe property Ic(Time;T280 j Road = 101).
Of course, we want a systematic way of deducing CSI properties from a tree-representation.To do so, we need to consider how a speci�c context inuences a tree. Consider again the tree ofFigure 5(b), and suppose we are given the context b1. Clearly, we now should focus only on branchesthat are consistent with this value. There are two such branches. One induces the assignment o0
and the other the assignment o1; b1. We can immediatly see that the choice between these twobranches does not depend on the value of E. Thus, we to conclude that Ic(A;E j b1) holds in thiscase.
This line of reasoning can be generalized by using the following de�nition.
De�nition 3.4: Let T be a decision tree over some set of variables Z, and let c 2 Val(C) � Z bea context. The reduced tree with respect to c, denoted, T c, is de�ned recursively as follows. Let rbe the root of T .
� If r is leaf node: T c = T .� If r is an interior node, then it is labeled with some variable Z, and T consists of r togetherwith immediate subtrees T1; � � � Tk, associated with values z1; � � � zk of Val(Z):{ if Z is not in C: we set T c to be R together with subtrees T1
c; � � � Tkc.{ if Z is in C: we set T c = Tj
c, where Tj is the subtree associated with the arc labeledwith value zj 2 c.
The reduced tree is the tree we need to traverse in order to get to the conditional probability if weknow that C = c. If a variable does not appear in the reduced tree, then the choice of conditionaldistribution does not depend on it.
Proposition 3.5: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , let c 2Val(C) for C � PaX be a context, and let Z � PaX. If T c does not test any variable in Z, thenP j= Ic(X;Z j PaX � Z; c).This proposition allows speci�es a computational tool for deducing \local" CSI relations from thetree representation. We can check whether a variable Z is being tested in the reduced tree givena context in linear time. This procedure, however, is incomplete in two ways. First, since theprocedure does not examine the actual parameter values, it can miss additional independenciesthat are true for the speci�c parameter assignments. However, as in the case of completeness ford-separation in BNs, this violation only occurs in degenerate cases. In this case, the degeneracyrequired to induce a violation of completeness is even more obvious than for BNs: if P satis�es anindependence of the form Ic(X;Z j PaX � Z; c), which is not reported by this procedure, then twoof the distributions at the leaves of the CPD-tree must be identical.
Proposition 3.6: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , where allof the distributions at the leaves of the tree are distinct. Then for any C;Z � PaX and c 2 Val(C),we have that T c does not test any variable in Z if and only if P j= Ic(X;Z j PaX � Z; c).
The more severe limitation of this procedure is that it only tests for independencies between Xand some of its parents given a context and the other parents. Are there are other, more global,
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 10
procedure CSI-sep(Graph, // network structureP // a distribution that satis�es Markov(G)c // a contextX;Y;Z // query
)let G0 be a duplicate of Gfor each edge Y ! X in G
if Y ! X is spurious given c in P thenremove Y ! X in G0
return d-sepG0(X;Y j Z;C)
Figure 8: Procedure for computing d-separation in the presence of asymmetric dependencies inCPDs.
implications of such CSI relations? Consider Example 3.1 again. Supposed we know that the alarmis o� (i.e., O = o0). Then, our intuition is that hearing an a radio report regarding an earthquakewould not a�ect the probability of receiving a phone call from the neighbor: since the alarm is o�,an earthquake cannot trigger it, and so the probability of alarm does not increase due to the higherprobability that there was an earthquake. (Note that when the alarm is on, we should anticipatea phone call after hearing the news report; see Section ??.)
Can we capture this intuition formally? Consider the dependence structure in the contextO = o0. Intuitively, in this context the edge E ! A is redundant, since we know that Ic(A;E j o0).Thus, our intuition is that we should check for d-separation in the graph without this edge. Indeed,we can show that this is a sound check for CSI conditions.
We start by formally de�ning the set of parents that are irrelevent given a context. Intuitively,we want to say that Y is irrelevant if X is independent of Y given the context and the other parents.We have to be careful though, since the context might include other variables that are not in thefamily of X that can cause X and Y to be dependent in a non-local fashion (e.g., c contains acommon descendent of both X and Y ). Thus, we use the following de�nition
De�nition 3.7: Let G be a network structure, let P be a distribution such that P j= Markov(G),and let c be a context. De�ne cjZ to be the context restricted to the variables in Z. An edgeY ! X in G is spurious, in the context c, if Ic(X;Y j PaX �Y; cjPaX) holds in P .
It is easy to see that if we represent CPDs with decision trees, then we can determine whetheran edge is spurious or not by the examining the reducted tree. An edge Y ! X is spurious if Ydoes not appear in the reduced tree for P (X j PaX). Thus, for trees, this de�nition has eÆcientprocedural implementation. For many other representations of asymmetric CPDs we also haveeÆcient procedures for identifying spurious edges.
Now we can de�ne a variant of d-separation that takes CSI into account. This procedure isstraightforward: we use local considerations to remove spurious edges, and then apply standardd-separation to the resulting graph. See Figure 8 for pseudo-code for this procedure.
As an example, reconsider the modi�ed Alarm example, with the context O = o0. In this case,we get that the arcs B ! A and E ! A are suprious, and thus the reduced graph is the oneshown in Figure 4(b). As we can see, R and C are d-separated in the reduced graph. Thus, usingCSI-separation we get that R and C are d-separated given the context o0.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 11
An immediate question that we should address is whether this procedure is reliable. That is, isit sound? As expected, it is not hard to show that it is indeed sound.
Theorem 3.8: Let G be a network structure, let P be a distribution such that P j= Markov(G),let c be a context, and let X;Y;Z be sets of variables. If CSI-sep(G;P; c;X;Y;Z) returns true,then P j= Ic(X;Y j Z; c).
Proof: See Exercise ??
Of course, we also want to know if CSI-separation is complete? That is, does it reports all theindependencies in the distribution. Here the answer is more complex. In general, the CSI-separationis not complete.
To see a simple counterexample, consider the example of Figure 6. In this example, CSI-separation will report that T101 and T280 are separated given Time and the context Road = 101.(To see this, note that T280 ! Time is spurious given Road = 101, and thus there is no pathbetween the two variables.) Similarly, if we consider the context Road = 280, we also have thatT101 and T280 are separated given Time. Thus, reasoning by cases, we conclude that once weknow the value of Road, we have that T101 and T280 are independent given Time.
Can we get this conclusion using CSI-separation? Unfortunately, in general, the answer is no.If we invoke CSI-separation with the empty context, then no edges are spurious and CSI-separationreduces to d-separation. Since both T101 and T280 are parents of Time, we conclude that theyare not separated given Time and Road.
The problem here is that CSI-separation does not perform reasoning by cases. Of course, ifwe want to determine whether X and Y and independent given Z and a context c, we can invokeCSI-separation on the context c; z for each possible value of Z, and see if X and Y are separatedin all of these contexts. This procedure, however, is exponential in the number of variables of Z.Thus, it is practical only for small evidence sets. Can we do better than reasoning by cases? Theanswer is that sometimes we cannot. See Exercise ?? for a more detailed examination of this issue.
4 Independence of causal inuence
We now describe another, very di�erent, type of structure in the local probability model. Let usreconsider the Alarm example, but now make di�erent assumptions about the alarm. Why does aburglary cause the alarm to go o�? Perhaps because it activates the motion sensors. Why does anearthquake cause the alarm go to o�? Perhaps because it jiggles some wires. But what happensif both occur? We can assume that these are two independent causal mechanisms, and that thealarm failed to go o� only if neither of these two mechanisms worked.
Assume that P (a1 j b1; e0) = 0:9 and P (a1 j b0; e1) = 0:6. In the case b1; e1, the burglary failsto set o� the alarm with probability 0:1, the earthquake fails to set it o� with probability 0:4, thealarm fails to go o� only if both mechanisms fail, and these failures occur independently; hence,the alarm fails to go o� with probability 0:1 � 0:4 = 0:04. In other words, our CPD for P (A j B;E)is:
A b0e0 b0e1 b1e0 b1e1
a1 0 0:6 0:9 0:96a0 1 0:4 0:1 0:04
Here, we assume for simplicity that there are no spontaneous alarms that are not caused by one ofthese mechanisms. We relax this assumption later on.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 12
Earthquake Burglary
e1
j1 0.6
0.4j0 1
0
e0
m1
m0
b1
0.9
0.1
0
1
b0
Alarm
Wire jiggle Motion
Figure 9: Decomposition of the noisy-or model for Alarm.
An alternative way of understanding this interaction is by assuming that the behavior of thealarm is the one induced by a more elaborate probabilistic model, as represented by the networkfragment in Figure 9. This �gure represents the conditional distribution for the Alarm node givenBurglary and Earthquake; it also uses two intermediate nodes that reveal the associated causalmechanisms. It is easy to verify that the conditional distribution P (A j B;E) induced by thisnetwork is precisely the one shown above.
The probability that B causes A (0:9 in this example) is called the noise parameter, and denoted�B . In the context of our decomposition, �B = P (m
1 j b1). Similarly, we have a noise parameter�E , which in this context is �E = P (j
1 j e1). We can also put in a leak probability that representsthe probability that the alarm would go o� spontaneously, by introducing another node into thenetwork. This node has no parents, and is true with probability �0 = 0:0001. It is also a parent ofthe Alarm node, which remains a deterministic or.
The decomposition of this CPD clearly shows why this local probability model is called a noisy-or model. The basic interaction of the e�ect with its causes is that of an Or, but there is somenoise in the \e�ective value" of each cause.
We can de�ne this model in the more general setting:
De�nition 4.1: LetA be a binary-valued random variable with n binary-valued parentsX1; : : : ;Xn.The CPD P (A j X1; : : : ;Xn) is a noisy-or if there are n + 1 noise parameters �0; �1; : : : ; �n suchthat
P (a0 j X1; : : : ;Xn) = (1� �0)Y
i : Xi=xi1
(1� �i)
The noisy-or interaction is a special case of a general class of local probability models, calledcausal independence, or independence of causal inuence. These models all share the property thatthe inuence of multiple causees can be decomposed into separate inuences of each one. Moreprecisely:
De�nition 4.2: Let A be a random variable with parentsX1; : : : ;Xn. The CPD P (A j X1; : : : ;Xn)exhibits independence of causal inuence if it can be induced by a network fragment of the structureshown in Figure 10, where the CPD of A is a deterministic function.
Independence of causal inuence is a very useful model, with many instantiations, correspondingto di�erent noise models | P (Yi j Xi) | and di�erent deterministic functions. For example, the
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 13
X
Y
X
Y
X
Y
1
1
2
2
n
n
A
noise
Figure 10: Independence of causal inuence
noisy-max model is very useful in medical diagnosis. Here, the parents Xi's correspond to theseverity of various diseases that the patient might have, the Yi's correspond to the extent to whichthese diseases inuence a particular symptom, and the severity of the symptom (e.g., Fever) is amax of these severities.
These types of models turn out to be very useful in practice, both because of their cognitiveplausibility and because they provide a signi�cant reduction in the number of parameters requiredto represent the distribution. The number of parameters in the CPD is linear in the number ofparents, as opposed to the usual exponential. Consider, for example, the CPCS network, developedfor the diagnosis of various internal diseases, as shown in Figure ??. The network is speci�ed using8254 parameters, as opposed to almost 134 million (133,931,430) for a network with full CPTs.The causal independence also has computational bene�ts, which we will discuss later.
5 Hierachical Models
Another very useful type of local probability model is one where the CPD is, itself, de�ned via aBayesian network fragment. As a very simple example, consider our decomposition of the noisy-orCPD for the Alarm variable. There, our decomposition represented a model of how the alarm reallyworked on the inside. The model included explicit variables for the various relevant attributes of thealarm, along with their dependency model. These internal variables and their dependencies wereencapsulated inside the Alarm model. Externally, to the rest of the network, we could still view thealarm as a single node with its two inputs: Burglary and Earthquake. All of the internal structurewas encapsulated within the alarm. The entire network fragment was a structured description ofsomething which, for the rest of the network, behaved exactly like a simple CPD.
In general, we can have a hierarchical model where the CPD for a variable X given its parentsY1; : : : ; Yk is de�ned via a separate Bayesian network fragment. That fragment has Y1; : : : ; Yk asinputs; i.e., the fragment doesn't specify a distribution over these variables. Rather, the fragmentspeci�es a conditional distribution over the rest of the variables in the fragment, given the inputs.The output of the fragment is the variable Alarm. By marginalizing over all of these internalvariables | all except the inputs and outputs | the network represents a conditional distributionP (X j Y1; : : : ; Yk), as desired. Again, the de�nition of the distribution is implicit, but can becomputed when necessary. This implicit de�nition can be much more compact than a full CPT.
This type of hierarchical model is clearly useful in device diagnosis tasks. There, the deviceis composed of many other devices. As far as the rest of the model is concerned, the internals of
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 14
abdominal-bruit-right-upper-quadrant abdominal-bruit-systolic abdominal-bruit-continuous
abdominal-distention
abdominal-fluid-wave
abdominal-friction-rub
abdominal-pain-epigastriumabdominal-pain-right-upper-quadrant
abdominal-pain-radiation
abdominal-pain-duration-subacute
abdominal-pain-duration-chronic
abdominal-pain-duration-acute
abdominal-pain-exacerbated-by-breathingabdominal-pain-exacerbated-by-cough abdominal-pain-exacerbated-by-motion
abdominal-pain-exacerbated-by-alcohol
abdominal-pain-exacerbated-by-meals
abdominal-pain-nature-non-colicky
abdominal-pain-nature-colicky abdominal-pain-alleviated-by-antacids
abdominal-pain-severity
abdominal-respiratory-movement abdominal-shifting-dullness
abdominal-tenderness abdominal-tenderness-involuntary-guarding-localized
abdominal-tenderness-rebound-tenderness
abnormal-bile-pigment-transport
abrupt-onset-of-illness
activated-partial-thromboplastin-time
acute-fatty-liver-of-pregnancy
acute-hemorrhage
affect affect-affect-description
age
alcoholism-chronic
alcohol-chronic-abuse-history
alcoholic-fatty-liver alcoholic-hepatitis
alkaline-phosphatase-blood
allergy-historyallergy-history-type-of-allergy
dilitation-of-common-bile-duct dilitation-of-major-intrahepatic-bile-ducts
amnesia amnesia-type-of-amnesia
anemia antinuclear-antibody-titer
antibody-hbsag antigen-hbsag
anti-mitochondrial-antibody-titer
antibody-thyroglobulin-reciprocal-titer
appetite
appearance-of-i131-rose-bengal-in-intestine-after-iv-injection
arterial-impulse-magnitude
ascites
ascending-cholangitis
ascitic-fluid-obtained-by-paracentesis ascitic-fluid-wbc-total
back-pain back-pain-laterality back-pain-severity
bilirubinuria bilirubin-blood-conjugated
bilirubin-blood-total
biliary-colic
bile-duct-obstruction-history
bile-plugging-hepatic-canaliculi
gas-in-biliary-tract
bleeding-time
ldh-bloodblood-ammonia-level
blood-culture-pseudomonas blood-culture-enterococcus blood-culture-proteus blood-culture-klebsiella-or-enterobacter blood-culture-e-coli
blood-culture-spirochetal-species
blood-glucose
blood-transfusion-historyblood-transfusion-history-number-of-units-transfused
body-habitus-description
bowel-sounds
breast-enlargement-recent breast-enlargement-recent-laterality
breath-odor breath-odor-odor
breast-tenderness-unilateralbreast-tenderness-bilateral
cardiac-cirrhosis
chest-xray-density-or-infiltrate-shape-of-pulmonary-lesion chest-xray-density-or-infiltrate
cholestasis
cholesterol-blood-decreased
cholesterol-blood-increased
cholestatic-enzyme-elevation
chronic-active-hepatitis
chylous-ascites
clotting-factor-deficit
clot-retraction
coagulopathy-of-hepatocellular-disease
color-of-mucous-membranes
color-of-mucous-membranes-cyanosis
color-of-mucous-membranes-pallor
coma
wbc-neutrophils-percentage
confusion
conjunctiva-suffusion
copper-urine
cough
cryoglobulins-serum
deep-tendon-reflexes
dehydration-syndrome
delirium
depressed-level-of-consciousness
diarrhea
diastolic-arterial-blood-pressure
diarrhea-duration-acute
diarrhea-duration-chronic
diaphragm-elevated diaphragmatic-movement
diffuse-inflammation-of-gastric-mucosa
direct-coombs-test
drug-abuse-historywarfarin-drug-administrationbroad-spectrum-antibiotic-drug-administrationthiouracil-drug-administrationsulfonamide-drug-administrationphenylbutazone-drug-administrationphenothiazine-drug-administration
penicillins-drug-administrationisoniazid-drug-administrationhydralazine-drug-administrationhydantoin-drug-administrationhalothane-drug-administrationcephalosporin-drug-administrationtetracylcline-drug-administrationgeneral-anesthetic-drug-administrationcytotoxic-drug-drug-administration
aspirin-drug-administrationacetaminophen-drug-administrationnitrofurantoin-drug-administrationmethyldopa-drug-administrationdrug-administration-relative-dosage-size
drug-hypersensitivity-generalized
dupuytrens-contractures-of-hands
dyspnea-orthopnea
dyspnea-resting dyspnea-exertional
ectasia-of-small-bile-ducts
edema-scrotum
edema-legs edema-grade-on-0-to-4-scaleedema-bilateral edema-unilateral
temperature-sensitivity-increased-heat
temperature-sensitivity-increased-cold
wbc-eosinophil-count
epistaxis
excessive-bleeding-after-minor-trauma
exposure-to-animals-intimate-contact-cattleexposure-to-animals-intimate-contact-swineexposure-to-animals-intimate-contact-rabbit-rodent-or-other-small-mammal
exposure-to-animals-intimate-contact-dogexposure-to-animals-intimate-contact-catextrahepatic-biliary-obstruction
extravasation-of-contrast-medium-into-stomach-or-duodenum
exudative-ascites
eyes-lid-lag
factor-ix-christmas-level
facies-gross-appearance-face-description
factor-v-proaccelerin-level factor-vii-proconvertin-level
factor-x-stuart-level
jaundice-family-history
fatty-metamorphosis-of-liver
febrile-response-to-microbial-pyrogens
feces-gross-inspection-grossly-bloody feces-gross-inspection-black-tarry
feces-gross-inspection-light-colored
feces-guaiac-test
fever-variability-with-time
fever-variability-with-time-interval
fibrinogen-blood
fibrous-intrahepatic-bands-of-wide-and-variable-thickness
fibrosis-without-loss-of-hepatic-lobular-architecture-degree-or-amount
fibrosis-with-loss-of-hepatic-lobular-architecture-degree-or-amount
fibrosis-without-loss-of-hepatic-lobular-architecture-pattern-of-hepatic-fibrosis
fibrosis-with-loss-of-hepatic-lobular-architecture-any-pattern
fibrosis-with-loss-of-hepatic-lobular-architecture-periportal
fibrosis-with-loss-of-hepatic-lobular-architecture-advanced-pattern-indeterminate
fibrosis-with-loss-of-hepatic-lobular-architecture-reversed-lobulation
fibrosis-with-loss-of-hepatic-lobular-architecture
fibrosis-without-loss-of-hepatic-lobular-architecture
fingers-clubbing
flanks-bulging flanks-heavy
focal-necrosis-and-inflammation-of-hepatocytes
frequency-of-urination
frequency-of-urination-diurnal-pattern
gallbladder-discretely-palpable gallbladder-prolonged-emptying gallbladder-size
gamma-glutamyl-transpeptidase
gastritis-acute
gastrointestinal-blood-loss-from-above-ligament-of-treitz
gastrointestinal-blood-loss-from-below-ligament-of-treitz
general-motor-activity
gross-coagulability-of-blood
gynecomastia
headache
heart-impulse heart-impulse-intensity-of-impulse heart-murmur-left-sternal-border heart-murmur-apex heart-murmur-character heart-murmur-grade heart-murmur-shape-of-murmur heart-murmur-systolic heart-murmur-diastolic
heart-murmur-timing-late heart-murmur-timing-middle heart-rate
headache-severity
hematocrit-blood hemoglobin-blood
hepatomegaly
hepatitis-b-acute
hepatitis-acute-toxic
hepatitis-acute-viral
hepatic-non-caseating-granulomas
hepatitis-contact-history
hepatocellular-damage-due-to-drug-hypersensitivity
hepatocellular-dysfunction hepatocellular-enzyme-elevation
hepatic-fibrosis
hepatocellular-inflammation-and-or-necrosis
hepatic-leptospirosis
hepatic-pain
hernia-umbilical hernia-inguinal
hirsutism
hyperdynamic-circulation
immersion-in-untreated-water-history
impotence
indocyanine-green-retention
influence-of-pregnancy-on-severity-of-illness influence-of-pregnancy-on-severity-of-illness-direction-of-change
iridocyclitis
jaundice
jaundice-clinical-time-course-duration
jaundice-recurrent-episodes-prior-to-present-illness
jaundice-intermittent-during-present-illness
joint-pain joint-pain-clinical-time-course-variability joint-pain-severity
jugular-venous-hum
eye-kayser-fleischer-ring
ketonuria
leptospira-agglutination-test
libido
liver-edge-by-palpation liver-edge-by-palpation-liver-edge-descriptor
liver-gross-contour
liver-gross-contour-liver-contour-description
liver-size-increased
liver-size-decreased
liver-size-degree-or-amount
liver-texture-by-palpation
liver-texture-by-palpation-fine-nodules-uniform-diffuse
liver-texture-by-palpation-bosselated
wbc-lymphocytes-atypical
lymph-node-enlargement
lymph-node-tenderness
macronodal-cirrhosis
magnesium-serum
mallory-bodies-hepatocytes manifestations-of-systemic-viral-infection
marked-necrosis-of-hepatocytes
menstrual-bleeding-increased
menstrual-bleeding-decreased
micronodal-cirrhosis
mononuclear-cells-infiltrating-interlobular-bile-ducts
mucous-membranes-petechiae mucous-membranes-petechiae-site-conjunctiva mucous-membranes-petechiae-site-oral-mucosa
muscle-necrosis-focal
muscle-pain
muscle-tenderness
muscle-weakness muscle-weakness-laterality muscle-weakness-proximal-versus-distal-distribution wbc-myelocytes
narrowing-of-inferior-vena-cava-at-hepatic-level
neutrophilic-cells-infiltrating-interlobular-bile-ducts
occupation-history-veterinarian-or-animal-husbandryoccupation-history-sewer-workeroccupation-history-mineroccupation-history-garbage-workeroccupation-history-fish-cleaneroccupation-history-farm-worker
occupation-history-dock-workeroccupation-history-construction-workeroccupation-history-butcheroccupation-history-medical-laboratory-workeroccupation-history-health-worker
oral-mucosa-superficial-ulcers
orthostatic-hypotension
eye-pain
palpitations
palmar-erythema
paresthesias
parotitis-acute
parenteral-drug-administration-history
parotid-gland-size
past-diagnosis-ulcerative-colitispast-diagnosis-hepatitis-acute-viral
past-diagnosis-crohns-diseasepast-diagnosis-congestive-heart-failure
peritoneal-fluid-diffuse
periportal-infiltration-neutrophils
periportal-infiltration-round-cells
peritoneal-irritation-right-upper-quadrant
pharyngitis-acute pharyngitis-acute-laterality
phosphate-serum
piecemeal-necrosis-of-liver
platelet-count-in-thousands
platelet-morphology platelet-morphology-platelets-description
pregnancy
pregnancy-test
primary-biliary-cirrhosis
proteinuria
proliferation-of-small-bile-ducts
prothrombin-time
pulse-corrigan pulse-pressure pulse-pressure-degree-or-amount pulse-quincke
quantity-of-axillary-and-or-pubic-hairregenerated-hepatic-nodules-of-variable-size
respiration-inspiratory-depth respiration-rate
retina-cotton-wool-exudates
rbc-reticulocyte-count
retina-roth-spots
retina-superficial-flame-shaped-hemorrhages
rheumatoid-factor
rigors
secondary-biliary-cirrhosis
secondary-fatty-liver
serum-albumin
serum-beta-globulin
serum-gamma-globulin serum-gamma-globulin-pattern-of-globulin-increase
serological-indicators-of-hepatocellular-disease
serum-iga-quantitative-level
serum-igg-quantitative-level
serum-igm-quantitative-level
serum-igm-quantitative-level-percent-of-total-serum-protein
sex
sgot-blood sgpt-blood
shellfish-ingestion-history
rbc-size rbc-size-degree-or-amount
skin-acne
skin-anergy-panel-indicates-anergy
skin-pallor
skin-petechiae
skin-pigmentation skin-pruritus
skin-purpura-or-ecchymosis
skin-rash-maculopapular
skin-striae skin-striae-skin-lesion-color
skin-sweating
skin-telangiectasias
skin-temperature-hands-and-feet
skin-temperature-generalized
skin-temperature-subjective-description-cold
skin-temperature-subjective-description-warm
skin-urticaria
sleep-disturbance
sleep-disturbance-hypersomnolence sleep-disturbance-insomnia
smooth-muscle-antibody-titer
spider-angiomata
spleen-size spleen-size-degree-or-amount
sputum-gross-inspection-sputum-description
sputum-production
stomach-aspirate-coffee-grounds stomach-aspirate-gross-blood
stupor-or-somnolence
superficial-gross-hemorrhages-of-mucosa
surgery-history-cholecystectomysurgery-history-biliary-tract-surgerysympathetic-hyperactivity
syncope
systolic-arterial-blood-pressure
systemic-leptospirosis
systemic-manifestations-of-chronic-liver-disease
rbc-targets
temperature
le-test
testis-pain
testicular-size
thrombocytopenia
thrombin-time
tinnitus tinnitus-laterality
cpk-total-blood
wbc-total-in-thousands
tourniquet-test
toxic-substance-exposure-wild-mushrooms
toxic-substance-exposure-phosphorustoxic-substance-exposure-chloroformtoxic-substance-exposure-carbon-tetrachloridetoxic-substance-exposure-alcohol-heavy-consumption
transudative-ascites
tremor
tremor-laterality
tremor-type-of-tremor
triglycerides-serum
hla-type-dw3hla-type-b8hla-type-a1
urea-nitrogen-blood
urine-culture urine-culture-spirochetal-species
urine-gross-inspection urine-gross-inspection-urine-description
urine-sediment-rbc
urobilinogen-urine
vaginal-bleeding-irregular-nonmenstrual vaginal-bleeding-irregular-nonmenstrual-degree-or-amount
vdrl-or-rpr
vertigo
vitamin-k-deficiency
vomiting
vomiting-vomitus-coffee-grounds vomiting-vomitus-gross-blood
vomiting-vomitus-normal-gastric-contents
weight-recent-gain-in-percent
weight-recent-loss-in-percent
zieves-syndrome
Figure 11: the cpcs network for diagnosis of internal diseases. The network contains 448 nodes,906 links.
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 15
Read EventPrint Event
Dust
Mouse
Temperature
OS
Printer
Fan
Write Event
Power Supply
Power Source
Spill
Warmed Up
Monitor
Motherboard
Keyboard
Age
Crash
Hard Drive
Computer
Controller
Temperature Age
CableMBR
Head Crash
Bootable
DriveUsable
DBR
FAT
Lost Clusters
Capacity
Surface Damage
Status
Hard
Used
OS-Status
Full
Drive Mechanism
Surface 2 Surface 3 Surface 4Surface 1
Motor
Head Crash
HeadData Transfer
Data AccessDriveMechanism
Temperature
Status
Connected Controller Ok Age
Motor
Stiction
TemperatureAge
Disk Spins
Dead
Figure 12: Four levels of hierarchy in an OOBN model of a computer system.
the component are not relevant. Only its external behavior is. We can encapsulate the internalattributes of a component, making only its external behavior observable to the rest of the model.There is no reason to restrict our model to a single output attribute: the external model mightdepend on several aspects of the component status.
Furthermore, by de�ning a probability model for a type of object, say a disk drive, we can reuseit several times, e.g., if we have several disk drives in our computer system.
In Figure 12 we show a simple hierarchical model for a computer system. This language containsprobabilistic classes for Computer, Motherboard, OS, Hard-Drive, Drive-Mechanism, Drive-Motor,and Disk-Surface. The Computer model has an attribute Has-Hard-Drive of class Hard-Drive;the Hard-Drive class, in turn, has attribute Has-Drive-Mechanism of class Drive-Mechanism. Wecan reuse our model to easily represent situations where an object has several components of thesame type; for example, the Hard-Drive model contains attributes Has-Surface-1, Has-Surface-2, Has-Surface-3 and Has-Surface-4, all of class Disk-Surface. There are also a large number ofsimple attributes, such as Hard-Drive.Status with values f Good, Minor-Damage,Major-Damage,Unreadable g. Most of the di�erent components, in fact, have a Status attribute. Although theyhave the same name, they are in fact di�erent attributes, because they are attributes of di�erentobjects.
The Hard-Drive class has inputs Temperature, Age and OS-Status, and the outputs Status andFull. Although the hard drive has a rich internal state, the only aspects of its state that inuenceobjects outside the hard drive are whether or not it is working properly and whether or not it isfull. The value of the Temperature input of the hard drive in a computer will be obtained fromthe value of the Temperature attribute of the computer itself. A similar process happens for otherinputs.
Besides showing the dependency graph for the classes Computer, Hard-Drive, Drive-Mechanismand Drive-Motor, the �gure also indicates other aspects of the class model. Complex attributes
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 16
(ones with a hierarchical model) are shown as rectangles, while simple attributes are ellipses. Eachclass model is contained in a box. Input attributes intersect the top edge of the box, indicatingthe fact that their values are received from outside the class, while output attributes intersect thebottom. The rectangles representing the complex components also have little bubbles on theirborders, showing that attributes are passed into and out of those components.
6 Continuous Variables
So far, we have restricted attention to discrete variables with �nitely many values. What if oneor more variables have in�nitely many values? Clearly, we can't even consider the idea of usingtables as a representation. This situation is quite common: many of the attributes we want torepresent actually take values in a continuous space: temperature, velocity, location, pressure,etc. One solution, which is often used, is to discretize these variables. While this is often done,it is not ideal. In order to get a reasonably accurate model, we often have to use a fairly �nediscretization, leading to very large CPTs. For example, in the application of probabilistic modelsto robot localization (which we will discuss later on), the resolution required for the discretizedversion was 2o for the angle (resulting in 180 values for the variable) and 15cm for the x and ylocation variables. For a reasonably sized environment, the resulting representation had around150 million states.
The view of a probability distribution as a function allows us to provide an alternative solution.All we need is a way of representing the CPD P (X j PaX) in some computer-readable form.
6.1 Density functions
To understand this issue better, let's consider what a continuous distribution looks like. A proba-bility density function (PDF) p is shown in Figure 13.
The probability of the variable being in some range [a; b] is simply
P (X 2 [a; b]) =Z bap(x)dx
In particular, Z1
�1
P (x)dx = 1:
It is important to understand the di�erence between the density function p and the associatedprobability distribution P . At one level, we can view the height of the density function p at eachpoint as representing the \probability" of the variable taking that value. However, that perspectiveis somewhat simpli�ed. First, the actual probability of any given value x is 0. Furthermore, thevalue p(x) is not necessarily in the range [0; 1]. (The only requirement is that the function benon-negative and integrate to 1.) A somewhat more accurate intuition is that the \height" p(x) isthe \contribution" that the value x adds to the integral that allows us to compute P .
As is usual for continuous functions, we represent them using some algebraic formula. There aremany classes of density functions, each associated with some particular template for the algebraicformula. Speci�c densities are instantiations of this template, with actual values substituted forcertain parameters in the template. The most commonly used density function is the Gaussian(normal) distribution. In the univariate case, the Gaussian distribution is parameterized by two
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 17
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -2 0 2 4
x
P(x)
N(0,1)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -2 0 2 4
x
P(x)
N(1,1)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -2 0 2 4
x
P(x)
N(0,4)
(a) (b) (c)
Figure 13: Two univariate Gaussians. (a) Mean 0 and variance 1. (b) Mean 1 and variance 1. (c)Mean 0 and variance 4.
parameters | a mean � and a variance �2. The template has the following form:
p(x) =1p2��2
exp
�(x� �)2
2�2
!:
We typically denote this density function using the notation N(�;�2). Intuitively, we can view theexpression in the top of the exponent as the number squared of standard deviations � that x isaway from the mean �. The more standard deviations x is from the mean, the lower its probability.In fact, the probability gets exponentially lower as x gets further away from the mean. Figure 13shows two examples of Gaussian distributions, for di�erent values of the parameters:
6.2 Conditional distributions
Marginal PDFs are a useful building block, but a BN node is associated with a conditional distri-bution. In a hybrid probabilistic model | one involving both discrete and continuous variables |there are four types of dependencies we should think about representing:
� a discrete node with a discrete parent� a continuous node with a discrete parent� a continuous node with a continuous parent� a discrete node with a continuous parent
Of these, the �rst is the case that we have been exploring until now. We give only one example ofeach of the others, simply to illustrate the basic principles.
Let us �rst consider a continuous node with a discrete parent. As we discussed above, onepossible CPD for a single continuous node X is the Gaussian distribution; this can be representedusing two parameters: the mean and the variance. The simplest way of making the continuousnode X depend on a discrete node U is to de�ne a di�erent set of parameters for every value ofthe discrete parent. More precisely, for every value a 2 Val(U), the CPD for X has a parameter�u and �
2u. The CPD for X is then:
p(X j u) = N(�u;�2u):It is clear that this model extends easily to multiple discrete parents U: we simply have a di�erentset of parameters for every instantiation of values u 2 Val(U).
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 18
Now, let's consider a continuous nodeX with a continuous parent Y . Again, one simple solutionis to decide to model the distribution of X as a Gaussian, whose parameters depend on the valueof Y . In this case, we need to have a set of parameters for every one of the in�nitely many valuesy 2 Val(Y ). The simplest and most common solution is to decide that the mean of X is a linearfunction of Y , and that the variance of X does not depend on Y . For example, we might have that
p(X j y) = N(�2x+ 0:9; 1)
This type of dependence is called a linear Gaussian model. It extends to multiple continuousparents in a straightforward way:
De�nition 6.1: Let X be a continuous node with continuous parents Y1; : : : ; Yk. We say that Xhas a linear Gaussian model if there exist parameters a0; : : : ; ak and �
2 such that
p(X j y1; : : : ; yk) = N(a0 + a1y1 + � � �+ akyk;�2)
We can easily extend this model, of course, to have the mean and variance of X depend on thevalue y of Y in any way we want. For example, we might have that the mean of X is sin(y) and thevariance y2=7. However, the linear Gaussian model is a very natural one, which is useful in manypractical applications. One reason is that this type of linear dependence is often quite natural: theposition of a robot at time t can often be viewed as a linear function of its position at time t� 1and its velocity at time t � 1, with some white (Gaussian) noise. Another reason is that linearGaussian dependencies give rise to multivariate Gaussian joint distributions.
More precisely, let X1; : : : ;Xn be a set of random variables. We say that a joint PDF overX1; : : : ;Xn is a multivariate Gaussian if it has the form:
p(X1; : : : ;Xn) =1
(2�)d=2j�j1=2 exp��12(x� �)T��1(x� �)
�
where � is an n� n covariance matrix, where �i;i represents the variance of Xi and �i;j for i 6= jrepresents the covariance of Xi and Xj . Figure 14 shows two multivariate Gaussians, one wherethe covariances are zero, and one where they are positive.
It turns out that continuous Bayesian networks with linear Gaussian models are equivalent tomultivariate Gaussians
Theorem 6.2: Every continuous Bayesian network where all of the dependency models are linearGaussian de�nes a multivariate Gaussian distribution. Conversely, every linear Gaussian distribu-
tion can be represented as a Bayesian network with linear Gaussian models.
In fact, every multivariate distribution (except the one where all variables are independent) hasmultiple representations as a BN, with di�erent structures. For example, the disribution in Fig-ure 14(b) can be represented either as the network where X ! Y or as the network where Y ! X.
We can extend the class of linear Gaussian networks to allow the continuous nodes to havediscrete variables. The idea is the same as the one used above. If a node X has continuous parentsY1; : : : ; Yk and discrete parents U, we simply parameterize it using: for every u 2 Val(U), we haveau;0; : : : ; au;k and �
2u. Then
p(X j y;y) = N(au;0 +kXi=1
au;iyi;�2u):
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 19
xy
P(x,y)
xy
P(x,y)
(a) (b)
Figure 14: Gaussians over two variables X and Y . (a) X and Y uncorrelated. (b) X and Ycorrelated.
This dependency model is called a conditional linear Gaussian. It induces joint distributions thatare mixtures | weighted averages | of Gaussians, with one component in the mixture for eachvalue of the discrete network variables, and the weight of the component being the probability ofthat value. Note that the conditional linear Gaussian model does not allow for continuous nodesto have discrete children.
Finally, we move to the case of a discrete child with a continuous parent. The simplest modelis a threshold model. Assume we have a binary discrete node U with a continuous parent Y . Wecan de�ne:
P (u1) =
(0:9 y � 650:05 otherwise
Such a model may be appropriate, for example, if Y is the temperature (in fahrenheit) and U isthe thermostat turning the heater on.
The problem with the threshold model is that the change in probability is discontinuous as afunction of X. A somewhat more reasonable model is the following softmax model. Intuitively,the softmax CPD de�nes a set of R regions (for some parameter R of our choice). The regions arede�ned by a set of R linear functions over the continuous variables. A region is characterized asthat part of the space where one particular linear function is higher than all the others. Each regionis also associated with some distribution over the values of the discrete child; this distribution isthe one used for the variable within this region. The actual CPD is a continuous version of thisregion-based idea, allowing for smooth transitions between the distributions in neighboring regionsof the space.
More precisely, let U be a discrete variable, with continuous parents Y = fY1; : : : ; Ykg. Assumethat U has k possible values, fu1; u2; : : : ; umg. Each of the R regions is de�ned via two vectors ofparameters �r;pr. The vector �r is a vector of weights �r0; �
r1; : : : ; �
rk specifying the linear function
associated with the region. The vector pi = fpr1; : : : ; prmg is the probability distribution overu1; : : : ; um associated with the region (i.e.,
Pmj=1 p
rj = 1). The CPD is now de�ned as: P (U = uj j
Y) =PR
r=1wrprj where w
r =exp(�r
0+Pk
i=1�riYi)PR
q=1exp(�q
0+Pk
i=1�qiYi)
. In other words, the distribution is a weighted
average of the region distributions, where the weight of each \region" depends exponentially on
-
Daphne Koller, Stanford University CS228 Notes, Handout #20 20
−5.0 0.0 5.0 10.00.0
0.2
0.4
0.6
0.8
1.0
−2.0 −1.0 0.0 1.0 2.00.0
0.2
0.4
0.6
0.8
1.0
P(C=low|X)P(C=medium|X)P(C=high|X)
Figure 15: Expressive power of a generalized softmax CPD.
how high the value of its de�ning linear function is, relative to the rest. The choice of �i determinesboth the regions and the slope of the transitions between them; the choice of pi determines thedistribution de�ning each region.
The power to choose the number of regions R to be as large as we wish is the key to therich expressive power of the generalized softmax CPD. Figure 15 demonstrates this expressivity. InFigure 15(a), we present an example CPD for a binary variable with R = 4 regions. In Figure 15(b),we show how this CPD can be used to represent a simple classi�er. Here, U is a sensor with threevalues: low, medium and high. The probability of each of these values depends on the value of thecontinuous parent Y . Note that we can easily accomodate a variety of noise models for the sensor:we can make it less reliable in borderline situations by making the transitions between regions moremoderate; we can make it inherently more noisy by having the probabilities of the di�erent valuesin each of the regions be farther away from 0 and 1.
As for the conditional linear Gaussian CPD, our softmax CPD will have a separate componentfor each instantiation of the discrete parents.
We have chosen to focus on a small set of models. Of course, there is an unlimited range ofrepresentations that we can use: any parametric representation for a function of the appropriatetype is �ne in principle. Indeed, the continuous distributions used for the robot grid describedat the beginning of this section were not all linear Gaussian models. The only diÆculty, as faras representation is concerned, is in creating a language that allows for it. Other tasks, such asinference and learning, are a di�erent issue. As we will see, these tasks are diÆcult even for verysimple linear Gaussian hybrid models.