Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

101
Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Transcript of Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Page 1: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Computer Science CPSC 502

Uncertainty Probability and Bayesian Networks

(Ch. 6)

Page 2: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Outline

• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks

Page 3: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Where are we?Environment

Problem Type

Query

Planning

Deterministic Stochastic

Constraint Satisfaction Search

Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

Done with Deterministic Environments

Page 4: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Where are we?Environment

Problem Type

Query

Planning

Deterministic Stochastic

Constraint Satisfaction Search

Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

Second Part of the Course

Page 5: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Where Are We?Environment

Problem Type

Query

Planning

Deterministic Stochastic

Constraint Satisfaction Search

Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

We’ll focus on Belief Nets

Page 6: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Two main sources of uncertainty (From Lecture 2)• Sensing Uncertainty: The agent cannot fully observe a state of

interest.

For example:• Right now, how many people are in this building?• What disease does this patient have?• Where is the soccer player behind me?

• Effect Uncertainty: The agent cannot be certain about the effects of its actions.

For example:• If I work hard, will I get an A?• Will this drug work for this patient?• Where will the ball go when I kick it?

Page 7: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Motivation for uncertainty

• To act in the real world, we almost always have to handle uncertainty (both effect and sensing uncertainty)• Deterministic domains are an abstraction

Sometimes this abstraction enables more powerful inference• Now we don’t make this abstraction anymore

• AI main focus shifted from logic to probability in the 1980s• The language of probability is very expressive and general• New representations enable efficient reasoning

We will see some of these, in particular Bayesian networks• Reasoning under uncertainty is part of the ‘new’ AI• This is not a dichotomy: framework for probability is logical!• New frontier: combine logic and probability

Page 8: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Probability as a measure of uncertainty/ignorance• Probability measures an agent's degree of belief in truth of

propositions about states of the world • Belief in a proposition f can be measured in terms of a number

between 0 and 1• this is the probability of f• E.g. P(“roll of fair die came out as a 6”) = 1/6 ≈ 16.7% = 0.167• P(f) = 0 means that f is believed to be definitely false• P(f) = 1 means f is believed to be definitely true• Using probabilities between 0 and 1 is purely a convention.

.

Page 9: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Probability Theory and Random Variables• Probability Theory

• system of logical axioms and formal operations for sound reasoning under uncertainty

• Basic element: random variable X • X is a variable like the ones we have seen in CSP/Planning/Logic

but the agent can be uncertain about the value of X

• As usual, the domain of a random variable X, written dom(X), is the set of values X can take

• Types of variables• Boolean: e.g., Cancer (does the patient have cancer or not?)• Categorical: e.g., CancerType could be one of {breastCancer,

lungCancer, skinMelanomas}• Numeric: e.g., Temperature (integer or real)

• We will focus on Boolean and categorical variables

Page 10: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Possible Worlds• A possible world specifies an assignment to each random variable

• Example: weather in Vancouver, represented by random variables - -

- Temperature: {hot mild cold}

- Weather: {sunny, cloudy}

• There are 6 possible worlds:

• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a

nonnegative real number such that- (w) sums to 1 over all possible worlds w

Because for sure we are in one of these worlds!

Weather Temperature

w1 sunny hot

w2 sunny mild

w3 sunny cold

w4 cloudy hot

w5 cloudy mild

w6 cloudy cold

Page 11: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Possible Worlds• A possible world specifies an assignment to each random variable

• Example: weather in Vancouver, represented by random variables - -

- Temperature: {hot mild cold}

- Weather: {sunny, cloudy}

• There are 6 possible worlds:

• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a

nonnegative real number such that- (w) sums to 1 over all possible worlds w

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w). i.e. sum of the probabilities of the worlds w in which f is true

Because for sure we are in one of these worlds!

Weather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

Page 12: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example• What’s the probability of it

being cloudy or cold?• µ(w3) + µ(w4) + µ(w5) + µ(w6) =

0.7 Weather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

• Remember

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true

Page 13: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Joint Probability Distribution

• Joint distribution over random variables X1, …, Xn:• a probability distribution over the joint random variable <X1, …, Xn>

with domain dom(X1) × … × dom(Xn) (the Cartesian product)

• Think of a joint distribution over n variables as the n-dimensional table of the corresponding possible worlds• Each row corresponds to an assignment X1= x1, …, Xn= xn and its

probability P(X1= x1, … ,Xn= xn)

• E.g., {Weather, Temperature} example

Weather Temperature µ(w)

sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

Definition (probability distribution)A probability distribution P on a random variable X is a function dom(X)

[0,1] such that x P(X=x)

Page 14: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• Given the joint distribution, we can compute distributions over

subsets of the variables through marginalization:

We also write this as P(X) = zdom(Z) P(X, Z = z).

• Simply an application of the definition of probability measure!

P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z

• Remember?

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true

Page 15: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• Given the joint distribution, we can compute distributions over

subsets of the variables through marginalization:

•We also write this as P(X) = zdom(Z) P(X, Z = z).

• This corresponds to summing out a dimension in the table.• Does the new table still sum to 1?

Temperature µ(w)

hot ?

mild ?

cold ?

Weather Temperature µ(w)

sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z

Page 16: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• Given the joint distribution, we can compute distributions over

subsets of the variables through marginalization:

•We also write this as P(X) = zdom(Z) P(X, Z = z).

• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability

distribution!

Temperature µ(w)

hot ?

mild ?

cold ?

Weather Temperature µ(w)

sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z

Page 17: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• Given the joint distribution, we can compute distributions over

subsets of the variables through marginalization:

•We also write this as P(X) = zdom(Z) P(X, Z = z).

• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability distribution!

Temperature µ(w)

hot 0.15

mild

cold

Weather Temperature µ(w)

sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

P(Temperature=hot) = P(Weather=sunny, Temperature = hot)+ P(Weather=cloudy, Temperature = hot)= 0.10 + 0.05 = 0.15

P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z

Page 18: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• We can also marginalize over more than one variable at once

Weather µ(w)

sunny 0.40

cloudy

Wind Weather Temperature µ(w)

yes sunny hot 0.04

yes sunny mild 0.09

yes sunny cold 0.07

yes cloudy hot 0.01

yes cloudy mild 0.10

yes cloudy cold 0.12

no sunny hot 0.06

no sunny mild 0.11

no sunny cold 0.03

no cloudy hot 0.04

no cloudy mild 0.25

no cloudy cold 0.08

P(X=x) = z1dom(Z1),…, zndom(Zn) P(X=x, Z1 = z1, …, Zn = zn)

i.e., Marginalization over Temperature and Wind

Page 19: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• We can also get marginals for more than one variable

Wind Weather Temperature µ(w)

yes sunny hot 0.04

yes sunny mild 0.09

yes sunny cold 0.07

yes cloudy hot 0.01

yes cloudy mild 0.10

yes cloudy cold 0.12

no sunny hot 0.06

no sunny mild 0.11

no sunny cold 0.03

no cloudy hot 0.04

no cloudy mild 0.25

no cloudy cold 0.08

Weather Temperature µ(w)

sunny hot 0.10

sunny mild

sunny cold

cloudy hot

cloudy mild

cloudy cold

P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)

Page 20: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginalization• We can also get marginals for more than one variable

• Still simply an application of the definition of probability measure!

P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true

Page 21: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Conditioning

Conditioning: revise beliefs based on new observations• Build a probabilistic model (the joint probability distribution,

JPD)Take into account all background informationCalled the prior probability distribution Denote the prior probability for proposition h as P(h)

• Observe new information about the worldCall all information we received subsequently the evidence e

• Integrate the two sources of information to compute the conditional probability P(h|e)This is also called the posterior probability of h given e.

Page 22: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for conditioning• You have a prior for the joint distribution of weather and

temperaturePossible

worldWeather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

T P(T|W=sunny)

hot 0.10/0.40=0.25

mild

cold

• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3

• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1

• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40

Page 23: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for conditioning• You have a prior for the joint distribution of weather and

temperaturePossible

worldWeather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3

• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1

• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40

T P(T|W=sunny)

hot 0.10/0.40=0.25

mild 0.20/0.40=0.50

cold 0.10/0.40=0.25

)(

)(

sunnyWP

sunnyWTP

Page 24: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Definition (conditional probability)The conditional probability of proposition h given evidence e is

• P(e): Sum of probability for all worlds in which e is true• P(he): Sum of probability for all worlds in which both h and e are

true

Conditional Probability

)(

)()|(

eP

ehPehP

Page 25: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Conditional Probability among Random Variables

P(X | Y) = P(Temperature | Weather) = P(Temperature Weather) / P(Weather)

P(X | Y) = P(X , Y) / P(Y) It expresses the conditional probability of each possible value for X given each possible value for Y

T = hot T = cold

W = sunny P(hot|sunny) P(cold|sunny)

W = cloudy P(hot|cloudy) P(cold|cloudy)

Which of the following is true?

1. The probabilities in each row should sum to 1

2. The probabilities in each column should sum to 1

3. Both of the above

4. None of the above

Crucial that you can answer this question. Think about it at home and let me know if you have questions next time

Page 26: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Inference by Enumeration

Great, we can compute arbitrary probabilities now!• Given:

• Prior joint probability distribution (JPD) on set of variables X• specific values e for the evidence variables E (subset of X)

• We want to compute:• posterior joint distribution of query variables Y (a subset of X)

given evidence e

• Step 1: Condition to get distribution P(X|e)• Step 2: Marginalize to get distribution P(Y|e)

Page 27: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Inference by Enumeration: example• Given P(W,C,T) as JPD below, and evidence e : “Wind=yes”

• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)

• Step 1: condition to get distribution P(X|e)Windy

WCloudy

CTemperature

TP(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08

Page 28: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”

• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)

• Step 1: condition to get distribution P(X|e)

• P(X|e) CloudyC

TemperatureT

P(C, T| W=yes)

sunny hot

sunny mild

sunny cold

cloudy hot

cloudy mild

cloudy cold

Windy W

Cloudy C

Temperature T

P(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08

 

Page 29: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”

• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)

• Step 1: condition to get distribution P(X|e)Cloudy

CTemperature

TP(C, T| W=yes)

sunny hot 0.04/0.43 0.10

sunny mild 0.09/0.43 0.21

sunny cold 0.07/0.43 0.16

cloudy hot 0.01/0.43 0.02

cloudy mild 0.10/0.43 0.23

cloudy cold 0.12/0.43 0.28

Windy W

Cloudy C

Temperature T

P(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08

 

Page 30: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”

• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)

• Step 2: marginalize to get distribution P(Y|e)

CloudyC

TemperatureT

P(C, T| W=yes)

sunny hot 0.10

sunny mild 0.21

sunny cold 0.16

cloudy hot 0.02

cloudy mild 0.23

cloudy cold 0.28

TemperatureT

P(T| W=yes)

hot 0.10+0.02 = 0.12

mild 0.21+0.23 = 0.44

cold 0.16+0.28 = 0.44

Page 31: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Problems of Inference by Enumeration

• If we have n variables, and d is the size of the largest domain• What is the space complexity to store the joint distribution?

• We need to store the probability for each possible world• There are O(dn) possible worlds, so the space complexity is O(dn)

• How do we find the numbers for O(dn) entries?• Time complexity O(dn)

• In the worse case, need to sum over all entries in the JPD

• We will look at an alternative way to perform inference, Bayesian networks• Formalism to exploit (conditional) independence between variables

• But first, let’s look at a couple more definitions

Page 32: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Product Rule• By definition, we know that :

• We can rewrite this to

• In general

)(

)()|(

1

1212 fP

ffPffP

)()|()( 11212 fPffPffP

 

Page 33: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Chain Rule

 

 

Page 34: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Why does the chain rule help us?

We will see how, under specific circumstances (variables

independence), this rule helps gain compactness

• We can represent the JPD as a product of marginal distributions

• We can simplify some terms when the variables involved are independent or conditionally independent

Page 35: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Outline

• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks

Page 36: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Marginal Independence

• Intuitively: if X and Y are marginally independent, then• learning that Y=y does not change your belief in X• and this is true for all values y that Y could take

• For example, weather is marginally independent from the result of a dice throw

 

Page 37: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Examples for marginal independence

• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this

season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?

 

Page 38: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Examples for marginal independence

• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this

season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?

No! Without revenue they cannot afford to keep their best players

 

Page 39: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Exploiting marginal independence

 

39

Exponentially fewer than the JPD!

Page 40: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Follow-up Example

• We said that “Canucks win the Stanley Cup this season”and “Canucks’ revenue last season” are not marginally independent?

• But they are conditionally independent given the Canucks line-up• Once we know who is playing then learning their revenue last

year won’t change our belief in their chances

Page 41: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Conditional Independence

41

 

• Intuitively: if X and Y are conditionally independent given Z, then• learning that Y=y does not change your belief in X

when we already know Z=z• and this is true for all values y that Y could take

and all values z that Z could take

Page 42: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for Conditional Independence

• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0

connected to the light

Lit l1

Up s2

Page 43: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for Conditional Independence

• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0

connected to the light

• However, whether light l1 is lit is conditionally independent from the position of switch s2 given whether there is power in wire w0

• Once we know Power w0, learning values for any other variable will not change our beliefs about light l1

• I.e., Lit l1 is independent of any other variable given Power w0

Power w0

Lit l1

Up s2

Page 44: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Exploiting Conditional Independence

• Recall the chain rule

 

 

 

Page 45: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Slide 45

Belief Networks

• Belief networks and their extensions are R&R systems explicitly defined to exploit independence in probabilistic reasoning

Page 46: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Outline

• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks

Page 47: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Bayesian Network Motivation• We want a representation and reasoning system that is based on

conditional (and marginal) independence• Compact yet expressive representation• Efficient reasoning procedures

• Bayesian (Belief) Networks are such a representation• Named after Thomas Bayes (ca. 1702 –1761)• Term coined in 1985 by Judea Pearl (1936 – )• Their invention changed the primary focus of AI from logic to

probability!

Thomas Bayes Judea Pearl47

Pearl recently received the very prestigious ACM Turing Award for his contributions to Artificial Intelligence!And is going to give a DLS talk on November 8!

Page 48: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Belief (or Bayesian) networksDef. A Belief network consists of

• a directed, acyclic graph (DAG) where each node is associated with a random variable Xi

• A domain for each variable Xi represented

• a set of conditional probability distributions for each node Xi given its parents Pa(Xi) in the graph

P (Xi | Pa(Xi))

• The parents Pa(Xi) of a variable Xi are those variables upon which Xi depends directly

• A Bayesian network is a compact representation of the JDP for a set of variables (X1, …,Xn )

P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))

Page 49: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

How to build a Bayesian network

Define a total order over the random variables: (X1, …,Xn)

If we apply the chain rule, we have

P(X1, …,Xn) = ∏ni= 1 P(Xi | X1, … ,Xi-1)

For each Xi, , select as parents in the Belief network a minimal set of its predecessors Pa (Xi) such that

P(Xi | X1, … ,Xi-1) = P (Xi | Pa (Xi))

Putting it all together, in a Belief network

P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))

Predecessors of Xi in the total order defined over the variables

Xi is conditionally independent from all its other predecessors given Parents(Xi)

Compact representation of the JPD - factorization over the JDP based on existing conditional independencies among the variables

Page 50: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

• You want to diagnose whether there is a fire in a building• You receive a noisy report about whether everyone is leaving

the building• If everyone is leaving, this may have been caused by a fire

alarm• If there is a fire alarm, it may have been caused by a fire or by

tampering • If there is a fire, there may be smoke

50

Let’s construct the Bayesian network for this

Page 51: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

First you choose the variables. In this case, all are Boolean:• Tampering is true when the alarm has been tampered with• Fire is true when there is a fire• Alarm is true when there is an alarm• Smoke is true when there is smoke• Leaving is true if there are lots of people leaving the building• Report is true if the sensor reports that lots of people are

leaving the building

51

Page 52: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis• Next, define a total ordering of variables:

• Let’s say Fire (F), Tampering (T), Alarm, (A), Smoke (S) Leaving (L) Report (R)

• The chain rule gives us: P(F,T,A,S,L,R) =

• Now choose the parents for each variable by evaluating conditional independencies

52

Page 53: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

Fire (F) is the first variable in the ordering, X1. It does not have parents.

Fire Diagnosis Example

Fire

Page 54: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Tampering (T) is independent of fire (learning that one is true would not change your beliefs about the probability of the other)

Example

Tampering Fire

Page 55: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T ) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Alarm (A) depends on both Fire and Tampering: it could be caused by either or both

Fire Diagnosis Example

Tampering Fire

Alarm

Page 56: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Smoke (S) is caused by Fire, and so is independent of Tampering and Alarm given whether there is a Fire

Fire Diagnosis Example

Tampering Fire

AlarmSmoke

Page 57: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T | F) P (A | F,T) P (S | F) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Leaving (L) is caused by Alarm, and thus is independent of the other variables given Alarm

Example

Tampering Fire

AlarmSmoke

Leaving

Page 58: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | F,T,A,S,L)

• Report ( R) is caused by Leaving, and thus is independent of the other variables given Leaving

Fire Diagnosis Example

Tampering Fire

AlarmSmoke

Leaving

Report

Page 59: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

The result is the following Bnets, and its corresponding, very compact factorization of the original JPD

Fire Diagnosis Example

Tampering Fire

AlarmSmoke

Leaving

Report

P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | L)

Page 60: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.

• How many probabilities do we need to specify for this Bayesian network?

60

Page 61: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.

• How many probabilities do we need to specify for this Bayesian network? P(Tampering): 1 probability P(Alarm|Tampering, Fire): 4 (independent)

1 probability for each of the 4 instantiations of the parents In total: 1+1+4+2+2+2 = 12 (compared to 26 -1= 63 for full JPD!)

61

Page 62: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

62

Page 63: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

P(Tampering=t) P(Tampering=f)

0.02 0.98

Example for BN construction: Fire Diagnosis

We don’t need to store P(Tampering=f) since probabilities sum to 1

63

P(Tampering=t)

0.02

Page 64: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire DiagnosisP(Tampering=t)

0.02

P(Fire=t)

0.01

Page 65: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Tampering T Fire F P(Alarm=t|T,F) P(Alarm=f|T,F)

t t 0.5 0.5

t f 0.85 0.15

f t 0.99 0.01

f f 0.0001 0.9999

Example for BN construction: Fire Diagnosis

65

P(Tampering=t)

0.02

P(Fire=t)

0.01

We don’t need to store P(Alarm=f|T,F) since probabilities sum to 1

Each row of this table is a conditional probability distribution

Page 66: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

66

Tampering T Fire F P(Alarm=t|T,F)

t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01

Fire F P(Smoke=t |F)

t 0.9

f 0.01

Alarm P(Leaving=t|A)

t 0.88

f 0.001Leaving P(Report=t|A)

t 0.75

f 0.01

In total: 1+1+4+2+2+2 = 12 (compared to 26-1 .for full JPD!)

Page 67: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Compactness• A CPT for a Boolean variable Xi with k Boolean parents has ……

rows for the combinations of parent values

• If each variable has no more than k parents, the complete network with n nodes requires to specify …………. numbers

• For k<< n, this is a substantial improvement,

• the numbers required grow ………… with n, vs. …………..for the full joint distribution

• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents

• Need to specify……………… probability

• But we need ……..for JPD

Page 68: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Compactness• A CPT for a Boolean variable Xi with k Boolean parents has 2k

rows for the combinations of parent values

• Each row requires one number p for Xi = true(the number for Xi = false is just 1-p)

• If each variable has no more than k parents, the complete network requires to specify O(n · 2k) numbers

• For k<< n, this is a substantial improvement,

• the numbers required grow linearly with n, vs. O(2n) for the full joint distribution

• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents

• Need to specify 30*25probability

• But we need 230 for JPD

Page 69: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Realistic BNet: Liver Diagnosis Source: Onisko et al., 1999

Page 70: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Compactness• What happens if the network is fully connected?

• Or k ≈ n• Not much saving compared to the numbers needed to specify

the full JPD

• Bnets are useful in sparse (or locally structured) domains• Domains in with each component interacts with (is related to) a

small fraction of other components• What if this is not the case in a domain we need to reason about

May need to make simplifying assumptions to reduce the dependencies in a domain

Page 71: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

“Where do the numbers (CPTs) come from?”

From experts• Tedious• Costly• Not always reliable

From data => Machine Learning• There are algorithms to learn both structures and numbers –

later in the course• Can be hard to get enough data

Still, usually better than specifying the full JPD

Page 72: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example of probability computation

How do we compute, for instance,

P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =

72

Tampering T Fire F P(Alarm=t|T,F)

t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01

Fire F P(Smoke=t |F)

t 0.9

f 0.01

Alarm P(Leaving=t|A)

t 0.88

f 0.001Leaving P(Report=t|A)

t 0.75

f 0.01

Page 73: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example for BN construction: Fire Diagnosis

P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =

P(Tampering=t) x P(Fire=f) x P(Alarm=t| Tampering=t, Fire=f) x

x P(Smoke=f| Fire = f) x P(Leaving=t| Alarm=t)

x P(Report=t|Leaving=t) =

73

Tampering T Fire F P(Alarm=t|T,F)

t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01

Fire F P(Smoke=t |F)

t 0.9

f 0.01

Alarm P(Leaving=t|A)

t 0.88

f 0.001Leaving P(Report=t|A)

t 0.75

f 0.01

Page 74: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

What if we use a different ordering?

74

Leaving

Report

Tampering

Smoke

Alarm

• We end up with a completely different network structure!• Which of the two structures is better (think computationally)?

Fire

• Say, we use the following order:• Leaving; Tampering; Report; Smoke; Alarm; Fire.

Page 75: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Which Structure is Better?Leaving

Report

Tampering

Smoke

Alarm

Fire

Page 76: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Which Structure is Better?

• Non-causal network is less compact: ……………………………numbers needed

• Deciding on conditional independence is hard in non-causal directions• Causal models and conditional independence seem hardwired for humans!

• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its

causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Leaving

Report

Tampering

Smoke

Alarm

Fire

Page 77: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Which Structure is Better?

• Non-causal network is less compact: 1+2+2+2+8+8 = 23 numbers needed• Deciding on conditional independence is hard in non-causal directions

• Causal models and conditional independence seem hardwired for humans!

• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its

causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Leaving

Report

Tampering

Smoke

Alarm

Fire

Page 78: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Example contd.

• Other than that, our two Bnets for the Alarm problem are equivalent as long as they represent the same probability distribution

P(T,F,A,S,L,R) = P (T) P (F) P (A | T,F) P (L | A) P (R|L) =

= P(L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)

i.e., they are equivalent if the corresponding CPTs are specified so that they satisfy the equation above

Leaving

Report

Tampering

Smoke

Alarm

Fire

Page 79: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Are there wrong network structures?• Given an order of variables, a network with arcs in excess to

those required by the direct dependencies implied by that order are still ok• Just not as efficient

79

Leaving

Report

Tampering

Smoke

Alarm

Fire

P (L)P(T|L)P(R|L) P(S|L,R,T) P(A|S,L,T) P(F|S,A,T) =P (L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)

• One extreme: the fully connected network is always correct but rarely the best choice

• Essentially we are not leveraging conditional independencies to simplify the factorization of the JDP

P (L)P(T|L)P(R|L,T)P(S|L,T,R)P(A|S,L,T,R) P(F|S,A,T,L,R)

Page 80: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Are there wrong network structures?• How can a network structure be wrong?

• If it misses directed edges that are required• E.g. an edge is missing below: Fire is not conditionally independent of

Alarm | {Tampering, Smoke}

80

Leaving

Report

Tampering

Smoke

Alarm

Fire

But remember what we said a few slides back. Sometimes we may need to make simplifying assumptions - e.g. assume conditional independence when it does not actually hold – in order to reduce complexity

Page 81: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Deciding on Structure In general, the direction of a direct dependency can always be changed

using Bayes rule

Product rule P(ab) = P(a | b) P(b) = P(b | a) P(a)

Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)

Useful for assessing diagnostic probability from causal probability (or vice-versa):• P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

Page 82: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Structure (contd.)

So the two simple Bnets below are equivalent as long as the CPTs are related via Bayes rule

Which structure to chose depends, among other things, on which CPT it is easier to specify

Burglair

Alarm

P(A | B) = P(B | A) P(A) / P(B)

Alarm

Burglar

Page 83: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Stucture (contd.) CPTs for causal relationships represent knowledge of

the mechanims underlying the process of interest. • e.g. how an alarm works, why a disease generates certain symptoms

CPTs for diagnostic relations can be defined only based on past observations.

E.g., let m be meningitis, s be stiff neck:

P(m|s) = P(s|m) P(m) / P(s)

• P(s|m) can be defined based on medical knowledge on the workings of meningitis

• P(m|s) requires statistics on how often the symptom of stiff neck appears in conjuction with meningities.

Page 84: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

In Summary• A Belief network is such that, the JPD of the variables involved is

defined as the product of the local conditional distributions:

P (X1, … ,Xn) = ∏i P(Xi | X1, … ,Xi-1) = ∏ i P (Xi | Pa (Xi))

Once we know the JPD, we can answer any query about any subset of the variables

Thus, a Belief network allows one to answer any query on any subset of the variables

Page 85: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Bayesian Networks: Types of Inference

Diagnostic

People are leaving

L=t

P(F|L=t)=?

Predictive IntercausalMixed

Fire happensF=t

P(L|F=t)=?Alarm goes off

P(a) = 1.0

P(F|A=t,T=t)=?

People are leaving

L=t

There is no fireF=f

P(A|F=f,L=t)=?

There is tamperingT=t

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Tampering

We will use the same reasoning procedure for all of these types

Page 86: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

X

Dependencies in a Bayesian Network

By construction, a node X is conditionally independent of its non-descendant nodes (e.g., Zij in the picture) given its parents. The gray area “blocks” probability propagation

Grey areas in the picture below represent evidence

Page 87: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

• A node X is conditionally independent of all other nodes in the network given its Markov blanket (the gray area in the picture). It “blocks” probability propagation

• Note that node X is conditionally dependent of non-descendant nodes in its Markov blanket (e.g., its children’s parents, like Z1j ) given their common descendants (e.g., Y1j).

• This allows, for instance, explaining away one cause (e.g. X) because of evidence of its effect (e.g., Y1) and another potential cause (e.g. z1j)

X

Page 88: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

In 1, 2 and 3, X and Y are dependent

Dependencies: another way to see them

Z

Z

Z

XY

E

E

E

1

2

3

• In 3, X and Y become dependent as soon as there is evidence on Z or on any of its descendants.

• This is because knowledge of one possible cause given evidence of the effect explains away the other cause

Page 89: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Or, blocking paths for probability propagation. Three ways in which a path between Y to X (or viceversa) can be blocked, given evidence E

Or Conditional Independencies

Z

Z

Z

XY E

1

2

3

Page 90: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Conditional Independece in a Bnet

Is H conditionally independent of E given I?

Z

Z

Z

XY E1

2

3

Page 91: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Conditional Independece in a Bnet

Is H conditionally independent of E given I?

Z

Z

Z

XY E1

2

3

T: Since we have no evidence on F, changes on the belief of G (caused by changes in the belief of H) do not affect belief on E

Page 92: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

(In)Dependencies in a Bnet

Is A conditionally independent of I given F?

Z

Z

Z

XY E1

2

3

Page 93: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

(In)Dependencies in a Bnet

Is A conditionally independent of I given F?

Z

Z

Z

XY E1

2

3

False: Since we have evidence on F, changes to the belief on G due to changes in I affect belief on E, which in turns propagates to A via C and B

What if node C is not there?

Page 94: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

(In)Dependencies in a Bnet

Is A conditionally independent of I given F?

Z

Z

Z

XY E1

2

3

Page 95: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

(In)Dependencies in a Bnet

Is A conditionally independent of I given F?

Z

Z

Z

XY E1

2

3

T: lack on evidence on D blocks the path that goes through E, D, B,

Page 96: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

What does this means in terms of choosing structure?

That you need to double check the appropriateness of the indirect dependencies/independencies generated by your chosen structures

Example: representing a domain for an intelligent system that acts as a tutor (aka Intelligent Tutoring System)• Topics divided in sub-topics• Student knowledge of a topic depends on student knowledge of its sub-

topics• We can never observe student knowledge directly, we can only observe it

indirectly via student test answers

Page 97: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Two Ways of Representing Knowledge

Overall Proficiency

Topic 1

Sub-topic 1.1

Answer 3 Answer 4

Sub-topic 1.2

Answer 2Answer 1

Answer 3 Answer 4Answer 2Answer 1

Sub-topic 1.1 Sub-topic 1.2

Overall Proficiency

Topic 1

Which one should I pick?

Page 98: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Two Ways of Representing Knowledge

Change in probability for a given node always propagates to its siblings, because we never get direct evidence on knowledge

Change in probability for a given node does not propagate to its siblings, because we never get direct evidence on knowledge

Overall Proficiency

Topic 1

Sub-topic 1.1

Answer 3 Answer 4

Sub-topic 1.2

Answer 2Answer 1

Answer 3 Answer 4Answer 2Answer 1

Sub-topic 1.1 Sub-topic 1.2

Overall Proficiency

Topic 1

Answer 1

Answer 1

Which one you want to chose depends on the domain you want to represent

Page 99: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Bayesian Networks - AISpace

Try the queries in the previous slides with the Belief Networks applet in AISpace, • Load the Fire Diagnosis example (ex. 6.10 in textbook)• Compare the probability of each query node before and

after the new evidence is observed • try first to predict how they should change using your

understanding of the domain

• Make sure you explore and understand the various other example networks in textbook and Aispace• Electrical Circuit example (textbook ex 6.11)• Patient’s wheezing and coughing example (ex. 6.14)

Not in applet, but you can easily insert itSeveral other examples in AIspace

Page 100: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Test your understandings of dependencies in a Bnet

Use the AISpace (http://www.aispace.org/mainApplets.shtml) applet for Belief and Decision networks (http://www.aispace.org/bayes/index.shtml)

Load the “conditional independence quiz” network Go in “Solve” mode and select “Independence Quiz”

Page 101: Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

• Given a JPD• Marginalize over specific variables• Compute distributions over any subset of the variables

• Use inference by enumeration • to compute joint posterior probability distributions over any subset of

variables given evidence• Define and use marginal and conditional independence• Build a Bayesian Network for a given domain• Compute the representational savings in terms of number of probabilities

required• Identify dependencies/independencies between nodes in a Bayesian Network

Learning Goals For Today’s Class