Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson...
-
Upload
magnus-short -
Category
Documents
-
view
220 -
download
5
Transcript of Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson...
![Page 1: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/1.jpg)
Copyright © 2001, Andrew W. Moore
Slide 1
Probabilistic and Bayesian Analytics
Brigham S. Anderson
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~brigham
![Page 2: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/2.jpg)
2
Probability
• The world is a very uncertain place
• 30 years of Artificial Intelligence and Database research danced around this fact
• And then a few AI researchers decided to use some ideas from the eighteenth century
![Page 3: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/3.jpg)
3
What we’re going to do
• We will review the fundamentals of probability.
• It’s really going to be worth it
• You’ll see examples of probabilistic analytics in action: • Inference, • Anomaly detection, and • Bayes Classifiers
![Page 4: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/4.jpg)
4
Discrete Random Variables
• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
• Examples• A = The US president in 2023 will be male• A = You wake up tomorrow with a headache• A = You have influenza
![Page 5: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/5.jpg)
5
Probabilities
• We write P(A) as “the probability that A is true”
• We could at this point spend 2 hours on the philosophy of this.
• We’ll spend slightly less...
![Page 6: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/6.jpg)
6
Sample Space
Definition 1.The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment
The elements of the sample space are called outcomes.
![Page 7: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/7.jpg)
7
Sample Spaces
Sample space of a coin flip:
S = {H, T}
H
T
![Page 8: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/8.jpg)
8
Sample Spaces
Sample space of a die roll:
S = {1, 2, 3, 4, 5, 6}
![Page 9: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/9.jpg)
9
Sample Spaces
Sample space of three die rolls?
S = {111,112,113,…,
…,664,665,666}
![Page 10: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/10.jpg)
10
Sample SpacesSample space of a single draw from a
deck of cards:
S={As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
![Page 11: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/11.jpg)
11
So Far…
Definition ExampleThe sample space is the set of all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}
An outcome is an element of the sample space.
2c
![Page 12: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/12.jpg)
12
Events
Definition 2.An event is any subset of S (including S itself)
![Page 13: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/13.jpg)
13
Events
Event: “Jack”
Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
![Page 14: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/14.jpg)
14
Events
Event: “Hearts”
Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
![Page 15: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/15.jpg)
15
Events
Event: “Red and Face”
Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
![Page 16: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/16.jpg)
16
Definitions
Definition Example
The sample space is the set of all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}
An outcome is a single point in the sample space.
2c
An event is a set of outcomes from the sample space.
{2h,2c,2s,2d}
![Page 17: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/17.jpg)
17
Events
Definition 3.Two events A and B are mutually exclusive if A^B=Ø.
Definition 4.If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S.
clubs
hearts spades
diamonds
![Page 18: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/18.jpg)
18
Probability
Definition 5.Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies
• P(A) ≥ 0 for any event A in S• P(S) = 1• For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
** This definition of the domain of this function is
not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields)
![Page 19: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/19.jpg)
19
Definitions
Definition Example
The sample space is the set of all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}
An outcome is a single point in the sample space.
4c
An event is a set of one or more outcomes
Card is “Red”
P(E) maps event E to a real number and satisfies the axioms of probability
P(Red) = 0.50P(Black) = 0.50
![Page 20: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/20.jpg)
20
A
~A
Misconception
• The relative area of the events determines their probability
• …in a Venn diagram it does, but not in general.• However, the “area equals probability” rule is guaranteed
to result in axiom-satisfying probability functions.
We often assume, for example, that the probability of “heads” is equal to
“tails” in absence of other information…
But this is totally outside the axioms!
![Page 21: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/21.jpg)
21
Creating a Valid P()
• One convenient way to create an axiom-satisfying probability function:
1. Assign a probability to each outcome in S
2. Make sure they sum to one
3. Declare that P(A) equals the sum of outcomes in event A
![Page 22: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/22.jpg)
22
Everyday Example
Assume you are a doctor.
This is the sample space of “patients you might see on any given day”.
Non-smoker, female, diabetic, headache, good insurance, etc…
Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc…
Outcomes
![Page 23: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/23.jpg)
23
Everyday Example
Number of elements in the “patient space”:
100 jillion
Are these patients equally likely to occur?
Again, generally not. Let’s assume for the moment that they are, though.
…which roughly means “area equals probability”
![Page 24: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/24.jpg)
24
Everyday Example
jillion100
jillion2
F
Event: Patient has Flu
Size of set “F”:2 jillion(Exactly 2 jillion of the points in the sample space have flu.)
Size of “patient space”:100 jillion
= 0.02PpatientSpace(F) =
![Page 25: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/25.jpg)
25
Everyday Example
jillion100
jillion2
F
= 0.02PpatientSpace(F) =
From now on, the subscript on P() willbe omitted…
![Page 26: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/26.jpg)
26
These Axioms are Not to be Trifled With
• There have been attempts to do different methodologies for uncertainty
• Fuzzy Logic• Three-valued logic• Dempster-Shafer• Non-monotonic reasoning
• But the axioms of probability are the only system with this property:
If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]
![Page 27: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/27.jpg)
27
Theorems from the AxiomsAxioms• P(A) ≥ 0 for any event A in S• P(S) = 1• For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
Theorem.If P is a probability function and A is an event in S, thenP(~A) = 1 – P(A)
Proof:(1) Since A and ~A partition S, P(A ~A) = P(S) = 1
(2) Since A and ~A are disjoint, P(A ~A) = P(A) + P(~A)
Combining (1) and (2) gives the result
![Page 28: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/28.jpg)
28
Multivalued Random Variables
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and
• The events {A1,A2,…,Ak} partition S, so
jiAAP ji if 0),(
1)...( 21 kAAAP
![Page 29: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/29.jpg)
29
Elementary Probability in Pictures
P(~A) + P(A) = 1
A
~A
![Page 30: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/30.jpg)
30
Elementary Probability in Pictures
P(B) = P(B, A) + P(B, ~A)
A
~A
B
![Page 31: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/31.jpg)
31
Elementary Probability in Pictures
1)(1
k
jjAP
A1
A2
A3
![Page 32: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/32.jpg)
32
Elementary Probability in Pictures
),()(1
k
jjABPBP
A1
A2
A3
B
Useful!
![Page 33: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/33.jpg)
33
Conditional Probability
Assume once more that you are a doctor.
Again, this is the sample space of “patients you might see on any given day”.
![Page 34: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/34.jpg)
34
Conditional Probability
F
Event: Flu
P(F) = 0.02
![Page 35: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/35.jpg)
35
Conditional Probability
Event: Headache
P(H) = 0.10H
F
![Page 36: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/36.jpg)
36
Conditional Probability
P(F) = 0.02P(H) = 0.10
…we still need to specify the interaction between flu and headache…
Define
P(H|F) = Fraction of F’s outcomes which are also in H
H
F
![Page 37: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/37.jpg)
37
H
Conditional Probability
F
P(F) = 0.02P(H) = 0.10P(H|F) = 0.50
0.01 0.01
0.89
0.09
H = “headache”F = “flu”
![Page 38: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/38.jpg)
38
Conditional Probability
H = “headache”F = “flu”
P(H|F) = Fraction of flu worlds in which patient has a headache
= #worlds with flu and headache ------------------------------------ #worlds with flu
= Size of “H and F” region ------------------------------ Size of “F” region
= P(H, F) ---------- P(F)
F
0.01 0.01
0.89
0.09H
![Page 39: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/39.jpg)
39
Conditional Probability
Definition.If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is
)(
),()|(
BP
BAPBAP
The Chain RuleA simple rearrangement of the above equation yields
)()|(),( BPBAPBAP Main BayesNet concept!
![Page 40: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/40.jpg)
40
Probabilistic Inference
H = “Have a headache”F = “Coming down with Flu”
P(H) = 0.10P(F) = 0.02P(H|F) = 0.50
One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
Is this reasoning good?
H
F
![Page 41: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/41.jpg)
41
Probabilistic Inference
)(
),()|(
HP
HFPHFP
)(
)()|(
HP
FPFHP
H
F
10.01.0
)02.0()50.0(
H = “Have a headache”F = “Coming down with Flu”
P(H) = 0.10P(F) = 0.02P(H|F) = 0.50
![Page 42: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/42.jpg)
42
What we just did…
P(A,B) P(A|B) P(B)
P(B|A) = ----------- = ---------------
P(A) P(A)
This is Bayes Rule
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
![Page 43: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/43.jpg)
43
More General Forms of Bayes Rule
)(~)|~()()|(
)()|()|(
APABPAPABP
APABPBAP
),(
),(),|(),|(
CBP
CAPCABPCBAP
![Page 44: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/44.jpg)
44
More General Forms of Bayes Rule
An
kkk
iii
APABP
APABPBAP
1
)()|(
)()|()|(
![Page 45: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/45.jpg)
45
Independence
Definition.Two events, A and B, are statistically independent if
)()(),( BPAPBAP
Which is equivalent to
)()|( APBAP
Important forBayes Nets
![Page 46: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/46.jpg)
46
Representing P(A,B,C)
• How can we represent the function P(A)?• P(A,B)?• P(A,B,C)?
![Page 47: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/47.jpg)
47
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Example: P(A, B, C)
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
The Joint Probability Table
![Page 48: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/48.jpg)
48
Using the Joint
One you have the JPT you can ask for the probability of any logical expression
E
PEP matching rows
)row()(
…what is P(Poor,Male)?
![Page 49: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/49.jpg)
49
Using the Joint
P(Poor, Male) = 0.4654 E
PEP matching rows
)row()(
…what is P(Poor)?
![Page 50: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/50.jpg)
50
Using the Joint
P(Poor) = 0.7604 E
PEP matching rows
)row()(
…what is P(Poor|Male)?
![Page 51: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/51.jpg)
51
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
),()|(
E
EE
P
P
EP
EEPEEP
![Page 52: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/52.jpg)
52
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
),()|(
E
EE
P
P
EP
EEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
![Page 53: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/53.jpg)
53
Inference is a big deal• I’ve got this evidence. What’s the chance that this
conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chance my spouse
is already asleep?
• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis
![Page 54: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/54.jpg)
54
Where do Joint Distributions come from?
• Idea One: Expert Humans• Idea Two: Simpler probabilistic facts and some algebraExample: Suppose you knew
P(A) = 0.5P(B|A) = 0.2P(B|~A) = 0.1
P(C|A,B) = 0.1P(C|A,~B) = 0.8P(C|~A,B) = 0.3P(C|~A,~B) = 0.1
Then you can automatically compute the JPT using the chain rule
P(A,B,C) = P(A) P(B|A) P(C|A,B)
Bayes Nets are a systematic way to do this.
![Page 55: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/55.jpg)
55
Where do Joint Distributions come from?
• Idea Three: Learn them from data!
Prepare to witness an impressive learning algorithm….
![Page 56: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/56.jpg)
56
Learning a JPT
Build a Joint Probability table for your attributes in which the probabilities are unspecified
Then fill in each row with
records ofnumber total
row matching records)row(ˆ P
A B C Prob
0 0 0 ?
0 0 1 ?
0 1 0 ?
0 1 1 ?
1 0 0 ?
1 0 1 ?
1 1 0 ?
1 1 1 ?
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Fraction of all records in whichA and B are True but C is False
![Page 57: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/57.jpg)
57
Example of Learning a JPT
• This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]
![Page 58: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/58.jpg)
58
Where are we?
• We have recalled the fundamentals of probability
• We have become content with what JPTs are and how to use them
• And we even know how to learn JPTs from data.
![Page 59: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/59.jpg)
59
Density Estimation
• Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation
• A Density Estimator learns a mapping from a set of attributes to a Probability
DensityEstimator
ProbabilityInput
Attributes
![Page 60: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/60.jpg)
60
• Given a record x, a density estimator M can tell you how likely the record is:
• Given a dataset with R records, a density estimator can tell you how likely the dataset is:(Under the assumption that all records were independently generated
from the probability function)
Evaluating a density estimator
R
kkR |MP|MP|MP
121 )(ˆ),,(ˆ)dataset(ˆ xxxx
)(ˆ |MP x
![Page 61: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/61.jpg)
61
A small dataset: Miles Per Gallon
From the UCI repository (thanks to Ross Quinlan)
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
![Page 62: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/62.jpg)
62
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
![Page 63: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/63.jpg)
63
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
203-1
21
10 3.4
)(ˆ),,(ˆ)dataset(ˆ
R
kkR |MP|MP|MP xxxx
![Page 64: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/64.jpg)
64
Log Probabilities
Since probabilities of datasets get so small we usually use log probabilities
R
kk
R
kk |MP|MP|MP
11
)(ˆlog)(ˆlog)dataset(ˆlog xx
![Page 65: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/65.jpg)
65
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
466.19
)(ˆlog)(ˆlog)dataset(ˆlog11
R
kk
R
kk |MP|MP|MP xx
![Page 66: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/66.jpg)
66
Summary: The Good News
The JPT allows us to learn P(X) from data.
• Can do inference: P(E1|E2)Automatic Doctor, Recommender, etc
• Can do anomaly detection spot suspicious / incorrect records
(e.g., credit card fraud)
• Can do Bayesian classificationPredict the class of a record
(e.g, predict cancerous/not-cancerous)
![Page 67: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/67.jpg)
67
Summary: The Bad News
• Density estimation with JPTs is trivial, mindless and dangerous
![Page 68: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/68.jpg)
68
Using a test set
An independent test set with 196 cars has a much worse log likelihood than it had on the training set
(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
….Density estimators can overfit. And the JPT estimator is the overfittiest of them all!
![Page 69: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/69.jpg)
69
Overfitting Density Estimators
If this ever happens, it means there are certain combinations that we learn are “impossible”
![Page 70: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/70.jpg)
70
Using a test set
The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020
We need Density Estimators that are less prone to overfitting
![Page 71: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/71.jpg)
71
Is there a better way?
The problem with the JPT is that it just mirrors the training data.
In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it!
We need to represent the probability function with fewer parameters…
![Page 72: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/72.jpg)
72
Aside:Bayes Nets
![Page 73: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/73.jpg)
73
Bayes Nets• What are they?
• Bayesian nets are a framework for representing and analyzing models involving uncertainty
• What are they used for?• Intelligent decision aids, data fusion, 3-E feature recognition,
intelligent diagnostic aids, automated free text understanding, data mining
• How are they different from other knowledge representation and probabilistic analysis tools?• Uncertainty is handled in a mathematically rigorous yet efficient
and simple way
![Page 74: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/74.jpg)
74
Bayes Net Concepts
1.Chain RuleP(A,B) = P(A) P(B|A)
2.Conditional IndependenceP(A|B,C) = P(A|B)
![Page 75: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/75.jpg)
75
A Simple Bayes Net
• Let’s assume that we already have P(Mpg,Horse)
How would you rewrite this using the Chain rule?
0.480.12bad
0.040.36good
highlowP(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
![Page 76: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/76.jpg)
76
Review: Chain Rule
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Mpg)
P(Horse|Mpg)
*
![Page 77: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/77.jpg)
77
Review: Chain Rule
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Mpg)
P(Horse|Mpg)
*
= P(good) * P(low|good) = 0.4 * 0.89
= P(good) * P(high|good) = 0.4 * 0.11
= P(bad) * P(low|bad) = 0.6 * 0.21
= P(bad) * P(high|bad) = 0.6 * 0.79
![Page 78: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/78.jpg)
78
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
![Page 79: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/79.jpg)
79
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.90P( low| bad) = 0.21P(high|good) = 0.10P(high| bad) = 0.79
P(Horse|Mpg)
![Page 80: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/80.jpg)
80
How to Make a Bayes Net
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.90P( low| bad) = 0.21P(high|good) = 0.10P(high| bad) = 0.79
P(Horse|Mpg)
• Each node is a probability function
• Each arc denotes conditional dependence
![Page 81: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/81.jpg)
81
How to Make a Bayes Net
So, what have we accomplished thus far?
Nothing; we’ve just “Bayes Net-ified” the
P(Mpg, Horse) JPT using the Chain rule.
…the real excitement starts when we wield conditional independence
Mpg
Horse
P(Mpg)
P(Horse|Mpg)
![Page 82: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/82.jpg)
82
How to Make a Bayes Net
Before we continue, we need a worthier opponent than puny P(Mpg, Horse)…
We’ll use P(Mpg, Horse, Accel):
P(good, low,slow) = 0.37P(good, low,fast) = 0.01P(good,high,slow) = 0.02P(good,high,fast) = 0.00P( bad, low,slow) = 0.10P( bad, low,fast) = 0.12P( bad,high,slow) = 0.16P( bad,high,fast) = 0.22
P(Mpg,Horse,Accel)
* Note: I made these up…
![Page 83: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/83.jpg)
83
How to Make a Bayes Net
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Note:Obviously, we could have written this 3!=6 different ways…
P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = … = …
![Page 84: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/84.jpg)
84
How to Make a Bayes Net
Mpg
Horse
Accel
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
![Page 85: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/85.jpg)
85
How to Make a Bayes Net
Mpg
Horse
Accel
P(Mpg)
P(Horse|Mpg)
P(Accel|Mpg,Horse)
![Page 86: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/86.jpg)
86
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.90P( low| bad) = 0.21P(high|good) = 0.10P(high| bad) = 0.79
P(Horse|Mpg)
P(slow|good, low) = 0.97P(slow|good,high) = 0.15P(slow| bad, low) = 0.90P(slow| bad,high) = 0.05P(fast|good, low) = 0.03P(fast|good,high) = 0.85P(fast| bad, low) = 0.10P(fast| bad,high) = 0.95
P(Accel|Mpg,Horse)
* Note: I made these up too…
![Page 87: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/87.jpg)
87
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow|good, low) = 0.97P(slow|good,high) = 0.15P(slow| bad, low) = 0.90P(slow| bad,high) = 0.05P(fast|good, low) = 0.03P(fast|good,high) = 0.85P(fast| bad, low) = 0.10P(fast| bad,high) = 0.95
P(Accel|Mpg,Horse)
A Miracle Occurs!
You are told by God (or another domain expert)that Accel is independent of Mpg given Horse!
i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)
![Page 88: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/88.jpg)
88
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.22P(slow|high) = 0.64P(fast| low) = 0.78P(fast|high) = 0.36
P(Accel|Horse)
![Page 89: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/89.jpg)
89
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.22P(slow|high) = 0.64P(fast| low) = 0.78P(fast|high) = 0.36
P(Accel|Horse)
Thank you, domain expert!
Now I only need to learn 5 parameters
instead of 7 from my data!
My parameter estimateswill be more accurate as
a result!
![Page 90: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/90.jpg)
90
Independence“The Acceleration does not depend on the Mpg once I know the Horsepower.”
This can be specified very simply:
P(Accel Mpg, Horse) = P(Accel | Horse)
This is a powerful statement!
It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.
![Page 91: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/91.jpg)
91
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.• E is a set of directed edges joining vertices. No loops
of any length are allowed.
Each vertex in V contains the following information:• A Conditional Probability Table (CPT) indicating how
this variable’s probabilities depend on all possible combinations of parental values.
![Page 92: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/92.jpg)
92
Bayes Nets Summary
• Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence.
• They can do everything a JPT can do (like quick, cheap lookups of probabilities)
![Page 93: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/93.jpg)
93
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable Some other variable values )
2
2 1
matching entriesjoint
and matching entriesjoint
2
2121 )entryjoint (
)entryjoint (
)(
)()|(
E
EE
P
P
EP
EEPEEP
![Page 94: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/94.jpg)
94
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable Some other variable values )
2
2 1
matching entriesjoint
and matching entriesjoint
2
2121 )entryjoint (
)entryjoint (
)(
)()|(
E
EE
P
P
EP
EEPEEP
Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.
How much work is the above computation?
![Page 95: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/95.jpg)
95
The sad, bad news
Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find
there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?
Sadder and worse news:General querying of Bayes nets is NP-complete.
![Page 96: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/96.jpg)
96
Case Study I
Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).
• Diagnostic system for lymph-node diseases.
• 60 diseases and 100 symptoms and test-results.
• 14,000 probabilities
• Expert consulted to make net.
• 8 hours to determine variables.
• 35 hours for net topology.
• 40 hours for probability table values.
• Apparently, the experts found it quite easy to invent the causal links and probabilities.
• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.
![Page 97: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/97.jpg)
97
Bayes Net Info
GUI Packages:• Genie -- Free • Netica -- $$• Hugin -- $$
Non-GUI Packages:• All of the above have APIs• BNT for MATLAB• AUTON code (learning extremely large networks of tens
of thousands of nodes)
![Page 98: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/98.jpg)
98
Bayes Nets andMachine Learning
![Page 99: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/99.jpg)
99
Machine Learning Tasks
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(e2 | e1) Missing Variables e2
![Page 100: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/100.jpg)
100
What is an Anomaly?
• An irregularity that cannot be explained by simple domain models and knowledge
• Anomaly detection only needs to learn from examples of “normal” system behavior.
• Classification, on the other hand, would need examples labeled “normal” and “not-normal”
![Page 101: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/101.jpg)
101
Anomaly Detection in Practice
• Monitoring computer networks for attacks.
• Monitoring population-wide health data for outbreaks or attacks.
• Looking for suspicious activity in bank transactions
• Detecting unusual eBay selling/buying behavior.
![Page 102: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/102.jpg)
102
JPTAnomaly Detector
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to detect anomalous cars.
• If the next example we see is <good,high>, how anomalous is it?
![Page 103: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/103.jpg)
103
JPTAnomaly Detector
04.0
),(),(
highgoodPhighgoodlikelihood
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) = How likely is
<good,high>?
Could not be easier! Just look up the entry in the JPT!
Smaller numbers are more anomalous in that themodel is more surprised to see them.
![Page 104: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/104.jpg)
104
Bayes NetAnomaly Detector
04.0
)|()(
),(),(
goodhighPgoodP
highgoodPhighgoodlikelihood
How likely is
<good,high>?
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.90P( low| bad) = 0.21P(high|good) = 0.10P(high| bad) = 0.79
P(Horse|Mpg)
![Page 105: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/105.jpg)
105
Bayes NetAnomaly Detector
04.0
)|()(
),(),(
goodhighPgoodP
highgoodPhighgoodlikelihood
How likely is
<good,high>?
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.90P( low| bad) = 0.21P(high|good) = 0.10P(high| bad) = 0.79
P(Horse|Mpg)
Again, trivial!We need to do one tiny lookup
for each variable in the network!
![Page 106: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/106.jpg)
106
Machine Learning Tasks
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(E2 | e1) Missing Variables E2
![Page 107: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/107.jpg)
107
Bayes Classifiers
• A formidable and sworn enemy of decision trees
DT BC
![Page 108: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/108.jpg)
108
Bayes Classifiers in 1 Slide
Bayes classifiers just do inference.
That’s it.
The “algorithm”1. Learn P(class,X)
2. For a given input x, infer P(class|x)
3. Choose the class with the highest probability
![Page 109: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/109.jpg)
109
JPTBayes Classifier
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we classify it?
![Page 110: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/110.jpg)
110
JPTBayes Classifier
)(
),()|(
lowP
lowgoodPlowgoodP
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
),(),(
),(
lowbadPlowgoodP
lowgoodP
739.012.036.0
36.0
How do we classify
<Horse=low>?
The P(good | low) = 0.75,so we classify the example
as “good”
![Page 111: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/111.jpg)
111
Bayes Net Classifier
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is <Horse=low,Accel=fast> how do we classify it?
![Page 112: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/112.jpg)
112Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
),(
)|()|()(
fastlowP
lowfastPgoodlowPgoodP
),(
0178.0
),(
)05.0)(89.0)(4.0(
fastlowPfastlowP
),,(),,(
0178.0
fastlowbadPfastlowgoodP
75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
Bayes NetBayes Classifier
![Page 113: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/113.jpg)
113
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
The P(good | low, fast) = 0.75,so we classify the example
as “good”.
…but that seems somehow familiar…
Wasn’t that the same answer asP(Mpg=good | Horse=low)?
Bayes NetBayes Classifier
![Page 114: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/114.jpg)
114
Bayes Classifiers
• OK, so classification can be posed as inference
• In fact, virtually all machine learning tasks are a form of inference
• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)
![Page 115: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/115.jpg)
115
The Naïve Bayes Classifier
ASSUMPTION: all the attributes are conditionally independent
given the class variable
![Page 116: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/116.jpg)
116
At least 256 parameters!You better have the data
to support them…
A mere 25 parameters!(the CPTs are tiny because the attribute
nodes only have one parent.)
The Naïve Bayes Advantage
![Page 117: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/117.jpg)
117
What is the Probability Functionof the Naïve Bayes?
P(Mpg,Cylinders,Weight,Maker,…) =
P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …
![Page 118: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/118.jpg)
118
What is the Probability Functionof the Naïve Bayes?
i
i classxPclassPclassP )|()(),( x
This is another great feature of Bayes Nets; you can graphically
see your model assumptions
![Page 119: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/119.jpg)
119
Bayes Classifier Results: “MPG”:
392 records
The Classifier
learned by “Naive BC”
![Page 120: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/120.jpg)
120
Bayes Classifier Results: “MPG”:
40 records
![Page 121: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/121.jpg)
121
More Facts About Bayes Classifiers
• Many other density estimators can be slotted in
• Density estimation can be performed with real-valued inputs
• Bayes Classifiers can be built with real-valued inputs
• Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!
![Page 122: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/122.jpg)
122
Summary
• Axioms of Probability
• Bayes nets are created by• chain rule• conditional independence
• Bayes Nets can do• Inference• Anomaly Detection• Classification
![Page 123: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/123.jpg)
123
![Page 124: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/124.jpg)
124
Using Bayes Rule to Gamble
The “Win” envelope has a dollar and four beads in it
$1.00
The “Lose” envelope has three beads and no money
Trivial question: someone draws an envelope at random and offers to sell it to you. How much should you pay?
![Page 125: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/125.jpg)
125
Using Bayes Rule to Gamble
The “Win” envelope has a dollar and four beads in it
$1.00
The “Lose” envelope has three beads and no moneyInteresting question: before deciding, you are allowed to see
one bead drawn from the envelope.Suppose it’s black: How much should you pay? Suppose it’s red: How much should you pay?
![Page 126: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/126.jpg)
126
Calculation…$1.00
![Page 127: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/127.jpg)
127
Probability Model Uses
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(E2 | e1) Missing Variables E2
How do we evaluate a particular density estimator?
![Page 128: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/128.jpg)
128
Probability Models
Full Prob.Table Naïve Prob.
• No assumptions• Overfitting-prone• Scales horribly
• Strong assumptions• Overfitting-resistant• Scales incredibly well
Bayes Nets
• Carefully chosen assumptions• Overfitting and scaling
properties depend on assumptions
![Page 129: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/129.jpg)
129
What you should know
• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD
• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE
![Page 130: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/130.jpg)
130
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
![Page 131: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/131.jpg)
131
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
![Page 132: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/132.jpg)
132
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
![Page 133: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/133.jpg)
133
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
This is a Maximum Likelihood classifier.
It can get silly if some Ys are very unlikely
![Page 134: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/134.jpg)
134
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely
)|(argmax 11predict
mmv
uXuXvYPY
Is this a good idea?
Much Better Idea
![Page 135: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/135.jpg)
135
Terminology
• MLE (Maximum Likelihood Estimator):
• MAP (Maximum A-Posteriori Estimator):
)|(argmax 11predict
mmv
uXuXvYPY
)|(argmax 11predict vYuXuXPY mm
v
![Page 136: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/136.jpg)
136
Getting what we need
)|(argmax 11predict
mmv
uXuXvYPY
![Page 137: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/137.jpg)
137
Getting a posterior probability
Yn
jjjmm
mm
mm
mm
mm
vYPvYuXuXP
vYPvYuXuXP
uXuXP
vYPvYuXuXP
uXuXvYP
111
11
11
11
11
)()|(
)()|(
)(
)()|(
)|(
![Page 138: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/138.jpg)
138
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
![Page 139: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/139.jpg)
139
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
We can use our favorite Density Estimator here.
Right now we have two options:
•Joint Density Estimator•Naïve Density Estimator
![Page 140: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/140.jpg)
140
Joint Density Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the joint Bayes Classifier this degenerates to a very simple rule:
Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.
Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.
In that case we just have to guess Y’s value
![Page 141: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/141.jpg)
141
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
![Page 142: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/142.jpg)
142
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:
Yn
jjj
vvYuXPvYPY
1
predict )|(log)(logargmax
![Page 143: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/143.jpg)
143
What you should know
• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs
![Page 144: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/144.jpg)
144
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
![Page 145: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/145.jpg)
145
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
![Page 146: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/146.jpg)
146
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
![Page 147: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/147.jpg)
147
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
Each of these obtained by the “computing a joint probability entry” method of the earlier slides
4 joint computes
4 joint computes
![Page 148: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/148.jpg)
148
IndependenceWe’ve stated:
P(M) = 0.6P(S) = 0.3P(S M) = P(S)
M S Prob
T T
T F
F T
F F
And since we now have the joint pdf, we can make any queries we like.
From these statements, we can derive the full joint pdf.
![Page 149: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/149.jpg)
149
Classic Machine Learning Tasks
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(E2 | e1) Missing Variables E2
![Page 150: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/150.jpg)
150
Anomaly Detection on 1 page
![Page 151: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/151.jpg)
151
A note about independence
• Assume A and B are Boolean Random Variables. Then
“A and B are independent”
if and only if
P(A|B) = P(A)
• “A and B are independent” is often notated as
BA
![Page 152: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/152.jpg)
152
Independence Theorems
• Assume P(A|B) = P(A)• Then P(A^B) =
= P(A) P(B)
• Assume P(A|B) = P(A)• Then P(B|A) =
= P(B)
![Page 153: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/153.jpg)
153
Independence Theorems
• Assume P(A|B) = P(A)• Then P(~A|B) =
= P(~A)
• Assume P(A|B) = P(A)• Then P(A|~B) =
= P(A)
![Page 154: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/154.jpg)
154
Multivalued Independence
For multivalued Random Variables A and B,
BAif and only if
)()|(:, uAPvBuAPvu from which you can then prove things like…
)()()(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu
![Page 155: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/155.jpg)
155
Back to Naïve Density Estimation
• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:
• Suppose that each record is generated by randomly shaking a green dice and a red dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
• Which of these datasets violates the naïve assumption?
![Page 156: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/156.jpg)
156
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
![Page 157: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/157.jpg)
157
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)
= P(A) P(~B^C^~D)
= P(A) P(~B|C^~D) P(C^~D)
= P(A) P(~B) P(C^~D)
= P(A) P(~B) P(C|~D) P(~D)
= P(A) P(~B) P(C) P(~D)
![Page 158: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/158.jpg)
158
Naïve Distribution General Case
• Suppose x[1], x[2], … x[M] are independently distributed.
M
kkM ukxPuMxuxuxP
121 )][()][,]2[,]1[(
• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.
• So we can do any inference • But how do we learn a Naïve Density
Estimator?
![Page 159: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/159.jpg)
159
Learning a Naïve Density Estimator
records ofnumber total
][in which records#)][(ˆ uixuixP
Another trivial learning algorithm!
![Page 160: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/160.jpg)
160
Contrast
Joint DE Naïve DE
Can model anything Can model only very boring distributions
No problem to model “C is a noisy copy of A”
Outside Naïve’s scope
Given 100 records and more than 6 Boolean attributes will screw up badly
Given 100 records and 10,000 multivalued attributes will be fine
![Page 161: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/161.jpg)
161
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
![Page 162: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/162.jpg)
162
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
![Page 163: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/163.jpg)
163
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
![Page 164: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/164.jpg)
164
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
![Page 165: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/165.jpg)
165
The DE learned by
“Joint”
“Weight vs. MPG”: The best that Naïve can do
The DE learned by
“Naive”
![Page 166: Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University.](https://reader034.fdocuments.us/reader034/viewer/2022052509/56649cfa5503460f949cbe95/html5/thumbnails/166.jpg)
166
The Axioms Of Probability