Transcript of Bayesian Learning No reading assignment for this topic.
- Slide 1
- Bayesian Learning No reading assignment for this topic
- Slide 2
- Conditional Probability Probability of an event given the
occurrence of some other event. E.g., Consider choosing a card from
a well-shuffled standard deck of 52 playing cards. Given that the
first card chosen is an ace, what is the probability that the
second card chosen will be an ace?
- Slide 3
- Event space = all possible pairs of cards Y X
- Slide 4
- Y First card is Ace Event space = all possible pairs of
cards
- Slide 5
- Y = First card is Ace X = Second card is Ace Event space = all
possible pairs of cards
- Slide 6
- P(Y) = 4 / 52 P(X,Y) = # possible pairs of aces / total # of
pairs = 43/5251 = 12/2652. P(X | Y) = (12/2652) / (4 / 52) =
3/51.
- Slide 7
- Deriving Bayes Rule
- Slide 8
- Bayesian Learning
- Slide 9
- Application to Machine Learning In machine learning we have a
space H of hypotheses: h 1, h 2,..., h n We also have a set D of
data We want to calculate P(h | D)
- Slide 10
- Prior probability of h: P(h): Probability that hypothesis h is
true given our prior knowledge If no prior knowledge, all h H are
equally probable Posterior probability of h: P(h | D): Probability
that hypothesis h is true, given the data D. Likelihood of D: P(D |
h): Probability that we will see data D, given hypothesis h is
true. Terminology
- Slide 11
- Bayes Rule: Machine Learning Formulation
- Slide 12
- Example
- Slide 13
- The Monty Hall Problem You are a contestant on a game show.
There are 3 doors, A, B, and C. There is a new car behind one of
them and goats behind the other two. Monty Hall, the host, asks you
to pick a door, any door. You pick door A. Monty tells you he will
open a door, different from A, that has a goat behind it. He opens
door B: behind it there is a goat. Monty now gives you a choice:
Stick with your original choice A or switch to C. Should you
switch? http://math.ucsd.edu/~crypto/Monty/monty.html
- Slide 14
- Bayesian probability formulation Hypothesis space H: h 1 = Car
is behind door A h 2 = Car is behind door B h 3 = Car is behind
door C Data D = Monty opened B What is P(h 1 | D)? What is P(h 2 |
D)? What is P(h 3 | D)?
- Slide 15
- Event space Event space = All possible configurations of cars
and goats behind doors A, B, C Y = Goat behind door B X = Car
behind door A
- Slide 16
- Y = Goat behind door B X = Car behind door A Bayes Rule: Event
space
- Slide 17
- Using Bayes Rule to solve the Monty Hall problem By Bayes rule:
P(h1|D) = P(D|h1)p(h1) / P(D) = 1/3 / = 1/3 P(h2|D) = P(D|h2)p(h2)
/ P(D) = 1 1/3 / = 2/3 So you should switch! You pick door A. Data
D = Monty opened door B Hypothesis space H: h 1 = Car is behind
door A h 2 = Car is behind door C h 3 = Car is behind door B What
is P(h 1 | D)? What is P(h 2 | D)? What is P(h 3 | D)? Prior
probability: P(h 1 ) = 1/3 P(h 2 ) =1/3 P(h 3 ) =1/3 Likelihood:
P(D | h 1 ) = 1/2 P(D | h 2 ) = 1 P(D | h 3 ) = 0 P(D) = p(D|h 1
)p(h 1 ) + p(D|h 2 )p(h 2 ) + p(D|h 3 )p(h 3 ) = 1/6 + 1/3 + 0 =
1/2
- Slide 18
- MAP (maximum a posteriori) Learning Bayes rule: Goal of
learning: Find maximum a posteriori hypothesis h MAP : because P(D)
is a constant independent of h.
- Slide 19
- Note: If every h H is equally probable, then This is called the
maximum likelihood hypothesis.
- Slide 20
- A Medical Example Toby takes a test for leukemia. The test has
two outcomes: positive and negative. It is known that if the
patient has leukemia, the test is positive 98% of the time. If the
patient does not have leukemia, the test is positive 3% of the
time. It is also known that 0.008 of the population has leukemia.
Tobys test is positive. Which is more likely: Toby has leukemia or
Toby does not have leukemia?
- Slide 21
- Hypothesis space: h 1 = T. has leukemia h 2 = T. does not have
leukemia Prior: 0.008 of the population has leukemia. Thus P(h 1 )
= 0.008 P(h 2 ) = 0.992 Likelihood: P(+ | h 1 ) = 0.98, P( | h 1 )
= 0.02 P(+ | h 2 ) = 0.03, P( | h 2 ) = 0.97 Posterior knowledge:
Blood test is + for this patient.
- Slide 22
- In summary P(h 1 ) = 0.008, P(h 2 ) = 0.992 P(+ | h 1 ) = 0.98,
P( | h 1 ) = 0.02 P(+ | h 2 ) = 0.03, P( | h 2 ) = 0.97 Thus:
- Slide 23
- What is P(leukemia|+)? So, These are called the posterior
probabilities.
- Slide 24
- In-Class Exercise Suppose you receive an e-mail message with
the subject Hi. You have been keeping statistics on your e-mail,
and have found that while only 10% of the total e-mail messages you
receive are spam, 50% of the spam messages have the subject Hi and
2% of the non-spam messages have the subject Hi. What is the
probability that the message is spam?
- Slide 25
- Bayesianism vs. Frequentism Classical probability: Frequentists
Probability of a particular event is defined relative to its
frequency in a sample space of events. E.g., probability of the
coin will come up heads on the next trial is defined relative to
the frequency of heads in a sample space of coin tosses. Bayesian
probability: Combine measure of prior belief you have in a
proposition with your subsequent observations of events. Example:
Bayesian can assign probability to statement There was life on Mars
a billion years ago but frequentist cannot.
- Slide 26
- Independence and Conditional Independence Two random variables,
X and Y, are independent if Two random variables, X and Y, are
independent given Z if Examples?
- Slide 27
- Naive Bayes Classifier Let f (x) be a target function for
classification: f (x) {+1, 1}. Let x = We want to find the most
probable class value, h MAP, given the data x:
- Slide 28
- By Bayes Theorem: P(class) can be estimated from the training
data. How? However, in general, not practical to use training data
to estimate P(x 1, x 2,..., x n | class). Why not?
- Slide 29
- Naive Bayes classifier: Assume Is this a good assumption? Given
this assumption, heres how to classify an instance x = : Naive
Bayes classifier: Estimate the values of these various
probabilities over the training set.
- Slide 30
- DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak
No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain
MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal
Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No
D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny
MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast
HotNormal Weak Yes D14Rain MildHigh Strong No Training data:
D15Sunny CoolHigh Strong ? Test data:
- Slide 31
- In practice, use training data to compute a probablistic
model:
- Slide 32
- Estimating probabilities Recap: In previous example, we had a
training set and a new example, We asked: What classification is
given by a naive Bayes classifier? Let n(c) be the number of
training instances with class c, and n(x i = a i, c) be the number
of training instances with attribute value x i =a i and class c.
Then
- Slide 33
- Problem with this method: If n(c) is very small, gives a poor
estimate. E.g., P(Outlook = Overcast | no) = 0.
- Slide 34
- Now suppose we want to classify a new instance:. Then: This
incorrectly gives us zero probability due to small sample.
- Slide 35
- One solution: Laplace smoothing (also called add-one smoothing)
For each class c j and attribute x i with value a i, add one
virtual instance. That is, recalculate: where k is the number of
possible values of attribute a.
- Slide 36
- DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak
No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain
MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal
Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No
D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny
MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast
HotNormal Weak Yes D14Rain MildHigh Strong No Training data: Add
virtual instances for Outlook: Outlook=Sunny: YesOutlook=Overcast:
YesOutlook=Rain: Yes Outlook=Sunny: NoOutlook=Overcast:
NoOutlook=Rain: No P(Outlook=Overcast| No) = 0 / 5 0 + 1 / 5 + 3 =
1/8
- Slide 37
- Etc.
- Slide 38
- In-class exercises
- Slide 39
- Naive Bayes on continuous-valued attributes How to deal with
continuous-valued attributes? Two possible solutions: Discretize
Assume particular probability distribution of classes over values
(estimate parameters from training data)
- Slide 40
- Simplest discretization method For each attribute x i, create k
equal-size bins in interval from min(x i ) to max(x i ). Choose
thresholds in between bins. P(Humidity < 40 | yes) P(40