P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances....

37
P(B|A) * P(A) P(B) P(A|B) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoningBayes rule we call P(A) the prior and P(A|B) the posterior

Transcript of P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances....

Page 1: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

P(B|A) * P(A)

P(B) P(A|B) =

Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…

Bayes� rule

we call P(A) the �prior� and P(A|B) the �posterior�

Page 2: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Other Forms of Bayes Rule

)(~)|~()()|()()|()|(

APABPAPABPAPABPBAP

+=

)()()|()|(

XBPXAPXABPXBAP

∧∧=∧

Page 3: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Parameter Estimation

• How to estimate parameters from data?

39

Maximum Likelihood Principle:

Choose the parameters that maximize the probability of the observed data!

Page 4: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Estimation Recipe

1. Use the log-likelihood 2. Differentiate with

respect to the parameters

3. *Equate to zero and solve

40

*Often requires numerical approximation (no closed form solution)

Page 5: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

An Example

• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin

41

Page 6: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

An Example

• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin

41

• We Observe: – Sequence of heads or tails – HTTTTTHTHT

• Goal: – Estimate the probability that

the next flip comes up heads

Page 7: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Assumptions

• Fixed parameter – Probability that a flip comes up heads

• Each flip is independent – Doesn’t affect the outcome of other flips

• (IID) Independent and Identically Distributed

42

Page 8: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Example

• Let’s assume we observe the sequence:– HTTTTTHTHT

• What is the best value of ?– Probability of heads

43

Page 9: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Example

• Let’s assume we observe the sequence:– HTTTTTHTHT

• What is the best value of ?– Probability of heads

• Intuition: should be 0.3 (3 out of 10)• Question: how do we justify this?

43

Page 10: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle

• The value of which maximizes the probability of the observed data is best!

• Based on our assumptions, the probability of “HTTTTTHTHT” is:

44

Page 11: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle

• The value of which maximizes the probability of the observed data is best!

• Based on our assumptions, the probability of “HTTTTTHTHT” is:

44

Page 12: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle

• The value of which maximizes the probability of the observed data is best!

• Based on our assumptions, the probability of “HTTTTTHTHT” is:

44

This is the Likelihood Function

Page 13: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of

45

Page 14: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of

45

Page 15: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of

45

Θ=0.3

Page 16: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function

of

46

Θ=0.3

Page 17: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood value of

47

Log Identities

Page 18: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood value of

48

Page 19: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Maximum Likelihood value of

48

Page 20: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

49

Page 21: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

• Maximum Likelihood

49

Page 22: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

• Maximum Likelihood– No way to quantify our uncertainty.

49

Page 23: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!

49

Page 24: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!

49

Q: how to deal with this problem?

Page 25: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Bayesian Parameter Estimation

• Let’s just treat like any other variable

• Put a prior on it! – Encode our prior knowledge about possible

values of using a probability distribution

• Now consider two probability distributions:

50

Page 26: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Posterior Over

51

Page 27: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Posterior Over

51

Page 28: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Posterior Over

51

My rule is so cool! !

Page 29: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52

Page 30: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52

Page 31: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52

Hyper-Parameters

Page 32: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52

Hyper-Parameters

Page 33: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52

Gamma is a continuous

generalization of the Factorial

Function

Hyper-Parameters

Page 34: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Beta Distribution

53

Beta(5,5) Beta(100,100)

Page 35: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

Beta Distribution

53

Beta(5,5) Beta(100,100)

Beta(0.5,0.5) Beta(1,1)

Page 36: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

MAP Estimate

54

=#H + ↵� 1

#T +#H + ↵+ � � 2

✓MAP = argmax✓

P (✓|D)

Page 37: P(B|A) * P(A) P(B) prior · 2019-12-10 · towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means

MAP Estimate

54

=#H + ↵� 1

#T +#H + ↵+ � � 2

✓MAP = argmax✓

P (✓|D) -Add-N smoothing -Pseudo-counts