Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf ·...

36
1 Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York

Transcript of Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf ·...

Page 1: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

1

Machine Learning using Bayesian Approaches

Sargur N. Srihari University at Buffalo, State University of New York

Page 2: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

2

Outline 1.  Progress in ML and PR 2.  Fully Bayesian Approach

1.  Probability theory Bayes theorem, Conjugate Distributions

2.  Making predictions using marginalization 3.  Role of sampling methods 4.  Generalization of neural networks, SVM

3.  Biometric/Forensic Application 4.  Concluding Remarks

Page 3: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

3

Example of Machine Learning: Handwritten Digit Recognition

•  Handcrafted rules will result in large no of rules and exceptions

•  Better to have a machine that learns from a large training set

•  Provide a probabilistic prediction

Wide variability of same numeral

Page 4: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Progress in PRML theory

•  Unified theory involving – Regression and classification – Linear models, neural networks, SVMs

•  Basis vectors –  polynomial, non-linear functions of weighted inputs,

centered around samples, etc

•  The fully Bayesian approach •  Models from statistical physics

–  4

Page 5: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

5

The Fully Bayesian Approach

•  Bayes Rule •  Bayesian Probabilities •  Concept of Conjugacy •  Monte Carlo Sampling

Page 6: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

6

Rules of Probability •  Given random variables X and Y •  Sum Rule gives Marginal Probability

•  Product Rule: joint probability in terms of conditional and marginal

•  Combining we get Bayes Rule

p(X) = p(X |Y )p(Y )Y∑where

Viewed as Posterior α likelihood x prior

X

Y

P(X,Y)

Page 7: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

7

Bayesian Probabilities •  Classical or Frequentist view of Probabilities

–  Probability is frequency of random, repeatable event •  Bayesian View

–  Probability is a quantification of uncertainty •  Whether Shakespeare’s plays were written by Francis Bacon •  Whether moon was once in its own orbit around the sun

•  Use of probability to represent uncertainty is not an ad-hoc choice –  Cox’s Theorem:

•  If numerical values represent degrees of belief, then simple set of axioms for manipulating degrees of belief leads to sum and product rules of probability

Page 8: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Role of Likelihood Function •  Uncertainty in w expressed as

•  Denominator is normalization factor •  Involves marginalization over w

•  Bayes theorem in words posterior α likelihood x prior

•  Likelihood Function used in both Bayesian and frequentist paradigms •  Frequentist: w is fixed parameter determined by estimator;

error bars on estimate from possible data sets D •  Bayesian: there is a single data set D, uncertainty

expressed as probability distribution over w

p(w |D) =p(D | w)p(w)

p(D)where p(D) = p(D | w)p(w)dw∫ by Sum Rule

Page 9: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

9

Bayesian versus Frequentist Approach •  Inclusion of prior knowledge arises naturally •  Coin Toss Example

–  Suppose we need to learn from coin tosses so that we can predict whether a coin toss will land Heads

–  Fair looking coin is tossed three times and lands Head each time

–  Classical m.l.e of the probability of landing heads is 1 implying all future tosses will land Heads

–  Bayesian approach with reasonable prior will lead to less extreme conclusion

p(µ) p(µ/D)

Beta Distribution has property of conjugacy

Page 10: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

10

Bayesian Inference with Beta •  Posterior obtained by multiplying beta

prior with binomial likelihood (Nm)µm(1-µ)N-m

yields

–  where l=N-m, which is no of tails –  m is no of heads

•  It is another beta distribution

–  Effectively increase value of a by m and b by l –  As number of observations increases

distribution becomes more peaked

a=2, b=2

N=m=1, with x=1

a=3, b=2

Illustration of one step in process

µ1(1-µ)0

Page 11: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

11

Predicting next trial outcome •  Need predictive distribution of x given observed D

–  From sum and products rule

•  Expected value of the posterior distribution can be shown to be

–  Which is fraction of observations (both fictitious and real) that correspond to x=1

•  Maximum likelihood and Bayesian results agree in the limit of infinite observations –  On average uncertainty (variance) decreases with

observed data

p(x =1 |D) = p(x =1,µ |D)dµ0

1∫ = p(x =1 |µ)p(µ |D)dµ

0

1∫ =

= µp(µ |D)dµ0

1∫ = E[µ |D]

Page 12: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

12

Bayesian Regression (Curve Fitting) •  Estimating polynomial weight vector w

–  In fully Bayesian approach consistently apply sum and product rules and integrate over all values of w

•  Given training data x and t and new test point x , goal is to predict value of t –  i.e, wish to evaluate predictive distribution p(t|x,x,t)

•  Applying sum and product rules of probability

p(t | x,x,t) = p(t,w |∫ x,x,t)dw by Sum Rule (marginalizing over w)

= p(t | x,w,x,t) p(∫ w | x,x,t) by Product Rule

= p(t | x,w)p(w | x,t)dw∫ by eliminating unnecessary variables

p(t | x,w) = N(t | y(x ,w),β−1) Posterior distribution over parameters Also a Gaussian

Page 13: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

13

Conjugate Distributions Discrete- Binary

Discrete- Multi-valued

Continuous

Bernoulli Single binary variable

Multinomial One of K values = K-dimensional binary vector

Gaussian

Angular Von Mises

Binomial N samples of Bernoulli

Beta Continuous variable between {0,1]

Dirichlet K random variables between [0.1]

Gamma ConjugatePrior of univariate Gaussian precision

Wishart Conjugate Prior of multivariate Gaussian precision matrix

Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians

Exponential Special case of Gamma

Uniform

N=1 Conjugate Prior

Conjugate Prior

Large N

K=2

Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision

Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix

Page 14: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Bayesian Approaches in Classical Pattern Recognition •  Bayesian Neural Networks

– Classical neural networks focus on maximum likelihood to determine network parameters (weights and biases)

– Marginalizes over distribution of parameters in order to make prediction

•  Bayesian Support Vector Machines

14

Page 15: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

15

Practicality of Bayesian Approach

•  Marginalization over whole parameter space is required to make predictions or compare models

•  Factors making it practical: •  Sampling Methods such as Markov Chain

Monte Carlo, •  Increased speed and memory of computers

•  Deterministic approximation schemes •  Variational Bayes is an alternative to sampling

Page 16: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

16

Markov Chain Monte Carlo (MCMC)

•  Rejection sampling and importance sampling for evaluating expectations of functions –  suffer from severe limitations, particularly with high

dimensionality •  MCMC is a very general and powerful framework

–  Markov refers to sequence of samples rather than the model being Markovian

•  Allows sampling from large class of distributions •  Scales well with dimensionality •  MCMC origin is in statistical physics (Metropolis

1949)

Page 17: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

17

Gibbs Sampling •  A simple and widely applicable Markov Chain Monte

Carlo algorithm •  It is a special case of Metropolis-Hastings algorithm •  Consider distribution p(z)=p(z1,..,zM) from which we wish

to sample •  We have chosen an initial state for the Markov chain •  Each step involves replacing value of one variable by a

value drawn from p(zi|z\i) where symbol z\i denotes z1,..,zM with zi omitted

•  Procedure is repeated by cycling through variables in some particular order

Page 18: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

18

Machine Learning in forensics

•  Q-K Comparison –  involves significant uncertainty in some

modalities •  E.g., handwriting

•  Learn intra-class variations – so as to determine whether questioned sample

is genuine

Page 19: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

19

a b

c d

Known Signatures Knowns Compared with each other

1

2 3

5

6

Questioned Signature a

b

c

d

Questioned Compared with Knowns

Genuine Distribution

.4,.1,.2,.3,.2,.1

0,.4,.1,.2

Distribution Comparison

4

Questioned Distribution

Genuine and Questioned Distributions

Page 20: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Methods for Comparing Distributions

•  Kolmogorov- Smirnov Test •  Kullback-Leibler Divergence •  Bayesian Formulation

20

Page 21: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Kolmogorov Smirnov Test

•  To determine if two data sets were drawn from same distribution

•  Compute a statistic D – maximum value of the absolute difference

between two c.d.f.s – K-S statistic for two cdfs SN1(x) and SN2(x)

•  Mapped to a probability P according to PKS= QKS (sqrtNe+0.12+(0.11/sqrtNe)D)

D = max−∞<x<∞

SN1(x) − SN 2(x)

Page 22: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

22

Kullback-Leibler (KL) Divergence •  If unknown distribution p(x) is modeled by

approximating distribution q(x) •  Average additional information required to

specify value of x as a result of using q(x) instead of p(x)

KL(p ||q) = − p(x)lnq(x)dx − p(x)ln p(x)dx∫( )∫

= − p(x)ln p(x)q(x)

dx∫

Page 23: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

23

K-L Properties •  Measure of Dissimilarity of p(x) and q(x) •  Non-symmetric:

-- KL(p||q) > 0 with equality iff p(x)=q(x) – Proof with convex functions

•  Not a metric •  Convert to a probability using

KL(p||q) n.e. KL(q||p)

DKL = Pb logb=1

B

∑ (PbQb

)

PKL = e−ςDKL

Page 24: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

24

Bayesian Comparison of Distributions

•  Determine p (D1 = D2) given that we only observe samples from them

•  Method: –  Determine p(D1=D2= a particular distribution f ) –  Marginalize over all f (Non-informative prior) –  If f is Gaussian we obtain a closed-form solution for

marginalization over the family of Gaussians F.

•  S1 – Samples from one distribution •  S2 – Samples from another distribution •  Task: To determine statistical similarity between S1 and S2

•  D1,D2 – Theoretical distributions from which S1 and S2 came from

Page 25: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

25

Bayesian Comparison of Distributions

P(D1 = D2 | S1,S2) = P(D1 = D2 = f | S1,S2)dfF∫

=F∫ p(S1,S2 |D1 = D2 = f )p(D1 = D2 = f )

p(S1,S2)df

=F∫ p(S1,S2 |D1 = D2 = f )

p(S1,S2)df

=F∫ p(S1,S2 |D1 = D2 = f )

p(S1,S2 |D1 = D2 = f )p(D1 = D2 = f )dfF∫

df

=F∫ p(S1,S2 |D1 = D2 = f )

p(S1,S2 |D1 = D2 = f )dfF∫

df

=F∫ p(S1,S2 |D1 = D2 = f )

p(S1 |D1 = f1)df1 p(S2 |D2 = f2)df2F∫

F∫

df

=Q(S1S2)

Q(S1) ×Q(S2) where Q(S) = p(S | f )df

F∫

Bayes rule

Marginalizing Denominator over all f

Assuming all distributions equally likely or uniform p(f)

Due to uniform p(f)

Since S1 and S2 are cond independent given f

Marginalizing over all f , a member of family F

Page 26: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

26

Bayesian Comparison of Distributions (F= Gaussian Family)

p(D1 = D2 | S1,S2) =Q(S1S2)

Q(S1) ×Q(S2) where Q(S) = p(S | f )df

F∫

For Gaussians closed form for Q is obtainable

Q(S) =−∞

∫ p(S |µ,σ )dµdσ0

Q(S) =−3 / 22 ×

(1−n ) / 2π ×

(−1/ 2)n ×(1−n / 2)C ×Γ(n /2 −1)

C = 2xx∈S∑ −

( x)x∈S∑

2

n

Evaluation of Q for Gaussian family

where n is the cardinality of S

Page 27: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Signature Data Set

Genuines

Forgeries

Samples for one person

55 individuals

Page 28: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Feature computation

Correlation-Similarity Measure

Feature Vector and Similarity

Page 29: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

BayesianComparison

Genuines

Forgeries

Clear Separation

Page 30: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Informa0on‐Theore0c(KS+KL)

Page 31: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

GraphMatching

Page 32: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Comparison of KS and KL

*S.N.Srihari A.Xu M.K.Kalera, Learning strategies and classification methods for off-line signature verification. Proc. of the 7th Int. Workshop on Frontiers in handwriting recognition (IWHR), pages 161-166, 2004

* Threshold Methods

Knowns KS KL KS & KL

4 76.38% 75.10% 79.50%

7 80.10% 79.50% 81.0%

10 83.53% 84.27% 86.78%

**Harish Srinivasan, Sargur.N.Srihari and Mattew.J.Beal, Machine learning for person identification and verification, Machine Learning in Document Analysis and Recognition (S. Marinai and H. Fujisawa, ed.), Springer 2008

**Information-Theoretic

Page 33: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Error Rates of Three Methods CEDAR Data Set

K-L

K-S

KS+KL

Bayesian

P

erce

ntag

e er

ror

5

10

1

5

20

2

5

30

5 10 15 20 Number of Training Samples

Page 34: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Why Does Bayesian Formalism do much better?

•  Few samples available for determining distributions

•  Significant uncertainty exists •  Bayesian models uncertainty better

than information-theoretic method which assumes that distributions are fixed

34

Page 35: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

35

Summary •  ML is a systematic approach instead of ad-hoc

methods in designing information processing systems

•  There is much progress in ML and PR –  developing a unified theory of different machine learning

approaches, e,g., neural networks, SVM •  Fully Bayesian approach involves making

predictions involving complex distributions, made feasible by sampling methods

•  Bayesian approach leads to higher performing systems with small sample sizes

Page 36: Machine Learning using Bayesian Approaches - cedar.buffalo.edusrihari/talks/IISc-Sept2008.pdf · extreme conclusion p µ ... Bernoulli Single binary variable Multinomial One of K

Lecture Slides

•  PRML lecture slides at

http: //www.cedar.buffalo.edu/~srihari/CSE574/index.html

36