Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Variational Bayes 101
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
The Bayes scene Exact averaging in
discrete/small models (Bayes networks)
Approximate averaging: - Monte Carlo methods - Ensemble/mean field
- Variational Bayes methods
Variational-Bayes .orgMLpediaWikipedia
• ISP Bayes:
ICA: mean field, Kalman, dynamical systemsNeuroImaging: Optimal signal detectorApproximate inferenceMachine learning methods
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Bayes’ methodology
Minimal error rate obtained when detector is based on posterior
probability (Bayes decision theory)
( | ) ( )( | ) , | 1,..,
( ) n
P D M P MP M D D x n N
P D
Likelihood may contain unknown parameters
( | ) ( | ) ( | )
[ ( | )] ( | )nn
P D M P D p M d
P x p M d
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Bayes’ methodology
Conventional approach is to use most probable parameters
* *( | ) ( | ) ( | ) ( | )n nn n
P D M P x M P x p M However: averaged model is generalization optimal (Hansen, 1999),
i.e.:
( | ) ,( | ) arg max log ( | )BayesianAverage P x D d D
P x D P x M
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
The hidden agenda of learning
Typically learning proceeds by generalization from limited set of samples…but
We would like to identify the model that generated the data
….Choose the least complex model compatible with data
That I figured
out in 1386
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Generalizability is defined as the expected performance on a random new sample ... the mean performance of a model on a ”fresh” data set is an unbiased estimate of generalization
Typical loss functions: <-log p(x)> , < # prediction errors > < [ g(x)-ĝ(x) ] 2 >, <log p(x,g)/p(x)p(g)>, etc
Results can be presented as ”bias-variance trade-off curves” or ”learning curves”
Generalization!
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Generalization optimal predictive distribution
”The game of guessing a pdf” Assume: Random teacher drawn from P(θ), random
data set, D, drawn from P(x|θ) The prediction / generalization error is
( , , ) [ log ( | , )] ( | )
( ) ( , , ) ( ) ( | )
D A p x D A P x dx
A D A P P D d dD
Predictive distribution of model A Test sample distribution
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Generalization optimal predictive distribution
We define the ”generalization functional” (Hansen, NIPS 1999)
Minimized by the ”Bayesian averaging” predictive distribution
[ (. | .,.)] log ( | ) ( | ) ( | ) ( )
( )[ ( | ) 1]
H q q x D P x dxP D dDP d
D q x D dx dD
( | ) ( )( | ) ( | )
( | ') ( ') '
P D Pq x D P x d
P D P d
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Bias-variance trade-off and averaging
Now averaging is good, can we average ”too much”?
Define the family of tempered posterior distributions
Case: univariate normal dist. w. unknown mean parameter…
High temperature: widened posterior average
Low temperature: Narrow average
1/
1/
( ( | ) ( ))( | , ) ( | )
( ( | ') ( ')) '
T
T
P D Pq x D T P x d
P D P d
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Bayes’ model selection, example Let three models A,B,C be given
A) x is normal N(0,1) B) x is normal N(0,σ2), σ2 is uniform U(0,∞) C) x is normal N(μ,σ2), μ, σ2 are uniform U(0,∞)
2 2 22
1
1 N
n x xn
m xN
1
1 N
x nn
xN
2 2
1
1( )
N
x n Xn
xN
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Model A
The likelihood of N samples is given by
/ 2
21( | ) exp
2 2
NNm
P D A
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Model B
The likelihood of N samples is given by
2 2 2
0
/ 2
222 2
2
222
( | ) ( | 0, ) ( )
1exp
2 2
22
2 2
N
NN
P D A P D P d
Nmd
NmN
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Model C
The likelihood of N samples is given by
2 2 2
0
/ 2 2 22
2 2
31 21 22 2
( | ) ( | , ) ( , )
[( ) ]1exp
2 2
32
2 2
N
X X
NN
X
P D A P D P d d
Nd d
NNN
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•Bayesian model selection•C(green) is the correct model,
what if only A(red)+B(blue) are known?
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•Bayesian model selection•A (red) is the correct model
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Bayesian inference• Bayesian averaging
• Caveats: Bayes can rarely be implemented exactly
Not optimal if the model family is incorrect: ”Bayes can not detect bias”
However, still asymptotically optimal if observation model is correct & prior is ”weak” (Hansen, 1999).
( | , ) ( | , ) ( | , ) ,
ˆ( | , ) ( | , ( ))
p g x D p g x p x D d
p g x D p g x D
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Hierarchical Bayes models• Multi-level models in Bayesian averaging
( | , ) ( | , ) ( | , , ) ( | , ) ,
ˆ ˆ( | , ) ( | , ( , ( )))
p g x D p g x p x D p x D d d
p g x D p g x D D
C.P. Robert: The Bayesian Choice - A Decision-Theoretic Motivation.Springer Texts in Statistics, Springer Verlag, NewYork (1994).
G. Golub, M. Heath and G. Wahba, Generalized crossvalidationas a method for choosing a good ridge parameter,Technometrics 21 pp. 215–223, (1979).
K. Friston: A theory of Cortical Responses. Phil. Trans. R. Soc. B 360:815-836 (2005)
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Hierarchical Bayes models
posterior( ) prior( )
( | , ) ( | , ) ( | , ) ( | ) ,
( | ) ( | )( | , )
( )
( | ) ( ) exp( ( ))
( | ) ( | ) ( | )
( ) ( )
p g x D p g x p D p D d d
p D pp D
p D
p C f
p D p D p d
f f
“learning hyper-
parameters by adjusting prior expectations”
-empirical Bayes-MacKay, (1992)
Hansen et al. (Eusipco, 2006)Cf. Boltzmann learning (Hinton et al. 1983)
Posterior
“Evidence”
Prior
Target atMaximal evidence
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Hyperparameter dynamics
2
2 2
posterior( ) prior( )
2 2,
2
2,
( | ) ( ) exp( )
1 1
11
j jj
j j
j ML
OPTj
j ML
p C
A
N AANN
AN
Gaussian prior w adaptive hyperparameter
Discontinuity: Parameter is pruned atLow signal-to-noise Hansen & Rasmussen, Neural Comp (1994)Tipping “Relevance vector machine” (1999)
θ2A is a signal-to-noise measure
θML is maximum lik. opt.
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Hyperparameter dynamics
Hyperparameters dynamically updated implies pruning
Pruning decisions based on SNR
Mechanism for cognitive selection, attention?
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Hansen & Rasmussen, Neural Comp (1994)
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Approximations needed for posteriors Approximations using asymptotic expansions
(Laplace etc) -JL Approximation of posteriors using tractable
(factorized) pdf’s by KL-fitting… Approximation of products using EP -AH Wednesday Approximation by MCMC –OWI Thursday
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
P. Højen-Sørensen: Thesis (2001)
Illustration of approximation by a gaussian pdf
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Variational Bayes
Notation are observables and hidden variables – we analyse the log likelihood of a mixture model
xlog ( | ) log p( , , | )p M M d d
y y θ x θ x
,n ny x
p( , , | ) p( | , , )p( | , )p( | )M M M My θ x y x θ x θ θ
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Variational Bayes
x
x x
x
x
log ( | ) log p( , , | )
( , , | )log p( , , | ) log q( )r( )
q( )r( )
( , , | )q( )r( ) log
q( )r( )
( , | , )q( )r( ) log log ( | )
q( )r( )
p M M d d
p MM d d d d
p Md d
p Md d p M
y y θ x θ x
y θ xy θ x θ x x θ θ x
x θ
y θ xx θ θ x
x θ
θ x yx θ θ x y
x θ
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
r( )
q( )
q( ) exp log ( , , | )
r( ) exp log ( , , | )
p M
p M
θ
x
x θ x y
θ θ x y
Variational Bayes:
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Conjugate exponential families
1
( , | , ) , ) ( )exp[ ( , )]
( ) ( , ) ( ) exp( )
( , | , ) ( ) ( ', ') ( ) exp[ ( ( , )]
' 1
' ( , )
p M g
p h g
p M p h g
y x θ y x θ u y x
θ ν θ θ ν
y x θ θ ν θ θ ν u y x
ν ν u y x
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Mini exercise What are the natural parameters for a Gaussian? What are the natural parameters for a MoG?
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•Observation model and “Bayes factor”
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•“Normal inverse gamma” prior – the conjugate prior for the GLM observation model
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•“Normal inverse gamma” prior – the conjugate prior for the GLM observation model
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
•Bayes factor is the ratio between normalization const. of NIG’s:
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Adv. Signal Proc. 2006
Informatics and Mathematical Modelling / Lars Kai Hansen
Exercises
Matthew Beal’s Mixture of Factor Analyzers code– Code available (variational-bayes.org)
Code a VB version of the BGML for signal detection– Code available for exact posterior
Top Related