Post on 28-May-2020
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference with s2
unknown, Zellner’s g-Prior, Uninformative Priors
Prof. Nicholas Zabaras
University of Notre Dame
Notre Dame, IN, USA
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 18, 2017
1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Linear basis function models, Maximum likelihood and least squares,
Geometry of least squares, Convexity of the NLL , Sequential learning,
Robust Linear Regression, Regularized least squares, Multiple
Outputs
Bayesian linear regression, Parameter posterior distribution, A Note on
Data Centering, Numerical Example, Predictive distribution, Bayesian
inference in linear regression when s2 is unknown, Zellner’s g-Prior,
Uninformative (Semi-Conjugate) Prior, Evidence Approximation
Contents
2
Following closely:
Chris Bishops’ PRML book, Chapter 3
Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7
Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Linear Regression We already considered in an earlier lecture an example of
linear regression – polynomial curve fitting.
We are interested in a linear combination – regression – of a
``fixed set’’ of nonlinear basis functions.
Supervised learning: N observations {xn} with corresponding
target values {tn} are provided.
The goal is to predict t for a new value x.
We construct a function such that y(x) is a prediction of t.
We follow a Bayesian perspective and model the predictive
distribution p(t|x).
3
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Linear Regression From the conditional distribution p(t|x), we can make point
estimates of t for a given x by minimizing a `loss function’.
For a quadratic loss function, the point estimate is the conditional mean y(x,w)=E[t|x].
4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Linear Regression The simplest linear model for regression is one that involves a
linear combination of the input variables
where and we have defined .
This is often simply known as linear regression. D is the input
dimensionality.
5
0 1 1
0
( , ) ...D
D D i i
i
y w w x w x w x
x w
1 2 . .T
Dx x xx 0 1x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Linear Basis Functions Models More generally:
where are known as basis functions and
The parameter allows for any fixed offset in the data and is
called the bias parameter.
For convenience, we define an additional dummy ‘basis
function’ so that,
Often represent features extracted from the data .
6
1
0 1 1 1 1 0
0
( , ) ... , 1M
T
M M i i
i
y w w w w
x w x x x w x x where :
i x
0 1 1. .T
M x x x x
0w
0 1 x
i x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Polynomial and Gaussian Basis Functions
7
Polynomial basis functions
(scalar input, global support): Gaussian basis functions:
j
j x x
2
2exp
2
j
j
xx
s
MatLab code 1: 0.2 :1
0.2s
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Basis Funct ion { Gaussian ? j (x) = exp
3
!(x ! 7 j )2
2s2)
4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Basis Function -- Polynomial
Basis Function - Gaussian
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Logistic Sigmoidal Basis Functions
8
Sigmoidal basis functions:
1
, ( )1
j
j a
xx a
s e
s s
( ) :as logistic sigmoid function
tanh( ) 2 2 1a a
a a
e ea a
e es
MatLab code
1: 0.2 :1
0.1s
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Basis Funct ion { Sigmoidal <j (x) = <(x ! 7 j
s)Basis Function - Sigmoidal
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Sigmoidal and Tanh Basis Functions
9
The sigmoidal and tanh basis functions are related:
A general linear combination of logistic sigmoidal functions is
equivalent to a linear combination of tanh functions:
tanh( ) 2 2 1a a
a a
e ea a
e es
where
1( )
1 aa
es
0 0
1 1
0 0
1 1
0 0
1
( , ) 22
1 tanh2
tanh2 2
1: ,
2 2
M Mj j
j j
j j
j
M Mj
j j
j j
Mj
j j
j
x xy x w w w w w
s s
x
xsw w u u
s
wwhere u w w u
s s
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 10
Choice of Basis Functions
We are interested in functions of local support to explore
adaptivity.
Local support functions comprise a spectrum of different
spatial frequencies.
An example is wavelets that are local both spatially and in
frequency.
They are however useful only when the input is defined
on a lattice.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
11
Assume observations from a deterministic function with added
Gaussian noise:
which is the same as saying,
Here is the precision. This is based on a squared loss function for which E[t|x]= y(x,w). An example for 2D x is shown
below.
where( , )t y x w 1| | 0,p N
1| , , | ( , ),p t t y x w x wN
10
20
30
40
0
10
20
30
15.5
16
16.5
17
17.5
18
010
2030
40
0
10
20
30
15
15.5
16
16.5
17
17.5
18
Run surfaceFitDemo
from PMTK3
0 1 1 2 2|t w w x w x x 2 2
0 1 1 2 2 3 1 4 2|t w w x w x w x w x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
12
Given observed inputs, , and targets ,
we obtain the likelihood function
We often use the log-likelihood:
Instead of maximizing the log-likelihood, one equivalently can
minimize the negative log-likelihood (NLL)
1,..., NX x x 1,...,T
Nt tt
1
1
| , , | ( ),N
T
n n
n
p t
t X w w xN
1
1
, log ( | ) log | ( ), , ,N
T
n n
n
p t
w w x wD N
1
1
, log | ( ),N
T
n n
n
NLL t
w w xN
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
13
Taking the log of the likelihood, we obtain:
where with we have denoted:
RSS is often known as the residual sum of squares or sum of
squared errors (SSE) and MSE is the mean squared error.
Computing w via MLE is the same as Least Squares.
1
1
ln | , ln | ( ), ln ln 2 ( )2 2
NT
n n D
n
N Np t E
t w w x wN
2 2
1 1
1( ) ( ) , ( ) , /
2
N NT T
D n n n n
n n
E t RSS t MSE RSS N
w - w x - w x
( )DE w
-4 -3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
5
prediction
truth
Sum of squares error contours for linear regression
w0
w1
-1 -0.5 0 0.5 1 1.5 2 2.5 3-1
-0.5
0
0.5
1
1.5
2
2.5
3
Run contoursSSEdemo,
from PMTK3
Run residualsDemo
from PMTK3
The NLL is
a quadratic
bowl with a
unique
minimum
(the MLE
estimate)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
14
Setting the gradient of the log-likelihood (written here as a row
vector) wrt equal to zero:
This equation can be solved for .
2
1
1ln | , ln ln 2 ( ), ( ) ( )
2 2 2
NT
D D n n
n
N Np E E t
t w w w - w x
w
1 1 1
ln | , ( ) ( ) ( ) ( ) ( ) 0N N N
T T T T T
n n n n n n n
n n n
p t t
w t w - w x x x w x x
w
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
15
We obtain (normal equation; ordinary least squares solution):
where we have defined:
Note that indeed:
1
† † :T T
ML ,
w t t Moore-Penrose pseudo- inverse
0 1 1 1 1 11
0 2 1 2 1 2 1 22
1 2
0 1 1
( ) ( ) .. ( )( )
( ) ( ) .. ( ) ( ) ( ) .. ( )( ),
: : : : ..:
( ) ( ) .. ( )( )
TM
TTM N
T
N
TN N M NN
x x xx
x x x x x xx
x x xx
1 1
( ) ( ) ( )N N
TT T T T T T T T
n n n n
n n
t
w x x x w t w t
1 1
( ) ( ) , ( )N N
T T T
n n n n
n n
t
x x t x
0 1 1. .T
i i i M i x x x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Maximum Likelihood and Least Squares
16
Taking now the log of the likelihood with respect to gives:
So the MLE variance ML is equal to the residual variance of
the target values around the regression function.
1
1
ln | , ln | ( ), ln ln 2 ( )2 2
NT
n n D
n
N Np t E
t w w x wN
2
1
1 2 1( ) ( )
NT
D n n
nML
E tN N
w - w x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Computing the Bias Parameter
17
If we make the bias parameter w0 explicit, then the error
function becomes
Setting the derivative with respect to w0 equal to zero, and
solving for w0, we obtain
The bias parameter compensates for the difference
between the averages of the target values and the weighted
sum of the averages of the basis function values.
21
0
1 1
1( ) ( )
2
N M
D n j j n
n j
E t - w - w
w x
1
0
1 1 1
1 1( )
N N M
n j j n
n n j
w t - wN N
x
0w
1
0
1 1 1
1 1, , ( )
M N N
j jj n j n
j n n
w t - w t = tN N
x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 18
We look for a geometrical
interpretation of the least-
squares solution in an N-
dimensional space. t is a
vector in that space with
components t1, . . . , tN (N>M).
The least-squares regression
function is obtained by finding
the orthogonal projection of
the data vector t onto the
subspace spanned by the
basis functions φj(x).
Note φj is here the jth
column of .
1( ),..., ( )T
j j j N x x
0 1 1 1 1 1
0 2 1 2 1 2
0 1 1
0 1 1
( ) ( ) .. ( )
( ) ( ) .. ( )...
: : : :
( ) ( ) .. ( )
M
M
M
N N M N
x x x
x x x
x x x
y w
Geometry of Least Squares
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Geometry of Least Squares
19
We are looking for w such that
the projection error
is orthogonal to the basis ,
i.e. such that:
These are the normal equations
we derived earlier.
1( ) ( ),..., ( )T
j j j N x x x
0 1 1( ) ( ) .. ( )M x x x
( , )n ny
y w
y x w
- -t y t w
j
0T - t w
S: M-dimensional subspace
spanned by ( )j x0
1
1
:
T
T
T
T
M
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Convexity of the NLL
20
Convexity of the NLL (positive definite Hessian) leads to a
unique globally optimal MLE.
Some models of interest don’t have concave likelihoods and
locally optimal MLE estimates are needed.
x y
1 -
A B
Convex function Concave function Neither convex or concave
function
A and B are local minimum
Run convexFnHand
from PMTK3
2 , , log ( 0)e log , ( 0)
Convex region Non-convex region
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Sequential Learning: LMS Algorithm
21
If the data set is large, we use sequential (on-line) algorithms
We apply the technique of stochastic (sequential) gradient
descent.
If the error function comprises a sum over data points
then after presentation of pattern n, the stochastic gradient
descent algorithm updates the parameter vector w using
t is the iteration number & η the learning rate parameter.
This is known as least-mean-squares or the LMS algorithm.
( 1) ( ) ( ) ( ) ( ) ( )T
n
n n n n
E
E tt t t t
w w w w x x
n
n
E E
2
1
1( ) ( )
2
NT
D n n
n
E t
w - w x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Robust Linear Regression
22
Using a Gaussian distribution for the noise,
can result in poor fit especially if we have outliers in the data.
Squared error penalizes deviations quadratically, so points far from the
line have more affect on the fit than points near the line.
To achieve robustness to outliers one can replace the Gaussian with a
distribution that has heavy tails (e.g. the Laplace distribution). Such a
distribution assigns higher likelihood to outliers, without having to
perturb the regression line to “explain” them.
1( , ) , ~ | 0,t y x w N
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6
-5
-4
-3
-2
-1
0
1
2
3
4
least squares
laplace
student, dof=0.630
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6
-5
-4
-3
-2
-1
0
1
2
3
4
Least Squares
Huber loss 1.0
Huber loss 5.0
Run linregRobustDemoCombined
from PMTK3
1
| , , | ( , ), exp ( , )
( ) ( ) , ( ) ( , )i i
i
p t b t y b t yb
NLL r r t y
x w x w x w
w w w x w
Lap
2
2
/ 2,
/ 2
,
H
H
r
r if rL r
r if r
dL r
dr
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Robust Linear Regression
23
Using the Laplace distribution leads to L1 error norm (non-linear
objective function) that is difficult to optimize.
A solution is to transform the problem (by increasing its dimension to
2N+M) to a linear programming problem.
𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖
−
Note that with our definition above:
, ,min . . 0, 0,
i i
T
i i i i i i i ir r
i
r r s t r r r r t
w
w x
0 0 01 1( ) , ( )
0 0 02 2
i i i
i i i i i i
i i i
i i i
r if r if rr r r r r r
if r r if r
r r r
Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Huber Loss Function
24
-3 -2 -1 0 1 2 3-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
L2
L1
huber
Run huberLossDemo
from PMTK3
This is equivalent to L2 for errors
that are smaller than δ, and is
equivalent to L1 for larger errors.
This loss function is everywhere
differentiable, using the fact that
d/dr |r| = sign(r) if r ≠ 0.
The function is also C1 continuous,
since the gradients of the two parts
of the function match at r = ±δ.
Optimizing the Huber loss is much faster than using the Laplace
likelihood, since we can use standard optimization methods (quasi-
Newton) rather than linear programming.
The Huber method also has a probabilistic interpretation, although it
is rather unnatural (Pontil et al. 1998).
Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical
report, MIT AI Lab.
Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73a ˘A,S101.
2
2
/ 2,
/ 2
,
H
H
r
r if rL r
r if r
dL r
dr
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularized LS – Ridge Regression
25
Consider the error function:
data term + regularization term
With the sum-of-squares error function and a quadratic
regularizer, we get
Specifically, setting the gradient with respect to w to zero, and
solving for w as before, we obtain
This is a trivial extension of the least-squares solution we
encountered earlier (Regularized Least Squares – Ridge
Regression)
( ) ( )D WE Ew w
2
1
1 1( )
2 2
NT T
n n
n
t
- w x w w
1
T T
w I t
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularized Least Squares
26
Regularized solution:
Regularization limits the effective model complexity (the
appropriate number of basis functions).
This is replaced with the problem of finding a suitable value of
the regularization coefficient .
controls how many non-zero w’s (i.e. basis functions) you
have.
1
T T
w I t
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularizer term plot with q = 0.5
-10 -5 0 5 10-10
-5
0
5
10Regularizer term plot with q = 1
-10 -5 0 5 10-10
-5
0
5
10Regularizer term plot with q = 2
-10 -5 0 5 10-10
-5
0
5
10Regularizer term plot with q = 4
-10 -5 0 5 10-10
-5
0
5
10
Regularized Least Squares
27
With a more general regularizer, we have
q = 1 is known as the Lasso regularizer. These plots
show only the regularizer term with = 0.7334.
2
1 1
1 1( )
2 2
qN MT
n n j
n j
t w
- w x
MatLab code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularized Least Squares
28
With a more general regularizer, we have
q = 2 corresponds to the quadratic regularizer.
2
1 1
1 1( )
2 2
qN MT
n n j
n j
t w
- w x
MatLab code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularized Least Squares
29
Lasso tends to generate sparser solutions than a quadratic
regularizer – if is large, some of the wj →0 (here w1 = 0).
Here, we consider that the regularized least squares solution is
equivalent to minimizing the unregularized sum of squares with
the constraint shown for some (see proof next).
Unregularized
error function
Constraint
1
qM
j
j
w
q = 2 q = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Regularized Least Squares
30
Let us write the constraint in the equivalent form:
This leads to the following Lagrangian function:
This is identical to our regularized least squares (RLS) in the
dependence on w.
For a particular >0, let w*() be the solution of the RLS in (*).
From the Kuhn-Tucker optimality conditions for we then
see:
1
10
2
qM
j
j
w
2
1 1
1( , ) ( )
2 2
qN MT
n n j
n j
L t w
w - w x
2
1 1
1 1( ) (*)
2 2
qN MT
n n j
n j
t w
- w x
( , )L w
*
1
qM
j
j
w
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Kuhn-Tucker Optimality Conditions
31
Consider the following constraint minimization problem:
This is equivalent as the minimization with respect to x and of the following Lagrangian:
subject to the following (Kuhn-Tucker) conditions:
Note for maximization problems, the Lagrangian should be
modified as:
min ( ), ( ) 0f g x
x xsubject to
min ( , ) ( ) g( )L f
x,
x x x
0, g( ) 0, g( ) 0 x x
( , ) ( ) g( )L f x x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Multiple Outputs-Isotropic Covariance
32
If we want to predict K>1 target variables, we use the same
basis for all components of the target vector):
W is an M × K matrix of parameters and t is K dimensional.
Given observed inputs, , and targets,
we obtain the log likelihood function
1 1| , , | ( , ), | ( ),Tp t x W t y x W I t W x IN N
1,..., NX x x 1,...,T
NT t t
2
1
1 1
ln | , , ln | ( ), ln ( )2 2 2
N NT T
n n n n
n n
NKp t
T X W W x I t W xN
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
K-Independent Regression Problems
33
As before, we can maximize this function with respect to W,
giving
If we examine this result for each target variable tk, we have
(take the kth column of W and T):
which is identical with the single output case (so there is
decoupling between the target variables)
As expected, we obtain K- independent regression problems.
2
1
ln | , , ln ( )2 2 2
NT
n n
n
NKp
T X W t W x
1
T T
ML
M N N KM KM M
W T
1
†
ML
T T
k k k
w t t
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Multiple Outputs – Full Covariance
34
Let us repeat the earlier formulation but with covariance
matrix S. If we want to predict K>1 target variables, we use
the same basis for all components of the target vector):
where W is an M × K matrix of parameters
Given observed inputs, , and targets,
we obtain the log likelihood function :
| , , | ( , ), | ( ),Tp t x W t y x W t W xN N S S S
1,..., NX x x 1,...,T
NT t t
1
1 1
1ln | ( ), ln ( ) ( )
2 2
N NT
T T T
n n n n n n
n n
Nt
W x t W x t W xN S S S
ln | , ,p T X W S
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Multiple Outputs – Full Covariance
35
As before, we maximize this function with respect to W,
For the ML estimate for S, use the result for the MLE of the
covariance of a multivariate Gaussian:
Note that each column of WML is of the form
seen for isotropic noise distribution and is independent of S!
1
1
1ln | , , ln ( ) ( )
2 2
NT
T T
n n n n
n
Np
T X W t W x t W x S S S
1
1
1
0 ( ) ( )N
T T T T
n n n ML
n M K
t W x x W T S
1
1( ) ( )
NT
T T
n ML n n ML n
nN
t W x t W x S
1
T T
ML
w t
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 36
Effective model complexity in MLE is governed by the number
of basis functions and is controlled by the size of the data set.
With regularization, the effective model complexity is controlled
mainly by and still by the number and form of the basis
functions.
Thus the model complexity for a particular problem cannot be
decided simply by maximizing the likelihood function as this
leads to excessively complex models and over-fitting.
Independent hold-out data can be used to determine model
complexity but this wastes data and it is computationally
expensive.
Bayesian Linear Regression
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 37
A Bayesian treatment of linear regression avoids the over-
fitting of maximum likelihood.
Bayesian approaches lead to automatic methods of
determining model complexity using the training data alone.
Bayesian Linear Regression
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 38
Assume additive Gaussian noise with known precision . The
likelihood function p(t|w) is the exponential of a quadratic
function of w
and its conjugate prior is Gaussian:
Combining this with the likelihood and using results for
marginal and conditional Gaussian distributions, gives the
posterior
where
0 0( ) ( | , )p w w m SN
( ) ( | , )N Np w | t w m SN
1
0 0
1 1
0
T
N N
T
N
m S S m t
S S
/2
21
11
| , , | ( ), exp ( )2 2
NN NT T
n n n n
nn
p t t
t X w w x w xN
Bayesian Linear Regression
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
We now have the product of two Gaussians and the
posterior is easily computed as:
Posterior Distribution: Derivation
39
1
0 0 0 0 0
2
1
1 1
1( | , ) exp ,
2
( | , , ) exp ( )2
1( | , , ) exp ( ) ( ) ( )
2
T
NT
n n
n
N NT T
n n n n
n n
p
p x t
p t
w m S w - m S w - m
t x w w
t x w w x x w x w
0
1 1
0 0 0
1 1
1 1 1 1
0 0 0 0
1
( | , , )
1 1exp ( ) ( ) ( )
2 2
, ( ) ( )
N
N NT T T T T
n n n n
n n
NT T T
N N N n n
n
p ,
t
,
m
w x t S
w S w w S m w x x w w x
w | S S m t S S S x x S
N
Complete the
square in w
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 40
Note that because the posterior distribution is Gaussian, its
posterior mode coincides with its mean.
The above expressions for the posterior mean and variance
can also be written for a sequential calculation (we already
have observed N data points and now considering an
additional data point (xN+1,tN+1)). In this case, we have:
( ) ( | , )N Np w | t w m SN 1
0 0
1 1
0
T
N N
T
N
m S S m t
S S
MAP Nw = m
1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1
1 1 1 1
1 1
1 1 1
N N N N n n
T
N N n n
t
m S S m
S S
Sequential Posterior Calculation
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Linear Regression
41
Let us consider for a prior, a zero-mean isotropic Gaussian
governed by a single precision parameter a so that
and the corresponding posterior distribution over w is then
given by
The log of the posterior is the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
Thus the MAP estimate is the same as regularized least
squares (Ridge Regression) with
1( ) ( | 0, )p a w w IN
1
T
N N
T
N
a
m S t
S I
2
1
ln ( | ) ( )2 2
NT T
n n
n
p t x +const a
w t w w w
/ . a
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
A Note on Data Centering
42
In linear regression, it helps to center the data in a way that
does not require us to compute the offset term . Write the
likelihood as:
Let us assume that the input data are centered in each
dimension such that:
The mean of the output is equally likely to be positive or
negative. Let us put an improper prior and integrate
out.
( | , , ) exp2
T
N Np ,
1 1t x w t w t w
1 1 2 1 11
1 2 2 2 2 1 22
1 2
1 2
( ) ( ) .. ( )( )
( ) ( ) .. ( ) ( ) ( ) .. ( )( ),
: : : : ..:
( ) ( ) .. ( )( )
TM
TTM N
T
N
TN N M NN
x x xx
x x x x x xx
x x xx
0 1 1. .T
i i i M i x x x x
1
0 1,...,N
i j
j
i M
x
( ) 1p
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
A Note on Data Centering
43
Introducing, , the marginal likelihood becomes :
Completing the square in gives (use the centering of the
input):
Our model is now simplified if instead of t we use (centered
output) and the likelihood is simply written as:
Recall that the MLE estimate for is:
( | , ) exp2
T
N N N Np , t t t t d
1 1 1 1
A
t x w t w t w
1
1 N
i
i
t tN
0 0
2
( | , ) exp 22
exp2
T TN
T T
N
tN tN
T
N N
p , t N t d
t t
1
1
1 1
w
t x w A A A
t w t w
Nt 1t t
( | , ) exp2
Tp ,
t x w t w t w
Ƹ𝜇 = ҧ𝑡 −
𝑗=1
𝐷
ഥ𝝓𝑗𝑇𝒘𝑗 , ഥ𝝓1 , . . . , ഥ𝝓𝐷 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑
𝑏𝑦𝑎𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛𝑜𝑓𝜱
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
A Note on Data Centering
44
To simplify the earlier notation, consider a linear regression
model of the form
In the context e.g. of MLE, we need to minimize
Minimization wrt w0 gives:
where:
Thus:
0| Ty w x w x
0
2
0
1
minN
T
i iw
i
t w
,w
- w x
0 0 0
1
0N
T T
i i
i
Tt w tw w N tN N
w w x wx x
1
11
2 2
1
1
1
,:
:
N
i
i
N
Ni
i i
i
D N
iD
i
x N
x
x Nxt t N
xx N
x
ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
A Note on Data Centering
45
Substituting the bias term in our objective function gives:
Minimization wrt w gives:
We thus first compute the MLE of w using the centered input
and output as follows:
We can then estimate the MLE estimate of w0 as follows:
1
11 2 11 11 12 1
1 2 2 221 22 221
1 21 2
1
..
..,
: : : : :::
..
N
iTT i
DDN
TT
T D iDic N
TT D D NN N NNN
iD
i
x N
x x x x x x x
x Nx x x x x x x
x x x x x x xx N
1
x x
x xX X X X x x
x x
,
22
1 1
min minN N
T T T
i i i i
i i
t t t t
w w
w x w x w x x
1
,c N
N
i
i
t
t t N
1t t t t
ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘
𝑖=1
𝑁
𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇 ෝ𝒘 =
𝑖=1
𝑁
𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕
ෝ𝒘 = 𝑿𝑐𝑇𝑿𝑐
−1𝑿𝑐𝑇𝒕𝑐 =
𝑖=1
𝑁
𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇
𝑖=1
𝑁
𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Regression: Example
46
We generate synthetic data from the function f(x, a) = a0+a1x
with parameter values a0 = −0.3 and a1 = 0.5 by first choosing
values of xn from the uniform distribution U(x|−1, 1), then
evaluating f(xn, a), and finally adding Gaussian noise with
standard deviation of 0.2 to obtain the target values tn.
We assume β=(1/0.2)2=25 and α=2.0.
We perform Bayesian inference sequentially – one point at a
time – so the posterior at each level becomes the new prior.
We show results after 1, 2 and 22 points have been collected.
The results include the likelihood contours (for 1 point), the
posterior and samples of the regression function from the
posterior.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Prior - No data yet
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-3
-2
-1
0
1
2
3y(x,w) using samples of w from the prior
Bayesian Regression: Example
47
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: One Data Point Collected
48
Note that the regression lines pass close to the data point (shown with a circle)
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: 2nd Data Point Collected
49
Note that the regression lines now pass close to both data points
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: 22 Data Points Collected
50
Note that the regression lines after 22 data points have been collected
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Summary of Results
51
prior/posterior (no data yet)
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space (no data yet)
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Summary of Results
52
W0
W1
prior/posterior
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
data space
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
likelihood
Run bayesLinRegDemo2d
from PMTK3
After 20 data points
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Predictive Distribution
53
We are not interested in w itself but in making predictions of t
for new values of x. This requires the predictive distribution
where
The 1st term represents the noise on the data whereas the 2nd
term reflects the uncertainty associated with w.
Because the noise process and the distribution of w are
independent Gaussians, their variances are additive.
The error bars get larger as we move away from the training
points. By contrast, in the plug-in approximation, the error bars
are of constant size.
As additional data points are observed, the posterior
distribution becomes narrower.
2( | , , ) ( | , ) ( | , ) | ( ), ( )T
N Np t x p t x , p , , d t x x a s Nx t w w x t w m
2 11( ) ( ) ( ),T T
N N Nx x S xs a
S I
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
In a full Bayesian treatment, we want to compute the
predictive distribution, i.e. given the training data x and t
and a new test point x, we want the distribution:
To compute the marginal, we use the standard equations
for Gaussian Linear Systems from an earlier lecture.
Predictive Distribution
54
1
1 1
1
1 1
( | , , ) ( | , ) ( | , ) ,
( | , , ) ( | ( , ), )
( | , , )
1 1exp ( ) ( ) ( )
2 2
( ), , ( ) ( )
N NT T T
M M n n n n
n n
N NT
N n n N N M M n n
n n
p t x p t x p d where
p t x t y x and
p ,
x x t x
t x x x
a
a
a
N
N
x t w w x t w
w w
w x t
w I w w w w
w |, S S S I
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Appendix: Useful Result
55
For the above linear model, we proved in an earlier lecture
that the following very useful results about marginal and
conditional Gaussian models hold:
1
1
| ,
| | ,
p
p
N
N
x x
y x y Ax b L
1 1| , Tp Ny y A b L + A A
1 1
| | ( ,T T Tp
x y x A LA A L y b) A LA N
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Predictive Distribution
56
Thus for our problem:
The predictive distribution now takes the form:
1
1
1 1
( ),
( ) ,
N
N n n N
n
T
, t x
t, x , = 0
x w = S = S
y A = b L =
1
1
| ,
| | ,
p
p
N
N
x x
y x y Ax b L
1( | , , ) ( | ( , ), )p t x t y x w wN
1
( | , , ) ( ),N
N n n N
n
p , t xa
w x t w |, S SN
1 1| , Tp y y A b L + A A N
1
1
| ( ) ( ), ( ) ( )N
T T
N n n N
n
p t t x t x x x
S + SN
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
In a full Bayesian treatment, we want to compute the
predictive distribution, i.e. given the training data x and t
and a new test point x, we want the distribution:
where the mean and variance are given by
Note that:
Predictive Distribution
57
1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN
1
2 1
1
1
( ) ( ) ( ) ( ) ,
( ) ( ) ( ) ( )
( ) ( )
NT T T
N n n N N N
n
T
N N
NT T
N n n
n
m x x x t x
x x x
x x
s
a a
+
S m m S t
S
S I I
uncertainty in the data+uncertainty in w
2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N
2 1 1 2 2
1( ) ( ) ( ) ( ) ( )N
T
N N N Nx x x and x xs s s
S+
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
It is easy to show:
Note:
and the identity:
Using these results, we can write:
Predictive Distribution
58
2 1 1 2 2
1( ) ( ) ( ) ( ) ( )N
T
N N N Nx x x and x xs s s
S+
11 1
1
1
( ) ( ) ( ) ( )N
T T
N n n N n n
n
x x x xa
S I S
1 1
11
11
T
T
T
M v v MM vv M
v M v
12 1 1 1 1
1 1
2
2 2
( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )( ) ( ) ( ) ( )
1 ( ) ( ) 1 ( ) ( )
T T T
N N N n n
T T
N n n N n NT
N N NT T
n N n n N n
x x x x x x x
x x x xx x x x
x x x x
s
s s
S S
S S SS
S S
+ + +
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
The notation used here is as follows:
Predictive Distribution: Summary
59
1
2 1
1
1
( ) ( ) ( )
( ) ( ) ( )
( ) ( )
NT
N n n
n
T
N N
NT
N n n
n
m x x x t
x x x
x x
s
a
S
S
S I
+
2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N
2 2 1
1
:
1
( ) , ( ) 1 ..
:
n
T M
n n
M
n
For Polynomial regression
x
x x x x x x
x
Note:
Predictive mean and
variance are functions
of x.
0
1
2 0 1 2 1
1
( )
( )
( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,
:
( )
n
n
T
n n n n n M n
M n
x
x
x x x x x x x unit matrix M M
x
I
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
Pointwise Uncertainty in the Predictions
60
MatLab code
N=1
N=2
M=9 Gaussians, 10 parameters
Scale of Gaussians adjusted
with data
a = 5*10-3
= 11.1
Using N=1,2,4,10
Data are given here
The predictive uncertainty is
smaller near the data.
The level of uncertainty
decreases with N
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
Pointwise Uncertainty in the Predictions
61
MatLab code
N=4 N=10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Summary of Results
62
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
MatLab code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Plugin Approximation
63
-8 -6 -4 -2 0 2 4 6 80
10
20
30
40
50
60
plugin approximation (MLE)
prediction
training data
-8 -6 -4 -2 0 2 4 6 8-10
0
10
20
30
40
50
60
70
80
Posterior predictive (known variance)
prediction
training data
-8 -6 -4 -2 0 2 4 6 8-20
0
20
40
60
80
100
functions sampled from posterior
-8 -6 -4 -2 0 2 4 6 80
5
10
15
20
25
30
35
40
45
50
functions sampled from plugin approximation to posterior
Run linregPostPredDemo
from PMTK3
𝑝(𝑡|𝑥, 𝒙, 𝒕)
= න𝑝(𝑡|𝑥,𝒘)𝛿ෝ𝒘(𝒘)𝑑𝒘
)= 𝑝(𝑡|𝑥, ෝ𝒘
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
Covariance Between the Predictions
64
Draw samples from the posterior of w and
then plot y(x,w). We use the same data as
the earlier example.
We are visualizing the joint uncertainty
in the posterior distribution between the
y values at two or more x values.
MatLab Code
N=2
N=1
Same data and basis functions
as in the earlier example
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
Covariance Between the Predictions
65
Draw samples from the posterior of w and
then plot y(x,w)
We are visualizing the joint uncertainty
in the posterior distribution between the
y values at two or more x values.
MatLab Code
N=12
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
N=4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Summary of Results
66
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Gaussian Basis vs. Gaussian Process
67
If we use localized basis functions such as Gaussians, then in
regions away from the basis function support, the contribution
from the second term in the predictive variance will go to zero,
leaving only the noise contribution β−1.
The model becomes very confident in its predictions when
extrapolating outside the region occupied by the basis functions.
This is an undesirable behavior.
This problem can be avoided by adopting an alternative
Bayesian approach to regression (Gaussian processes).
2 1 1
( )( ) ( ) ( )
away from theT
N Nsupport of x
x x xs S+
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Inference when s2 is Unknown
68
Let us extend the previous results for linear regression assuming
now that s2 is unknown.
Assume a likelihood of the form:*
A conjugate prior has the following form:
The posterior is now derived as:
/2
2 2 2
/2 2
1| , , | , exp
22
TN
N Np s s s
s
Ny Xw y Xw
y X w y Xw I
0
0
1( )/2 1
0 0 0 02 20
21/2/2
0 0
2, | exp
22
T Taa D N
N D
bbp
as s
s
Dw w V w w y Xw y Xw
wV
In the remaining of this lecture, the response is denoted as y, the dimensionality of w as D and the design matrix
as X.
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0
𝑏0𝑎0
2𝜋 Τ𝐷 2|𝑉0|Τ1 2𝛤 𝑎0
𝜎2 − 𝑎0+ Τ𝐷 2+1 exp −𝒘 − 𝒘0
𝑇𝑽0−1 𝒘 −𝒘0 + 2𝑏02𝜎2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Inference when s2 is Unknown
69
Let us define the following:
With these definitions, one with simple algebra can show:
The posterior marginals can now be derived explicitly:
0
0
1( )/2 1
0 0 0 02 20
21/2/2
0 0
2, | exp
22
T Taa D N
N D
bbp
as s
s
w - w V w - w y - Xw y - Xww
VD
11 1
0 0 0
1 1
0 0 0 0 0
,
1/ 2,
2
T T
N N N
T T T
N N N N Na a N b b
V V X X w V V w X y
w V w y y w V w
1 1 1/2 1
0 0 0 0 0 02 2
2
1/2 1
2
2
2, | exp
2
2exp
2
N
N
T T T T Ta D
N N N N
Ta D
N N N N
bp
b
s ss
ss
w w V w w y Xw y Xw w V w y y w V ww
w w V w w
D
2 2| | ,N Np a bs sD IG
2
1 2
| , , 2 12
Na DT
N N NND N N N
N N
bp a
a b
D Tw w V w w
w w V
𝑝 𝒘, 𝜎2|𝒟 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘𝑁 , 𝑽𝑁, 𝑎𝑁, 𝑏𝑁 ≜ 𝒩 𝒘|𝒘𝑁, 𝜎2𝑽𝑁 ℐ𝒢 𝜎2|𝑎𝑁, 𝑏𝑁
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Posterior Marginals
70
The marginal posterior can be directly written as:
2
1 1 2/2 1
2 2
2
0
2| exp 1
2 2
N
N
a DT T
a DN N N N N N N
N
bp d
bs s
s
Dw w V w w w w V w w
w
2
1 2
| , , 2 12
Na DT
N N NND N N N
N N
bp a
a b
D Tw w V w w
w w V
To compute the integral above, simply set =s-2, ds2=--2d and use the normalizing factor of the
Gamma distribution න
0
∞
𝜆𝑎−1𝑒−𝑏𝜆𝑑𝜆 = 𝛤(𝑎)𝑏−𝑎 ∼ 𝑏−𝑎.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Posterior Predictive Distribution
71
Consider the posterior predictive for m new test inputs:
As a first step, let us integrate in w by writing:
Let us denote the last term in the Eq. above as
These terms
cancel out
from the
integration in
w
2𝛽 = − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁
𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁
−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁
−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁
𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘 + 𝒘−𝒘𝑁
𝑇𝑽𝑁−1 𝒘−𝒘𝑁 + 2𝑏𝑁 =
𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁
−1𝒘𝑁
𝑇෩𝑿𝑇෩𝑿+ 𝑽𝑁
−1 𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁
−1𝒘𝑁
− ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁
𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁
−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁
−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁
𝑝 𝒚|෩𝑿, 𝒟
∝ ඵ1
2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −
𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘
2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −
𝒘 − 𝒘𝑁𝑇𝑽𝑁
−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2
𝑑𝒘𝑑𝜎2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Posterior Predictive Distribution
72
The posterior predictive
is now simplified using =1/s2 and recalling the normalization of
the Gamma distribution:
Substituting and by comparing the 2 Eqs. one can verify that:
Use the Sherman Morrison
Woodburry formula here to show
that (symmetry of V0 is assumed)
𝑝 𝒚|෩𝑿,𝒟 ∝ න 𝜆 Τ𝑚 2+𝑎𝑁−1exp −𝛽𝜆 𝑑𝜆 ∼ 𝛽− Τ𝑚 2+𝑎𝑁
𝑝 𝒚|෩𝑿,𝒟 ∝ − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁
𝑇 ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁
−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁
−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁−𝑚2+𝑎𝑁
∝ 1 +
𝒚 − ෩𝑿𝒘𝑁𝑇 𝑏𝑁
𝑎𝑁𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿
𝑇
−1
𝒚 − ෩𝑿𝒘𝑁
2𝑎𝑁
−𝑚2+𝑎𝑁
𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿𝑇
−1= 𝑰𝑚 − ෩𝑿 ෩𝑿𝑇෩𝑿 + 𝑽𝑁
−1 −1෩𝑿𝑇
𝑝 𝒚|෩𝑿, 𝒟
∝ ඵ1
2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −
𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘
2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −
𝒘− 𝒘𝑁𝑇𝑽𝑁
−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2
𝑑𝒘𝑑𝜎2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Bayesian Inference when s2 is Unknown
73
The posterior predictive is also a Student’s T:
The predictive variance has two terms
due to the measurement noise
and due to the uncertainty in w. The second
term depends on how close a test input is to the training
data.
Nm
N
b
aI
𝑝 𝒚|෩𝑿,𝒟 = 𝒯𝑚 𝒚|෩𝑿𝒘𝑁,𝑏𝑁𝑎𝑁
𝑰𝑚 + ෩𝑿𝑽𝑁෩𝑿𝑇 , 2𝑎𝑁
𝑏𝑁𝑎𝑁
෩𝑿𝑽𝑁෩𝑿𝑇
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Zellner’s G-Prior
74
It is common to set a0 = b0 = 0, corresponding to an
uninformative prior for σ2, and to set w0 = 0 and V0 =
g(XTX)−1 for any positive value g.
This is called Zellner’s g-prior. Here g plays a role
analogous to 1/λ in ridge regression. However, the prior
covariance is proportional to (XTX)−1 rather than I.
This ensures that the posterior is invariant to scaling of the
inputs.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In
Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.
North Holland.
Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝒩 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 ℐ𝐺 𝜎2|0,0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Unit Information Prior
75
We will see below that if we use an uninformative prior, the
posterior precision given N measurements is .
The unit information prior is defined to contain as much
information as one sample.
To create a unit information prior for linear regression, we
need to use which is equivalent to the g-prior
with g = N.
Zellner’s prior depends on the data: This is contrary to
much of our Bayesian inference discussion!
1 T
N
V X X
1
0
1 T
N
V X X
Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to
the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝑁 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 𝐼𝑛𝑣𝐺𝑎𝑚𝑚𝑎 𝜎2|0,0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Uninformative Prior
76
An uninformative prior can be obtained by considering the
uninformative limit of the conjugate g-prior, which
corresponds to setting g = ∞. This is equivalent to an
improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,
which gives p(w, σ2) ∝ σ−(D+2).
Alternatively, we can start with the semi-conjugate prior
p(w, σ2) = p(w)p(σ2), and take each term to its
uninformative limit individually, which gives p(w, σ2) ∝ σ−2.
This is equivalent to an improper NIG prior with w0 = 0,V =
∞I, a0 = −D/2 and b0 = 0.
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2|0,0 → 𝜎 )−(𝐷+2
𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2| −𝐷
2, 0 → 𝜎−2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Uninformative Prior
77
Using the uninformative prior, , the
corresponding posterior and marginal posteriors are given
by
Note in the calculation of s2:
2 2, | , | , , ,N N N Np a bs sw w w VD NIG
2 2,p s s w
𝑝 𝒘|𝒟 = 𝒯𝐷 𝒘𝑁,𝑏𝑁𝑎𝑁
𝑽𝑁, 2𝑎𝑁 = 𝒯𝐷 𝒘|ෝ𝒘𝑀𝐿𝐸 ,𝑠2
𝑁 − 𝐷𝑪,𝑁 − 𝐷
𝑽𝑁 = 𝑪 = 𝑽0−1 + 𝑿𝑇𝑿 −1 → 𝑿𝑇𝑿 −1, 𝒘𝑵 = 𝑽𝑁 𝑽0
−1𝒘0 + 𝑿𝑇𝒚 → 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = ෝ𝒘𝑀𝐿𝐸
𝑎𝑁 = 𝑎0 + Τ𝑁 2 = Τ𝑁 − 𝐷 2 ,
𝒃𝑁 = 𝒃0 +1
2𝒘0𝑇𝑽0
−1𝒘0 + 𝒚𝑇𝒚 − 𝒘𝑁𝑇𝑽𝑁
−1𝒘𝑁 = Τ𝑠2 2 , 𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸
𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸
𝒘𝑁 = ෝ𝒘𝑀𝐿𝐸 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚
𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸
𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸 = 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 𝑇 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚
= 𝒚𝑇𝒚 − 𝒚𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸𝑇 𝑿𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸
𝑇 𝑽𝑁−1ෝ𝒘𝑀𝐿𝐸
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Frequentist Confidence Interval Vs. Bayesian Marginal Credible Interval
78
The use of a (semi-conjugate) uninformative prior is quite
interesting since the resulting posterior turns out to be
equivalent to the results obtained from frequentist statistics.
This is equivalent to the sampling distribution of the MLE
which is given by the following:
is the standard error of the estimated parameter.
The frequentist confidence interval and the Bayesian
marginal credible interval for the parameters are the same. Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)
Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)
𝑝 𝒘𝒋|𝐷 = 𝒯 𝑤𝑗|ෝ𝑤𝑗 ,𝐶𝑗𝑗𝑠
2
𝑁 − 𝐷, 𝑁 − 𝐷
𝑤𝑗 − ෝ𝑤𝑗
𝑠𝑗~𝒯𝑁−𝐷 , 𝑠𝑗 =
𝐶𝑗𝑗𝑠2
𝑁 − 𝐷
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
The Caterpillar Example
79
As a worked example of the uninformative prior, consider
the caterpillar dataset. We can compute the posterior
mean and standard deviation, and the 95% credible
intervals (CI) for the regression coefficients.
coeff mean stddev 95pc CI sig
w0 10.998 3.06027 [ 4.652, 17.345] *
w1 -0.004 0.00156 [ -0.008, -0.001] *
w2 -0.054 0.02190 [ -0.099, -0.008] *
w3 0.068 0.09947 [ -0.138, 0.274]
w4 -1.294 0.56381 [ -2.463, -0.124] *
w5 0.232 0.10438 [ 0.015, 0.448] *
w6 -0.357 1.56646 [ -3.605, 2.892]
w7 -0.237 1.00601 [ -2.324, 1.849]
w8 0.181 0.23672 [ -0.310, 0.672]
w9 -1.285 0.86485 [ -3.079, 0.508]
w10 -0.433 0.73487 [ -1.957, 1.091]
The 95%
credible intervals
are identical to
the 95%
confidence
intervals
computed using
standard
frequentist
methods.Run linregBayesCaterpillar
from PMTK3
Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to
computational Bayesian statistics. Springer.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
The Caterpillar Example
80
We can use these marginal posteriors to compute if the
coefficients are significantly different from 0 -- check if its
95% CI excludes 0.
The CIs for coefficients 0, 1, 2, 4, 5 are all significant.
These results are the same as those produced by a
frequentist approach using p-values at the 5% level.
But note that the MLE does not even exist when N <D, so
standard frequentist inference theory breaks down in this
setting. Bayesian inference theory still works using proper
priors.
Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Empirical Bayes for Linear Regression
81
We describe next an empirical Bayes procedure for picking
the hyper-parameters in the prior (we will come back to this
and relevance determination in a forthcoming lecture).
More precisely, we choose η = (α, λ) to maximize the
marginal likelihood, where λ = 1/σ2 be the precision of the
observation noise and α is the precision of the prior, p(w) =
N(w|0, α-1I).
This is known as the evidence procedure.
MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for
supervised neural networks. Network.
Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.
MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation
11(5), 1035–1068.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Empirical Bayes for Linear Regression
82
The evidence procedure provides an alternative to using
cross validation.
In the Figure, the log marginal likelihood is plotted for
different values of α, as well as the maximum value found
by the optimizer.
-25 -20 -15 -10 -5 0 5-150
-140
-130
-120
-110
-100
-90
-80
-70
-60
-50
log alpha
log evidence
Run linregPolyVsRegDemo
from PMTK3
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Empirical Bayes for Linear Regression
83
-25 -20 -15 -10 -5 0 5-150
-140
-130
-120
-110
-100
-90
-80
-70
-60
-50
log alpha
log evidence
Run linregPolyVsRegDemo
from PMTK3
-20 -15 -10 -5 0 50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log lambda
negative log marg. likelihood
CV estimate of MSE
We obtain the same result as 5-CV (λ = 1/σ2 is fixed in
both methods).
The key advantage of the evidence procedure over CV is
that it allows different αj to be used for every feature.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)
Automatic Relevancy Determination
84
The evidence procedure can be used to perform feature
selection (automatic relevancy determination or ARD)
The evidence procedure is also useful when comparing
different kinds of models:
It is important to (at least approximately) integrate over η
rather than setting it arbitrarily.
Using variation Bayes models our uncertainty on η rather
than computing point estimates.
,( | ) ( | ) ( | )( | )
( |
,
, ) ( | , ) ( | )
p m p m p m d dp m
max p m p m p m d
w
w
w
w
w
w
DD
D