Bayesian Analysis for Machine Learning
AI Friends Seminar
Ganguk Hwang
Department of Mathematical SciencesKAIST
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 1 / 54
Bayesian Model for Linear Regression
Bayesian Model for Linear Regression
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 2 / 54
Bayesian Model for Linear Regression
The Standard Linear Model
Let D = {(xi, yi)|i = 1, · · · , n} be a training set of n observations wherexi is an input vector of dimension D and y is a scalar output.
Let X = [x1, · · · ,xn].
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 3 / 54
Bayesian Model for Linear Regression
The Standard Linear Model
First, we will consider the standard linear regression model with Gaussiannoise
f(xi) = xi>w, yi = f(xi) + εi
where εi ∼ N (0, σ2n) and {εi}ni=1 are independent.
Then, the likelihood can be computed as
p(y|X,w) =
n∏i=1
p(yi|xi,w) =
n∏i=1
1√2πσn
exp(− (yi − xi
>w)2
2σ2n
)=
1
(2πσ2n)n/2exp
(− 1
2σ2n|y −X>w|2
)∼ N (X>w, σ2nI)
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 4 / 54
Bayesian Model for Linear Regression
The Standard Linear Model
In the Bayesian treatment, we need to specify a prior over the parameters.Suppose w ∼ N (0,Σp).By Bayes’ rule,
p(w|y, X) =p(y|X,w)p(w, X)
p(y,X)=p(y|X,w)p(w)
p(y|X).
Here, p(y|X) =∫p(y|X,w)p(w)dw is the normalizing constant. (It is
called as the marginal likelihood).
Since it is just a constant, we neglect p(y|X) when we compute theposterior p(w|X,y).
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 5 / 54
Bayesian Model for Linear Regression
The Standard Linear Model
p(w|X,y) ∝ p(y|X,w)p(w)
∝ exp(− 1
2σ2n(y −X>w)>(y −X>w)
)exp
(− 1
2w>Σ−1p w
)
Here,
1
σ2n(y −X>w)>(y −X>w) + w>Σ−1p w = w>Aw −Bw −w>C +D
Where A = 1σ2nXX> + Σ−1p , B = 1
σ2n
(Xy)>, C = 1σ2nXy, D = 1
σ2ny>y
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 6 / 54
Bayesian Model for Linear Regression
The Standard Linear Model
Observe that
(w −w)>A(w −w) = w>Aw −w>Aw −w>Aw + w>Aw
Now, set w = A−1C. Then,
w =1
σ2nA−1Xy
w>A =1
σ2n(Xy)> = B
By neglecting the constant term, we get
p(w|X,y) ∝ exp(− 1
2(w −w)>A(w −w)
)In other words,
p(w|X,y) ∼ N (1
σ2nA−1Xy, A−1)
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 7 / 54
Bayesian Model for Linear Regression
Gaussian Identities
Theorem 1
The product of two Gaussians gives another Gaussian
N (x|a, A)N (x|b, B) = Z−1N (x|c, C)
where
c = C(A−1a +B−1b), C = (A−1 +B−1)−1, and
Z−1 = (2π)−D/2|A+B|−1/2 exp(− 1
2(a− b)>(A+B)−1(a− b)
).
Here, Z is just the normalizing constant.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 8 / 54
Bayesian Model for Linear Regression
The Standard Linear ModelPredictive distribution
Definition 1
The (Posterior) predictive distribution is the distribution of possibleunobserved values (test data) conditional on the observed values (trainingdata).
To make a prediction for a test data, we use the predictive distribution.Let x∗ be a test case and f∗ = f(x∗) = x∗
>w. Then, the predictivedistribution for f∗ at x∗ is given as
p(f∗|x∗, X,y) ∼ N(1
σ2nx∗>A−1Xy,x∗
>A−1x∗).
We use the mean of the predictive distribution as our estimator for f(x).
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 9 / 54
Bayesian Model for Linear Regression
The Standard Linear ModelPredictive distribution
The predictive distribution is also Gaussian from our previous derivation.
The predictive variance x∗>A−1x∗ is a quadratic form of the test input
with the posterior covariance matrix. It implies that the predictiveuncertainties grow with the magnitude of test input.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 10 / 54
Projections of Inputs into Feature Space
Projections of Inputs into Feature Space
Consider the function φ(x) : RD → RN which maps an input x into an Ndimensional feature space.
Let Φ(X) = [φ(x1), · · · , φ(xn)].
The model is now given as f(xi) = φ(xi)>w, yi = f(xi) + εi with
εi ∼ N(0, σ2n) and they are independent.
This model linearly approximates the outputs using feature functionswhich form a basis of a high dimensional space.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 11 / 54
Projections of Inputs into Feature Space
Why feature funtions?
Figure: Feature function
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 12 / 54
Projections of Inputs into Feature Space
Projections of Inputs into Feature SpaceThe predictive distribution
The analysis for this model is analogous to the standard linear model.Here, X,x∗ are replaced by Φ(X) and φ(x∗). For simplicity, we writeΦ = Φ(X), φ∗ = φ(x∗). Then, the predictive distribution becomes
f∗|x∗, X,y ∼ N (1
σ2nφ>∗ A
−1Φy, φ>∗ A−1φ∗)
where A = 1σ2n
ΦΦ> + Σ−1p .
Alternatively, we can rewrite the predictive distribution in the following way
f∗|x∗, X,y ∼ N(φ>∗ ΣpΦ(K+σ2nI)−1y, φ>∗ (Σp−ΣpΦ(K+σ2nI)−1Φ>Σp)φ∗
)where K = Φ>ΣpΦ (which is related with the Gaussian ProcessRegression).
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 13 / 54
Regression-Examples
Examples - Introduction
There are two examples which use the GP regression for prediction.
The first one is a noise free case where the objective function is given asf(x) = xcos(x).
The second one is a noisy case and the objective function is not given.
The training data for the second example are the numbers ofinternational airline passengers in USA per month which arecommonly used to test the performance of a regression operator.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 14 / 54
Regression-Examples
Examples - Introduction
Python 3.5 is used to implement that examples. We use the followinglibraries
1 Numpy : It makes easy to do vector computation in python.
2 Scikit learn(sklearn) GaussianProcessRegressor, linear model : Thefirst one performs GP regression by optimizing the log-likelihood withrespect to hyper-parameters.The second one is used to get linear regression which is needed topredict the mean function of test data.
There are also good tools for GP regression such as gpytorch and gpflow.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 15 / 54
Regression-Examples
Noise free case
In this example, the objective function is given as f(x) = xcos(x). Thetraining data are given as {(x1, f(x1)), · · · , (xn, f(xn))} where x1, · · · , xnare randomly chosen.Here, we use the squared exponential covariance function.
k(xi, xj) = σ2fexp(− 1
2l2(xi − xj)2
)The following figure shows the GP prediction which comes from therandomly chosen points in y = f(x).
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 16 / 54
Regression-Examples
Noise free case
Figure: Noise-free Case
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 17 / 54
Regression-Examples
Noise free case
The red dotted line is our objective function. The blue line is our GPregressor.
Hyper-parameter tuning is done by optimizing the log-marginal likelihoodfunction using an optimization method.
The above figure is obtained from the optimized kernel
k(xi, xj) = 8.062exp(− 1
2(2.19)2(xi − xj)2
)with the log-marginal likelihood value = −13.67.
It is the case where the GP regressor is very accurate.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 18 / 54
Regression-Examples
Noisy case
This example deals with a real data for the international airline passengersin USA from 1949.01 to 1960.12.
Here, we use the first 85 percent of data as a training data and use theremaining data as a validation set. The validation set is used to test theestimator.
Here, we use the following covariance function
k(xi, xj) = σ21(xi · xj + σ22
)exp(− 2(
sin2(π(x−x′)2
σ3)
σ24))
and for noisy data we use
ky(xi, xj) = k(xi, xj) + σ25δij .
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 19 / 54
Regression-Examples
Noisy case
Figure: Noisy Case
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 20 / 54
Regression-Examples
Noisy case
The red line is our validation set (or, equivalently called as the test data)and the blue line is our training data. Black line is our GP regressor.
The above figure is obtained from the optimized kernel
ky(xi, xj) = (0.94)2(xi · xj + 10−5
)exp(− 2(
sin2(π(x−x′)2
0.902 )
(0.255)2))
+ 279δij
with the log-margianl likelihood value = -539.31.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 21 / 54
Regression-Examples
Remarks
There are some remarks.
In general, the log - marginal likelihood function may not be concavewith respect to its parameters. So, there may be many localoptimums and many gradient based optimization methods may fallinto local optimum.
In GP regression, choosing a proper covariance function is themain issue. For both examples, we choose our covariance functionsheuristically. So, there may exist some other better covariancefunctions which make the GP regressor more accurate.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 22 / 54
Regression-Examples
Remarks
Recall that our GP regression is based on the GP with zero meanfunction. So, we need to centralize our training data. In our secondexample, the training data are centralized by subtracting its movingaverage
MA(xi) =xi−[k/2] + · · ·+ xi+[k/2]+1
k(We use k = 9)
There are another centralization methods such as subtracting samplemean. You may get better results by choosing other window size k.
Your predictor should not train any information of validation set.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 23 / 54
Bayesian Model for Classification
Bayesian Models for Classification
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 24 / 54
Bayesian Model for Classification
Classification
We consider classification problems.
x : data, y : the class label
When we consider p(y,x), we have two approaches by Bayes’ theorem.
The generative approach considers p(y,x) = p(x|y)p(y).
The discriminative approach focusses on modeling p(y|x) directly.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 25 / 54
Bayesian Model for Classification
Classification ProblemsThe generative approach
For y = C1, C2, . . . , CC , the posterior probability of each class is
p(y|x) =p(y,x)
p(x)=
p(x|y)p(y)∑Cc=1 p(Cc)p(x|Cc)
A simple and common choice for the class-conditional density is
p(x|Cc) = N (µc,Σc).
However, it is unclear whether this choice is appropriate.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 26 / 54
Bayesian Model for Classification
Classification ProblemsThe discriminative approach
Basic idea
x −→GP f(x) −→σ σ(f(x))
For the binary case, we usually use a response function σ(z) whichsquashes its argument into [0, 1], guaranteeing a valid probabilisticinterpretation.
The response function σ(z) can be any sigmoid function. (A sigmoidfunction is a monotically increasing function mapping from R to [0,1].)
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 27 / 54
Bayesian Model for Classification
Classification ProblemsThe discriminative approach
Two examples for response functions are
Linear logistic regression model
p(Ci|x) = λ(wTx), where λ(z) =1
1 + exp(−z)
Probit regression model
p(Ci|x) = Φ(wTx), where Φ(z) =
∫ z
−∞
1√2πe−
x2
2 dx.
From now on we only consider the discriminative approach.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 28 / 54
Bayesian Model for Classification
Classification ProblemsDecision Theory for Classification
Let L(c, c′) be the loss incurred by making decision c′ if the true class isCc. Usually L(c, c′) = 0 when c = c′.Then the expected loss is
RL(c′|x) =∑c
L(c, c′)p(Cc|x).
The optimal class isc∗ = argmin
cRL(c|x).
One common choice for the loss function is the zero-one loss, i.e.,
L(c, c′) = 1− δcc′ .
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 29 / 54
Bayesian Model for Classification
Linear Models for Classification
x −→w∼N wTx −→σ σ(wTx)
As we did in regression, we start with a linear model (f(x) = wTx).
Let the labels and the likelihood be
y = ±1
p(y = 1|x,w) = σ(wTx),
and we use the sigmoid function as the response function.
When σ(z) = λ(z), the model is called a linear logistic regression.When σ(z) = Φ(z), the model is called a linear probit regression.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 30 / 54
Bayesian Model for Classification
Linear Models for Classification
For symmetric σ(z), i.e., σ(−z) = 1− σ(z),
p(yi|xi,w) = σ(yifi),
where fi = f(xi) = wTxi because
p(yi = 1|xi,w) = σ(wTxi) = σ(fi),
p(yi = −1|xi,w) = 1− σ(wTx) = 1− σ(fi) = σ(−fi).
So we can write p(y|x,w) consistently regardless of the value of y.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 31 / 54
Bayesian Model for Classification
Linear Models for Classification
Remark: The logit transformation is defined as
logit(x) = logp(y = 1|x)
p(y = −1|x).
For a linear logistic regression, logit(x) = wTx.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 32 / 54
Bayesian Model for Classification
Linear Models for Classification
In binary classification, what we want to do is predicting y conditional onx. To do this we assume the existence of a function a(x) which modelsthe logit as a function of x. Thus
P (y = 1|x, a(x)) = σ(a(x)).
There are two approaches to complete this model.
One way is considering a(x) as a parametrized function a(x;w) andgive w a prior which is given below.
The other approach is to model a(x) using Gaussian process which isomitted here.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 33 / 54
Bayesian Model for Classification
Linear Models for Classification
Given a dataset D = {(xi, yi)|i = 1, . . . , n}, we assume that the labels aregenerated independently, conditional on f(x) = wTx. Using the Gaussianprior w ∼ N (0,Σp),
p(w|D,y) ∝ p(y|w,D)p(w|D)
=
n∏i=1
σ(yiwTxi)exp(−1
2wTΣ−1p w)
So
log p(w|D,y) =c −1
2wTΣ−1p w +
n∑i=1
log(σ(yiwTxi))
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 34 / 54
Bayesian Model for Classification
Linear Models for ClassificationMaximum Likelihood
For σ(z) = λ(z), the log posterior is a concave function of w for fixed D.
Proposition 1
f(w) = −12w
TΣ−1p w is concave in w.
Proposition 2
log(λ(ywTx)) is concave where λ(z) = 11+exp(−z) .
For σ(z) = Φ(z), the log posterior is also concave.
Proposition 3
log(Φ(ywTx)) is concave.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 35 / 54
Bayesian Model for Classification
Linear Models for ClassificationMaximum Likelihood
Since the log posterior is a concave function in w, it is relatively easy tofind its unique maximum. We can use the Newton’s method to find themaximum.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 36 / 54
Bayesian Model for Classification
Linear Models for ClassificationPredictions
To make predictions based the training set D for a test point x∗, we have
p(y∗ = 1|x∗,D) =
∫p(y∗ = 1|w,x∗)p(w|D) dw.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 37 / 54
Bayesian Model for Classification
Linear Models for Classification
In the multi-class case, we use the softmax function
p(y = Cc|x,W ) =exp(xTwc)∑c′ exp(xTwc′)
where wc is the weight vector for class c, and W = (w1, . . . ,wn).
The corresponding log likelihood is of the form
n∑i=1
C∑c=1
δc,yi [xTi wc − log(
∑c′
exp(xTi wc′))]
which is also a concave function of W .
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 38 / 54
Bayesian Model for Classification
In Gaussian Process Classification, we use Gaussian processes and thereare two approaches.
Laplace Approximation
Expectation Propagation
For the details, please refer to
Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 39 / 54
Classification Examples
Example - Introduction
There are two examples which use the GP regression for classification.
The first one just a toy problem on our textbook pages 60 and 61.
The second one is a multi-class classification which classifies the types ofiris. The training data for the second example are the iris data setswhich are commonly used to test the performance of classification.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 40 / 54
Classification Examples
Example - Introduction
Python 3.5 is also used to implement the examples. We use the followinglibraries
1. Numpy : It makes easy to do vector computation in python.
2. Scikit learn(sklearn) GaussianProcessClassifier : It performs GPclassification by optimizing the log-likelihood with respect tohyper-parameters. It uses the logistic response function and Laplaceapproximation to approximate the non-Gaussian likelihood.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 41 / 54
Classification Examples
A Toy Problem
15 data points are randomly chosen in the square [0, 1]2. The 2 classes arelabeled as x (−1) and o (+1). We use the squared exponential covariancefunction
k(xi,xj) = σ2exp(− ‖xi − xj‖2
2l2)
The following figure shows the contour plots of the predictive probabilityEq[π(x∗)|f ] and the training data points.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 42 / 54
Classification Examples
A Toy Problem
Figure: Classification results
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 43 / 54
Classification Examples
A Toy Problem
Here, test data are all points in [0, 1]2.
The value in each contour denotes the predictive probability of class +1.
Similarly as in GP regression, Hyper-parameter tuning is done byoptimizing the approximated log-marginal likelihood function.
The above figure is obtained from the optimized kernel
k(xi,xj) = 1.642exp(− ‖xi − xj‖2
2(0.116)2)
with the log-marginal likelihood value = −10.142.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 44 / 54
Classification Examples
Classification of iris dataIntroduction
This example deals with a real data on the types of the irises. There are 3main types of irises: setosa, versicolor, and virginica. The iris data iscomposed with 150 data of irises with their types and their lengths andwidths of sepals and petals. The 3 classes, ’setosa’, ’versicolor’, and’virginica’ are labeled as red, green and blue.
We want to classify a given iris by using its lengths and widths of the sepaland petal using GP classification.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 45 / 54
Classification Examples
Classification of iris dataIntroduction
Since each data has 4 features (the lengths and widths of the sepal andpetal), it may be hard to get a graphical demonstration. So, we select 2features: the widths and lengths of the sepal.
Next, to test our classifier, we divide the data sets into 2 parts of size100 and 50. 100 data are used to train our model, and 50 data areused for validation.
We again use the squared exponential covariance function with noise term
ky(xi,xj) = σ21exp(− ‖xi − xj‖2
2l2)
+ σ22δij .
The following figure shows the locations of training sets and classificationresults with respect to the widths and lengths of sepals.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 46 / 54
Classification Examples
Classification of iris data
Figure: Classification with the widths and lengths of sepalsAI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 47 / 54
Classification Examples
Classification of iris data
Here, the test data are all points in the 2d box[mintraining data(sepal length)− 1,maxtraining data(sepal length) + 1
]×[
min(sepal width)− 1,max(sepal width) + 1].
The above figure is obtained from the optimized kernel
ky(xi,xj) = (8.84)2exp(− ‖xi − xj‖2
2(2.71)2+ (2.66)2δij
)with log-marginal likelihood value = -31.892
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 48 / 54
Classification Examples
Classification of iris data
From the above figure, we can say that our GP classifier works well forclassifying setosa. However, it doesn’t work well in classifying versicolorand virginica. Actually, its average error rate with respect to validation is20.5 percent. Does this mean that our GP classification works poorly?Let’s consider the following figure.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 49 / 54
Classification Examples
Classification of iris data
Figure: Comparison between two different feature sets
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 50 / 54
Classification Examples
Classification of iris data
The right figure is the locations of training sets with respect to the widthsand lengths of petals. By considering both, we can observe that thewidths and lengths of petals are better feature sets for learning the GPclassifier. By using the new features we get better results which areprovided below.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 51 / 54
Classification Examples
Classification of iris data
Figure: Classification with the widths and lengths of petals
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 52 / 54
Classification Examples
Classification of iris data
The test data are all points in the 2d box[mintraining data(Petal length)− 1,maxtraining data(Petal length) + 1
]×[
min(Petal width)− 1,max(Petal width) + 1].
The above figure is obtained from the optimized kernel
ky(xi,xj) = (7.77)2exp(− ‖xi − xj‖2
2(4.49)2+ (5.55)2δij
)with log-marginal likelihood value = -10.560 which is much larger than theprevious case. Moreover, its averaged error rate with respect to avalidation set is only 4 percent.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 53 / 54
Classification Examples
References
The iris data :http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load iris.html#sklearn.datasets.load iris
K.P. Murphy, Machine Learning, The MIT Press, 2012.
C.E. Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.
AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 54 / 54
Top Related