Algorithms
-
Upload
jose-cisneros -
Category
Documents
-
view
460 -
download
0
Transcript of Algorithms
Algorithms
This section contains concise descriptions of almost all of the models and algo-rithms in this book. This includes additional details, variations of algorithms andimplementation concerns that were omitted from the main text to improve read-ability. The goal is to provide sufficient information to implement a naive versionof each method and the reader is encouraged to do exactly this.
WARNING! These algorithms have not been checked very well. I’m looking forvolunteers to help me with this - please mail [email protected] if you can help.IN the mean time, treat them with suspicion and send me any problems you find
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
2
0.1 Fitting probability distributions
0.1.1 ML learning of Bernoulli parameters
The Bernoulli distribution is a probability density model suitable for describingdiscrete binary data x ∈ 0, 1. It has pdf
Pr(x) = λx(1− λ)1−x,
where the parameter λ ∈ [0, 1] denotes the probability of success.
Algorithm 1: Maximum likelihood learning for Bernoulli distribution
Input : Binary training data xiIi=1
Output: ML estimate of Bernoulli parameter θ = λbegin
λ =∑Ii=1 xi/I
end
0.1.2 MAP learning of Bernoulli parameters
The conjugate prior to the Bernoulli distribution is the Beta distribution,
Pr(λ) =Γ[α+ β]
Γ[α]Γ[β]λα−1(1− λ)β−1.
where Γ[•] is the Gamma function and α, β are hyperparameters.
Algorithm 2: MAP learning for Bernoulli distribution with conjugate prior
Input : Binary training data xiIi=1, Hyperparameters α, βOutput: MAP estimates of parameters θ = λbegin
λ = (∑Ii=1 xi + α− 1)/(α+ β + I − 2)
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.1 Fitting probability distributions 3
0.1.3 Bayesian approach to Bernoulli distribution
Algorithm 3: Predictive distribution Bernoulli fit (Bayesian)
Input : Binary training data xiIi=1, Hyperparameters α, βOutput: Posterior parameters α, β, predictive distribution Pr(x∗|x1...I)begin
// Compute Bernoulli posterior over λ
α = α+∑Ii=1 xi
β = β + I −∑Ii=1 xi
// Evaluate new datapoint under predictive distribution
Pr(x∗ = 1|x1...I) = (α)/(α+ β)Pr(x∗ = 0|x1...I) = 1− Pr(x∗ = 1|x1...I)
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
4
0.1.4 ML learning of univariate normal parameters
The univariate normal distribution is a probability density model suitable for de-scribing continuous data x in one dimension. It has pdf
Pr(x) =1√
2πσ2exp
[−0.5(x− µ)2/σ2
],
where the parameter µ denotes the mean and σ2 denotes the variance.
Algorithm 4: Maximum likelihood learning for normal distribution
Input : Training data xiIi=1
Output: Maximum likelihood estimates of parameters θ = µ, σ2begin
// Set mean parameter
µ =∑Ii=1 xi/I
// Set variance
σ2 =∑Ii=1(xi − µ)2/I
end
0.1.5 MAP learning of univariate normal parameters
The conjugate prior to the normal distribution is the normal-scaled inverse gammawhich has pdf
Pr(µ, σ2) =
√γ
σ√
2π
βα
Γ(α)
(1
σ2
)α+1
exp
[−2β + γ(δ − µ)2
2σ2
],
with hyperparameters α, β, γ > 0 and δ ∈ [−∞,∞].
Algorithm 5: MAP learning for normal distribution with conjugate prior
Input : Training data xiIi=1, Hyperparameters α, β, γ, δOutput: MAP estimates of parameters θ = µ, σ2begin
// Set mean parameter
µ = (∑i=1 xi + γδ)/(I + γ)
// Set variance
σ2 = (∑Ii=1(xi − µ)2 + 2β + γ(δ − µ)2)/(I + 3 + 2α)
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.1 Fitting probability distributions 5
0.1.6 Bayesian approach to univariate normal distribution
In the Bayesian approach to the univariate normal distribution we again use anormal-scaled inverse gamma prior. In the learning stage we compute a probabilitydistribution over the mean and variance parameters. The predictive distributionfor a new datum is based on all possible values of these parameters
Algorithm 6: Bayesian approach to normal distribution
Input : Training data xiIi=1, Hyperparameters α, β, γ, δ, Test data x∗
Output: Posterior parameters α, β, γ, δ, predictive distribution Pr(x∗|x1...I)begin
// Compute normal inverse gamma posterior over parameters from
training data
α = α+ I/2
β =∑i x
2i /2 + β + (γδ2)/2− (γδ +
∑i xi)
2/(2γ + 2I)γ = γ + I
δ = (γδ +∑i xi)/(γ + I)
// Compute intermediate parameters
α = α+ 1/2
β = (x∗2)/2 + β + (γδ2)/2− (γδ + x∗)2/(2γ + 2)γ = γ + 1// Evaluate new datapoint under predictive distribution
Pr(x∗|x1...I) =√γβαΓ[α]/(
√2π√γβαΓ[α])
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
6
0.1.7 ML learning of multivariate normal parameters
The univariate normal distribution is a probability density model suitable for de-scribing continuous data x in one dimension. It has pdf
Pr(x) =1
(2π)D/2|Σ|1/2exp
[−0.5(x− µ)TΣ−1(x− µ)
],
where µ denotes the mean vector and Σ denotes the covariance matrix
Algorithm 7: Maximum likelihood learning for multivariate normal
Input : Training data xiIi=1
Output: Maximum likelihood estimates of parameters θ = µ,Σbegin
// Set mean parameter
µ =∑Ii=1 xi/I
// Set variance
Σ =∑Ii=1(xi − µ)(xi − µ)T /I
end
0.1.8 MAP learning of multivariate normal parameters
The conjugate prior to the normal distribution is the normal inverse Wishart
Pr(µ,Σ) =Ψα/2|Σ|−(α+D+2)/2 exp
[−0.5
(2Tr(ΨΣ−1) + γ(µ− δ)TΣ−1(µ− δ)
)]2αD/2(2π)D/2ΓD[α/2]
,
with hyperparameters α,Ψ, γ and δ.
Algorithm 8: MAP learning for normal distribution with conjugate prior
Input : Training data xiIi=1, Hyperparameters α,Ψ, γ, δOutput: MAP estimates of parameters θ = µ,Σbegin
// Compute posterior parameters
α = α+ I
Ψ = Ψ + γδδT +∑Ii=1 xix
Ti − (γδ + xi)(γδ + xi)
T /(γ + 1)γ = γ + I
δ = (∑i=1 xi + γδ)/(I + γ)
// Set mean and covariance
µ = δ
Σ = (2Ψ + (µ− δ)(µ− δ)T /(α+D + 2)
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.1 Fitting probability distributions 7
0.1.9 Bayesian approach to multivariate normal distribution
In the Bayesian approach to the multivariate normal distribution we again use anormal inverse Wishart In the learning stage we compute a probability distributionover the mean and variance parameters. The predictive distribution for a newdatum is based on all possible values of these parameters
Algorithm 9: Bayesian approach to normal distribution
Input : Training data xiIi=1, Hyperparameters α,Ψ, γ, δ, Test data x∗
Output: Posterior parameters α, Ψ, γ, δ, predictive distribution Pr(x∗|x1...I)begin
// Compute normal inverse Wishart over parameters
α = α+ I
Ψ = Ψ + γδδT /2 +∑Ii=1 xix
Ti /2− (γδ +
∑i xi)(γδ +
∑i xi)
T /(2γ + 2I)γ = γ + I
δ = (∑i=1 xi + γδ)/(I + γ)
// Compute intermediate parameters
α = α+ 1
Ψ = γδδT
+ x∗x∗T − (γδ + x∗)(γδ + x∗)T /(γ + 1)γ = γ + 1// Evaluate new datapoint under predictive distribution
Pr(x∗|x1...I) = Ψα/2ΓD[α]/(πd/2Ψα/2ΓD[α])
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
8
0.1.10 ML learning of categorical parameters
The categorical distribution is a probability density model suitable for describingdiscrete multivalued data x ∈ 1, 2, . . .K. It has pdf
Pr(x = k) = λk,
where the parameter λk denotes the probability of observing category k.
Algorithm 10: Maximum likelihood learning for categorical distribution
Input : Multi-valued training data xiIi=1
Output: ML estimate of categorical parameters θ = λ1 . . . λkbegin
for k=1to K do
λk =∑Ii=1 δ[xi − k]/I
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.1 Fitting probability distributions 9
0.1.11 MAP learning of categorical parameters
The conjugate prior to the categorical distribution is the Dirichlet distribution,
Pr(λ1 . . . λK) =Γ[∑Kk=1 αk]∏K
k=1 Γ[αk]
K∏k=1
λαk−1k , (1)
where Γ[•] is the Gamma function and αkKk=1 are hyperparameters.
Algorithm 11: MAP learning for categorical distribution with conjugate prior
Input : Binary training data xiIi=1, Hyperparameters αkKk=1
Output: MAP estimates of parameters θ = λkKk=1
beginfor k=1to K do
λk = (∑Ii=1 δ[xi − k] + αk − 1)/(I +
∑Kk=1(αk)−K)
end
end
0.1.12 Bayesian approach to categorical distribution
Algorithm 12: Bayesian approach to categorical distribution
Input : Categorical training data xiIi=1, Hyperparameters αkKk=1
Output: Posterior parameters αkKk=1, predictive distribution Pr(x∗|x1...I)begin
// Compute caterorical posterior over λfor k=1to K do
αk = αk +∑Ii=1 δ[xi − k]
end// Evaluate new datapoint under predictive distribution
for k=1to K do
Pr(x∗ = k|x1...I) = (αk)/(∑Km=1 αm)
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
10
0.2 Machine learning for machine vision
0.2.1 Basic generative classifier
Consider the situation where we wish to assign a label w ∈ 1, 2, . . .K based onan observed multivariate measurement vector xi. We model the class conditionaldensity functions as normal distributions so that
Pr(xi|wi = k) = Normxi [µk,Σk] (2)
with prior probabilities over the world state defined byh
Pr(wi) = Catwi [λ] (3)
Algorithm 13: Basic Generative classifier
Input : Training data xi, wiIi=1, new data example x∗
Output: ML parameters θ = λ1...K ,µ1...K ,Σ1...K, posterior probability Pr(w∗|x∗)begin
// Learning of model
for k=1 to K do// Set mean
µk =∑Ii=1 xiδ[wi − k]/
∑Ii=1 δ[wi − k]
// Set variance
Σk =∑Ii=1(xi − µ)(xi − µ)T δ[wi − k]/
∑Ii=1 δ[wi − k]
// Set prior
λk =∑Ii=1 δ[wi − k]/I
end// Compute likelihoods for new datapoint
for k=1 to K dolk = Normx∗ [µk,Σk]
end// Classify new datapoint
for k=1 to K do
Pr(w∗ = k|x∗) = lkλk/∑Km=1 lmλm
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.3 Fitting complex densities 11
0.3 Fitting complex densities
0.3.1 Mixture of Gaussians
The mixture of Gaussians (MoG) is a probability density model suitable for data xinD dimensions. The data is described as a weighted sum ofK normal distributions
Pr(x|θ) =
K∑k=1
λkNormx[µk,Σk],
where µ1...K and Σ1...K are the means and covariances of the normal distributionsand λ1...K are positive valued weights that sum to one. The MoG is fit using theEM algorithm.
Algorithm 14: Maximum likelihood learning for mixtures of Gaussians
Input : Training data xiIi=1, number of clusters KOutput: ML estimates of parameters θ = λ1...K ,µ1...K ,Σ1...Kbegin
Initialize θ = θ0a
repeat// Expectation Step
for i=1to I dofor k=1to K do
lik = λkNormxi [µk,Σk] // numerator of Bayes’ rule
end// Compute posterior (responsibilities) by normalizing
for k=1 to K do
rik = lik/(∑Kk=1 lik)
end
end
// Maximization Step b
for k=1 to K do
λ[t+1]k =
∑Ii=1 rik/(
∑Kk=1
∑Ii=1 rik)
µ[t+1]k =
∑Ii=1 rikxi/(
∑Ii=1 rik)
Σ[t+1]k =
∑Ii=1 rik(xi − µ[t+1]
k )(xi − µ[t+1]k )T /(
∑Ii=1 rik).
end// Compute Data Log Likelihood and EM Bound
L =∑Ii=1 log
[∑Kk=1 λkNormxi [µk,Σk]
]B =
∑Ii=1
∑Kk=1 rik log [λkNormxi [µk,Σk]/rik]
until No further improvement in L
end
aOne possibility is to set the weights λ• = 1/K, the means µ• to the values of K ran-
domly chosen datapoints and the variances Σ• to the variance of the whole dataset.bFor a diagonal covariance retain only the diagonal of the Σk update.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
12
0.3.2 t-distribution
The t-distribution is a robust (long-tailed) distribution with pdf
Pr(x) =Γ(ν+D2
)(νπ)D/2|Σ|1/2Γ
(ν2
) (1 +(x− µ)TΣ−1(x− µ)
ν
)−(ν+D)/2
.
We use the EM algorithm to fit the parameters θ = µ,Σ, ν of the t-distribution.
Algorithm 15: Maximum likelihood learning for t-distribution
Input : Training data xiIi=1
Output: Maximum likelihood estimates of parameters θ = µ,Σ, νbegin
Initialize θ = θ0a
repeat// Expectation Step
for i=1to I doδi = (xi − µ)TΣ−1(xi − µ)E[hi] = (ν +D)/(ν + δi)E[log[hi] = Ψ[ν/2 +D/2]− log[ν/2 + δi]/2
end// Maximization Step
µ =∑Ii=1 E[hi]xi/
∑Ii=1 E[hi]
Σ =∑Ii=1 E[hi](xi − µ)(xi − µ)T /
∑Ii=1 E[hi]
ν = optimize[tCost[ν, E[hi], E[log[hi]]Ii=1], ν]// Compute Data Log Likelihood
for i=1to I doδi = (xi − µ)TΣ−1(xi − µ)
endL = I log[Γ[(ν +D)/2]]− I(d/2) log[νπ]− I log[|Σ|]/2− log[Γ[ν/2]]
L = L−∑Ii=1(ν +D)/2 log[1 + deltai/ν]
until No further improvement in L
end
a One possibility is to initialize the parameters µ and Σ to the mean and variance of thedistribution and set the initial degrees of freedom to a large value say ν = 1000.
The optimization of the degrees of freedom nu uses the criterion
tCost[ν, E[hi], E[log[hi]]Ii=1
]=
I∑i=1
ν
2log[ν
2
]− log
[Γ[ν
2
]]+(ν
2− 1)E[log[hi]]−
ν
2E[hi].
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.3 Fitting complex densities 13
0.3.3 Factor analyzer
The factor analyzer is a probability density model suitable for data x in D dimen-sions. It has pdf
Pr(xi|θ) = Normx∗ [µ,ΦΦ + Σ],
where µ is a D × 1 mean vector, Φ is a D ×K matrix containing the K factorsφKk=1 in its columns and Σ is a diagonal matrix of size D×D. The factor analyzeris fit using the EM algorithm.
Algorithm 16: Maximum likelihood learning for factor analyzer
Input : Training data xiIi=1, number of factors KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1 xi/I
repeat// Expectation Step
for i=1to I doE[hi] = (ΦTΣ−1Φ + I)−1ΦTΣ−1(xi − µ)
E[hihTi ] = (ΦTΣ−1Φ + I)−1 + E[hi]E[hi]
T
end// Maximization Step
Φ =(∑I
i=1(xi − µ)E[hi]T)(∑I
i=1 E[hihTi ])−1
Σ = 1I
∑Ii=1 diag
[(xi − µ)T (xi − µ)−ΦE[hi]x
Ti
]// Compute Data Log Likelihoodb
L =∑Ii=1 log
[Normxi [µ,ΦΦT + Σ]
]until No further improvement in L
end
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.bIn high dimensions it is worth reformulating the covariance of this matrix using the
Woodbury relation (section ??).
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
14
0.4 Regression models
0.4.1 Linear regression model
The linear regression model describes the world y as a normal distribution. Themean of this distribution is a linear function φ0+φTx and the variance is constant.In practice we add a 1 to the start of every data vector xi ← [1 xTi ]T and attachthe y-intercept φ0 to the start of the gradient vector φ← [φ0 φT ]T and write
Pr(yi|xi,θ) = Normyi
[φ0 + φTxi, σ
2].
To learn the model, we will work with the matrix X = [x1,x2 . . .xI ] whichcontains all of the training data examples in its columns and the world vectory = [y1, y2 . . . yI ]
T which contains the training world states.
Algorithm 17: Maximum likelihood learning for linear regression
Input : (D + 1)×I Data matrix X, I×1 world vector yOutput: Maximum likelihood estimates of parameters θ = Φ, σ2begin
// Set gradient parameter
Φ = (XXT )−1Xy// Set variance parameter
σ2 = (y −XTφ)T (y −XTφ)/I
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.4 Regression models 15
0.4.2 Bayesian linear regression
To implement the Bayesian version we define a prior
Pr(φ) = Normφ[0, σ2pI],
which contains one hyperparameter σ2p which determines the prior variance.
Algorithm 18: Bayesian formulation of linear regression.
Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p,
Output: Distribution Pr(y∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = arg maxσ2 Normy[0, σ2pX
TX + σ2I]// Compute variance of posterior over Φ
A−1 = σ2pI− σ2
pX(XTX + (σ2/σ2
p)I)−1
XT
// Compute mean of prediction for new example x∗
µy∗|x∗ = x∗TA−1Xy/σ2
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = x∗TA−1x∗ + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
16
0.4.3 Bayesian non-linear regression (Gaussian process regression)
Algorithm 19: Gaussian process regression.
Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p, Kernel
Function K[•, •]Output: Distribution Pr(y∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = arg maxσ2 Normy[0, σ2pK[X,X] + σ2I]
// Compute inverse term
A−1 =(K[X,X] + (σ2/σ2
p)I)−1
// Compute mean of prediction for new example x∗
µy∗|x∗ = (σ2p/σ
2)K[x∗,X]y − (σ2/σ2p)K[x∗,X]A−1KX,X]y
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = σ2
pKx∗,x∗]− σ2pK[x∗,X]A−1K[X,x∗] + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.4 Regression models 17
0.4.4 Sparse linear regression
Algorithm 20: Sparse linear regression.
Input : (D + 1)×I Data matrix X, I×1 world vector w, degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Initialize variables
H = diag[1, 1, . . . 1]for t=1to T do
// Maximize marginal likelihood w.r.t. variance parameter σ2
with line search
σ2 = arg maxσ2 Normy[0,XTH−1X + σ2I]// Maximize marginal likelihood w.r.t. relevance parameters H
Σ = σ2(XXT + H)−1
µ = ΣXw/σ2
for d=1to D doif Method 1 then
hd = (1 + ν)/(µ2d + Σdd + ν)
elsehd = (1− hdΣdd + ν)/(µ2
d + ν)end
end
end// Remove columns of X, rows of y and rows and columns of H where
contributions is low (perhaps hdd > 1000)[H,X,y] = prune[H,X,y]// Compute variance of posterior over Φ
A−1 = H−1 −H−1X(XTH−1X + σ2I
)−1XTH−1
// Compute mean of prediction for new example x∗
µy∗|x∗ = x∗TA−1Xy/σ2
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = x∗TA−1x∗ + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
18
0.4.5 Dual Bayesian linear regression
To implement the Bayesian version we represent the parameter vector φ as aweighted sum
φ = Xψ
of the data examples X. We define a prior over the new parameters Ψ so that
Pr(ψ) = Normψ[0, σ2pI],
which contains one hyperparameter σ2p which determines the prior variance.
Algorithm 21: Dual formulation of linear regression.
Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p,
Output: Distribution Pr(y∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = arg maxσ2 Normy[0, σ2pX
TXXTX + σ2I]// Compute inverse variance of posterior over Φ
A = (XTXXTX/σ2 + I/σ2p
// Compute mean of prediction for new example x∗
µy∗|x∗ = x∗TA−1Xy/σ2
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = x∗TA−1x∗ + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.4 Regression models 19
0.4.6 Dual Gaussian process regression
Algorithm 22: Dual Gaussian process regression.
Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p, Kernel
Function K[•, •]Output: Distribution Pr(y∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = arg maxσ2 Normy[0, σ2pK[X,X]K[X,X] + σ2I]
// Compute inverse term
A = K[X,X]K[X,X]/σ2 + I/σ2p
// Compute mean of prediction for new example x∗
µy∗|x∗ = K[x,X]A−1K[X,X]y/σ2
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
20
0.4.7 Relevance vector regression
Algorithm 23: Relevance vector regression.
Input : (D + 1)×I Data matrix X, I×1 world vector w, Kernel function K[•, •],degrees of freedom, ν
Output: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Initialize variables
H = diag[1, 1, . . . 1]for t=1to T do
// Maximize marginal likelihood w.r.t. variance parameter σ2
with line search
σ2 = arg maxσ2 Normy[0,K[X,X]H−1K[X,X] + σ2I]// Maximize marginal likelihood w.r.t. relevance parameters HΣ = σ2(K[X,X]K[X,X] + H)−1
µ = ΣK[X,X]w/σ2
for i=1to I doif Method 1 then
hd = (1 + ν)/(µ2i + Σii + ν)
elsehd = (1− hdΣii + ν)/(µ2
i + ν)end
end
end// Remove columns of X, rows of y and rows and columns of H where
contributions is low (perhaps hdd > 1000)[H,X,y] = prune[H,X,y]// Compute inverse term
A = K[X,X]K[X,X]/σ2 + I/σ2p
// Compute mean of prediction for new example x∗
µy∗|x∗ = K[x,X]A−1K[X,X]y/σ2
// Compute variance of prediction for new example x∗
σ2y∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 21
0.5 Classification models
0.5.1 Logistic regression
The logistic regression model is defined as
Pr(w|φ,x) = Bernw
[1
1 + exp[−φTx]
].
This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter φ.
Algorithm 24: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters φOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = 0g = zeros[D + 1, 1]H = zeros[D + 1, D + 1]// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
22
0.5.2 MAP Logistic Regression
This is a straightforward optimization problem and very similar to the originallogistic regression model except that we now also have a prior over the parameters
Pr(φ) = Normφ[0, σ2pI] (4)
We prepend a 1 to the start of each data example xi and then optimize the logbinomial probability. To do this we need to compute this value, and the derivativeand Hessian with respect to the parameter φ.
Algorithm 25: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters φ, priorvariance σ2
p
Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = L− (D + 1) log[2πσ2]/2− φTφ/(2σ2p)
g = −φ/σ2p
H = −1/σ2p
// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 23
0.5.3 Bayesian logistic regression
In Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. Thistakes the form of a Bernoulli distribution and is hence summarized by the sin-gle λ∗ = Pr(w∗ = 1|x∗).
Algorithm 26: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
φ = zeros[D, 1]// Optimization using cost function of algorithm ??φ = optimize [logRegCrit[xi, wi,φ],φ]
end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,φ]// Set mean and variance of Laplace approximation
µ = ΦΣ = −H−1
// Compute mean and variance of activation
µa = µTx∗
σ2a = x∗TΣx∗
// Compute approximate prediction
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
24
0.5.4 MAP dual logistic regression
The dual logistic regression model is defined as
Pr(w|φ,x) = Bernw
[1
1 + exp[−ψTXTx]
]Pr(ψ) = Normψ[0, σ2
pI].
This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter ψ.
Algorithm 27: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters ψOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = −I log[2πσ2]/2−ψTψ/(2σ2p)
g = −ψ/σ2p
H = −1/σ2p
// Form compound data matrix
X = [x1,x2, . . .xI ]// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−ψTXxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
end
g = g + (yi − wi)XTxiH = H + yi(1− yi)XTxix
Ti X
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 25
0.5.5 Dual Bayesian logistic regression
In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).
Algorithm 28: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegCrit[ψ],ψ]
end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation
µ = ΨΣ = −H−1
// Compute mean and variance of activation
µa = µTXTx∗
σ2a = x∗TXΣXTx∗
// Compute approximate prediction
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
26
0.5.6 MAP Kernel logistic regression
Algorithm 29: Compute cost function, derivative and Hessian
Input : World state wiIi=1, data xiIi=1, parameters ψ, kernel function K[•, •]Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = −I log[2πσ2]/2−ψTψ/(2σ2p)
g = −ψ/σ2p
H = −1/σ2p
// Form compound data matrix
X = [x1,x2, . . .xI ]// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−ψTK[X,xi]])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
endg = g + (yi − wi)K[X,xi]H = H + yi(1− yi)K[X,xi]K[xiX]
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 27
0.5.7 Bayesian kernel logistic regression (Gaussian process classification)
In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).
Algorithm 30: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegKernelCrit[ψ],ψ]// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegKernelCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation
µ = ΨΣ = −H−1
// Compute mean and variance of activation
µa = µTK[X,x∗]σ2a = K[x∗,X]ΣK[X,x∗]
// Compute approximate prediction
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
28
0.5.8 Relevance vector classification
0.5.9 Incremental fitting for logistic regression
The incremental fitting approach to logistic regression model fits the model
Pr(w|φ,x) = Bernw
[1
1 + exp[−φ0 +∑Kk=1 φkf [xi, ξk]]
].
The method is to set all the weight parameters φk to zero initially and to optimizethem one by one. At the first stage we optimize φ0, φ1 and ξ1. Then we optimizeφ0, φ2 and ξ2 and so on.
Algorithm 31: Incremental logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1
Output: ML parameters φ0, φk, ξkKk=1
begin// Initialize parameters
φ0 = 0for k=1to K do
φk = 0
ξk = ξ(0)k
end// Initialize parameters
for i=1to I doai = 0
endfor k=1to K do
// Reset offset parameters
for i=1to I doai = ai − φ0
endφ0 = 0[φ0, φk, ξk] = optimize [logRegOffsetCrit[φ0,φk, ξk, ai,xi],φ0,φk, ξk]for i=1to I do
ai = ai + φ0 + φkf[xi, ξk]end
end
end
At each stage the optimization procedure improves the criterion
logRegOffsetCrit[φ0, φk, ξk, aiIi=1] =
I∑i=1
log
[Bernwi
[1
1 + exp[−ai − φ0 − φkf[xi, ξk]]
]]with respect to parameters φ0, φk, ξk.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 29
0.5.10 Logit-boost
The logit-boost model is
Pr(w|φ,x) = Bernw
[1
1 + exp[−φ0 +∑Kk=1 φkheaviside[f [x, ξck ]]
].
Algorithm 32: Incremental logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, functionsfm[x, ξm]Mm=1
Output: ML parameters φ0, φkKk=1, ck ∈ 1 . . .Mbegin
// Initialize parameters
φ0 = 0for k=1to K do
φk = 0endfor i=1to I do
ai = 0end// Initialize parameters
for k=1to K do// Find best weak classifier by looking at magnitude of gradient
ck = maxm[(∑Ii=1(ai − wi)f[xi, ξm])2]
// Reset offset parameters
for i=1to I doai = ai − φ0
endφ0 = 0// Perform optimization
[φ0, φk] = optimize [logitBoostCrit[φ0,φk],φ0,φk]for i=1to I do
ai = ai + φ0 + φkf[xi, ξck ]
end
end
end
At each stage the optimization procedure improves the criterion
logitBoostCrit[φ0, φk] =
I∑i=1
log
[Bernwi
[
[1
1 + exp[−ai − φ0 − φkf[xi, ξck ]]
]]with respect to parameters φ0, φk.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
30
0.5.11 Multi-class logistic regression
The multiclass logistic regression model is defined as
Pr(w|φ,x) = Catw[softmax[φT1 x, φT2 x, . . . φTKx]
].
where we have prepended a 1 to the start of each data vector x. This is a straight-forward optimization problem over the log probability. We need to compute thisvalue, and the derivative and Hessian with respect to the parameters φk.
Algorithm 33: Cost function, derivative and Hessian for multi-class logistic regres-
sion
Input : World state wiIi=1, observed data xiIi=1, parameters φKk=1
Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = 0for k=1to K do
gk = 0for L=1to K do
Hkl = 0end
end// For each data point
for i=1to I do// Compute prediction y
yi = softmax[φT1 xi, φT2 xi, . . . φ
Tk xi]
// Update log likelihood
L = L+ log[yiw]// Update gradient and Hessian
for k=1to K dogk = gk + xi(yik − δ[wi − k])for L=1to K do
Hkl = Hkl + xixTi yik(δ[k − l]− yil)
end
end
end// Assemble final Hessian
g = [g1; g2; . . .gk] for k=1to K doHk = [Hk1,Hk2, . . .HkK ]
endH = [H1; H2; . . .HK ]
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 31
0.5.12 Multiclass logistic tree
Algorithm 34: Multiclass classification tree
Input : World state wiIi=1, data xiIi=1Mm=1, classifiers g[•, ωm]Mm=1
Output: Categorical params at leaves λpJ+1p=1 , Classifier indices cjJj=1
beginEnqueue[x1...I , w1...I ]// For each node in tree
for j=1to J do[x1...I , w1...I ] = dequeue[]for m=1to M do
// Count frequency of each class passing into either branch
for k=1to K do
n(l)k =
∑Ii=1 δ[g[xi, ωm]− 0]δ[wi − k]
n(r)k =
∑Ii=1 δ[g[xi, ωm]− 1]δ[wi − k]
end// Compute log likelihood
lm =∑Kk=1
∑n(l)kn=1 log[n
(l)k /
∑Kq=1 n
(l)q ]
lm = lm +∑Kk=1
∑n(r)kn=1 log[n
(r)k /
∑Kq=1 n
(r)q ]
endcj = arg maxm lm // Store best classifier
Sl = ;Sr = // Partition into two sets
for i=1to I doif g[xi, ωcj ] == 0 thenSL = Sl ∪ i
elseSR = Sr ∪ i
end
endEnqueue[xSl , wSl ,λl] // Add to queue
Enqueue[xSr , wSr ,λr]
end// Recover categorical parameters at leaves
for p = 1to J + 1 do[x1...I , w1...I ] = dequeue[ ]for k=1to K do
nk =∑Ii=1 δ[wi − k]
end
λp = n/∑Kk=1 nk
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
32
0.5.13 Random classification tree
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.5 Classification models 33
0.5.14 Random classification fern
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
34
0.6 Graphical models
0.6.1 Gibbs’ Sampling from an discrete undirected model
Algorithm 35: Gibbs’ sampling from undirected model
Input : Potential functions φc[Sc]Cc=1
Output: Samples xtT1begin
// Initialize first sample in chain
x0 = x(0)
// For each time sample
for t=1to T doxt = xt−1
// For each dimension
for d=1to d do// For each possible value
for k=1to K doλk = 1xtd = kfor c such that d ∈ Sc do
λk = λkφc[Sc]end
end
λ = λ/∑Kk=1 λk
// Draw from categorical distribution
xtd = DrawFromCategorical[λ]
end
end
end
It is normal to discard the first few thousand entries so that the initial conditionsare forgotten. Then entries are chosen that are spaced apart to avoid correlationbetween the samples.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.6 Graphical models 35
0.6.2 Contrastive divergence for learning undirected models
Algorithm 36: Contrastive divergence learning of undirected model
Input : data xKk=1,learning rate αOutput: ML Parameters θbegin
// Initialize parameters
θ = θ(0)
// For each time sample
repeatfor i=1to I do
// Take a single Gibbs’ sample step from the ith data point
x∗i = GibbsSample[xi,θ]
end// Update parameters
// Function gradient[•, •] returns derivative of log of
unnormalized probability
θ = θ + α∑Ii=1(gradient[xi, θ]− I/Jgradient[x∗i ])
until No further average change in θ
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
36
0.7 Models for chains and trees
0.7.1 Dynamic programming for chain model
Algorithm 37: Dynamic programming in chain
Input : Unary costs Un,kN,Kn=1,k=1, Pairwise costs Pn,k,lN,K,Kn=2,k=1,l=1
Output: Minimum cost path ynNn=1
begin// Initialize cumulative sums Sn,kfor k=1to K do
S1,k = U1,k
end// Work forward through chain
for n=2to N do// Find minimum cost to get to this node
Sn,k = Un,k + minl[Sn−1,l + Pn,k,l]// Store route by which we got here
Rn,k = argminl[Sn−1,l + Pn,k,l]
end// Find node yN with overall minimum cost
yN = mink[SN,k]// Trace back to retrieve route
for n=N to 2 doyn−1 = Rn,yn
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.7 Models for chains and trees 37
0.7.2 Dynamic programming for tree model
This algorithm relies on pre-computing an order to traverse the nodes so that thechildren of each node in the graph are visited before the parent. It also uses thenotation ψn,k[ych[n]
] to represent the logarithm of the factor in the probability
distribution that includes node n and its children for some yn = k and some valuesof the children.
Algorithm 38: Dynamic programming in tree
Input : Unary costs Un,kN,Kn=1,k=1, Joint cost function ψn,k[ych[n]]Nn=1
Output: Minimum cost path ynNn=1
beginrepeat
// Retrieve nodes in an order so children always come before
parents
n = GetNextNode[]// Add unary costs to cumulative sums
for k=1to K doSn,k = Un,k + minych[n]
ψn,k[ych[n]]
Rn,k = arg minych[n]ψn,k[ych[n]
]end// Push node index onto stack
push[n]
until pa[yn] = // Find node yN with overall minimum cost
yn = mink[Sn,k]// Trace back to retrieve route
for c=1to N don = pop[]if ch[n] 6= then
ych[n]= Rn,yn
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
38
0.7.3 Sum Product Algorithm
Algorithm 39: Sum Product Algorithm: Distribute
Input : Observed data z∗nn∈Sobs, Functions φk[Ck]Kk=1
Output: Marginal probability distributions qn[yn]Nn=1
begin// Distribute evidence
repeat// Retrieve edges in order
n = GetNextEdge[]// Test for type of edge
if isEdgeToFunction[e[n]] then// If this data was observed
if n ∈ Sobs thenmen = δ[z∗n]
else// Find set of edges that are incoming to data node
S = k : en1 ∈ ek \ en// Take product of messages
men =∏k∈S mek
// Add edge to stack
push[n]
end
else// Find set of edges incoming to function node
S = k : en1 ∈ ek \ en// Take product of messages
men =∑y∈S φn[S ∪ n]
∏k∈S mek
// Add edge to stack
push[n]
end
until pa[en] = end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.7 Models for chains and trees 39
Algorithm 40: Sum Product Algorithm: Collate and compute distributions
Input : Observed data z∗nn∈Sobs, Functions φk[Ck]Kk=1
Output: Marginal probability distributions qn[yn]Nn=1
begin// Collate evidence
repeat// Retrieve edges in opposite order
n = pop[]// Test for type of edge
if !isEdgeToFunction[e[n]] then// Find set of edges that are incoming to data node
S = k : en2 ∈ ek \ en// Take product of messages
ben =∏k∈S bek
else// Find set of edges incoming to function node
S = k : en2 ∈ ek \ en// Take product of messages
ben =∑y∈S φn[S ∪ n]
∏k∈S bek
end
until stack empty// Compute distributions at nodes
for k=1to K do// Find set of edges that are incoming to data node
S1 = n : k ∈ ek2S2 = n : k ∈ ek1q[k] =
∏n∈S1 mn
∏n∈S2 bn
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
40
0.8 Models for grids
0.8.1 Binary graph cuts
Algorithm 41: Binary Graph Cut Algorithm
Input : Unary costs Un(k)N,Kn,k=1, Pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, binary edge
flags Emn,N,Nn=1,m=1
Output: Label assignments ynbegin
// Create edges from source and to sink
for n=1to N doMakeLink[source, n]MakeLink[n, sink]for m=1to n− 1 do
if Em,n = 1 thenMakeLink[n,m]MakeLink[m,n]
end
end
end// Add costs to edges
for n=1to N doAddToLink[source, n, Un(0)]AddToLink[n, sink, Un(1)]for m=1to n− 1 do
if Em,n = 1 thenAddToLink[n,m, Pmn(1, 0)− Pmn(1, 1)− Pmn(1, 1)]AddToLink[m,n, Pmn(1, 0)]AddToLink[source,m, Pmn(0, 0)]AddToLink[n, sink, Pmn(1, 1)]
end
end
endReparameterize[]ComputeMinCut[]// Read off values
for n=1to N doif isConnected[n, source] then
yn = 1else
yn = 0end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.8 Models for grids 41
0.8.2 Reparameterization for graph cuts
Algorithm 42: Reparameterization for binary graph cut
Input : GraphOutput: Modified Graphbegin
for n=1to N dofor m=1to n− 1 do
if Em,n = 1 thenβ = 0if GetEdgeCost[n,m] < 0 then
β = β + GetEdgeCost[n,m]else
if GetEdgeCost[m,n] < 0 thenβ = β − GetEdgeCost[m,n]
end
endAddToLink[n,m,−β]AddToLink[m,n, β]AddToLink[source,m, β]AddToLink[n, sink, β]α = min[GetEdgeCost[source, n],GetEdgeCost[m, sink]]AddToLink[source, n,−α]AddToLink[n, sink,−α]
end
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
42
0.8.3 Multi-label graph cuts
Algorithm 43: Multi-way Graph Cut Algorithm
Input : Unary costs Un(k)N,Kn,k=1, Pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, binary edge
flags Emn,N,Nn=1,m=1
Output: Label assignments ynbegin
// Create edges from source and to sink
for n=1to N doMakeLink[source, (n− 1)(K + 1) + 1,∞]MakeLink[n(K + 1), sink,∞]for k=1to K do
MakeLink[(n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1, U(n−1)(K+1)+k(k)]MakeLink[(n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1,∞]
endfor m=1to n− 1 do
if Em,n = 1 thenfor k=1to K do
for L=2to K + 1 doC =Pn,m(k, l−1)+Pn,m(k−1, l)−Pn,m(k, l)−Pn,m(k−1, l−1)MakeLink[(n− 1)(K + 1) + k, (m− 1)(K + 1) + l, C]
end
end
end
end
endReparameterize[]ComputeMinCut[]// Read off values
for n=1to N dofor k=1to K do
if isConnected[(n− 1)(K + 1) + k, source] thenyn = k
end
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.8 Models for grids 43
0.8.4 Alpha-Expansion Algorithm
Algorithm 44: Alpha Expansion Algorithm (main loop)
Input : Unary costs Un(k)N,Kn,k=1, Pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, binary edge
flags Emn,N,Nn=1,m=1
Output: Label assignments ynbegin
y = y0
L =∑Nn=1 Un(yn) +
∑Nn=1
∑Mm=1 EmnPn,m(yn, ym)
L0 = −∞repeat
L0 = L for k=1to K doy = AlphaExpand[y, k]
end
L =∑Nn=1 Un(yn) +
∑Nn=1
∑Mm=1 EmnPn,m(yn, ym)
until L = L0
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
44
Algorithm 45: Alpha Expansion Algorithm expand
Input : Unary costs Un(k)N,Kn,k=1, Pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, binary edge
flags Emn,N,Nn=1,m=1, expand label k, label assignments ynOutput: New label assignments ynbegin
t = 0 for n=1to N doMakeLink[source, n,∞, Un(k)]if yn = k then
MakeLink[n, sink,∞]else
MakeLink[n, sink, Un(yn)]endfor m=1to n do
if Em,n == 1 thenif yn == k|ym == k then
if yn! = k&ym == k thenMakeLink[n,m, Pn,m(ym, yn)]
endif yn == k&ym! = k then
MakeLink[m,n, Pn,m(yn, ym)]end
elseif yn == ym then
MakeLink[n,m, Pn,m(ym, yn)]MakeLink[m,n, Pn,m(yn, ym)]
elset = t+ 1MakeLink[n, t, Pn,m(yn, k)]MakeLink[m, t, Pn,m(k, ym)]MakeLink[t, sink, Pn,m(ym, yn)]
end
end
end
end
endReparameterize[]ComputeMinCut[]// Read off values
for n=1to N doif isConnected[n, sink] then
yn = kend
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.9 The pinhole camera 45
0.9 The pinhole camera
0.9.1 ML learning of camera extrinsic parameters
Given a known object, with I distinct three-dimensional points wiIi=1 points,their corresponding projections in the image xiIi=1 and known camera param-eters Λ, estimate the geometric relationship between the camera and the objectdetermined by the rotation Ω and the translation τ .
Algorithm 46: ML learning of extrinsic parameters
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
for i=1to I do// Convert to normalized camera coordinates
x′i = Λ−1[xi, yi, 1]T
// Compute linear constraints
a1i = [ui, vi, wi, 1, 0, 0, 0, 0,−uix′i,−vix′i,−wix′i,−x′i]a2i = [0, 0, 0, 0, ui, vi, wi, 1,−uiy′i,−viy′i,−wiy′i,−y′i]
end// Stack linear constraints
A = [a11; a21; a12; a22; . . . a1I ; a2I ]// Solve with SVD
[U,L,V] = svd[A]b = v12 // extract last column of V// Extract estimates up to unknown scale
Ω = [b1, b2, b3; b5, b6, b7; b9; b10, b11]τ = [b4; b8; b12]// Find closest rotation using Procrustes method
[U,L,V] = svd[Ω]
Ω = UVT
// Rescale translation
τ = τ∑3i=1
∑3j=1 Ωij/(9Ωij)
// Refine parameters with non-linear optimization
[Ω, τ ] = optimize[projCost[Ω, τ ],Ω, τ ]
end
The final optimization minimizes the least squares error between the predictedprojections of the points wi into the image and the observed data xi, so
projCost[Ω, τ ]=
I∑i=1
(xi−pinhole[[wi, 0],Λ,Ω, τ ])T
(xi−pinhole[[wi, 0],Λ,Ω, τ ])
This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
46
0.9.2 ML learning of intrinsic parameters (camera calibration)
Given a known object, with I distinct 3D points wiIi=1 points and their corre-sponding projections in the image xiIi=1, establish the camera parameters Λ.
Algorithm 47: ML learning of intrinsic parameters
Input : World points wiIi=1, image points xiIi=1, initial ΛOutput: Intrinsic parameters Λbegin
// Main loop for alternating optimization
for k=1to K do// Compute extrinsic parameters
[Ω, τ ] = calcExtrinsic[Λ, wi,xiIi=1]// Compute intrinsic parameters
for i=1 to I do// Compute matrix Ai
ai = (ωT1 wi + τx)/(ωT3 wi + τ z)
bi = (ωT1 wi + τ y)/(ωT3 wi + τ z)Ai = [ai, bi, 1, 0, 0; 0, 0, 0, bi, 1]
end// Concatenate matrices and data points
x = [x1; x2; . . .xI ]A = [A1; A2; . . .AI ]// Compute parameters
θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]
end// Refine parameters with non-linear optimization
[Ω, τ ,Λ] = optimize [projCost[Ω, τ ,Λ],Ω, τ ,Λ]
end
The final optimization minimizes the squared error between the projections of wi
and the observed data xij , respecting the constraints on the rotation matrix Ω.
projCost[Ω, τ ,Λ]=
I∑i=1
(xi−pinhole[wi,Λ,Ω, τ ])T
(xi−pinhole[wi,Λ,Ω, τ ])
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.9 The pinhole camera 47
0.9.3 Inferring 3D world points (reconstruction)
Given J calibrated cameras in known positions (i.e. cameras with known Λ,Ω, τ ),viewing the same three-dimensional point w and knowing the corresponding pro-jections in the images xjJj=1, establish the position of the point in the world.
Algorithm 48: Inferring 3D world position
Input : Image points xjJj=1, camera parameters Λj ,Ωj , τ jJj=1
Output: 3D world point wbegin
for j=1to J do// Convert to normalized camera coordinates
x′j = Λ−1j [xj , yj , 1]T
// Compute linear constraints
a1j = [ω31jx′j − ω11j , ω32jx
′j − ω12j , ω33jx
′j − ω13j ]
a2j = [ω31jy′j − ω11j , ω32jy
′j − ω12j , ω33jy
′j − ω13j ]
bj = [τxj − τzjx′j ; τyj − τzjy′j ]end// Stack linear constraints
A = [a11; a21; a12; a22; . . . a1J ; a2J ]b = [b1; b2; . . . bJ ]// LS solution for parameters
w = (ATA)−1ATb// Refine parameters with non-linear optimization
w = optimize[projCost[w, xj ,Λj ,Ωj , τ jJj=1],w
]end
The final optimization minimizes the squared error between the projections of wand the observed data xj ,
projCost[w, xj ,Λj ,Ωj , τ jJj=1] =
J∑j=1
(xj−pinhole[w,Λj ,Ωj , τ j ])T
(xj−pinhole[w,Λj ,Ωj , τ j ])
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
48
0.10 Transformation models
0.10.1 ML learning of Euclidean transformation
The Euclidean transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a rotation Ω and a translation τ , so that
Pr(xi|wi,Ω, τ , σ2) = Normxi
[euc[wi,Ω, τ ], σ2I
],
where the Euclidean transformation is defined as
euc[wi,Ω, τ ] = Ωwi + τ ,
and Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.
Algorithm 49: Maximum likelihood learning of Euclidean transformation
Input : Training data pairs xi,wiIi=1
Output: Rotation Ω, translation τ , variance, σ2
begin// Compute mean of two data sets
µw =∑Ii=1 wi/I
µx =∑Ii=1 xi/I
// Concatenate data into matrix form
W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation
[U,L,V] = svd[XWT ]
Ω = VUT
// Solve for translation
τ =∑Ii=1(xi −Ωwi)/I
// Solve for variance
σ2 =∑Ii=1(xi −Ωwi − τ )T (xi −Ωwi − τ )/2I
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.10 Transformation models 49
0.10.2 ML learning of similarity transformation
The similarity transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a rotation Ω, a translation τ and a scaling ρ so that
Pr(xi|wi,Ω, τ , σ2) = Normxi
[sim[wi,Ω, τ , ρ], σ2I
],
where the similarity transformation is defined as
sim[wi,Ω, τ , ρ] = ρΩwi + τ ,
and Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.
Algorithm 50: Maximum likelihood learning of similarity transformation
Input : Training data pairs xi,wiIi=1
Output: Rotation Ω, translation τ , scale ρ, variance σ2
begin// Compute mean of two data sets
µw =∑Ii=1 wi/I
µx =∑Ii=1 xi/I
// Concatenate data into matrix form
W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation
[U,L,V] = svd[XWT ]
Ω = VUT
// Solve for scaling
ρ = (∑Ii=1(xi − µx)TΩ(wi − µw))/(
∑Ii=1(wi − µw)T (w − µw))
// Solve for translation
τ =∑Ii=1(xi − ρΩwi)/I
// Solve for variance
σ2 =∑Ii=1(xi − ρΩwi − τ )T (xi − ρΩwi − τ )/2I
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
50
0.10.3 ML learning of affine transformation
The affine transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 with a linear transformation Φ and an offset τ so that
Pr(xi|wi,Ω, τ , σ2) = Normxi
[aff[wi,Φ, τ ], σ2I
],
where the affine transformation is defined as
aff[wi,Φ, τ ] = Φwi + τ .
Algorithm 51: Maximum likelihood learning of affine transformation
Input : Training data pairs xi,wiIi=1
Output: Linear transformation Φ, offset τ , variance σ2
begin// Solve for translation
τ =∑Ii=1(wi − xi)/I
// Compute intermediate 2×4 matrices Ai
for i=1to I doAi = [wi,0; 0,wi]
T
end// Concatenate matrices Ai into 2I×4 matrix AA = [A1; A2; . . .AI ]// Concatenate output points into 2I × 1 vector cc = [x1 − τ ; x2 − τ ; . . .xI − τ ]// Solve for linear transformation
φ = (ATA)−1AT cΦ = [φ1, φ2;φ3, φ4]// Solve for variance
σ2 =∑Ii=1(xi − φwi − τ )T (xi − φwi − τ )/2I
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.10 Transformation models 51
0.10.4 ML learning of projective transformation (homography)
The projective transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a non-linear transformation with 3×3 parameter matrix Φ so
Pr(xi|wi,Ω, τ , σ2) = Normxi
[proj[wi,Φ], σ2I
],
where the homography is defined as
proj[wi,Φ] =[φ11u+φ12v+φ13
φ31u+φ32v+φ33
φ21u+φ22v+φ23
φ31u+φ32v+φ33
]T.
Algorithm 52: Maximum likelihood learning of projective transformation
Input : Training data pairs xi,wiIi=1
Output: Parameter matrix Φ,, variance σ2
begin// Convert data to homogeneous representation
for i=1to I doxi = [xi; 1]
end// Compute intermediate 2×9 matrices Ai
for i=1to I doAi = [0, xi;−xi,0; vixi,−uixi]T
end// Concatenate matrices Ai into 2I×9 matrix AA = [A1; A2; . . .AI ]// Solve for approximate parameters
[U,L,V] = svd[A]Φ0 = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Refine parameters with non-linear optimization
Φ = optimize[homCostFn[Φ],Φ0]// Solve for variance
σ2 =∑Ii=1(xi − hom[wi,Φ])T (xi − hom[wi,Φ])/2I
end
The cost function for the non-linear optimization is based on the least squares errorof the model.
homCostFn[Φ] =
I∑i=1
(xi − proj[wi,Φ])T
(xi − proj[wi,Φ])
The optimization should be carried out with the constraint |Φ|F = 1 that the sumof the squares of the elements of Φ is one.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
52
0.10.5 ML Inference for transformation models
Consider a transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 so that
Pr(xi|wi,Φ) = Normxi
[trans[wi,Φ], σ2I
].
In inference we wish are given a new data point x = [x, y] and wish to computethe most likely point w = [u, v] that was responsible for it. To make progress, weconsider the transformation model trans[wi,Φ] in homogeneous form
λ
xy1
=
φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33
uv1
,or x = Φw. The Euclidean, similarity, affine and projective transformations canall be expressed as a 3× 3 matrix of this kind.
Algorithm 53: Maximum likelihood inference for transformation models
Input : Transformation parameters Φ, new point xOutput: point wbegin
// Convert data to homogeneous representation
x = [x; 1]// Apply inverse transform
a = Φ−1x// Convert back to Cartesian coordinates
w = [a1/a3, a2/a3]
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.10 Transformation models 53
0.10.6 Learning extrinsic parameters (planar scene)
Consider a calibrated camera with known parameters Λ viewing a planar. We aregiven a set of 2D positions on the plane wI
i=1 (measured in real world units likecm) and their corresponding 2D pixel positions xIi−1. The goal of this algorithmis to learn the 3D rotation Ω and translation τ that maps a point in the frameof reference of the plane w = [u, v, w]T (w = 0 on the plane) into the frame ofreference of the camera.
Algorithm 54: ML learning of extrinsic parameters (planar scene)
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
// Compute homography between pairs of points
Φ = LearnHomography[xiIi=1, wiIi=1]// Eliminate effect of intrinsic parameters
Φ = Λ−1Φ// Compute SVD of first two columns of Φ[ULV] = svd[φ1,φ2]// Estimate first two columns of rotation matrix
[ω1ω2] = [u1,u2] ∗VT
// Estimate third column by taking cross product
ω3 = ω1 × ω2
Ω = [ω1,ω2,ω3]// Check that determinant is one
if det[Ω] < 0 thenΩ = [ω1,ω2,−ω3]
end// Compute scaling factor for translation vector
λ = (∑3i=1
∑2j=1 ωij/φij)/6
// Compute translation
τ = λφ3
// Refine parameters with non-linear optimization
[Ω, τ ] = optimize[projCost[Ω, τ ],Ω, τ ]
end
The final optimization minimizes the least squares error between the predictedprojections of the points wi into the image and the observed data xi, so
projCost[Ω, τ ]=
I∑i=1
(xi−pinhole[[wi, 0],Λ,Ω, τ ])T
(xi−pinhole[[wi, 0],Λ,Ω, τ ])
This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
54
0.10.7 Learning intrinsic parameters (planar scene)
This is also known as camera calibration from a plane. The camera is presented withJ views of a plane with unknown pose Ωj , τ j. For each image we know I points
wiIi=1 where wi = [ui, vi, 0] and we know their imaged positions xijI,Ji=1,j=1 ineach of the J scenes. The goal is to compute the intrinsic matrix Λ.
Algorithm 55: ML learning of intrinsic parameters (planar scene)
Input : World points wiIi=1, image points xijI,Ji=1,j=1, initial ΛOutput: Intrinsic parameters Λbegin
// Main loop for alternating optimization
for k=1to K do// Compute extrinsic parameters for each image
for j=1to J do[Ωj , τ j ] = calcExtrinsic[Λ, wi,xijIi=1]
end// Compute intrinsic parameters
for i=1 to I dofor j=1to J do
// Compute matrix Aij
aij = (ωT1 wi + τx)/(ωT3 wi + τ z)
bij = (ωT1 wi + τ y)/(ωT3 wi + τ z)Aij = [aij , bij , 1, 0, 0; 0, 0, 0, bij , 1]
end
end// Concatenate matrices and data points
x = [x11; x12; . . .xIJ ]A = [A11; A12; . . .AIJ ]// Compute parameters
θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0θ4,θ5; 0, 0, 1]
end// Refine parameters with non-linear optimization
[Ω, τJj=1,Λ] = optimize[projCost[Ω, τJj=1,Λ], Ω, τJj=1,Λ
]end
The final optimization minimizes the squared error between the projections of wi
and the observed data xij , respecting the constraints on the rotation matrix Ω.
projCost[Ω, τJj=1,Λ] =
I∑i=1
J∑j=1
(xij−pinhole[[wi, 0],Λ,Ωj , τ j ])T
(xij−pinhole[[wi, 0],Λ,Ωj , τ j ])
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.10 Transformation models 55
0.10.8 Robust learning of projective transformation with RANSAC
The goal of this algorithm is to fit a homography that maps one set of 2D pointswiIi=1 to another set xiIi=1, in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches andthe outliers.
Algorithm 56: Robust ML learning of homography
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Homography Φ, inlier indices Ibegin
// Initialize best inlier set to empty
I = for n=1to N do// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φn = LearnHomography[xii∈R, wii∈R]// Initialize set of inliers to empty
Sn = for i=1to I do
// Compute squared distance
d = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])// If small enough then add to inliers
if d < τ2 thenSn = Sn ∩ i
end
end// If best outliers so far then store
if |Sn| > |I| thenI = Sn
end
end// Compute homography from all outliers
Φ = LearnHomography[xii∈I , wii∈I ]
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
56
0.10.9 Sequential RANSAC for fitting homographies
The goal of this algorithm is to estimate K of homographies between subsets ofpoint pairs wi,xiIi=1 to another set xiIi=1 using sequential RANSAC
Algorithm 57: Robust sequential learning of homographies
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , inlier threshold τ ,number of homographies to fit K
Output: K homographies Φk, and associated inlier indices Ikbegin
// Initialize set of indices of remaining point pairs
S = 1 . . . I for k=1to K do// Compute homography using RANSAC (algorithm ??)[Φk, Ik] = LearnHomographyRobust[xii∈S , wii∈S , N, τ ]// Remove inliers from remaining points
S = S\Ik// Check that there are enough remaining points
if |S| < 4 thenbreak
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.10 Transformation models 57
0.10.10 PEaRL for fitting homographies
The propose, expand and re-learn algorithm first suggests a large number of possiblehomographies relating point pairs wi,xiIi=1¿ These then compete for the pointpairs to be assigned the them and they are re-learnt based on this assignments.
Algorithm 58: PEaRL learning of homographies
Input : Point pairs xi,wiIi=1, number of initial models M , inlier threshold τ ,mininum number of inliers l, number of iterations J , neighborhood systemNiIi=1, pairwise cost P
Output: Set of homographies Φk, and associated inlier indices Ikbegin
// Propose Step: generate M hypotheses
m = 1 // hypothesis number
repeat// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φm = LearnHomography[xii∈R, wii∈R]Im = // Initialize inlier set to empty
for i=1to I dodim = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])if dim < τ2 then // if distance small, add to inliers
In = In ∩ iend
endif |Im| ≥ l then // If enough inliers, get next hypothesis
m = m+ 1end
until m < Mfor j=1to J do
// Expand Step: returns I × 1 label vector l
l = AlphaExpand[D, P, NiIi=1]// Re-Learn Step: re-estimate homographies with support
for m=1to M doIm = find[L == m] // Extract points with label L// If enough support then re-learn, update distances
if |Im| ≥ 4 thenΦm = LearnHomography[xii∈Im , wii∈Im ]for i=1to I do
dim = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])end
end
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
58
0.11 Multiple cameras
0.11.1 Camera geometry from point matches
This algorithm computes the rotation and translation (up to scale) between thecameras given a set of I point matches xi1,xi2Ii=1 between two images
Algorithm 59: Extracting relative camera position from point matches
Input : Point pairs xi1,xi2Ii=1, intrinsic matrices Λ1,Λ2
Output: Rotation Ω, translation τ between camerasbegin
// Compute fundamental matrix (algorithm ??)
F = ComputeFundamental[x1i,x2iIi=1]// Compute essential matrix
E = ΛT2 FΛ1
// Extract four possible rotation and translations from EW = [0,−1, 0; 1, 0, 0; 0, 0,−1][U,L,V] = svd[E]
τ1 = ULWUT ; Ω1 = UW−1VT
τ2 = ULW−1UT ; Ω2 = UWVT
τ3 = −τ1; Ω3 = Ω1
τ4 = −τ2; Ω4 = Ω1
// For each possibility
for k=1to K doFailFlag= 0// For each point
for i=1to I do// Reconstruct point (algorithm ??)w = Reconstruct[xi1,xi2,Λ1,Λ2,0, I,Ωk, τ k]// Test if point reconstructed behind camera
if w3 < 0 thenFailFlag= 1
end
end// If all point in front of camera then return solution
if FailFlag == 0 thenΩ = Ωk
τ = τ kreturn
end
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.11 Multiple cameras 59
0.11.2 Eight point algorithm for fundamental matrix
This algorithm takes a set of I ≥ 8 point correspondences xi1,xi2Ii=1 betweentwo images and computes the fundamental matrix using the 8 point algorithm. Toimprove the numerical stability of the algorithm, the points are transformed beforethe calculation and the resulting fundamental matrix is modified to compensatefor this transformation.
Algorithm 60: Eight point algorithm for fundamental matrix
Input : Point pairs x1i,x2iIi=1
Output: Fundamental matrix Fbegin
// Compute statistics of data
µ1 =∑Ii=1 x1i/I
Σ1 =∑Ii=1(x1i − µ)(x1i − µ)/I
µ2 =∑Ii=1 x2i/I
Σ2 =∑Ii=1(x2i − µ)(x2i − µ)/I
for k=1to K do// Compute transformed coordinates
xi1 = Σ−1/21 (xi1 − µ1)
xi2 = Σ−1/22 (xi2 − µ2)
// Compute constraint
Ai = [xi2xi1, xi2yi1, xi2, yi2xi1, yi2yi1, yi2, xi1, yi1, 1]
end// Append constraints and solve
A = [A1; A2; . . .AI ][U,L,V] = svd[A]F = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Compensate for transformation
T1 = [Σ−1/21 ,Σ
−1/21 µ1; 0, 0, 1]
T2 = [Σ−1/22 ,Σ
−1/22 µ2; 0, 0, 1]
F = TT2 FT1
// Ensure that matrix has rank 2
[U,L,V] = svd[F]l33 = 0
F = ULVT
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
60
0.11.3 Robust computation of fundamental matrix with RANSAC
The goal of this algorithm is to estimate the fundamental matrix from 2D pointpairs xi1,xi2Ii=1 to another in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches.
Algorithm 61: Robust ML fitting of fundamental matrix
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Fundamental matrix F, inlier indices Ibegin
// Initialize best inlier set to empty
I = for n=1to N do
// Draw 8 different random integers between 1 and IR = RandomSubset[1 . . . I, 8]// Compute fundamental matrix (algorithm ??)Φn = ComputeFundamental[xii∈R, wii∈R]// Initialize set of inliers to empty
Sn = for i=1to I do
// Compute epipolar line in first image
xi2 = [xi2; 1]l = tildexi2F// Compute squared distance to epipolar line
d1 = (l1xi1 + l2yi1 + l3)2/(l21 + l22)// Compute epipolar line in second image
xi1 = [xi1; 1]l2 = Fxi1// Compute squared distance to epipolar line
d2 = (l1xi2 + l2yi2 + l3)2/(l21 + l22)// If small enough then add to inliers
if (d1 < τ2)&&(d2 < τ2) thenSn = Sn ∩ i
end
end// If best outliers so far then store
if |Sn| > |I| thenI = Sn
end
end// Compute fundamental matrix from all outliers
Φ = ComputeFundamental[xii∈I , wii∈I ]
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.11 Multiple cameras 61
0.11.4 Planar rectification
This algorithm computes homographies that can be used to rectify the two images.The homography for this second images is chosen so that it moves the epipole toinfinity. The homography for the first image is chosen so that the matches areon the same horizontal lines as in the first image and the distance between thematches is smallest in a last squares sense (i.e. the disparity is smallest).
Algorithm 62: Planar rectification
Input : Point pairs xi1,xi2Ii=1
Output: Homographies Φ1, Φ2 to transform first and second imagesbegin
// Compute fundamental matrix (algorithm ??)
F = ComputeFundamental[x1i,x2iIi=1]// Compute epipole in image 2
[U,L,V] = svd[F]
e = [u13, u23, u33]T
// Compute three transformation matrices
T1 = [0, 0,−δx; 0, 0, δy, 0, 0, 1]θ = atan2[ey − δy, ex − δx]T2 = [cos[θ], sin[θ], 0;− sin[θ], cos[θ], 0; 0, 0, 1]T3 = [1, 0, 0; 0, 1, 0,−1/(cos[θ], sin[θ]), 0, 1]]// Compute homography for second image
Φ2 = T3T2,T1
// Compute factorization of fundamental matrix
L = diag[l11, l22, (l11 + l22)/2]W = [0,−1, 0; 1, 0, 0; 0, 0, 1]
M = ULWVT
for k=1to K dox′i1 = hom[xi1,Φ2M]// Transform points
x′i2 = hom[xi2,Φ2]// Create elements of A and bAi = [x′i1, x
′i2, 1]
bi = x′i2 − x′i1end// Concatenate elements of A and bA = [A1; A2; . . .AI ]b = [b1; b2; . . . bI ]// Solve for α
α = (ATA)−1ATb// Calculate homography in first image
Φ1 = (I + [1, 0, 0]TαT )Φ2M
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
62
0.12 Shape Models
0.12.1 Snake
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.12 Shape Models 63
0.12.2 Template model
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
64
0.12.3 Generalized Procrustes analysis
The goal of generalized Procrustes analysis is to align a set of shape vectors wiIi=1
with respect to a given transformation family (Euclidean, similarity, affine etc.).Each shape vector consists of a set of N 2D points wi = [wT
i1,wTi2, . . .w
TiN ]T . In the
algorithm below, we will use the example of registering with respect to a Similaritytransformation, which consists of a rotation Ω, scaling ρ and translation τ .
Algorithm 63: Generalized Procrustes analysis
Input : Shape vectors wiIi=1, number of factors, KOutput: Template w, transformations Ωi,ρi, τ iIi=1, no of iterations Kbegin
Initialize w = w1
// Main iteration loop
for k=1to K do// For each transformation
for i=1to I do// Compute transformation to template (algorithm ??)
[Ωi, ρi, τ i] = EstimateSimilarity[wnNn=1, winNn=1]
end// Update template (average of inverse transform)
wi =∑Ii=1 ΩT
i (win − τ i)/(I ∗ ρi)end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.12 Shape Models 65
0.12.4 Probabilistic principal components analysis
The probabilistic principal components analysis algorithm describes a set of I D×1data examples xiIi=1 with the model
Pr(xi) = Normxi[µ,ΦΦT + σ2I]
where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principalcomponents in its columns. The principal components define a K dimensionalsubspace and the parameter σ2 explains the variation of the data around thissubspace.
Algorithm 64: ML learning of PPCA model
Input : Training data xiIi=1, number of principal components, KOutput: Parameters µ,Φ, σ2
begin// Estimate mean parameter
µ =∑Ii=1 xi/I
// Form matrix of mean-zero data
X = [x1 − µ,x2 − µ, . . .xI − µ]// Decompose X to matrices U,L,V
[VLVT ] = svd[XTX]
U = WVL−1/2
// Estimate noise parameter
σ2 =∑Dj=K+1 ljj/(D −K)
// Estimate principal components
Uk = [u1,u2, . . .uK ]Lk = diag[l11, l22, . . . lKK ]
Φ = UK(LK − σ2I)1/2
end
0.12.5 Active shape model
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
66
0.13 Models for style and identity
0.13.1 ML learning of subspace identity model
This describes the jth of J data examples from the ith of I identities as
xij = µ+ Φhi + εij ,
where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×Kfactor matrix, hi is the K×1 hidden variable representing the identity and εij is aD×1 additive normal noise multivariate noise with diagonal covariance Σ.
Algorithm 65: Maximum likelihood learning for identity subspace model
Input : Training data xijI,Ji=1,j=1, number of factors, KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1
∑Jj=1 xij/IJ
repeat// Expectation step
for i=1to I do
E[hi] = (JΦTΣ−1Φ + I)−1ΦTΣ−1∑Jj=1(xij − µ)
E[hihTi ] = (JΦTΣ−1Φ + I)−1 + E[hi]E[hi]
T
end// Maximization step
Φ =(∑I
i=1
∑Jj=1(xij − µ)E[hi]
T)(∑I
i=1 JE[hihTi ])−1
Σ = 1IJ
∑Ii=1
∑Jj=1 diag
[(xij − µ)T (xij − µ)−ΦE[hi]x
Tij
]// Compute data log likelihood
for i=1to I dox′i = [xTi1,x
Ti2, . . . ,x
TiJ ]T // compound data vector, JD×1
end
µ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1
Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix, JD×KΣ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JDL =
∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
b
until No further improvement in L
end
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using theWoodbury relation (section ??)
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.13 Models for style and identity 67
0.13.2 Identity matching subspace identity model
To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule
Pr(y = m|x1...N ) =Pr(x1...N |y = m)Pr(y = p)∑Mp=1 Pr(x1...N |y = p)Pr(y = p)
(5)
Let the mth model divide the data into Q non-overlapping partitions SqQq=1
where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (6)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (7)
where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top of eachother, Φ′ is a |Sq|D×K compound factor matrix formed by stacking |Sq| copiesof Φ on top of each other and Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σ. In high dimensions it is worthreformulating the covariance using the Woodbury relation (section ??)
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
68
0.13.3 ML learning of PLDA model
PLDA describes the jth of J data examples from the ith of I identities as
xij = µ+ Φhi + Ψsij + εij ,
where all terms are the same as in subspace identity model but now we add Ψ, theD×L within-individual factor matrix and sij the L×1 style variable.
Algorithm 66: Maximum likelihood learning for PLDA model
Input : Training data xijI,Ji=1,j=1, numbers of factors, K,LOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Ψ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1
∑Jj=1 xij/IJ
repeatµ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1
Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix 1, JD×KΨ′ = diag[Ψ,Ψ, . . .Ψ] // compound factor matrix 2, JD×JLΦ′ = [Φ′,Ψ′] // concatenate matrices JD×(K+JL)Σ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JD// Expectation step
for i=1to I dox′i = [xTi1,x
Ti2, . . . ,x
TiJ ]T // compound data vector, JD×1
µh′i= (Φ′TΣ′−1Φ′ + I)−1Φ′TΣ′−1(x′i − µ′)
Σh′i= (Φ′TΣ′−1Φ′ + I)−1 + E[h′i]E[h′i]
T
for j=1to J doSij = [1 . . .K,K+(J − 1)L+1 . . .K+JL]
E[h′′ij ] = µh′i
(Sij) // Extract indices for xij
E[h′′ijh′′Tij ] = Σ
h′i(Sij ,Sij)
end
end// Maximization step
Φ′′ =(∑I
i=1
∑Jj=1(xij − µ)E[h
′′ij ]T)(∑I
i=1
∑Jj=1 E[h
′′ijh′′Tij ])−1
Σ = 1IJ
∑Ii=1
∑Jj=1 diag
[(xij − µ)T (xij − µ)− [Φ,Ψ]E[hij ]x
Tij
]Φ = Φ′′(:, 1 : K) // Extract original factor matrix
Ψ = Φ′′(:,K + 1 : K + L) // Extract other factor matrix
// Compute data log likelihood
L =∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
until No further improvement in L
end
a Initialize Ψ to random values, other variables as in identity subspace model.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.13 Models for style and identity 69
0.13.4 Identity matching using PLDA model
To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule
Pr(y = m|x1...N ) =Pr(x1...N |y = m)Pr(y = p)∑Mp=1 Pr(x1...N |y = p)Pr(y = p)
(8)
Let the mth model divide the data into Q non-overlapping partitions SqQq=1
where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (9)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (10)
where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top ofeach other. The matrix Φ′ is a |Sq|D×(K + |Sq|L compound factor matrix whichis constructed as
Φ′ =
Φ Ψ 0 . . . 0Φ 0 Ψ . . . 0...
......
. . ....
Φ 0 0 . . . Ψ
(11)
Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrix which is block diag-onal with each block equal to Σ. In high dimensions it is worth reformulating thecovariance using the Woodbury relation (section ??)
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
70
0.13.5 ML learning of asymmetric bilinear model
This describes the jth data example from the ith identities and the kth styles as
xijs = µs + Φshi + εijs,
where the terms have the same interpretation as for the subspace identity modelexcept now there is one set of parameters θs = µs,Φs,Σs per style, s.
Algorithm 67: Maximum likelihood learning for asymmetric bilinear model
Input : Training data xijI,J,Si=1,j=1,s=1, number of factors, KOutput: ML estimates of parameters θ = µ1...S ,Φ1...S ,Σ1...Sbegin
Initialize θ = θ0
for s=1to S do
µs =∑Ii=1
∑Jj=1 xijs/IJ // Set mean
endrepeat
// Expectation step
for i=1to I do
E[hi] = (I + J∑Ss=1 ΦT
s Σ−1s Φs)
−1∑Ss=1 ΦT
s Σ−1s
∑Jj=1(xijs − µs)
E[hihTi ] = (I + JΦT
s Σ−1s Φs)
−1 + E[hi]E[hi]T
end// Maximization step
for s=1to S do
Φs =(∑I
i=1
∑Jj=1(xijs − µs)E[hi]
T)(∑I
i=1 JE[hihTi ])−1
Σs = 1IJ
∑Ii=1
∑Jj=1 diag
[(xijs − µs)T (xijs − µs)−ΦsE[hi]x
Tijs
]end// Compute data log likelihood
for s=1to S doΦ′s = [ΦT
s ,ΦTs . . .Φ
Ts ]T
Σ′s = diag[Σs,Σs, . . .Σs]for i=1to I do
x′is = [xTi1s,xTi2s, . . . ,x
TiJs]
T
x′i = [xTi1,xTi2, . . . ,x
TiS ]T // compound data vector, JSD×1
end
end
µ′ = [µT ,µT . . .µT ]T // compound mean vector, JSD×1
Φ′ = [Φ′T1 ,Φ
′T2 . . .Φ
′TS ]T // compound factor matrix, JSD×K
Σ′ = diag[Σ′1,Σ′2, . . .Σ
′S ] // compound covariance, JSD×JSD
L =∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
until No further improvement in L
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.13 Models for style and identity 71
0.13.6 Identity matching with asymmetric bilinear model
This formulation assumes that the style s of each observed data example is known. Toperform inferences about the identities of newly observed data xnNn=1 we build M com-peting models that explain the data in terms of different identities and which correspondto world states y = 1 . . .M . We define a prior Pr(y = m) = λm for each model. Then wecompute the posterior over world states using Bayes’ rule
Pr(y = m|x1...N ) =Pr(x1...N |y = m)Pr(y = p)∑Mp=1 Pr(x1...N |y = p)Pr(y = p)
(12)
Let the mth model divide the data into Q non-overlapping partitions SqQq=1 whereeach subset Sq is assumed to belong to the same identity. We now compute the likelihoodPr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (13)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (14)
where x′ is a compound data vector formed by stacking all of the data associated withcluster Sq on top of each other. If there were |Sq| data vectors associated with Sq thenthis will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1 compound meanformed by stacking the appropriate mean vectors µs for the style of each example on topof each other. The matrix Φ′ is a |Sq|D× (K + |Sq|L compound factor matrix whichis constructed by stacking the factor matrices Φs on top of each other, where the stylematches that of the data. Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σs where the style is chosen to matchthe style of the data. In high dimensions it is worth reformulating the covariance usingthe Woodbury relation (section ??)
0.13.7 Style translation with asymmetric bilinear model
Algorithm 68: Style translation with asymmetric bilinear model
Input : Example x in style s1, model parameters θOutput: Prediction for data x∗ in style s2
begin// Estimate hidden variable
E[h] = (I + ΦTs1Σ−1
s1 Φs1)−1ΦTs1Σ−1
s1 (x− µs1)
// Predict in different style
x∗ = µs2 + Φs2E[h]
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
72
0.13.8 Symmetric bilinear model
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.14 Temporal models 73
0.14 Temporal models
0.14.1 Kalman filter
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
74
0.14.2 Kalman smoother
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.14 Temporal models 75
0.14.3 Extended Kalman filter
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
76
0.14.4 Iterated extended Kalman filter
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.14 Temporal models 77
0.14.5 Unscented Kalman filter
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
78
0.14.6 Condensation algorithm
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.14 Temporal models 79
0.14.7 Bag of features model
The bag of features model treats each object class as a distribution over discrete features fregardless of their position in the image. Assume that there are I images with Ji featuresin the ith image. Denote the jth feature in the ith image as fij . Then we have
Pr(Xi|w = n) =
Ij∏j=1
Catfij [λn] (15)
Algorithm 69: Learn bag of words model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameter α
Output: Model parameters λmMm=1
begin// For each object class
for n=1to N do// For each feature
for k=1to L do// Compute number of times feature k observed for object m
Nfnk =
∑Ii=1
∑Jij=1 δ[wi − n]δ[fij − k]
end// Compute parameter
λnk = (Nfnk + α− 1)/(
∑Kk=1 N
fnk +Kα− 1)
end
end
We can then define a prior Pr(w) over the N object classes and classify a new imageusing Bayes rule,
Pr(w = n|X ) =Pr(X|w = n)Pr(w = n)∑Nn=1 Pr(X|w = n)Pr(w = n)
(16)
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
80
0.14.8 Latent Dirichlet Allocation
The LDA model models a discrete set of features fij ∈ 1 . . .K as a mixture of M categor-ical distributions (parts), where the categorical distributions themselves are shared, butthe mixture weights πi differ from image to image
Algorithm 70: Learn latent Dirchlet allocation model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameters α, β
Output: Model parameters λmMm=1, πiIi=1
begin// Initialize categorical parameters
θ = θ0a
// Initialize count parameters
N(f) = 0
N(p) = 0for i=1to I do
for j=1to J do// Initialize hidden variables
pij = randInt[M ]// Update count parameters
N(f)pij ,fij
= N(f)pij ,fij
+ 1
N(p)i,pij
= N(f)i,pij
+ 1
end
end// Main MCMC Loop
for t=1to T do
p(t) = MCMCSample[p, f ,N(f),N(w), λmMm=1, πiIi=1,M,K]end// Choose samples to use for parameter estimate
St = [BurnInTime : SkipTime : Last Sample]for i=1to I do
for m=1to M do
πi,m =∑Jij=1
∑t∈St δ[p
[t]ij −m] + α
end
πi = πi/∑Mm=1 πim
endfor m=1to M do
for k=1to K do
λm,k =∑Ii=1
∑Jij=1
∑t∈St δ[p
[t]ij −m]δ[fij − k] + β
end
λm = λm/∑Kk=1 λm,k
end
end
a One way to do this would be to set the categorical parameters λmMm=1, πiIi=1
to random values by generating positive random vectors and normalizing them to sum to
one.
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.14 Temporal models 81
Algorithm 71: MCMC Sampling for LDA
Input : p, f ,N(f),N(w), λmMm=1, πiIi=1,M,KOutput: Part sample pbegin
repeat// Choose next feature
(a, b) = ChooseFeature[J1, J2, . . . JI ]// Remove feature from statistics
N(f)pab,fab
= N(f)pab,fab
− 1
N(p)a,pab = N
(p)pab − 1
for m=1to M do
qm = (N(f)m,fab
+ β)(N(p)a,m + α)
qm = qm/(∑Kk=1(N
(f)m,k + β)
∑Nm=1(N
(p)a,m + α))
end// Normalize
q = q/(sumMm=1qm)
// Draw new feature
pij = DrawCategorical[q]// Replace feature in statistics
N(f)pab,fab
= N(f)pab,fab
+ 1
N(p)a,pab = N
(p)pab + 1
until All parts pij updated
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
82
0.15 Preprocessing
0.15.1 Principal components analysis
The goal of PCA is to approximate a set of multivariate data xiIi=1 with a secondset of variables of reduced size hiIi=1, so that
xi ≈ µ+ Φhi,
where Φ is a rectangular matrix where the columns are unit length and orthogonalto one another so that ΦTΦ = I.
Algorithm 72: Principal components analysis
Input : Training data xiIi=1, number of components KOutput: Mean µ, PCA basis functions Φ, low dimensional data hiIi=1
begin// Estimate mean
µ =∑Ii=1 xi/I
// Form mean zero data matrix
X = [x1 − µ,x2 − µ, . . .xI − µ]// Do spectral decomposition
[U,L,V] = svd[XTX]// Compute dual principal components
Ψ = [u1,u2, . . .uK ]// Compute principal components
Φ = XΨ// Convert data to low dimensional representation
for i=1to I dohi = ΦT (xi − µ)
end
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.
0.15 Preprocessing 83
0.15.2 k-means algorithm
The goal of the k-means algorithm is to partition a set of data xiIi=1 into Kclusters. It can be thought of as approximating each data point with the associatedcluster mean µk , so that
xi ≈ µhi,
where hi ∈ 1, 2, . . .K is a discrete variable that indicates which cluster the ithpoint belongs to.
Algorithm 73: K-means algorithm
Input : Data xiIi=1, number of clusters K, data dimension DOutput: Cluster means µkKk=1, cluster assignment indices, hiIi=1
begin// Initialize cluster means (one of many heuristics)
µ =∑Ii=1 xi/I
for i=1to I dodi = (xi − µ)T (xi − µ)
end
Σ = Diag[∑Ii=1 di/I]
for k=1to K do
µk = µ+ Σ1/2randn[D, 1]end// Main loop
repeat// Compute distance from data points to cluster means
for i=1to I dofor k=1to K do
dik = (xi − µk)T (xi − µk)end// Update cluster assignments
hi = argminkdikend// Update cluster means
for k=1to K do
µk =∑Ii=1 δ[hi − k]xi/(
∑Ii=1 δ[hi − k])
end
until No further change in µkKk=1
end
Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.