Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...

Bayesian k-nearest-neighbour classification


Christian P. Robert

Universite Paris Dauphine & CREST, INSEE

http://www.ceremade.dauphine.fr/∼xian

Joint work with G. Celeux, J.M. Marin, & D.M. Titterington

http://www.ceremade.dauphine.fr/~xian


Outline

1 MRFs

2 Bayesian inference in Gibbs random fields

3 Perfect sampling

4 k-nearest-neighbours

5 Pseudo-likelihood reassessed

6 Variable selection


MRFs

MRFs

Markov random fields: natural spatial generalisation of Markovchains


MRFs

MRFs

Markov random fields: natural spatial generalisation of Markovchains

They can be derived from graph structures, when ignoring timedirectionality/causality

E.g., a Markov chain is also a chain graph of random variables,where each variable in the graph has the property that it isindependent of all (both past and future) others given its twonearest neighbours.


MRFs

MRFs (cont’d)

Definition (MRF)

A general Markov random field is the extension of the above toany graph structure on the random variables, i.e., a collection ofrv’s such that each one is independent of all the others given itsimmediate neighbours in the corresponding graph.

[Cressie, 1993]


MRFs

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .


MRFs

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .For A ⊆ N , δA denotes the set of neighbours of A, i.e. thecollection of all points in N/A that have a neighbour in A.


MRFs

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .For A ⊆ N , δA denotes the set of neighbours of A, i.e. thecollection of all points in N/A that have a neighbour in A.

Definition (MRF)

y = (y1, . . . , yn) is a Markov random field associated with thegraph G if its full conditionals satisfy

f (yi|y−i) = f (yi|yδi) .

Cliques are sets of points that are all neighbours of one another.


MRFs

Gibbs distributions

Special case of MRF:


MRFs

Gibbs distributions

Special case of MRF:y = (y1, . . . , yn) is a Gibbs random field associated with thegraph G if

f(y) =1

Zexp

{−

∑

c∈C

Vc(yc)

}.

where Z is the normalising constant, C is the set of cliques and Vc

is any function also called potential (and U(y) =∑

c∈CVc(yc) is

the energy function)


MRFs

Statistical perspective

Introduce a parameter θ ∈ Θ in the Gibbs distribution:

f(y|θ) =exp {Qθ(y)}

Z(θ)


MRFs




Z(θ)

and estimate it from observed data y.


MRFs




Z(θ)

and estimate it from observed data y.

Bayesian approach:put a prior distribution π(θ) on θ and use posterior distribution

π(θ|y) ∝ f(y|θ)π(θ) =exp {Qθ(y)}

Z(θ)× π(θ)


MRFs

Potts model

Example (Boltzman dependence)

Case when Qθ(y) is of the form

Qθ(y) = θS(y)

= θ∑

l∼ki

δyl=yi

Path for Potts


Bayesian inference in Gibbs random fields

1 MRFs

2 Bayesian inference in Gibbs random fieldsPseudo-posterior inferencePath samplingAuxiliary variables

3 Perfect sampling







Use the posterior π(θ|y) to draw inference




Use the posterior π(θ|y) to draw inferenceProblem

Z(θ) =∑

y

exp {Qθ(y)}

not available analytically & exact computation not feasible.




Use the posterior π(θ|y) to draw inferenceProblem

Z(θ) =∑

y

exp {Qθ(y)}

not available analytically & exact computation not feasible.

Solutions:

Pseudo-posterior inference

Path sampling approximations

Auxiliary variable method





Oldest solution: replace the likelihood with the pseudo-likelihood

pseudo-like(y|θ) =

n∏

i=1

f(yi|y−i, θ) .

Then define pseudo-posterior

pseudo-post(θ|y) ∝n∏

i=1

f(yi|y−i, θ)π(θ)

and resort to MCMC methods to derive a sample from pseudo-post[Besag, 1974-75]



Path sampling

Path sampling

Generate a sample from [the true] π(θ|y) by aMetropolis-Hastings algorithm, with acceptance probability

MH1(θ′|θ) =

{Z(θ)

Z(θ′)

} {exp [Qθ′(y)] π(θ′)

exp [Qθ(y)] π(θ)

} {q1(θ|θ

′)

q1(θ′|θ)

}

where q1(θ′|θ) is a [arbitrary] proposal density.

[Robert & Casella, 1999/2004]



Path sampling

Path sampling (cont’d)

When Qθ(y) = θS(y) [cf. Gibbs/Potts distribution],

Z(θ) =∑

y

exp [θS(y)]

anddZ(θ)

dθ=

∑

y

S(y) exp[θS(y)]

= Z(θ)∑

y

S(y) exp{θS(y)}/Z(θ)

= Z(θ) Eθ[S(y)] .



Path sampling

Path sampling (cont’d)

When Qθ(y) = θS(y) [cf. Gibbs/Potts distribution],

Z(θ) =∑

y

exp [θS(y)]

anddZ(θ)

dθ=

∑

y

S(y) exp[θS(y)]

= Z(θ)∑

y

S(y) exp{θS(y)}/Z(θ)

= Z(θ) Eθ[S(y)] .

c© Derivative expressed as an expectation under f(y|θ)



Path sampling

Path sampling identity

Therefore, the ratio Z(θ)/Z(θ′) can be derived from an integral,since

log

{Z(θ)

Z(θ′)

}=

∫ θ′

θ

Eu[S(y)]du .

[Gelman & Meng, 1998]



Path sampling

Implementation for Potts

Potts

Step X: Monte Carlo approximation of Eθ[S(X)] derived formMCMC sequence of X’s for fixed θ

Potts Metropolis-Hastings Sampler

Iteration t (t ≥ 1):

1 Generate u = (ui)i∈I random permutation of I;2 For 1 ≤ ℓ ≤ |I|,

generate

x(t)uℓ

∼ U ({1, . . . , x(t−1)uℓ

− 1, x(t−1)uℓ

+ 1, . . . , G}) ,

compute the n(t)ul,g’s and ρl = {exp(θ[n

(t)uℓ,x − n

(t)uℓ,xuℓ

])} ∧ 1 ,

and set x(t)uℓ

equal to xuℓwith probability ρl.



Path sampling

Implementation for Potts (2)

Step θ: Use (a) importancesampling recycling whenchanging the value of θ and (b)numerical quadrature for integralapproximation



Path sampling


Step θ: Use (a) importancesampling recycling whenchanging the value of θ and (b)numerical quadrature for integralapproximationIllustration: Approximation ofEβ,k[S(y)] for Ripley’sbenchmark, for k = 1, 125

Perfect Potts



Auxiliary variables

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state spaceas y, with conditional density g(z|θ,y) and consider the [artificial]joint posterior

π(θ, z|y) ∝ π(θ, z,y) = g(z|θ,y)f(y|θ)π(θ)



Auxiliary variables

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state spaceas y, with conditional density g(z|θ,y) and consider the [artificial]joint posterior

π(θ, z|y) ∝ π(θ, z,y) = g(z|θ,y)f(y|θ)π(θ)

Expl’tion: Integrating out z gets us back to π(θ|y)[Møller et al., 2006]



Auxiliary variables

Auxiliary variables (cont’d)

For q1 [arbitrary] proposal density on θ and

q2((θ′, z′)|(θ, z))) = q1(θ

′|θ)f(z′|θ′) ,

(i.e., simulating z from the likelihood), the Metropolis-Hastingsratio associated with q2 is

MH2((θ′, z′)|(θ, z)) =

(Z(θ)

Z(θ′)

) (exp {Qθ′(y)}π(θ′)

exp {Qθ(y)}π(θ)

)(g(z′|θ′,y)

g(z|θ,y)

)

×

(q1(θ|θ′) exp {Qθ(z)}

q1(θ′|θ) exp {Qθ′(z)}

)(Z(θ′)

Z(θ)

)

and....



Auxiliary variables


...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =

(exp {Qθ′(y)}π(θ′)

exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

)

×

(q1(θ|θ

′) exp {Qθ(z)}

q(θ′|θ) exp {Qθ′(z)}

)



Auxiliary variables


...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =


exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

)

×

(q1(θ|θ

′) exp {Qθ(z)}


)

Choice ofg(z|θ,y) = exp

{Q

θ(z)

}/Z(θ)

where θ is the maximum pseudo-likelihood estimate of θ.



Auxiliary variables


...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =


exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

)

×

(q1(θ|θ

′) exp {Qθ(z)}


)

Choice ofg(z|θ,y) = exp

{Q

θ(z)

}/Z(θ)

where θ is the maximum pseudo-likelihood estimate of θ.

New problem: Need to simulate from f(y|θ)


Perfect sampling

Perfect sampling

Coupling From The Past:algorithm that allows for exact and iid sampling from a givendistribution while using basic steps from an MCMC algorithm


Perfect sampling

Perfect sampling

Coupling From The Past:algorithm that allows for exact and iid sampling from a givendistribution while using basic steps from an MCMC algorithm

Underlying concept: run coupled Markov chains that start from allpossible states in the state space. Once all chains havemet/coalesced, they stick to the same path; the effect of the initialstate has “vanished”.

[Propp & Wilson, 1996]


Perfect sampling


In the case of a two-colour Ising model, existence of a perfectsampler by virtue of monotonicity properties: Potts


Perfect sampling



Ising Metropolis-Hastings Perfect Sampler

For T large enough,

1 Start two chains x0,−t and x1,−t from saturated states2 For t = −T, . . . , 1, couple both chains:

if missing, generate the basic uniforms u(t)

use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

3 Check coalescence at time 0: if x0,0 = x1,0i, stopelse increase T and recycle younger u(t)’s


Perfect sampling



Ising Metropolis-Hastings Perfect Sampler

For T large enough,

1 Start two chains x0,−t and x1,−t from saturated states2 For t = −T, . . . , 1, couple both chains:

if missing, generate the basic uniforms u(t)

use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

3 Check coalescence at time 0: if x0,0 = x1,0i, stopelse increase T and recycle younger u(t)’s

Limitation: Slow down & down when θ increase


k-nearest-neighbours

KNN’s as a probability distribution

1 MRFs


3 Perfect sampling

4 k-nearest-neighboursKNN’s as a clustering ruleKNN’s as a probabilistic modelBayesian inference on KNN’sMCMC implementation


Variable selection



KNN’s as a clustering rule


The k-nearest-neighbour procedure is a supervised clusteringmethod that allocates [new] subjects to one of G categories basedon the most frequent class [within a learning sample] in theirneighbourhood.




Supervised classification

Infer from a partitioned datasetthe classes of a new dataset





Infer from a partitioned datasetthe classes of a new datasetData: training dataset

(ytr

i , xtri

)i=1,...,n

with class label 1 ≤ ytri ≤ Q and

predictor covariates xtri





Infer from a partitioned datasetthe classes of a new datasetData: training dataset

(ytr

i , xtri

)i=1,...,n

with class label 1 ≤ ytri ≤ Q and

predictor covariates xtri

and testing dataset

(yte

i , xtei

)i=1,...,m

with unknown ytei ’s

−1.0 −0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

x2




Classification

Skip animation

Principle

Prediction for a new point(yte

j , xtej ) (j = 1, . . . , m): the

most common class amongst thek-nearest-neighbours of xte

j inthe training set

Neighbourhood based onEuclidean metric

−1.0 −0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

x2




Standard procedure

Example : help(knn)

data(iris3)

train=rbind(iris3[1:25,,1],iris3[1:25,,2],iris3[1:25,,3])

test=rbind(iris3[26:50,,1],iris3[26:50,,2],iris3[26:50,,3])

cl=factor(c(rep("s",25),rep("c",25),rep("v",25)))

library(class)

knn(train,test,cl,k=3,prob=TRUE)

attributes(.Last.value)




Model choice perspective

Back to idea

Choice of k?

Usually chosen by minimising cross-validated misclassification rate(non-parametric or even non-probabilist!)




Influence of k

Dataset of Ripley (1994),with two classes whereeach population of xi’s isfrom a mixture of twobivariate normaldistributions.Training set of n = 250points and testing set on aset of m = 1, 000 points

−1.0 −0.5 0.0 0.5 1.0

−0.

20.

00.

20.

40.

60.

81.

01.

2

k=1

−1.0 −0.5 0.0 0.5 1.0

−0.

20.

00.

20.

40.

60.

81.

01.

2

k=11

−1.0 −0.5 0.0 0.5 1.0

−0.

20.

00.

20.

40.

60.

81.

01.

2

k=57

−1.0 −0.5 0.0 0.5 1.0

−0.

20.

00.

20.

40.

60.

81.

01.

2

k=137




Influence of k (cont’d)

k-nearest-neighbour leave-one-out cross-validation:

Solutions 17 18 35 36 45 46 51 52 53 54 (29)

Procedure Misclass’n error rate

1-nn 0.150 (150)3-nn 0.134 (134)15-nn 0.095 (095)17-nn 0.087 (087)54-nn 0.081 (081)



KNN’s as a probabilistic model


k-nearest-neighbour model

Based on full conditional distributions (1 ≤ ω ≤ Q)

P(ytri = ω|ytr

−i, xtr, β, k) ∝ exp

β∑

k

l∼i

δω(ytrl )

/k

β > 0

wherek

l ∼ i is the k-nearest-neighbour relation[Holmes & Adams, 2002]

This can also be seen as a Potts model.




Motivations

β does not exist in the original k-nn procedure.




Motivations

β does not exist in the original k-nn procedure.

It is only relevant from a statistical point of view as a measure ofuncertainty about the model:

β = 0 corresponds to a uniform distribution on all classes ;

β = +∞ leads to a point mass distribution on the prevalentclass.




MRF-like expression

Closed form expression for the full conditionals

P(ytri = ω|ytr

−i, xtr, β, k) = exp (βnω(i)/k)

/ ∑

q

exp (βnq(i)/k)

where nω(i) number of neighbours of i with class label ω

Potts




Drawback

Because the neighbourhoodstructure is not symmetric (xi

may be one of the k nearestneighbours of xj while xj is notone of the k nearest neighboursof xi),




Drawback

Because the neighbourhoodstructure is not symmetric (xi

may be one of the k nearestneighbours of xj while xj is notone of the k nearest neighboursof xi), there usually is no jointprobability distributioncorresponding to these “fullconditionals”!




Drawback (2)

Note: Holmes & Adams (2002) solve this problem by directlydefining the joint as the pseudo-likelihood

f(ytr|xtr, β, k) ∝n∏

i=1

exp (βnyi(i)/k)

/ ∑

q

exp (βnq(i)/k) . . .

[with a missing constant Z(β)]




Drawback (2)

Note: Holmes & Adams (2002) solve this problem by directlydefining the joint as the pseudo-likelihood

f(ytr|xtr, β, k) ∝n∏

i=1

exp (βnyi(i)/k)

/ ∑

q

exp (βnq(i)/k) . . .

[with a missing constant Z(β)]... but they are still using the same [wrong] predictive

P(ytej = ω|ytr, xtr, xte

j , β, k) = exp (βnω(j)/k)

/ ∑

q

exp (βnq(j)/k)




Resolution

Symmetrise the neighbourhood relation:




Resolution

Symmetrise the neighbourhood relation:

Principle: if xtri belongs to the

k-nearest-neighbour set for xtrj

and xtrj does not belong to the

k-nearest-neighbour set for xtri ,

xtrj is added to the set of

neighbours of xtri




Consequence

Given the full conditionals

P(ytri = ω|ytr


β∑

k

l#i

δω(ytrl )

/k

wherek

l#i is the symmetrised k-nearest-neighbour relation,




Consequence

Given the full conditionals

P(ytri = ω|ytr


β∑

k

l#i

δω(ytrl )

/k

wherek

l#i is the symmetrised k-nearest-neighbour relation,

there exists a corresponding joint distribution




Extension to unclassified points

Predictive distribution of ytej (j = 1, . . . , m) defined as

P(ytej = ω|xte

j , ytr, xtr, β, k) ∝ exp

β∑

k

l#j

δω(ytrl )

/k

wherek

l#j is the symmetrised k-nearest-neighbour relation wrt thetraining set {xtr

1 , . . . , xtrn}



Bayesian inference on KNN’s

Bayesian modelling

Within the Bayesian paradigm, assign a prior π(β, k) like

π(β, k) ∝ I(1 ≤ k ≤ kmax) I(0 ≤ β ≤ βmax)

because there is a maximum value (e.g., βmax = 15) after whichthe distribution is Dirac [as in Potts model]




Bayesian modelling

Within the Bayesian paradigm, assign a prior π(β, k) like

π(β, k) ∝ I(1 ≤ k ≤ kmax) I(0 ≤ β ≤ βmax)

because there is a maximum value (e.g., βmax = 15) after whichthe distribution is Dirac [as in Potts model] and because it can beargued that kmax = n/2

Note

β is dimension-less because of the use of frequencies nω(i)/k ascovariates




Bayesian global inference

Use marginal predictive distribution of ytej given xte

j (j = 1, . . . , m)

∫P(yte

j = ω|xtej , ytr, xtr, β, k)π(β, k|ytr, xtr)dβ dk

whereπ(β, k|ytr, xtr) ∝ f(ytr|xtr, β, k)π(β, k)

posterior distribution of (β, k) given the training dataset ytr

[ytej = MAP estimate]

Note

Model choice with no varying dimension because β is the same forall models



MCMC implementation

MCMC implementation

A Markov Chain Monte Carlo (MCMC) approximation of

f(yn+1|xn+1,y,X)

is provided by

M−1M∑

i=1

f(yn+1|xn+1,y,X, (β, k)(i)

)

where {(β, k)(1), . . . , (β, k)(M)} MCMC output associated withstationary distribution π(β, k|y,X).



MCMC implementation

Auxiliary variable version

Random walk Metropolis–Hastings algorithm on both β and k

Since β ∈ (0, βmax), a logistic reparameterisation of β is

β = βmax eθ/1 + eθ ,

and the random walk θ ∼ N (θ(t), τ2) is on θ

For k, uniform proposal on 2r neighbours of k(t),{k(t) − r, . . . , k(t) − 1, k(t) + 1, . . . k(t) + r}

⋂{1, . . . , K}.



MCMC implementation

Auxiliary variable version

Random walk Metropolis–Hastings algorithm on both β and k

Since β ∈ (0, βmax), a logistic reparameterisation of β is

β = βmax eθ/1 + eθ ,

and the random walk θ ∼ N (θ(t), τ2) is on θ

For k, uniform proposal on 2r neighbours of k(t),{k(t) − r, . . . , k(t) − 1, k(t) + 1, . . . k(t) + r}

⋂{1, . . . , K}.

Simulation of f(ztr|xtr, β, k) by perfect sampling taking advantageof monotonicity properties [but may get stuck for too large valuesof β]



MCMC implementation

Choice of (β, k) paramount:

Illustration of Ripley’s dataset: (k, β) = (53, 2.28) versus(k, β) = (13, 1.45)



MCMC implementation

Diabetes in Pima Indian women

Example (R benchmark)

“A population of women who were at least 21 years old, of Pima Indianheritage and living near Phoenix (AZ), was tested for diabetes accordingto WHO criteria. The data were collected by the US National Institute ofDiabetes and Digestive and Kidney Diseases. We used the 532 completerecords after dropping the (mainly missing) data on serum insulin.”

number of pregnancies

plasma glucose concentration in an oral glucose tolerance test

diastolic blood pressure

triceps skin fold thickness

body mass index

diabetes pedigree function

age



MCMC implementation

Diabetes in Pima Indian womenMCMC output for βmax = 1.5, β = 1.15, k = 40, and 20, 000simulations.



MCMC implementation

Diabetes in Pima Indian women

Example (Error rate & k selection)

k Misclassificationerror rate

1 0.3163 0.22915 0.22631 0.21157 0.20566 0.208



MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is

yn+1 = arg maxg

M−1M∑

i=1

f(g|xn+1,y,X, β(i), k(i)

).



MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is

yn+1 = arg maxg

M−1M∑

i=1

f(g|xn+1,y,X, β(i), k(i)

).

E.g., Ripley’s dataset misclassification error rate: 0.082.


Pseudo-likelihood reassessed

A reassessment of pseudo-likelihood

1 MRFs


3 Perfect sampling






Pseudo-likelihood

Pseudo-likelihood leads to (almost) straightforward MCMCimplementation



Magnitude of the approximation

Since perfect and path sampling approaches also are available forsmall datasets, possibility of evaluation of pseudo-likelihoodapproximation



Ripley’s benchmark (1)

Approximations to the posterior of β based on the pseudo (green),the path (red) and the perfect (yellow) schemes withk = 1, 10, 70, 125, for 20, 000 iterations:



Ripley’s benchmark (2)

Approximations of posteriors of β (top) and k (bottom)


Variable selection

Variable selection

1 MRFs


3 Perfect sampling





Variable selection

Goal: Selection of thecomponents of the predictorvector that best contribute to theclassification

Parsimony (dimension ofpredictor may be larger thantraining sample size n)

Efficiency (more componentsblur class differences)

−1.0 −0.5 0.0 0.5 1.0

−0

.20

.00

.20

.40

.60

.81

.01

.2

gamma=(1,1), err=78

−1.0 −0.5 0.0 0.5 1.0

−0

.20

.00

.20

.40

.60

.81

.01

.2

gamma=(1,0), err=284

−1.0 −0.5 0.0 0.5 1.0

−0

.20

.00

.20

.40

.60

.81

.01

.2

gamma=(0,1), err=116

−1.0 −0.5 0.0 0.5 1.0

−0

.20

.00

.20

.40

.60

.81

.01

.2

gamma=(1,1,1), err=159


Variable selection

Component indicators

Completion of (β, k) with indicator variables γj ∈ {0, 1}(1 ≤ j ≤ p) that determine which components of x are active inthe model

P(yi = Cj |y−i,X, β, k,γ) ∝ exp

β∑

l∈vk(i)

δCj(yl)

/k

with vk(i) (symmetrised) k nearest neighbourhood of xi for thedistance

d(xi, xℓ)2 =

p∑

j=1

γj(xij − xℓj)2


Variable selection

Variable selection

Formal similarity with usual variable selection in regression models.Use of a uniform prior on the γj ’s on {0, 1}, independently for allj’sExploration of a range of models Mγ of size 2p that may be toolarge (see, e.g., the vision dataset with p = 200)


Variable selection

Implementation

Use of a naive “reversible jump” MCMC, where

1 (β, k) are changed conditional on γ and

2 γ is changed one component at a time conditional on (β, k)[and the data]

Note

Validation of simple jumps due to (a) saturation of the dimensionby associating a γj to each variable and (b) hierarchical structureof the (β, k) part.

c© This is not a varying dimension problem


Variable selection

MCMC algorithm

Variable selection k-nearest-neighbours

At time 0, generate γ(0)j ∼ B(1/2), log β(0) ∼ N

(0, τ2

)and

k(0) ∼ U{1,...,K}

At time 1 ≤ t ≤ T ,

1 Generate log β ∼ N(log β(t−1), τ2

)and

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})


Variable selection

MCMC algorithm



(0, τ2

)and

k(0) ∼ U{1,...,K}



)and

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})

2 Calculate Metropolis-Hastings acceptance probabilityρ(β, k, β(t−1), k(t−1))


Variable selection

MCMC algorithm



(0, τ2

)and

k(0) ∼ U{1,...,K}



)and

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})


3 Move to(β(t), k(t)

)by Metropolis-Hastings step


Variable selection

MCMC algorithm



(0, τ2

)and

k(0) ∼ U{1,...,K}



)and

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})


3 Move to(β(t), k(t)

)by Metropolis-Hastings step

4 For j = 1, . . . , p, generate γ(t)j ∼ π(γj |y,X, γ

(t)−j , β

(t), k(t))


Variable selection

Benchmark 1

Ripley’s dataset with 8 additional potential [useless] covariatessimulated from N (0, .052)

Using the 250 datapoints for variable selection, comparison of the210 = 1024 models by pseudo-maximum likelihood estimation of(k, β) and by comparison of pseudo-likelihoods leads to select theproper submodel

γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0

with k = 3.1 and β = 3.8. Forward and backward selectionprocedures lead to same conclusion.


Variable selection

Benchmark 1

Ripley’s dataset with 8 additional potential [useless] covariatessimulated from N (0, .052)

Using the 250 datapoints for variable selection, comparison of the210 = 1024 models by pseudo-maximum likelihood estimation of(k, β) and by comparison of pseudo-likelihoods leads to select theproper submodel

γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0

with k = 3.1 and β = 3.8. Forward and backward selectionprocedures lead to same conclusion.

MCMC algorithm produces γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0 asthe MMAP, with very similar values for k and β [Hardly any moveaway from (1, 1, 0, . . . , 0) is accepted]


Variable selection

Benchmark 2

Ripley’s dataset with now 28 additional covariates simulated fromN (0, .052)

Using the 250 datapoints for variable selection, direct comparisonof the 230 models by pseudo-maximum likelihood estimationimpossible!

Forward and backward selection procedures both lead to the propersubmodel γ = (1, 1, 0, . . . , 0)


Variable selection

Benchmark 2

Ripley’s dataset with now 28 additional covariates simulated fromN (0, .052)

Using the 250 datapoints for variable selection, direct comparisonof the 230 models by pseudo-maximum likelihood estimationimpossible!

Forward and backward selection procedures both lead to the propersubmodel γ = (1, 1, 0, . . . , 0)MCMC algorithm again produces γ1 = γ2 = 1 andγ3 = · · · = γ10 = 0 as the MMAP, with more moves aroundγ = (1, 1, 0, . . . , 0)

Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...

Documents

Transcript of Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...