Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...

Bayesian k-nearest-neighbour classification

Christian P. Robert

Universite Paris Dauphine & CREST, INSEE

http://www.ceremade.dauphine.fr/∼xian

Joint work with G. Celeux, J.M. Marin, & D.M. Titterington

Outline

1 MRFs

2 Bayesian inference in Gibbs random fields

3 Perfect sampling

4 k-nearest-neighbours

5 Pseudo-likelihood reassessed

6 Variable selection

Markov random fields: natural spatial generalisation of Markovchains

They can be derived from graph structures, when ignoring timedirectionality/causality

E.g., a Markov chain is also a chain graph of random variables,where each variable in the graph has the property that it isindependent of all (both past and future) others given its twonearest neighbours.

MRFs (cont’d)

Definition (MRF)

A general Markov random field is the extension of the above toany graph structure on the random variables, i.e., a collection ofrv’s such that each one is independent of all the others given itsimmediate neighbours in the corresponding graph.

[Cressie, 1993]

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .For A ⊆ N , δA denotes the set of neighbours of A, i.e. thecollection of all points in N/A that have a neighbour in A.

A formal definition

Take y1, . . . , yn, rv’s with values in a finite set S, and letG = (N, E) be a finite graph with N = {1, ..., n} the collection ofnodes and E the collection of edges, made of pairs from N .For A ⊆ N , δA denotes the set of neighbours of A, i.e. thecollection of all points in N/A that have a neighbour in A.

Definition (MRF)

y = (y1, . . . , yn) is a Markov random field associated with thegraph G if its full conditionals satisfy

f (yi|y−i) = f (yi|yδi) .

Cliques are sets of points that are all neighbours of one another.

Gibbs distributions

Special case of MRF:

Gibbs distributions

Special case of MRF:y = (y1, . . . , yn) is a Gibbs random field associated with thegraph G if

f(y) =1

Vc(yc)

where Z is the normalising constant, C is the set of cliques and Vc

is any function also called potential (and U(y) =∑

c∈CVc(yc) is

the energy function)

Statistical perspective

Introduce a parameter θ ∈ Θ in the Gibbs distribution:

f(y|θ) =exp {Qθ(y)}

and estimate it from observed data y.

Bayesian approach:put a prior distribution π(θ) on θ and use posterior distribution

π(θ|y) ∝ f(y|θ)π(θ) =exp {Qθ(y)}

Z(θ)× π(θ)

Potts model

Example (Boltzman dependence)

Case when Qθ(y) is of the form

Qθ(y) = θS(y)

= θ∑

l∼ki

δyl=yi

Path for Potts

Bayesian inference in Gibbs random fields

1 MRFs

2 Bayesian inference in Gibbs random fieldsPseudo-posterior inferencePath samplingAuxiliary variables

3 Perfect sampling

Use the posterior π(θ|y) to draw inference

Use the posterior π(θ|y) to draw inferenceProblem

Z(θ) =∑

exp {Qθ(y)}

not available analytically & exact computation not feasible.

Use the posterior π(θ|y) to draw inferenceProblem

Z(θ) =∑

exp {Qθ(y)}

not available analytically & exact computation not feasible.

Solutions:

Pseudo-posterior inference

Path sampling approximations

Auxiliary variable method

Pseudo-posterior inference

Oldest solution: replace the likelihood with the pseudo-likelihood

pseudo-like(y|θ) =

f(yi|y−i, θ) .

Then define pseudo-posterior

pseudo-post(θ|y) ∝n∏

f(yi|y−i, θ)π(θ)

and resort to MCMC methods to derive a sample from pseudo-post[Besag, 1974-75]

Path sampling

Generate a sample from [the true] π(θ|y) by aMetropolis-Hastings algorithm, with acceptance probability

MH1(θ′|θ) =

{Z(θ)

Z(θ′)

} {exp [Qθ′(y)] π(θ′)

exp [Qθ(y)] π(θ)

} {q1(θ|θ

q1(θ′|θ)

where q1(θ′|θ) is a [arbitrary] proposal density.

[Robert & Casella, 1999/2004]

Path sampling

Path sampling (cont’d)

When Qθ(y) = θS(y) [cf. Gibbs/Potts distribution],

Z(θ) =∑

exp [θS(y)]

anddZ(θ)

S(y) exp[θS(y)]

= Z(θ)∑

S(y) exp{θS(y)}/Z(θ)

= Z(θ) Eθ[S(y)] .

Path sampling

Path sampling (cont’d)

When Qθ(y) = θS(y) [cf. Gibbs/Potts distribution],

Z(θ) =∑

exp [θS(y)]

anddZ(θ)

S(y) exp[θS(y)]

= Z(θ)∑

S(y) exp{θS(y)}/Z(θ)

= Z(θ) Eθ[S(y)] .

c© Derivative expressed as an expectation under f(y|θ)

Path sampling

Path sampling identity

Therefore, the ratio Z(θ)/Z(θ′) can be derived from an integral,since

{Z(θ)

Z(θ′)

∫ θ′

Eu[S(y)]du .

[Gelman & Meng, 1998]

Path sampling

Implementation for Potts

Step X: Monte Carlo approximation of Eθ[S(X)] derived formMCMC sequence of X’s for fixed θ

Potts Metropolis-Hastings Sampler

Iteration t (t ≥ 1):

1 Generate u = (ui)i∈I random permutation of I;2 For 1 ≤ ℓ ≤ |I|,

generate

x(t)uℓ

∼ U ({1, . . . , x(t−1)uℓ

− 1, x(t−1)uℓ

+ 1, . . . , G}) ,

compute the n(t)ul,g’s and ρl = {exp(θ[n

(t)uℓ,x − n

(t)uℓ,xuℓ

])} ∧ 1 ,

and set x(t)uℓ

equal to xuℓwith probability ρl.

Path sampling

Implementation for Potts (2)

Step θ: Use (a) importancesampling recycling whenchanging the value of θ and (b)numerical quadrature for integralapproximation

Path sampling

Step θ: Use (a) importancesampling recycling whenchanging the value of θ and (b)numerical quadrature for integralapproximationIllustration: Approximation ofEβ,k[S(y)] for Ripley’sbenchmark, for k = 1, 125

Perfect Potts

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state spaceas y, with conditional density g(z|θ,y) and consider the [artificial]joint posterior

π(θ, z|y) ∝ π(θ, z,y) = g(z|θ,y)f(y|θ)π(θ)

Auxiliary variables

Introduce z auxiliary/extraneous variable on the same state spaceas y, with conditional density g(z|θ,y) and consider the [artificial]joint posterior

π(θ, z|y) ∝ π(θ, z,y) = g(z|θ,y)f(y|θ)π(θ)

Expl’tion: Integrating out z gets us back to π(θ|y)[Møller et al., 2006]

Auxiliary variables

Auxiliary variables (cont’d)

For q1 [arbitrary] proposal density on θ and

q2((θ′, z′)|(θ, z))) = q1(θ

′|θ)f(z′|θ′) ,

(i.e., simulating z from the likelihood), the Metropolis-Hastingsratio associated with q2 is

MH2((θ′, z′)|(θ, z)) =

(Z(θ)

Z(θ′)

) (exp {Qθ′(y)}π(θ′)

exp {Qθ(y)}π(θ)

)(g(z′|θ′,y)

g(z|θ,y)

(q1(θ|θ′) exp {Qθ(z)}

q1(θ′|θ) exp {Qθ′(z)}

)(Z(θ′)

and....

Auxiliary variables

...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =

(exp {Qθ′(y)}π(θ′)

exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

(q1(θ|θ

′) exp {Qθ(z)}

q(θ′|θ) exp {Qθ′(z)}

Auxiliary variables

...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =

exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

(q1(θ|θ

′) exp {Qθ(z)}

Choice ofg(z|θ,y) = exp

}/Z(θ)

where θ is the maximum pseudo-likelihood estimate of θ.

Auxiliary variables

...Z(θ) vanishes:

MH2((θ′, z′)|(θ, z)) =

exp {Qθ(y)}π(θ)

) (g(z′|θ′,y)

g(z|θ,y)

(q1(θ|θ

′) exp {Qθ(z)}

Choice ofg(z|θ,y) = exp

}/Z(θ)

where θ is the maximum pseudo-likelihood estimate of θ.

New problem: Need to simulate from f(y|θ)

Perfect sampling

Coupling From The Past:algorithm that allows for exact and iid sampling from a givendistribution while using basic steps from an MCMC algorithm

Perfect sampling

Coupling From The Past:algorithm that allows for exact and iid sampling from a givendistribution while using basic steps from an MCMC algorithm

Underlying concept: run coupled Markov chains that start from allpossible states in the state space. Once all chains havemet/coalesced, they stick to the same path; the effect of the initialstate has “vanished”.

[Propp & Wilson, 1996]

Perfect sampling

In the case of a two-colour Ising model, existence of a perfectsampler by virtue of monotonicity properties: Potts

Perfect sampling

Ising Metropolis-Hastings Perfect Sampler

For T large enough,

1 Start two chains x0,−t and x1,−t from saturated states2 For t = −T, . . . , 1, couple both chains:

if missing, generate the basic uniforms u(t)

use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

3 Check coalescence at time 0: if x0,0 = x1,0i, stopelse increase T and recycle younger u(t)’s

Perfect sampling

Ising Metropolis-Hastings Perfect Sampler

For T large enough,

1 Start two chains x0,−t and x1,−t from saturated states2 For t = −T, . . . , 1, couple both chains:

if missing, generate the basic uniforms u(t)

use u(t) to update both x0,t into x0,t+1 and x1,t into x1,t+1

3 Check coalescence at time 0: if x0,0 = x1,0i, stopelse increase T and recycle younger u(t)’s

Limitation: Slow down & down when θ increase

k-nearest-neighbours

KNN’s as a probability distribution

1 MRFs

3 Perfect sampling

4 k-nearest-neighboursKNN’s as a clustering ruleKNN’s as a probabilistic modelBayesian inference on KNN’sMCMC implementation

Variable selection

KNN’s as a clustering rule

The k-nearest-neighbour procedure is a supervised clusteringmethod that allocates [new] subjects to one of G categories basedon the most frequent class [within a learning sample] in theirneighbourhood.

Supervised classification

Infer from a partitioned datasetthe classes of a new dataset

Infer from a partitioned datasetthe classes of a new datasetData: training dataset

i , xtri

)i=1,...,n

with class label 1 ≤ ytri ≤ Q and

predictor covariates xtri

Infer from a partitioned datasetthe classes of a new datasetData: training dataset

i , xtri

)i=1,...,n

with class label 1 ≤ ytri ≤ Q and

predictor covariates xtri

and testing dataset

i , xtei

)i=1,...,m

with unknown ytei ’s

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

Prediction for a new point(yte

j , xtej ) (j = 1, . . . , m): the

most common class amongst thek-nearest-neighbours of xte

j inthe training set

Neighbourhood based onEuclidean metric

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

j , xtej ) (j = 1, . . . , m): the

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

j , xtej ) (j = 1, . . . , m): the

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

j , xtej ) (j = 1, . . . , m): the

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

j , xtej ) (j = 1, . . . , m): the

−1.0 −0.5 0.0 0.5 1.0

Classification

Skip animation

Principle

j , xtej ) (j = 1, . . . , m): the

−1.0 −0.5 0.0 0.5 1.0

Standard procedure

Example : help(knn)

data(iris3)

train=rbind(iris3[1:25,,1],iris3[1:25,,2],iris3[1:25,,3])

test=rbind(iris3[26:50,,1],iris3[26:50,,2],iris3[26:50,,3])

cl=factor(c(rep("s",25),rep("c",25),rep("v",25)))

library(class)

knn(train,test,cl,k=3,prob=TRUE)

attributes(.Last.value)

Model choice perspective

Back to idea

Choice of k?

Usually chosen by minimising cross-validated misclassification rate(non-parametric or even non-probabilist!)

Influence of k

Dataset of Ripley (1994),with two classes whereeach population of xi’s isfrom a mixture of twobivariate normaldistributions.Training set of n = 250points and testing set on aset of m = 1, 000 points

−1.0 −0.5 0.0 0.5 1.0

Influence of k (cont’d)

k-nearest-neighbour leave-one-out cross-validation:

Solutions 17 18 35 36 45 46 51 52 53 54 (29)

Procedure Misclass’n error rate

1-nn 0.150 (150)3-nn 0.134 (134)15-nn 0.095 (095)17-nn 0.087 (087)54-nn 0.081 (081)

KNN’s as a probabilistic model

k-nearest-neighbour model

Based on full conditional distributions (1 ≤ ω ≤ Q)

P(ytri = ω|ytr

−i, xtr, β, k) ∝ exp

δω(ytrl )

β > 0

wherek

l ∼ i is the k-nearest-neighbour relation[Holmes & Adams, 2002]

This can also be seen as a Potts model.

Motivations

β does not exist in the original k-nn procedure.

Motivations

β does not exist in the original k-nn procedure.

It is only relevant from a statistical point of view as a measure ofuncertainty about the model:

β = 0 corresponds to a uniform distribution on all classes ;

β = +∞ leads to a point mass distribution on the prevalentclass.

MRF-like expression

Closed form expression for the full conditionals

P(ytri = ω|ytr

−i, xtr, β, k) = exp (βnω(i)/k)

exp (βnq(i)/k)

where nω(i) number of neighbours of i with class label ω

Drawback

Because the neighbourhoodstructure is not symmetric (xi

may be one of the k nearestneighbours of xj while xj is notone of the k nearest neighboursof xi),

Drawback

Because the neighbourhoodstructure is not symmetric (xi

may be one of the k nearestneighbours of xj while xj is notone of the k nearest neighboursof xi), there usually is no jointprobability distributioncorresponding to these “fullconditionals”!

Drawback (2)

Note: Holmes & Adams (2002) solve this problem by directlydefining the joint as the pseudo-likelihood

f(ytr|xtr, β, k) ∝n∏

exp (βnyi(i)/k)

exp (βnq(i)/k) . . .

[with a missing constant Z(β)]

Drawback (2)

Note: Holmes & Adams (2002) solve this problem by directlydefining the joint as the pseudo-likelihood

f(ytr|xtr, β, k) ∝n∏

exp (βnyi(i)/k)

exp (βnq(i)/k) . . .

[with a missing constant Z(β)]... but they are still using the same [wrong] predictive

P(ytej = ω|ytr, xtr, xte

j , β, k) = exp (βnω(j)/k)

exp (βnq(j)/k)

Resolution

Symmetrise the neighbourhood relation:

Resolution

Principle: if xtri belongs to the

k-nearest-neighbour set for xtrj

and xtrj does not belong to the

k-nearest-neighbour set for xtri ,

xtrj is added to the set of

neighbours of xtri

Resolution

Principle: if xtri belongs to the

k-nearest-neighbour set for xtrj

and xtrj does not belong to the

k-nearest-neighbour set for xtri ,

xtrj is added to the set of

neighbours of xtri

Consequence

Given the full conditionals

P(ytri = ω|ytr

δω(ytrl )

wherek

l#i is the symmetrised k-nearest-neighbour relation,

Consequence

Given the full conditionals

P(ytri = ω|ytr

δω(ytrl )

wherek

l#i is the symmetrised k-nearest-neighbour relation,

there exists a corresponding joint distribution

Extension to unclassified points

Predictive distribution of ytej (j = 1, . . . , m) defined as

P(ytej = ω|xte

j , ytr, xtr, β, k) ∝ exp

δω(ytrl )

wherek

l#j is the symmetrised k-nearest-neighbour relation wrt thetraining set {xtr

1 , . . . , xtrn}

Bayesian inference on KNN’s

Bayesian modelling

Within the Bayesian paradigm, assign a prior π(β, k) like

π(β, k) ∝ I(1 ≤ k ≤ kmax) I(0 ≤ β ≤ βmax)

because there is a maximum value (e.g., βmax = 15) after whichthe distribution is Dirac [as in Potts model]

Bayesian modelling

Within the Bayesian paradigm, assign a prior π(β, k) like

π(β, k) ∝ I(1 ≤ k ≤ kmax) I(0 ≤ β ≤ βmax)

because there is a maximum value (e.g., βmax = 15) after whichthe distribution is Dirac [as in Potts model] and because it can beargued that kmax = n/2

β is dimension-less because of the use of frequencies nω(i)/k ascovariates

Bayesian global inference

Use marginal predictive distribution of ytej given xte

j (j = 1, . . . , m)

∫P(yte

j = ω|xtej , ytr, xtr, β, k)π(β, k|ytr, xtr)dβ dk

whereπ(β, k|ytr, xtr) ∝ f(ytr|xtr, β, k)π(β, k)

posterior distribution of (β, k) given the training dataset ytr

[ytej = MAP estimate]

Model choice with no varying dimension because β is the same forall models

MCMC implementation

A Markov Chain Monte Carlo (MCMC) approximation of

f(yn+1|xn+1,y,X)

is provided by

M−1M∑

f(yn+1|xn+1,y,X, (β, k)(i)

where {(β, k)(1), . . . , (β, k)(M)} MCMC output associated withstationary distribution π(β, k|y,X).

MCMC implementation

Auxiliary variable version

Random walk Metropolis–Hastings algorithm on both β and k

Since β ∈ (0, βmax), a logistic reparameterisation of β is

β = βmax eθ/1 + eθ ,

and the random walk θ ∼ N (θ(t), τ2) is on θ

For k, uniform proposal on 2r neighbours of k(t),{k(t) − r, . . . , k(t) − 1, k(t) + 1, . . . k(t) + r}

⋂{1, . . . , K}.

MCMC implementation

Auxiliary variable version

Random walk Metropolis–Hastings algorithm on both β and k

Since β ∈ (0, βmax), a logistic reparameterisation of β is

β = βmax eθ/1 + eθ ,

and the random walk θ ∼ N (θ(t), τ2) is on θ

For k, uniform proposal on 2r neighbours of k(t),{k(t) − r, . . . , k(t) − 1, k(t) + 1, . . . k(t) + r}

⋂{1, . . . , K}.

Simulation of f(ztr|xtr, β, k) by perfect sampling taking advantageof monotonicity properties [but may get stuck for too large valuesof β]

MCMC implementation

Choice of (β, k) paramount:

Illustration of Ripley’s dataset: (k, β) = (53, 2.28) versus(k, β) = (13, 1.45)

MCMC implementation

Choice of (β, k) paramount:

Illustration of Ripley’s dataset: (k, β) = (53, 2.28) versus(k, β) = (13, 1.45)

MCMC implementation

Diabetes in Pima Indian women

Example (R benchmark)

“A population of women who were at least 21 years old, of Pima Indianheritage and living near Phoenix (AZ), was tested for diabetes accordingto WHO criteria. The data were collected by the US National Institute ofDiabetes and Digestive and Kidney Diseases. We used the 532 completerecords after dropping the (mainly missing) data on serum insulin.”

number of pregnancies

plasma glucose concentration in an oral glucose tolerance test

diastolic blood pressure

triceps skin fold thickness

body mass index

diabetes pedigree function

MCMC implementation

Diabetes in Pima Indian womenMCMC output for βmax = 1.5, β = 1.15, k = 40, and 20, 000simulations.

MCMC implementation

Diabetes in Pima Indian women

Example (Error rate & k selection)

k Misclassificationerror rate

1 0.3163 0.22915 0.22631 0.21157 0.20566 0.208

MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is

yn+1 = arg maxg

M−1M∑

f(g|xn+1,y,X, β(i), k(i)

MCMC implementation

Predictive output

The approximate Bayesian prediction of yn+1 is

yn+1 = arg maxg

M−1M∑

f(g|xn+1,y,X, β(i), k(i)

E.g., Ripley’s dataset misclassification error rate: 0.082.

Pseudo-likelihood reassessed

A reassessment of pseudo-likelihood

1 MRFs

3 Perfect sampling

Pseudo-likelihood

Pseudo-likelihood leads to (almost) straightforward MCMCimplementation

Magnitude of the approximation

Since perfect and path sampling approaches also are available forsmall datasets, possibility of evaluation of pseudo-likelihoodapproximation

Ripley’s benchmark (1)

Approximations to the posterior of β based on the pseudo (green),the path (red) and the perfect (yellow) schemes withk = 1, 10, 70, 125, for 20, 000 iterations:

Ripley’s benchmark (2)

Approximations of posteriors of β (top) and k (bottom)

Variable selection

1 MRFs

3 Perfect sampling

Variable selection

Goal: Selection of thecomponents of the predictorvector that best contribute to theclassification

Parsimony (dimension ofpredictor may be larger thantraining sample size n)

Efficiency (more componentsblur class differences)

−1.0 −0.5 0.0 0.5 1.0

gamma=(1,1), err=78

−1.0 −0.5 0.0 0.5 1.0

gamma=(1,0), err=284

−1.0 −0.5 0.0 0.5 1.0

gamma=(0,1), err=116

−1.0 −0.5 0.0 0.5 1.0

gamma=(1,1,1), err=159

Variable selection

Component indicators

Completion of (β, k) with indicator variables γj ∈ {0, 1}(1 ≤ j ≤ p) that determine which components of x are active inthe model

P(yi = Cj |y−i,X, β, k,γ) ∝ exp

l∈vk(i)

δCj(yl)

with vk(i) (symmetrised) k nearest neighbourhood of xi for thedistance

d(xi, xℓ)2 =

γj(xij − xℓj)2

Variable selection

Formal similarity with usual variable selection in regression models.Use of a uniform prior on the γj ’s on {0, 1}, independently for allj’sExploration of a range of models Mγ of size 2p that may be toolarge (see, e.g., the vision dataset with p = 200)

Variable selection

Implementation

Use of a naive “reversible jump” MCMC, where

1 (β, k) are changed conditional on γ and

2 γ is changed one component at a time conditional on (β, k)[and the data]

Validation of simple jumps due to (a) saturation of the dimensionby associating a γj to each variable and (b) hierarchical structureof the (β, k) part.

Variable selection

MCMC algorithm

Variable selection k-nearest-neighbours

At time 0, generate γ(0)j ∼ B(1/2), log β(0) ∼ N

(0, τ2

k(0) ∼ U{1,...,K}

At time 1 ≤ t ≤ T ,

1 Generate log β ∼ N(log β(t−1), τ2

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})

Variable selection

MCMC algorithm

(0, τ2

k(0) ∼ U{1,...,K}

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})

2 Calculate Metropolis-Hastings acceptance probabilityρ(β, k, β(t−1), k(t−1))

Variable selection

MCMC algorithm

(0, τ2

k(0) ∼ U{1,...,K}

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})

3 Move to(β(t), k(t)

)by Metropolis-Hastings step

Variable selection

MCMC algorithm

(0, τ2

k(0) ∼ U{1,...,K}

k ∼ U ({k − r, k − r + 1, . . . , k + r − 1, k + r})

3 Move to(β(t), k(t)

)by Metropolis-Hastings step

4 For j = 1, . . . , p, generate γ(t)j ∼ π(γj |y,X, γ

(t)−j , β

(t), k(t))

Variable selection

Benchmark 1

Ripley’s dataset with 8 additional potential [useless] covariatessimulated from N (0, .052)

Using the 250 datapoints for variable selection, comparison of the210 = 1024 models by pseudo-maximum likelihood estimation of(k, β) and by comparison of pseudo-likelihoods leads to select theproper submodel

γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0

with k = 3.1 and β = 3.8. Forward and backward selectionprocedures lead to same conclusion.

Variable selection

Benchmark 1

Ripley’s dataset with 8 additional potential [useless] covariatessimulated from N (0, .052)

Using the 250 datapoints for variable selection, comparison of the210 = 1024 models by pseudo-maximum likelihood estimation of(k, β) and by comparison of pseudo-likelihoods leads to select theproper submodel

γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0

with k = 3.1 and β = 3.8. Forward and backward selectionprocedures lead to same conclusion.

MCMC algorithm produces γ1 = γ2 = 1 and γ3 = · · · = γ10 = 0 asthe MMAP, with very similar values for k and β [Hardly any moveaway from (1, 1, 0, . . . , 0) is accepted]

Variable selection

Benchmark 2

Ripley’s dataset with now 28 additional covariates simulated fromN (0, .052)

Using the 250 datapoints for variable selection, direct comparisonof the 230 models by pseudo-maximum likelihood estimationimpossible!

Forward and backward selection procedures both lead to the propersubmodel γ = (1, 1, 0, . . . , 0)

Variable selection

Benchmark 2

Ripley’s dataset with now 28 additional covariates simulated fromN (0, .052)

Using the 250 datapoints for variable selection, direct comparisonof the 230 models by pseudo-maximum likelihood estimationimpossible!

Forward and backward selection procedures both lead to the propersubmodel γ = (1, 1, 0, . . . , 0)MCMC algorithm again produces γ1 = γ2 = 1 andγ3 = · · · = γ10 = 0 as the MMAP, with more moves aroundγ = (1, 1, 0, . . . , 0)

Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...

Documents

Transcript of Bayesian k-nearest-neighbour classificationjunliu/Workshops/workshop2007/talk... · Bayesian...

Slide DM 09 K-Nearest Neighbour

Rugby players, Ballet dancers, and the Nearest Neighbour Classifier COMP24111 lecture 2.

Overfitting and K-Nearest Neighbour · 2019. 10. 16. · Overfitting and K-Nearest Neighbour Dr. Xiaowei Huang xiaowei

K-nearest neighbour search for PostgreSQLmegera/postgres/talks/pgcon-2010-1.pdf · Oleg Bartunov, Teodor Sigaev PGCon-2010, Ottawa, May 20-21, 2010 K-nearest neighbour search for

An Evaluation of k-Nearest Neighbour Imputation Using ... · PDF fileAn Evaluation of k-Nearest Neighbour Imputation Using Likert Data ... described in section 3.1. ... for imputing

Pattern Recognition Chapter 3: Nearest Neighbour Based ...

Nearest neighbour classiﬁcation in the tails of a distributiontc325/About_Me_files/SKRKEssay.pdf · Nearest neighbour classiﬁcation in the tails of a ... We discuss the theoretical

IMPROVING NEAREST NEIGHBOUR SEARCH IN 3D ... and many more. In (Boonsam et al., 2011), nearest neighbour information is used to schedule the tour procedure for transporting the products

A nearest-neighbour discretisation of the regularized ... › pdf › 1704.09022.pdf · A nearest-neighbour discretisation of the regularized stokeslet boundary integral equation

Scalable Nearest Neighbour Algorithms for High Dimensional Data

Scalable Nearest Neighbour Algorithms for High Dimensional ...

PowerPoint Presentation · DSR with 95% CI Nearest neighbour average National average ...

Nearest neighbour classication of Indian sign language ...

Self-avoiding trails with nearest neighbour interactions on the … · 2012-12-18 · Self-avoiding trails with nearest neighbour interactions on the square lattice Andrea Bedini

Optimasi K-Nearest Neighbour Menggunakan Particle Swarm ...

The k-Nearest Neighbour UCB Algorithm for Multi-Armed Bandits …proceedings.mlr.press/v83/reeve18a/reeve18a.pdf · The k-Nearest Neighbour UCB Algorithm for Multi-Armed Bandits with

Least Squares Algorithms with Nearest Neighbour Techniques for ...

An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searchingregev/papers/randann.pdf · · 2009-10-04for Approximate Nearest Neighbour Searching ...

Fractal Approximate Nearest Neighbour Search in Log-Log Time · Fractal Approximate Nearest Neighbour Search in Log-Log Time Martin Stommel mstommel@tzi.de Stefan Edelkamp ... neighbouring

Nearest Neighbour and Clustering