Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given...
-
Upload
amberly-simon -
Category
Documents
-
view
217 -
download
1
Transcript of Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given...
Classification ProblemClassification Problem
• GivenGiven
• Predict class label of a given query Predict class label of a given query
Nnnn y 1)},{( x q
n x } ,{ ny
0x
-- --
-- -
-
--
-
-
-
-
-
-
-
+
+
++
+ ++
+
++
+
+
+
+
+
+
+
+
-
-
-
-
-+
-
- + +
.0x
-
-
+
-
+2x
1x
+-
Classification ProblemClassification Problem
• Unknown probability distribution Unknown probability distribution
• We need to estimate: We need to estimate:
),( yP x
)()|( 00 xx fP
)()|( 00 xx fP
The Bayesian ClassifierThe Bayesian Classifier• Loss function:Loss function: • Expected loss (conditional risk) associated with class Expected loss (conditional risk) associated with class jj::
• Bayes rule:Bayes rule:
• Zero-one loss function:Zero-one loss function:
kj |
x x ||)|(1
kPkjjRJ
k
)|(minarg1
* xjRjJj
kjif
kjifkj
1
0|
)|(maxarg1
* xjPjJj
Bayes rule
The Bayesian ClassifierThe Bayesian Classifier
• Bayes rule achieves the minimum error rateBayes rule achieves the minimum error rate
• How to estimate the posterior probabilities: How to estimate the posterior probabilities:
)|(maxarg1
* xjPjJj
JjjP 1| x
)|(ˆmaxargˆ1
xx jPjJj
Density estimationDensity estimation• Use Bayes theorem to estimate the posterior probability Use Bayes theorem to estimate the posterior probability
values:values:
is the probability density function of given is the probability density function of given
class class
is the prior probability of classis the prior probability of class
J
k
kPkp
jPjpjP
1
|
|)|(
x
xx
x jp |xj
jP j
Naïve Bayes ClassifierNaïve Bayes Classifier• Makes the assumption of independence of features given the class:Makes the assumption of independence of features given the class:
• The task of estimating a The task of estimating a qq-dimensional density function is reduced to the estimation of q -dimensional density function is reduced to the estimation of q one-dimensional density functions. Thus, the complexity of the task is drastically one-dimensional density functions. Thus, the complexity of the task is drastically reduced.reduced.
• The use of Bayes theorem becomes much simpler.The use of Bayes theorem becomes much simpler.
• Proven to be effective in practice.Proven to be effective in practice.
jxpjxxxpjp i
q
iq ||,,,|
121
x
Nearest-Neighbor MethodsNearest-Neighbor Methods• Predict the class label of as the most frequent Predict the class label of as the most frequent
one occurring in the neighborsone occurring in the neighbors
0xK
-- --
-- -
-
--
-
-
-
-
-
-
-
+
+
++
+ ++
+
++
+
+
+
+
+
+
+
+
-
-
-
-
-+
-
- + +
-
-
+
-
+2x
1x
+- .+
-
Nearest-Neighbor MethodsNearest-Neighbor Methods• Predict the class label of as the most frequent Predict the class label of as the most frequent
one occurring in the neighborsone occurring in the neighbors
0xK
-- --
-- -
-
--
-
-
-
-
-
-
-
+
+
++
+ ++
+
++
+
+
+
+
+
+
+
+
-
-
-
-
-+
-
- + +
-
-
+
-
+2x
1x
+-++
-
Nearest-Neighbor MethodsNearest-Neighbor Methods• Predict the class label of as the most frequent Predict the class label of as the most frequent
one occurring in the neighborsone occurring in the neighbors
0xK
-- --
-- -
-
--
-
-
-
-
-
-
-
+
+
++
+ ++
+
++
+
+
+
+
+
+
+
+
-
-
-
-
-+
-
- + +
-
-
+
-
+2x
1x
+- .+
-.. distanc
edistance
metricmetric
Basic assumption:Basic assumption:
)() (
)() (
xxx
xxx
ff
ff
small for x
Example: Letter RecognitionExample: Letter Recognition
....
..Edge countEdge count
First statisticalFirst statistical momentmoment
Asymptotic Properties of Asymptotic Properties of K-NN MethodsK-NN Methods
)(ˆlim xx jjN ff
0/lim NKNif and if and KNlim
• The first condition reduces the variance by making the estimation The first condition reduces the variance by making the estimation independent of the accidental characteristics of the independent of the accidental characteristics of the KK nearest nearest
neighbors. neighbors.
• The second condition reduces the bias by assuring that the The second condition reduces the bias by assuring that the KK nearest neighbors are arbitrarily close to the query point. nearest neighbors are arbitrarily close to the query point.
Asymptotic Properties of Asymptotic Properties of K-NN MethodsK-NN Methods
EEN 2lim 1
1E classification error rate of the 1-NN ruleclassification error rate of the 1-NN rule
E classification error rate of the Bayes ruleclassification error rate of the Bayes rule
In the asymptotic limitIn the asymptotic limit no decision rule is more no decision rule is more than twice as accurate as the 1-NN rulethan twice as accurate as the 1-NN rule
Finite-sample settingsFinite-sample settings
• If the number of training data If the number of training data NN is large and the number is large and the number of input features of input features qq is small, then the asymptotic results may is small, then the asymptotic results may still be valid.still be valid.
• However, for a moderate to large number of input However, for a moderate to large number of input variables, the sample required for their validity is variables, the sample required for their validity is beyond feasibility.beyond feasibility.
• How well the 1-NN rule works in finite-How well the 1-NN rule works in finite-sample settings?sample settings?
Curse-of-DimensionalityCurse-of-Dimensionality
• This phenomenon is known as This phenomenon is known as the the curse-of-dimensionalitycurse-of-dimensionality
• It refers to the fact that in high dimensional It refers to the fact that in high dimensional spaces data become extremely sparse and spaces data become extremely sparse and
are far apart from each otherare far apart from each other
• It affects It affects anyany estimation problem with estimation problem with high dimensionalityhigh dimensionality
Curse of Dimensionality
Sample of size Sample of size N=500N=500 uniformly distributed in uniformly distributed in q]1 ,0[
DMAXDMAX
DMINDMIN
DMAX/DMINDMAX/DMIN
Curse of Dimensionality
dimdim
The distribution of the ratio The distribution of the ratio DMAX/DMINDMAX/DMIN converges to converges to 11 as the dimensionality increases as the dimensionality increases
Curse of Dimensionality
dimdim
Variance of distances from a given pointVariance of distances from a given point
Curse of Dimensionality
The variance of distances from a given point The variance of distances from a given point converges to converges to 00 as the dimensionality increases as the dimensionality increases
dimdim
Curse of Dimensionality
Distance values from a given pointDistance values from a given point
Values flatten out as dimensionality increasesValues flatten out as dimensionality increases
Computing radii of nearest neighborhoodsComputing radii of nearest neighborhoods
median radius of a nearest neighborhoodmedian radius of a nearest neighborhood
q.5,.5- cubeunit in theon distributi uniform
Curse-of-DimensionalityCurse-of-Dimensionality
q 4 4 6 6 10 10 20 20 20
N 100 1000 100 1000 1000 10000 10000
d(q,N) 0.42 0.23 0.71 0.48 0.91 0.72 1.51 1.20 0.76
610 1010
~N• Random sample of size uniform distribution in theRandom sample of size uniform distribution in theq -dimensional unit hypercube-dimensional unit hypercube
• Diameter of a neighborhood using EuclideanDiameter of a neighborhood using Euclidean1K)(),( /1 qNONqd distance: distance:
As dimensionality increases, the distance from the As dimensionality increases, the distance from the closest point increases fasterclosest point increases faster
Large Highly biased estimationsLarge Highly biased estimations),( Nqd
Curse-of-DimensionalityCurse-of-Dimensionality
• It is a serious problem in many It is a serious problem in many real-world applicationsreal-world applications
• Microarray dataMicroarray data: 3,000-4,000 genes;: 3,000-4,000 genes;
• DocumentsDocuments: 10,000-20,000 words in : 10,000-20,000 words in dictionary;dictionary;
• ImagesImages, , face recognitionface recognition, etc., etc.
How can we deal withHow can we deal with the curse of dimensionality?the curse of dimensionality?
5.19122.92
2.9268.7
N
iiii
iii
T
N
ii
xxx
xxxN
xxx
xxxE
xxx
xE
E
Nx
x
12
222211
2211
2
11
2222211
22112
11
221122
11
12
1
2
1
1
,
: 22
1
μxμx
xμx
matrix covariance
N
i
iN
i
ii
N
i
iiN
i
i
N
iiii
iii
xN
xxN
xxN
xN
xxx
xxxN
1
2
221
2211
12211
1
2
11
12
222211
2211
2
11
11
11
1
variancevariance
variancevariance
covariancecovariance
covariancecovariance
06.15.0
5.099.0
15.105.1
05.104.1
05.101.0
01.093.0
04.149.0
49.097.0
03.193.0
93.094.0
Dimensionality ReductionDimensionality Reduction
• Many dimensions are often Many dimensions are often interdependent (correlated);interdependent (correlated);
We can:We can:
• Reduce the dimensionality of problems;Reduce the dimensionality of problems;
• Transform interdependent coordinates Transform interdependent coordinates into significant and independent ones;into significant and independent ones;