Lecture 13. Outline of Rule-Based Classification 1. Overview of KNN 2. Parameter and distance...
-
Upload
shanon-williamson -
Category
Documents
-
view
217 -
download
4
Transcript of Lecture 13. Outline of Rule-Based Classification 1. Overview of KNN 2. Parameter and distance...
Outline of Rule-Based Classification
1. Overview of KNN
2. Parameter and distance Choices
3. Characteristics of Instance-based learning
4. Example of KNN Classifiers
1. Nearest Neighbor ClassifiersBasic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
Nearest-Neighbor Classifiers Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Unknown record
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
1. Overview of KNN Classifier Instance based learningK-NN = k nearest neighbors algorithm
Training phase consists only of storing the feature vectors and class labels of the training samples
K is a user-defined parameterA test point is selected and distances from the test
point to every point in the training set is calculated
Class label is assigned via majority vote of the k nearest neighbors to the test point
2. KNN Parameter SelectionBest choice for k value
Depends upon the dataLarger values generally reduce noise effectsBut larger values make class boundaries less
distinctCan be selected via heuristic techniquesCan also use evolutionary algorithms to
optimize k value
KNN Classifier: Parameter Selection
K-NN = k nearest neighbors algorithmTraining phase consists only of storing the
feature vectors and class labels of the training samples
K is a user-defined parameterA test point is selected and distances from the
test point to every point in the training set is calculated
Class label is assigned via majority vote of the k nearest neighbors to the test point
Nearest Neighbor Classification…Choosing the value of k:
If k is too small, sensitive to noise pointsIf k is too large, neighborhood may include
points from other classes
X
2. KNN Distance Measure SelectionChoice of distance measure is important
Nearest neighbors can produce incorrect predictions unless appropriate proximity measures used and pre-processing steps taken
Scale of attributes must be taken into consideration to avoid domination by less important attributes
Nearest Neighbor ClassificationCompute distance between two points:
Euclidean distance
Determine the class from nearest neighbor listtake the majority vote of class labels among
the k-nearest neighborsWeigh the vote according to distance
weight factor, w = 1/d2
i ii
qpqpd 2)(),(
Nearest Neighbor Classification…Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M
Nearest Neighbor Classification…Problem with Euclidean measure:
High dimensional data curse of dimensionality
Can produce counter-intuitive results
1 1 1 1 1 1 1 1 1 1 1 00 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142 d = 1.4142
Solution: Normalize the vectors to unit length
3. Characteristics of INSTANCE-BASED LEARNING:
Instance-based learningNo model built
Classifying test points can be expensive Short training time
Uses local information to make predictions Requires distance measure Requires classification function Decision boundaries can be of arbitrary shape =
more flexible model representation Decision boundaries also highly variable
Instance-Based Classifiers
Atr1 ……... AtrN ClassA
B
B
C
A
C
B
Set of Stored Cases
Atr1 ……... AtrN
Unseen Case
• Store the training records
• Use training records to predict the class label of unseen cases
Instance Based ClassifiersExamples:
Rote-learner Memorizes entire training data and performs
classification only if attributes of record match one of the training examples exactly
Nearest neighbor Uses k “closest” points (nearest neighbors) for
performing classification
Nearest neighbor Classification…k-NN classifiers are lazy learners
It does not build models explicitlyUnlike eager learners such as decision tree
induction and rule-based systemsClassifying unknown records are relatively
expensive
EAGER LEARNERS: Characteristics
Eager learners Global model of entire input space Long training time to develop model Short time classifying test points Decision tree and rule-based classifiers restricted
to rectilinear decision boundaries
4. Example: PEBLSPEBLS: Parallel Example-Based Learning
System (Cost & Salzberg)Works with both continuous and nominal
features For nominal features, distance between two
nominal values is computed using modified value difference metric (MVDM)
Each record is assigned a weight factorNumber of nearest neighbor, k = 1
Example: PEBLS
ClassMarital Status
Single Married Divorced
Yes 2 0 1
No 2 4 1
i
ii
n
n
n
nVVd
2
2
1
121 ),(
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
d(Single,Divorced)
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
d(Married,Divorced)
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
ClassRefund
Yes No
Yes 0 3
No 3 4
Example: PEBLS
d
iiiYX YXdwwYX
1
2),(),(
Tid Refund Marital Status
Taxable Income Cheat
X Yes Single 125K No
Y No Married 100K No 10
Distance between record X and record Y:
where:
correctly predicts X timesofNumber
predictionfor used is X timesofNumber Xw
wX 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Example 2 Distance of two documents
Word Touchdown
Capital
Murder Juror profit Score
Doc A 2 0 0 0 1 3
Doc B 0 3 1 0 5 0
Doc C 0 1 4 2 0 1
Doc E 3 0 1 1 1 4
Doc F 1 4 3 3 1 0
Doc G 0 1 0 0 3 1
App in natural language processing
Use term frequencies to compare documents and classify documents in a corpora
Problem: Normalizing the word counts to avoid the closed-
class word like ‘the’, ‘to’, ‘in’, etc.
Solution: use the tf.idf to normalize word ‘Importance’ when counting the term frequencies
Significance tf: Term
Frequencies Idf: inverse
document frequency
= log(N/df)N: #of docs
in a training corpus
df: #of docs in which term i appear
Ft.idf = Tf*log(N/df)
TF N-TF
TF N-TF DF.IDF DF.IDF
the 1.00 Chile 0.167 Pinochet 1.0
of 0.703
his 0.159 Spanish 0.377
in 0.536
have 0.152 Chile 0.254
a 0.504
has 0.149 Chilean 0.245
and 0.489
British 0.127 Garzon 0.238
to 0.482
The 0.127 Argentina 0.176
Pinochet
0.380
arrest 0.120 diplomatic
0.163
that 0.308
who 0.120 British 0.154
for 0.250
Chilean
0.105 immunity 0.141
said 0.214
Argent. 0.098 dictator 0.133
was 0.203
by 0.098 arrest 0.123
he 0.196
from 0.098 military 0.117
Spanish
0.185
as 0.094 genocide 0.114
‘s 0.174
with 0.094 Augusto 0.105
on 0.170
Surgery
0.090 crime 0.094