SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN,...
-
Upload
charles-hunt -
Category
Documents
-
view
218 -
download
5
Transcript of SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN,...
SEEM4630 2013-2014 Tutorial 2
Classification:
Decision tree, Naïve Bayes & k-NN
Wentao TIAN, [email protected]
Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN
Goal: previously unseen records should be assigned a class as accurately as possible.
2
Classification: Definition
GoalConstruct a tree so that instances belonging to
different classes should be separated Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the rootTest attributes are selected on the basis of a
heuristics or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes 3
Decision Tree
4
Attribute Selection Measure 1: Information Gain
Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(||
||)(
1j
v
j
jA DInfo
D
DDInfo
(D)InfoInfo(D)Gain(A) A
)(log)( 21
i
m
ii ppDInfo
5
Attribute Selection Measure 2: Gain Ratio
Information gain measure is biased towards attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A)
)||
||(log
||
||)( 2
1 D
D
D
DDSplitInfo j
v
j
jA
6
Attribute Selection Measure 3: Gini index
If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1
and D2, the gini index gini(D) is defined as
Reduction in Impurity:
n
jp jDgini
1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
)()()( DginiDginiAginiA
ExampleOutlook Temperature Humidity Wind Play Tennis
Sunny >25 High Weak No
Sunny >25 High Strong No
Overcast >25 High Weak Yes
Rain 15-25 High Weak Yes
Rain <15 Normal Weak Yes
Rain <15 Normal Strong No
Overcast <15 Normal Strong Yes
Sunny 15-25 High Weak No
Sunny <15 Normal Weak Yes
Rain 15-25 Normal Weak Yes
Sunny 15-25 Normal Strong Yes
Overcast 15-25 High Strong Yes
Overcast >25 Normal Weak Yes
Rain 15-25 High Strong No
7
8
Tree induction example
S[9+, 5-] Outlook
Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25
Entropy of data S
Split data by attribute Outlook
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94
9
Tree induction example
S[9+, 5-] Temperature
<15 [3+,1-]15-25 [5+,1-]>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14
Split data by attribute Temperature
10
S[9+, 5-] Wind
Weak [6+, 2-]
Strong [3+, 3-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05
Split data by attribute Humidity
Split data by attribute Wind
Tree induction example
S[9+, 5-] Humidity
High [3+,4-]
Normal [6+, 1-]
11
Outlook
Yes?? ??
Overcast
Sunny Rain
Gain(Outlook) = 0.25Gain(Temperature)=0.14Gain(Humidity) = 0.15Gain(Wind) = 0.05
NoWeakHigh>25Sunny
NoStrongHigh>25Sunny
YesWeakHigh>25Overcast
YesWeakHigh15-25Rain
YesWeakNormal<15Rain
NoStrongNormal<15Rain
YesStrongNormal<15Overcast
NoWeakHigh15-25Sunny
YesWeakNormal<15Sunny
YesWeakNormal15-25Rain
YesStrongNormal15-25Sunny
YesStrongHigh15-25Overcast
YesWeakNormal>25Overcast
NoStrongHigh15-25Rain
Play Tennis
WindHumidity
Temperature
Outlook
Tree induction example
12
Sunny[2+, 3-] Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]= 0.97 – 0 = 0.97
Gain(Wind)= 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02
Entropy of branch Sunny
Split Sunny branch by attribute Temperature
Split Sunny branch by attribute Humidity
Split Sunny branch by attribute Wind
Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97
Sunny[2+,3-] Temperature
<15 [1+,0-]
15-25 [1+,1-]>25 [0+,2-]
Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57
Sunny[2+,3-] Humidity
High [0+,3-]
Normal [2+, 0-]
13
Outlook
YesHumidity ??
YesNo
High
Sunny Rain
Normal
Overcast
Tree induction example
Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 = 0.02
Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0 = 0.97
Entropy of branch Rain
Split Rain branch by attribute Temperature
Split Rain branch by attribute Humidity
Split Rain branch by attribute Wind
14
Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97
Rain[3+,2-] Temperature
<15 [1+,1-]
15-25 [2+,1-]>25 [0+,0-]
Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02
Rain[3+,2-] Wind
Weak [3+, 0-]
Strong [0+, 2-]
Rain[3+,2-] Humidity
High [1+,1-]
Normal [2+, 1-]
15
Outlook
YesHumidity Wind
YesNo
NormalHigh
NoYes
StrongWeak
OvercastSunny Rain
16
Bayesian Classification A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership probabilities
where xi is the value of attribute
Ai
Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem.
posteriori probability
prior probability
likelihood
),...,,|( 21 ni xxxCP
),...,,|( 21 ni xxxCP
)|,...,,( 21 in CxxxP
)( iCP
),...,,(
)()|,...,,(),...,,|(
21
2121
n
iinni xxxP
CPCxxxPxxxCP
Model: compute
from data
)|,...,,( 21 in CxxxP
17
Naïve Bayes Classifier Problem: joint probabilities are difficult to estimate
Naïve Bayes Classifier Assumption: attributes are conditionally independent
)|()|()|,...,,( 121 iniin CxPCxPCxxxP
11 2
1 2
( | ) ( )( | , ,..., )
( , ,..., )
n
j i iji n
n
P x C P CP C x x x
P x x x
A B C
m b t
m s t
g q t
h s t
g q t
g q f
g s f
h b F
h q f
m b f
18
Example: Naïve Bayes Classifier
P(C=t) = 1/2 P(C=f) = 1/2
P(A=m|C=t) = 2/5P(A=m|C=f) = 1/5P(B=q|C=t) = 2/5P(B=q|C=f) = 2/5
Test Record: A=m, B=q, C=?
For C = tP(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 *
1/2 = 2/25
P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)
For C = fP(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 *
1/2 = 1/25
P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)
Conclusion: A=m, B=q, C=t19
Example: Naïve Bayes Classifier
Higher!
InputA set of stored recordsk: # of nearest neighbors
OutputCompute distance: Identify k nearest neighborsDetermine the class label of unknown record based on
class labels of nearest neighbors (i.e. by taking majority vote)
20
Nearest Neighbor Classification
i ii
qpqpd 2)(),(
21
Nearest Neighbor ClassificationInput Given 8 training
instancesP1 (4, 2) OrangeP2 (0.5, 2.5) OrangeP3 (2.5, 2.5) OrangeP4 (3, 3.5) OrangeP5 (5.5, 3.5) OrangeP6 (2, 4) BlackP7 (4, 5) BlackP8 (2.5, 5.5) Black k = 1 & k = 3
New Instance:Pn (4, 4) ?
Calculate the distances:
d(P1, Pn) = d(P2, Pn) = 3.80d(P3, Pn) = 2.12d(P4, Pn) = 1.12d(P5, Pn) = 1.58d(P6, Pn) = 2d(P7, Pn) = 1d(P8, Pn) = 2.12
A Discrete Example
2)42()44( 22
22
Nearest Neighbor Classification
k = 1
P1P2 P3
P4 P5P6
P7P8
Pn
P1P2 P3
P4 P5
P6
P7P8
Pn
k = 3
Scaling issuesAttributes may have to be scaled to
prevent distance measures from being dominated by one of the attributes• Each attribute must follow in the same range• Min-Max normalization
Example:• Two data records: a = (1, 1000), b = (0.5, 1)• dis(a, b) = ?
23
Nearest Neighbor Classification…
Two Types of Learning MethodologiesLazy Learning
• Instance-based learning. (k-NN)Eager Learning
• Decision-tree and Bayesian classification.• ANN & SVM
24
Classification: Lazy & Eager Learning
P1P2 P3
P4 P5
P6
P7P8
Pn
P1P2 P3
P4 P5
P6
P7P8
Pn
Lazy Learninga. Do not require model buildingb. Less time training but more time predictingc. Lazy method effectively uses a richer
hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
Eager Learninga. Require model buildingb. More time training but less time predictingc. Must commit to a single hypothesis that
covers the entire instance space
25
Differences Between Lazy &Eager Learning
Thank you & Questions?
26