COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements...
Transcript of COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements...
COMP307 Week 3 (Tutorial)
1. Announcements• Assignment 1 (15%)• Helpdesk sessions
2. Sets• Training and Test sets• Validation set
3. Datasets• Instances• Features and feature vectors• Class label
4. 3-K Techniques• k-Nearest Neighbour• k-fold Cross Validation• k-Means Clustering
5. Decision Trees• DT learning vs learned DT• Impurity measure Conditions • Pruning
6. Other Questions
Over-fitting
Validation Set
• What?• Why?• How?
Datasets and Instances
12.7 2.6 55.3 . . . 15.0 A
f1 f2 f3 f4 f5 . . . class10 2.2 45 3.7 22.1 . . . A3.7 7.9 12 2.1 17.5 . . . A22.8 27.9 11.4 36 77 . . . B90.4 6.34 2.77 15.8 53.7 . . . A74.6 4.78 84.9 15.9 103 . . . B2.89 14.7 3.11 10 52 . . . B
K-Nearest Neighbour
10.3 45.7 2.7 A
7.1 80.5 1.1 A
22.3 20.4 9.6 B
30.5 21.2 17.9 B
5.2 67.1 7.7 A
15.6 18.6 11.4 B
11.9 53.4 6.3 A
Training Set
f1 f2 f3 class
2.1 33.5 4.7 ?14.84
47.40
24.57
33.65
33.88
21.19
22.24
2.1 33.5 4.7 A
k-fold Cross Validation75 %
92 %
43 %
76 %
30 %
82 %
83 %
47 %
50 %
73 %
65.10 %Dataset
Training Test
50 % 50 %
• How do we specify the number of instances in each fold?• How do we select those instances?
k-Means Clustering
Decision Tress (DT)
• DT learning ≠ learned DT• Impurity measure
1. 0, if all instances belong to one class2. Max, equal number of instances for both classes3. Continuous –smooth
• Large vs. small trees?
Example (Training) Dataset• Approve/Reject a loan application?
6
Applicant Job Deposit Family Class
1 true low single Approve
2 true low couple Approve
3 true low single Approve
4 true high single Approve
5 false high couple Approve
6 false low couple Reject
7 true low children Reject
8 false low single Reject
9 false high children Reject
Numeric Features• Can split on a simple comparison
– Which split point?– Consider class boundaries
23
Deposit < $10K
True False
Applicant Job Deposit Family Class
8 false $3K single Reject
3 true $4K single Approve
6 false $6K couple Reject
2 true $7K couple Approve
7 true $8K children Reject
1 true $10K single Approve
4 true $16K single Approve
5 false $18K couple Approve
9 false $30K children Reject
<$4K
<$6K
<$7K
<$8K
<$10K
<$30K
Other Questions
• Whether k-NN can cope with categorical data/features?• Whether k-Means can lead to the same clusters?• How do we select/initialise the seeds in k-Means?