12/2/2004Comp 120 Fall 20041 2 December 1 class to go CCR Exam Review.
Fall 20041 Supervised Learning. Fall 20042 Introduction Key idea Known target concept (predict...
Transcript of Fall 20041 Supervised Learning. Fall 20042 Introduction Key idea Known target concept (predict...
Fall 2004 1
Supervised Learning
Fall 2004 2
Introduction Key idea
Known target concept (predict certain attribute) Find out how other attributes can be used
Algorithms Rudimentary Rules (e.g., 1R) Statistical Modeling (e.g., Naïve Bayes) Divide and Conquer: Decision Trees Instance-Based Learning Neural Networks Support Vector Machines
Fall 2004 3
1-Rule Generate a one-level decision tree One attribute Performs quite well! Basic idea:
Rules testing a single attribute Classify according to frequency in training
data Evaluate error rate for each attribute Choose the best attribute
That’s all folks!
Fall 2004 4
The Weather Data (again)Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No
Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No
Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes
Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes
Rainy Mild High TRUE No
Fall 2004 5
Apply 1R
Attribute Rules Errors Total
1 outlook sunnyno 2/5 4/14overcast yes 0/4
rainy yes 2/52 temperature hot no 2/4 5/14
mild yes 2/6cool no 3/7
3 humidity high no 3/7 4/14normal yes 2/8
4 windy false yes 2/8 5/14true no 3/6
Fall 2004 6
Other Features Numeric Values
Discretization : Sort training data Split range into categories
Missing Values “Dummy” attribute
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y N N Y Y Y N Y Y N
Fall 2004 7
Naïve Bayes Classifier Allow all attributes to
contribute equally Assumes
All attributes equally important
All attributes independent
Realistic? Selection of
attributes
Fall 2004 8
Bayes Theorem
][
][]|[]|[
EP
HPHEPEHP
Hypothesis
Evidence
Conditional probabilityof H given E
Prior
PosteriorProbability
Fall 2004 9
Maximum a Posteriori (MAP)
]|[maxarg
][]|[maxarg
][
][]|[maxarg
]|[maxarg
HEPH
HPHEP
EP
HPHEP
EHPH
HML
H
H
HMAP
Maximum Likelihood (ML)
Fall 2004 10
Classification Want to classify a new instance (a1, a2,…, an) into
finite number of categories from the set V. Bayesian approach: Assign the most probable
category vMAP given (a1, a2,…, an).
Can we estimate the probabilities from the training data?
][]|,...,,[maxarg
],...,,[
][]|,...,,[maxarg
],...,,|[maxarg
21
21
21
21
vPvaaaP
aaaP
vPvaaaP
aaavPv
nVv
n
n
Vv
nVv
MAP
Fall 2004 11
Naïve Bayes Classifier Second probability easy to estimate
How?
The first probability difficult to estimate
Why?
Assume independence (this is the naïve bit):
i
iVv
MAP vaPvPv ]|[][maxarg
Fall 2004 12
The Weather Data (yet again)
Yes No Yes No Yes No Yes No Yes NoSunny 2 3 Hot 2 2 High 3 4 FALSE 6 2 9 5Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3Rainy 3 2 Cool 3 1
PlayOutlook Temperature Humidity Windy
93]|[ˆ
93]|[ˆ
93]|[ˆ
92]|[ˆ
149][ˆ
yesPlaytrueWindyP
yesPlayhighHumidityP
yesPlaycooleTemperaturP
yesPlaysunnyOutlookP
yesPlayP
Fall 2004 13
Estimation Given a new instance with
outlook=sunny, temperature=high, humidity=high, windy=true
0053.09
3
9
3
9
3
9
2
14
9
]|[][
i
i yesplayaPyesPlayP
Fall 2004 14
Calculations continued … Similarly
Thus
0206.05
3
5
4
5
1
5
3
14
5
]|[][
i
i noplayaPnoPlayP
}{
]|[][maxarg},{
noPlay
vaPvPvi
inoPlayyesPlayv
MAP
Fall 2004 15
Normalization Note that we can normalize to get the
probabilities:
}{795.00206.00053.0
0206.0
}{205.00206.00053.0
0052.0
],...,,[
][]|,...,,[],...,,|[
21
2121
noPlayv
yesPlayv
aaaP
vPvaaaPaaavP
n
nn
Fall 2004 16
Problems …. Suppose we had the following training data:
Now what?
Yes No Yes No Yes No Yes No Yes NoSunny 0 5 Hot 2 2 High 3 4 FALSE 6 2 9 5Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3Rainy 3 2 Cool 3 1
PlayOutlook Temperature Humidity Windy
ˆ 9[ ] 14ˆ 0[ | ] 9ˆ 3[ | ] 9ˆ 3[ | ] 9ˆ 3[ | ] 9
P Play yes
P Outlook sunny Play yes
P Temperature cool Play yes
P Humidity high Play yes
P Windy true Play yes
Fall 2004 17
Laplace Estimator Replace estimates
with9
3]|[ˆ
9
4]|[ˆ
9
2]|[ˆ
yesplayrainyOutlookP
yesplayovercastOutlookP
yesplaysunnyOutlookP
9
3]|[ˆ
9
4]|[ˆ
9
2]|[ˆ
3
2
1
pyesplayrainyOutlookP
pyesplayovercastOutlookP
pyesplaysunnyOutlookP
Fall 2004 18
Numeric Values Assume a probability distribution for the
numeric attributes density f(x) normal
fit a distribution (better)
Similarly as before
.2
1)(
2
2
2)(
x
exf
i
iVv
MAP vafvPv )|(][maxarg
Fall 2004 19
Discussion Simple methodology Powerful - good results in practice Missing values no problem Not so good if independence assumption is
severely violated Extreme case: multiple attributes with same
values Solutions:
Preselect which attributes to use Non-naïve Bayesian methods: networks
Fall 2004 20
Decision Tree Learning Basic Algorithm:
Select an attribute to be tested If classification achieved return classification Otherwise, branch by setting attribute to
each of the possible values Repeat with branch as your new tree
Main issue: how to select attributes
Fall 2004 21
Deciding on Branching What do we want to accomplish? Make good predictions Obtain simple to interpret rules
No diversity (impurity) is best all same class all classes equally likely
Goal: select attributes to reduce impurity
Fall 2004 22
Measuring Impurity/Diversity Lets say we only have two classes: Minimum
Gini index/Simpson diversity index
Entropy
),min( 21 pp
)1(22 1121 pppp
)(log)(log 222121 pppp
Fall 2004 23
Impurity Functions
0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Entropy
Gini index
Minimum
Fall 2004 24
Entropy
Proportion ofS classified as i
c
iii ppSEntropy
12log)(
Entropy is a measure of impurity in the training data S Measured in bits of information needed to encode a
member of S Extreme cases
All member same classification (Note: 0·log 0 = 0) All classifications equally frequent
Number of classes
Training data(instances)
Fall 2004 25
Expected Information Gain
})(:{
)(||
||)(),(
)(
vsaSsS
SEntropyS
SSEntropyaSGain
v
aValuesvv
v
All possible valuesfor attribute a
Gain(S,a) is the expected information provided about theclassification from knowing the value of attribute a
(Reduction in number of bits needed)
Fall 2004 26
The Weather Data (yet again)
Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No
Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No
Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes
Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes
Rainy Mild High TRUE No
Fall 2004 27
Decision Tree: Root Node
Outlook
YesYesNoNoNo
YesYesYesYes
YesYesYesNoNo
Sunny
Overcast
Rainy
Fall 2004 28
Calculating the Entropy
97.05
2log
5
2
5
3log
5
3)(
00.04
0log
4
0
4
4log
4
4)(
97.05
3log
5
3
5
2log
5
2)(
92.014
5log
14
5
14
9log
14
9
log)(
22
22
22
22
2
12
rainy
overcast
sunny
iii
SEntropy
SEntropy
SEntropy
ppSEntropy
Fall 2004 29
Calculating the Gain
05.0),(
15.0),(
03.0),(
23.069.092.0
97.014
50
14
497.0
14
592.0
)(||
||)(),(
)(
windySGain
humiditySGain
tempSGain
SEntropyS
SSEntropyoutlookSGain
aValuesvv
v
Select!
Fall 2004 30
Next LevelOutlook
NoNo
YesNo
Yes
Sunny
Overcast
Rainy
Temperature
Fall 2004 31
Calculating the Entropy
01
0log
1
0
1
1log
1
1)(
12
1log
2
1
2
1log
2
1)(
02
2log
2
2
2
0log
2
0)(
97.0)(
22
22
22
cool
mild
hot
SEntropy
SEntropy
SEntropy
SEntropy
Fall 2004 32
Calculating the Gain
02.0),(
97.0),(
57.040.097.0
05
11
5
20
5
297.0
)(||
||)(),(
)(
windySGain
humiditySGain
SEntropyS
SSEntropytempSGain
aValuesvv
v
Select
Fall 2004 33
Final TreeOutlook
No Yes
Sunny
Overcast
Rainy
Humidity
High Normal
Yes Windy
No Yes
True False
Fall 2004 34
What’s in a Tree? Our final decision tree correctly
classifies every instance Is this good?
Two important concepts: Overfitting Pruning
Fall 2004 35
Overfitting Two sources of abnormalities
Noise (randomness)
Outliers (measurement errors)
Chasing every abnormality causes
overfitting Tree to large and complex
Does not generalize to new data
Solution: prune the tree
Fall 2004 36
Pruning Prepruning
Halt construction of decision tree early Use same measure as in determining
attributes, e.g., halt if InfoGain < K Most frequent class becomes the leaf node
Postpruning Construct complete decision tree Prune it back Prune to minimize expected error rates Prune to minimize bits of encoding (Minimum
Description Length principle)
Fall 2004 37
Scalability Need to design for large amounts of data Two things to worry about
Large number of attributes Leads to a large tree (prepruning?) Takes a long time
Large amounts of data Can the data be kept in memory? Some new algorithms do not require all the data to
be memory resident
Fall 2004 38
Discussion: Decision Trees
The most popular methods Quite effective Relatively simple Have discussed in detail the ID3
algorithm: Information gain to select attributes No pruning Only handles nominal attributes
Fall 2004 39
Selecting Split Attributes Other Univariate splits
Gain Ratio: C4.5 Algorithm (J48 in Weka)
CART (not in Weka) Multivariate splits
May be possible to obtain better splits by considering two or more attributes simultaneously
Fall 2004 40
Instance-Based Learning Classification To not construct a explicit description
of how to classify Store all training data (learning) New example: find most similar
instance computing done at time of classification k-nearest neighbor
Fall 2004 41
K-Nearest Neighbor Each instance lives in n-
dimensional space
Distance between instances
)(),...,(),( 21 xaxaxa n
2
12121 ))(),((),(
n
rrr xaxaxxd
Fall 2004 42
Example: nearest neighbor
+
+
+
+
+
--
--
-
- xq*
1-Nearest neighbor?
6-Nearest neighbor?
-
Fall 2004 43
Some attributes may take large values and other small
Normalize
All attributes on equal footing
Normalizing
ii
ii vv
vva
minmax
min1
Fall 2004 44
Other Methods for Supervised Learning
Neural networks
Support vector machines
Optimization
Rough set approach
Fuzzy set approach
Fall 2004 45
Evaluating the Learning
Measure of performance Classification: error rate
Resubstitution error Performance on training set
Poor predictor of future performance
Overfitting
Useless for evaluation
Fall 2004 46
Test Set
Need a set of test instances Independent of training set instances
Representative of underlying structure
Sometimes: validation data Fine-tune parameters
Independent of training and test data
Plentiful data - no problem!
Fall 2004 47
Holdout Procedures
Common case: data set large but limited
Usual procedure: Reserve some data for testing
Use remaining data for training
Problems: Want both sets as large as possible
Want both sets to be representitive
Fall 2004 48
"Smart" Holdout
Simple check: Are the proportions of classes about the same in each data set?
Stratified holdout Guarantee that classes are (approximately)
proportionally represented
Repeated holdout Randomly select holdout set several times and
average the error rate estimates
Fall 2004 49
Holdout w/ Cross-Validation
Cross-validation Fixed number of partitions of the data (folds) In turn: each partition used for testing and
remaining instances for training May use stratification and randomization Standard practice:
Stratified tenfold cross-validation Instances divided randomly into the ten
partitions
Fall 2004 50
Cross Validation
Train on 90% of the data
ModelTest on 10%of the data Error rate e1
Train on 90% of the data
ModelTest on 10%of the data Error rate e2
Fold 1
Fold 2
Fall 2004 51
Cross-Validation Final estimate of error
Quality of estimate
k
ikk 1
1
k
ii
k
kks
st
1
2
1,1
)1(
1
Fall 2004 52
Leave-One-Out Holdout
n-Fold Cross-Validation (n instance set)
Use all but one instance for training Maximum use of the data
Deterministic
High computational cost
Non-stratified sample
Fall 2004 53
Bootstrap
Sample with replacement n times Use as training data Use instances not in training data for
testing How many test instances are
there?1
limn
n n
Fall 2004 54
0.632 Bootstrap
On the average e-1 n = 0.369 n instances will be in the test set
Thus, on average we have 63.2% of instance in training set
Estimate error rate
e = 0.632 etest + 0.368 etrain
Fall 2004 55
Accuracy of our Estimate?
Suppose we observe s successes in a testing set of ntest instances ...
We then estimate the success rateRsuccess=s/ ntest.
Each instance is either a success or failure (Bernoulli trial w/success probability p)
Mean p Variance p(1-p)
Fall 2004 56
Properties of Estimate
We haveE[Rsuccess]=p
Var[Rsuccess]=p(1-p)/ntest
If ntraining is large enough the Central Limit Theorem (CLT) states that, approximately,
Rsuccess~Normal(p,p(1-p)/ntest)
Fall 2004 57
Confidence Interval
CI for normal
CI for p
cznpp
pRzP
test
success
/)1(
test
testtest
success
test
success
testsuccess
nz
nz
nR
nR
znz
R
p 2
2
222
1
42
Look up in table
Level
Fall 2004 58
Comparing Algorithms
Know how to evaluate the results of our data mining algorithms (classification)
How should we compare different algorithms?
Evaluate each algorithm Rank Select best one
Don't know if this ranking is reliable
Fall 2004 59
Assessing Other Learning Developed procedures for classification Association rules
Evaluated based on accuracy Same methods as for classification
Numerical prediction Error rate no longer applies Same principles
use independent test set and hold-out procedures
cross-validation or bootstrap
Fall 2004 60
Measures of Effectiveness Need to compare:
Predicted values p1, p2,..., pn.
Actual values a1, a2,..., an.
Most common measure Mean-squared error
n
iii ap
n 1
2)(1
Fall 2004 61
Other Measures Mean absolute error
Relative squared error
Relative absolute error
Correlation
n
iii ap
n 1
||1
n
ii
n
iii
aa
ap
1
2
1
2
)(
)(
n
ii
n
iii
aa
ap
1
1
||
||
Fall 2004 62
What to Do?
“Large” amounts of data Hold-out 1/3 of data for testing Train a model on 2/3 of data Estimate error (or success) rate and calculate
CI “Moderate” amounts of data
Estimate error rate: Use 10-fold cross-validation with stratification, or use bootstrap.
Train model on the entire data set
Fall 2004 63
Predicting Probabilities Classification into k classes
Predict probabilities p1, p2,..., pnfor each class.
Actual values a1, a2,..., an.
No longer 0-1 error Quadratic loss function
jji
ijji
k
jjj
pp
ppapk
2
22
1
2
21
)1()(1
Correct class
Fall 2004 64
Information Loss Function
Instead of quadratic function:
where the j-th prediction is correct. Information required to
communicate which class is correct in bits with respect to the probability
distribution
jp2log
Fall 2004 65
Occam's Razor
Given a choice of theories that are equally good the simplest theory should be chosen
Physical sciences: any theory should be consistant with all empirical observations
Data mining: theory predictive model good theory good prediction What is good? Do we minimize the error rate?
Fall 2004 66
Minimum Description Length MDL principle:
Minimize size of theory + info needed to specify
exceptions Suppose trainings set E is mined
resulting in a theory T Want to minimize
]|[][ TELTL
Fall 2004 67
Most Likely Theory
Suppose we want to maximize P[T|E]
Bayes' rule
Take logarithms
][
][]|[]|[
EP
TPTEPETP
][log][log]|[log]|[log EPTPTEPETP
Fall 2004 68
Information Function Maximizing P[T|E] equivilent to
minimizing
That is, the MDL principle!
Number of bits it takesto submit the theory
Number of bits it takesto submit the exceptions
][log][log]|[log
]|[log
EPTPTEP
ETP
Fall 2004 69
Applications to Learning Classification, association, numeric
prediciton Several predictive models with 'similar' error rate
(usually as small as possible) Select between them using Occam's razor Simplicity subjective Use MDL principle
Clustering Important learning that is difficult to evaluate Can use MDL principle
Fall 2004 70
Comparing Mining Algorithms
Know how to evaluate the results Suppose we have two algorithms
Obtain two different models Estimate the error rates e(1) and e(2). Compare estimates
Select the better one Problem?
?ˆˆ )2()1( ee
Fall 2004 71
Weather Data Example Suppose we learn the rule
If outlook=rainy then play=yesOtherwise play=no
Test it on the following test set:
Have zero error rate
Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE NoRainy Mild High FALSE YesRainy Cool Normal FALSE YesSunny Mild High FALSE No
Fall 2004 72
Different Test Set 2 Again, suppose we learn the rule
If outlook=rainy then play=yesOtherwise play=no
Test it on a different test set:
Have 100% error rate!
Outlook Temp. Humidity Windy PlayOvercast Hot High FALSE Yes
Rainy Cool Normal TRUE NoOvercast Cool Normal TRUE Yes
Sunny Cool Normal FALSE YesRainy Mild High TRUE No
Fall 2004 73
Comparing Random Estimates Estimated error rate is just an
estimate (random) Need variance as well as point
estimates Construct a t-test statistic
ks
dt
/2
Average of differencesin error rates
Estimated standarddeviation
H0: Difference = 0
Fall 2004 74
Discussion Now know how to compare two learning
algorithms and select the one with the better error rate
We also know to select the simplest model that has 'comparable' error rate
Is it really better?
Minimising error rate can be misleading
Fall 2004 75
Examples of 'Good Models' Application: loan approval
Model: no applicants default on loans
Evaluation: simple, low error rate
Application: cancer diagnosis Model: all tumors are benign
Evaluation: simple, low error rate
Application: information assurance Model: all visitors to network are well intentioned
Evaluation: simple, low error rate
Fall 2004 76
What's Going On?
Many (most) data mining applications can be thought about as detecting exceptions
Ignoring the exceptions does not significantly increase the error rate!
Ignoring the exceptions often leads to a simple model!
Thus, we can find a model that we evaluate as good but completely misses the point
Need to account for the cost of error types
Fall 2004 77
Accounting for Cost of Errors
Explicit modeling of the cost of each error costs may not be known often not practical
Look at trade-offs visual inspection semi-automated learning
Cost-sensitive learning assign costs to classes a priori
Fall 2004 78
Explicit Modeling of Cost
Predicted class
Yes No
Yes Truepositive
FalsenegativeActual
Class No Falsepositive
Truenegative
Confusion Matrix(Displayed in Weka)
Fall 2004 79
Cost Sensitive Learning Have used cost information to evaluate
learning
Better: use cost information to learn
Simple idea: Increase instances that demonstrate
important behavior (e.g., classified as
exceptions)
Applies for any learning algorithm
Fall 2004 80
Discussion Evaluate learning
Estimate error rate Minimum length principle/Occam’s Razor
Comparison of algorithm Based on evaluation Make sure difference is significant
Cost of making errors may differ Use evaluation procedures with caution Incorporate into learning
Fall 2004 81
Engineering the Output Prediction base on one model
Model performs well on one training set, but poorly on others
New data becomes available new model
Combine models Bagging Boosting Stacking
Improve prediction but complicate structure
Fall 2004 82
Bagging
Bias: error despite all the data in the world!
Variance: error due to limited data Intuitive idea of bagging:
Assume we have several data sets Apply learning algorithm to each set Vote on the prediction (classification/numeric)
What type of error does this reduce? When is this beneficial?
Fall 2004 83
Bootstrap Aggregating
In practice: only one training data set
Create many sets from one Sample with replacement (remember
the bootstrap) Does this work?
Often given improvements in predictive performance
Never degeneration in performance
Fall 2004 84
Boosting
Assume a stable learning procedure Low variance Bagging does very little
Combine structurally different models Intuitive motivation:
Any given model may be good for a subset of the training data
Encourage models to explain part of the data
Fall 2004 85
AdaBoost.M1
Generate models: Assign equal weight to each training instance Iterate:
Apply learning algorithm and store model e error If e = 0 or e > 0.5 terminate For every instance:
If classified correctly multiply weight by e/(1-e) Normalize weight
Until STOP
Fall 2004 86
AdaBoost.M1
Classification: Assign zero weight to each class For every model:
Add
to class predicted by model Return class with highest weight
ee 1log
Fall 2004 87
Performance Analysis Error of combined classifier converges to
zero at an exponential rate (very fast) Questionable value due to possible overfitting Must use independent test data
Fails on test data if Classifier more complex than training data
justifies Training error become too large too quickly
Must achieve balance between model complexity and the fit to the data
Fall 2004 88
Fitting versus Overfitting Overfitting very difficult to assess here
Assume we have reached zero error May be beneficial to continue boosting! Occam's razor?
Build complex models from simple ones Boosting offers very significant
improvement Can hope for more improvement than bagging
Can degenerate performance Never happens with bagging
Fall 2004 89
Stacking
Models of different types Meta learner:
Learn which learning algorithms are good Combine learning algorithms intelligently
Decision Tree
Naïve Bayes
Instance-Based
Level-0 Models
Meta Learner
Level-1 Model
Fall 2004 90
Meta Learning Holdout part of the training set Use remaining data for training level-0 methods Use holdout data to train level-1 learning Retrain level-0 algorithms with all the data Comments:
Level-1 learning: use very simple algorithm (e.g., linear model)
Can use cross-validation to allow level-1 algorithms to train on all the data
Fall 2004 91
Supervised Learning Two types of learning
Classification Numerical prediction
Classification learning algorithms Decision trees Naïve Bayes Instance-based learning Many others are part of Weka,
browse!
Fall 2004 92
Other Issues in Supervised Learning
Evaluation Accuracy: hold-out, bootstrap, cross-
validation Simplicity: MDL principle Usefulness: cost-sensitive learning
Metalearning Bagging, Boosting, Stacking