Fall 20041 Supervised Learning. Fall 20042 Introduction Key idea Known target concept (predict...

Fall 2004 1

Supervised Learning

Fall 2004 2

Introduction Key idea

Known target concept (predict certain attribute) Find out how other attributes can be used

Algorithms Rudimentary Rules (e.g., 1R) Statistical Modeling (e.g., Naïve Bayes) Divide and Conquer: Decision Trees Instance-Based Learning Neural Networks Support Vector Machines

Fall 2004 3

1-Rule Generate a one-level decision tree One attribute Performs quite well! Basic idea:

Rules testing a single attribute Classify according to frequency in training

data Evaluate error rate for each attribute Choose the best attribute

That’s all folks!

Fall 2004 4

The Weather Data (again)Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Fall 2004 5

Apply 1R

Attribute Rules Errors Total

1 outlook sunnyno 2/5 4/14overcast yes 0/4

rainy yes 2/52 temperature hot no 2/4 5/14

mild yes 2/6cool no 3/7

3 humidity high no 3/7 4/14normal yes 2/8

4 windy false yes 2/8 5/14true no 3/6

Fall 2004 6

Other Features Numeric Values

Discretization : Sort training data Split range into categories

Missing Values “Dummy” attribute

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y N N Y Y Y N Y Y N

Fall 2004 7

Naïve Bayes Classifier Allow all attributes to

contribute equally Assumes

All attributes equally important

All attributes independent

Realistic? Selection of

attributes

Fall 2004 8

Bayes Theorem

][

][]|[]|[

EP

HPHEPEHP

Hypothesis

Evidence

Conditional probabilityof H given E

Prior

PosteriorProbability

Fall 2004 9

Maximum a Posteriori (MAP)

]|[maxarg

][]|[maxarg

][

][]|[maxarg

]|[maxarg

HEPH

HPHEP

EP

HPHEP

EHPH

HML

H

H

HMAP

Maximum Likelihood (ML)

Fall 2004 10

Classification Want to classify a new instance (a1, a2,…, an) into

finite number of categories from the set V. Bayesian approach: Assign the most probable

category vMAP given (a1, a2,…, an).

Can we estimate the probabilities from the training data?

][]|,...,,[maxarg

],...,,[

][]|,...,,[maxarg

],...,,|[maxarg

21

21

21

21

vPvaaaP

aaaP

vPvaaaP

aaavPv

nVv

n

n

Vv

nVv

MAP

Fall 2004 11

Naïve Bayes Classifier Second probability easy to estimate

How?

The first probability difficult to estimate

Why?

Assume independence (this is the naïve bit):

i

iVv

MAP vaPvPv ]|[][maxarg

Fall 2004 12

The Weather Data (yet again)

Yes No Yes No Yes No Yes No Yes NoSunny 2 3 Hot 2 2 High 3 4 FALSE 6 2 9 5Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3Rainy 3 2 Cool 3 1

PlayOutlook Temperature Humidity Windy

93]|[ˆ

93]|[ˆ

93]|[ˆ

92]|[ˆ

149][ˆ

yesPlaytrueWindyP

yesPlayhighHumidityP

yesPlaycooleTemperaturP

yesPlaysunnyOutlookP

yesPlayP

Fall 2004 13

Estimation Given a new instance with

outlook=sunny, temperature=high, humidity=high, windy=true

0053.09

3

9

3

9

3

9

2

14

9

]|[][

i

i yesplayaPyesPlayP

Fall 2004 14

Calculations continued … Similarly

Thus

0206.05

3

5

4

5

1

5

3

14

5

]|[][

i

i noplayaPnoPlayP

}{

]|[][maxarg},{

noPlay

vaPvPvi

inoPlayyesPlayv

MAP

Fall 2004 15

Normalization Note that we can normalize to get the

probabilities:

}{795.00206.00053.0

0206.0

}{205.00206.00053.0

0052.0

],...,,[

][]|,...,,[],...,,|[

21

2121

noPlayv

yesPlayv

aaaP

vPvaaaPaaavP

n

nn

Fall 2004 16

Problems …. Suppose we had the following training data:

Now what?

Yes No Yes No Yes No Yes No Yes NoSunny 0 5 Hot 2 2 High 3 4 FALSE 6 2 9 5Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3Rainy 3 2 Cool 3 1

PlayOutlook Temperature Humidity Windy

ˆ 9[ ] 14ˆ 0[ | ] 9ˆ 3[ | ] 9ˆ 3[ | ] 9ˆ 3[ | ] 9

P Play yes

P Outlook sunny Play yes

P Temperature cool Play yes

P Humidity high Play yes

P Windy true Play yes

Fall 2004 17

Laplace Estimator Replace estimates

with9

3]|[ˆ

9

4]|[ˆ

9

2]|[ˆ

yesplayrainyOutlookP

yesplayovercastOutlookP

yesplaysunnyOutlookP

9

3]|[ˆ

9

4]|[ˆ

9

2]|[ˆ

3

2

1

pyesplayrainyOutlookP

pyesplayovercastOutlookP

pyesplaysunnyOutlookP

Fall 2004 18

Numeric Values Assume a probability distribution for the

numeric attributes density f(x) normal

fit a distribution (better)

Similarly as before

.2

1)(

2

2

2)(

x

exf

i

iVv

MAP vafvPv )|(][maxarg

Fall 2004 19

Discussion Simple methodology Powerful - good results in practice Missing values no problem Not so good if independence assumption is

severely violated Extreme case: multiple attributes with same

values Solutions:

Preselect which attributes to use Non-naïve Bayesian methods: networks

Fall 2004 20

Decision Tree Learning Basic Algorithm:

Select an attribute to be tested If classification achieved return classification Otherwise, branch by setting attribute to

each of the possible values Repeat with branch as your new tree

Main issue: how to select attributes

Fall 2004 21

Deciding on Branching What do we want to accomplish? Make good predictions Obtain simple to interpret rules

No diversity (impurity) is best all same class all classes equally likely

Goal: select attributes to reduce impurity

Fall 2004 22

Measuring Impurity/Diversity Lets say we only have two classes: Minimum

Gini index/Simpson diversity index

Entropy

),min( 21 pp

)1(22 1121 pppp

)(log)(log 222121 pppp

Fall 2004 23

Impurity Functions

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Entropy

Gini index

Minimum

Fall 2004 24

Entropy

Proportion ofS classified as i

c

iii ppSEntropy

12log)(

Entropy is a measure of impurity in the training data S Measured in bits of information needed to encode a

member of S Extreme cases

All member same classification (Note: 0·log 0 = 0) All classifications equally frequent

Number of classes

Training data(instances)

Fall 2004 25

Expected Information Gain

})(:{

)(||

||)(),(

)(

vsaSsS

SEntropyS

SSEntropyaSGain

v

aValuesvv

v

All possible valuesfor attribute a

Gain(S,a) is the expected information provided about theclassification from knowing the value of attribute a

(Reduction in number of bits needed)

Fall 2004 26

The Weather Data (yet again)

Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Fall 2004 27

Decision Tree: Root Node

Outlook

YesYesNoNoNo

YesYesYesYes

YesYesYesNoNo

Sunny

Overcast

Rainy

Fall 2004 28

Calculating the Entropy

97.05

2log

5

2

5

3log

5

3)(

00.04

0log

4

0

4

4log

4

4)(

97.05

3log

5

3

5

2log

5

2)(

92.014

5log

14

5

14

9log

14

9

log)(

22

22

22

22

2

12

rainy

overcast

sunny

iii

SEntropy

SEntropy

SEntropy

ppSEntropy

Fall 2004 29

Calculating the Gain

05.0),(

15.0),(

03.0),(

23.069.092.0

97.014

50

14

497.0

14

592.0

)(||

||)(),(

)(

windySGain

humiditySGain

tempSGain

SEntropyS

SSEntropyoutlookSGain

aValuesvv

v

Select!

Fall 2004 30

Next LevelOutlook

NoNo

YesNo

Yes

Sunny

Overcast

Rainy

Temperature

Fall 2004 31

Calculating the Entropy

01

0log

1

0

1

1log

1

1)(

12

1log

2

1

2

1log

2

1)(

02

2log

2

2

2

0log

2

0)(

97.0)(

22

22

22

cool

mild

hot

SEntropy

SEntropy

SEntropy

SEntropy

Fall 2004 32

Calculating the Gain

02.0),(

97.0),(

57.040.097.0

05

11

5

20

5

297.0

)(||

||)(),(

)(

windySGain

humiditySGain

SEntropyS

SSEntropytempSGain

aValuesvv

v

Select

Fall 2004 33

Final TreeOutlook

No Yes

Sunny

Overcast

Rainy

Humidity

High Normal

Yes Windy

No Yes

True False

Fall 2004 34

What’s in a Tree? Our final decision tree correctly

classifies every instance Is this good?

Two important concepts: Overfitting Pruning

Fall 2004 35

Overfitting Two sources of abnormalities

Noise (randomness)

Outliers (measurement errors)

Chasing every abnormality causes

overfitting Tree to large and complex

Does not generalize to new data

Solution: prune the tree

Fall 2004 36

Pruning Prepruning

Halt construction of decision tree early Use same measure as in determining

attributes, e.g., halt if InfoGain < K Most frequent class becomes the leaf node

Postpruning Construct complete decision tree Prune it back Prune to minimize expected error rates Prune to minimize bits of encoding (Minimum

Description Length principle)

Fall 2004 37

Scalability Need to design for large amounts of data Two things to worry about

Large number of attributes Leads to a large tree (prepruning?) Takes a long time

Large amounts of data Can the data be kept in memory? Some new algorithms do not require all the data to

be memory resident

Fall 2004 38

Discussion: Decision Trees

The most popular methods Quite effective Relatively simple Have discussed in detail the ID3

algorithm: Information gain to select attributes No pruning Only handles nominal attributes

Fall 2004 39

Selecting Split Attributes Other Univariate splits

Gain Ratio: C4.5 Algorithm (J48 in Weka)

CART (not in Weka) Multivariate splits

May be possible to obtain better splits by considering two or more attributes simultaneously

Fall 2004 40

Instance-Based Learning Classification To not construct a explicit description

of how to classify Store all training data (learning) New example: find most similar

instance computing done at time of classification k-nearest neighbor

Fall 2004 41

K-Nearest Neighbor Each instance lives in n-

dimensional space

Distance between instances

)(),...,(),( 21 xaxaxa n

2

12121 ))(),((),(

n

rrr xaxaxxd

Fall 2004 42

Example: nearest neighbor

+

+

+

+

+

--

--

-

- xq*

1-Nearest neighbor?

6-Nearest neighbor?

-

Fall 2004 43

Some attributes may take large values and other small

Normalize

All attributes on equal footing

Normalizing

ii

ii vv

vva

minmax

min1

Fall 2004 44

Other Methods for Supervised Learning

Neural networks

Support vector machines

Optimization

Rough set approach

Fuzzy set approach

Fall 2004 45

Evaluating the Learning

Measure of performance Classification: error rate

Resubstitution error Performance on training set

Poor predictor of future performance

Overfitting

Useless for evaluation

Fall 2004 46

Test Set

Need a set of test instances Independent of training set instances

Representative of underlying structure

Sometimes: validation data Fine-tune parameters

Independent of training and test data

Plentiful data - no problem!

Fall 2004 47

Holdout Procedures

Common case: data set large but limited

Usual procedure: Reserve some data for testing

Use remaining data for training

Problems: Want both sets as large as possible

Want both sets to be representitive

Fall 2004 48

"Smart" Holdout

Simple check: Are the proportions of classes about the same in each data set?

Stratified holdout Guarantee that classes are (approximately)

proportionally represented

Repeated holdout Randomly select holdout set several times and

average the error rate estimates

Fall 2004 49

Holdout w/ Cross-Validation

Cross-validation Fixed number of partitions of the data (folds) In turn: each partition used for testing and

remaining instances for training May use stratification and randomization Standard practice:

Stratified tenfold cross-validation Instances divided randomly into the ten

partitions

Fall 2004 50

Cross Validation

Train on 90% of the data

ModelTest on 10%of the data Error rate e1

Train on 90% of the data

ModelTest on 10%of the data Error rate e2

Fold 1

Fold 2

Fall 2004 51

Cross-Validation Final estimate of error

Quality of estimate

k

ikk 1

1

k

ii

k

kks

st

1

2

1,1

)1(

1

Fall 2004 52

Leave-One-Out Holdout

n-Fold Cross-Validation (n instance set)

Use all but one instance for training Maximum use of the data

Deterministic

High computational cost

Non-stratified sample

Fall 2004 53

Bootstrap

Sample with replacement n times Use as training data Use instances not in training data for

testing How many test instances are

there?1

limn

n n

Fall 2004 54

0.632 Bootstrap

On the average e-1 n = 0.369 n instances will be in the test set

Thus, on average we have 63.2% of instance in training set

Estimate error rate

e = 0.632 etest + 0.368 etrain

Fall 2004 55

Accuracy of our Estimate?

Suppose we observe s successes in a testing set of ntest instances ...

We then estimate the success rateRsuccess=s/ ntest.

Each instance is either a success or failure (Bernoulli trial w/success probability p)

Mean p Variance p(1-p)

Fall 2004 56

Properties of Estimate

We haveE[Rsuccess]=p

Var[Rsuccess]=p(1-p)/ntest

If ntraining is large enough the Central Limit Theorem (CLT) states that, approximately,

Rsuccess~Normal(p,p(1-p)/ntest)

Fall 2004 57

Confidence Interval

CI for normal

CI for p

cznpp

pRzP

test

success

/)1(

test

testtest

success

test

success

testsuccess

nz

nz

nR

nR

znz

R

p 2

2

222

1

42

Look up in table

Level

Fall 2004 58

Comparing Algorithms

Know how to evaluate the results of our data mining algorithms (classification)

How should we compare different algorithms?

Evaluate each algorithm Rank Select best one

Don't know if this ranking is reliable

Fall 2004 59

Assessing Other Learning Developed procedures for classification Association rules

Evaluated based on accuracy Same methods as for classification

Numerical prediction Error rate no longer applies Same principles

use independent test set and hold-out procedures

cross-validation or bootstrap

Fall 2004 60

Measures of Effectiveness Need to compare:

Predicted values p1, p2,..., pn.

Actual values a1, a2,..., an.

Most common measure Mean-squared error

n

iii ap

n 1

2)(1

Fall 2004 61

Other Measures Mean absolute error

Relative squared error

Relative absolute error

Correlation

n

iii ap

n 1

||1

n

ii

n

iii

aa

ap

1

2

1

2

)(

)(

n

ii

n

iii

aa

ap

1

1

||

||

Fall 2004 62

What to Do?

“Large” amounts of data Hold-out 1/3 of data for testing Train a model on 2/3 of data Estimate error (or success) rate and calculate

CI “Moderate” amounts of data

Estimate error rate: Use 10-fold cross-validation with stratification, or use bootstrap.

Train model on the entire data set

Fall 2004 63

Predicting Probabilities Classification into k classes

Predict probabilities p1, p2,..., pnfor each class.

Actual values a1, a2,..., an.

No longer 0-1 error Quadratic loss function

jji

ijji

k

jjj

pp

ppapk

2

22

1

2

21

)1()(1

Correct class

Fall 2004 64

Information Loss Function

Instead of quadratic function:

where the j-th prediction is correct. Information required to

communicate which class is correct in bits with respect to the probability

distribution

jp2log

Fall 2004 65

Occam's Razor

Given a choice of theories that are equally good the simplest theory should be chosen

Physical sciences: any theory should be consistant with all empirical observations

Data mining: theory predictive model good theory good prediction What is good? Do we minimize the error rate?

Fall 2004 66

Minimum Description Length MDL principle:

Minimize size of theory + info needed to specify

exceptions Suppose trainings set E is mined

resulting in a theory T Want to minimize

]|[][ TELTL

Fall 2004 68

Information Function Maximizing P[T|E] equivilent to

minimizing

That is, the MDL principle!

Number of bits it takesto submit the theory

Number of bits it takesto submit the exceptions

][log][log]|[log

]|[log

EPTPTEP

ETP

Fall 2004 69

Applications to Learning Classification, association, numeric

prediciton Several predictive models with 'similar' error rate

(usually as small as possible) Select between them using Occam's razor Simplicity subjective Use MDL principle

Clustering Important learning that is difficult to evaluate Can use MDL principle

Fall 2004 70

Comparing Mining Algorithms

Know how to evaluate the results Suppose we have two algorithms

Obtain two different models Estimate the error rates e(1) and e(2). Compare estimates

Select the better one Problem?

?ˆˆ )2()1( ee

Fall 2004 71

Weather Data Example Suppose we learn the rule

If outlook=rainy then play=yesOtherwise play=no

Test it on the following test set:

Have zero error rate

Outlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE NoRainy Mild High FALSE YesRainy Cool Normal FALSE YesSunny Mild High FALSE No

Fall 2004 72

Different Test Set 2 Again, suppose we learn the rule

If outlook=rainy then play=yesOtherwise play=no

Test it on a different test set:

Have 100% error rate!

Outlook Temp. Humidity Windy PlayOvercast Hot High FALSE Yes

Rainy Cool Normal TRUE NoOvercast Cool Normal TRUE Yes

Sunny Cool Normal FALSE YesRainy Mild High TRUE No

Fall 2004 73

Comparing Random Estimates Estimated error rate is just an

estimate (random) Need variance as well as point

estimates Construct a t-test statistic

ks

dt

/2

Average of differencesin error rates

Estimated standarddeviation

H0: Difference = 0

Fall 2004 74

Discussion Now know how to compare two learning

algorithms and select the one with the better error rate

We also know to select the simplest model that has 'comparable' error rate

Is it really better?

Minimising error rate can be misleading

Fall 2004 75

Examples of 'Good Models' Application: loan approval

Model: no applicants default on loans

Evaluation: simple, low error rate

Application: cancer diagnosis Model: all tumors are benign


Application: information assurance Model: all visitors to network are well intentioned


Fall 2004 76

What's Going On?

Many (most) data mining applications can be thought about as detecting exceptions

Ignoring the exceptions does not significantly increase the error rate!

Ignoring the exceptions often leads to a simple model!

Thus, we can find a model that we evaluate as good but completely misses the point

Need to account for the cost of error types

Fall 2004 77

Accounting for Cost of Errors

Explicit modeling of the cost of each error costs may not be known often not practical

Look at trade-offs visual inspection semi-automated learning

Cost-sensitive learning assign costs to classes a priori

Fall 2004 78

Explicit Modeling of Cost

Predicted class

Yes No

Yes Truepositive

FalsenegativeActual

Class No Falsepositive

Truenegative

Confusion Matrix(Displayed in Weka)

Fall 2004 79

Cost Sensitive Learning Have used cost information to evaluate

learning

Better: use cost information to learn

Simple idea: Increase instances that demonstrate

important behavior (e.g., classified as

exceptions)

Applies for any learning algorithm

Fall 2004 80

Discussion Evaluate learning

Estimate error rate Minimum length principle/Occam’s Razor

Comparison of algorithm Based on evaluation Make sure difference is significant

Cost of making errors may differ Use evaluation procedures with caution Incorporate into learning

Fall 2004 81

Engineering the Output Prediction base on one model

Model performs well on one training set, but poorly on others

New data becomes available new model

Combine models Bagging Boosting Stacking

Improve prediction but complicate structure

Fall 2004 82

Bagging

Bias: error despite all the data in the world!

Variance: error due to limited data Intuitive idea of bagging:

Assume we have several data sets Apply learning algorithm to each set Vote on the prediction (classification/numeric)

What type of error does this reduce? When is this beneficial?

Fall 2004 83

Bootstrap Aggregating

In practice: only one training data set

Create many sets from one Sample with replacement (remember

the bootstrap) Does this work?

Often given improvements in predictive performance

Never degeneration in performance

Fall 2004 84

Boosting

Assume a stable learning procedure Low variance Bagging does very little

Combine structurally different models Intuitive motivation:

Any given model may be good for a subset of the training data

Encourage models to explain part of the data

Fall 2004 85

AdaBoost.M1

Generate models: Assign equal weight to each training instance Iterate:

Apply learning algorithm and store model e error If e = 0 or e > 0.5 terminate For every instance:

If classified correctly multiply weight by e/(1-e) Normalize weight

Until STOP

Fall 2004 86

AdaBoost.M1

Classification: Assign zero weight to each class For every model:

Add

to class predicted by model Return class with highest weight

ee 1log

Fall 2004 87

Performance Analysis Error of combined classifier converges to

zero at an exponential rate (very fast) Questionable value due to possible overfitting Must use independent test data

Fails on test data if Classifier more complex than training data

justifies Training error become too large too quickly

Must achieve balance between model complexity and the fit to the data

Fall 2004 88

Fitting versus Overfitting Overfitting very difficult to assess here

Assume we have reached zero error May be beneficial to continue boosting! Occam's razor?

Build complex models from simple ones Boosting offers very significant

improvement Can hope for more improvement than bagging

Can degenerate performance Never happens with bagging

Fall 2004 89

Stacking

Models of different types Meta learner:

Learn which learning algorithms are good Combine learning algorithms intelligently

Decision Tree

Naïve Bayes

Instance-Based

Level-0 Models

Meta Learner

Level-1 Model

Fall 2004 90

Meta Learning Holdout part of the training set Use remaining data for training level-0 methods Use holdout data to train level-1 learning Retrain level-0 algorithms with all the data Comments:

Level-1 learning: use very simple algorithm (e.g., linear model)

Can use cross-validation to allow level-1 algorithms to train on all the data

Fall 2004 91

Supervised Learning Two types of learning

Classification Numerical prediction

Classification learning algorithms Decision trees Naïve Bayes Instance-based learning Many others are part of Weka,

browse!

Fall 2004 92

Other Issues in Supervised Learning

Evaluation Accuracy: hold-out, bootstrap, cross-

validation Simplicity: MDL principle Usefulness: cost-sensitive learning

Metalearning Bagging, Boosting, Stacking

Fall 20041 Supervised Learning. Fall 20042 Introduction Key idea Known target concept (predict...

Documents

Transcript of Fall 20041 Supervised Learning. Fall 20042 Introduction Key idea Known target concept (predict...