02 Classification

18
IE 527 Intelligent Engineering Systems Basic concepts Model/performance evaluation Overfitting 2

Transcript of 02 Classification

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 1/18

IE 527

Intelligent Engineering Systems

Basic concepts

Model/performance evaluation

Overfitting

2

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 2/18

The task of learning a target function f that maps each

attribute set x to one of the predefined class labels y  Given a collection of records (training set )

Each record contains a set of attributes, one of the

attributes (target) is the class.

Find a model for the class attribute as a function of the

values of other attributes.

Goal: previously unseen records should be assigned a

class as accurately as possible.

A test set is used to determine the accuracy of the

model. Usually, the given data set is divided intotraining and test sets, with training set used to buildthe model and test set used to validate it.

3

Predicting tumor cells as benign or malignant

Classifying credit card transactionsas legitimate or fraudulent

Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance,weather, entertainment, sports, etc.

Descriptive modeling

To explain what features define the class label

Predictive modeling

To predict the class label of unknown records4

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 3/18

Systematic approaches to build classification models

from an input data set

Employ a learning algorithm to identify a model that

best fits the relationship between the attribute set and

the class label

Decision Tree based methods

Artificial Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

The model should both fit the input data well and

correctly predict the class labels of unknown records

(generalization).5

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K  Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K  Yes

9 No Medium 75K No

10 No Small 90K  Yes10

 

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?10

 

Test Set

Learning

algorithm

Training Set

6

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 4/18

Tid    Refund Marital

Status

Taxable

Income   Cheat

1 Yes Single 125K   No

2 No Married 100K   No

3 No Single 70K   No

4 Yes Married 120K   No

5 No Divorced 95K   Yes

6 No Married 6 0K   No

7 Yes Divorced 220K   No

8 No Single 85K   Yes

9 No Married 7 5K   No

10 No Single 90K   Yes10

Refund

MarSt

TaxInc

 YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Root/Internal nodes 

- Spl i tt ing At t r ibu tes 

Training Data Induced Model

(Decision Tree)

Leaf nodes - Class labels 

7

Tid    Refund Marital

Status

Taxable

Income   Cheat

1 Yes Single 125K   No

2 No Married 100K   No

3 No Single 70K   No

4 Yes Married 120K   No

5 No Divorced 95K   Yes

6 No Married 60K   No

7 Yes Divorced 220K   No

8 No Single 85K   Yes

9 No Married 75K   No

10 No Single 90K   Yes10

MarSt

Refund

TaxInc

 YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

< 80K > 80K

There could be more than one tree that

fits the same data!

8

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 5/18

Refund

MarSt

TaxInc

 YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital

Status

Taxable

Income Cheat

No Married 80K ?1 0

 

Test Data

 Assign Cheat to “No”

9

10

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 6/18

Multiple methods are available to classify or predict.

For each method, multiple choices are available forparameter settings.

To choose the best model, we need to assess eachmodel’s performance.

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among competing

models?

12

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 7/18

Error = classifying a record as belonging to one class

when it belongs to another class. Error rate = (no. of misclassified records) / (total no. of records)

Can also use other measures of error (especially for Prediction

where error of each instance, − ) such as

Total SSE (Sum of Squared Errors) = =1

2

RMSE (Root Mean Squared Error) = =1

2 /

Naïve rule: Classify all records as belonging to the most

prevalent class or random classification (50-50) (or theaverage value for Prediction).

Often used as benchmark—we hope to do better than that

(exception: when goal is to identify high-value but rare outcomes,

we may do well by doing worse than the naïve rule)

Performance of a model w.r.t. predictive capability

Confusion Matrix

Performance metrics

Error rate = (no. of wrong predictions) / (total no. of predictions)

= (FP+FN) / (TP+TN+FP+FN) = (25+85)/3000 = 3.67%

Accuracy = (no. of correct predictions) / (total no. of predictions)

= (TP+TN) / (TP+TN+FP+FN) = (201+2689)/3000 = 96.33%

= 1 - (error rate)14

PREDICTED CLASS

ACTUALCLASS

Class = Yes (1) Class = No (0)

Class = Yes (1) 201 (TP) 85 (FN)

Class = No (0) 25 (FP) 2689 (TN)

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 8/18

Consider a 2-class problem:

No. of class 0 = 9990; no. of class 1 = 10 If a model predicts everything to be class 0, accuracy =

9990/10000 = 99.9%

Accuracy is misleading because the model does not detect any

class 1 object!

Accuracy may not be well suited for evaluating models

derived from imbalanced data sets.

Often a correct classification of the rare class (class 1) has a

greater value than a correct classification of the majority class!

In other words, misclassification cost is asymmetric.

FP (or FN) is acceptable, but FN (or FP) must not be allowed.

Example: tax fraud, identity theft, response to promotions,

network intrusion, predicting flight delay, etc.

In such cases, we want to tolerate greater overall error (reduced

accuracy) in return for better classifying the important class.15

+: rare but more important

-: majority but less important

TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified

TNR (specificity) = TN/(FP+TN) = % of – class correctly classified

FPR = FP/(FP+TN) = 1 – TNR FNR = FN/(TP+FN) = 1 – TPR

Oversample the important class for training (but don’t do for

validation/testing)

16

PREDICTED CLASS

ACTUALCLASS

+ -

+ f++ (TP) f+- (FN)

- f-+ (FP) f-- (TN)

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 9/18

+: rare but important

-: less important

Ctotal(M) = TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)

C(i,j) (or C(j|i)): Cost of (mis)classifying class i object as class j

For a symmetric, 0/1 cost matrix (C(+,+)=C(-,-)=0, C(+,-)=C(-,+)=1)

Ctotal(M) = FP + FN = n * (error rate)

Find a model that yields the lowest cost. If FN are most costly, reduce the FN errors by extending decision

boundary toward the negative class to cover more positives, at the

expense of generating additional false alarms (FP).

17

PREDICTED CLASS

ACTUALCLASS

C(i,j)   + -

+ C(+,+) (TP) C(+,-) (FN)

- C(-,+) (FP) C(-,-) (TN)

Select M1 (or A1)18

Cost

Matrix

PREDICTED CLASS

 ACTUAL

CLASS

C(i, j) + -

+ -1 100

- 1 0

Model M1

(or Attr. A1)

PREDICTED CLASS

 ACTUAL

CLASS

+ -+ 150 40

- 60 250

Model M2

(or Attr. A2)

PREDICTED CLASS

 ACTUAL

CLASS

+ -+ 250 45

- 5 200

Accuracy = 400/500 = 80%

Cost = 3910

Accuracy = 450/500 = 90%

Cost = 4255 (larger due to

more FN)

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 10/18

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 11/18

For m classes, confusion matrix has m rows and m

columns

Theoretically, there are m(m-1) misclassification costs,since any case could be misclassified in m-1 ways

Practically too many to work with

In decision-making context, though, such complexityrarely arises – one class is usually of primary interest

Classifications may reduce to “important” vs. “unimportant”

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among competing

models?

22

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 12/18

Holdout

Reserve 2/3 for training and 1/3 for testing Fewer training records

Highly dependent on the composition of training/test sets

Training & test sets no longer independent of each other

Random subsampling

Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration

Can’t control no. of each record used for testing and training

 Cross validation

Partition data into k equal-sized disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one; repeat k times Total error by summing up the errors for all k runs

 Leave-one-out: a special case where k = n – good for small samples

Utilizing as much data as possible for training; test sets mutually exclusive

Computationally expensive; high variance (only one record in each test set)

23

Stratified sampling

For imbalanced classes, e.g. consider 100 + and 1000 -.

Undersampling for –:

A random sample of 100 –; Focused undersampling

Underrepresented -

Oversampling for +:

Replicate + until (no. of +) = (no. of -); Generate new + by interpolation

Overfitting possible

Hybrid

Bootstrap

Training set composed by sampling with replacement (possible duplicates)

The rest can become part of the test set.

Good for small samples (like leave-one-out); low variance

24

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 13/18

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among competing

models?

25

Developed in 1950s for signal detection theory to analyze

noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TPR (on y-axis) against FPR (on x-axis)

Performance of each classifier represented as a point onthe ROC curve

Changing the threshold of algorithm, sample distribution or costmatrix changes the location of the point

26

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 14/18

(TPR,FPR) along cutoff values from 0 to 1

(0,0): Model predicts everything to be – (cutoff = 1) (1,1): Model predicts everything to be + (cutoff = 0)

(1,0): The ideal model (hitting the upper-left corner; area under ROC = 1)

Diagonal line

Random guessing (naïve classifier)

Classify as a + with a fixed prob. p

TPR (= pn+/n+) = FPR (= pn-/n-) = p

Below diagonal line

Prediction is worse than guessing!

M1 vs. M2

M1 is better for small FPR

M2 is better for large FPR

Area under ROC curve (AUC)

Ideal: AUC = 1

Random guessing: AUC = 0.5

The larger the AUC, the better the model

27

Apply the classifier to each

test instance to produce its

posterior probability to be +

Sort the instances in

increasing order of the P(+)

Apply cutoff at each unique

value of P(+)

Assign + to instances ≥ cutoff,

– to instances < cutoff  Initially TPR = FPR = 1

Count the number of TP, FP, TN,

FN at each cutoff 

Increase cutoff to the next higher;

repeat until the highest

Plot TPR against FPR28

Instance P(+) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 +

5 0.85 -

6 0.85 -

7 0.76 -8 0.53 +

9 0.43 -

10 0.25 +

Cutoff Table

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 15/18

29

ROC Curve

Class   + - + - - - + - + +

P0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR  1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR  1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Given two models:

Model M1: accuracy = 85%, tested on 30 instances

Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?

Estimate Confidence Intervals for accuracy

Prediction can be regarded as Bernoulli trials (2 possible outcomes),

which follow a binomial distribution with p (true accuracy).

For large test sets, (empirical) acc ~ N(p, p(1-p)n)

Compare performance of two models

Testing statistical significance by Z or t-test

H0: d = e1 – e2 = 0 H1: d ≠ 0

See Section 4.6 in Tan et al. (2006) for more details.30

   

 

  1)

/)1(( 2/12/   Z 

 N  p p

 pacc Z  P 

)(2

4422

2/

22

2/

2

2/

 

  

 Z n

accnaccn Z  Z accn p

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 16/18

31

Generalization: A good classification model must not only fit training

data well but also accurately classify unseen records (test/new data).

 Overfitting: a model that fits training data too well can have a

poorer generalization than a model with a higher training error.

32

Underfitting: When a

model is too simple, both

training and test errors are

large (the model has yet

to learn the data)

Overfitting: Once the tree

becomes too large, its test

error begins to increase

while its training error

continues to decrease

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 17/18

Decision boundary is distorted by a (mislabeled) noise

point that should be ignored by the decision tree.

33

Lack of data points makes it difficult to predict the class labels

correctly

Decision boundary is made by only few records falling in the region

Insufficient number of training records in the region causes the

decision tree to predict the test examples using other training

records that are irrelevant to the classification task

34

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 18/18

Overfitting results in decision trees that are more

complex than necessary. The chance of overfitting increases as the model becomes more

complex.

Training error no longer provides a good estimate of how well the

tree will perform on previously unseen records.

Need new ways for estimating generalization errors

Occam’s Razor

Given two models of similar generalization errors, one should

prefer the simpler model over the more complex model.

For complex models, there is a greater chance that it was fitted bychance or by noise in data and/or it overfits the data.

Therefore, one should include model complexity when evaluating a

model.

Reduce the number of nodes in a decision tree (pruning).35