Classification 021411

8/7/2019 Classification 021411

1/35

Data MiningClassification: Basic Concepts, Decision

Trees, and Model Evaluation

Lecture Notes for Chapter 4

Introduction to Data Mining

by

Tan, Steinbach, Kumar

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


2/35

Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

Classification: Definition

Given a collection of records (training set ) Each record is by characterized by a tuple

(x,y), wherexis the attribute set andy is the

class label

x: attribute, predictor, independent variable, input

y: class, response, dependent variable, output

Task: Learn a model that maps each attribute setx

into one of the predefined class labelsy


3/35


Examples of Classification Task

Task Attribute set,x Class label,y

Categorizing

email

messages

Features extracted from

email message header

and content

spam or non-spam

Identifying

tumor cells


MRI scans

malignant or benign

cells

Cataloging

galaxies


telescope images

Elliptical, spiral, or

irregular-shaped

galaxies


4/35


General Approach for BuildingClassification Model

Apply

Model

LearnModel

Tid

Attrib1

Attrib2 Attrib3 Class

1 Yes

Large 125K No

2

No Medium 100K No

3 No Small 70K

No

4

Yes

Medium 120K No

5

No Large 95K

Yes

6

No Medium 60K

No

7

Yes

Large 220K

No

8

No Small 85K

Yes

9

No Medium 75K

No

10 No Small 90K

Yes10

Tid

Attrib1

Attrib2 Attrib3 Class

11 No Small 55K

?

12 Yes

Medium 80K

?

13 Yes

Large 110K ?

14 No Small 95K

?

15 No Large 67K

?10


5/35


Classification Techniques

Base Classifiers Decision Tree based Methods

Rule-based Methods

Nearest-neighbor

Neural Networks

Nave Bayes and Bayesian Belief Networks

Support Vector Machines

Ensemble Classifiers

Boosting, Bagging, Random Forests


6/35


Example of a Decision Tree

IDHomeOwner

MaritalStatus

AnnualIncome

DefaultedBorrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Home

Owner

MarSt

Income

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree


7/35


Another Example of Decision Tree

MarSt

Home

Owner

Income

YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

< 80K > 80K

There could be more than one tree that

fits the same data!

IHomeOwner

MaritalStatus

AnnualIncome

efaultedBorrower

Yes Single K No

No Married 00K No

3 No Single 70K No

Yes Married 0K No

No Divorced 9 K Yes

6 No Married 60K No

7 Yes Divorced 0K No

8 No Single 8 K Yes

9 No Married 7 K No

0 No Single 90K Yes

0


8/35


Decision Tree Induction

Many Algorithms: Hunts Algorithm (one of the earliest)

CART

ID , C4.5 SLIQ,SPRINT


9/35


General Structure of Hunts Algorithm

Let Dt be the set of trainingrecords that reach a node t

General Procedure:

If Dt contains records that

belong the same class yt,then t is a leaf node

labeled as yt

If Dt contains records that

belong to more than one

class, use an attribute test

to split the data into smaller

subsets. Recursively apply

the procedure to each

subset.

Dt

?

D

t

t t

D t

1 es Single 125K

2 No Married 100K

3 No Single 70K

4 es Married 120K

5 No Divorced 95K

6 No Married 60K

7 es Divorced 220K

8 No Single 85K

9 No Married 75K

10 No Single 90K10


10/35


Hunts Algorithm

IDHome

Owner

Marital

Status

Annual

Income

Defaulted

Borrower1 es ingle 125 No

2 o arried 100 No

3 o ingle 70 No

4 es arried 120 No

5 o ivorced 95 Yes

6 o arried 60 No

7 es ivorced 220 No

8 o ingle 85 Yes

9 o arried 75 No

10 o ingle 90 Yes10


11/35


How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Which test condition is the best?


12/35


How to determine the Best Split

Greedy approach: Nodes with purerclass distribution are

preferred

Need a measure of node impurity:

High degree of impurity Low degree of impurity


13/35


Measures of Node Impurity

Gini Index

Entropy

Misclassification error

!j

tjptGINI 2)]|([1)(

!j

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

!


14/35


Comparison among Impurity Measures

For a 2-class problem:


15/35


Measure of Impurity: GINI

Gini Index for a given node t :

(NOTE:p(j | t)is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equallydistributed among all classes, implying leastinteresting information

Minimum (0.0) when all records belong to one class,

implying most interesting information

!j

tjptGINI2

)]|([1)(

C1 0

C2 6

i

i=0.000

C1 2

C2 4

i

i=0.444

C1 3

C2 3

i

i=0.500

C1 1

C2 5

i

i=0.278


16/35


Index

Splits into two partitions Effect of Weighing partitions:

Larger and Purer Partitions are sought for.

B?

Yes No

Node N1 Node N2

Parent

C1 6

C2 6

Gini 0. 00

N1 N2C1 5 2

C2 1 4

Gini 0. 61

Gini(N1)

= 1 (5/6)2

(1/6)2

= 0.278

Gini(N2)

= 1 (2/6)2 (4/6)2

= 0.444

Gini(Children)= 6/12 * 0.278 +

6/12 * 0.444

= 0.361


17/35


Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class inthe dataset

Use the count matrix to make decisions

CarType

{Sports,Luxury}

{Family}

C1 9 1

C2 7 3

Gini 0.468

CarT

{ rt }{Famil ,

x r }

C1 8 2

C2 0 10

Gi i 0.167

CarTyp

Family p rt L x ry

C1 1 8 1

C2 3 0 7

Gini 0.163

Multi-waysplit Two-waysplit(find bestpartitionofvalues)


18/35


Decision Tree Based Classification

Advantages: Inexpensive to construct

Extremely fast at classifying unknown records

Easy to interpret for small-sized trees Accuracy is comparable to other classification

techniques for many simple data sets


19/35


Rule-Based Classifier

Classify records by using a collection of ifthen rules

Name Blood Type Give Birth Can Fly Live in Water Class

human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fishes

whale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammals

pigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptiles

penguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibians

gila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birds

dolphin warm yes no yes mammalseagle warm no yes no birds

R1:(GiveBirth = no) (Can Fly = yes)p Birds

R2:(GiveBirth = no) (Livein Water= yes)p Fishes

R3:(GiveBirth = yes) (BloodType = warm)pMammals

R4:(GiveBirth = no) (Can Fly = no)p Reptiles

R5:(Livein Water= sometimes)pAmphibians


20/35


Nearest Neighbor Classifiers

Basic idea:

If it walks like a duck, quacks like a duck,

then its probably a duck

Training

Records

Test RecordCompute

Distance

Choose k of the

nearest records


21/35


Bayes Classifier

A probabilistic framework for solving classification problems

Key idea is that certain attribute values are more likely

(probable) for some classes than for others

Example: Probability an individual is a male or female if the

individual is wearing a dress

Conditional Probability:

Bayes theorem:

)(

)()()(

XP

YPYXPXYP !

)(

),()|(

)(

),()|(

YP

YPYP

XP

YXPXYP

!

!


22/35


Confusion Matrix:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Evaluating Classifiers


23/35


Most widely-used metric:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a

(TP)

b

(FN)

Class=No c

(FP)

d

(TN)

Accuracy

FNFPTNTP

TNTP

dcba

da

!

!Accuracy


24/35


Holdout

Reserve k% for training and (100-k)% for testing

Random subsampling

Repeated holdout

Cross validation Partition data into k disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one

Leave-one-out: k=n

Bootstrap Sampling with replacement

. 2 bootstrap:

!

vv!b

i

siboot accaccb

acc1

368.0632.01

Methods for Classifier Evaluation


25/35


Consider a 2-class problem

Number of Class 0 examples = 0

Number of Class 1 examples = 10

If a model predicts everything to be class 0, accuracy is

0/10000 = . %

This is misleading because the model does not detect

any class 1 example

Detecting the rare class is usually more interesting(e.g., frauds, intrusions, defects, etc)

Problem with Accuracy


26/35


Example of classification accuracy measures

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes 5

(TP)

5

(FN)

Class=No 5

(FP)

5

(TN) FPFNTP

TP

pr

rp

FNTP

TPFPTP

TP

TNFPFNTP

TNTP

!

!

!

!

!

2

22(F)measure-F

(r)Recall

(p)Precision

Accuracy

Accuracy = 0.8

ForYes class: precision = 8 .5, recall = 8 .5, F-measure = 8 .5

For No class: precision = 0.5, recall = 0.5, F-measure = 0.5

02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 2


27/35


Example of classification accuracy measures

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes

(TP)

1

(FN)

Class=No 10

(FP)

0

(TN)

Accuracy = 0. 450

Sensitivity = 0.

Specificity = 0. 0



28/35


29/35


ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off betweendetection rate and false alarm rate

Developed in 1 50s for signal detection theory to analyzenoisy signals

ROC curve plots True Positive Rate (TPR) against (False

Positive Rate) FPR Performance of a model represented as a point in an

ROC curve

Changing the threshold parameter of classifierchanges the location of the point

http://commonsenseatheism.com/wp-content/uploads/2011/01/Swets-Better-Decisions-Through-Science.pdf



30/35


(TPR,FPR):

(0,0): declare everything

to be negative class

(1,1): declare everything

to be positive class

(1,0): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of

the true class

ROC Curve



31/35


Using ROC for Model Comparison

No model consistently

outperforms the other

M1 is better for small FPR

M2 is better for large FPR

Area Under the ROC curve

Ideal: Area = 1

Random guess:

Area = 0.5



32/35


33/35


ROC Curve Example

- 1-dimensional data set containing 2 classes (positive and negative)

- Any points located at x > t is classified as positive

At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88

02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData


34/35


How to Construct an ROC curve

Instance score(+|A) True Class

1 0. 5 +

2 0. +

0.8 -

4 0.85 -

5 0.85 -

0.85 +

0. -

8 0.5 +0.4 -

10 0.25 +

Use classifier that producescontinuous-valued output for

each test instance score(+|A)

Sort the instances according to

score(+|A) in decreasing orderApply threshold at each unique

value of score(+|A)

Count the number of TP, FP,

TN, FN at each threshold TPR = TP/(TP+FN)

FPR = FP/(FP + TN)



35/35


How to construct an ROC curve

Class + - + - - - + - + +

0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TP 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FP 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:


Classification 021411

Documents

Transcript of Classification 021411