Classification 021411

download Classification 021411

of 35

Transcript of Classification 021411

  • 8/7/2019 Classification 021411

    1/35

    Data MiningClassification: Basic Concepts, Decision

    Trees, and Model Evaluation

    Lecture Notes for Chapter 4

    Introduction to Data Mining

    by

    Tan, Steinbach, Kumar

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

  • 8/7/2019 Classification 021411

    2/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Classification: Definition

    Given a collection of records (training set ) Each record is by characterized by a tuple

    (x,y), wherexis the attribute set andy is the

    class label

    x: attribute, predictor, independent variable, input

    y: class, response, dependent variable, output

    Task: Learn a model that maps each attribute setx

    into one of the predefined class labelsy

  • 8/7/2019 Classification 021411

    3/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Examples of Classification Task

    Task Attribute set,x Class label,y

    Categorizing

    email

    messages

    Features extracted from

    email message header

    and content

    spam or non-spam

    Identifying

    tumor cells

    Features extracted from

    MRI scans

    malignant or benign

    cells

    Cataloging

    galaxies

    Features extracted from

    telescope images

    Elliptical, spiral, or

    irregular-shaped

    galaxies

  • 8/7/2019 Classification 021411

    4/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    General Approach for BuildingClassification Model

    Apply

    Model

    LearnModel

    Tid

    Attrib1

    Attrib2 Attrib3 Class

    1 Yes

    Large 125K No

    2

    No Medium 100K No

    3 No Small 70K

    No

    4

    Yes

    Medium 120K No

    5

    No Large 95K

    Yes

    6

    No Medium 60K

    No

    7

    Yes

    Large 220K

    No

    8

    No Small 85K

    Yes

    9

    No Medium 75K

    No

    10 No Small 90K

    Yes10

    Tid

    Attrib1

    Attrib2 Attrib3 Class

    11 No Small 55K

    ?

    12 Yes

    Medium 80K

    ?

    13 Yes

    Large 110K ?

    14 No Small 95K

    ?

    15 No Large 67K

    ?10

  • 8/7/2019 Classification 021411

    5/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Classification Techniques

    Base Classifiers Decision Tree based Methods

    Rule-based Methods

    Nearest-neighbor

    Neural Networks

    Nave Bayes and Bayesian Belief Networks

    Support Vector Machines

    Ensemble Classifiers

    Boosting, Bagging, Random Forests

  • 8/7/2019 Classification 021411

    6/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Example of a Decision Tree

    IDHomeOwner

    MaritalStatus

    AnnualIncome

    DefaultedBorrower

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Home

    Owner

    MarSt

    Income

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Splitting Attributes

    Training Data Model: Decision Tree

  • 8/7/2019 Classification 021411

    7/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Another Example of Decision Tree

    MarSt

    Home

    Owner

    Income

    YESNO

    NO

    NO

    Yes No

    MarriedSingle,

    Divorced

    < 80K > 80K

    There could be more than one tree that

    fits the same data!

    IHomeOwner

    MaritalStatus

    AnnualIncome

    efaultedBorrower

    Yes Single K No

    No Married 00K No

    3 No Single 70K No

    Yes Married 0K No

    No Divorced 9 K Yes

    6 No Married 60K No

    7 Yes Divorced 0K No

    8 No Single 8 K Yes

    9 No Married 7 K No

    0 No Single 90K Yes

    0

  • 8/7/2019 Classification 021411

    8/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Decision Tree Induction

    Many Algorithms: Hunts Algorithm (one of the earliest)

    CART

    ID , C4.5 SLIQ,SPRINT

  • 8/7/2019 Classification 021411

    9/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    General Structure of Hunts Algorithm

    Let Dt be the set of trainingrecords that reach a node t

    General Procedure:

    If Dt contains records that

    belong the same class yt,then t is a leaf node

    labeled as yt

    If Dt contains records that

    belong to more than one

    class, use an attribute test

    to split the data into smaller

    subsets. Recursively apply

    the procedure to each

    subset.

    Dt

    ?

    D

    t

    t t

    D t

    1 es Single 125K

    2 No Married 100K

    3 No Single 70K

    4 es Married 120K

    5 No Divorced 95K

    6 No Married 60K

    7 es Divorced 220K

    8 No Single 85K

    9 No Married 75K

    10 No Single 90K10

  • 8/7/2019 Classification 021411

    10/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Hunts Algorithm

    IDHome

    Owner

    Marital

    Status

    Annual

    Income

    Defaulted

    Borrower1 es ingle 125 No

    2 o arried 100 No

    3 o ingle 70 No

    4 es arried 120 No

    5 o ivorced 95 Yes

    6 o arried 60 No

    7 es ivorced 220 No

    8 o ingle 85 Yes

    9 o arried 75 No

    10 o ingle 90 Yes10

  • 8/7/2019 Classification 021411

    11/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    How to determine the Best Split

    Before Splitting: 10 records of class 0,

    10 records of class 1

    Which test condition is the best?

  • 8/7/2019 Classification 021411

    12/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    How to determine the Best Split

    Greedy approach: Nodes with purerclass distribution are

    preferred

    Need a measure of node impurity:

    High degree of impurity Low degree of impurity

  • 8/7/2019 Classification 021411

    13/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Measures of Node Impurity

    Gini Index

    Entropy

    Misclassification error

    !j

    tjptGINI 2)]|([1)(

    !j

    tjptjptEntropy )|(log)|()(

    )|(max1)( tiPtErrori

    !

  • 8/7/2019 Classification 021411

    14/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Comparison among Impurity Measures

    For a 2-class problem:

  • 8/7/2019 Classification 021411

    15/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Measure of Impurity: GINI

    Gini Index for a given node t :

    (NOTE:p(j | t)is the relative frequency of class j at node t).

    Maximum (1 - 1/nc) when records are equallydistributed among all classes, implying leastinteresting information

    Minimum (0.0) when all records belong to one class,

    implying most interesting information

    !j

    tjptGINI2

    )]|([1)(

    C1 0

    C2 6

    i

    i=0.000

    C1 2

    C2 4

    i

    i=0.444

    C1 3

    C2 3

    i

    i=0.500

    C1 1

    C2 5

    i

    i=0.278

  • 8/7/2019 Classification 021411

    16/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Index

    Splits into two partitions Effect of Weighing partitions:

    Larger and Purer Partitions are sought for.

    B?

    Yes No

    Node N1 Node N2

    Parent

    C1 6

    C2 6

    Gini 0. 00

    N1 N2C1 5 2

    C2 1 4

    Gini 0. 61

    Gini(N1)

    = 1 (5/6)2

    (1/6)2

    = 0.278

    Gini(N2)

    = 1 (2/6)2 (4/6)2

    = 0.444

    Gini(Children)= 6/12 * 0.278 +

    6/12 * 0.444

    = 0.361

  • 8/7/2019 Classification 021411

    17/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Categorical Attributes: Computing Gini Index

    For each distinct value, gather counts for each class inthe dataset

    Use the count matrix to make decisions

    CarType

    {Sports,Luxury}

    {Family}

    C1 9 1

    C2 7 3

    Gini 0.468

    CarT

    { rt }{Famil ,

    x r }

    C1 8 2

    C2 0 10

    Gi i 0.167

    CarTyp

    Family p rt L x ry

    C1 1 8 1

    C2 3 0 7

    Gini 0.163

    Multi-waysplit Two-waysplit(find bestpartitionofvalues)

  • 8/7/2019 Classification 021411

    18/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Decision Tree Based Classification

    Advantages: Inexpensive to construct

    Extremely fast at classifying unknown records

    Easy to interpret for small-sized trees Accuracy is comparable to other classification

    techniques for many simple data sets

  • 8/7/2019 Classification 021411

    19/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Rule-Based Classifier

    Classify records by using a collection of ifthen rules

    Name Blood Type Give Birth Can Fly Live in Water Class

    human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fishes

    whale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammals

    pigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptiles

    penguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibians

    gila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birds

    dolphin warm yes no yes mammalseagle warm no yes no birds

    R1:(GiveBirth = no) (Can Fly = yes)p Birds

    R2:(GiveBirth = no) (Livein Water= yes)p Fishes

    R3:(GiveBirth = yes) (BloodType = warm)pMammals

    R4:(GiveBirth = no) (Can Fly = no)p Reptiles

    R5:(Livein Water= sometimes)pAmphibians

  • 8/7/2019 Classification 021411

    20/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Nearest Neighbor Classifiers

    Basic idea:

    If it walks like a duck, quacks like a duck,

    then its probably a duck

    Training

    Records

    Test RecordCompute

    Distance

    Choose k of the

    nearest records

  • 8/7/2019 Classification 021411

    21/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Bayes Classifier

    A probabilistic framework for solving classification problems

    Key idea is that certain attribute values are more likely

    (probable) for some classes than for others

    Example: Probability an individual is a male or female if the

    individual is wearing a dress

    Conditional Probability:

    Bayes theorem:

    )(

    )()()(

    XP

    YPYXPXYP !

    )(

    ),()|(

    )(

    ),()|(

    YP

    YPYP

    XP

    YXPXYP

    !

    !

  • 8/7/2019 Classification 021411

    22/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Confusion Matrix:

    PREDICTED CLASS

    ACTUAL

    CLASS

    Class=Yes Class=No

    Class=Yes a b

    Class=No c d

    a: TP (true positive)

    b: FN (false negative)

    c: FP (false positive)

    d: TN (true negative)

    Evaluating Classifiers

  • 8/7/2019 Classification 021411

    23/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Most widely-used metric:

    PREDICTED CLASS

    ACTUAL

    CLASS

    Class=Yes Class=No

    Class=Yes a

    (TP)

    b

    (FN)

    Class=No c

    (FP)

    d

    (TN)

    Accuracy

    FNFPTNTP

    TNTP

    dcba

    da

    !

    !Accuracy

  • 8/7/2019 Classification 021411

    24/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Holdout

    Reserve k% for training and (100-k)% for testing

    Random subsampling

    Repeated holdout

    Cross validation Partition data into k disjoint subsets

    k-fold: train on k-1 partitions, test on the remaining one

    Leave-one-out: k=n

    Bootstrap Sampling with replacement

    . 2 bootstrap:

    !

    vv!b

    i

    siboot accaccb

    acc1

    368.0632.01

    Methods for Classifier Evaluation

  • 8/7/2019 Classification 021411

    25/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Consider a 2-class problem

    Number of Class 0 examples = 0

    Number of Class 1 examples = 10

    If a model predicts everything to be class 0, accuracy is

    0/10000 = . %

    This is misleading because the model does not detect

    any class 1 example

    Detecting the rare class is usually more interesting(e.g., frauds, intrusions, defects, etc)

    Problem with Accuracy

  • 8/7/2019 Classification 021411

    26/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Example of classification accuracy measures

    PREDICTED CLASS

    ACTUALCLASS

    Class=Yes Class=No

    Class=Yes 5

    (TP)

    5

    (FN)

    Class=No 5

    (FP)

    5

    (TN) FPFNTP

    TP

    pr

    rp

    FNTP

    TPFPTP

    TP

    TNFPFNTP

    TNTP

    !

    !

    !

    !

    !

    2

    22(F)measure-F

    (r)Recall

    (p)Precision

    Accuracy

    Accuracy = 0.8

    ForYes class: precision = 8 .5, recall = 8 .5, F-measure = 8 .5

    For No class: precision = 0.5, recall = 0.5, F-measure = 0.5

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 2

  • 8/7/2019 Classification 021411

    27/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Example of classification accuracy measures

    PREDICTED CLASS

    ACTUALCLASS

    Class=Yes Class=No

    Class=Yes

    (TP)

    1

    (FN)

    Class=No 10

    (FP)

    0

    (TN)

    Accuracy = 0. 450

    Sensitivity = 0.

    Specificity = 0. 0

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 2

  • 8/7/2019 Classification 021411

    28/35

  • 8/7/2019 Classification 021411

    29/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    ROC (Receiver Operating Characteristic)

    A graphical approach for displaying trade-off betweendetection rate and false alarm rate

    Developed in 1 50s for signal detection theory to analyzenoisy signals

    ROC curve plots True Positive Rate (TPR) against (False

    Positive Rate) FPR Performance of a model represented as a point in an

    ROC curve

    Changing the threshold parameter of classifierchanges the location of the point

    http://commonsenseatheism.com/wp-content/uploads/2011/01/Swets-Better-Decisions-Through-Science.pdf

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 2

  • 8/7/2019 Classification 021411

    30/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    (TPR,FPR):

    (0,0): declare everything

    to be negative class

    (1,1): declare everything

    to be positive class

    (1,0): ideal

    Diagonal line:

    Random guessing

    Below diagonal line:

    prediction is opposite of

    the true class

    ROC Curve

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 0

  • 8/7/2019 Classification 021411

    31/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    Using ROC for Model Comparison

    No model consistently

    outperforms the other

    M1 is better for small FPR

    M2 is better for large FPR

    Area Under the ROC curve

    Ideal: Area = 1

    Random guess:

    Area = 0.5

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 1

  • 8/7/2019 Classification 021411

    32/35

  • 8/7/2019 Classification 021411

    33/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    ROC Curve Example

    - 1-dimensional data set containing 2 classes (positive and negative)

    - Any points located at x > t is classified as positive

    At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData

  • 8/7/2019 Classification 021411

    34/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    How to Construct an ROC curve

    Instance score(+|A) True Class

    1 0. 5 +

    2 0. +

    0.8 -

    4 0.85 -

    5 0.85 -

    0.85 +

    0. -

    8 0.5 +0.4 -

    10 0.25 +

    Use classifier that producescontinuous-valued output for

    each test instance score(+|A)

    Sort the instances according to

    score(+|A) in decreasing orderApply threshold at each unique

    value of score(+|A)

    Count the number of TP, FP,

    TN, FN at each threshold TPR = TP/(TP+FN)

    FPR = FP/(FP + TN)

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 4

  • 8/7/2019 Classification 021411

    35/35

    Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 #

    How to construct an ROC curve

    Class + - + - - - + - + +

    0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

    TP 5 4 4 3 3 3 3 2 2 1 0

    FP 5 5 4 4 3 2 1 1 0 0 0

    TN 0 0 1 1 2 3 4 4 5 5 5

    FN 0 1 1 2 2 2 2 3 3 4 5

    TP 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

    FP 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

    Threshold >=

    ROC Curve:

    02/14/2011 CSCI 8 80: Spring 2011: Mining BiomedicalData 5