1__dt

download 1__dt

of 36

Transcript of 1__dt

  • 8/8/2019 1__dt

    1/36

    Decision Trees

    SLIQ fast scalable classifierGroup 12-Vaibhav Chopda-Tarun Bahadur

  • 8/8/2019 1__dt

    2/36

  • 8/8/2019 1__dt

    3/36

  • 8/8/2019 1__dt

    4/36

    AgendaWhat is classification

    Why decision trees ?The ID3 algorithmLimitations of ID3 algorithm

    SLIQ fast scalable classifier forDataMiningSP RINT the successor of SLIQ

  • 8/8/2019 1__dt

    5/36

    Classification Process : Model Constructi

    TrainingData

    NAME RANK YEARS TENURED

    Mike Assistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 no

    Anne Associate Prof 3 no

    ClassificationAlgorithms

    IF rank = professorOR years > 6THEN tenured = yes

    Classifier (Mode l)

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    6/36

    T esting and Prediction (by a classifier)

    Classifier

    TestingData

    NA M E RA N K YEAR S TENU R EDT om Assistant P rof 2 noM erlisa Associate Prof 7 noG eorge Professor 5 yes

    Joseph Assistant P rof 7 yes

    U nseen Data

    (Jeff, Professor, 4 )

    Tenured?

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    7/36

    Classification by Decision T ree Induction

    Decision tree (T uples flow along the tree structure) Internal node denotes an attribute Branch represents the values of the

    node attribute

    Leaf nodes represent class labels or class distribution

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    8/36

    Classification by Decision T ree Induction

    Decision tree generation consists of two phases Tree construction

    At start we choose one attribute as the root and put all its values as

    branchesWe choose recursively internal nodes (attributes) with their proper values as branches.We Stop when

    all the samples (records) are of the same class, then the node becomesthe leaf labeled with that class

    or there is no more samples left

    or there is no more new attributes to be put as the nodes. In this casewe apply MAJORI TY VOT ING to classify the node.

    Tree pruningIdentify and remove branches that reflect noise or outliers

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    9/36

    Classification by Decision T ree Induction

    Wh eres t h e c h allenge ? Good c h oice of root attribute

    Good c h oice of t h e internal nodes attributesis a crucial point.

    Decision T ree Induction Algorithms differ on methods of evaluating and choosingthe root and internal nodes attributes.

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    10/36

    Basic Idea of ID3/C4.5 Algorithm- greedy algorithm- constructs decision trees in a top-down

    recursive divide-and-conquer manner.

    T ree S T AR T S as a single node (root) representing alltraining dataset (samples)IFthe samples are ALL in the same class, THE N thenode becomes a L E AF and is labeled with that classOTHE RWIS E , the algorithm uses an entropy-basedmeasure known as information gain as a heuristicfor selecting the A TT RIBU TE that will B EST separatethe samples into individual classes. T his attributebecomes the node-name (test, or tree split decisionattribute)

    CSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    11/36

    Basic Idea of ID3/C4.5 Algorithm (2)A branch is created for each value of the node-attribute (and is

    labeled by this value -this is syntax) and the samples arepartitioned accordingly (this is semantics; see example whichfollows)T he algorithm uses the same process recursively to form adecision tree at each partition. Once an attribute has occurred at anode, it need not be considered in any other of the nodesdescendentsT he recursive partitioning STOPS only when any one of thefollowing conditions is true.All records (samples) for the given node belong to the same classor

    T here are no remaining attributes on which theRecords (samples) may be further partitioning. In this case weconvert the given node into a L E AF and label it with the class inmajority among samples ( majority voting )T here is no records (samples) left a leaf is created with majority

    vote for training sampleCSE634 course notes P rof. Anita Wasilewska

  • 8/8/2019 1__dt

    12/36

    E xample from

    professor Anitas slidea e in ome st dent redit_ratin

    0 lo es fair > 0 lo es ex ellent

    1 0 lo es ex ellent

  • 8/8/2019 1__dt

    13/36

    Shortcommings of ID3Scalability ?requires lot of computation at everystage of construction of decision treeScalability ?needs all the training data to be in thememoryIt does not suggest any standardsplitting index for range attributes

  • 8/8/2019 1__dt

    14/36

    SLIQ - a decision tree classifier

    Features of SLIQ

    Applies to both numerical and categorical attributesBuilds compact and accurate treesUses a pre-sorting technique in the tree growingphase and an inexpensive pruning algorithm

    Suitable for classification of large disk-resident datasets, independently of the number of classes,attributes and records

  • 8/8/2019 1__dt

    15/36

    SLIQ Methodology:

    Generateattribute listfor eachattribute

    Sort attributelists forNUMERICAttributes

    Createdecision treebypartitioningrecords

    Start End

  • 8/8/2019 1__dt

    16/36

    Example:

    Drivers Age CarType Class23 Family HIGH17 Sports HIGH

    43 Sports HIGH68 Family LOW32 Truck LOW20 Family HIGH

  • 8/8/2019 1__dt

    17/36

    Attribute listing phase :

    Age Class RecId

    23 HIGH 0

    17 HIGH 1

    43 HIGH 268 LOW 3

    32 LOW 4

    20 HIGH 5

    Rec Id Age CarType Class

    0 23 Family HIGH

    1 17 Sports HIGH

    2 43 Sports HIGH

    3 68 Family LOW

    4 32 Truck LOW

    5 20 Family HIGH

    Rec Id CarType Rec Id

    Family HIGH 0

    Sports HIGH 1

    Sports HIGH 2Family LOW 3

    Truck LOW 4

    Family HIGH 5

    Age NUMERIC attribute CarType CATEGORICAL attribute

  • 8/8/2019 1__dt

    18/36

    P resorting P hase:

    Age Class RecId17 HIGH 0

    20 HIGH 5

    23 HIGH 0

    32 LOW 4

    43 LOW 268 HIGH 3

    CarType Class Rec IdFamily HIGH 0

    Sports HIGH 1

    Sports HIGH 2

    Family LOW 3

    Truck LOW 4

    Family HIGH 5

    Only NUMERIC attributes sorted CATEGORICAL attribute need not be sorted

  • 8/8/2019 1__dt

    19/36

    Constructing the decision tree

  • 8/8/2019 1__dt

    20/36

    Constructing the decision tree(block 20) for each leaf node being examined, the methoddetermines a split test to best separate the records at theexamined node using the attribute lists in block 21.

    (block 22) the records at the examined leaf node are partitionedaccording to the best split test at that node to form new leaf nodes, which are also child nodes of the examined node.

    The records at each new leaf node are checked at block 23 to seeif they are of the same class. If this condition has not been

    achieved, the splitting process is repeated starting with block 24for each newly formed leaf node until each leaf node containsrecords from one class.

    In finding the best split test (or split point) at a leaf node, asplitting index corresponding to a criterion used for splitting therecords may be used to help evaluate possible splits. This

    splitting index indicates how well the criterion separates therecord classes. The splitting index is preferably a gini index.

  • 8/8/2019 1__dt

    21/36

    Gini IndexThe gini index is used to evaluate the goodness of the alternativesplits for an attributeIf a data set T contains examples from n classes, gini(T) is defined as

    Where pj is the relative ferquency of class j in the data set T.

    After splitting T into two subset T1 and T2 the gini index of the split data is defined as

  • 8/8/2019 1__dt

    22/36

    Gini Index : The preferred

    splitting index Advantage of the gini index:Its calculation requires only the distribution of the class valuesin each record partition.

    To find the best split point for a node, the node's attribute listsare scanned to evaluate the splits for the attributes. Theattribute containing the split point with the lowest value forthe gini index is used for splitting the node's records.

    The following is the splitting test (next slide) The flow chart will fit in the block 21 of decision treeconstruction

  • 8/8/2019 1__dt

    23/36

  • 8/8/2019 1__dt

    24/36

    Numeric Attributes splitting

    index

  • 8/8/2019 1__dt

    25/36

    Splitting for catergorical

    attributes

  • 8/8/2019 1__dt

    26/36

    Determining subset of highest index

    The logic of finding best subset substitute for block 39

    G reedyalgorithm maybe used here

  • 8/8/2019 1__dt

    27/36

    The decision tree getting

    constructed (level 0)

  • 8/8/2019 1__dt

    28/36

    Decision tree (level 1)

  • 8/8/2019 1__dt

    29/36

    The classification at level 1

  • 8/8/2019 1__dt

    30/36

    P erformance:

  • 8/8/2019 1__dt

    31/36

    P erformance: Classification

    Accuracy

  • 8/8/2019 1__dt

    32/36

    P erformance: Decision Tree

    Size

  • 8/8/2019 1__dt

    33/36

    P erformance: Execution time

  • 8/8/2019 1__dt

    34/36

    P erformance: Scalability

  • 8/8/2019 1__dt

    35/36

    Conclusion:

  • 8/8/2019 1__dt

    36/36

    THANK YOU !!!