PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

27
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning Rajeev Rastogi Kyuseok Shim Presented by: Alon Keinan

description

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev RastogiKyuseok Shim. Presented by: Alon Keinan. Presentation layout. Introduction: Classification and Decision Trees Decision Tree Building Algorithms SPRINT & MDL PUBLIC Performance Comparison - PowerPoint PPT Presentation

Transcript of PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Page 1: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Rajeev Rastogi Kyuseok Shim

Presented by: Alon Keinan

Page 2: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 3: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Introduction: Classification

• Classification in data mining:– Training sample set– Classifying future records

• Techniques: Bayesian, NN, Genetic, decision trees …

Page 4: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Introduction: Decision Trees

training

Page 5: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 6: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Decision Tree Building Algorithms

• 2 phases: – The building phase– The pruning phase

• The building constructs a “perfect” tree

• The pruning prevents “overfitting”

Page 7: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Building Phase Algorithms

• Differ in the selection of the test criterion for partitioning– CLS– ID3 & C4.5– CART, SLIQ & SPRINT

• Differ in their ability to handle large training sets

• All consider “guillotine-cut” only

Page 8: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Pruning Phase Algorithms

• MDL – Minimum Description Length

• Cost-Complexity Pruning

Page 9: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 10: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

SPRINT

• Initialize root node• Initialize queue Q to contain root node• While Q is not empty do

– dequeue the first node N in Q– if N is not pure

• for each attribute evaluate splits• use least entropy split to split node N into N1 and

N2• append N1 and N2 to Q

Page 11: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Entropy

Page 12: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

MDL

• The best tree is the one that can be encoded using the fewest number of bits

• Cost of encoding data records:

• Cost of encoding tree:– The structure of the tree– The splits– The classes in the leaves

Page 13: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Pruning algorithm

• computeCost&Prune(Node N)– If N is a leaf return (C(S)+1)

– minCostLeft:=computeCost&Prune(Nleft)

– minCostRight:=computeCost&Prune(Nright)

– minCost:=min{C(S)+1, Csplit(N)+1+minCostLeft+minCostRight}

– If minCost=C(S)+1• Prune child nodes Nleft and Nright

– return minCost

Page 14: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 15: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

PUBLIC

• PUBLIC = PrUning and BuiLding Integrated in Classification

• Uses SPRINT for building• Prune periodically !!!• Basically uses MDL for pruning• Distinguished three types of leaves:

– “not expandable”– “pruned”– “yet to be expanded”

• Exact same tree

Page 16: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Lower Bound Computation

• PUBLIC(1) – Bound=1

• PUBLIC(S) – Incorporating split costs

• PUBLIC(V) – Incorporating split values

Page 17: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

PUBLIC(S)

• Calculates a lower bound for s=0,..,k-1– For s=0: C(S)+1– For s>0:

• Takes the minimum of the bounds

• Computes by iterative addition

• O(klogk)

Page 18: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

PUBLIC(V)

• PUBLIC(S) estimates each split as log(a)

• PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s

• Complexity: O(k*(logk+a))

Page 19: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Lower Bound ComputationSummary

PUBLIC(1) Fixed - 1 O(1)

PUBLIC(S) Incorporating split costs

O(klogk)

PUBLIC(V) Incorporating split value costs

O(k*(logk+a))

Page 20: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 21: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Performance Comparisons

• Algorithms:– SPRINT– PUBLIC(1)– PUBLIC(S)– PUBLIC(V)

• Data sets:– Real-life– Synthetic

Page 22: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Real-life Data Sets

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

SPRINT

PUBLIC(1)

PUBLIC(S)

PUBLIC(V)

Page 23: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Synthetic Data Sets

0

500

1000

1500

2000

SPRINT

PUBLIC(1)

PUBLIC(S)

PUBLIC(V)

Page 24: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Noise

Page 25: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Other Parameters

• No. of Attributes

• No. of Classes

• Size of training set

Page 26: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Presentation layout

• Introduction: Classification and Decision Trees

• Decision Tree Building Algorithms

• SPRINT & MDL

• PUBLIC

• Performance Comparison

• Conclusions

Page 27: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Conclusion

• The pruning is integrated into the building phase• Computing lower bounds of the cost of “yet to be

expanded” leaves• Improved performance• Open:

– How often to invoke the pruning procedure?

– Expanding other algorithms …

– Developing a tighter lower bound…