Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods...

40
Scaling Decision Tree Induction

Transcript of Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods...

Page 1: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Scaling Decision Tree Induction

Page 2: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Outline

• Why do we need scaling?

• Cover state of the art methods

• Details on my research (which is one of the state of the art methods)

Page 3: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Problems Scaling Decision Trees

• Data doesn’t fit in RAM

• Numeric attributes require repeated sorting

• Noisy datasets lead to very large trees

• Large datasets fundamentally different from smaller ones– Can’t store the entire dataset– Underlying phenomenon changes over time

Page 4: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Current State-Of-The-Art

• Disk based methods– Sprint– SLIQ

• Sampling methods– BOAT– VFDT & CVFDT

• Data Stream Methods– VFDT & CVFDT

Page 5: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

SPRINT/SLIQ

• Shafer, Agrawal, Mehta

• In the IBM Intelligent Miner for Data

• Learns the same tree as traditional method but works with data on disk

• One scan over the data per level of the induced tree

Page 6: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

SPRINT/SLIQ Details

• Split the dataset into one file per attribute– (value, record ID)

• Pre-sort each numeric attribute’s file• Do one scan over each file, find best split

point• Use hash-tables to split the files maintaining

sort order• Recur

Page 7: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

SPRINT/SLIQ Splitting Example

3 | 3

5 | 2

6 | 5

9 | 1

10 | 4

12 | 6

val | rec

10 | 1

14 | 6

20 | 2

25 | 4

30 | 3

40 | 5

val | rec10 | 1

14 | 6

20 | 4

20 | 2

30 | 3

40 | 5

>

1 | >

2 | <

3 | <

4 | >

5 | <

6 | >

‘hashtable’

To Split

<

Test Attrib

Page 8: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

BOAT

• Gehrke, Ganti, Ramakrishnan, Loh

• Learns the same tree as traditional methods but can be as much as 3x faster than SPRINT/SLIQ

• When things work out learns more than one level of tree in one scan over the database

Page 9: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

BOAT Details

• Read a sample of data into memory

• Learn N trees via traditional methods on bootstrap samples from this sample

• Keep any subset of the N trees that is exactly the same

• Verify the subtree with a scan over all data

• When this fails revert to SPRINT/SLIQ

Page 10: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

BOAT Examplex1?

x2?

male female

> 65 <= 65

x1?

x2?

male female

> 67 <= 67

x1?

x2?

male female

> 61 <= 61x3?

no yes

x1?

x2?

male female

> 61 <= 67?

Page 11: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

VFDT/CVFDT

• Hulten, Spencer, Domingos• With high probability learns what

traditional methods would learn, but much faster

• Learns from data stream instead of data base

• CVFDT is extension to time changing concepts

Page 12: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Motivation

• Why use a data stream model?– High data rate

– Essentially infinite data

– Data collected in varied circumstances

• Need a algorithms that are:– Constant time per example & use each example once

– Incremental

– Anytime

– Produce results ‘equivalent’ to traditional methods

Page 13: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Hoeffding Trees

• In order to pick split attribute for a node looking at a few example may be sufficient

• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…

• Leaves predict most common class

Page 14: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

How Much Data?

• Make sure best attribute is better than second– That is:

• Using a statistical result: Hoeffding bound– Collect data till: 21 XGXG

n

R

2

1ln2

021 XGXG

Page 15: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Hoeffding Tree Algorithm

Proceedure HoeffdingTree(Stream, δ)Let HT = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream

Sort (X, y) to leaf using HTUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then

Split leaf on best attributeFor each branch

Start new leaf, init sufficient statisticsReturn HT

x1?

y=0 x2?

y=0 y=1

male female

> 65 <= 65

Page 16: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Properties of Hoeffding Trees

• Model may contain incorrect splits, useful?

• Bound the difference with infinite data tree– Chance an arbitrary example takes different path

• Intuition: example on level i of tree has i chances to go through a mistaken node

p

DTDT HT

,

Page 17: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

VFDT (Very Fast Decision Tree)• Memory management

– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed

• Ties:– Wasteful to decide between identical attributes

• Check for splits periodically• Pre-pruning (optional)

– Only make splits that improve the value of G(.)

• Early stop on bad attributes• Bootstrap with traditional learner• Rescan old data when time available

G

Page 18: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Experiments

• Compared VFDT and C4.5 (Quinlan, 1993)

• Same memory limit for both (40 MB)– 100k examples for C4.5

• VFDT settings: δ = 10^-7, τ = 5%

• Domains: 2 classes, 100 binary attributes

• Fifteen synthetic trees 2.2k – 500k leaves

• Noise from 0% to 30%

Page 19: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Page 20: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Page 21: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Running Times

• Pentium III at 500 MHz running Linux

• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds

• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process

• VFDT processes 32k examples per second (excluding I/O)

Page 22: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Page 23: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Time-Changing Data Streams

• Underlying concept often changes over time– Seasonal effects– Economic cycles– Etc.

• Many KDD systems assume data is sample from stationary distribution

• CVFDT -- Extends VFDT for time changing data streams

Page 24: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Dealing with Time Changing Concepts

• Out-of-date data misleads learner and results in larger or less accurate models

• Maintain a window of the most recent examples– When new data arrives update the window and reapply

the learner

– Effective when window size similar to concept drift rate

• Extremely inefficient!

Page 25: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Concept adapting VFDT

• Keep up to date with a window of size w– Incrementally incorporate and forget examples

• Smoothly change the induced tree– Grow speculative structure– Change structure when more accurate

• Incorporates new examples in constant time instead of relearning on window: O(w) time

Page 26: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Window (Forgetting Examples)

• Keep sufficient statistics at every node• Update with new & old examples

– Keep an ID and only forget where needed– Quickly update leaf predictions

• Periodically check for any invalid splits– Some portion due to incorrect initial splits– The rest due to changes in the data stream

2XGXG split

Page 27: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Alternate Sub-Trees• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts

Gender?

Pets? College?

Hair?

false

false true

falsetrue true

Page 28: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

CVFDT Details

• Memory Requirements– When drift present, CVFDT uses fewer nodes

than VFDT– Observed good results with relatively few

alternate-trees

• Update time– O(# attribs * # values * # classes * path length)

• Independent of training set and window size!

Page 29: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Other things

• Dynamic window size– Drastic changes in the data stream– Drastic changes in the induced model– No apparent changes (learn more detail)

Page 30: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Synthetic Experiments

• Concept based on parallel hyper-planes• Aligned axis better split attribute, rotate the hyper-

planes to change structure of ‘true’ tree

+

+

+

+-

-

-

-

Concept Drift

Page 31: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Synthetic Experiments (cont.)• Compare CVFDT with VFDT• 5 million training examples• Drift inserted by periodically rotating hyper-

planes– About 8% of test points change label each drift

• 100,000 examples in window• 5% noise• Results sampled every 10k examples throughout

the run and averaged

Page 32: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Error Rate vs. # Attributes

Page 33: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Tree Size vs. # Attributes

Page 34: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Detailed View of Single Run

Page 35: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Varying Levels of Drift

Page 36: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Details of Adaptation

Page 37: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Comparison With VFDT-window

• CVFDT most of the accuracy gain

• VFDT: 10 min• CVFDT: 46 min• VFDT-window

– Est. 548 days!

VFDT-WindowCVFDT

VFDT

70

72

74

76

78

80

82

84

86

88

90

Accuracy

Page 38: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Application: Web Data

• Trace of all web requests from UW campus

• 82.8 million requests over one-week period

• Goal: to predict which pages to cache

• CVFDT does better for first 70% of run

• VFDT’s performance improved near end

• Data seems to contain drift, but more study is needed

Page 39: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Open Issues

• Continuous Attributes

• Batch version of VFDT

• Very Fast Post Pruning

• Extending general method to other algorithms

Page 40: Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Summary

• Decision trees important, need some more work to scale to today's problems

• Disk based methods– About one scan per level of tree

• Sampling can produce equivalent trees much faster