Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods...
-
Upload
marion-harper -
Category
Documents
-
view
212 -
download
0
Transcript of Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods...
Scaling Decision Tree Induction
Outline
• Why do we need scaling?
• Cover state of the art methods
• Details on my research (which is one of the state of the art methods)
Problems Scaling Decision Trees
• Data doesn’t fit in RAM
• Numeric attributes require repeated sorting
• Noisy datasets lead to very large trees
• Large datasets fundamentally different from smaller ones– Can’t store the entire dataset– Underlying phenomenon changes over time
Current State-Of-The-Art
• Disk based methods– Sprint– SLIQ
• Sampling methods– BOAT– VFDT & CVFDT
• Data Stream Methods– VFDT & CVFDT
SPRINT/SLIQ
• Shafer, Agrawal, Mehta
• In the IBM Intelligent Miner for Data
• Learns the same tree as traditional method but works with data on disk
• One scan over the data per level of the induced tree
SPRINT/SLIQ Details
• Split the dataset into one file per attribute– (value, record ID)
• Pre-sort each numeric attribute’s file• Do one scan over each file, find best split
point• Use hash-tables to split the files maintaining
sort order• Recur
SPRINT/SLIQ Splitting Example
3 | 3
5 | 2
6 | 5
9 | 1
10 | 4
12 | 6
val | rec
10 | 1
14 | 6
20 | 2
25 | 4
30 | 3
40 | 5
val | rec10 | 1
14 | 6
20 | 4
20 | 2
30 | 3
40 | 5
>
1 | >
2 | <
3 | <
4 | >
5 | <
6 | >
‘hashtable’
To Split
<
Test Attrib
BOAT
• Gehrke, Ganti, Ramakrishnan, Loh
• Learns the same tree as traditional methods but can be as much as 3x faster than SPRINT/SLIQ
• When things work out learns more than one level of tree in one scan over the database
BOAT Details
• Read a sample of data into memory
• Learn N trees via traditional methods on bootstrap samples from this sample
• Keep any subset of the N trees that is exactly the same
• Verify the subtree with a scan over all data
• When this fails revert to SPRINT/SLIQ
BOAT Examplex1?
x2?
male female
> 65 <= 65
x1?
x2?
male female
> 67 <= 67
x1?
x2?
male female
> 61 <= 61x3?
no yes
x1?
x2?
male female
> 61 <= 67?
VFDT/CVFDT
• Hulten, Spencer, Domingos• With high probability learns what
traditional methods would learn, but much faster
• Learns from data stream instead of data base
• CVFDT is extension to time changing concepts
Motivation
• Why use a data stream model?– High data rate
– Essentially infinite data
– Data collected in varied circumstances
• Need a algorithms that are:– Constant time per example & use each example once
– Incremental
– Anytime
– Produce results ‘equivalent’ to traditional methods
Hoeffding Trees
• In order to pick split attribute for a node looking at a few example may be sufficient
• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…
• Leaves predict most common class
How Much Data?
• Make sure best attribute is better than second– That is:
• Using a statistical result: Hoeffding bound– Collect data till: 21 XGXG
n
R
2
1ln2
021 XGXG
Hoeffding Tree Algorithm
Proceedure HoeffdingTree(Stream, δ)Let HT = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream
Sort (X, y) to leaf using HTUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then
Split leaf on best attributeFor each branch
Start new leaf, init sufficient statisticsReturn HT
x1?
y=0 x2?
y=0 y=1
male female
> 65 <= 65
Properties of Hoeffding Trees
• Model may contain incorrect splits, useful?
• Bound the difference with infinite data tree– Chance an arbitrary example takes different path
• Intuition: example on level i of tree has i chances to go through a mistaken node
p
DTDT HT
,
VFDT (Very Fast Decision Tree)• Memory management
– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed
• Ties:– Wasteful to decide between identical attributes
• Check for splits periodically• Pre-pruning (optional)
– Only make splits that improve the value of G(.)
• Early stop on bad attributes• Bootstrap with traditional learner• Rescan old data when time available
G
Experiments
• Compared VFDT and C4.5 (Quinlan, 1993)
• Same memory limit for both (40 MB)– 100k examples for C4.5
• VFDT settings: δ = 10^-7, τ = 5%
• Domains: 2 classes, 100 binary attributes
• Fifteen synthetic trees 2.2k – 500k leaves
• Noise from 0% to 30%
Running Times
• Pentium III at 500 MHz running Linux
• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds
• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process
• VFDT processes 32k examples per second (excluding I/O)
Time-Changing Data Streams
• Underlying concept often changes over time– Seasonal effects– Economic cycles– Etc.
• Many KDD systems assume data is sample from stationary distribution
• CVFDT -- Extends VFDT for time changing data streams
Dealing with Time Changing Concepts
• Out-of-date data misleads learner and results in larger or less accurate models
• Maintain a window of the most recent examples– When new data arrives update the window and reapply
the learner
– Effective when window size similar to concept drift rate
• Extremely inefficient!
Concept adapting VFDT
• Keep up to date with a window of size w– Incrementally incorporate and forget examples
• Smoothly change the induced tree– Grow speculative structure– Change structure when more accurate
• Incorporates new examples in constant time instead of relearning on window: O(w) time
Window (Forgetting Examples)
• Keep sufficient statistics at every node• Update with new & old examples
– Keep an ID and only forget where needed– Quickly update leaf predictions
• Periodically check for any invalid splits– Some portion due to incorrect initial splits– The rest due to changes in the data stream
2XGXG split
Alternate Sub-Trees• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts
Gender?
Pets? College?
Hair?
false
false true
falsetrue true
CVFDT Details
• Memory Requirements– When drift present, CVFDT uses fewer nodes
than VFDT– Observed good results with relatively few
alternate-trees
• Update time– O(# attribs * # values * # classes * path length)
• Independent of training set and window size!
Other things
• Dynamic window size– Drastic changes in the data stream– Drastic changes in the induced model– No apparent changes (learn more detail)
Synthetic Experiments
• Concept based on parallel hyper-planes• Aligned axis better split attribute, rotate the hyper-
planes to change structure of ‘true’ tree
+
+
+
+-
-
-
-
Concept Drift
Synthetic Experiments (cont.)• Compare CVFDT with VFDT• 5 million training examples• Drift inserted by periodically rotating hyper-
planes– About 8% of test points change label each drift
• 100,000 examples in window• 5% noise• Results sampled every 10k examples throughout
the run and averaged
Error Rate vs. # Attributes
Tree Size vs. # Attributes
Detailed View of Single Run
Varying Levels of Drift
Details of Adaptation
Comparison With VFDT-window
• CVFDT most of the accuracy gain
• VFDT: 10 min• CVFDT: 46 min• VFDT-window
– Est. 548 days!
VFDT-WindowCVFDT
VFDT
70
72
74
76
78
80
82
84
86
88
90
Accuracy
Application: Web Data
• Trace of all web requests from UW campus
• 82.8 million requests over one-week period
• Goal: to predict which pages to cache
• CVFDT does better for first 70% of run
• VFDT’s performance improved near end
• Data seems to contain drift, but more study is needed
Open Issues
• Continuous Attributes
• Batch version of VFDT
• Very Fast Post Pruning
• Extending general method to other algorithms
Summary
• Decision trees important, need some more work to scale to today's problems
• Disk based methods– About one scan per level of tree
• Sampling can produce equivalent trees much faster