Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...
Machine Learning in Real
World:CART
22
Outline
CART Overview and Gymtutor Tutorial Example
Splitting Criteria
Handling Missing Values
Pruning Finding Optimal Tree
33
CART – Classification And Regression Tree Developed 1974-1984 by 4 statistics professors
Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
Focused on accurate assessment when data is noisy
Currently distributed by Salford Systems
44
CART Tutorial Data: GymtutorCART HELP, Sec 3 in CARTManual.pdf ANYRAQT Racquet ball usage (binary indicator coded 0, 1) ONAER Number of on-peak aerobics classes attended NSUPPSNumber of supplements purchased OFFAERNumber of off-peak aerobics classes attended NFAMMEM Number of family members TANNING Number of visits to tanning salon ANYPOOL Pool usage (binary indicator coded 0, 1) SMALLBUS Small business discount (binary indicator coded 0, 1) FIT Fitness score HOME Home ownership (binary indicator coded 0, 1) PERSTRN Personal trainer (binary indicator coded 0, 1) CLASSES Number of classes taken. SEGMENT Member’s market segment (1, 2, 3) – target
55
View data
CART Menu: View -> Data Info …
66
CART Example: Gymtutor
77
CART Model Setup
Target -- required
Predictors (default – all)
Categorical ANYRAQT, ANYPOOL, SMALLBUS, HOME
Categorical: if field name ends in “$”, or from values
Testing default – 10-fold cross-validation
…
88
Sample Tree
99
Color-coding using class
1010
Decision Tree: Splitters
1111
Tree Details
1212
Tree Summary Reports
1313
Pruning the tree
1414
Keeping only important variables
1515
Revised Tree
1616
Automating CART: Command Log
Automated field selection
handles any number of fields automatically selects relevant fields
No data preprocessing needed Does not require any kind of variable transforms
Impervious to outliers
Missing value tolerant Moderate loss of accuracy due to missing values
Key CART features
Tree growing
Splitting rules to generate tree
Stopping criteria: how far to grow?
Missing values: using surrogates
Tree pruning
Trimming off parts of the tree that don’t work
Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?
Optimal tree selection
Deciding on the best tree after growing and pruning
Balancing simplicity against accuracy
CART: Key Parts of Tree Structured Data Analysis
Data is split into two partitions
Q: Does C4.5 always have binary partitions?
Partitions can also be split into sub-partitions
hence procedure is recursive
CART tree is generated by repeated partitioning of data set
parent gets two children
each child produces two grandchildren
four grandchildren produce 8 great grandchildren
CART is a form of Binary Recursive Partitioning
Is continuous variable X c ?
Does categorical variable D take on levels i, j, or k? is GENDER M or F ?
Standard split: if answer to question is YES a case goes left; otherwise it
goes right
this is the form of all primary splits
example : Is AGE 62.5?
More complex conditions possible: Boolean combinations: AGE<=62 OR BP<=91
Linear combinations: .66*AGE - .75*BP< -40
Splits always determined by questions with YES/NO answers
For any node CART will examine ALL possible splits.
CART allows search over a random sample if desired
Look at first variable in our data set AGE with minimum value 40
Test split Is AGE 40?
Will separate out the youngest persons to the left
Could be many cases if many people have the same AGE
Next increase the AGE threshold to the next youngest person Is AGE 43?
This will direct additional cases to the left
Continue increasing the splitting threshold value by value
each value is tested for how good the split is . . . how effective it is in separating the classes from each other
Q: Consider splits between values of the same class?
Searching all Possible Splits
AGE BP SINUST SURVIVE40 91 0 SURVIVE
40 110 0 SURVIVE
40 83 1 DEAD
43 99 0 SURVIVE
43 78 1 DEAD
43 135 0 SURVIVE
45 120 0 SURVIVE
48 119 1 DEAD
48 122 0 SURVIVE
49 150 0 DEAD
49 110 1 SURVIVE
Sorted by Age Sorted by Blood Pressure
AGE BP SINUST SURVIVE43 78 1 DEAD
40 83 1 DEAD
40 91 0 SURVIVE
43 99 0 SURVIVE
40 110 0 SURVIVE
49 110 1 SURVIVE
48 119 1 DEAD
45 120 0 SURVIVE
48 122 0 SURVIVE
43 135 0 SURVIVE
49 150 0 DEAD
Split TablesQ: Where splits need to be evaluated?
X
X
2323
If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.
gini(T) is minimized if the classes in T are skewed.
Advanced: CART also has other splitting criteria Twoing is recommended for multi-class
CART Splitting Criteria: Gini Index
CHAID treats missing as a distinct categorical value e.g AGE is 25-44, 45-64, 65-95 or missing
method also adopted by C4.5
If missing is a distinct value then all cases with missing go the same way in the tree
Assumption: whatever the unknown value it is the same for all cases with missing value
Problem: can be more than one reason for a database field to be missing: E.g. Income as a splitter wants to separate high from low
Levels most likely to be missing? High Income AND Low Income!
Don’t want to send both groups to same side of tree
Missing as a distinct splitter value
2626
CART Treatment of Missing Primary Splitters: Surrogates CART uses a more refined method —a surrogate is used as a stand in
for a missing primary field
surrogate should be a valid replacement for primary
Consider our example of INCOME
Other variables like Education or Occupation might work as good surrogates
Higher education people usually have higher incomes
People in high income occupations will usually (though not always) have higher incomes
Using surrogate means that missing on primary not all treated same way
Whether go left or right depends on surrogate value
thus record specific . . . some cases go left others go right
A primary splitter is the best splitter of a node
A surrogate is a splitter that splits in a fashion similar to the primary
Surrogate — variable with near equivalent information
Why Useful? If the primary is expensive or difficult to gather and the surrogate is
not
Then consider using the surrogate instead
Loss in predictive accuracy might be slight
If primary splitter is MISSING then CART will use a surrogate
if top surrogate missing CART uses 2nd best surrogate etc
If all surrogates missing also CART uses majority rule
Surrogates Mimicking Alternatives to Primary Splitters
You will never know when to stop . . . so don’t!
Instead . . . grow trees that are obviously too big
Largest tree grown is called “maximal” tree
Maximal tree could have hundreds or thousands of nodes usually instruct CART to grow only moderately too big
rule of thumb: should grow trees about twice the size of the truly best tree
This becomes first stage in finding the best tree
Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)
CART Pruning Method: Grow Full Tree, Then Prune
3030
Maximal Tree Example
Take a very large tree (“maximal” tree)
Tree may be radically over-fit
Tracks all the idiosyncrasies of THIS data set
Tracks patterns that may not be found in other data sets
At bottom of tree splits based on very few cases
Analogous to a regression with very large number of variables
PRUNE away branches from this large tree
But which branch to cut first?
CART determines a pruning sequence:
the exact order in which each node should be removed
pruning sequence determined for EVERY node
sequence determined all the way back to root node
Tree Pruning
3232
Pruning: Which nodes come off next?
Prune away "weakest link" "weakest link" — the nodes that add least to overall accuracy of the tree
contribution to overall tree a function of both increase in accuracy and size of node
accuracy gain is weighted by share of sample
small nodes tend to get removed before large ones
If several nodes have same contribution they all prune away simultaneously
Hence more than two terminal nodes could be cut off in one pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is bad
if target variable is unpredictable we will want to prune back to root . . . the no model solution
Order of Pruning: Weakest Link Goes First
3434
Pruning Sequence Example
24 Terminal Nodes 21 Terminal Nodes
20 Terminal Nodes 18 Terminal Nodes
3535
Now we test every tree in the pruning sequence Take a test data set and drop it down the largest tree in
the sequence and measure its predictive accuracy how many cases right and how many wrong
measure accuracy overall and by class
Do same for 2nd largest tree, 3rd largest tree, etc
Performance of every tree in sequence is measured
Results reported in table and graph formats
Note that this critical stage is impossible to complete without test data
CART procedure requires test data to guide tree evaluation
Compare error rates measured by
learn data
large test set
Learn R(T) always decreases as tree grows (Q: Why?)
Test R(T) first declines then increases (Q: Why?)
Overfitting is the result tree of too much reliance on learn R(T)
Can lead to disasters when applied to new data
71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .20 .31
**10 .29 .30 9 .32 .34
7 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91
No.Terminal
NodesR(T) Rts(T)
Training Data Vs. Test Data Error Rates
First, provides a rough guide of how you are doing Truth will typically be WORSE than training data measure
If tree performing poorly on training data error may not want to pursue further
Training data error rate more accurate for smaller trees
So reasonable guide for smaller trees
Poor guide for larger trees
At optimal tree training and test error rates should be similar
if not something is wrong
useful to compare not just overall error rate but also within node performance between training and test data
Why look at training data error rates (or cost) at all?
Within a single CART run which tree is best?
Process of pruning the maximal tree can yield many sub-trees
Test data set or cross- validation measures the error rate of each tree
Current wisdom — select the tree with smallest error rate
Only drawback — minimum may not be precisely estimated
Typical error rate as a function of tree size has flat region
Minimum could be anywhere in this region
The Best Pruned Subtree:An Estimation Problem
0 10 20 30 40 50
Tk
R(Tk)
0
1
| |~
^
CART: Optimal Tree
Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run
Instead suggest SMALLEST TREE within 1 SE of the minimum error tree
Tends to provide very stable results from run to run
Is possibly as accurate as minimum cost tree yet simpler
Current learning — one SERULE is good for small data sets
For large data sets one should pick most accurate tree
known as the zero SE rule
One SE Rule -- One Standard Error Rule
Optimal tree has lowest or near lowest cost as determined by a test procedure
Tree should exhibit very similar accuracy when applied to new data
BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database
trees somewhat larger or smaller than “optimal” may be preferred
Room for user judgment judgment not about split variable or values
judgment as to how much of tree to keep
determined by story tree is telling
willingness to sacrifice a small amount of accuracy for simplicity
In what sense is the optimal tree “best”?
4141
CART Summary
CART Key Features binary splits
gini index as splitting criteria
grow, then prune
surrogates for missing values
optimal tree – 1 SE rule
lots of nice graphics
4242
Decision Tree Summary
Decision Trees splits – binary, multi-way split criteria – entropy, gini, … missing value treatment pruning rule extraction from trees
Both C4.5 and CART are robust tools No method is always superior –
experiment!
witten & eibe