Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...

Machine Learning in Real

World:CART

22

Outline

CART Overview and Gymtutor Tutorial Example

Splitting Criteria

Handling Missing Values

Pruning Finding Optimal Tree

33

CART – Classification And Regression Tree Developed 1974-1984 by 4 statistics professors

Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)

Focused on accurate assessment when data is noisy

Currently distributed by Salford Systems

44

CART Tutorial Data: GymtutorCART HELP, Sec 3 in CARTManual.pdf ANYRAQT Racquet ball usage (binary indicator coded 0, 1) ONAER Number of on-peak aerobics classes attended NSUPPSNumber of supplements purchased OFFAERNumber of off-peak aerobics classes attended NFAMMEM Number of family members TANNING Number of visits to tanning salon ANYPOOL Pool usage (binary indicator coded 0, 1) SMALLBUS Small business discount (binary indicator coded 0, 1) FIT Fitness score HOME Home ownership (binary indicator coded 0, 1) PERSTRN Personal trainer (binary indicator coded 0, 1) CLASSES Number of classes taken. SEGMENT Member’s market segment (1, 2, 3) – target

55

View data

CART Menu: View -> Data Info …

66

CART Example: Gymtutor

77

CART Model Setup

Target -- required

Predictors (default – all)

Categorical ANYRAQT, ANYPOOL, SMALLBUS, HOME

Categorical: if field name ends in “$”, or from values

Testing default – 10-fold cross-validation

…

88

Sample Tree

99

Color-coding using class

1010

Decision Tree: Splitters

1111

Tree Details

1212

Tree Summary Reports

1313

Pruning the tree

1414

Keeping only important variables

1515

Revised Tree

1616

Automating CART: Command Log

Automated field selection

handles any number of fields automatically selects relevant fields

No data preprocessing needed Does not require any kind of variable transforms

Impervious to outliers

Missing value tolerant Moderate loss of accuracy due to missing values

Key CART features

Tree growing

Splitting rules to generate tree

Stopping criteria: how far to grow?

Missing values: using surrogates

Tree pruning

Trimming off parts of the tree that don’t work

Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?

Optimal tree selection

Deciding on the best tree after growing and pruning

Balancing simplicity against accuracy

CART: Key Parts of Tree Structured Data Analysis

Data is split into two partitions

Q: Does C4.5 always have binary partitions?

Partitions can also be split into sub-partitions

hence procedure is recursive

CART tree is generated by repeated partitioning of data set

parent gets two children

each child produces two grandchildren

four grandchildren produce 8 great grandchildren

CART is a form of Binary Recursive Partitioning

Is continuous variable X c ?

Does categorical variable D take on levels i, j, or k? is GENDER M or F ?

Standard split: if answer to question is YES a case goes left; otherwise it

goes right

this is the form of all primary splits

example : Is AGE 62.5?

More complex conditions possible: Boolean combinations: AGE<=62 OR BP<=91

Linear combinations: .66*AGE - .75*BP< -40

Splits always determined by questions with YES/NO answers

For any node CART will examine ALL possible splits.

CART allows search over a random sample if desired

Look at first variable in our data set AGE with minimum value 40

Test split Is AGE 40?

Will separate out the youngest persons to the left

Could be many cases if many people have the same AGE

Next increase the AGE threshold to the next youngest person Is AGE 43?

This will direct additional cases to the left

Continue increasing the splitting threshold value by value

each value is tested for how good the split is . . . how effective it is in separating the classes from each other

Q: Consider splits between values of the same class?

Searching all Possible Splits

AGE BP SINUST SURVIVE40 91 0 SURVIVE

40 110 0 SURVIVE

40 83 1 DEAD

43 99 0 SURVIVE

43 78 1 DEAD

43 135 0 SURVIVE

45 120 0 SURVIVE

48 119 1 DEAD

48 122 0 SURVIVE

49 150 0 DEAD

49 110 1 SURVIVE

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE43 78 1 DEAD

40 83 1 DEAD

40 91 0 SURVIVE

43 99 0 SURVIVE

40 110 0 SURVIVE

49 110 1 SURVIVE

48 119 1 DEAD

45 120 0 SURVIVE

48 122 0 SURVIVE

43 135 0 SURVIVE

49 150 0 DEAD

Split TablesQ: Where splits need to be evaluated?

X

X

2323

If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

gini(T) is minimized if the classes in T are skewed.

Advanced: CART also has other splitting criteria Twoing is recommended for multi-class

CART Splitting Criteria: Gini Index

CHAID treats missing as a distinct categorical value e.g AGE is 25-44, 45-64, 65-95 or missing

method also adopted by C4.5

If missing is a distinct value then all cases with missing go the same way in the tree

Assumption: whatever the unknown value it is the same for all cases with missing value

Problem: can be more than one reason for a database field to be missing: E.g. Income as a splitter wants to separate high from low

Levels most likely to be missing? High Income AND Low Income!

Don’t want to send both groups to same side of tree

Missing as a distinct splitter value

2626

CART Treatment of Missing Primary Splitters: Surrogates CART uses a more refined method —a surrogate is used as a stand in

for a missing primary field

surrogate should be a valid replacement for primary

Consider our example of INCOME

Other variables like Education or Occupation might work as good surrogates

Higher education people usually have higher incomes

People in high income occupations will usually (though not always) have higher incomes

Using surrogate means that missing on primary not all treated same way

Whether go left or right depends on surrogate value

thus record specific . . . some cases go left others go right

A primary splitter is the best splitter of a node

A surrogate is a splitter that splits in a fashion similar to the primary

Surrogate — variable with near equivalent information

Why Useful? If the primary is expensive or difficult to gather and the surrogate is

not

Then consider using the surrogate instead

Loss in predictive accuracy might be slight

If primary splitter is MISSING then CART will use a surrogate

if top surrogate missing CART uses 2nd best surrogate etc

If all surrogates missing also CART uses majority rule

Surrogates Mimicking Alternatives to Primary Splitters

You will never know when to stop . . . so don’t!

Instead . . . grow trees that are obviously too big

Largest tree grown is called “maximal” tree

Maximal tree could have hundreds or thousands of nodes usually instruct CART to grow only moderately too big

rule of thumb: should grow trees about twice the size of the truly best tree

This becomes first stage in finding the best tree

Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)

CART Pruning Method: Grow Full Tree, Then Prune

3030

Maximal Tree Example

Take a very large tree (“maximal” tree)

Tree may be radically over-fit

Tracks all the idiosyncrasies of THIS data set

Tracks patterns that may not be found in other data sets

At bottom of tree splits based on very few cases

Analogous to a regression with very large number of variables

PRUNE away branches from this large tree

But which branch to cut first?

CART determines a pruning sequence:

the exact order in which each node should be removed

pruning sequence determined for EVERY node

sequence determined all the way back to root node

Tree Pruning

3232

Pruning: Which nodes come off next?

Prune away "weakest link" "weakest link" — the nodes that add least to overall accuracy of the tree

contribution to overall tree a function of both increase in accuracy and size of node

accuracy gain is weighted by share of sample

small nodes tend to get removed before large ones

If several nodes have same contribution they all prune away simultaneously

Hence more than two terminal nodes could be cut off in one pruning

Sequence determined all the way back to root node

need to allow for possibility that entire tree is bad

if target variable is unpredictable we will want to prune back to root . . . the no model solution

Order of Pruning: Weakest Link Goes First

3434

Pruning Sequence Example

24 Terminal Nodes 21 Terminal Nodes

20 Terminal Nodes 18 Terminal Nodes

3535

Now we test every tree in the pruning sequence Take a test data set and drop it down the largest tree in

the sequence and measure its predictive accuracy how many cases right and how many wrong

measure accuracy overall and by class

Do same for 2nd largest tree, 3rd largest tree, etc

Performance of every tree in sequence is measured

Results reported in table and graph formats

Note that this critical stage is impossible to complete without test data

CART procedure requires test data to guide tree evaluation

Compare error rates measured by

learn data

large test set

Learn R(T) always decreases as tree grows (Q: Why?)

Test R(T) first declines then increases (Q: Why?)

Overfitting is the result tree of too much reliance on learn R(T)

Can lead to disasters when applied to new data

71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .20 .31

**10 .29 .30 9 .32 .34

7 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91

No.Terminal

NodesR(T) Rts(T)

Training Data Vs. Test Data Error Rates

First, provides a rough guide of how you are doing Truth will typically be WORSE than training data measure

If tree performing poorly on training data error may not want to pursue further

Training data error rate more accurate for smaller trees

So reasonable guide for smaller trees

Poor guide for larger trees

At optimal tree training and test error rates should be similar

if not something is wrong

useful to compare not just overall error rate but also within node performance between training and test data

Why look at training data error rates (or cost) at all?

Within a single CART run which tree is best?

Process of pruning the maximal tree can yield many sub-trees

Test data set or cross- validation measures the error rate of each tree

Current wisdom — select the tree with smallest error rate

Only drawback — minimum may not be precisely estimated

Typical error rate as a function of tree size has flat region

Minimum could be anywhere in this region

The Best Pruned Subtree:An Estimation Problem

0 10 20 30 40 50

Tk

R(Tk)

0

1

| |~

^

CART: Optimal Tree

Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run

Instead suggest SMALLEST TREE within 1 SE of the minimum error tree

Tends to provide very stable results from run to run

Is possibly as accurate as minimum cost tree yet simpler

Current learning — one SERULE is good for small data sets

For large data sets one should pick most accurate tree

known as the zero SE rule

One SE Rule -- One Standard Error Rule

Optimal tree has lowest or near lowest cost as determined by a test procedure

Tree should exhibit very similar accuracy when applied to new data

BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database

trees somewhat larger or smaller than “optimal” may be preferred

Room for user judgment judgment not about split variable or values

judgment as to how much of tree to keep

determined by story tree is telling

willingness to sacrifice a small amount of accuracy for simplicity

In what sense is the optimal tree “best”?

4141

CART Summary

CART Key Features binary splits

gini index as splitting criteria

grow, then prune

surrogates for missing values

optimal tree – 1 SE rule

lots of nice graphics

4242

Decision Tree Summary

Decision Trees splits – binary, multi-way split criteria – entropy, gini, … missing value treatment pruning rule extraction from trees

Both C4.5 and CART are robust tools No method is always superior –

experiment!

witten & eibe

Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...

Documents

Transcript of Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...