Using CART For Beginners with A Teclo Example Dataset

Salford Systems Webex Training

Salford Systems

http://www.salford-systems.com





CART® Decision Tree Basics

• We start with a simple analysis of some market research

data using CART

• This introduction assumes no background in data mining or

predictive analytics

• We do assume you have had some experience reviewing

data with the purpose of discovering interesting and or

predictive patterns

© Copyright Salford Systems 2013

Beginning with CART

• CART is the perfect place to start learning about data mining

• Widely regarded as one of the most important tools in data

mining and also the easiest to understand and master

– Decision trees are still the most popular data analysis tool among

experienced data miners

• Delivers easy to understand analyses of complex data

– Allows for very sophisticated analyses especially when a structured

series of trees are developed

– Effective Exploratory Data Analysis (EDA) to support more

conventional modeling (eg logistic regression)


Classification with CART®Real world study early 1990s

• Fixed line service provider offering a new mobile phone service

• Wants to identify customers most likely to accept new mobile offer

• Data set based on limited market trial

• 830 usable records

• 67 attributes and target including

– Demographics

– Attitudes and Needs

– Pricing for handset & minutes


Mobile Phone Offer

• Data is a sample of land line telephone customers of a

European telco

• At the time mobile phones were very rare in the country in

question

• The Company realized the time was right to introduce

mobile phones on a substantial scale to their existing fixed

line customer base

• Key questions:

– WHO to target with the marketing campaigns for the new product

– HOW MUCH to charge for the handset


Nature of the Research

• Company arranged to make real world offers to about 1,000

existing land line customers

• Everyone was presented the same offer (only one model of

phone and one service plan available)

• The PRICE of the handset was varied randomly over a large

range prices from near zero to about $300

• Goal was to learn who responded positively and at what

price points

• Offers were made in person as part of a one hour visit in

which much was learned about the household (media

preferences, number of children, distance to work, etc)


• Target variable RESPONSE: Coded 0 or 1 (YES, NO)

• 65 available predictors include variables like:

Nature of the Data

HANDPRIC Cost of handset (one time fee)

USEPRICE Usage cost (per month, 100 minutes)

TELEBILC Landline home phone bill average

CITY Resident in which of 5 major cities

AGE Coded in 5 year increments

HOUSIZ Possible proxy for income, coded 1-6

SEX Male, Female, Unknown

EDUCATN Coded 1-7 ( thru postgrad)


Analysis File Overview in CART 6.0


Set Up the Model

(Select Target, allowable predictors)

Only requirement is to select TARGET (dependent) variable. CART will do everything

else automatically© Copyright Salford Systems 2013

CART:

Does its own variable selection

• Embedded variable (feature) selection means that modeler can let the software make its own choice of predictors

• Modeler will often want to limit the model to focus on selected inputs

– Exclude ID variables and merge keys

– Exclude clones of the dependent variable

– Exclude data pertaining to the future (relative to the dependent)

– E.g. restrict a model to easily available predictors

– Test predictive power of purchased external data

• Modeling automation can allow exploration of a vast space of pre-selected predictors (see later slides)


In Example we run CART model

• CART completes analysis and gives access to all results from the NAVIGATOR

– Shown on next slide

• Upper section displays tree of a selected size

– number of terminal nodes

• Lower section displays error rate for trees of all possible sizes

• Green bar marks most accurate tree

• We display a compact 10 node tree for further scrutiny


CART Model Viewer

Access reports and drill into model details

Most accurate tree is marked with the green bar. Above we select the 10 node tree

for convenience of a more compact display. Note train/ test area under ROC curve © Copyright Salford Systems 2013

Root Node:Hover Mouse

Tree starts with all training data

Displays details of TARGET variable in overall training data

Above we see that 15.2% of the 830 households accepted the offer

Goal of the analysis is now to extract patterns characteristic of responders


Goal is to split node: separate responders

• Details of root node split

• If we could only use a single piece of information to separate responders from

non-responders CART chooses the HANDSET PRICE

•Those offered the phone with a price > 130 contain only 9.9% responders

•Those offered a lower price respond at 21.9%


CART Splitting Rules

• We discuss the details later

• Here we just point out that the split CART displays is

– ―the best of all possible splits‖

• Subject to the splitting criteria you have chosen and any

constraints imposed

• How do we know this split is ―best‖?

• Because CART actually tries all possible splits looking for

the best

– Exhaustive brute force search

– Advanced algorithms used to make this search fast

– As much as 100 times faster than other decision trees


Grow progressively bigger tree: One split at a time


• Binary recursive partitioning repeated until further splitting

impossible (e.g. data exhausted)

• This leads us to the largest possible or “maximal tree

Maximal tree is raw material for best model


• Goal is to find optimal tree embedded inside maximal tree

• Will find optimal tree via ―pruning‖

• Like backwards stepwise regression

• Challenge: A tree with 100 terminal nodes can be pruned back to 99

terminal nodes by eliminating any one 99 penultimate nodes

• Now the 99 new terminal nodes can be cut back to 98 by eliminating

any one of the surviving 98 penultimate nodes

• Something like 99! possible trees. How do we find the best?

Pruning Sequence

• CART automatically generates a pruning sequence which

develops a preferred sequence of progressively smaller

trees

• We can prove that for a given tree size the CART tree in the

sequence will be the best performing tree of all possible

trees of that size

• In our sequence, the 10 node tree is gauranteed to be more

accurate than any other 10 node tree you could extract from

the maximal tree

• You as the user never need to worry about this

• ―Better‖ is defined in terms of performance on the training

data as we need the tree sequence before we can test


Error Curve: Plots Accuracy vs Model Size


•Requires test data

•Can use cross-validation (sample reuse) if data is scarce

•Curve typically U-shaped

• Too small is not good and neither is too large

•Can look at any tree in the sequence of pruned subtrees

•Error is what BFOS call an “honest” estimate of model performance

Pick a modest sized tree to examine

Note high response in this RED colored node

Response of 38.5% in this segment vs. 15.2% overall

Lift = 2.53© Copyright Salford Systems 2013

Navigator allows access to all model info

• The terminal nodes are color coded to represent results

– RED nodes are ―hot‖ and contain high concentrations of the class of interest (buyers)

– BLUE nodes are ―cold‖ and contain very low concentrations of the class of interest

– PINK and WHITE nodes have moderate concentrations

• We first look to see if we have any RED nodes

– Explore any red nodes via mouse hover

• Then we drill down to see a tree schematic revealing the main drivers of the tree


Select ―splitters‖ View

Selects a streamlined overview of the tree showing ONLY primary splitters


Model Overview: Main Drivers( Red= Good Response Blue=Poor Response)

High values of a split variable always go to the right; low values go left


Examine Extreme Right-most Terminal Node

• Hover mouse over node to see inside

• Even though this node is on the ―high price‖ of the tree it still

exhibits the strongest response across all terminal node

segments (43.5% response)

•Rules defining this node are shown on next slide


Rules can be extracted in a variety of languages

•

Here we select rules

expressed in C for

one node of interest

Entire tree can also

be rendered in Java,

XML/PMML, or

SAS


Continuing down the tree

• We note that even if the new product is offered at a high price we can still find prospects very interested:

– Those that have a high average landline bill and own a pager

– This group displays greatest probability of response (43.5%)


Classic Detailed Tree Display

Analyst can select details to be displayed


Control Over Details Displayed in Nodes

At left an example

showing the class

bar chart is

displayed

Separate controls for

internal and terminal

nodes


Configure Print Image Interactively

Shrink to one page, include header/footer

•


Tree Performance Measures

and Principle Message

• In addition to the details of the tree (splits, split values)

• Variable Importance Ranking

• Confusion Matrix (Prediction Success Matrix)

• Gains, ROC


Variable Importance Ranking

(Relative impact on outcomes)

Three major ways of computing variable importance. Above

default display.© Copyright Salford Systems 2013

Predictive Accuracy

(How often right, how often wrong)

This model is not very accurate but ranks responders well


Gains CurveIn top decile model captures about 23% of responders


Performance Evaluation: ROC Curve


Observations on CART Tree

Contrasts with Conventional Stats

• CART leverages rank order of predictor to split

– Transforming predictor X into Log(X) will not change tree

– Of course tree will be expressed in terms of Log(X) but this will not

change the location of the split

– Traditional statisticians experiments with alternative transforms

unnecessary

• CART is immune to outliers in predictors

– Suppose X has values 1,2,3,…,100, 900

– To CART this is the same as 1,2,3,…,100, 101

– All CART ―sees‖ is the rank order

• We will see later that CART has built-in missing value

handling

• So no worry about outliers, missing values, transformations


CART Methodology: Partition Data

Into Two Segments

• Partitioning line parallel to an axis

• Root node split first

– 2.450

– Isolates all the type 1 species from rest of the sample

• This gives us two child nodes

– One is a Terminal Node with only type 1 species

– The other contains only type 2 and 3

• Note: entire data set divided into two separate parts


Second Split: Partitions

Only Portion of the Data

• Again, partition with line parallel to one of the two axes

• CART selects PETALWID to split this NODE

– Split it at 1.75

– Gives a tree with a misclassification rate of 4%

• Split applies only to a single partition of the data

• Each partition is analyzed separately


Discriminant Analysis Uses Oblique Lines

•Linear Combinations are difficult to understand and explain

•CART does permit ―oblique‖ splits based on linear combinations of small sets

of variables but this is rarely desirable

© Copyright Salford Systems

2013

CART Representation of a Surface

Model clearly non-linear

Height of bar represents probability of response

Remaining axes represent values of two predictors

Greatest prob of response here in corner to the right

0

CART Splitting Process

• Standard splits are based on ONE predictor and the form of

a database RULE

• A data record goes left if

splitter_variable <= split value

• Examples: A data record goes left

• if AGE<=35

• if CREDIT_SCORE <= 700

• if TELEPHONE_BILL <= 50


Searching all splits facilitated by sorting

• On left we sort by TELEBILC, on right by TRAVTIMR

• Test smallest value first, then next smallest, etc moving all the way down the column

• The arrow shows a split sending 10 cases to the left and all other data to the right

Example Root Node SplitContinuous Splitter


From our Euro_telco_mini.xls example

Split is TELEBILC <= 50

Alternative Split PointsWhat if split the data at TELEBILC<=25?


Note the response rate between the two nodes with this split are very similar

They are much different after splitting at the optimal value

Two splits separate quite differently


The first pane shows two segments with 14.3% and 15.5% response

The second pane shows two segments with 12.7% and 19.8%

Our goal in CART is to generate substantially different segments and we

accomplish this by experimenting with every possible split value for every predictor

CART Splitting Process: More

• Splitter variables need not be numeric, they can be text

• Splitter variables need not be ordered

• A data record goes left

• if CITY$ = ―London‖ OR ―Madrid‖ OR ―Paris‖

• if DIAGNOSIS = 111 OR 35 OR 9999


• CART considers all possible splits based on a categorical predictor

• Example: four regions - A, B, C, D can be split 7 ways (23 -1 = 7)

• Each decision is a possible split of the node and each is evaluated

• Note: A on the left and B,C,D on the right is the same split as its mirror image A

on the right and B,C,D on the left

• So we only list one version of this split– It is which cases stay together that matters not which side of the tree they are on

Splits on K-level categorical predictors

2K-1 -1 ways to split

Left Right

1 A B, C, D

2 B A, C, B

3 C A, B, D

4 D A, B, C

5 A, B C, D

6 A, C B, D

7 A, D B, C


Categorical Split Caution:

Dangers of HLCs (High Level Categoricals)

• Because categorical variables generate 2K-1 -1 ways to split

the data high values of K can be problematic

• K=33 is not an unusually large number of levels yet allows

for about 4 billion ways to split the data

• When the number of possible splits exceeds the number of

records in the data the categorical variable has an

advantage over any continuous splitter

– A continuous variable with a unique value in every row of the data

gives us a choice of split points equal to the number of rows of data

• Later we will discuss several ways to deal with HLCs

including repackaging the high cardinality categoricals into

lower cardinality versions and penalties


Example Root Node SplitCategorical Splitter


From our Euro_telco_mini.xls example

Observe that we have to LIST the values that go to each child node

CART Competitor Splits

• The CART mechanism for splitting data is always the same

• We are given a block of data

– Could be all of our data and we are starting from scratch

– Could be a small part of our data obtained after already doing a lot of

slicing and dicing

• When we work with a block of data we do not take into

account how we got to that block of data

• We do not consider any information which might be

available outside of the block of data

• The block of data to be analyzed is our entire universe and

nothing else exists for us


Getting Ready to Split

• For a block of data to be split

– It must contain a sufficient number of data records (ATOM)

– We can tell CART what the minimum must be

– Default is just TWO records

– In large database analysis we might reasonably set the minimum

quite a bit higher

– ATOM values such as 10, 20, 50, 100, 200 have cropped up in our

practical work

• If you are working with a small database such as those

encountered in biomedical research (e.g. 200 records total) you

will want to allow the ATOM size to be small

• If you are working with hundreds of thousands or millions of

records there is no harm in trying a minimum size like 200


Still Getting Ready to Split

• If we have a classification problem such as modeling

response to a marketing offer where there are two outcomes

– Responded

– Did Not Respond

• To be splittable the block of data cannot be ―pure‖, i.e.

composed of all responders or all non-responders

– True regardless of how large the block of data is

– Splitting is designed to separate the responders from the non-

responders so we need a mixture to have something to do

• The data records cannot all have exactly the same values

for the predictors

– CART will be looking for a useful difference in a predictor between

responders and non-responders


Observation on Dummy Variable Predictors

• If you split a node using a continuous variable there is

always the chance that this same variable is used again in a

subsequent split for descendent nodes

• Once a node is split with a dummy variable this variable can

never be used again in descendant nodes

– Because a descendant node will contain either all 0 or all 1 values for

this variable. Hence it cannot split.

• If a dummy variable is introduced into the tree below the root

it might appear in more than one location in the tree

– But one use will never be the ancestor of the other use


Making The Split

• To split the block of data (which we will henceforth refer to

as splitting the node) we search each available predictor

• For every predictor we make a trial split at every distinct

value of the predictor

• For each trial split we compute a goodness of split measure

normally referred to as the ―improvement‖

• For each predictor we find the split value that yields the best

improvement

• Once every predictor has been searched to find the best

split point we rank the splitters in descending order and then

use the best overall splitter to grow the tree


Ranked List of Splitters

• The ranked list of splitters is also known as the competitor

list

• CART always computes the entire list as this is the only way

to know for sure which split is best

• To save space CART normally only displays the top 5

competitors within a node

– You can request a larger number in your options settings

• The root node at the top of the tree always displays the

complete list of competitors even if there are thousands of

predictors


Why Care about Competitor Splits?

• Useful to know if the best splitter is far better than all the rest

or only slightly better

• Useful to know which predictors show up near the top

– Are they very different from each or are they all reflecting the same

underlying information

• Useful to know if a strong but perhaps 2nd best predictor

splits the data more evenly than the best

– We might want to try FORCING that 2nd best predictor into the root to

see what happens

– Sometimes this yields an overall better tree

• Pattern of top splitters may reflect problems

– Top 3 competitors may all be ―too good to be true‖ and we might

need to drop them all from the analysis


Surrogate Splits

• Surrogate splits were first introduced by the authors of

CART in their classic monograph Classification and

Regression Trees, 1984.

• Surrogate splits are mimics or substitutes for the primary

splitter of a node

• An ideal surrogate splits the data in exactly the same way as

the primary split

– The ―association‖ measure reflects how close to perfect a given

surrogate is


Why Surrogates?

• Surrogates have two primary functions:

– To split data when the primary splitter is missing

– To reveal common patterns among predictors in a data set

• CART searches for surrogate splitters in every node in the

tree

– Surrogates are searched for even when there is no missing data

– No guarantee that useful surrogates can be found

– CART attempts to find at least five surrogates for every node but this

number can be modified

– Number of surrogates actually found normally varies from node to

node


CART and Missing Values in Deployment

• CART is the only learning machine that is prepared to deal

with any pattern of missing values in future data

• Even if the training data have no missings CART develops

strategies to deal with the eventuality of any variable or

variables being missing

• Some learning machines cannot handle missing values at all

• Other learning machines can only deal with missing value

patterns that they have been trained on (seen before)

– Eg Handle X5=missing only if X5 was ever missing in the training

data

• CART has no such restrictions and is always ready for any

pattern of missings


Surrogates in Action:

Euro_telco_mini.xls


Remember to check off CITY, MARITAL and RESPONSE as ―categorical‖

Manually Prune Back to the 10-node tree


Just click on the blue curve in the lower panel to select a smaller easier to manage

tree. The double click on left child of root node (see arrow above)

Look at the Left Child of the ―Root‖


The primary splitter predicting subscription to a new mobile phone offer is the

monthly telephone bill (TELEBILC) dividing node into spenders of more or less

than $50 per month

Surrogate for TELEBILC

• If this variable were missing for any reason (database error,

person recently moved, new customer) we do not know

whether to move down the tree to the left or to the right

• Surrogate variable can be used in place of the missing primary splitter. In this case the surrogate is of the form

go to the left if MARITAL=1

• Left is associated with LOW spending on the telephone bill

• CART suggests that single person households spend less

while households headed by married or divorced persons

spend more


Surrogates and Direction

• A surrogate is intended to be a substitute for the primary

splitter making similar left/right decisions

• But surrogates may work in the opposite direction so every

continuous variable surrogate is supplied with a ―tag‖

– The letter ―s‖ after the split point stands for ―standard‖

– The letter ―r‖ after the split point stands for ―reverse‖

• If a surrogate is negatively correlated with the primary

splitter then it will split in the reverse direction

– Categorical splitters are always organized so that the levels that

correspond to left in the primary splitter go left in the surrogate


Normally Surrogates Make Sense

• Our primary splitter is the average monthly spend of a

household on a fixed line telephone account

• Our surrogates include marital status, commute time to

work, age, and the city of residence

– Longer commutes are associated with larger spend on the phone

– Older head of household also is associated with larger spend

– We cannot interpret the CITY variable at this point because we don’t

know the identity of the cities

• In general surrogates help us understand the primary splitter

– Especially helpful in survey research


How to Compute Surrogates?

• This is a technical question which we will not cover here

– The CART monograph contains a wealth of technical information

although it can be a challenging read

• However, we will discuss the main ideas

• The top surrogate is

– A single variable

– A single split (in the same format as any primary splitter)

– Intended to mimic as closely as possible how data is partitioned by

the primary segment into LEFT and RIGHT nodes

• To get a surrogate think of generating a one split CART tree

where the dependent variable is {LEFT or RIGHT} as

defined by the primary splitter. (There are many details)


What is ―Association‖?

• Association is a measure of the strength of the surrogate

• The lowest possible reported score is 0 (useless)

• The highest possible score is 1 (perfect clone)

• CART starts from the default rule: if you don’t kow which way to send a

data record down a tree go with the majority (sometimes weighted majority)

• If when training the tree most cases went left then in the absence of

other information also go left

• The default makes mistakes of course because it always sends every

record to the same majority side

– Association measures how much better the surrogate is than the

default rule (percent reduction in errors made)

• Default rule is the ―surrogate of last resort‖


Competitors and Surrogates:

Different Objectives


Competitors yield the best possible split when using that variable

Surrogate yields the best possible mimic of the primary splitter and goodness of

split may be sacrificed to match some aspect of the primary splitter

Note that C2 is a competitor with one split point and a surrogate with a different

split point

Grow another tree on GB2000.XLS

• We prefer this data set because it has no missing values

making working through examples much easier

• Don’t forget: CART always computes surrogates and in this

way the CART tree is always prepared for future missings

• We will not be trying to make sense of this tree

– will look just at the mechanics

• Note the root node splitter and the top surrogate


Root Node Split


Root Splitter:

M1 <= -.04645

Top Surrogate:

C2 <= -.10835

Main Splitter vs. Best Surrogate

Main Splitter Surrogate

Left Right Left Right

Class 1 672 328 626 374

Class 2 252 748 300 700

Total 924 1076 926 1074


Best Surrogate must closely match not only the record counts in the child nodes

but also the distribution of the target variable

Modeling ROOTSPLIT with CART


Observation: Modeling the root node split (we have to create a new variable

to reflect this) will not necessarily match the surrogate report

Other factors must be taken into account. Here we get the right variable but

not the right split point

Main Splitter VS Best SurrogateModel Root Split As a Binary Target

Main

Splitter

Surrogate Alternate

Left Right Left Right Left Right

Class 1 672 328 626 374 598 402

Class 2 252 748 300 700 288 712

Total 924 1076 926 1074 886 1114


Best Surrogate must closely match record counts in the child nodes and the

distribution of the target variable

Modeling root split on available predictors will not match surrogate exactly

Variable Importance in CART

• It is hard to imagine now but in 1984 when the CART

monograph was first published data analysts did not

generally rank variables

• Although informally researchers would pay attention to t-

statistics or p-values associated with the coefficients of

regressions researchers frowned on the practice of ranking

predictors

• Since the advent of modern data analytic methods

researchers expect to see a variable importance ranking for

all models

• It all started with CART!


CART concept of Variable Importance

• Variable importance is intended to measure how much work a

variable does in a particular tree

• Variable importance is thus tied to a specific model

• A variable might be most important in one model and not

important at all in a different model built on the same data

• The fact that a variable is important does not mean that we need

it! If we were deprived of the use of an important variable it might

be that other available variables could substitute for it or do the

same predictive work

• Variable Importance describes the role of a variable in a specific

tree


Variable Importance and Tree Size

• Every tree in the CART sequence has its own variable

importance list

• A small tree will typically have only a few important variables

• A large tree will typically have many more important

variables

– Because with more nodes there are more chances for more variables

to play a role in the tree

• Usually we focus on the tree CART has identified as optimal

but this should not deter you from selecting another (usually

smaller) tree


Splitter Improvement Scores

• Recall that every splitter (and every surrogate) has an associated

―improvement‖ score which measures how good a splitter is

• The improvement score for a splitter in a node is always scaled down by

the percent of data that actually pass through the node

• 100% of all data pass through the root node so the root node splitter is

always scaled by 100%

• But a child node of the root might have say 30% of the data pass through

– whatever improvement we compute for split of that node will be multiplied by 0.30

• Splits lower in the tree have only a small fraction of full data passing

through so their adjusted improvement scores tend to be small


Variable Importance Computation

• To construct a variable importance score for a variable we

start by locating every node that variable split

• We add up all of the improvement scores generated by that

variable in those nodes

• Then we go through every node this variable acted as a

surrogate and add up all those improvement scores as well

• The grand total is the raw importance score

• After obtaining raw importance scores for every variable we

rescale the results so that the best score is always 100


Variations on Importance Scores

• Breiman, Friedman, Olshen and Stone discuss one idea

they ultimately rejected:

– Including competitor improvements scores as well

• This turns out to be a bad idea because it leads to double-

counting

– If a variable is the 2nd best splitter in a node there is an excellent

chance that the same split will score well in the child nodes

– If we were to give the splitter credit in the parent node for a being a

competitor we would probably end up giving the exact same split

credit again lower down in the tree

– Another way to think about this: a split is trying to enter the tree. If we

do not acept the split right away the same split may keep trying to

enter the tree lower down

– We only want to give this split credit once


BATTERY LOVO

• Leave One Variable Out (LOVO)

– Available in SPM PRO EX versions but you can accomplish the

process manually as well

• Take your best modeling set up including your preferred list

of predictors

• BATTERY LOVO runs a set of models that are identical to

your preferred set up except that one variable has been

excluded

• To be complete we run a ―drop just one variable‖ model for

each variable in your KEEP list

• If you have 20 variables then BATTERY LOVO will run 20

models (each of which will have 19 predictors)

– Now rank the models from worst to best


BATTERY LOVO Importance Ranking

• Using the LOVO procedure tests how much our model

deteriorates if we were to remove a given variable

• It is sensible to say that a variable is very important if losing

it damages the model substantially

• Conversely, if losing a variable does no harm then we could

conclude that the variable is useless

• CAUTION: the LOVO ranking could be quite different from

the CART internal ranking and both rankings are ―right‖

– CART measures how much work a variable actually does

– LOVO measures how it hurts to lose a variable


Randomization Test

• Leo Breiman introduced yet another concept of variable

importance measure related to his work on tree ensembles

• Start with your test data

– Score this data with your preferred model to obtain baseline

performance

– Take the first predictor in the test data and randomly shuffle its

values in the column of data

– The values are unchanged but values are relocated to rows they do

not belong on

– Now score again. We would expect performance to drop because

one predictor has been damaged. Repeat say 100 times and

average the performance deterioration.

– Doing this for al variables will produce performance degradation

scores and the larger the score the more important the variable


Randomization Test

• As of December 2011 this test is only available from the

command line of recent versions of SPM

• After growing a CART tree and saving the grove issue these

commands from the command line or an SPM Notepad

SCORE VARIMP=YES NPREPS=100

• You may readily run with NPREPS=30 but the results are

more reliable with a larger number of replications


Results from Random Shuffling:

Baseline ROC=.85320


Rank Score ROC_After Variable

1 100 0.82144 M1

2 63.21 0.83312 RES

3 45.57 0.83873 LS

4 25.9 0.84498 CR

5 22.66 0.84601 C2

6 21.29 0.84644 BU

7 5.84 0.85135 DT

8 4.25 0.85185 A1

9 4.23 0.85186 PRE

10 3.49 0.85209 OC

11 3.18 0.85219 MAR

12 2.29 0.85248 YM

13 1.64 0.85268 LT

14 0 0.8532 DP

15 0 0.8532 TRA

16 0 0.8532 GEN

17 0 0.8532 A2

18 0 0.8532 B

19 0 0.8532 CP2

20 0 0.8532 CD2

21 0 0.8532 D1

22 0 0.8532 E

23 0 0.8532 M

24 0 0.8532 CH

25 0 0.8532 TY$

Which Importance Score Should I Use?

• The internal CART variable importance scores are the

easiest and the fastest to obtain and are a great starting

point

• LOVO scores are useful when your goal is to assess

whether you can live without a predictor


• Importance is a function of the OVERALL tree including

deepest nodes

• Suppose you grow a large exploratory tree — review

importances

• Then find an optimal tree via test set or CV yielding smaller

tree

• Optimal tree SAME as exploratory tree in the top nodes

• YET importances might be quite different.

• WHY? Because larger tree uses more nodes to compute the

importance

• When comparing results be sure to compare similar or same

sized trees

Variable Importance Caution


Train/Test Consistency Checks

• Unlike classical statistics data mining models generally do

not rely on training data to assess model quality

• In the SPM data mining suite we are always focused on test

data model performance

– This is the only way to reliably protect against over fitting

• Every modeling method including our classical statistical

models in SPM 7.0 offers test data performance measures

• Generally these measures are overall model performance

indicators

– Measures say nothing about internal model details


CART Tree Assessment

• CART uses test data performance of every tree in the back-

pruned sequence of progressively smaller trees to identify

the overall best performer on classification accuracy

• CART also notes which tree achieves the best test data

Area Under the ROC (AUROC) curve on the Navigator


What more can we do?

• CART performance measures have always been overall-tree

scores

• No specific attention is paid to node-specific performance

• However, in real world applications we often want to pay

close attention to individual nodes

– Might use the rank order of the nodes in important decisions

– Prefer to rely on nodes that are most accurate in their predictions of

event rates (response)

• Therefore we need an additional tool for assessing CART

tree performance at the node level

• Provided by the PRO EX feature we call TTC

– Train/Test Consistency checks


Use the GB2000.XLS data set


Model setup to select TARGET as the dependent variable

CART as the modeling method

On the TEST tab we opt for 50% randomly selected test partition

TTC in CART and SPM PRO EX

• The TTC report is available from the navigator which

displays for every CART model

– Look for the TTC button near the bottom of the navigator

• TTC relies on separate train and test data partitions which

means that TTC is not available when using cross-validation


TTC Display


Upper panel of TTC display contains one line in the table for every sized tree

Bottom row represents the 2 node tree. Top line is for largest tree grown

TTC: Select Target Class


In this case TARGET=2 represents BAD which is our focus class

You the modeler get to choose which class to focus on, there is no ―right‖ class

TTC Upper Panel


Rank Match: Do the train and test samples rank order the nodes in the same way

(a statistical test allows for insignificant ―wobbles‖)

Direction Agreement: Do the train and test samples agree as to whether a node is

―above average‖ or ―below average‖ (response, lift, event rate). Again a statistical

test allows for insignificant violations

Click on 14 node tree in TTC upper panel


Red curve is training data and shows node specific lift (node response/ overall

response)

Dark Blue horizontal line is the LIFT=1.0 reference line

Light blue line with green triangles displays test data

3rd ranked node in train data would be ranked 1st or 2nd in test data

TTC Details


For the 14 node tree we are told that agreement on ―direction‖ fails 1 time

And the rank order agreement fails 5 times (scroll to right to see this)

The statistical sensitivity of the test is controlled by the z-score selected in the

Thresholds area to the right of the display. Defaults are 1.00

Setting this threshold to 2.00 will allow much more train/test divergence

Changing TTC Sensitivity Threshold


Changing the thresholds to 2.00 permits moderate deviations and treats them as

statistical noise. After changing thresholds click on ―Apply‖ if display has not updated

We prefer to use the 1.00 threshold as this points us to trees with very high

consistency that decision makers like to see. It does point to rather small trees.

TTC: Display for 6 node tree


Much more defensible tree as train and test data align very well

Summary

• TTC focuses on two types of train-test disagreement

• DIRECTION: Is this node a response node or not?

– We regard disagreement on this fundamental topic to be fatal

• RANK ORDER: Are the richest nodes as identified by the

training data confirmed in test data

– Without this we cannot defend deployment of a tree

• TTC allows us to quickly identify which tree in the pruning

sequence is the largest satisfying train/test consistency

• TTC optimal tree is often rather close in size to Breiman’s 1

SE rule tree

– But 1 SE rule does not look inside nodes at all

– 1 SE rule is available for cross-validation while TTC is not


Controlling Node Sizes In CART

With ATOM and MINCHILD

• Today’s topic is on the technical side but very easy to

understand

• Concepts are relevant to all Salford tree-based tools

including TreeNet and Random Forests

• Controlling the sizes of terminal nodes is a practical matter

• If you are using CART, for example, to segment a database

you might want to make it impossible to create segments

that are too small

• Altering terminal node size can also influence performance

details of the optimal tree


Background: Obtaining Optimal Trees

• CART theory teaches us that we cannot arrive at the optimal

tree via a stopping rule

• The CART authors devoted quite a bit of energy to

researching this topic

• For any stopping rule it is possible to construct data sets for

which that stopping rule will not work

• We will end up stopping too early and we will miss important

data structure

• Result discovered both by experimentation and via

mathematical construction


Grow First Then Prune

• CART methodology is thus to start with an unlimited growing

phase

• Grow the largest possible tree first

• Think of this as a search engine for discovering possibly

valuable trees

• THEN use pruning to arrive at the optimal tree or a set of

trees that yield both acceptable predictive performance and

simplicity

• CART also insists that we have a test method to make our

final tree selection. That is the topic of another session.


Maximum Tree Size

• CART theory tells us that trees should be grown to their

maximum size during the growing phase

• Thus, trees should be grown until we either

– Run out of data (1 record left and thus there is nothing to split)

– Node impossible to split because pure (all GOOD or all BAD)

– Node impossible to split because all records have identical values for

predictors

• Experience tells us that if you start with 1,000 records in a

typical binary classification problem you should expect about

500 terminal nodes in the largest possible tree

– But could be many less

• Let’s try for biggest possible tree with the GB2000.xls data


An Unlimited TreeUsing GB2000.xls


To get 349 nodes we set the test method to EXPLORE, MINCHILD=2, ATOM=1

Terminal Node Sample Sizes


We obtain this frequency chart by clicking the graph icon in the center left area

of the navigator. We can see that many but not all terminal nodes are small.

Bottom Left Most Part of Tree


We get a relatively large node to the

extreme left (all class 2)

Remaining three terminal nodes in this

snippet are also all ―pure‖ but much

smaller

Obvious why the tree has to stop here

as there is nothing left to do once a

node is pure

Obtained by right clicking the node of interest and selecting ―Display Tree‖

Practical Maximal Trees

• In real world practice it may not be necessary to push the

tree growth to the literal maximum

• Essential to grow a large tree

– Large enough to include the optimal tree

• We can control the size of the maximal CART tree in a

number of ways

– Some controls tell CART to stop early

– Other controls limit CART’s freedom to produce small nodes


Key Controls over Splits:

ATOM and MINCHILD

• ATOM

– ATOM terminates splitting along a branch of the tree when the node

sample size is to small

– If a node contain fewer than ATOM data records then STOP

– 10 is commonly used but you might set this much larger

• MINCHILD

– MINCHILD prevents creation of child nodes that are too small

– The smallest possible value is 1 meaning that in splitting a node we

would be permitted to send 1 solitary record to a child node and all

other records to the other child node

– Larger values are sensible and desirable. Values such as 5, 10, 20,

30, 50 could work well depending on the data. We have used values

as large as 200


Setting ATOM and MINCHILD


On Advanced Tab of

Model Setup

Parent control

(ATOM)

Terminal node min

(MINCHILD)

Setting ATOM and MINCHILD

• ATOM: Minimum size required for a node to be a parent

• MINCHILD: Minimum size allowed for a child

• We recommend that ATOM be set to three times MINCHILD

• ATOM must be at least twice MINCHILD to allow a split

consistent with MINCHILD

• If you set inconsistent values for ATOM and MINCHILD they

will be reset automatically to be consistent

• To get the control you want be sure that ATOM is at least

twice MINCHILD


ATOM and MINCHILD

• ATOM controls the right to be a parent

• Parent must generate two children

• Parent must contain enough data to be able to fill two child

nodes

• So parent must have at least 2*MINCHILD records


ATOM and MINCHILD

• By allowing ATOM to be three times MINCHILD you give

CART some flexibility in finding the split

10 records 10 records

• Min-------------------------------|-----------------------------------Max

split

Suppose ATOM=20 and MINCHILD=10. Then we must split

this node into two exactly equal child nodes of 10 records

each. There is no flexibility here

• If no such split can be found because of clumping of values

of the variable then the node cannot be split on that variable


ATOM is 3 times MINCHILD

10 records 10 records 10 records

• Min------------------*------|--------------------*--------------------Max

left child ..….. split region…... right child

• In the example above ATOM=30 and the region of possible splitting

points lies in between the two asterisks

• There can be just one split point. So long as the smaller side has at least

10 records (in this example of MINCHILD=10) there is freedom to

choose

• To give CART flexibility as to where to locate this last split (at the bottom

of the tree) when need to have ATOM> 2*MINCHILD

• Not mandatory but worth keeping in mind. So first choose MINCHILD

and then set ATOM sensibly


An Unappealing Node Split:Could be prevented by using a larger MINCHILD


Only one record is sent to the right and the remaining 1999 records go left

Can prevent such splits with a control which does not allow a child to be created

with fewer than the specified number of records

Experiment to get Best Settings


SPM PRO EX

Battery Tab Model Setop

Select ATOM and

MINCHILD

Modify values to be

tested, optionally

We used a 50% random

sample for testing

Choosing ATOM and MINCHILD


Settings of ATOM=10

and MINCHILD=5 yield

a Rel. error within 1% of

the literal best

Direct Control Over Tree Size (Almost)

• You also have the option of LIMITing tree in a variety of

ways including limiting the DEPTH of the tree

• To get to the LIMITS menu item you must first go to the

Classic Output


Growing Limits Dialog


DEPTH=1 will allow just

one split

Controlling tree size via a

DEPTH limit may yield

Inferior results

We tend to use it only

When wanting extremely

small trees such as one split

LIMITS Details

• A tree of depth=1 can have only two terminal nodes

• With each additional depth level we allow for a doubling of

the number of terminal nodes

• Potential sizes are then 2,4,8,16 etc.

• However, depth limits do not guarantee a specific number of

terminal nodes only that no terminal node will deeper than

was allowed


LIMIT DEPTH=1


We sometimes want to start a CART analysis by splitting just the ROOT node and

then reviewing the entire ranked list of potential splitters

Mostly useful for very large data sets as this reduces compute time substantially

LIMIT DEPTH=2


Maximum length of any branch will allow two splits between the root node and

any terminal node. But some branches might stop early due to pre-pruning.

Depth Limit=3

Method GINI


With METHOD GINI you may not get every branch of the tree exhibited to

the full depth you wanted (due a technical matter – ―pre-pruning‖

Depth Limit=3

METHOD PROB


You have a better chance of getting every branch grown out to full depth using

METHOD PROB

Concluding Remarks

• Setting ATOM (smallest legal parent) and MINCHILD

(smallest legal child) can help to speed up large database

runs

• Modest limitation will not harm performance if we take care

with the settings

• Can and should use experimentation to find best settings

• In some circumstances setting these controls to values

larger than their minimums can improve performance on test

data


CART and the PRIORS Parameter

• If you are a casual user of CART you probably can get by

without knowing anything about PRIORS

• The default settings of CART handle PRIORS in a way that

is well suited for almost all classification problems

• A casual user will probably not want to review or understand

the more technical output‖ which is printed to the plain text

―classic output‖ window

• BUT there are some very effective uses of CART that

require judicious manipulation of the PRIORS settings

• Therefore a basic understanding of PRIORS may be helpful

and worth the effort


Classic Reference

• The original CART monograph published in 1984, remains

one of the great classics of machine learning

• Classification and Regression Trees by Leo Breiman,

Jerome Friedman, Richard Olshen, and Charles Stone, CRC

Press

• Available also in paperback and as e-book from Amazon:• http://www.amazon.com/Classification-Regression-Trees-Leo-Breiman/dp/0412048418/

• Not the easiest reading but well worth having as a reference

and contains fascinating discussions regarding the decisions

the authors made in crafting CART

• Contains extensive discussion of priors as well as all major

concepts relevant to CART. Still wothwhile reading.© Copyright Salford Systems 2013

CART Monograph Details


For The Casual User

• Thinking about a binary 0/1 classification problem we have

two ways of evaluating a CART generated segment

– Assign the segment to the majority class (more than 50%)

– If there are more 1s then 0s then the segment is labeled ―1‖

– Assign the segment to the class with a LIFT greater than 1

– We start with a baseline event rate (fraction of 1 in the data)

– Look at the ratio of event rate in the node to event rate in sample

• Ratio of event rate in segment to event rate in root

– Any segment with a better than baseline event rate is labeled ―1‖

• CART by default uses the LIFT concept for making

decisions (known in CART-speak as PRIORS EQUAL)

• You can elect to use the first method via PRIORS DATA


Example Split: Priors Equal


Almost 80% GOOD (Class 0) Remainder BAD (Class 1)

Left child is considered a BAD dominant node because 36% BAD > 21.4% BAD

Priors equal simply ensures that we think in these ―relative to what we started

with‖ terms

PRIORS EQUAL or PRIORS DATA

• PRIORS EQUAL is almost always the right choice

– Is the DEFAULT and almost always yields useful results

• PRIORS DATA focuses on absolute majority and not relative

counts in the data

– Will rarely work with highly unbalanced data (eg 10:1 ratio of 0 to 1)

• PRIORS can be expressed as a ratio

– Default 1:1

– You can set priors to whatever ratio you like

• 1.2:1 as we did in the previous example

• 5:1

• 10:1

– Changing priors usually changes results, sometimes dramatically

– Extreme priors often make getting any tree impossible


Setting PRIORSMechanics


To set your own PRIORS

first click the SPECIFY

option

The default settings of 1:1

can now be changed

To the left the dialog is

allowing me to alter the

entry for Class 0

Once entered I will be

given the opportunity to

make an new entry for

Class 1

If PRIORS can change results then what is right?

• The results CART gives you are intended to reflect what you

consider important and what makes sense given your

objectives

• PRIORS EQUAL usually reflects what most people want

• If tweaking the PRIORS and changing them gives you better

results given your objectives then use the tweaked priors


Advice on PRIORS

• Start with the default of EQUAL

– Most users never get beyond this!

• BATTERY PRIORS

– CART PRO EX runs an automatic sweep across dozens of different

settings to display the consequences of tweaking the priors

– Results are then summarized in tables and charts

– Useful when you want to achieve a specific balance of accuracy

across the dependent variable classes

– Choose the setting that is practically best

• Otherwise, you can experiment manually to measure the

impact of a change


PRIORS: Under the Hood

• To understand how PRIORS affect core CART calculations

we need to start with a brief review of splitting rules

• We will only discuss the Gini to illustrate the key concepts


Start With Gini Splitting Rule:Two classes

• Very simple formula for the two class (binary) dependent variable

• Label the classes as Class 0 and Class 1 and in a specific node in

a tree we represent the shares of the data for the two classes as

p0 and p1

These two must sum to 1 (p0 + p1 = 1)

• The measure of diversity (or impurity) in a given subset of data (e.g. a node) is given by

Impurity = 1 – p0*p0 – p1*p1

• Impurity will equal 0 if either sample share is equal to 1 (100%)

• Impurity will equal 0.50 when both sample shares are equal (50%)

1 – (.5*.5) – (.5*.5) = 1 - .25 - .25 = .50


Splitting Criteria and Impurity

• The Gini measure is just a sensible way to represent how

diverse the data is in a node (for a classification problem)

– Extensive experience shows it works well, a good measure

– You do have a choice of 6 different splitting methods in CART

• Useful because it can be used for any number of classes

– Every class has a share

– Square the shares and subtract them all from 1

• We use the Gini measure as a way to rank competing splits

• Split A will be considered better it produces child nodes with

less diversity (on average) than does split B

• We measure the goodness of split by looking at the

reduction in impurity relative to the node being split (the

parent)© Copyright Salford Systems 2013

Improvement Calculation

• Hypothetical Example


Parent Node Impurity = 0.50

Left Child Impurity = .30 Right Child Impurity=.20

20% of data 80% of data

Left child improves diversity by 0.20 (0.50 – 0.30)

Right child improves diversity by 0.30 (0.50 – 0.20)

Weighted average impurity is .2*.3 + .8*.2=.22

Improvement from parent is .5 - .22 = .28

Graphing Gini Impurity (2 classes)

• Impurity formula here

simplifies to 2p(1-p)

• Impurity is greatest

when p=(1-p)= 0.5

• Impurity is low when p

is near either extreme

of 0 or 1 as the node is

dominated by one class

• Declines slowly near

p=.5 and accelerates as

it approaches 0 or 1

1

0 0.5 1

Graph is of 2*[2*p*(1-p)] to make it easier to read


Split Improvement Measurement

(No Missing Values for Splitter)


Parent Node N Percent

Left Child N Percent Right Child N Percent

Parent Impurity = 0.50

Left Child Impurity = 0.3967 Fraction of data in left child = 55%

Right Child Impurity=0.3457 Fraction of data in right child = 45%

Weighted average of child node diversity = .3737

Overall improvement of split = .1262

As expressed in the CART monograph

Parent node impurity minus weighted average of the impurities in each

child node

• pL = probability of case going left (fraction of node going left)

• pR = probability of case going right (fraction of node going right)

• t = node

• s = splitting rule

• i = impurity

( , ) ( ) ( ) ( )t s i t p i t p i tL L R R

impurityL impurityR

Impurity

Parent

probL probR


Unbalanced Data and PRIORS EQUAL

• Calculations for all key quantities become weighted when

we use the CART default and the original data is

unbalanced

• Weighting is used to calculate

– Fraction of the data belonging to each class

– Fraction of the data in the left and right child nodes

– Gini impurity in each node

– Resulting improvement of the split (reduction in impurity)

• We no longer can use simple ratios

• Good news is that the mechanism for weighting is very

simple and easy to remember

– All counts are expressed as count in the node divided by the

corresponding count in the root node


Calculations for Priors

• Our training sample starts with N0 examples of class 0 and

N1 examples of class 1

• Now look at any node t in the CART tree

– N0(t) examples of class 0

– N1(t) examples of class 1

• Fraction of class 0 will now be calculated as (simplified)

• In other words we convert every count to ratio of a count in a

node (t) to the corresponding count in the root (sample)

• Then the math is the same as usual


(0N t( ) /

0N )

(0N t( ) /

0N )+ (1N t( ) /

1N )

What fraction of the data is in a node

• Again we use ratios instead of counts to calculate

• For priors equal we just average

– Fraction of all the Class 0 in a node

– Frcation of all the Class 1 in a node

• If the priors are not equal then all ratios are first multiplied by

the corresponding prior (which acts as a weight)

• When priors are equal the terms all cancel out


(0P0N t( ) /

0N )

(0P0N t( ) /

0N )+ (1P1N t( ) /

1N )

• pi(t) = Proportion of class i in node t

• If priors DATA then

• Proportions of class i in node t with data priors

• Otherwise proportions are always calculated as weighted

shares using priors adjusted pi

Priors Incorporated Into Splitting

Gini = 1 - pi2

(i)=N

N

i

N(t)

(t)N

(t)N

(t)N=t

i

j

i

)p(

Nj

Nj(t)(j)

Ni

Ni(t)(i)

tp )(


Run a Real World Example79% Class 0 (Good) 21% Class 1 (Bad)


Data set BAD_RARE_X.XLS MODEL BAD = X15 just one predictor

Test method: 20% random sample for test


We only want to look at the root node split. But tree is quite predictive!

Root Node Split:Under PRIORS EQUAL


Main splitter improvement is reported to be .06264

Observe that the left hand child is considered to be Class 1 because the

node Class 1 share of 41% is greater than the root share of 21.4%

Classic OutputTypical user rarely consults classic output


Start by confirming the total record counts in the parent and child nodes

Agrees with previous diagram in GUI

Next Confirm Target Class Breakdown


Here we see the same counts for Class 0 and Class 1 as in GUI

Priors Adjusted Computations


Note first that the parent node is reported to have 50% class 0 and 50% class 1

This is guaranteed for the root node under priors equal

With 2 classes each is treated as if it represented half the data

With 3 classes each would be treated as if it represented 1/3 of the data

Our calculations of the Gini impurity would be based on these priors adjusted

shares of the data (or node)

The class breakdowns in the child nodes (left and right) are priors adjusted

using the formulas presented earlier

Spreadsheet to Reproduce Results


Column C contains the counts for each class in the parent and child nodes

Column H at the top records the priors

Column G displays the priors adjusted shares (raw shares are in Column D)

Column F displays raw and priors adjusted child node probabilities

Column J displays the Gini diversity in the parent and child nodes and the

improvement generated by the weighted average of the child diversities

All we need to input are the class counts and the priors and formulas do the rest

Conclusion

• Priors are an advanced control that the casual user need not

worry about

• The default setting is almost always reasonable and almost

always yields valuable results

• Tweaking the priors can change the details of the tree and

can alter results

– Sometimes considerably

– Can be worth running some experiments

• Further discussion in another tutorial


Modeling automation Report

Develop model using a variety of strategies

Here we display results for each of the 6 major tree growing methods. Entropy

yields best performance here. This one of 18 different automation schemes.© Copyright Salford Systems 2013

Summary of Variable Importance Results

Across alternative modeling strategies


Alternative Modeling Automation Strategies

Analyst Can Run All Strategies if desired


Automated Modeling:

Vary Penalty on False Positives


Accuracy among YES and NO groups

As penalty on false positive is varied (automatically)


Automatic Shaving:

Backwards Elimination of Least Important Feature

•


Hot Spot Detection:

Search many trees for high value segments

Lift in node plotted against sample size: Examination of individual nodes

from many different trees to find best segments


Tabular detail: Hot spot search for special nodes

Tree 18 Node 25 defines a segment with 85.3% of the target class

Sample size in this segment is N=265 in the test set

Clicking on any row brings up tree for examination and review


Constrained Trees

• Many predictive models can benefit from Salford’s patent pending ―Structured Trees‖

• Trees constrained in how they are grown to reflect decision support requirements

• In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables

– Price variables are under the control of the company

– Customer characteristics are not under company control


Visualizing separate regions of tree


Constrained Tree

Mobile Phone Price variables appear only at bottom

Demographic and spend information at top of tree

Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom


Automatically Generated Code

Can be deployed directly


Deployment –IIUse Salford Scoring Engine/Server

Controllable via scripting can be deployed in batch mode on server


Cross-Validation: Part 1

• Built-in automatic method of self testing a model for

reliability

• Honest assessment of the performance characteristics of a

model

– Will model perform as expected on previously unseen (new) data

• Available for all principal Salford data mining engines

• CART monograph 1984 was decisive in introducing cross-

validation into data mining

• Many important details relevant to decision trees and

sequences of models developed in the monograph for the

first time


Cross-Validation is a Testing Method

• Why go through special trouble to construct a sophisticated testing

method when we can just hold back some test data?

• When working with plentiful data it makes perfect sense to reserve a

good portion for testing

– E.g. Credit risk data set with 150,000 training records and 100,000 test

records, real world example

– Direct Marketing data sets with 300,000 training records and 50,000 test

records

• Not all analytical projects have access to large volumes of data


Principal Reason for Cross-ValidationData Scarcity

• When relevant data is scarce we face a data allocation

dilemma

– If we reserve sufficient data to conduct a reliable test we find

ourselves lacking training data

– If we insist on having enough training data to build a good model we

will have little or nothing left for testing

• Train Test

• o---------------------------------------------------------------|-------------o

• A common division of data is 80% train 20% test

• With 300 data records in total this would amount to 240 train and 60 test


Tough decision: How much data to allocate to test

• Train Test

• o---------|-------------------------------------------------------------------o

• Train Test

• o------------------------------|----------------------------------------------o

• Train Test

• o-------------------------------------------------|---------------------------o

• Train Test

• o------------------------------------------------------------------------|----o


Unbalanced Target Data

• In most classification studies the target (dependent variable)

data distribution is unbalanced

• Usually one large data segment (non-event) and a smaller

data segment (event) which is the subject of the analysis

– Who purchases on an e-commerce website?

– Who clicks on a banner ad?

– Who benefits from a given medical treatment?

– What conditions lead to a manufacturing flaw?

• When the data is substantially unbalanced the sample size

problem is magnified dramatically

– Think of your sample size as being equal to the smaller class

– If you only have 100 clicks that is your data set size

– Does not matter much that you have 1 million non-clicks.


Cross-Validation Strategy:Sample Re-use

• Any one train/test partition of the data that leaves enough

data for training will yield weak test results

– based on just a fragment of the available data

• But what if we were to repeat this process many times

– using different test partitions?

• Imagine the following: we divide the data into many 90/10

train/test partitions and repeat the modeling and testing

• Suppose that in every trial we get at least 75% of the test

data events classified correctly

• This would increase our confidence dramatically in the

reliability of the model performance

– Because we have multiple at least slightly different tests


Cross-Validation Technical Details

• Cross-Validation requires a specialized preparation of the data

somewhat different than our example of repeated train/test partitioning

• We start by dividing the data into K partitions. In the original CART

monongraph Breiman, Friedman, Olshen, and Stone set K=10

• K=10 has become an industry standard due both to Breiman et. al. and

other studies that followed (see final slides for details)

• The K partitions should all have the same distribution of the target

variable (same fraction of events) and if possible be equal in size

Takes care to get this right when data cannot be evenly divided into K parts

• This is all done automatically for you in SPM software


Cross-Validation Train/Test Procedure:K mutually exclusive partitions, 1 Test, K-1 Train

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

Test

Test

Test

Test

Learn

Learn

Learn

LearnLearn

ETC...

Learn

Above each partition is in the train sample 9 times and in the test sample 1 time

Build K Models

• Once the data has been partitioned into the K parts we are ready to build

K models

– If we have 10 data partitions and we will build 10 models

• Each model is constructed by reserving one part for test and the

remaining K-1 parts for training

– If K=5 then each model will based on an 80/20 split of data

– If K=10 then each model will be based on a 90/10 split

– There is nothing wrong with considering K=15 or K=20 or more

• In this strategy it is important to observe that each of the K blocks of data

is used as a test sample exactly once

• If we could somehow combine all the test results we would have an

aggregated test sample equal in size to that of the training data


Euro_Telco_Mini.xls Data Set

Class=0 Class=1

CVCycle Learn Test Learn Test CVW

1 634 70 113 13 0.1026161

2 633 71 114 12 0.0960758

3 634 70 113 13 0.1026161

4 633 71 114 12 0.0960758

5 634 70 113 13 0.1026161

6 633 71 114 12 0.0960758

7 634 70 113 13 0.1026161

8 634 70 113 13 0.1026161

9 633 71 114 12 0.0960758

10 634 70 113 13 0.1026161

• Here we see the breakdown of the 830 record data set into the 10 CV folds

• Table shows sample counts for majority and minority classes for learn and test

partitions for each fold

• Observe that CART has succeeded in making each fold almost identical in the

learn/test division and in the balance between TARGET=0 and TARGET-1

• Last column is the WEIGHT that CART uses on each fold for certain

calculations

Confusion MatrixPrediction Success Matrix

• In two-class (e.g. Yes/No) classification test results can be

represented via the 2x2 confusion matrix


Predicted Y=0 Predicted Y=1

Actual Y=0 20 4

Actual Y=1 1 5

Hypothetical results for the test set of a single Cross-validation fold

Note test sample is quite small but there will be a number of these (e.g. 10)

Aligning the CV Trees

All automatic and the user never sees this

Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10

Nodes 2 2 3 2 2 2 2 2 2 2 2

Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531

Nodes 4 6 4 4 4 5 4 4 5 4 4

Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285

Nodes 5 7 4 4 4 5 4 4 5 4 7

Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342

Nodes 9 8 4 8 4 9 4 9 6 8 10

Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157

• We would expect that the trees would be aligned by number of nodes and this is

approximately what happens

• CART aligns the trees by a measure of ―complexity‖ discussed in other sessions

• Alignment is required to determine the estimated error rate of the main tree when it has

been pruned to a specific size (complexity)

• Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately.

Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes

and one to 6 nodes

Summing the Confusion Matrices

• Each CV fold generates a test confusion matrix based on a

completely separate subset of data

• When summed the test partitions are equal to the entire

original training data

• Summing the confusion matrices yields an aggregate matrix

that is based on a sample equal to the original data set

• If we started with 300 records the assembled confusion

matrix consists of 300 test records

• Not a ―trick‖. Each record was genuinely reserved for test

one time and was classified correctly or incorrectly in its fold

• We have thus arrived at the largest possible test sample we

could create: as if 100% of the data was used for test!


Test Results Extracted From Cross-Validation

• Cross-validation is not a method for building a model

• Cross-validation is a method for indirectly testing a model

that on its own has no test performance results

• In classic cross-validation we throw away the K models built

on parts of the data. We keep only test results.

• Modern options for using these K different models exist and

you can save them in SPM

– Could be used in a committee or ensemble of models

– One of the CV models might turn out to be more interesting than the

main model


Does Cross-Validation Really Work?

• We have tested CV by extracting a small training data from

a much larger database

• We used CV to obtain a ―simulated‖ test performance

• We then tested our main model against a genuine large test

sample extracted from the larger database

• Our results were always remarkably in agreement. CV gave

essentially the same results as the true test set method

• The CART monograph also discusses similar experiments

conducted by Breiman Friedman Olshen and Stone (BFOS)

• They come to the same conclusion while observing that 5-

fold cross-validation tends to understate model performance

and that 20-fold may be slightly more accurate than 10-fold


How Many Folds?

• How many folds do we need to run to obtain reliable results

• Think about 2 fold CV

– Divide the data into two parts

– First train on part 1 and test on part 2

– Then reverse roles of train and test

– Assemble results

• Problem with 2-fold CV is that we train on only half the

available data

– This is a severe disadvantage to the learning process unless we

have a large amount of data

• The spirit of CV is to use as much training as possible


How many CV folds?

• In the original CART monograph the authors Breiman,

Friedman, Olshen and Stone discussed some experiments

• Using small numbers such as 5-fold was typically

pessimistic

– Results suggested the model was not as good as it really was

• Using a substantial number of folds such as 20 was

generally only slightly more accurate than 10-fold

– CART authors suggested 10-fold as a default

– Results hold for classification problems

• These classification model results re-confirmed in a 1995

paper by Ronny Kohavi– A Study of Cross-Validation and Bootstrap for Accuracy Estimation and

Model Selection. In International Joint Conference on Artificial Intelligence

(IJCAI 1995)© Copyright Salford Systems 2013

Creating Your Own Folds:Needs to be done with care with smaller samples

• Suppose you have 100 records divided as

– 92 records Y=0

– 8 records Y=1

• Each fold must have at least one record for each target

class

• Best we can do then is to have 8 folds

• But we cannot divide 92 into 8 equal parts

– 7 parts with 11 records Y=0 (response rate=.0833)

– 1 part with 15 records Y=0 ( response rate=.0625)

• Better to divide as



– More equal balance across the folds yields more stable results


Points to Remember

• The ―main‖ model in CV is always built on all the training data

– Nothing is held back for testing

• If you were to run CV in several different ways

– Vary the number of folds

– Vary construction of CV folds by varying random number seed

• You would always get the exact same main model

– Only the estimates of test performance could differ

• Are the results sensitive to these parameters?

– BATTERY CV re-runs the analysis with different numbers of folds

• Larger numbers should converge

– BATTERY CVR using the same number of folds but creates the K partitions

based on different random number seeds

• Is expected to yield reasonably stable results

• Unstable results suggest considerable uncertainty regarding your model


Cross-Validation: Part II

• In part I we reveiwed the main ideas behind cross-validation

• We pointed out that CV is a method for testing a model

• Especially useful when there is a shortage of data but can

be used in any circumstance

• A main model is built on all training data with nothing held

back for testing

• An additional set of K different models are built on different

partitions of the data holding back some of the data for test

• The test results for the K models are aggregated and then

used as an estimate of the test set performance of the

―main‖ model


Cross-Validation Train/Test Procedure:K mutually exclusive partitions, 1 Test, K-1 Train

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

Test

Test

Test

Test

Learn

Learn

Learn

LearnLearn

ETC...

Learn

Above each partition is in the train sample 9 times and in the test sample 1 time

Alignment of Results

• In this session we discuss a somewhat technical topic

related to the mechanics of aligning test results from K CV

models and the main model

• Recall that CART grows a large tree and then prunes it back

• Back pruning is conducted via ―cost-complexity‖

• Back pruning might prune off more than one terminal node

at a time

• Back pruning might prune back several nodes along the

same branch

• CV generates K different models each with its own maximal

tree and its own sequence of back-pruned trees


CV Mechanics

• Main Model Has No Test Data Each CV Model has test Data


Main Model CV Model 1

CV Model 2

CV Model 3

CV Model 10

Combine test results from all CV

folds and attribute to main model

CART and CV Details

• A CART tree model is actually a family of progressively

smaller tree models one of which is normally deemed

―optimal‖

• So we don’t just have a main model and K CV models

• We have a main tree sequence and K CV tree sequences

• For every tree in the main sequence we need to match it up

with its corresponding tree in each CV sequence

• The most obvious way to do this is by tree size

• To estimate the error rate of the 2-node tree in the main tree

sequence match it up with the K 2-node trees found via CV

• Then proceed to match up every other tree size found


CART Tree Alignment

• Matching up trees from the different sequences is much

more complicated than this

• Each CV tree has its own sequence and its own maximal

size

• These sequences may not all contain the same tree sizes

• The main tree might contain a subtree with 8 terminal nodes

but not every CV tree will contain an 8 node tree

– Back pruning sometimes skips over certain sizes jumping directly say

from 9 terminal nodes to 7

• Not all tree sequences will have the same number of nodes

in the maximal tree


Alignment via Cost Complexity

• Cost complexity prunes trees by examining a trade off between error rate

(cost) and size of the tree (complexity)

• Error rate can be taken to be misclassification rate for this discussion (on

the training data)

• Suppose our maximal tree has a training data misclassification rate of

.00 (not uncommon on training data) but that the tree is very large (e.g.

1000 terminal nodes)

• Suppose we penalized terminal nodes at the rate of .0001

• Then the error rate of 0 would be counterbalanced by a penalty of

1000*(.0001)=0.10

• If we could prune off 500 nodes we would reduce the penalty to .05 but

of course our misclassification would probably increase

• If the increase in misclassification rate were say .04 then the total of

misclass rate + penalty would be only .04 + .05 = .09 a benefit!


CART Cost Complexity Pruning

• CART automatically tests different penalties to try to induce

a smaller tree

• We always start with a penalty of 0 and then start gradually

increasing it

• To prune back we prune off the so-called ―weakest-link‖

which is the node that increases the misclassification rate of

the whole tree the least

• Means that sample size of node is taken into account

• A progressive search algorithm for finding the next penalty is

described in the CART monograph


Cost-Complexity is the key to Alignment

• For every CART tree sequence a specific penalty on nodes

(e.g. .001) leads immediately to exactly one tree of a specific

size

• We can only find this tree by going through the pruning

sequence (no shortcuts)

• We align the CART CV trees by the penalty (complexity)

rather than the tree size

• So for a given penalty we find the tree that corresponds to it

both in the main tree sequence and also in each CV tree

• These aligned trees are used to extract the performance

measures that will finally be assigned to the main tree of that

size


Table of Alignments:Special extract report not automatically generated


• Table displays the aligned trees corresponding to each tree in the main sequence

• In the first row the main tree has been pruned to 2 nodes as have all but one of the CV

trees

• When the main tree is pruned to 7 nodes it is aligned with trees of varying sizes ranging

from 4 to 7 terminal nodes

• The complexity penalties appear under the terminal node counts

• Complexity penalties always increase as the tree becomes smaller

Using CART For Beginners with A Teclo Example Dataset

Technology

Transcript of Using CART For Beginners with A Teclo Example Dataset