Week 15 Lecture 30 CART -...
Transcript of Week 15 Lecture 30 CART -...
![Page 1: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/1.jpg)
Week 15 Lecture 30
CARTClass Prep
bugs_waterchem.csv
library (rpart)
library (mvpart)
HabUse.csv
data(iris)
HWCART: readings on webpage
The mvpart package !!??
Reading from Guthery
Class project
![Page 2: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/2.jpg)
What’s in a name?
Structural modeling
CART
Classification and Regression Trees
Recursive partitioning
Decision Trees
Constrained cluster analysis
![Page 3: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/3.jpg)
Introduction to CART
Structural modeling technique
Data mining (exploration / description)
Partitions response variable/s based on
best predictor (surrogates)
Very flexible, few assumptions
Great for complex data and relationships
Focuses on maximizing prediction ability
rather than minimizing error
![Page 4: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/4.jpg)
Introduction to CART
Recursively partitions the data into more
and more homogeneous subsets based on
certain levels of predictor variables
Divisive, constrained cluster analysis
“Supervised” divisive cluster analysis
![Page 5: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/5.jpg)
Outcomes
Decision tree
Data description, patterns, relationships
Characteristics of the response clusters
Predictive model
![Page 6: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/6.jpg)
![Page 7: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/7.jpg)
Introduction to CARTUsed in medical, industry, and business fieldsBreiman et al. 1984. Classification and regression trees. Chapman and Hall, New York.
Fairly new to ecology
![Page 8: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/8.jpg)
Important Introductory
Papers in Ecology
De'ath G, Fabricius KE. 2000. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.
De'ath G. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83: 1105-1117.
![Page 9: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/9.jpg)
Data Structure for CARTA response variable – Continuous
– Categorical
Explanatory variables– Continuous and / or categorical
Matrix of response variables -- > MRT
Panacea
or Pandora’s box (c.f. James and McCulloch 1990) ?
![Page 10: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/10.jpg)
How it Works
Purifies response by a rank of explanatory
variable/s
– i.e., values < or > X
![Page 11: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/11.jpg)
How it works
Split the response variable into the two most homogenous groups based on the best level of the best explanatory variable– Choose the level of the explanatory variable that maximizes
homogeneity of the two groups with respect to the values of the response variable
Do again on each separated, exclusive group
Do again on those groups
Tree grows longer until you have 1 observation per group or you stop growing it– Overgrow the tree and then prune it back ?
– Nodes = splitting levels
– Terminal node = tree leaf
Allied with divisive hierarchical cluster analysis, but is constrained explicitly on your explanatory variables
![Page 12: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/12.jpg)
MN < 0.02
![Page 13: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/13.jpg)
MN < 0.02
SO4 < 46
AL <
0.02
![Page 14: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/14.jpg)
![Page 15: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/15.jpg)
64.3% MCR
0% MCR
0% MCR5.3% MCR
10.3% MCR
55.6% MCR33.9% MCR
55.9
% M
CR18.7% MCR
20% MCR
26.3% MCR
16% MCR
39% MCR
30% MCR
61.8% MCR
17% MCR33% MCR 10% MCR
11% MCR
40% MCR
Over-grown Tree
= 35/375
Rel. Error (or training error) and 1-Rel. Error = % Var Exp
Xerror = CV error
Xstd or std error
CV MCR
![Page 16: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/16.jpg)
64.3% MCR
55.6% MCR
18.7% MCR
0% MCR
55.9% MCR
20% MCR
26.3% MCR
33.9% MCR
16% MCR11.1% MCR
40% MCR
0% MCR
39% MCR
61.8% MCR
5.3% MCR
10.3% MCR 30% MCR
Pruned Tree
![Page 17: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/17.jpg)
Model Description
Categorical response– Terminal leaves characterized by a distribution on the
categorical variable
– Proportions of observations in each group
– MCR
Continuous responses– Terminal leaves characterized by a mean of the response
variable and summary stats
– Report % of SS
All terminal leaves are characterized by group size, some measure of variation, the response value, and values of explanatory variables
![Page 18: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/18.jpg)
How to do it in
library (mvpart)
![Page 19: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/19.jpg)
Modeling WVSCI Categories from
Landscape Data—Classification Tree
WV SCI
• Excellent
• Good
• Moderate
• Poor
![Page 20: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/20.jpg)
Modeling EPT Scores from
Landscape Data—Regression Tree
It would be nice
to visualize this
variation
![Page 21: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/21.jpg)
![Page 22: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/22.jpg)
RPART Uses
Exploratory
Modeling
– Description
– Prediction—what is the value of the response variable given new observations on explanatory variables?
IF…THEN statements
Map generation
Distinguishing groups in terms of species composition
– Change points
![Page 23: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/23.jpg)
Landscape Models to Predict
Water Quality Type
![Page 24: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/24.jpg)
Application of
WQ Models
Prediction of WQ Type
in Un-sampled reaches
% by Rshed Area
Type Cheat Tygart
Sev A 5.3 % 1.4 %
Mod A 2.2 % 4.2 %
Hard 2.8 % 14.2 %
Soft 27.1 % 6.4 %
Trans 23.4 % 23.8 %
Ref 39.2 % 50.0 %
![Page 25: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/25.jpg)
Model Validation
How well does it really work to predict
new data?
![Page 26: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/26.jpg)
How to Pick the “BEST” Tree Size by Pruning
Test set Validation: External Model Testing– Model building subset (¾) and model testing subset (¼) (if you have
enough data)???
– Drop external data through different tree sizes to see which tree size does the best predicting
– Choose tree size with smallest predicted error
V-fold Cross Validation– Divide data into 10 equal groups (V = 10)
– Build tree with V2-9 and predict V1
– Build tree with V1, 3-10 and predict V2
– Etc.
– Calculate estimated error over ALL subsets for EACH tree size (WOW !!)
– Do 50 times (at least –Yikes !! )—because under multiple CVs the best tree size varies
– Select modal tree size that has lowest error rate
– Or select smallest tree size that is within 1SE of the minimum (freedom)
– On average, this tree size should give the best prediction success for new data
![Page 27: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/27.jpg)
Picking “Best” Tree Size
Cross-Validation Relative
Error—decreases then
increases to a plateau
Relative Error—Decreases with
tree size
![Page 28: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/28.jpg)
Easily Done in
Package rpart
Package mvpart
A word about publishing graphics
![Page 29: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/29.jpg)
Multivariate Regression Trees
Extension of univariate regression trees
Multiple continuous response variables
Multiple continuous and/or categorical
predictor variables
Species – Environmental Relationships
Indicator Species
Disadvantages: impossible to visualize for
large assemblages
![Page 30: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/30.jpg)
![Page 31: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/31.jpg)
Conclusions
Linear models (OLS)
GLM and GAM (non-normal errors and
non-linear relationships)
CART, ANN, BRT
![Page 32: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/32.jpg)
rpart Summary Advantages
Non-parametric
Missing data ok
Surrogate splitters
Simple even for complex relationships
Scale invariant
Outliers
Flexible
![Page 33: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/33.jpg)
rpart Weaknesses
Over-fitting
– Finds best splits, good but not for prediction
One single tree model
Can be unstable
– Very sensitive to the input data !
– Can get very different trees
– Difficult with smooth responses
– GLM and GAM out-perform it
Poor predictive models
![Page 34: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/34.jpg)
Disadvantages of CART
CART does not use combinations of
variables
Deceptive – if variable not included it could
be as it was “masked” by another as a
surrogate
Tree is optimal at each split – it may not be
globally optimal (sample / inference
challenge)
![Page 35: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/35.jpg)
Solutions
Stochastic Boosting
– randomForest
– gbm (Boosted Regression Trees)
– gbmplus (aggregated boosted trees)
– ada
Currently no multi-classification response possible
But there’s a work around (tedious as hell)
The one versus all approach
Design matrix with dummies
![Page 36: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response](https://reader034.fdocuments.us/reader034/viewer/2022050408/5f84d2beac9c502bcd051215/html5/thumbnails/36.jpg)
Boosting
Improves prediction (quite a lot)
– Fit tree to many random samples (1000’s) of
the data (bagging size)
– Random subset of predictors used for each
tree
– Successive trees fit to residuals of earlier
trees
– Focus on hard to predict cases
– Average predictions over all cases
– Machine learning