Data Mining Trevor Hastie, Stanford University 1 › ~hastie › TALKS › Trends.pdf ·...
Transcript of Data Mining Trevor Hastie, Stanford University 1 › ~hastie › TALKS › Trends.pdf ·...
Data Mining Trevor Hastie, Stanford University 1
Modern Trends in Data Mining
Trevor HastieStanford University
February 2009.
•
• •
•••
•••
•
••
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
••
•
•
••
• • ••
•
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
••
•
••
•
•• •
•
•
•
Data Mining Trevor Hastie, Stanford University 2
Datamining for Prediction
• We have a collection of data pertaining to our business,industry, production process, monitoring device, etc.
• Often the goals of data-mining are vague, such as “look forpatterns in the data” — not too helpful.
• In many cases a “response” or “outcome” can be identified as agood and useful target for prediction.
• Accurate prediction of this target can help the company makebetter decisions, and save a lot of money.
• Data-mining is particularly good at building such predictionmodels — an area known as ”supervised learning”.
Data Mining Trevor Hastie, Stanford University 3
Example: Credit Risk Assessment
• Customers apply to a bank for a loan or credit card.
• They supply the bank with information such as age, income,employment history, education, bank accounts, existing debts,etc.
• The bank does further background checks to establish credithistory of customer.
• Based on this information, the bank must decide whether tomake the loan or issue the credit card.
Data Mining Trevor Hastie, Stanford University 4
Example continued: Credit Risk Assessment
• The bank has a large database of existing and past customers.Some of these defaulted on loans, others frequently made latepayments etc. An outcome variable “Status” is defined, takingvalue “good” or “default”. Each of the past customers is scoredwith a value for status.
• Background information is available for all the past customers.
• Using supervised learning techniques, we can build a riskprediction model that takes as input the backgroundinformation, and outputs a risk estimate (probability ofdefault) for a prospective customer.
The California based company Fair-Isaac uses a generalizedadditive model + boosting methods in the construction of theircredit risk scores.
Data Mining Trevor Hastie, Stanford University 5
Example: Churn Prediction
• When a customer switches to another provider, we call this“churn”. Examples are cell-phone service and credit cardproviders.
• Based on customer information and usage patterns, we canpredict
– the probability of churn
– the retention probability (as a function of time)
• This information can be used to evaluate
– prospective customers to decide on acceptance
– present customers to decide on intervention strategy
Risk assessment and survival models are used by US cell-phonecompanies such as AT&T to manage churn.
Data Mining Trevor Hastie, Stanford University 6
Netflix Challenge
Netflix users rate movies from 1-5. Based on a history of ratings,predict the rating a viewer will give to a new movie.
• Training data: sparse 400K (users) by 18K (movies) ratingmatrix, with 98.7% missing. About 100M movie/rater pairs.
• Quiz set of about 1.4M movie/viewer pairs, for whichpredictions of ratings are required (Netflix has held them back)
• Probe set of about 1.4 million movie/rater pairs similar incomposition to the quiz set, for which the ratings are known.
Data Mining Trevor Hastie, Stanford University 7
Grand Prize: one million dollars, if beat Netflix’s RMSE by 10%.After ≈ 2.5 years, the leaders are at 9.63%; 37,000 teams!
Data Mining Trevor Hastie, Stanford University 8
The Supervised Learning Problem
Starting point:
• Outcome measurement Y (also called dependent variable,response, target, output)
• Vector of p predictor measurements X (also called inputs,regressors, covariates, features, independent variables)
• In the regression problem, Y is quantitative (e.g price, bloodpressure, rating)
• In classification, Y takes values in a finite, unordered set(default yes/no, churn/retain, spam/email)
• We have training data (x1, y1), . . . , (xN , yN ). These areobservations (examples, instances) of these measurements.
Data Mining Trevor Hastie, Stanford University 9
Objectives
On the basis of the training data we would like to:
• Accurately predict unseen test cases for which we know X butdo not know Y .
• In the case of classification, predict the probability of anoutcome.
• Understand which inputs affect the outcome, and how.
• Assess the quality of our predictions and inferences.
Data Mining Trevor Hastie, Stanford University 10
More Examples
• Predict whether someone will have a heart attack on the basisof demographic, diet and clinical measurements
• Determine whether an incoming email is “spam”, based onfrequencies of key words in the message
• Identify the numbers in a handwritten zip code, from adigitized image
• Estimate the probability that an insurance claim is fraudulent,based on client demographics, client history, and the amountand nature of the claim.
• Predict the type of cancer in a tissue sample using DNAexpression values
Data Mining Trevor Hastie, Stanford University 11
Email or Spam?
• data from 4601 emails sent to an individual (named George, atHP labs, before 2000). Each is labeled as “spam” or “email”.
• goal: build a customized spam filter.
• input features: relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in these emailmessages.
george you hp free ! edu remove
spam 0.00 2.26 0.02 0.52 0.51 0.01 0.28
email 1.27 1.27 0.90 0.07 0.11 0.29 0.01
Average percentage of words or characters in an email message equal to
the indicated word or character. We have chosen the words and
characters showing the largest difference between spam and email.
Data Mining Trevor Hastie, Stanford University 12
Handwritten Digit Identification
A sample of segmented and normalized handwritten digits, scannedfrom zip-codes on envelopes. Each image has 16 × 16 pixels ofgreyscale values ranging from 0 − 255.
Data Mining Trevor Hastie, Stanford University 13
SID42354SID31984
SID301902SIDW128368SID375990SID360097
SIDW325120ESTsChr.10SIDW365099SID377133SID381508
SIDW308182SID380265
SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079
SIDW428642TUPLE1TUP1
ERLUMENSIDW416621
SID43609ESTs
SID52979SIDW357197SIDW366311
ESTsSMALLNUCSIDW486740
ESTsSID297905SID485148SID284853ESTsChr.15SID200394
SIDW322806ESTsChr.2
SIDW257915SID46536
SIDW488221ESTsChr.5SID280066
SIDW376394ESTsChr.15SIDW321854WASWiskott
HYPOTHETICALSIDW376776SIDW205716SID239012
SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31
SIDW376928SIDW310141SIDW298203
PTPRCSID289414SID127504ESTsChr.3SID305167SID488017
SIDW296310ESTsChr.6SID47116
MITOCHONDRIAL60Chr
SIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812
DNAPOLYMERSID377451ESTsChr.1
MYBPROTOSID471915
ESTsSIDW469884HumanmRNASIDW377402
ESTsSID207172
RASGTPASESID325394
H.sapiensmRNAGNAL
SID73161SIDW380102SIDW299104
BR
EA
ST
RE
NA
LR
EN
AL
RE
NA
LP
RO
ST
AT
EN
SC
LCO
VA
RIA
NC
NS
BR
EA
ST
NS
CLC
MC
F7A
-rep
roO
VA
RIA
NO
VA
RIA
NC
OLO
NB
RE
AS
TN
SC
LCLE
UK
EM
IAM
ELA
NO
MA
CN
SC
OLO
NP
RO
ST
AT
ELE
UK
EM
IAN
SC
LCN
SC
LCC
NS
CN
SM
CF
7D-r
epro
ME
LAN
OM
AB
RE
AS
TR
EN
AL
RE
NA
LO
VA
RIA
NLE
UK
EM
IAB
RE
AS
TM
ELA
NO
MA
LEU
KE
MIA
RE
NA
LB
RE
AS
TM
ELA
NO
MA
LEU
KE
MIA
ME
LAN
OM
AC
OLO
NN
SC
LCC
OLO
NO
VA
RIA
NN
SC
LCK
562A
-rep
roLE
UK
EM
IAC
OLO
NN
SC
LCO
VA
RIA
NK
562B
-rep
roC
NS
RE
NA
LC
OLO
NM
ELA
NO
MA
NS
CLC
RE
NA
LU
NK
NO
WN
ME
LAN
OM
AM
ELA
NO
MA
BR
EA
ST
CO
LON
RE
NA
L
Microarray Cancer Data
Expression matrix of 6830 genes(rows) and 64 samples (columns),for the human tumor data (100randomly chosen rows shown).The display is a heat map, rang-ing from bright green (under ex-pressed) to bright red (over ex-pressed).Goal: predict cancer class basedon expression values.
Data Mining Trevor Hastie, Stanford University 14
Shameless self-promotion
All of the topics in this lecture arecovered in the 2009 second editionof our 2001 book.The book blends traditional linearmethods with contemporary non-parametric methods, and manybetween the two.
Data Mining Trevor Hastie, Stanford University 15
Ideal Predictions
• For a quantitative output Y , the best prediction we can makewhen the input vector X = x is
f(x) = Ave(Y |X = x)
– This is the conditional expectation — deliver the Y -averageof all those examples having X = x.
– This is best if we measure errors by average squared errorAve(Y − f(X))2.
• For a qualitative output Y taking values 1, 2,. . . , M , compute
– Pr(Y = m|X = x) for each value of m. This is theconditional probability of class m at X = x.
– Classify C(x) = j if Pr(Y = j|X = x) is the largest — themajority vote classifier.
Data Mining Trevor Hastie, Stanford University 16
Implementation with Training Data
The ideal prediction formulas suggest a data implementation. Topredict at X = x, gather all the training pairs (xi, yi) havingxi = x, then:
• For regression, use the mean of their yi to estimatef(x) = Ave(Y |X = x)
• For classification, compute the relative proportions of eachclass among these yi, to estimate Pr(Y = m|X = x); Classifythe new observation by majority vote.
Problem: in the training data, there may be NO observationshaving xi = x.
Data Mining Trevor Hastie, Stanford University 17
Nearest Neighbor Averaging
• Estimate Ave(Y |X = x) by
Averaging those yi whose xi are in a neighborhood of x.
• E.g. define the neighborhood to be the set of k observationshaving values xi closest to x in euclidean distance ||xi − x||.
• For classification, compute the class proportions among these k
closest points.
• Nearest neighbor methods often outperform all other methods— about one in three times — especially for classification.
Data Mining Trevor Hastie, Stanford University 18
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
O
OO
O
O
O
O
O
O
O
O
O
O
O
O
O
OOO
O
O
O
O O
O
OOO
OOOOO
O
O
O
OO
O
O
O
OO
O
OO
O
OOOO
O
O
O
O
O
OOO
O
O
O
O
O
O
O
OO
OO
O
OOOOO
O
O
O
O
O
OO
O
O
O
OO
O
OO
O
O
O
O
OOO
O
O
O
O
OOO
OOOOO
O
O
O
OO
O
O
O
OO
O
OO
O
OOOO
O
O
O
O
O
OOO
O
O
O
O
O
O
O
OO
•
Kernel smoothing
• Smooth version of nearest-neighbor averaging
• At each point x, the functionf(x) = Y (Y |X = x) is esti-mated by the weighted aver-age of the y’s.
• The weights die downsmoothly with distance fromthe target point x (indicatedby shaded orange region).
Data Mining Trevor Hastie, Stanford University 19
Structured Models
• When we have a lot of predictor variables, NN methods oftenfail because of the curse of dimensionality:
It is hard to find nearby points in high dimensions!
• Near-neighbor models offer little interpretation.
• We can overcome these problems by assuming some structurefor the regression function Ave(Y |X = x) or the probabilityfunction Pr(Y = k|X = x). Typical structural assumptions:
– Linear Models
– Additive Models
– Low-order interaction models
– Restrict attention to a subset of predictors
– . . . and many more
Data Mining Trevor Hastie, Stanford University 20
Linear Models
• Linear models assume
Ave(Y |X = x) = β0 + β1x1 + β2x2 + . . . + βpxp
• For two class classification problems, linear logistic regressionhas the form
logPr(Y = +1|X = x)Pr(Y = −1|X = x)
= β0 + β1x1 + β2x2 + . . . + βpxp
• This translates to
Pr(Y = +1|X = x) =eβ0+β1x1+β2x2+...+βpxp
1 + eβ0+β1x1+β2x2+...+βpxp
Chapters 3 and 4 of deal with linear models.
Data Mining Trevor Hastie, Stanford University 21
Linear Model Complexity Control
With many inputs, linear regression can overfit the training data,leading to poor predictions on future data. Two general remediesare available:
• Variable selection: reduce the number of inputs in the model.For example, stepwise selection or best subset selection.
• Regularization: leave all the variables in the model, but whenfitting the model, restrict their coefficients.
– Ridge:∑p
j=1 β2j ≤ s. All the coefficients are non-zero, but
are shrunk toward zero (and each other).
– Lasso:∑p
j=1 |βj | ≤ s. Some coefficients drop out the model,others are shrink toward zero.
Data Mining Trevor Hastie, Stanford University 22
Best Subset Selection
Subset Size s
Res
idua
l Sum
-of-
Squ
ares
020
4060
8010
0
0 1 2 3 4 5 6 7 8
•
•
•••••••
••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••
•••••••
•
•
•
•• • • • • • •
Each point corresponds to a linear model involving a subset of thevariables, and shows the residual sum-of-squares on the trainingdata. The red models are the candidates, and we need to choose s.
Data Mining Trevor Hastie, Stanford University 23
Ridge
0 2 4 6 8
−0.
20.
00.
20.
40.
6
•
••••
••
••
••
••
••
••
••
••
•••
•
lcavol
••••••••••••••••••••••••
•
lweight
••••••••••••••••••••••••
•
age
•••••••••••••••••••••••••
lbph
••••••••••••••••••••••••
•
svi
•
•••
••
••
••
••
••••••••••••
•
lcp
••••••••••••••••••••••••
•gleason
•
•••••••••••••••••••••••
•
pgg45
Shrinkage Factor s
Coeffi
cien
tsβ̂(s
)
Lasso
0.0 0.2 0.4 0.6 0.8 1.0−
0.2
0.0
0.2
0.4
0.6
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
Shrinkage Factor s
Coeffi
cien
tsβ̂(s
)
Both ridge and lasso coefficients paths can be computed veryefficiently for all values of s.
Data Mining Trevor Hastie, Stanford University 24
Overfitting and Model Assessment
• In all cases above, the larger s, the better we will fit thetraining data. Often we overfit the training data.
• Overfit models can perform poorly on test data (high variance).
• Underfit models can perform poorly on test data (high bias).
Model assessment aims to
1. Choose a value for a tuning parameter s for a technique.
2. Estimate the future prediction ability of the chosen model.
• For both of these purposes, the best approach is to evaluate theprocedure on an independent test set, if one is available.
• If possible one should use different test data for (1) and (2)above: a validation set for (1) and a test set for (2)
Data Mining Trevor Hastie, Stanford University 25
K-Fold Cross-Validation
Primarily a method for estimating a tuning parameter s when datais scarce; we illustrate for the regularized linear regression models.
• Divide the data into K roughly equal parts (5 or 10)
Train Train Train
5
TrainTest
21 3 4
• for each k = 1, 2, . . . K, fit the model with parameter s to theother K − 1 parts, giving β̂−k(s) and compute its error inpredicting the kth part: Ek(λ) =
∑i∈kth part(yi − xiβ̂
−k(s))2.
• This gives the overall cross-validation errorCV (s) = 1
K
∑Kk=1 Ek(s)
• do this for many values of s and choose the value of s thatmakes CV (s) smallest.
Data Mining Trevor Hastie, Stanford University 26
Cross-Validation Error Curve
0.0 0.2 0.4 0.6 0.8 1.0
3000
3500
4000
4500
5000
5500
6000
Tuning Parameter s
CV
Err
or
• 10-fold CV error curve usinglasso on some diabetes data(64 inputs, 442 samples).
• Thick curve is CV error curve
• Shaded region indicates stan-dard error of CV estimate.
• Curve shows effect of over-fitting — errors start to in-crease above s = 0.2.
• This shows a trade-off be-tween bias and variance.
Data Mining Trevor Hastie, Stanford University 27
Modern Structured Models in Data Mining
The following is a list of some of the more important and currentlypopular prediction models in data mining.
• Linear Models (often heavily regularized)
• Generalized Additive Models
• Neural Networks
• Trees, Random Forests and Boosted Tree Models — hot!
• Support Vector and Kernel Machines — hot!
Data Mining Trevor Hastie, Stanford University 28
Generalized Additive Models
Allow a compromise between linear models and more flexible localmodels (kernel estimates) when there are a many inputsX = (X1, X2, . . . , Xp).
• Additive models for regression:
Ave(Y |X = x) = α0 + f1(x1) + f2(x2) + . . . + fp(xp).
• Additive models for classification:
logPr(Y = +1|X = x)Pr(Y = −1|X = x)
= α0 + f1(x1) + f2(x2) + . . . + fp(xp).
Each of the functions fj(xj) (one for each input variable), can be asmooth function (ala kernel estimate), linear, or omitted.
Data Mining Trevor Hastie, Stanford University 29
0 2 4 6 8
-50
5
0 1 2 3
-50
5
0 2 4 6
-50
510
0 2 4 6 8 10
-50
510
0 2 4 6 8 10
-50
510
0 2 4 6
-50
510
0 5 10 15 20
-10
-50
0 5 10
-10
-50
0 10 20 30
-10
-50
5
0 2 4 6
-50
5
0 5 10 15 20
-10
-50
5
0 5 10 15
-10
-50
0 10 20 30
-50
510
0 1 2 3 4 5 6
-50
510
0 2000 6000 10000
-50
5
0 5000 10000 15000
-50
5
our over remove internet
free business hp hpl
george 1999 re edu
ch! ch$ CAPMAX CAPTOT
f̂(o
ur)
f̂(o
ver)
f̂(r
emove)
f̂(i
nternet)
f̂(f
ree)
f̂(b
usiness)
f̂(h
p)
f̂(h
pl)
f̂(g
eorge)
f̂(1
999)
f̂(r
e)
f̂(e
du)
f̂(c
h!)
f̂(c
h$)
f̂(C
APMAX)
f̂(C
APTOT)
GAM fit to SPAM data
• Shown are the most importantpredictors.
• Many show nonlinear behav-ior.
• Overall error rate 5.3% .
• Functions can be re-parametrized (e.g. log terms,quadratic, step-functions),and then fit by linear model.
• Produces a prediction peremail Pr(SPAM|X = x)
Data Mining Trevor Hastie, Stanford University 30
Neural Networks
1Z
Single (Hidden) Layer Perceptron
X X X1 2 3
Z Z Z3 42
Input Layer
Hidden Layer
Output Layer1Y Y 2
• Like a complex regression or logis-tic regression model — more flexi-ble, but less interpretable — a “blackbox”.
• Hidden units Z1, Z2, . . . , Zm (4 here):Zj = σ(α0j + αT
j X)σ(Z) = eZ/(1 + eZ) is the logisticsigmoid activation function.
• Output is a linear regression or logis-tic regression model in the Zj .
• Complexity controlled by m, ridgeregularization, and early stopping ofthe backpropogation algorithm for fit-ting the neural network.
Data Mining Trevor Hastie, Stanford University 31
Support Vector Machines
•
•
•
•
•
• •
••
•
•
•
••
•
••
•
•
•Margin
Margin
Decision Boundary
• Maximize the gap (margin)between the two classes on thetraining data.
• If not separable
– enlarge the feature spacevia basis expansions (e.g.polynomials).
– use a “soft” margin (allowlimited overlap).
• Solution depends on a smallnumber of points (“supportvectors”) — 3 here.
Data Mining Trevor Hastie, Stanford University 32
Support Vector Machines
•
•
•
•
•
• •
••
•
•
•
•
•
••
•
••
•
•
••ξ1ξ1ξ1
ξ2ξ2ξ2
ξ3ξ3
ξ4ξ4ξ4 ξ5
Soft Margin
Soft Margin
xT β + β0 = 0
• Maximize the soft margin sub-ject to a bound on the totaloverlap:
∑i ξi ≤ B.
• With yi ∈ {−1, +1}, becomesconvex-optimization problem
min ||β|| s.t.
yi(xTi β + β0) ≥ 1 − ξi ∀i
N∑i=1
ξi ≤ B
Data Mining Trevor Hastie, Stanford University 33
Properties of SVMs
• Primarily used for classification problems. Builds a linearclassifier f(X) = β0 + β1X1 + β2X2 + . . . βpXp.
If f(X) > 0, classify as +1, else if f(X) < 0, classify as -1.
• Generalizations use kernels (“radial basis functions”):
f(X) = α0 +N∑
i=1
αiK(X, xi)
– K is a symmetric function, e.g. K(X, xi) = e−γ||X−xi||2 ,and each xi is one of the samples (vectors)
– Many of the αi = 0; the rest are “support points”.
• Extensions to regression, logistic regression, PCA, . . ..
• Well developed mathematics — function estimation inReproducing Kernel Hilbert Spaces.
Data Mining Trevor Hastie, Stanford University 34
SVM via Loss + Penalty
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Binomial Log-likelihoodSupport Vector
yf(x) (margin)
Loss
With f(x) = xT β + β0 andyi ∈ {−1, 1}, consider
minβ0, β
N∑i=1
[1−yif(xi)]++λ
2‖β‖2
This hinge loss criterionis equivalent to the SVM,with λ monotone in B.Compare with
minβ0, β
N∑i=1
log[1 + e−yif(xi)
]+
λ
2‖β‖2
This is binomial deviance loss, and the solution is “ridged” linearlogistic regression.
Data Mining Trevor Hastie, Stanford University 35
Path algorithms for the SVM
• The two-class SVM classifier f(X) = α0 +∑N
i=1 αiK(X, xi)yi
can be seen to have a quadratic penalty and piecewise-linearloss. As the cost parameter C is varied, the Lagrangemultipliers αi change piecewise-linearly.
• This allows the entire regularization path to be traced exactly.The active set is determined by the points exactly on themargin.
12 points, 6 per class, Separated
Step: 17 Error: 0 Elbow Size: 2 Loss: 0
*
*
**
*
*
*
*
*
*
**
7
8
9
10
11
12
1
2
3
4
56
1
3
Mixture Data − Radial Kernel Gamma=1.0
Step: 623 Error: 13 Elbow Size: 54 Loss: 30.46
***
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
* ****
*
*** *
*
*
*
*
*
*
*
***
***
****
*
***
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
**
***
*
*
*
*
*
*
***
** **
*
*
*
*
*
*
*
*
*
**
*
***
*
*
*
*
*
*
*
*
**
*
* *
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
** *
**
**
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
Mixture Data − Radial Kernel Gamma=5
Step: 483 Error: 1 Elbow Size: 90 Loss: 1.01
***
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
* ****
*
*** *
*
*
*
*
*
*
*
***
***
****
*
***
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
**
***
*
*
*
*
*
*
***
** **
*
*
*
*
*
*
*
*
*
**
*
***
*
*
*
*
*
*
*
*
**
*
* *
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
** *
**
**
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
Data Mining Trevor Hastie, Stanford University 36
Classification and Regression Trees
✔ Can handle huge datasets
✔ Can handle mixed predictors—quantitative and qualitative
✔ Easily ignore redundant variables
✔ Handle missing data elegantly
✔ Small trees are easy to interpret
✖ large trees are hard to interpret
✖ Often prediction performance is poor
Data Mining Trevor Hastie, Stanford University 37
Tree fit to SPAM data
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
Data Mining Trevor Hastie, Stanford University 38
Ensemble Methods and Boosting
Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.
• Random Forests (Breiman 1999): Improvements over bagging.
• Boosting (Freund & Shapire, 1996): Fit many smallish trees toreweighted versions of the training data. Classify by weightedmajority vote.
In general Boosting � Random Forests � Bagging � Single Tree.
Data Mining Trevor Hastie, Stanford University 39
0 500 1000 1500 2000 2500
0.04
00.
045
0.05
00.
055
0.06
00.
065
0.07
0
Spam Data
Number of Trees
Tes
t Err
orBaggingRandom ForestGradient Boosting (5 Node)
Data Mining Trevor Hastie, Stanford University 40
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
CM (x)
C3(x)
C2(x)
C1(x)
Boosting
• Average many trees, eachgrown to re-weighted versionsof the training data.
• Weighting decorrelates thetrees, by focussing on regionsmissed by past trees.
• Final Classifier is weighted av-erage of classifiers:
C(x) = sign[∑M
m=1 αmCm(x)]
Data Mining Trevor Hastie, Stanford University 41
Modern Gradient Boosting (Friedman, 2001)
• Fits an additive model
Fm(X) = T1(X) + T2(X) + T3(X) + . . . + Tm(X)
where each of the Tj(X) is a tree in X.
• Can be used for regression, logistic regression and more. Forexample, gradient boosting for regression works by repeatedlyfitting trees to the residuals:
1. Fit a small tree T1(X) to Y .
2. Fit a small tree T2(X) to the residual Y − T1(X).
3. Fit a small tree T3(X) to the residual Y − T1(X) − T2(X).and so on.
• m is the tuning parameter, which must be chosen using avalidation set (m too big will overfit).
Data Mining Trevor Hastie, Stanford University 42
Gradient Boosting - Details
• For general loss function L[Y, Fm(X) + Tm+1(X)], fit a tree tothe gradient ∂L/∂Fm rather than residual.
• Shrink the new contribution before adding into the model:Fm(X) + γTm+1(X). This slows the forward stagewisealgorithm, leading to improved performence.
• Tree depth determines interaction order of the model.
• Boosting will eventually overfit; number of terms m is a tuningparameter.
• As γ ↓ 0, boosting path behaves like 1 regularization path inthe space of trees.
Data Mining Trevor Hastie, Stanford University 43
0 200 400 600 800 1000
0.24
0.26
0.28
0.30
0.32
0.34
0.36
Adaboost Stumps for Classification
Iterations
Tes
t Mis
clas
sific
atio
n E
rror
Adaboost StumpAdaboost Stump shrink 0.1
Data Mining Trevor Hastie, Stanford University 44
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE, SVM and Boosting on SPAM data
ooo
Boosting − Error: 4.5%SVM − Error: 6.7%TREE − Error: 8.7%
Boosting on SPAM
Boosting dominates all othermethods on SPAM data —4.5% test error.Used 1000 trees (depth 6) withdefault settings for gbm pack-age in R.ROC curve obtained by vary-ing the threshold of the classi-fier.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.
Data Mining Trevor Hastie, Stanford University 45
Software
• R is free software for statistical modeling, graphics and ageneral programming environment. Works on PCs, Macs andLinux/Unix platforms. All the models here can be fit in R. Rgrew from its predecessor Splus, and both implement the Slanguage developed at Bell Labs in the 80s.
• SAS and their Enterprise Miner can fit most of the modelsmentioned in this talk, with good data-handling capabilities,and high-end user interfaces.
• Salford Systems has commercial versions of trees, randomforests and gradient boosting.
• SVM software is all over, but beware of patent infringements ifput to commercial use.
• Many free versions of neural network software; Google will find.