Data Mining Trevor Hastie, Stanford University 1 › ~hastie › TALKS › Trends.pdf ·...

Data Mining Trevor Hastie, Stanford University 1

Modern Trends in Data Mining

Trevor HastieStanford University

February 2009.

•

• •

•••

•••

•

••

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

••

•

•

••

• • ••

•

•

•

•

•••

•

•

•

•

•

•

•

••

•

•

•

•

••

•

•

•

••

•

••

•

•• •

•

•

•


Datamining for Prediction

• We have a collection of data pertaining to our business,industry, production process, monitoring device, etc.

• Often the goals of data-mining are vague, such as “look forpatterns in the data” — not too helpful.

• In many cases a “response” or “outcome” can be identified as agood and useful target for prediction.

• Accurate prediction of this target can help the company makebetter decisions, and save a lot of money.

• Data-mining is particularly good at building such predictionmodels — an area known as ”supervised learning”.


Example: Credit Risk Assessment

• Customers apply to a bank for a loan or credit card.

• They supply the bank with information such as age, income,employment history, education, bank accounts, existing debts,etc.

• The bank does further background checks to establish credithistory of customer.

• Based on this information, the bank must decide whether tomake the loan or issue the credit card.


Example continued: Credit Risk Assessment

• The bank has a large database of existing and past customers.Some of these defaulted on loans, others frequently made latepayments etc. An outcome variable “Status” is defined, takingvalue “good” or “default”. Each of the past customers is scoredwith a value for status.

• Background information is available for all the past customers.

• Using supervised learning techniques, we can build a riskprediction model that takes as input the backgroundinformation, and outputs a risk estimate (probability ofdefault) for a prospective customer.

The California based company Fair-Isaac uses a generalizedadditive model + boosting methods in the construction of theircredit risk scores.


Example: Churn Prediction

• When a customer switches to another provider, we call this“churn”. Examples are cell-phone service and credit cardproviders.

• Based on customer information and usage patterns, we canpredict

– the probability of churn

– the retention probability (as a function of time)

• This information can be used to evaluate

– prospective customers to decide on acceptance

– present customers to decide on intervention strategy

Risk assessment and survival models are used by US cell-phonecompanies such as AT&T to manage churn.


Netflix Challenge

Netflix users rate movies from 1-5. Based on a history of ratings,predict the rating a viewer will give to a new movie.

• Training data: sparse 400K (users) by 18K (movies) ratingmatrix, with 98.7% missing. About 100M movie/rater pairs.

• Quiz set of about 1.4M movie/viewer pairs, for whichpredictions of ratings are required (Netflix has held them back)

• Probe set of about 1.4 million movie/rater pairs similar incomposition to the quiz set, for which the ratings are known.


Grand Prize: one million dollars, if beat Netflix’s RMSE by 10%.After ≈ 2.5 years, the leaders are at 9.63%; 37,000 teams!


The Supervised Learning Problem

Starting point:

• Outcome measurement Y (also called dependent variable,response, target, output)

• Vector of p predictor measurements X (also called inputs,regressors, covariates, features, independent variables)

• In the regression problem, Y is quantitative (e.g price, bloodpressure, rating)

• In classification, Y takes values in a finite, unordered set(default yes/no, churn/retain, spam/email)

• We have training data (x1, y1), . . . , (xN , yN ). These areobservations (examples, instances) of these measurements.


Objectives

On the basis of the training data we would like to:

• Accurately predict unseen test cases for which we know X butdo not know Y .

• In the case of classification, predict the probability of anoutcome.

• Understand which inputs affect the outcome, and how.

• Assess the quality of our predictions and inferences.


More Examples

• Predict whether someone will have a heart attack on the basisof demographic, diet and clinical measurements

• Determine whether an incoming email is “spam”, based onfrequencies of key words in the message

• Identify the numbers in a handwritten zip code, from adigitized image

• Estimate the probability that an insurance claim is fraudulent,based on client demographics, client history, and the amountand nature of the claim.

• Predict the type of cancer in a tissue sample using DNAexpression values


Email or Spam?

• data from 4601 emails sent to an individual (named George, atHP labs, before 2000). Each is labeled as “spam” or “email”.

• goal: build a customized spam filter.

• input features: relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in these emailmessages.

george you hp free ! edu remove

spam 0.00 2.26 0.02 0.52 0.51 0.01 0.28

email 1.27 1.27 0.90 0.07 0.11 0.29 0.01

Average percentage of words or characters in an email message equal to

the indicated word or character. We have chosen the words and

characters showing the largest difference between spam and email.


Handwritten Digit Identification

A sample of segmented and normalized handwritten digits, scannedfrom zip-codes on envelopes. Each image has 16 × 16 pixels ofgreyscale values ranging from 0 − 255.


SID42354SID31984

SID301902SIDW128368SID375990SID360097

SIDW325120ESTsChr.10SIDW365099SID377133SID381508

SIDW308182SID380265

SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079

SIDW428642TUPLE1TUP1

ERLUMENSIDW416621

SID43609ESTs

SID52979SIDW357197SIDW366311

ESTsSMALLNUCSIDW486740

ESTsSID297905SID485148SID284853ESTsChr.15SID200394

SIDW322806ESTsChr.2

SIDW257915SID46536

SIDW488221ESTsChr.5SID280066

SIDW376394ESTsChr.15SIDW321854WASWiskott

HYPOTHETICALSIDW376776SIDW205716SID239012

SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31

SIDW376928SIDW310141SIDW298203

PTPRCSID289414SID127504ESTsChr.3SID305167SID488017

SIDW296310ESTsChr.6SID47116

MITOCHONDRIAL60Chr

SIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812

DNAPOLYMERSID377451ESTsChr.1

MYBPROTOSID471915

ESTsSIDW469884HumanmRNASIDW377402

ESTsSID207172

RASGTPASESID325394

H.sapiensmRNAGNAL

SID73161SIDW380102SIDW299104

BR

EA

ST

RE

NA

LR

EN

AL

RE

NA

LP

RO

ST

AT

EN

SC

LCO

VA

RIA

NC

NS

BR

EA

ST

NS

CLC

MC

F7A

-rep

roO

VA

RIA

NO

VA

RIA

NC

OLO

NB

RE

AS

TN

SC

LCLE

UK

EM

IAM

ELA

NO

MA

CN

SC

OLO

NP

RO

ST

AT

ELE

UK

EM

IAN

SC

LCN

SC

LCC

NS

CN

SM

CF

7D-r

epro

ME

LAN

OM

AB

RE

AS

TR

EN

AL

RE

NA

LO

VA

RIA

NLE

UK

EM

IAB

RE

AS

TM

ELA

NO

MA

LEU

KE

MIA

RE

NA

LB

RE

AS

TM

ELA

NO

MA

LEU

KE

MIA

ME

LAN

OM

AC

OLO

NN

SC

LCC

OLO

NO

VA

RIA

NN

SC

LCK

562A

-rep

roLE

UK

EM

IAC

OLO

NN

SC

LCO

VA

RIA

NK

562B

-rep

roC

NS

RE

NA

LC

OLO

NM

ELA

NO

MA

NS

CLC

RE

NA

LU

NK

NO

WN

ME

LAN

OM

AM

ELA

NO

MA

BR

EA

ST

CO

LON

RE

NA

L

Microarray Cancer Data

Expression matrix of 6830 genes(rows) and 64 samples (columns),for the human tumor data (100randomly chosen rows shown).The display is a heat map, rang-ing from bright green (under ex-pressed) to bright red (over ex-pressed).Goal: predict cancer class basedon expression values.


Shameless self-promotion

All of the topics in this lecture arecovered in the 2009 second editionof our 2001 book.The book blends traditional linearmethods with contemporary non-parametric methods, and manybetween the two.


Ideal Predictions

• For a quantitative output Y , the best prediction we can makewhen the input vector X = x is

f(x) = Ave(Y |X = x)

– This is the conditional expectation — deliver the Y -averageof all those examples having X = x.

– This is best if we measure errors by average squared errorAve(Y − f(X))2.

• For a qualitative output Y taking values 1, 2,. . . , M , compute

– Pr(Y = m|X = x) for each value of m. This is theconditional probability of class m at X = x.

– Classify C(x) = j if Pr(Y = j|X = x) is the largest — themajority vote classifier.


Implementation with Training Data

The ideal prediction formulas suggest a data implementation. Topredict at X = x, gather all the training pairs (xi, yi) havingxi = x, then:

• For regression, use the mean of their yi to estimatef(x) = Ave(Y |X = x)

• For classification, compute the relative proportions of eachclass among these yi, to estimate Pr(Y = m|X = x); Classifythe new observation by majority vote.

Problem: in the training data, there may be NO observationshaving xi = x.


Nearest Neighbor Averaging

• Estimate Ave(Y |X = x) by

Averaging those yi whose xi are in a neighborhood of x.

• E.g. define the neighborhood to be the set of k observationshaving values xi closest to x in euclidean distance ||xi − x||.

• For classification, compute the class proportions among these k

closest points.

• Nearest neighbor methods often outperform all other methods— about one in three times — especially for classification.


0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

O

OO

O

O

O

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O O

O

OOO

OOOOO

O

O

O

OO

O

O

O

OO

O

OO

O

OOOO

O

O

O

O

O

OOO

O

O

O

O

O

O

O

OO

OO

O

OOOOO

O

O

O

O

O

OO

O

O

O

OO

O

OO

O

O

O

O

OOO

O

O

O

O

OOO

OOOOO

O

O

O

OO

O

O

O

OO

O

OO

O

OOOO

O

O

O

O

O

OOO

O

O

O

O

O

O

O

OO

•

Kernel smoothing

• Smooth version of nearest-neighbor averaging

• At each point x, the functionf(x) = Y (Y |X = x) is esti-mated by the weighted aver-age of the y’s.

• The weights die downsmoothly with distance fromthe target point x (indicatedby shaded orange region).


Structured Models

• When we have a lot of predictor variables, NN methods oftenfail because of the curse of dimensionality:

It is hard to find nearby points in high dimensions!

• Near-neighbor models offer little interpretation.

• We can overcome these problems by assuming some structurefor the regression function Ave(Y |X = x) or the probabilityfunction Pr(Y = k|X = x). Typical structural assumptions:

– Linear Models

– Additive Models

– Low-order interaction models

– Restrict attention to a subset of predictors

– . . . and many more


Linear Models

• Linear models assume

Ave(Y |X = x) = β0 + β1x1 + β2x2 + . . . + βpxp

• For two class classification problems, linear logistic regressionhas the form

logPr(Y = +1|X = x)Pr(Y = −1|X = x)

= β0 + β1x1 + β2x2 + . . . + βpxp

• This translates to

Pr(Y = +1|X = x) =eβ0+β1x1+β2x2+...+βpxp

1 + eβ0+β1x1+β2x2+...+βpxp

Chapters 3 and 4 of deal with linear models.


Linear Model Complexity Control

With many inputs, linear regression can overfit the training data,leading to poor predictions on future data. Two general remediesare available:

• Variable selection: reduce the number of inputs in the model.For example, stepwise selection or best subset selection.

• Regularization: leave all the variables in the model, but whenfitting the model, restrict their coefficients.

– Ridge:∑p

j=1 β2j ≤ s. All the coefficients are non-zero, but

are shrunk toward zero (and each other).

– Lasso:∑p

j=1 |βj | ≤ s. Some coefficients drop out the model,others are shrink toward zero.


Best Subset Selection

Subset Size s

Res

idua

l Sum

-of-

Squ

ares

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•

•

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•

•

•

•• • • • • • •

Each point corresponds to a linear model involving a subset of thevariables, and shows the residual sum-of-squares on the trainingdata. The red models are the candidates, and we need to choose s.


Ridge

0 2 4 6 8

−0.

20.

00.

20.

40.

6

•

••••

••

••

••

••

••

••

••

••

•••

•

lcavol

••••••••••••••••••••••••

•

lweight

••••••••••••••••••••••••

•

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

•

svi

•

•••

••

••

••

••

••••••••••••

•

lcp

••••••••••••••••••••••••

•gleason

•

•••••••••••••••••••••••

•

pgg45

Shrinkage Factor s

Coeffi

cien

tsβ̂(s

)

Lasso

0.0 0.2 0.4 0.6 0.8 1.0−

0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

Shrinkage Factor s

Coeffi

cien

tsβ̂(s

)

Both ridge and lasso coefficients paths can be computed veryefficiently for all values of s.


Overfitting and Model Assessment

• In all cases above, the larger s, the better we will fit thetraining data. Often we overfit the training data.

• Overfit models can perform poorly on test data (high variance).

• Underfit models can perform poorly on test data (high bias).

Model assessment aims to

1. Choose a value for a tuning parameter s for a technique.

2. Estimate the future prediction ability of the chosen model.

• For both of these purposes, the best approach is to evaluate theprocedure on an independent test set, if one is available.

• If possible one should use different test data for (1) and (2)above: a validation set for (1) and a test set for (2)


K-Fold Cross-Validation

Primarily a method for estimating a tuning parameter s when datais scarce; we illustrate for the regularized linear regression models.

• Divide the data into K roughly equal parts (5 or 10)

Train Train Train

5

TrainTest

21 3 4

• for each k = 1, 2, . . . K, fit the model with parameter s to theother K − 1 parts, giving β̂−k(s) and compute its error inpredicting the kth part: Ek(λ) =

∑i∈kth part(yi − xiβ̂

−k(s))2.

• This gives the overall cross-validation errorCV (s) = 1

K

∑Kk=1 Ek(s)

• do this for many values of s and choose the value of s thatmakes CV (s) smallest.


Cross-Validation Error Curve

0.0 0.2 0.4 0.6 0.8 1.0

3000

3500

4000

4500

5000

5500

6000

Tuning Parameter s

CV

Err

or

• 10-fold CV error curve usinglasso on some diabetes data(64 inputs, 442 samples).

• Thick curve is CV error curve

• Shaded region indicates stan-dard error of CV estimate.

• Curve shows effect of over-fitting — errors start to in-crease above s = 0.2.

• This shows a trade-off be-tween bias and variance.


Modern Structured Models in Data Mining

The following is a list of some of the more important and currentlypopular prediction models in data mining.

• Linear Models (often heavily regularized)

• Generalized Additive Models

• Neural Networks

• Trees, Random Forests and Boosted Tree Models — hot!

• Support Vector and Kernel Machines — hot!


Generalized Additive Models

Allow a compromise between linear models and more flexible localmodels (kernel estimates) when there are a many inputsX = (X1, X2, . . . , Xp).

• Additive models for regression:

Ave(Y |X = x) = α0 + f1(x1) + f2(x2) + . . . + fp(xp).

• Additive models for classification:

logPr(Y = +1|X = x)Pr(Y = −1|X = x)

= α0 + f1(x1) + f2(x2) + . . . + fp(xp).

Each of the functions fj(xj) (one for each input variable), can be asmooth function (ala kernel estimate), linear, or omitted.


0 2 4 6 8

-50

5

0 1 2 3

-50

5

0 2 4 6

-50

510

0 2 4 6 8 10

-50

510

0 2 4 6 8 10

-50

510

0 2 4 6

-50

510

0 5 10 15 20

-10

-50

0 5 10

-10

-50

0 10 20 30

-10

-50

5

0 2 4 6

-50

5

0 5 10 15 20

-10

-50

5

0 5 10 15

-10

-50

0 10 20 30

-50

510

0 1 2 3 4 5 6

-50

510

0 2000 6000 10000

-50

5

0 5000 10000 15000

-50

5

our over remove internet

free business hp hpl

george 1999 re edu

ch! ch$ CAPMAX CAPTOT

f̂(o

ur)

f̂(o

ver)

f̂(r

emove)

f̂(i

nternet)

f̂(f

ree)

f̂(b

usiness)

f̂(h

p)

f̂(h

pl)

f̂(g

eorge)

f̂(1

999)

f̂(r

e)

f̂(e

du)

f̂(c

h!)

f̂(c

h$)

f̂(C

APMAX)

f̂(C

APTOT)

GAM fit to SPAM data

• Shown are the most importantpredictors.

• Many show nonlinear behav-ior.

• Overall error rate 5.3% .

• Functions can be re-parametrized (e.g. log terms,quadratic, step-functions),and then fit by linear model.

• Produces a prediction peremail Pr(SPAM|X = x)


Neural Networks

1Z

Single (Hidden) Layer Perceptron

X X X1 2 3

Z Z Z3 42

Input Layer

Hidden Layer

Output Layer1Y Y 2

• Like a complex regression or logis-tic regression model — more flexi-ble, but less interpretable — a “blackbox”.

• Hidden units Z1, Z2, . . . , Zm (4 here):Zj = σ(α0j + αT

j X)σ(Z) = eZ/(1 + eZ) is the logisticsigmoid activation function.

• Output is a linear regression or logis-tic regression model in the Zj .

• Complexity controlled by m, ridgeregularization, and early stopping ofthe backpropogation algorithm for fit-ting the neural network.


Support Vector Machines

•

•

•

•

•

• •

••

•

•

•

••

•

••

•

•

•Margin

Margin

Decision Boundary

• Maximize the gap (margin)between the two classes on thetraining data.

• If not separable

– enlarge the feature spacevia basis expansions (e.g.polynomials).

– use a “soft” margin (allowlimited overlap).

• Solution depends on a smallnumber of points (“supportvectors”) — 3 here.


Support Vector Machines

•

•

•

•

•

• •

••

•

•

•

•

•

••

•

••

•

•

••ξ1ξ1ξ1

ξ2ξ2ξ2

ξ3ξ3

ξ4ξ4ξ4 ξ5

Soft Margin

Soft Margin

xT β + β0 = 0

• Maximize the soft margin sub-ject to a bound on the totaloverlap:

∑i ξi ≤ B.

• With yi ∈ {−1, +1}, becomesconvex-optimization problem

min ||β|| s.t.

yi(xTi β + β0) ≥ 1 − ξi ∀i

N∑i=1

ξi ≤ B


Properties of SVMs

• Primarily used for classification problems. Builds a linearclassifier f(X) = β0 + β1X1 + β2X2 + . . . βpXp.

If f(X) > 0, classify as +1, else if f(X) < 0, classify as -1.

• Generalizations use kernels (“radial basis functions”):

f(X) = α0 +N∑

i=1

αiK(X, xi)

– K is a symmetric function, e.g. K(X, xi) = e−γ||X−xi||2 ,and each xi is one of the samples (vectors)

– Many of the αi = 0; the rest are “support points”.

• Extensions to regression, logistic regression, PCA, . . ..

• Well developed mathematics — function estimation inReproducing Kernel Hilbert Spaces.


SVM via Loss + Penalty

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Binomial Log-likelihoodSupport Vector

yf(x) (margin)

Loss

With f(x) = xT β + β0 andyi ∈ {−1, 1}, consider

minβ0, β

N∑i=1

[1−yif(xi)]++λ

2‖β‖2

This hinge loss criterionis equivalent to the SVM,with λ monotone in B.Compare with

minβ0, β

N∑i=1

log[1 + e−yif(xi)

]+

λ

2‖β‖2

This is binomial deviance loss, and the solution is “ridged” linearlogistic regression.


Path algorithms for the SVM

• The two-class SVM classifier f(X) = α0 +∑N

i=1 αiK(X, xi)yi

can be seen to have a quadratic penalty and piecewise-linearloss. As the cost parameter C is varied, the Lagrangemultipliers αi change piecewise-linearly.

• This allows the entire regularization path to be traced exactly.The active set is determined by the points exactly on themargin.

12 points, 6 per class, Separated

Step: 17 Error: 0 Elbow Size: 2 Loss: 0

*

*

**

*

*

*

*

*

*

**

7

8

9

10

11

12

1

2

3

4

56

1

3

Mixture Data − Radial Kernel Gamma=1.0

Step: 623 Error: 13 Elbow Size: 54 Loss: 30.46

***

*

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

* ****

*

*** *

*

*

*

*

*

*

*

***

***

****

*

***

*

*

*

*

*

*

*

* *

*

*

*

*

*

*

**

***

*

*

*

*

*

*

***

** **

*

*

*

*

*

*

*

*

*

**

*

***

*

*

*

*

*

*

*

*

**

*

* *

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

*

*

*

*

*

*

*

*

*

** *

**

**

*

**

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

Mixture Data − Radial Kernel Gamma=5

Step: 483 Error: 1 Elbow Size: 90 Loss: 1.01

***

*

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

* ****

*

*** *

*

*

*

*

*

*

*

***

***

****

*

***

*

*

*

*

*

*

*

* *

*

*

*

*

*

*

**

***

*

*

*

*

*

*

***

** **

*

*

*

*

*

*

*

*

*

**

*

***

*

*

*

*

*

*

*

*

**

*

* *

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

*

*

*

*

*

*

*

*

*

** *

**

**

*

**

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*


Classification and Regression Trees

✔ Can handle huge datasets

✔ Can handle mixed predictors—quantitative and qualitative

✔ Easily ignore redundant variables

✔ Handle missing data elegantly

✔ Small trees are easy to interpret

✖ large trees are hard to interpret

✖ Often prediction performance is poor


Tree fit to SPAM data

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58


Ensemble Methods and Boosting

Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.

• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.

• Random Forests (Breiman 1999): Improvements over bagging.

• Boosting (Freund & Shapire, 1996): Fit many smallish trees toreweighted versions of the training data. Classify by weightedmajority vote.

In general Boosting � Random Forests � Bagging � Single Tree.


0 500 1000 1500 2000 2500

0.04

00.

045

0.05

00.

055

0.06

00.

065

0.07

0

Spam Data

Number of Trees

Tes

t Err

orBaggingRandom ForestGradient Boosting (5 Node)


Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Average many trees, eachgrown to re-weighted versionsof the training data.

• Weighting decorrelates thetrees, by focussing on regionsmissed by past trees.

• Final Classifier is weighted av-erage of classifiers:

C(x) = sign[∑M

m=1 αmCm(x)]


Modern Gradient Boosting (Friedman, 2001)

• Fits an additive model

Fm(X) = T1(X) + T2(X) + T3(X) + . . . + Tm(X)

where each of the Tj(X) is a tree in X.

• Can be used for regression, logistic regression and more. Forexample, gradient boosting for regression works by repeatedlyfitting trees to the residuals:

1. Fit a small tree T1(X) to Y .

2. Fit a small tree T2(X) to the residual Y − T1(X).

3. Fit a small tree T3(X) to the residual Y − T1(X) − T2(X).and so on.

• m is the tuning parameter, which must be chosen using avalidation set (m too big will overfit).


Gradient Boosting - Details

• For general loss function L[Y, Fm(X) + Tm+1(X)], fit a tree tothe gradient ∂L/∂Fm rather than residual.

• Shrink the new contribution before adding into the model:Fm(X) + γTm+1(X). This slows the forward stagewisealgorithm, leading to improved performence.

• Tree depth determines interaction order of the model.

• Boosting will eventually overfit; number of terms m is a tuningparameter.

• As γ ↓ 0, boosting path behaves like 1 regularization path inthe space of trees.


0 200 400 600 800 1000

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Adaboost Stumps for Classification

Iterations

Tes

t Mis

clas

sific

atio

n E

rror

Adaboost StumpAdaboost Stump shrink 0.1


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE, SVM and Boosting on SPAM data

ooo

Boosting − Error: 4.5%SVM − Error: 6.7%TREE − Error: 8.7%

Boosting on SPAM

Boosting dominates all othermethods on SPAM data —4.5% test error.Used 1000 trees (depth 6) withdefault settings for gbm pack-age in R.ROC curve obtained by vary-ing the threshold of the classi-fier.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.


Software

• R is free software for statistical modeling, graphics and ageneral programming environment. Works on PCs, Macs andLinux/Unix platforms. All the models here can be fit in R. Rgrew from its predecessor Splus, and both implement the Slanguage developed at Bell Labs in the 80s.

• SAS and their Enterprise Miner can fit most of the modelsmentioned in this talk, with good data-handling capabilities,and high-end user interfaces.

• Salford Systems has commercial versions of trees, randomforests and gradient boosting.

• SVM software is all over, but beware of patent infringements ifput to commercial use.

• Many free versions of neural network software; Google will find.

Data Mining Trevor Hastie, Stanford University 1 › ~hastie › TALKS › Trends.pdf ·...

Documents

Transcript of Data Mining Trevor Hastie, Stanford University 1 › ~hastie › TALKS › Trends.pdf ·...