Download - Notes_7 IE

8/11/2019 Notes_7 IE

1/134

Nonlinear Regression

Reference reading:

pp. 2733 and 564567 for MLE

Ch. 13. Except you can just skim 13.2 (optimizationalgorithms) and 13.6 (neural networks)

pp. 453457 for regression trees

pp. 458464, 529536, and Lab 5 for bootstrapping

8/11/2019 Notes_7 IE

2/134

Maximum Likelihood Estimation (MLE)

Given a parametric model for a set of data, in general,how do you devise a good way to estimate theparameters? i.e., what criterion do you optimize?

MLE is a very general principle on which manyparametric model estimators are derived Many familiar estimators from a first course in statistics turn out

to be MLEs

The model fitting criteria in linear and logistic regression canboth be derived as applications of MLE; likewise for manysupervised learning models

When a researcher proposes a new model for a problem, theyusually start with the MLE principle to fit the model

In statistical modeling software, choosing a method of modelfitting is often related to choosing a statistical model for which

the method of fitting is the corresponding MLE

8/11/2019 Notes_7 IE

3/134

The MLE Principle

Suppose you have some parametric model to representyour data, with parameters denoted by q{q1, q2, . . .,qp}, and you want to fit the model (i.e., estimate theparameters) based on a random sample of data Y{y1,

y2, . . .,yn}.

Denote the joint distribution of the data byf(y1,y2, . . .,yn;q1, q2, . . ., qp), orf(Y; q)for short. We callf(Y; q): the prob. distribution, when viewed as a function of Y, for fixed

values ofq

, or

the likelihood function, when viewed as a function of qfor thefixed values of Yin your actual data sample.

Basic MLE Principle: Take the estimates of qto bevalues that maximize the likelihood functionf(Y; q). Wecall these values the MLE of q.

8/11/2019 Notes_7 IE

4/134

Example: Estimating mand sfor a Normal Pop.

data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: Yi~NID(m,s2)

parameters: q= {m, s} (p= 2)

marginal pdf of Yi: f(yi; m,s) =

joint pdf of Y1, . . ., Yn(aka likelihood function):

f(Y; m,s) =

MLEs of mand sare the values that maximizef(Y; m,s)

n

iin/n

n

ii

yexp,;yf1

2

221 2

1

2

1 mss

sm

m

ss

yexp i2

22

1

2

1

8/11/2019 Notes_7 IE

5/134

Example: Estimating the coefficients in LogisticRegression

data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: for i= 1, 2, . . ., n, Yi~ Bernoulli with

where xi= [1,xi1, . . .xik]T (kpredictor variables)

parameters: = [b0,b1, . . .,bk]T(p= k+1)

marginal distribution of Yi: f(yi; ) =

joint distribution of Y1, . . ., Yn:

f(Y; ) =

xx

x

iT

iT

ikki

ikkiiii

exp

exp

xxexp

xxexp|YPrp

111

110

110

bbb

bbb

0:1

1:

yp

yp

ii

ii

n

yi i

T

n

yi i

T

iT

n

yi

i

n

yi

i

n

i

i

iiiiexpexp

exppp;yf

01

11

01

111 1

1

1

1

xx

x

8/11/2019 Notes_7 IE

6/134

654321

90

80

70

60

50

40

30

20

10

0

car_age

income

0

1

y

Scatterplot of income vs car_age

8/11/2019 Notes_7 IE

7/134

Nonlinear Regression Models and Nonlinear LeastSquares

A general form of nonlinear regression model is Yi=g(xi,q) + ei, where:

Yi: response for observation i

xi: vector of predictors for observation i

q: vector of model parametersg(xi,q): some parametric nonlinear function

ei: zero-mean random error for observation i

We will see shortly that if the random errors areGaussian and independent of x, the MLE of qis justnonlinear least squares

8/11/2019 Notes_7 IE

8/134

Example: Manufacturing Learning Curve

Y= relative efficiency of operation

x2= week #

If there were only one facility, and the data looked likebelow, how would you model it?

(older)Afacility:0

(modern)Bfacility:11x

x2

y

1.0

8/11/2019 Notes_7 IE

9/134

Discussion Points and Questions

If facilities A and B had different asymptotic efficienciesas in Fig. 13.5, how would you modify the model?

If facilities A and B had different exponential rates, howwould you modify the model?

If the objective was to determine if the two facilities haddifferent asymptotic efficiencies, how could you do this?

Are the formulae for t-tests, standard errors, etc. in alinear regression still valid? If not, how would youcalculate and use the analogous quantities in nonlinearregression?

8/11/2019 Notes_7 IE

10/134

8/11/2019 Notes_7 IE

11/134

MLE for General Nonlinear Regression Model withNormal Errors

Yi=g(xi,q) + eiwith error distribution: ei~NID(0,s2)

view the xi's as deterministic, not random

write Yi= mi+ ei

with mi

g(xi,q) (to simplify notation) Yi ~NID(mi,s

2)

marginal pdf of Yi: f(yi; q,s) =

joint pdf of Y1, . . ., Yn(aka likelihood function):

f(Y; q,s) =

m

ss

iiyexp2

22

1

2

1

n

i

iin/n

n

i

i yexp,;yf

1

2

22

1 2

1

2

1m

ss

s

8/11/2019 Notes_7 IE

12/134

MLE of q: Choose to maximize f(Y; q,s)

i.e., minimize

i.e, the MLE of qfor the general nonlinear regressionmodel with i.i.d. Gaussian errors (that are independent ofx) is "nonlinear least squares"

In general, we need optimization software to fit themodel

n

iii

n

iii ,gyy

1

2

1

2xm

8/11/2019 Notes_7 IE

13/134

Summary of Steps in General MLE

1) Write out the form of the statistical model that you areusing to represent the data

2) Find the marginal distribution of each individualobservation Yi(for regression problems the xi's are nottreated as random, so you only need to find themarginal distribution of the Yi's given the xi's)

3) From the marginal distributions in step (2), find the jointdistributionf(Y; q)of the entire set of data Y

4) If tractable, find an analytical expression for the qthatmaximizes the likelihoodf(Y; q). Otherwise, useoptimization software to minimize logf(Y; q)

5) The MLE of qis the minimizer in step (4), and theHessian can be used to assess statistical uncertainty

(next topic)

8/11/2019 Notes_7 IE

14/134

Relevant R Functions and Packages

nlm()minimize a general nonlinear function, such asimplementing MLE for a nonstandard model (but mostspecific statistical models in R have built-in MLEimplementation)

nls()nonlinear least squares

boot bootstrapping package

cross-validation is built into many R modeling functions(as an optional argument or as a separate function likecv.tree or cv.glm), or not hard to write your own function

8/11/2019 Notes_7 IE

15/134

R commands for fitting learning curve exampleusing the general optimizer nlm()

MLC

8/11/2019 Notes_7 IE

16/134

R commands for fitting learning curve exampleusing the nonlinear LS function nls()

MLC

8/11/2019 Notes_7 IE

17/134

Statistical Uncertainty in Supervised Learning

With nonlinear regression models, the formulae for assessing

statistical uncertainty in linear regression (e.g.,F-tests and t-tests forsignificance of predictors, SEs and CIs for parameters, PIs and CIsfor new observations, etc.) do not apply directly

Question: Why might we want to calculate SEs, CIs/PIs, dohypothesis tests, etc?

For some nonlinear models, we can use approximate asymptoticanalytical resultsvalid for sufficiently large sample size nto assessstatistical uncertainty

Fortunately, we have alternative computational approachesthatapply to any nonlinear model:

Cross-validationfor deciding which models are the best (whichimplies which terms belong in the model, among other things)

Bootstrap resampling(or bootstrappingfor short) for SEs andCIs on the parameters and CIs and PIs on new observations

8/11/2019 Notes_7 IE

18/134

Overview of Bootstrapping

You are given a sample of data of size nobservations.

You have estimated some parameter(s) q(call it )

Objective: Estimate the sampling distribution of andquantities like SE( )that are derived from it.

Problem: Hypothetically, if we knew the entirepopulation, we could consider using simulation to drawmany random samples (each of size n) from thepopulation and calculate a different for each sample.

We could construct a histogram of all the 's and taketheir sample standard deviation to be an estimate ofSE( ) for the single real sample. The problem is we onlyhave the single sample and not the entire population.

q

q

q

q

q

q

8/11/2019 Notes_7 IE

19/134

Example: How you could use regular simulation tofind the SE of a sample average, if you know the

underlying distribution (for example, normal)

Generate say 10,000 samples,each of size n = 20, from anN(5.3,0.4^2) distribution

Calculate the averages { : j

= 1, 2, . . ., 10,000} for the10,000 replicates

Take

y y(1) y(2) y(3) y(4)

5.32 5.18 4.79 5.40 5.81

5.37 5.78 5.99 4.43 5.21

5.23 5.74 4.87 5.02 4.62

5.33 4.56 4.91 4.99 4.45

6.07 5.07 5.14 5.35 5.15

4.88 5.17 5.15 5.84 5.27

5.38 5.23 5.09 6.09 5.65

5.04 6.25 5.04 5.96 4.665.68 5.52 5.66 6.07 5.27

5.44 5.09 5.57 5.15 5.60

5.55 4.72 4.96 4.69 5.15

4.93 5.29 5.31 5.17 6.18

4.71 4.60 5.01 4.27 5.88

4.71 4.79 5.04 5.60 5.49

4.63 5.65 5.54 4.75 4.85

5.26 5.58 5.43 4.92 5.20

5.67 5.35 5.52 5.36 4.945.87 6.05 5.49 5.33 5.63

5.74 5.64 5.05 4.93 5.74

5.17 4.82 4.68 5.58 5.56

ave 5.30 5.30 5.21 5.25 5.32

SD 0.40

00010

1

2

100010

1 ,

j sim

)j(

sim yy,ySE

y )j(sim

8/11/2019 Notes_7 IE

20/134

Example: How you could use bootstrapping to findthe SE of a sample average, if you do NOT know

the underlying distribution

Generate say 10,000 bootstrapsamples, each of size n = 20,from your one real sample

Calculate the averages { :b= 1, 2, . . ., 10,000} for the10,000 replicates

Take

00010

1

2

100010

1 ,

b

)b(

yy,ySE

y )b(

y y(1) y(2) y(3) y(4)

5.32 5.44 5.04 5.38 5.55

5.37 4.63 5.87 4.71 5.74

5.23 5.67 4.93 5.68 6.07

5.33 4.71 4.93 5.23 4.63

6.07 4.71 5.87 5.44 5.67

4.88 4.71 5.23 4.88 5.68

5.38 5.37 5.33 5.38 4.71

5.04 5.38 5.87 4.71 5.235.68 5.26 5.04 5.55 5.23

5.44 5.55 5.44 5.23 5.17

5.55 4.63 4.88 5.17 5.23

4.93 5.68 6.07 5.23 5.68

4.71 5.68 4.93 5.33 5.26

4.71 5.67 5.23 4.71 5.17

4.63 5.87 5.17 5.17 4.63

5.26 5.44 5.37 5.04 5.23

5.67 4.88 5.23 5.23 6.075.87 5.33 5.33 5.37 5.74

5.74 5.32 5.23 5.68 4.88

5.17 5.23 5.33 5.32 5.37

ave 5.30 5.26 5.32 5.22 5.35

SD 0.40

8/11/2019 Notes_7 IE

21/134

Bootstrapping overview continued

Solution: Make a pretend population that consists ofyour original sample of nobservations, copied over andover, an infinite number of times. Then draw many"bootstrap" random samples (each of size n) from thepretend population and calculate a different for each

sample. You can construct a histogram of all the 's,take their sample standard deviation to be an estimate ofSE( ), etc.

How this is implemented: You do not have to actually

copy your original sample over and over. The aboveconstruction of each bootstrap sample is equivalent todrawing a random sample of size nfrom the originalsample of data (with replacement).

q

q

q

8/11/2019 Notes_7 IE

22/134

A Different Example (that has nothing to do withnonlinear regression)

Pop0= population of all grains

Pop1 = population of all grains with thickness < 0.3 and

equivalent diameter > 0.6

mR= mean aspect ratio for all grains inPop1

f= = fraction projected area of grains

inPop1

The patent claim is violated iff> 0.5 AND mR

> 8

PopPop

0

1

incrystalsallofAreaincrystalsallofArea

8/11/2019 Notes_7 IE

23/134

8/11/2019 Notes_7 IE

24/134

Some Details: Bootstrapping in Nonlinear Regression

You have a sample of nobservations of aresponse variable and a set of predictor variables.

You fit a nonlinear regression model to the data toestimate a set of parameters q

Let qdenote one of the parameters of interest and itsestimate.

Objective:Estimate the sampling distribution of , itsstandard error, a confidence interval for q, etc.

To do this, follow the steps of the bootstrap procedure onthe subsequent slides

q

ni

ii

,y1

x

q

8/11/2019 Notes_7 IE

25/134

Steps of the Bootstrap Procedure

1) Generate a "bootstrap" sample (with replacement) of nobservations from . Denote the bootstrapsample by

2) Fit the same type of regression model (with the sameset of parameters q and parameter qof special interest)to the bootstrapped sample. Denote the estimates forthe bootstrapped sample by and

3) Pick a large numberB(e.g.,B= 10,000), and repeatSteps (1) and (2) a total ofBtimes, which produces

b

niii,y 1x

ni

bi

bi ,y 1x

qb

Bbb 1q

8/11/2019 Notes_7 IE

26/134

Steps of the Bootstrap Procedure, continued

4) Construct a histogram of and calculate:

q

Bb

b

B

1

1qq

112

B

SE

B

bb

qqq

average of all bootstrapped estimates

standard error of

q /2 upper /2quantile of the sample distribution of

Bb

b

1q

Bbb 1q

q /21 lower /2quantile of the sample distribution of Bbb 1q

8/11/2019 Notes_7 IE

27/134

8/11/2019 Notes_7 IE

28/134

Steps of the Bootstrap Procedure, continued

5) A crude 1confidence interval for qis:

6) A better 1confidence interval for qis:

qqqqq SEzSEz // 22

qqqqqqq // 212

q q /2q /21

q

8/11/2019 Notes_7 IE

29/134

Example CI Calculations for q0for the Manu. LearningCurve

Crude 95% CI:

Reflected 95% CI:

0241008100409610161020 .,....SEz / qq

qqqqqq , ,.,. 975000002500

01510 . q

00400 .SE q02310250 . ,. q

007.1 975,.0 q

(from the left-most histogram two slides prior)

01610 . q

00710161.0161016102310161 ..,...

.02510091 ,.

8/11/2019 Notes_7 IE

30/134


What is the difference between the two CIs (crudeversus reflected) on the previous slide?

In general, when would the two confidence intervalsdiffer?

What are the effects of increasingBon the bootstrappedhistogram of a parameter estimate? Would the histogrambecome tighter?

What are the effects of increasing non the bootstrappedhistogram of a parameter estimate? Would the histogram

become tighter?

Why must nfor each bootstrapped sample be the sameas nfor the real sample?

8/11/2019 Notes_7 IE

31/134

R commands for bootstrapping parameter SEs/CIsfor the manufacturing learning curve

library(boot) #need to load the boot package

MLC

8/11/2019 Notes_7 IE

32/134

8/11/2019 Notes_7 IE

33/134

> plot(MLCboot,index=1)

> boot.ci(MLCboot,conf=c(.9,.95,.99),index=1,type=c("norm","basic"))

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 1000 bootstrap replicates

CALL :

boot.ci(boot.out = MLCboot, conf = c(0.9, 0.95, 0.99), type = c("norm",

"basic"), index = 1)

Intervals :

Level Normal Basic

90% ( 1.010, 1.020 ) ( 1.011, 1.020 )95% ( 1.010, 1.021 ) ( 1.010, 1.022 )

99% ( 1.008, 1.023 ) ( 1.009, 1.023 )Histogram of t

t*

Density

1.005 1.015 1.025

0

50

100

150

-3 -2 -1 0 1 2 3

1.005

1.01

5

Quantiles of Standard Normal

t*

8/11/2019 Notes_7 IE

34/134


1) In boot.ci, type = "norm" gives our crude CI based on the SE and

the normal percentiles, but translated by subtracting out theestimated Bias (taken to be the bootstrap average minus theoriginal parameter estimate); type = basic interval gives the betterCI obtained by reflecting the percentiles.

2) How can we determine if there is statistically significant evidence

that the asymptotic relative efficiencies of the two manufacturingfacilities differ?

3) What is a 95% CI on the asymptotic relative efficiency of the olderfacility (x1 = 0)?

4) What is a 95% CI on the asymptotic relative efficiency of the newer

facility (x1 = 1)?

5) In general, given the covariance matrix Sof a random vector Z, thevariance of the linear combination aTZis

Var(aTZ) = aTSa

8/11/2019 Notes_7 IE

35/134

Comments on Bootstrapping

Comparison class verses textbook notation:

In R, use the "boot" command in the "boot" package In Matlab, use the "bootstrp" command in the stats

toolbox

Class KNN

bootstrap parameter estimate qb b1*

upper /2 percentile of distributionof bootstrapped parameters q /2 b1*(1/2)

lower /2 percentile of distribution

of bootstrapped parameters q

/21 b1*(/2)

bootstrap sample of data n

ibi

bi ,y 1x

ni

*i

*i,y 1x

8/11/2019 Notes_7 IE

36/134

Some Common Blackbox Nonlinear Regressionand Classification Models

If you have knowledge of the structure of the relationship between Y

and x, then the best approach is to use it (e.g., if you think it is anlinear, exponential, quadratic, etc. relationship, then fit that model)

For many data sets (especially large "data mining" applications), wemight doubt a linear model will fit but have no idea of the structure ofthe nonlinearities.

In this case, unless there are only a few predictors, polynomial(e.g., quadratic) models are not the preferred next step to trybeyond linear models

Why not?

There are many blackbox nonlinear modeling approaches

We will cover some common ones (neural networks, CARTmodels, nearest neighbors) that span the spectrum of methods

Almost all can be used equally well for either regression orclassification

8/11/2019 Notes_7 IE

37/134

Neural Networks

Clever original idea and memorable namebecamevery popular in the 1980s and 1990s.

They have evolved to have less resemblance to how thehuman brain processes information (but bettereffectiveness at modeling nonlinear relationships in

complicated data sets)

To fit a neural network model (and all of the otherblackbox models), the training data must be available inthe same format as for linear/logistic regression:

A 2D array of observations Each column is a different variable; each row a different case

One column is the response variable (Y) and the other columnsare any number of predictor variables (X's)

The neural network hidden variables (H's) are internal variablesthat you do not enter or even care about

8/11/2019 Notes_7 IE

38/134

Standard Graphical Depiction of a Neural Network

8/11/2019 Notes_7 IE

39/134

Mathematical Definition of What a Neural NetworkModel Really Is

each "node" represents an activation function (labeled as the

function output, with function input a linear combo of previous

layer function outputs)

X's: input (i.e., predictor) variables, in "input layer"

Y: output (i.e. response) variable, in "output layer"

H's: internal dummy variables, in "hidden layer" 's andb's: model parameters, to be estimated

the NN model:

for m= 1, 2, . . .,M,

xxexpxxexp

Hkk,m,m,m

kk,m,m,mm

110

110

1

e

bbb

bbb

HHexp

HHexpY

MM

MM

110

110

1

8/11/2019 Notes_7 IE

40/134

Neural Network Activation Functions

For classification, it is common to use the samesigmoidal (logistic) activation function for each node:

wherez= linear combo of variables from previous layer

For regression, it is usually preferable to use sigmoidal

activation functions for all hidden nodes and a linearactivation[i.e., h(z) =z]function for the output layernodes:

zexpzexpzexp

zh

1

1

1

ebbb HHY MM110

8/11/2019 Notes_7 IE

41/134

8/11/2019 Notes_7 IE

42/134

An S-shaped function with multivariate input

Recall that this is what the S-shaped logistic functionlooks like when there are multiple input variables

8/11/2019 Notes_7 IE

43/134


Yis an S-shaped (or sometimes linear) function of thedummy variables (H's), which are in turn S-shapedfunctions of the predictors (X's)

When you combine them together, substituting for theH's to get Yas a function of theX's, you can think of the

neural network model as

for some (very messy) with q= {all 's andb's}

What kind of functionalXYrelationships can youcapture with the neural network model structure?

e x,gY

x,g

8/11/2019 Notes_7 IE

44/134

Fitting A Neural Network Model

1) Standardize predictors via

= average, stdev ofjth predictor (jth column)

2) If using logistic output activation function, scaleresponse to interval [0,1] via

Why do we need to do this rescaling for a logistic output

activation function?

s

xxx

x

jijij

j

s,x xj j

yyyyy

minmax

minii

8/11/2019 Notes_7 IE

45/134

Fitting A Neural Network Model, continued

3) Choose:

# hidden layers

# nodes in each hidden layer

output activation function (usually linear or logistic)

other options and tuning parameters (e.g. l)

4) Software estimates parameters to minimize (nonlinearLS with shrinkage):

g(xi,q)denotes the neural network response prediction

M

mm

M

m

k

jj,m

n

iii ,gy

0

2

1 0

2

1

2blx

8/11/2019 Notes_7 IE

46/134

where

q= {all 's andb's}

l= user-chosen shrinkage parameter

The shrinkage term is analogous to the term that we add to theSSE in ridge regression

Why do we need to include the shrinkage term when fitting a

neural network, even if we have no multicollinearity?

HHexp

HHexp,g

M,iM,i

M,iM,ii

bbb

bbb

110

110

1

x

xxexpxxexp

Hk,ik,m,i,m,m

k,ik,m,i,m,mm,i

110

110

1

E l P di ti M d li f CPU

8/11/2019 Notes_7 IE

47/134

Example: Predictive Modeling of CPUperformance

Data in cpus.txt, which is the same as the cpus data in the MASS

package 209 cases, with 9 variables and 6 predictor variables

perf is the response, which is CPU performance. Ignore estperfwhich was another authors estimated performance.

The six numerical predictors are cycle time (nanoseconds), cachesize (Kb), min and max main memory size (Kb), and min and maxnumber of channels. See V&R for additional discussion

The objective is to learn the predictive relationship between CPUperformance and the predictor variables

Example with a bigger data set coming up shortly

8/11/2019 Notes_7 IE

48/134

Neural Network Modeling of CPU data

#######R code for reading in cpus data set, taking log(response) and then converting to

[0,1] interval, and standardizing predictors##############

CPUS

8/11/2019 Notes_7 IE

49/134

Matrix scatterplot of transformed cpus data

syct

-3 -1 1 0 2 4 0 2 4 6

-2

0

2

-3

-1

1

mmin

mmax

-4

-2

0

2

0

2

4

cach

chmin0

2

4

6

0

2

4

6

chmax

-2 0 2 -4 -2 0 2 0 2 4 6 0.0 0.4 0.8

0.0

0.4

0.8

perf

8/11/2019 Notes_7 IE

50/134

CPUS Example Continued

#############Fit a neural network model to the CPUS1 data####################

library(nnet)cpus.nn1

8/11/2019 Notes_7 IE

51/134


Why do we need to standardize the predictors (and theresponse variable when using a linear output activationfunction)?

How can we get r^2 for this example (the nnet function inR does not spit it out)

Which predictor variables appear to be the mostimportant, and what R output do we look at to determinethis?

What value of lwill give us the smallest training SSE?

How can we decide the best value of l?

8/11/2019 Notes_7 IE

52/134


#######A function to determine the indices in a CV partition##################

CVInd

8/11/2019 Notes_7 IE

53/134


##Now use the same CV partition to compare Neural Net and linear reg models###

Ind

8/11/2019 Notes_7 IE

54/134


The best value of lis the value that results in the

smallest CV SSE (or equivalently, the largest CV r^2,smallest CV SD(e), etc).

How can we decide the best number of hidden layernodes?

Why should we use the same CV partition whencomparing two models?

What are the pros and cons of n-fold CV versus K-foldCV for some smaller K, e.g., 3, 5, or 10?

8/11/2019 Notes_7 IE

55/134

Example: Predictive Modeling of Income Data

Data in adult_train.csv is from the 1994 US Census (also see

http://archive.ics.uci.edu/ml/datasets/Census+Income) 32561 cases, with 15 variables. This is a small sample from the US

census with 15 potentially relevant variables. Each row represents a"similar" population segment with weight given by "fnlwgt"

income has been converted to a binary categorical variable (

50k) with roughly a 75%/25% population split Later we will fit predictive models to classify income based on the

other variables (classification). Here, the objective is to predict thenumber of hours per week spent working based on the othervariables (regression)

This is already a very cleaned data set, but we may need to do alittle additional cleaning

What should we do about the missing "?" values

8/11/2019 Notes_7 IE

56/134

The First Few Rows

age workclass fnlwgt education education-marital-st occupatio relationshi race sex capital-gai capital- los hours-per- native-couincome

39 State-gov 77516 Bachelors 13 Never-ma Adm-cleri Not-in-fa White Male 2174 0 40 United-St

8/11/2019 Notes_7 IE

57/134

Read in the Data

XX

8/11/2019 Notes_7 IE

58/134

Some Preliminary Exploratory Analyses

##exploring individual variables

par(mfrow=c(2,3)); for (i in c(1,5,11,12,13)) hist(XX[[i]],xlab=names(XX)[i]); plot(XX[[15]])par(mfrow=c(1,1)); plot(XX[[2]],cex.names=.7)

for (i in c(2,4,6,7,8,9,10,14,15)) print(table(XX[[i]])/nrow(XX))

Should we be concerned

with anything here or doany further cleaning?

Histogram of XX[[i]]

age

Frequency

20 40 60 80

0

1000

2000

3000

4000


education.num

Frequency

5 10 15

0

2000

4000

6000

8000

10000


capital.gain

Frequency

0e+00 4e+04 8e+04

0

5000

15000

250

00


capital.loss

Frequency

0 1000 2000 3000 4000

0

5000

15000

25000


hours.per.week

Frequency

0 20 40 60 80 100

0

5000

10000

15000

50K

0

5000

10000

15000

20000

8/11/2019 Notes_7 IE

59/134

Some Preliminary Exploratory Analyses

##exploring pairwise predictor/response relationships

par(mfrow=c(2,1))plot(jitter(XX$age,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)

plot(jitter(XX$education.num,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)

par(mfrow=c(1,1))

barplot(tapply(XX$hours.per.week,XX$education,mean),ylim=c(30,50),cex.names=.7,xpd=F)

for (i in c(2,4,6,7,8,9,14,15)) {print(tapply(XX$hours.per.week,XX[[i]],mean)); cat("\n")}

Some points to consider regarding correlation versus functional dependence (pointsthat apply to ANY regression analysis)

If hours.per.week appears correlated with another variable, it does not meanthat hours.per.week has a functional dependence on that variable

The two could appear correlated because they both depend on another variable(either one of the existing variables or an unrecorded nuisance variable)

If you have recorded enough nuisance variables, a multiple regression analysiscan sometimes distinguish which correlations are truly due to a functionaldependence

If your goal is pure prediction (and not explanatory), does it matter?

8/11/2019 Notes_7 IE

60/134

10th 11th 12th 1st-4th 5th-6th 7th-8th 9th Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad Masters Preschool Prof-school Some-college

30

35

40

45

50

8/11/2019 Notes_7 IE

61/134

A Typical Next Step in Predictive Modeling

##linear regression with all predictors included

Inc.lm

8/11/2019 Notes_7 IE

62/134

Some Typical Next Steps

##linear regression including interactions

Inc.lm.full

8/11/2019 Notes_7 IE

63/134

Now Try a Neural Network Model

##Neural network model

library(nnet)Inc.nn1

8/11/2019 Notes_7 IE

64/134

Multi-Response Neural Networks

Neural networks also apply to the situation in which we

have more than one (sayK) response variables

We handle this by includingKnodes in the output layer(see the following slide)

This is different than fittingKseparate neural networks,

one for each response, because theKresponses sharethe same hidden layer node functions

This is generally more effective than fittingKseparateneural networks models if the response variables have

similar functional dependencies on the predictors. If theresponses have completely different dependencies onthe predictors, then you are better off fittingKseparateneural networks models

Graphical Depiction of Neural Network with K

8/11/2019 Notes_7 IE

65/134

Graphical Depiction of Neural Network withKResponse Variables

f C f

8/11/2019 Notes_7 IE

66/134

Neural Networks for Classification

The most common application of multi-response neural

networks is for classification when we have a categoricalresponse withKcategories (aka classes). Note that thisalso applies to binary responses (K= 2)

To handle this (most software does this internally), make

aK-length 0/1 response vector, e.g., for the fgl data:Type y1 y2 y3 y4 y5 y6

WinF 1 0 0 0 0 0

WinNF 0 1 0 0 0 0

Veh 0 0 1 0 0 0

Con 0 0 0 1 0 0

Tab1 0 0 0 0 1 0

Head 0 0 0 0 0 1

E l P di ti Gl T i F i

8/11/2019 Notes_7 IE

67/134

Example: Predicting Glass Type in Forensics

Data in fgl.txt, which is the same as the FGL data in the MASS

package. See V&R for additional discussion 214 cases, with 9 predictor variables and a categorical response

Each row contains the results of an analysis of a fragment of glass

type is the response, one of six different glass types: window floatglass (WinF: 70 rows), window non-float glass (WinNF: 76 rows),

vehicle window glass (Veh: 17 rows), containers (Con: 13 rows),tableware (Tabl: 9 rows) and vehicle headlamps (Head: 29 rows).

Eight of the predictors are the chemical composition of the fragment,and the ninth (RI) is the refractive index

The objective is to train a predictive model to predict the glass typebased on a fragment of the glass, for forensic purposes

R d th D t d T f V i bl

8/11/2019 Notes_7 IE

68/134

Read the Data and Transform some Variables

######Read data, convert response to binary, and standardize predictors#####

FGL

8/11/2019 Notes_7 IE

69/134

First Few Rows of fgl.txt data

RI Na Mg Al Si K Ca Ba Fe type

3.01 13.64 4.49 1.1 71.78 0.06 8.75 0 0 WinF

-0.39 13.89 3.6 1.36 72.73 0.48 7.83 0 0 WinF

-1.82 13.53 3.55 1.54 72.99 0.39 7.78 0 0 WinF

-0.34 13.21 3.69 1.29 72.61 0.57 8.22 0 0 WinF

-0.58 13.27 3.62 1.24 73.08 0.55 8.07 0 0 WinF

-2.04 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 WinF

-0.57 13.3 3.6 1.14 73.09 0.58 8.17 0 0 WinF

-0.44 13.15 3.61 1.05 73.24 0.57 8.24 0 0 WinF

1.18 14.04 3.58 1.37 72.08 0.56 8.3 0 0 WinF

-0.45 13 3.6 1.36 72.99 0.57 8.4 0 0.11 WinF

-2.29 12.72 3.46 1.56 73.2 0.67 8.09 0 0.24 WinF

-0.37 12.8 3.66 1.27 73.01 0.6 8.56 0 0 WinF

-2.11 12.88 3.43 1.4 73.28 0.69 8.05 0 0.24 WinF

-0.52 12.86 3.56 1.27 73.21 0.54 8.38 0 0.17 WinF

-0.37 12.61 3.59 1.31 73.29 0.58 8.5 0 0 WinF

-0.39 12.81 3.54 1.23 73.24 0.58 8.39 0 0 WinF

-0.16 12.68 3.67 1.16 73.11 0.61 8.7 0 0 WinF

3.96 14.36 3.85 0.89 71.36 0.15 9.15 0 0 WinF

1.11 13.9 3.73 1.18 72.12 0.06 8.89 0 0 WinF

-0.65 13.02 3.54 1.69 72.73 0.54 8.44 0 0.07 WinF

Mathematical Definition of K-Class Neural Network

8/11/2019 Notes_7 IE

70/134

Mathematical Definition ofKClass Neural NetworkModel

for m= 1, 2, . . .,M,

(same as before)

for l= 1, 2, . . .,K,

(multinomial logistic model)

Note: ForK= 2, this reduces to:

xxexpxxexp

Hkk,m,m,m

kk,m,m,mm

110

110

1

K

jMM,j,j,j

MM,l,l,ll

HHexp

HHexp|YPr

1110

1101

bbb

bbb

x

HHexp

HHexp|YPr

MM

MM

bbb

bbb

110

1101

11 x

Fitti A N l N t k M d l f Cl ifi ti

8/11/2019 Notes_7 IE

71/134

Fitting A Neural Network Model for Classification

1--3) The first three steps are the same as before

4) For classification, software estimates parameters tominimize (nonlinear LS with shrinkage):

(log-likelihood + shrinkage penalty)

5) CV should be used to choose any tuning parameters (l,number of nodes, etc)

lall mall

m,lmall jall

j,m

n

iil,i

K

ll,i a|YrP

y

2

2

11 blx

Fitting a Neural Net Classifier for the FGL Data

8/11/2019 Notes_7 IE

72/134

g(binary response case)

#############Fit a neural network classification model to the FGL1 data######

library(nnet)fgl.nn1

8/11/2019 Notes_7 IE

73/134

response vs. predicted probability for fgl data

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

phat

jitter(y,0.05)

Using CV to Compare Models for the FGL Data

8/11/2019 Notes_7 IE

74/134

Using CV to Compare Models for the FGL Data

Ind

8/11/2019 Notes_7 IE

75/134


What is the best neural network model, in terms of the

tuning parameters (decay, size, etc)? What is the best CV misclassification rate?

Is this good?

What other model(s) would you compare to the bestneural network?

Classification for the 6 Class FGL Response

8/11/2019 Notes_7 IE

76/134

Classification for the 6-Class FGL Response

#############Same, but use the original 6-category response######

library(nnet)fgl.nn1

8/11/2019 Notes_7 IE

77/134

Neural Network Classification of Income Data

Reconsider the data in adult_train.csv

Instead of predicting the number of hours (regression), we will nowpredict the binary income categorization ( 50k) using theother predictor variables

Recall that for the entire sample, we 75% are 50k

Read the Data and Fit Models

8/11/2019 Notes_7 IE

78/134

Read the Data and Fit Models

XX

8/11/2019 Notes_7 IE

79/134


Which modelneural network or logistic regression

appears to be better? How good does it appear?

Pros and Cons of Neural Networks

8/11/2019 Notes_7 IE

80/134

Pros and Cons of Neural Networks

Pros:

very flexible; with enough nodes, can model almost anynonlinear relationship

can efficiently model linear behavior if the relationship is trulylinear

often very good predictive power

Cons: model fitting can be unstable and sensitive to initial guesses

for very large data sets, model fitting can be very slow relative tosome methods like trees and linear models, which makes CV

very computationally expensive overfitting (but can avoid by using CV to choose l)

sensitive to user-chosen "tuning parameters (but can use CV tochoose them wisely)

poor interpretability

Classification and Regression Tree (CART)

8/11/2019 Notes_7 IE

81/134

Models

Perhaps the single most widely used generic nonlinear

modeling method Very simple idea and very interpretable models

They usually do not have the best predictive power, butthey serve as the basis for many more advanced

supervised learning methods (e.g., boosting, randomforests) that have excellent predictive power

As with neural networks (and most of the methods wewill cover), you can use tree models for either regression

or classification. We will start with regression.

Structure of a Regression Tree

8/11/2019 Notes_7 IE

82/134

Structure of a Regression Tree

A final fitted CART model divides the predictor (x) space

by successively splitting into rectangular regions andmodels the response (Y) as constant over each region

This can be schematically represented as a "tree": each interior node of the tree indicates on which predictor

variable you split and where you split each terminal node (aka leaf) represents one region and

indicates the value of the predicted response in that region

The following slide illustrates a fitted tree model for anexample from the KNN text (Figure 11.12), in which theobjective is to predict college GPA (the response) as afunction of HS rank and ACT score (two predictors)

To use a fitted CART for prediction, you start at the rootnode and follow the splitting rules down to a leaf

8/11/2019 Notes_7 IE

83/134

Mathematical Representation of Regression Tree

8/11/2019 Notes_7 IE

84/134

Mathematical Representation of Regression Tree

Can still view tree model as where:

M= total number of regions (terminal nodes)

Rm= mth region

I(xRm) = indicator function =

cm

=constant predictor overRm

q= all parameters and structure (M, splits inRms, cms,etc)

Note that for

e x;gY

M

mmm RIc;g

1xx

cRIc;g,R jM

mmimiji

1 xxx

R

R

m

m

x

x

:0

:1


8/11/2019 Notes_7 IE

85/134


What kind of functional xYrelationships can you

capture with a regression tree model structure? Can a regression tree represent a linear relationship?

Can it represent a linear relationship as efficiently as aneural network?

Which type of modelneural network or regression treeis more interpretable?

Which type of modelneural network or regression treeis easier to fit?

Given a set of regions, how would you estimate thecoefficients {cm: m= 1, 2, . . .,M}?

Fitting a Regression Tree

8/11/2019 Notes_7 IE

86/134

Fitting a Regression Tree

A CART model is fit using an array of training data

structured just like in regression (one response columnand many predictor columns)

Fitting the model entails growing the tree one node at atime (see next slide for an example)

At each step, the single best next split (which predictor andwhere to split) is the one that gives the biggest reduction in SSE

The fitted or predicted response over any region is simply theaverage response over that region. The errors used to calculatethe SSE are the response values minus the fitted values.

Stop splitting when reduction in SSE with the next split is belowa specified threshold, all node sizes are below a threshold, etc.

Most algorithms overfit then prune back branches

After fitting a CART model, software spits out the final

fitted tree, which can be used for prediction/interpretation

8/11/2019 Notes_7 IE

87/134

SSE is Calculated as Follows

8/11/2019 Notes_7 IE

88/134

SSE is Calculated as Follows

For given set of splits:

= "size" of mthterminal node (region)

R

im

miim

mi

yN

R|yavecx

x 1

R#N mim x

M

m Rmimi cySSE

1

2x

cRIc;gy,R j

M

mmimiiji

1forthatNote xxx

Pruning

8/11/2019 Notes_7 IE

89/134

Pruning

Pruning a branch means that you collapse one of the

internal nodes into a single terminal node Pruning the tree means that you prune a number of

branches

Pruning algorithms in software will usually optimally

prune back a tree in a manner that minimizes SSE + lM,whereMand SSEare for the pruned tree. The best valuefor lis determined via CV

There is a nice computational trick ("weakest link

pruning") that allows this optimal pruning to be done veryfast. See HTF for further discussion.

Regression Tree Ex (cpus data)

8/11/2019 Notes_7 IE

90/134

Regression Tree Ex. (cpus data)

#do not have to standardize or transform predictors to fit trees

library(tree)control = tree.control(nobs=nrow(CPUS), mincut = 5, minsize = 10, mindev = 0.002)

#default is mindev = 0.01, which only gives a 10-node tree

cpus.tr

8/11/2019 Notes_7 IE

91/134

( ) p ()

size

deviance

10

20

30

4

0

2 4 6 8 10 12 14

24.00 3.80 1.20 0.72 0.41 0.24 0.22

size

deviance

10

15

20

25

30

35

40

2 4 6 8 10 12 14

24.00 3.80 1.20 0.72 0.41 0.24 0.22

deviance vs k (l) from cv.tree()


8/11/2019 Notes_7 IE

92/134


What is the best size tree for the CPUS example?

Provide an interpretation of which predictor variables aremost important

Do there appear to be any interactions between mmaxand cach?

Why must minsize be at least twice mincut?

The "deviance" measure that is plotted versus tree sizeis 2logf(y,q). Why does this correspond to the SSE for anonlinear regression model with normal errors?

Classification Trees Overview

8/11/2019 Notes_7 IE

93/134

Classification Trees Overview

Fitting and using classification trees with aK-category

response is similar to fitting and using regression trees. For classification trees, we modelpk(x)= Pr{Y= k| x}

(k=1,2,. . .,K) as constant over each region

Compare to regression trees, for which we modelg(x;q)

=E[Y| x]as constant over each region At each step in the fitting algorithm, the best next split is

the one that most reduces some criterion measuring theimpurity within the regions

Classification Trees Some Details

8/11/2019 Notes_7 IE

94/134

Classification Trees Some Details

In the regionRm, the fitted class probabilities and best

class prediction are:

(class-ksample fraction in regionRm)

(most common class in regionRm)

Some common impurity measures:

Misclassification error:

Gini index:

deviance: (log-likelihood)

R

im

k,mmi

kyIN

px

1

pmaxargk k,mkm

M

m

mk,mm

M

m Rmi

mi p-NkyI

11

1x

M

m

K

kk,mk,mm p-pN

1 11

M

m

K

k

k,mk,mm plogpN

1 1

Example Illustrating the Notation

8/11/2019 Notes_7 IE

95/134

Example Illustrating the Notation

Suppose you haveK= 4classes, and the predictors for

Nm= 100training cases fall into a particular regionRm.For those 100cases, suppose we have the followingbreakdown of the number of cases with response valuethat fell into the four categories:

What is for k= 1, 2, 3, 4?

What is km?

Class, k # obsvns withYin Class k

1 10

2 20

3 65

4 5

p k,m

p k,m

K= 2 Class Example Illustrating Notation andS litti B d I it

8/11/2019 Notes_7 IE

96/134

Splitting Based on Impurity

In the following, where would the first split that minimizes

the misclassification rate be, and what would theand be?

p k,m

kmyi

2

1

xi

Classification Tree Ex. (fgl data)

8/11/2019 Notes_7 IE

97/134

( g )

library(tree)

control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)#default is mindev = 0.01, which only gives a 10-node tree

fgl.tr

8/11/2019 Notes_7 IE

98/134

size

deviance

200

220

240

260

28

2 4 6 8 10 12 14

size

deviance

50

100

150

200

250

2 4 6 8 10 12 14

|Mg < 2.695

RI < 6.22

Other Win

Win


8/11/2019 Notes_7 IE

99/134

What is the best tree size for the FGL data?

Which predictors appear to be the most important, andwhat are their effect(s)?

If you want a summary measure of the predictive qualityof the tree model, what would it be?

If you wanted to decide whether the neural network isbetter than the tree for predicting glass type, how wouldyou do this?

How can you tell what impurity measure R used to fit the

model?

Same but for the original 6-category response

8/11/2019 Notes_7 IE

100/134

g g y p

control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)


fgl.tr

8/11/2019 Notes_7 IE

101/134


8/11/2019 Notes_7 IE

102/134

You choose the best size the same waychoosing the

size with the lowest CV deviance (the bestMwas about7)

Which predictors appear to be the most important, anddoes this seem to agree with the best predictors for the

2-class tree when we choseM= 3? If you want a summary measure of the predictive quality

for the 6-class tree model, what would it be?

8/11/2019 Notes_7 IE

103/134

Numerical Assessment of Variable Importance

8/11/2019 Notes_7 IE

104/134

For a visual assessment of the importance of each

predictor in a tree, inspect the tree graph (the importanceofxj is reflected by how many times it appears in internalnodes, how close they are to the root node, and thelength of the branch for that split if using type = "p" in

plot.tree):##Replot 6-class FGL tree with branch lengths proportional to reduction in impurity##

fgl.tr1

8/11/2019 Notes_7 IE

105/134

Use CV to compare trees with any other model

In the previous regression tree example (cpus data), thebest value of complexity parameter was l = 0.4, whichtranslated toM= 11 terminal nodes.

To compare a regression tree with a neural network

model we would: Form a random CV partition (e.g. using the CVInd function)

Compute the CV SSE for a neural network model with 10 hiddenlayer nodes and l = 0.05 (which CV earlier said was roughly thebest value)

Compute the CV SSE for a regression tree model with either l =0.4 orM= 11 , using the same partition

Repeat the previous 3 steps as many times as you can,averaging the results, and select the best model as the one withthe lower average CV SSE

Return to the Income Data Example

8/11/2019 Notes_7 IE

106/134

Reconsider the data in adult_train.csv

Before, we fit a neural network model for regression, predicting thenumber of hours per week worked. And we also fit a neural networkfor classification, predicting the binary income categorization ( 50k)

Here, we will fit similar regression and classification models, but

using trees instead of neural networks

8/11/2019 Notes_7 IE

107/134

The Best-sized Regression Tree for INCOME Data

8/11/2019 Notes_7 IE

108/134

size

dev

iance

36000

00

3800000

4000000

4200000

1 10 20 30 40 50

380000 24000 7400 5800 4400 3800 2900 2400 -Inf

|age < 22.5

age < 18.5

education.num < 9.5relationship:d

sex:a

age < 63.5

occupation:afghilmn

occupation:abcfghm

age < 64.5income:a

age < 63.5

workclass:abdg

income:a occupation:djkln

23.136.9 27.8 34.3

37.8 41.228.2

41.1 44.1 30.1

43.8 47.1 49.2 55.5

34.9


8/11/2019 Notes_7 IE

109/134

When we include native.country, we get an error because

no more than 32 categories are allowed for a categoricalpredictor. If you really wanted to include native.country,how would you handle this?

Relative to the CPUS example, would you increase

mincut, minsize, and mindev, or decrease them? As always, you should deliberately overgrow the tree and

then prune it back. How do know if you have overgrownthe tree?

Comparing the tree to the neural network Which was faster to fit?

Which had better predictive quality, and how can you tell?

Which was easier to interpret?

How about for comparing a tree to a linear regression?

Try a Classification Tree

8/11/2019 Notes_7 IE

110/134

control = tree.control(nobs=nrow(INCOME), mincut = 20, minsize = 40, mindev = 0.0005)


Inc.tr

8/11/2019 Notes_7 IE

111/134

K-Nearest Neighbors

8/11/2019 Notes_7 IE

112/134

A generic nonlinear modeling tool that is extremely

flexible Perhaps the simplest modeling idea of all

For simple data sets with large n, small k, and nocategorical predictors, almost as widely used as CART

Based on the name, can you guess how K-nearestneighbors works?

Structure of 1-Nearest Neighbors (for regression)

8/11/2019 Notes_7 IE

113/134

You need a set of training data {yi, xi: i= 1, 2, . . ., n}, but

you do not fit a model. For 1-Nearest Neighbors, to predict Yfor a new case

with predictors x: find the xiin your training set that is the closest neighbor to x

then take the predicted Yto be the response value for thattraining observation

Illustration of K-NN for Gas Mileage data

8/11/2019 Notes_7 IE

114/134

library(scatterplot3d)

library(rgl)

GAS

8/11/2019 Notes_7 IE

115/134

-2 -1 0 1 2

10

15

20

25

30

35

40

-2

-1

0

1

2

3

Displacement

Rear_ax

le_ratio

Mpg

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1

0

1

2

Displacement

Rear_axle_ratio

Calculating Distances to find Nearest Neighbors

8/11/2019 Notes_7 IE

116/134

If we want to predict Y(x) for a new case with predictors x,

the distance between xand the predictors xifor the ithtraining case (i= 1, 2, . . ., n) is measured via

For 1-nearest neighbor, the prediction of Y(x) is

where i1(x) = index of closest neighbor of x

k

j

ijjii T

ii xx,d

1

2xxxxxxxx

yyi

)(1

)(x

x


8/11/2019 Notes_7 IE

117/134

You should always standardize your predictors as a first

stepwhy? How do you handle categorical predictors?

For 1-nearest neighbor, what would a plot ofversus x look for the gas mileage example with only

Displacement as a predictor?

)(xy

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

15

20

25

30

35

Displacement

Mpg

Structure ofK-Nearest Neighbors (for regression)

8/11/2019 Notes_7 IE

118/134

More generally, forK-nearest neighbors, you use exactly

the same procedure, except you: find theKclosest training xi's to x, and

then take the predicted Yto be the average response value fortheseKtraining observations:

where {i1(x), i2(x), . . ., iK(x)} = indices ofKclosest

neighbors of x

The tradeoff of using large vs. smallKis exactly theclassic bias/variance tradeoff

K

l iy

Ky

l1

)(1

)(x

x

Large Versus Small K (single predictor example)

8/11/2019 Notes_7 IE

119/134

Why is the predictor in the left plot high variance and low bias?

Why is the predictor in the right plot low variance and high bias?

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

15

20

25

30

35

Displacement

Mpg

K=1 K=20

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

15

20

25

30

35

Displacement

Mpg

Bias and Variance of K-Nearest Neighbors

8/11/2019 Notes_7 IE

120/134

Assume true relationship: Y=g(x,q) + e

with fixed training xsand e~ i.i.d. (0,e2) (not necessarily normal)

The predictor for fixed xis:

MSE =E[( )2] =s2+ Var[ ] +Bias2[ ]

where

K

li

K

li

K

l i lll K

;gK

yK

y1

)(1

)(1

)(

111)( e xxx xx

)(xyY )(xy )(xy

K

yVar s2)( x

K

li ;gK

;gyEYEyBiasl

1)(

1)()()( xxxxx x

ss

s 22

2 1)(

K

K

KyYVar xx

Another Example of Large Vs. Small K

8/11/2019 Notes_7 IE

121/134

This is a classification example from HTF with two response categories

(blue or orange in the figures) and two predictors. The followingscatterplots arex1vs.x2also showing the decision boundaries for theK-nearest neighbors classifiers with K = 15 and K = 1

K-NN for CPUS data

8/11/2019 Notes_7 IE

122/134

library(yaImpute)

CPUS

8/11/2019 Notes_7 IE

123/134

K=2, training fit K=6, training fit

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

fit

ytest

0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

fit

ytest

CV to Choose the Best K

Nrep

8/11/2019 Notes_7 IE

124/134

K

8/11/2019 Notes_7 IE

125/134

K=2, n-fold CV K=6, n-fold CV

0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

yhat2

y

0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

yhat1

y


8/11/2019 Notes_7 IE

126/134

Did the 6-nearest neighbors method do better than the

best neural network model or the linear regressionmodel?

How can you tell which predictors have the largest effecton the response in the K-nearest neighbors model?

What are the parameters of the fitted model?

K-Nearest Neighbors for Classification

8/11/2019 Notes_7 IE

127/134

Like CART models, it is straightforward to use nearest

neighbors for classification. For binary classification, to predictPr{Y=1 | x}for a new

case, find theKnearest neighbors as before, and takethe predictedPr{Y=1 | x}to be the fraction of theK

nearest neighbors havingy= 1responses If we have more than two response categories, we take

the predicted probability for each category to be thefraction of nearest neighbors with response valuesbelonging to that category.

K-NN for FGL data

8/11/2019 Notes_7 IE

128/134

FGL

8/11/2019 Notes_7 IE

129/134

K=10, training fit

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

phat

jitter(as.numeric(ytest=="Win"),am

ount=0.05)


8/11/2019 Notes_7 IE

130/134

In the preceding, what was the training misclassification

rate? What would happen to the training misclassificationrate if we decreased K? What would it be for K = 1?

To find the best K for K-nearest neighbors forclassification problems, you must use CV, just like for

any other method. What CV measure would you use to find the best K for

the FGL data?

Is finding the CV errors for K-nearest neighbors

substantially more computationally expensive thanfinding the training errors, like it is for all of the othermethods we have covered?

Pros and Cons of Nearest NeighborsPros:

8/11/2019 Notes_7 IE

131/134

Pros:

The most flexible of allcan represent any nonlinear relationship (as

long as you have sufficiently large n). Easy to use. No model fitting required

Cons:

There is no real fitted model (nor even any indication of which predictorsare most important), so not suitable for interpretation or explanatorypurposes.

Because there is no fitted model, you need to retain all the training datato predict.

With large k(the number of predictors), you need very large n, becauseneighbors get further away in higher dimensions.

For most supervised learning methods, large nincreases the

computational expense for training, but not for new case prediction.Large nis more problematic for nearest neighbors, because the"training" occurs for every new case prediction

With very large n, we need computational tricks (e.g. tree-basedmethods) to efficiently search for nearest neighbors.

Not well suited for categorical predictors

Effect of dimension (k) on distance between neighborsk= 1

Scatterplot of Rear axle ratio vs Displacement

k= 2

8/11/2019 Notes_7 IE

132/134

6005004003002001000

0

Displacement

500400300200100

4.5

4.0

3.5

3.0

2.5

Displacement

Rear_

axle_

ratio

Scatterplot of Rear_axle_ratio vs Displacement

k= 3

4.0

7.5 3.5

8.0

8.5

3.0100

9.0

250 2.5400

550

Comp_ratio

Rear_axle_ratio

Displacement

3D Scatterplot of Comp_ratio vs Rear_axle_ratio vs Displacement

Software Implementation in Matlab (in case youwant to know)

8/11/2019 Notes_7 IE

133/134

Neural Networks: Neural networks toolbox (IE

computer lab does NOT have NN toolbox) CART: CLASSREGTREE (part of the Stats toolbox)

Nearest Neighbors: No model to fit. Easy to write yourown script in Matlab.

Cross-Validation: CROSSVAL (part of the Statstoolbox). You must write an appropriate function call foryour specific model.

Bootstrapping: BOOTSTRP (part of the Stats toolbox).

You must write an appropriate function call for yourspecific model.

Some Other "Data Mining" Tools

8/11/2019 Notes_7 IE

134/134

Two big categories of problems:

supervised learning (we have a response Yand predictors xandwant to model Yas a function of x).

unsupervised learning (we have no Y; just an xand we want tofind relationships among elements of x)

IEMS 304 covered the foundations and primary tools ofsupervised learning. There are many more advancedmethods, but most are extensions of what we havealready covered

Examples of unsupervised learning:

clustering

association rules