8/11/2019 Notes_7 IE
1/134
Nonlinear Regression
Reference reading:
pp. 2733 and 564567 for MLE
Ch. 13. Except you can just skim 13.2 (optimizationalgorithms) and 13.6 (neural networks)
pp. 453457 for regression trees
pp. 458464, 529536, and Lab 5 for bootstrapping
8/11/2019 Notes_7 IE
2/134
Maximum Likelihood Estimation (MLE)
Given a parametric model for a set of data, in general,how do you devise a good way to estimate theparameters? i.e., what criterion do you optimize?
MLE is a very general principle on which manyparametric model estimators are derived Many familiar estimators from a first course in statistics turn out
to be MLEs
The model fitting criteria in linear and logistic regression canboth be derived as applications of MLE; likewise for manysupervised learning models
When a researcher proposes a new model for a problem, theyusually start with the MLE principle to fit the model
In statistical modeling software, choosing a method of modelfitting is often related to choosing a statistical model for which
the method of fitting is the corresponding MLE
8/11/2019 Notes_7 IE
3/134
The MLE Principle
Suppose you have some parametric model to representyour data, with parameters denoted by q{q1, q2, . . .,qp}, and you want to fit the model (i.e., estimate theparameters) based on a random sample of data Y{y1,
y2, . . .,yn}.
Denote the joint distribution of the data byf(y1,y2, . . .,yn;q1, q2, . . ., qp), orf(Y; q)for short. We callf(Y; q): the prob. distribution, when viewed as a function of Y, for fixed
values ofq
, or
the likelihood function, when viewed as a function of qfor thefixed values of Yin your actual data sample.
Basic MLE Principle: Take the estimates of qto bevalues that maximize the likelihood functionf(Y; q). Wecall these values the MLE of q.
8/11/2019 Notes_7 IE
4/134
Example: Estimating mand sfor a Normal Pop.
data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: Yi~NID(m,s2)
parameters: q= {m, s} (p= 2)
marginal pdf of Yi: f(yi; m,s) =
joint pdf of Y1, . . ., Yn(aka likelihood function):
f(Y; m,s) =
MLEs of mand sare the values that maximizef(Y; m,s)
n
iin/n
n
ii
yexp,;yf1
2
221 2
1
2
1 mss
sm
m
ss
yexp i2
22
1
2
1
8/11/2019 Notes_7 IE
5/134
Example: Estimating the coefficients in LogisticRegression
data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: for i= 1, 2, . . ., n, Yi~ Bernoulli with
where xi= [1,xi1, . . .xik]T (kpredictor variables)
parameters: = [b0,b1, . . .,bk]T(p= k+1)
marginal distribution of Yi: f(yi; ) =
joint distribution of Y1, . . ., Yn:
f(Y; ) =
xx
x
iT
iT
ikki
ikkiiii
exp
exp
xxexp
xxexp|YPrp
111
110
110
bbb
bbb
0:1
1:
yp
yp
ii
ii
n
yi i
T
n
yi i
T
iT
n
yi
i
n
yi
i
n
i
i
iiiiexpexp
exppp;yf
01
11
01
111 1
1
1
1
xx
x
8/11/2019 Notes_7 IE
6/134
654321
90
80
70
60
50
40
30
20
10
0
car_age
income
0
1
y
Scatterplot of income vs car_age
8/11/2019 Notes_7 IE
7/134
Nonlinear Regression Models and Nonlinear LeastSquares
A general form of nonlinear regression model is Yi=g(xi,q) + ei, where:
Yi: response for observation i
xi: vector of predictors for observation i
q: vector of model parametersg(xi,q): some parametric nonlinear function
ei: zero-mean random error for observation i
We will see shortly that if the random errors areGaussian and independent of x, the MLE of qis justnonlinear least squares
8/11/2019 Notes_7 IE
8/134
Example: Manufacturing Learning Curve
Y= relative efficiency of operation
x2= week #
If there were only one facility, and the data looked likebelow, how would you model it?
(older)Afacility:0
(modern)Bfacility:11x
x2
y
1.0
8/11/2019 Notes_7 IE
9/134
Discussion Points and Questions
If facilities A and B had different asymptotic efficienciesas in Fig. 13.5, how would you modify the model?
If facilities A and B had different exponential rates, howwould you modify the model?
If the objective was to determine if the two facilities haddifferent asymptotic efficiencies, how could you do this?
Are the formulae for t-tests, standard errors, etc. in alinear regression still valid? If not, how would youcalculate and use the analogous quantities in nonlinearregression?
8/11/2019 Notes_7 IE
10/134
8/11/2019 Notes_7 IE
11/134
MLE for General Nonlinear Regression Model withNormal Errors
Yi=g(xi,q) + eiwith error distribution: ei~NID(0,s2)
view the xi's as deterministic, not random
write Yi= mi+ ei
with mi
g(xi,q) (to simplify notation) Yi ~NID(mi,s
2)
marginal pdf of Yi: f(yi; q,s) =
joint pdf of Y1, . . ., Yn(aka likelihood function):
f(Y; q,s) =
m
ss
iiyexp2
22
1
2
1
n
i
iin/n
n
i
i yexp,;yf
1
2
22
1 2
1
2
1m
ss
s
8/11/2019 Notes_7 IE
12/134
MLE of q: Choose to maximize f(Y; q,s)
i.e., minimize
i.e, the MLE of qfor the general nonlinear regressionmodel with i.i.d. Gaussian errors (that are independent ofx) is "nonlinear least squares"
In general, we need optimization software to fit themodel
n
iii
n
iii ,gyy
1
2
1
2xm
8/11/2019 Notes_7 IE
13/134
Summary of Steps in General MLE
1) Write out the form of the statistical model that you areusing to represent the data
2) Find the marginal distribution of each individualobservation Yi(for regression problems the xi's are nottreated as random, so you only need to find themarginal distribution of the Yi's given the xi's)
3) From the marginal distributions in step (2), find the jointdistributionf(Y; q)of the entire set of data Y
4) If tractable, find an analytical expression for the qthatmaximizes the likelihoodf(Y; q). Otherwise, useoptimization software to minimize logf(Y; q)
5) The MLE of qis the minimizer in step (4), and theHessian can be used to assess statistical uncertainty
(next topic)
8/11/2019 Notes_7 IE
14/134
Relevant R Functions and Packages
nlm()minimize a general nonlinear function, such asimplementing MLE for a nonstandard model (but mostspecific statistical models in R have built-in MLEimplementation)
nls()nonlinear least squares
boot bootstrapping package
cross-validation is built into many R modeling functions(as an optional argument or as a separate function likecv.tree or cv.glm), or not hard to write your own function
8/11/2019 Notes_7 IE
15/134
R commands for fitting learning curve exampleusing the general optimizer nlm()
MLC
8/11/2019 Notes_7 IE
16/134
R commands for fitting learning curve exampleusing the nonlinear LS function nls()
MLC
8/11/2019 Notes_7 IE
17/134
Statistical Uncertainty in Supervised Learning
With nonlinear regression models, the formulae for assessing
statistical uncertainty in linear regression (e.g.,F-tests and t-tests forsignificance of predictors, SEs and CIs for parameters, PIs and CIsfor new observations, etc.) do not apply directly
Question: Why might we want to calculate SEs, CIs/PIs, dohypothesis tests, etc?
For some nonlinear models, we can use approximate asymptoticanalytical resultsvalid for sufficiently large sample size nto assessstatistical uncertainty
Fortunately, we have alternative computational approachesthatapply to any nonlinear model:
Cross-validationfor deciding which models are the best (whichimplies which terms belong in the model, among other things)
Bootstrap resampling(or bootstrappingfor short) for SEs andCIs on the parameters and CIs and PIs on new observations
8/11/2019 Notes_7 IE
18/134
Overview of Bootstrapping
You are given a sample of data of size nobservations.
You have estimated some parameter(s) q(call it )
Objective: Estimate the sampling distribution of andquantities like SE( )that are derived from it.
Problem: Hypothetically, if we knew the entirepopulation, we could consider using simulation to drawmany random samples (each of size n) from thepopulation and calculate a different for each sample.
We could construct a histogram of all the 's and taketheir sample standard deviation to be an estimate ofSE( ) for the single real sample. The problem is we onlyhave the single sample and not the entire population.
q
q
q
q
q
q
8/11/2019 Notes_7 IE
19/134
Example: How you could use regular simulation tofind the SE of a sample average, if you know the
underlying distribution (for example, normal)
Generate say 10,000 samples,each of size n = 20, from anN(5.3,0.4^2) distribution
Calculate the averages { : j
= 1, 2, . . ., 10,000} for the10,000 replicates
Take
y y(1) y(2) y(3) y(4)
5.32 5.18 4.79 5.40 5.81
5.37 5.78 5.99 4.43 5.21
5.23 5.74 4.87 5.02 4.62
5.33 4.56 4.91 4.99 4.45
6.07 5.07 5.14 5.35 5.15
4.88 5.17 5.15 5.84 5.27
5.38 5.23 5.09 6.09 5.65
5.04 6.25 5.04 5.96 4.665.68 5.52 5.66 6.07 5.27
5.44 5.09 5.57 5.15 5.60
5.55 4.72 4.96 4.69 5.15
4.93 5.29 5.31 5.17 6.18
4.71 4.60 5.01 4.27 5.88
4.71 4.79 5.04 5.60 5.49
4.63 5.65 5.54 4.75 4.85
5.26 5.58 5.43 4.92 5.20
5.67 5.35 5.52 5.36 4.945.87 6.05 5.49 5.33 5.63
5.74 5.64 5.05 4.93 5.74
5.17 4.82 4.68 5.58 5.56
ave 5.30 5.30 5.21 5.25 5.32
SD 0.40
00010
1
2
100010
1 ,
j sim
)j(
sim yy,ySE
y )j(sim
8/11/2019 Notes_7 IE
20/134
Example: How you could use bootstrapping to findthe SE of a sample average, if you do NOT know
the underlying distribution
Generate say 10,000 bootstrapsamples, each of size n = 20,from your one real sample
Calculate the averages { :b= 1, 2, . . ., 10,000} for the10,000 replicates
Take
00010
1
2
100010
1 ,
b
)b(
yy,ySE
y )b(
y y(1) y(2) y(3) y(4)
5.32 5.44 5.04 5.38 5.55
5.37 4.63 5.87 4.71 5.74
5.23 5.67 4.93 5.68 6.07
5.33 4.71 4.93 5.23 4.63
6.07 4.71 5.87 5.44 5.67
4.88 4.71 5.23 4.88 5.68
5.38 5.37 5.33 5.38 4.71
5.04 5.38 5.87 4.71 5.235.68 5.26 5.04 5.55 5.23
5.44 5.55 5.44 5.23 5.17
5.55 4.63 4.88 5.17 5.23
4.93 5.68 6.07 5.23 5.68
4.71 5.68 4.93 5.33 5.26
4.71 5.67 5.23 4.71 5.17
4.63 5.87 5.17 5.17 4.63
5.26 5.44 5.37 5.04 5.23
5.67 4.88 5.23 5.23 6.075.87 5.33 5.33 5.37 5.74
5.74 5.32 5.23 5.68 4.88
5.17 5.23 5.33 5.32 5.37
ave 5.30 5.26 5.32 5.22 5.35
SD 0.40
8/11/2019 Notes_7 IE
21/134
Bootstrapping overview continued
Solution: Make a pretend population that consists ofyour original sample of nobservations, copied over andover, an infinite number of times. Then draw many"bootstrap" random samples (each of size n) from thepretend population and calculate a different for each
sample. You can construct a histogram of all the 's,take their sample standard deviation to be an estimate ofSE( ), etc.
How this is implemented: You do not have to actually
copy your original sample over and over. The aboveconstruction of each bootstrap sample is equivalent todrawing a random sample of size nfrom the originalsample of data (with replacement).
q
q
q
8/11/2019 Notes_7 IE
22/134
A Different Example (that has nothing to do withnonlinear regression)
Pop0= population of all grains
Pop1 = population of all grains with thickness < 0.3 and
equivalent diameter > 0.6
mR= mean aspect ratio for all grains inPop1
f= = fraction projected area of grains
inPop1
The patent claim is violated iff> 0.5 AND mR
> 8
PopPop
0
1
incrystalsallofAreaincrystalsallofArea
8/11/2019 Notes_7 IE
23/134
8/11/2019 Notes_7 IE
24/134
Some Details: Bootstrapping in Nonlinear Regression
You have a sample of nobservations of aresponse variable and a set of predictor variables.
You fit a nonlinear regression model to the data toestimate a set of parameters q
Let qdenote one of the parameters of interest and itsestimate.
Objective:Estimate the sampling distribution of , itsstandard error, a confidence interval for q, etc.
To do this, follow the steps of the bootstrap procedure onthe subsequent slides
q
ni
ii
,y1
x
q
8/11/2019 Notes_7 IE
25/134
Steps of the Bootstrap Procedure
1) Generate a "bootstrap" sample (with replacement) of nobservations from . Denote the bootstrapsample by
2) Fit the same type of regression model (with the sameset of parameters q and parameter qof special interest)to the bootstrapped sample. Denote the estimates forthe bootstrapped sample by and
3) Pick a large numberB(e.g.,B= 10,000), and repeatSteps (1) and (2) a total ofBtimes, which produces
b
niii,y 1x
ni
bi
bi ,y 1x
qb
Bbb 1q
8/11/2019 Notes_7 IE
26/134
Steps of the Bootstrap Procedure, continued
4) Construct a histogram of and calculate:
q
Bb
b
B
1
1qq
112
B
SE
B
bb
qqq
average of all bootstrapped estimates
standard error of
q /2 upper /2quantile of the sample distribution of
Bb
b
1q
Bbb 1q
q /21 lower /2quantile of the sample distribution of Bbb 1q
8/11/2019 Notes_7 IE
27/134
8/11/2019 Notes_7 IE
28/134
Steps of the Bootstrap Procedure, continued
5) A crude 1confidence interval for qis:
6) A better 1confidence interval for qis:
qqqqq SEzSEz // 22
qqqqqqq // 212
q q /2q /21
q
8/11/2019 Notes_7 IE
29/134
Example CI Calculations for q0for the Manu. LearningCurve
Crude 95% CI:
Reflected 95% CI:
0241008100409610161020 .,....SEz / qq
qqqqqq , ,.,. 975000002500
01510 . q
00400 .SE q02310250 . ,. q
007.1 975,.0 q
(from the left-most histogram two slides prior)
01610 . q
00710161.0161016102310161 ..,...
.02510091 ,.
8/11/2019 Notes_7 IE
30/134
Discussion Points and Questions
What is the difference between the two CIs (crudeversus reflected) on the previous slide?
In general, when would the two confidence intervalsdiffer?
What are the effects of increasingBon the bootstrappedhistogram of a parameter estimate? Would the histogrambecome tighter?
What are the effects of increasing non the bootstrappedhistogram of a parameter estimate? Would the histogram
become tighter?
Why must nfor each bootstrapped sample be the sameas nfor the real sample?
8/11/2019 Notes_7 IE
31/134
R commands for bootstrapping parameter SEs/CIsfor the manufacturing learning curve
library(boot) #need to load the boot package
MLC
8/11/2019 Notes_7 IE
32/134
8/11/2019 Notes_7 IE
33/134
> plot(MLCboot,index=1)
> boot.ci(MLCboot,conf=c(.9,.95,.99),index=1,type=c("norm","basic"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = MLCboot, conf = c(0.9, 0.95, 0.99), type = c("norm",
"basic"), index = 1)
Intervals :
Level Normal Basic
90% ( 1.010, 1.020 ) ( 1.011, 1.020 )95% ( 1.010, 1.021 ) ( 1.010, 1.022 )
99% ( 1.008, 1.023 ) ( 1.009, 1.023 )Histogram of t
t*
Density
1.005 1.015 1.025
0
50
100
150
-3 -2 -1 0 1 2 3
1.005
1.01
5
Quantiles of Standard Normal
t*
8/11/2019 Notes_7 IE
34/134
Discussion Points and Questions
1) In boot.ci, type = "norm" gives our crude CI based on the SE and
the normal percentiles, but translated by subtracting out theestimated Bias (taken to be the bootstrap average minus theoriginal parameter estimate); type = basic interval gives the betterCI obtained by reflecting the percentiles.
2) How can we determine if there is statistically significant evidence
that the asymptotic relative efficiencies of the two manufacturingfacilities differ?
3) What is a 95% CI on the asymptotic relative efficiency of the olderfacility (x1 = 0)?
4) What is a 95% CI on the asymptotic relative efficiency of the newer
facility (x1 = 1)?
5) In general, given the covariance matrix Sof a random vector Z, thevariance of the linear combination aTZis
Var(aTZ) = aTSa
8/11/2019 Notes_7 IE
35/134
Comments on Bootstrapping
Comparison class verses textbook notation:
In R, use the "boot" command in the "boot" package In Matlab, use the "bootstrp" command in the stats
toolbox
Class KNN
bootstrap parameter estimate qb b1*
upper /2 percentile of distributionof bootstrapped parameters q /2 b1*(1/2)
lower /2 percentile of distribution
of bootstrapped parameters q
/21 b1*(/2)
bootstrap sample of data n
ibi
bi ,y 1x
ni
*i
*i,y 1x
8/11/2019 Notes_7 IE
36/134
Some Common Blackbox Nonlinear Regressionand Classification Models
If you have knowledge of the structure of the relationship between Y
and x, then the best approach is to use it (e.g., if you think it is anlinear, exponential, quadratic, etc. relationship, then fit that model)
For many data sets (especially large "data mining" applications), wemight doubt a linear model will fit but have no idea of the structure ofthe nonlinearities.
In this case, unless there are only a few predictors, polynomial(e.g., quadratic) models are not the preferred next step to trybeyond linear models
Why not?
There are many blackbox nonlinear modeling approaches
We will cover some common ones (neural networks, CARTmodels, nearest neighbors) that span the spectrum of methods
Almost all can be used equally well for either regression orclassification
8/11/2019 Notes_7 IE
37/134
Neural Networks
Clever original idea and memorable namebecamevery popular in the 1980s and 1990s.
They have evolved to have less resemblance to how thehuman brain processes information (but bettereffectiveness at modeling nonlinear relationships in
complicated data sets)
To fit a neural network model (and all of the otherblackbox models), the training data must be available inthe same format as for linear/logistic regression:
A 2D array of observations Each column is a different variable; each row a different case
One column is the response variable (Y) and the other columnsare any number of predictor variables (X's)
The neural network hidden variables (H's) are internal variablesthat you do not enter or even care about
8/11/2019 Notes_7 IE
38/134
Standard Graphical Depiction of a Neural Network
8/11/2019 Notes_7 IE
39/134
Mathematical Definition of What a Neural NetworkModel Really Is
each "node" represents an activation function (labeled as the
function output, with function input a linear combo of previous
layer function outputs)
X's: input (i.e., predictor) variables, in "input layer"
Y: output (i.e. response) variable, in "output layer"
H's: internal dummy variables, in "hidden layer" 's andb's: model parameters, to be estimated
the NN model:
for m= 1, 2, . . .,M,
xxexpxxexp
Hkk,m,m,m
kk,m,m,mm
110
110
1
e
bbb
bbb
HHexp
HHexpY
MM
MM
110
110
1
8/11/2019 Notes_7 IE
40/134
Neural Network Activation Functions
For classification, it is common to use the samesigmoidal (logistic) activation function for each node:
wherez= linear combo of variables from previous layer
For regression, it is usually preferable to use sigmoidal
activation functions for all hidden nodes and a linearactivation[i.e., h(z) =z]function for the output layernodes:
zexpzexpzexp
zh
1
1
1
ebbb HHY MM110
8/11/2019 Notes_7 IE
41/134
8/11/2019 Notes_7 IE
42/134
An S-shaped function with multivariate input
Recall that this is what the S-shaped logistic functionlooks like when there are multiple input variables
8/11/2019 Notes_7 IE
43/134
Discussion Points and Questions
Yis an S-shaped (or sometimes linear) function of thedummy variables (H's), which are in turn S-shapedfunctions of the predictors (X's)
When you combine them together, substituting for theH's to get Yas a function of theX's, you can think of the
neural network model as
for some (very messy) with q= {all 's andb's}
What kind of functionalXYrelationships can youcapture with the neural network model structure?
e x,gY
x,g
8/11/2019 Notes_7 IE
44/134
Fitting A Neural Network Model
1) Standardize predictors via
= average, stdev ofjth predictor (jth column)
2) If using logistic output activation function, scaleresponse to interval [0,1] via
Why do we need to do this rescaling for a logistic output
activation function?
s
xxx
x
jijij
j
s,x xj j
yyyyy
minmax
minii
8/11/2019 Notes_7 IE
45/134
Fitting A Neural Network Model, continued
3) Choose:
# hidden layers
# nodes in each hidden layer
output activation function (usually linear or logistic)
other options and tuning parameters (e.g. l)
4) Software estimates parameters to minimize (nonlinearLS with shrinkage):
g(xi,q)denotes the neural network response prediction
M
mm
M
m
k
jj,m
n
iii ,gy
0
2
1 0
2
1
2blx
8/11/2019 Notes_7 IE
46/134
where
q= {all 's andb's}
l= user-chosen shrinkage parameter
The shrinkage term is analogous to the term that we add to theSSE in ridge regression
Why do we need to include the shrinkage term when fitting a
neural network, even if we have no multicollinearity?
HHexp
HHexp,g
M,iM,i
M,iM,ii
bbb
bbb
110
110
1
x
xxexpxxexp
Hk,ik,m,i,m,m
k,ik,m,i,m,mm,i
110
110
1
E l P di ti M d li f CPU
8/11/2019 Notes_7 IE
47/134
Example: Predictive Modeling of CPUperformance
Data in cpus.txt, which is the same as the cpus data in the MASS
package 209 cases, with 9 variables and 6 predictor variables
perf is the response, which is CPU performance. Ignore estperfwhich was another authors estimated performance.
The six numerical predictors are cycle time (nanoseconds), cachesize (Kb), min and max main memory size (Kb), and min and maxnumber of channels. See V&R for additional discussion
The objective is to learn the predictive relationship between CPUperformance and the predictor variables
Example with a bigger data set coming up shortly
8/11/2019 Notes_7 IE
48/134
Neural Network Modeling of CPU data
#######R code for reading in cpus data set, taking log(response) and then converting to
[0,1] interval, and standardizing predictors##############
CPUS
8/11/2019 Notes_7 IE
49/134
Matrix scatterplot of transformed cpus data
syct
-3 -1 1 0 2 4 0 2 4 6
-2
0
2
-3
-1
1
mmin
mmax
-4
-2
0
2
0
2
4
cach
chmin0
2
4
6
0
2
4
6
chmax
-2 0 2 -4 -2 0 2 0 2 4 6 0.0 0.4 0.8
0.0
0.4
0.8
perf
8/11/2019 Notes_7 IE
50/134
CPUS Example Continued
#############Fit a neural network model to the CPUS1 data####################
library(nnet)cpus.nn1
8/11/2019 Notes_7 IE
51/134
Discussion Points and Questions
Why do we need to standardize the predictors (and theresponse variable when using a linear output activationfunction)?
How can we get r^2 for this example (the nnet function inR does not spit it out)
Which predictor variables appear to be the mostimportant, and what R output do we look at to determinethis?
What value of lwill give us the smallest training SSE?
How can we decide the best value of l?
8/11/2019 Notes_7 IE
52/134
CPUS Example Continued
#######A function to determine the indices in a CV partition##################
CVInd
8/11/2019 Notes_7 IE
53/134
CPUS Example Continued
##Now use the same CV partition to compare Neural Net and linear reg models###
Ind
8/11/2019 Notes_7 IE
54/134
Discussion Points and Questions
The best value of lis the value that results in the
smallest CV SSE (or equivalently, the largest CV r^2,smallest CV SD(e), etc).
How can we decide the best number of hidden layernodes?
Why should we use the same CV partition whencomparing two models?
What are the pros and cons of n-fold CV versus K-foldCV for some smaller K, e.g., 3, 5, or 10?
8/11/2019 Notes_7 IE
55/134
Example: Predictive Modeling of Income Data
Data in adult_train.csv is from the 1994 US Census (also see
http://archive.ics.uci.edu/ml/datasets/Census+Income) 32561 cases, with 15 variables. This is a small sample from the US
census with 15 potentially relevant variables. Each row represents a"similar" population segment with weight given by "fnlwgt"
income has been converted to a binary categorical variable (
50k) with roughly a 75%/25% population split Later we will fit predictive models to classify income based on the
other variables (classification). Here, the objective is to predict thenumber of hours per week spent working based on the othervariables (regression)
This is already a very cleaned data set, but we may need to do alittle additional cleaning
What should we do about the missing "?" values
8/11/2019 Notes_7 IE
56/134
The First Few Rows
age workclass fnlwgt education education-marital-st occupatio relationshi race sex capital-gai capital- los hours-per- native-couincome
39 State-gov 77516 Bachelors 13 Never-ma Adm-cleri Not-in-fa White Male 2174 0 40 United-St
8/11/2019 Notes_7 IE
57/134
Read in the Data
XX
8/11/2019 Notes_7 IE
58/134
Some Preliminary Exploratory Analyses
##exploring individual variables
par(mfrow=c(2,3)); for (i in c(1,5,11,12,13)) hist(XX[[i]],xlab=names(XX)[i]); plot(XX[[15]])par(mfrow=c(1,1)); plot(XX[[2]],cex.names=.7)
for (i in c(2,4,6,7,8,9,10,14,15)) print(table(XX[[i]])/nrow(XX))
Should we be concerned
with anything here or doany further cleaning?
Histogram of XX[[i]]
age
Frequency
20 40 60 80
0
1000
2000
3000
4000
Histogram of XX[[i]]
education.num
Frequency
5 10 15
0
2000
4000
6000
8000
10000
Histogram of XX[[i]]
capital.gain
Frequency
0e+00 4e+04 8e+04
0
5000
15000
250
00
Histogram of XX[[i]]
capital.loss
Frequency
0 1000 2000 3000 4000
0
5000
15000
25000
Histogram of XX[[i]]
hours.per.week
Frequency
0 20 40 60 80 100
0
5000
10000
15000
50K
0
5000
10000
15000
20000
8/11/2019 Notes_7 IE
59/134
Some Preliminary Exploratory Analyses
##exploring pairwise predictor/response relationships
par(mfrow=c(2,1))plot(jitter(XX$age,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)
plot(jitter(XX$education.num,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)
par(mfrow=c(1,1))
barplot(tapply(XX$hours.per.week,XX$education,mean),ylim=c(30,50),cex.names=.7,xpd=F)
for (i in c(2,4,6,7,8,9,14,15)) {print(tapply(XX$hours.per.week,XX[[i]],mean)); cat("\n")}
Some points to consider regarding correlation versus functional dependence (pointsthat apply to ANY regression analysis)
If hours.per.week appears correlated with another variable, it does not meanthat hours.per.week has a functional dependence on that variable
The two could appear correlated because they both depend on another variable(either one of the existing variables or an unrecorded nuisance variable)
If you have recorded enough nuisance variables, a multiple regression analysiscan sometimes distinguish which correlations are truly due to a functionaldependence
If your goal is pure prediction (and not explanatory), does it matter?
8/11/2019 Notes_7 IE
60/134
10th 11th 12th 1st-4th 5th-6th 7th-8th 9th Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad Masters Preschool Prof-school Some-college
30
35
40
45
50
8/11/2019 Notes_7 IE
61/134
A Typical Next Step in Predictive Modeling
##linear regression with all predictors included
Inc.lm
8/11/2019 Notes_7 IE
62/134
Some Typical Next Steps
##linear regression including interactions
Inc.lm.full
8/11/2019 Notes_7 IE
63/134
Now Try a Neural Network Model
##Neural network model
library(nnet)Inc.nn1
8/11/2019 Notes_7 IE
64/134
Multi-Response Neural Networks
Neural networks also apply to the situation in which we
have more than one (sayK) response variables
We handle this by includingKnodes in the output layer(see the following slide)
This is different than fittingKseparate neural networks,
one for each response, because theKresponses sharethe same hidden layer node functions
This is generally more effective than fittingKseparateneural networks models if the response variables have
similar functional dependencies on the predictors. If theresponses have completely different dependencies onthe predictors, then you are better off fittingKseparateneural networks models
Graphical Depiction of Neural Network with K
8/11/2019 Notes_7 IE
65/134
Graphical Depiction of Neural Network withKResponse Variables
f C f
8/11/2019 Notes_7 IE
66/134
Neural Networks for Classification
The most common application of multi-response neural
networks is for classification when we have a categoricalresponse withKcategories (aka classes). Note that thisalso applies to binary responses (K= 2)
To handle this (most software does this internally), make
aK-length 0/1 response vector, e.g., for the fgl data:Type y1 y2 y3 y4 y5 y6
WinF 1 0 0 0 0 0
WinNF 0 1 0 0 0 0
Veh 0 0 1 0 0 0
Con 0 0 0 1 0 0
Tab1 0 0 0 0 1 0
Head 0 0 0 0 0 1
E l P di ti Gl T i F i
8/11/2019 Notes_7 IE
67/134
Example: Predicting Glass Type in Forensics
Data in fgl.txt, which is the same as the FGL data in the MASS
package. See V&R for additional discussion 214 cases, with 9 predictor variables and a categorical response
Each row contains the results of an analysis of a fragment of glass
type is the response, one of six different glass types: window floatglass (WinF: 70 rows), window non-float glass (WinNF: 76 rows),
vehicle window glass (Veh: 17 rows), containers (Con: 13 rows),tableware (Tabl: 9 rows) and vehicle headlamps (Head: 29 rows).
Eight of the predictors are the chemical composition of the fragment,and the ninth (RI) is the refractive index
The objective is to train a predictive model to predict the glass typebased on a fragment of the glass, for forensic purposes
R d th D t d T f V i bl
8/11/2019 Notes_7 IE
68/134
Read the Data and Transform some Variables
######Read data, convert response to binary, and standardize predictors#####
FGL
8/11/2019 Notes_7 IE
69/134
First Few Rows of fgl.txt data
RI Na Mg Al Si K Ca Ba Fe type
3.01 13.64 4.49 1.1 71.78 0.06 8.75 0 0 WinF
-0.39 13.89 3.6 1.36 72.73 0.48 7.83 0 0 WinF
-1.82 13.53 3.55 1.54 72.99 0.39 7.78 0 0 WinF
-0.34 13.21 3.69 1.29 72.61 0.57 8.22 0 0 WinF
-0.58 13.27 3.62 1.24 73.08 0.55 8.07 0 0 WinF
-2.04 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 WinF
-0.57 13.3 3.6 1.14 73.09 0.58 8.17 0 0 WinF
-0.44 13.15 3.61 1.05 73.24 0.57 8.24 0 0 WinF
1.18 14.04 3.58 1.37 72.08 0.56 8.3 0 0 WinF
-0.45 13 3.6 1.36 72.99 0.57 8.4 0 0.11 WinF
-2.29 12.72 3.46 1.56 73.2 0.67 8.09 0 0.24 WinF
-0.37 12.8 3.66 1.27 73.01 0.6 8.56 0 0 WinF
-2.11 12.88 3.43 1.4 73.28 0.69 8.05 0 0.24 WinF
-0.52 12.86 3.56 1.27 73.21 0.54 8.38 0 0.17 WinF
-0.37 12.61 3.59 1.31 73.29 0.58 8.5 0 0 WinF
-0.39 12.81 3.54 1.23 73.24 0.58 8.39 0 0 WinF
-0.16 12.68 3.67 1.16 73.11 0.61 8.7 0 0 WinF
3.96 14.36 3.85 0.89 71.36 0.15 9.15 0 0 WinF
1.11 13.9 3.73 1.18 72.12 0.06 8.89 0 0 WinF
-0.65 13.02 3.54 1.69 72.73 0.54 8.44 0 0.07 WinF
Mathematical Definition of K-Class Neural Network
8/11/2019 Notes_7 IE
70/134
Mathematical Definition ofKClass Neural NetworkModel
for m= 1, 2, . . .,M,
(same as before)
for l= 1, 2, . . .,K,
(multinomial logistic model)
Note: ForK= 2, this reduces to:
xxexpxxexp
Hkk,m,m,m
kk,m,m,mm
110
110
1
K
jMM,j,j,j
MM,l,l,ll
HHexp
HHexp|YPr
1110
1101
bbb
bbb
x
HHexp
HHexp|YPr
MM
MM
bbb
bbb
110
1101
11 x
Fitti A N l N t k M d l f Cl ifi ti
8/11/2019 Notes_7 IE
71/134
Fitting A Neural Network Model for Classification
1--3) The first three steps are the same as before
4) For classification, software estimates parameters tominimize (nonlinear LS with shrinkage):
(log-likelihood + shrinkage penalty)
5) CV should be used to choose any tuning parameters (l,number of nodes, etc)
lall mall
m,lmall jall
j,m
n
iil,i
K
ll,i a|YrP
y
2
2
11 blx
Fitting a Neural Net Classifier for the FGL Data
8/11/2019 Notes_7 IE
72/134
g(binary response case)
#############Fit a neural network classification model to the FGL1 data######
library(nnet)fgl.nn1
8/11/2019 Notes_7 IE
73/134
response vs. predicted probability for fgl data
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
phat
jitter(y,0.05)
Using CV to Compare Models for the FGL Data
8/11/2019 Notes_7 IE
74/134
Using CV to Compare Models for the FGL Data
Ind
8/11/2019 Notes_7 IE
75/134
Discussion Points and Questions
What is the best neural network model, in terms of the
tuning parameters (decay, size, etc)? What is the best CV misclassification rate?
Is this good?
What other model(s) would you compare to the bestneural network?
Classification for the 6 Class FGL Response
8/11/2019 Notes_7 IE
76/134
Classification for the 6-Class FGL Response
#############Same, but use the original 6-category response######
library(nnet)fgl.nn1
8/11/2019 Notes_7 IE
77/134
Neural Network Classification of Income Data
Reconsider the data in adult_train.csv
Instead of predicting the number of hours (regression), we will nowpredict the binary income categorization ( 50k) using theother predictor variables
Recall that for the entire sample, we 75% are 50k
Read the Data and Fit Models
8/11/2019 Notes_7 IE
78/134
Read the Data and Fit Models
XX
8/11/2019 Notes_7 IE
79/134
Discussion Points and Questions
Which modelneural network or logistic regression
appears to be better? How good does it appear?
Pros and Cons of Neural Networks
8/11/2019 Notes_7 IE
80/134
Pros and Cons of Neural Networks
Pros:
very flexible; with enough nodes, can model almost anynonlinear relationship
can efficiently model linear behavior if the relationship is trulylinear
often very good predictive power
Cons: model fitting can be unstable and sensitive to initial guesses
for very large data sets, model fitting can be very slow relative tosome methods like trees and linear models, which makes CV
very computationally expensive overfitting (but can avoid by using CV to choose l)
sensitive to user-chosen "tuning parameters (but can use CV tochoose them wisely)
poor interpretability
Classification and Regression Tree (CART)
8/11/2019 Notes_7 IE
81/134
Models
Perhaps the single most widely used generic nonlinear
modeling method Very simple idea and very interpretable models
They usually do not have the best predictive power, butthey serve as the basis for many more advanced
supervised learning methods (e.g., boosting, randomforests) that have excellent predictive power
As with neural networks (and most of the methods wewill cover), you can use tree models for either regression
or classification. We will start with regression.
Structure of a Regression Tree
8/11/2019 Notes_7 IE
82/134
Structure of a Regression Tree
A final fitted CART model divides the predictor (x) space
by successively splitting into rectangular regions andmodels the response (Y) as constant over each region
This can be schematically represented as a "tree": each interior node of the tree indicates on which predictor
variable you split and where you split each terminal node (aka leaf) represents one region and
indicates the value of the predicted response in that region
The following slide illustrates a fitted tree model for anexample from the KNN text (Figure 11.12), in which theobjective is to predict college GPA (the response) as afunction of HS rank and ACT score (two predictors)
To use a fitted CART for prediction, you start at the rootnode and follow the splitting rules down to a leaf
8/11/2019 Notes_7 IE
83/134
Mathematical Representation of Regression Tree
8/11/2019 Notes_7 IE
84/134
Mathematical Representation of Regression Tree
Can still view tree model as where:
M= total number of regions (terminal nodes)
Rm= mth region
I(xRm) = indicator function =
cm
=constant predictor overRm
q= all parameters and structure (M, splits inRms, cms,etc)
Note that for
e x;gY
M
mmm RIc;g
1xx
cRIc;g,R jM
mmimiji
1 xxx
R
R
m
m
x
x
:0
:1
Discussion Points and Questions
8/11/2019 Notes_7 IE
85/134
Discussion Points and Questions
What kind of functional xYrelationships can you
capture with a regression tree model structure? Can a regression tree represent a linear relationship?
Can it represent a linear relationship as efficiently as aneural network?
Which type of modelneural network or regression treeis more interpretable?
Which type of modelneural network or regression treeis easier to fit?
Given a set of regions, how would you estimate thecoefficients {cm: m= 1, 2, . . .,M}?
Fitting a Regression Tree
8/11/2019 Notes_7 IE
86/134
Fitting a Regression Tree
A CART model is fit using an array of training data
structured just like in regression (one response columnand many predictor columns)
Fitting the model entails growing the tree one node at atime (see next slide for an example)
At each step, the single best next split (which predictor andwhere to split) is the one that gives the biggest reduction in SSE
The fitted or predicted response over any region is simply theaverage response over that region. The errors used to calculatethe SSE are the response values minus the fitted values.
Stop splitting when reduction in SSE with the next split is belowa specified threshold, all node sizes are below a threshold, etc.
Most algorithms overfit then prune back branches
After fitting a CART model, software spits out the final
fitted tree, which can be used for prediction/interpretation
8/11/2019 Notes_7 IE
87/134
SSE is Calculated as Follows
8/11/2019 Notes_7 IE
88/134
SSE is Calculated as Follows
For given set of splits:
= "size" of mthterminal node (region)
R
im
miim
mi
yN
R|yavecx
x 1
R#N mim x
M
m Rmimi cySSE
1
2x
cRIc;gy,R j
M
mmimiiji
1forthatNote xxx
Pruning
8/11/2019 Notes_7 IE
89/134
Pruning
Pruning a branch means that you collapse one of the
internal nodes into a single terminal node Pruning the tree means that you prune a number of
branches
Pruning algorithms in software will usually optimally
prune back a tree in a manner that minimizes SSE + lM,whereMand SSEare for the pruned tree. The best valuefor lis determined via CV
There is a nice computational trick ("weakest link
pruning") that allows this optimal pruning to be done veryfast. See HTF for further discussion.
Regression Tree Ex (cpus data)
8/11/2019 Notes_7 IE
90/134
Regression Tree Ex. (cpus data)
#do not have to standardize or transform predictors to fit trees
library(tree)control = tree.control(nobs=nrow(CPUS), mincut = 5, minsize = 10, mindev = 0.002)
#default is mindev = 0.01, which only gives a 10-node tree
cpus.tr
8/11/2019 Notes_7 IE
91/134
( ) p ()
size
deviance
10
20
30
4
0
2 4 6 8 10 12 14
24.00 3.80 1.20 0.72 0.41 0.24 0.22
size
deviance
10
15
20
25
30
35
40
2 4 6 8 10 12 14
24.00 3.80 1.20 0.72 0.41 0.24 0.22
deviance vs k (l) from cv.tree()
Discussion Points and Questions
8/11/2019 Notes_7 IE
92/134
Discussion Points and Questions
What is the best size tree for the CPUS example?
Provide an interpretation of which predictor variables aremost important
Do there appear to be any interactions between mmaxand cach?
Why must minsize be at least twice mincut?
The "deviance" measure that is plotted versus tree sizeis 2logf(y,q). Why does this correspond to the SSE for anonlinear regression model with normal errors?
Classification Trees Overview
8/11/2019 Notes_7 IE
93/134
Classification Trees Overview
Fitting and using classification trees with aK-category
response is similar to fitting and using regression trees. For classification trees, we modelpk(x)= Pr{Y= k| x}
(k=1,2,. . .,K) as constant over each region
Compare to regression trees, for which we modelg(x;q)
=E[Y| x]as constant over each region At each step in the fitting algorithm, the best next split is
the one that most reduces some criterion measuring theimpurity within the regions
Classification Trees Some Details
8/11/2019 Notes_7 IE
94/134
Classification Trees Some Details
In the regionRm, the fitted class probabilities and best
class prediction are:
(class-ksample fraction in regionRm)
(most common class in regionRm)
Some common impurity measures:
Misclassification error:
Gini index:
deviance: (log-likelihood)
R
im
k,mmi
kyIN
px
1
pmaxargk k,mkm
M
m
mk,mm
M
m Rmi
mi p-NkyI
11
1x
M
m
K
kk,mk,mm p-pN
1 11
M
m
K
k
k,mk,mm plogpN
1 1
Example Illustrating the Notation
8/11/2019 Notes_7 IE
95/134
Example Illustrating the Notation
Suppose you haveK= 4classes, and the predictors for
Nm= 100training cases fall into a particular regionRm.For those 100cases, suppose we have the followingbreakdown of the number of cases with response valuethat fell into the four categories:
What is for k= 1, 2, 3, 4?
What is km?
Class, k # obsvns withYin Class k
1 10
2 20
3 65
4 5
p k,m
p k,m
K= 2 Class Example Illustrating Notation andS litti B d I it
8/11/2019 Notes_7 IE
96/134
Splitting Based on Impurity
In the following, where would the first split that minimizes
the misclassification rate be, and what would theand be?
p k,m
kmyi
2
1
xi
Classification Tree Ex. (fgl data)
8/11/2019 Notes_7 IE
97/134
( g )
library(tree)
control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)#default is mindev = 0.01, which only gives a 10-node tree
fgl.tr
8/11/2019 Notes_7 IE
98/134
size
deviance
200
220
240
260
28
2 4 6 8 10 12 14
size
deviance
50
100
150
200
250
2 4 6 8 10 12 14
|Mg < 2.695
RI < 6.22
Other Win
Win
Discussion Points and Questions
8/11/2019 Notes_7 IE
99/134
What is the best tree size for the FGL data?
Which predictors appear to be the most important, andwhat are their effect(s)?
If you want a summary measure of the predictive qualityof the tree model, what would it be?
If you wanted to decide whether the neural network isbetter than the tree for predicting glass type, how wouldyou do this?
How can you tell what impurity measure R used to fit the
model?
Same but for the original 6-category response
8/11/2019 Notes_7 IE
100/134
g g y p
control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)
#default is mindev = 0.01, which only gives a 10-node tree
fgl.tr
8/11/2019 Notes_7 IE
101/134
Discussion Points and Questions
8/11/2019 Notes_7 IE
102/134
You choose the best size the same waychoosing the
size with the lowest CV deviance (the bestMwas about7)
Which predictors appear to be the most important, anddoes this seem to agree with the best predictors for the
2-class tree when we choseM= 3? If you want a summary measure of the predictive quality
for the 6-class tree model, what would it be?
8/11/2019 Notes_7 IE
103/134
Numerical Assessment of Variable Importance
8/11/2019 Notes_7 IE
104/134
For a visual assessment of the importance of each
predictor in a tree, inspect the tree graph (the importanceofxj is reflected by how many times it appears in internalnodes, how close they are to the root node, and thelength of the branch for that split if using type = "p" in
plot.tree):##Replot 6-class FGL tree with branch lengths proportional to reduction in impurity##
fgl.tr1
8/11/2019 Notes_7 IE
105/134
Use CV to compare trees with any other model
In the previous regression tree example (cpus data), thebest value of complexity parameter was l = 0.4, whichtranslated toM= 11 terminal nodes.
To compare a regression tree with a neural network
model we would: Form a random CV partition (e.g. using the CVInd function)
Compute the CV SSE for a neural network model with 10 hiddenlayer nodes and l = 0.05 (which CV earlier said was roughly thebest value)
Compute the CV SSE for a regression tree model with either l =0.4 orM= 11 , using the same partition
Repeat the previous 3 steps as many times as you can,averaging the results, and select the best model as the one withthe lower average CV SSE
Return to the Income Data Example
8/11/2019 Notes_7 IE
106/134
Reconsider the data in adult_train.csv
Before, we fit a neural network model for regression, predicting thenumber of hours per week worked. And we also fit a neural networkfor classification, predicting the binary income categorization ( 50k)
Here, we will fit similar regression and classification models, but
using trees instead of neural networks
8/11/2019 Notes_7 IE
107/134
The Best-sized Regression Tree for INCOME Data
8/11/2019 Notes_7 IE
108/134
size
dev
iance
36000
00
3800000
4000000
4200000
1 10 20 30 40 50
380000 24000 7400 5800 4400 3800 2900 2400 -Inf
|age < 22.5
age < 18.5
education.num < 9.5relationship:d
sex:a
age < 63.5
occupation:afghilmn
occupation:abcfghm
age < 64.5income:a
age < 63.5
workclass:abdg
income:a occupation:djkln
23.136.9 27.8 34.3
37.8 41.228.2
41.1 44.1 30.1
43.8 47.1 49.2 55.5
34.9
Discussion Points and Questions
8/11/2019 Notes_7 IE
109/134
When we include native.country, we get an error because
no more than 32 categories are allowed for a categoricalpredictor. If you really wanted to include native.country,how would you handle this?
Relative to the CPUS example, would you increase
mincut, minsize, and mindev, or decrease them? As always, you should deliberately overgrow the tree and
then prune it back. How do know if you have overgrownthe tree?
Comparing the tree to the neural network Which was faster to fit?
Which had better predictive quality, and how can you tell?
Which was easier to interpret?
How about for comparing a tree to a linear regression?
Try a Classification Tree
8/11/2019 Notes_7 IE
110/134
control = tree.control(nobs=nrow(INCOME), mincut = 20, minsize = 40, mindev = 0.0005)
#default is mindev = 0.01, which only gives a 10-node tree
Inc.tr
8/11/2019 Notes_7 IE
111/134
K-Nearest Neighbors
8/11/2019 Notes_7 IE
112/134
A generic nonlinear modeling tool that is extremely
flexible Perhaps the simplest modeling idea of all
For simple data sets with large n, small k, and nocategorical predictors, almost as widely used as CART
Based on the name, can you guess how K-nearestneighbors works?
Structure of 1-Nearest Neighbors (for regression)
8/11/2019 Notes_7 IE
113/134
You need a set of training data {yi, xi: i= 1, 2, . . ., n}, but
you do not fit a model. For 1-Nearest Neighbors, to predict Yfor a new case
with predictors x: find the xiin your training set that is the closest neighbor to x
then take the predicted Yto be the response value for thattraining observation
Illustration of K-NN for Gas Mileage data
8/11/2019 Notes_7 IE
114/134
library(scatterplot3d)
library(rgl)
GAS
8/11/2019 Notes_7 IE
115/134
-2 -1 0 1 2
10
15
20
25
30
35
40
-2
-1
0
1
2
3
Displacement
Rear_ax
le_ratio
Mpg
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1
0
1
2
Displacement
Rear_axle_ratio
Calculating Distances to find Nearest Neighbors
8/11/2019 Notes_7 IE
116/134
If we want to predict Y(x) for a new case with predictors x,
the distance between xand the predictors xifor the ithtraining case (i= 1, 2, . . ., n) is measured via
For 1-nearest neighbor, the prediction of Y(x) is
where i1(x) = index of closest neighbor of x
k
j
ijjii T
ii xx,d
1
2xxxxxxxx
yyi
)(1
)(x
x
Discussion Points and Questions
8/11/2019 Notes_7 IE
117/134
You should always standardize your predictors as a first
stepwhy? How do you handle categorical predictors?
For 1-nearest neighbor, what would a plot ofversus x look for the gas mileage example with only
Displacement as a predictor?
)(xy
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
15
20
25
30
35
Displacement
Mpg
Structure ofK-Nearest Neighbors (for regression)
8/11/2019 Notes_7 IE
118/134
More generally, forK-nearest neighbors, you use exactly
the same procedure, except you: find theKclosest training xi's to x, and
then take the predicted Yto be the average response value fortheseKtraining observations:
where {i1(x), i2(x), . . ., iK(x)} = indices ofKclosest
neighbors of x
The tradeoff of using large vs. smallKis exactly theclassic bias/variance tradeoff
K
l iy
Ky
l1
)(1
)(x
x
Large Versus Small K (single predictor example)
8/11/2019 Notes_7 IE
119/134
Why is the predictor in the left plot high variance and low bias?
Why is the predictor in the right plot low variance and high bias?
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
15
20
25
30
35
Displacement
Mpg
K=1 K=20
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
15
20
25
30
35
Displacement
Mpg
Bias and Variance of K-Nearest Neighbors
8/11/2019 Notes_7 IE
120/134
Assume true relationship: Y=g(x,q) + e
with fixed training xsand e~ i.i.d. (0,e2) (not necessarily normal)
The predictor for fixed xis:
MSE =E[( )2] =s2+ Var[ ] +Bias2[ ]
where
K
li
K
li
K
l i lll K
;gK
yK
y1
)(1
)(1
)(
111)( e xxx xx
)(xyY )(xy )(xy
K
yVar s2)( x
K
li ;gK
;gyEYEyBiasl
1)(
1)()()( xxxxx x
ss
s 22
2 1)(
K
K
KyYVar xx
Another Example of Large Vs. Small K
8/11/2019 Notes_7 IE
121/134
This is a classification example from HTF with two response categories
(blue or orange in the figures) and two predictors. The followingscatterplots arex1vs.x2also showing the decision boundaries for theK-nearest neighbors classifiers with K = 15 and K = 1
K-NN for CPUS data
8/11/2019 Notes_7 IE
122/134
library(yaImpute)
CPUS
8/11/2019 Notes_7 IE
123/134
K=2, training fit K=6, training fit
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
fit
ytest
0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
fit
ytest
CV to Choose the Best K
Nrep
8/11/2019 Notes_7 IE
124/134
K
8/11/2019 Notes_7 IE
125/134
K=2, n-fold CV K=6, n-fold CV
0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
yhat2
y
0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
yhat1
y
Discussion Points and Questions
8/11/2019 Notes_7 IE
126/134
Did the 6-nearest neighbors method do better than the
best neural network model or the linear regressionmodel?
How can you tell which predictors have the largest effecton the response in the K-nearest neighbors model?
What are the parameters of the fitted model?
K-Nearest Neighbors for Classification
8/11/2019 Notes_7 IE
127/134
Like CART models, it is straightforward to use nearest
neighbors for classification. For binary classification, to predictPr{Y=1 | x}for a new
case, find theKnearest neighbors as before, and takethe predictedPr{Y=1 | x}to be the fraction of theK
nearest neighbors havingy= 1responses If we have more than two response categories, we take
the predicted probability for each category to be thefraction of nearest neighbors with response valuesbelonging to that category.
K-NN for FGL data
8/11/2019 Notes_7 IE
128/134
FGL
8/11/2019 Notes_7 IE
129/134
K=10, training fit
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
phat
jitter(as.numeric(ytest=="Win"),am
ount=0.05)
Discussion Points and Questions
8/11/2019 Notes_7 IE
130/134
In the preceding, what was the training misclassification
rate? What would happen to the training misclassificationrate if we decreased K? What would it be for K = 1?
To find the best K for K-nearest neighbors forclassification problems, you must use CV, just like for
any other method. What CV measure would you use to find the best K for
the FGL data?
Is finding the CV errors for K-nearest neighbors
substantially more computationally expensive thanfinding the training errors, like it is for all of the othermethods we have covered?
Pros and Cons of Nearest NeighborsPros:
8/11/2019 Notes_7 IE
131/134
Pros:
The most flexible of allcan represent any nonlinear relationship (as
long as you have sufficiently large n). Easy to use. No model fitting required
Cons:
There is no real fitted model (nor even any indication of which predictorsare most important), so not suitable for interpretation or explanatorypurposes.
Because there is no fitted model, you need to retain all the training datato predict.
With large k(the number of predictors), you need very large n, becauseneighbors get further away in higher dimensions.
For most supervised learning methods, large nincreases the
computational expense for training, but not for new case prediction.Large nis more problematic for nearest neighbors, because the"training" occurs for every new case prediction
With very large n, we need computational tricks (e.g. tree-basedmethods) to efficiently search for nearest neighbors.
Not well suited for categorical predictors
Effect of dimension (k) on distance between neighborsk= 1
Scatterplot of Rear axle ratio vs Displacement
k= 2
8/11/2019 Notes_7 IE
132/134
6005004003002001000
0
Displacement
500400300200100
4.5
4.0
3.5
3.0
2.5
Displacement
Rear_
axle_
ratio
Scatterplot of Rear_axle_ratio vs Displacement
k= 3
4.0
7.5 3.5
8.0
8.5
3.0100
9.0
250 2.5400
550
Comp_ratio
Rear_axle_ratio
Displacement
3D Scatterplot of Comp_ratio vs Rear_axle_ratio vs Displacement
Software Implementation in Matlab (in case youwant to know)
8/11/2019 Notes_7 IE
133/134
Neural Networks: Neural networks toolbox (IE
computer lab does NOT have NN toolbox) CART: CLASSREGTREE (part of the Stats toolbox)
Nearest Neighbors: No model to fit. Easy to write yourown script in Matlab.
Cross-Validation: CROSSVAL (part of the Statstoolbox). You must write an appropriate function call foryour specific model.
Bootstrapping: BOOTSTRP (part of the Stats toolbox).
You must write an appropriate function call for yourspecific model.
Some Other "Data Mining" Tools
8/11/2019 Notes_7 IE
134/134
Two big categories of problems:
supervised learning (we have a response Yand predictors xandwant to model Yas a function of x).
unsupervised learning (we have no Y; just an xand we want tofind relationships among elements of x)
IEMS 304 covered the foundations and primary tools ofsupervised learning. There are many more advancedmethods, but most are extensions of what we havealready covered
Examples of unsupervised learning:
clustering
association rules
Top Related