Lecture 5 MACHINE LEARNINGLecture 5 MACHINE LEARNING Bruce E. Hansen Summer School in Economics and...
Transcript of Lecture 5 MACHINE LEARNINGLecture 5 MACHINE LEARNING Bruce E. Hansen Summer School in Economics and...
Lecture 5MACHINE LEARNING
Bruce E. Hansen
Summer School in Economics and EconometricsUniversity of CreteJuly 22-26, 2019
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 1 / 99
Learning Machines: Daleks?
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 2 / 99
Everyone ready!
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 3 / 99
Today’s Schedule
Principal Component Analysis
Ridge Regression
Lasso
Regression Trees
Bagging
Random Forests
Ensembling
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 4 / 99
References
Hastie, Tibshirani, and Friedman (2008) The Elements of StatisticalLearning: Data Mining, Inferenece, and Prediction
I Today’s Lecture is extracted from this textbook
James, Witten, Hastie, and Tibshirani (2013) An Introduction toStatistical Learning: with Applications in R
I Undergraduate level
Efron Hastie (2017): Computer Age Statistical Inference: Algorithms,Evidence, and Data Science
I Also introductory
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 5 / 99
PRINCIPAL COMPONENT ANALYSIS
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 6 / 99
Principal Component Analysis
Large number p of regressors xi
Do we need all p regressors?If many regressors are very similar to one another —and highlycorrelated —maybe the question is not “which” regressor isimportant, but instead “which linear combination” is important
For example, if you have a data set with a large number of test scorestaken by students, and you are trying to predict some outcome(grades in future classes, success at a university, future wages), whatmight be the best predictor is an average, or linear combination, ofthe test scores
This leads to the concept of latent factors and factor analysis
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 7 / 99
Single Factor Model
xi = h fi + ui
I h and ui are p× 1I fi is 1× 1I fi is the factorI h are the factor loadingsI ui are the idiosyncratic errors
In this model, the factor fi affects all regressors xji
I But the magnitude is specific to the regressor and captured by h
ScalingI The scale of h and fi are not separately identified, nor their signI One option h′h = 1I Second option var ( fi) = 1
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 8 / 99
Testscore Example
xi is a set of test scores for an individual student
fi is the student’s latent ability
h is how ability affects the different test scoresI Some tests may be highly related to abilityI Some tests may be less relatedI Some may be unrelated (random?)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 9 / 99
Regressor Covariance Matrix
Σx = var (xi) = hh′σ2f + Ipσ2
u
Σxh =(
hh′σ2f + Ipσ2
u
)h = h
(σ2
f + σ2u
)I Thus h is an eigenvector of Σx with eigenvalue σ2
f + σ2u
I All other eigenvectors have eigenvalues σ2u
I Thus h is the eigenvector of Σx associated with the largest eigenvalue
EstimationI Σx = sample covariance matrix of xiI h = eigenvector associated with largest eigenvalue of Σx
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 10 / 99
The Largest Principal Component Captures the GreatestVariation
Figure: Principal components of some input data pointsBruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 11 / 99
Multiple Factor Model
xi = ∑rm=1 hm fmi + ui = H ′ f i + ui
Interpretation in test score example:I There are more than one form of “ability”I i.e. literary and mathematicalI In labor economics, there has been hypothesized a distinction betweencognitive and non-cognitive ability which has been very useful inexplaining wage patterns (some jobs require one or the other, and someboth (e.g. surgeon)
Loadings normalized to be orthonormal, factors uncorrelated
Σx = H ′Σ f H + IKσ2u
The factor loadings hm are the eigenvectors of Σx associated with thelargest r eigenvaluesEstimation: H = eigenvectors of Σx associated with the largest reigenvalues
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 12 / 99
Identification of the number of factors
A diffi cult practical problem
Standard practice is to examine the eigenvalues of Σx and look forbreaks
The theory suggests there should be r “large” eigenvalues and theremainder “small”
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 13 / 99
Illustration
Data from Duflo, Dupas and Kremer (2011, AER)
Observations are testscores for first grade students in Kenya
The authors used the total testscore, but the data set includes scoreson subsections of the test
I Four literary and three mathematics
How many factors explain performance?
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 14 / 99
Eigenvalue Proportion1 4.02 0.572 1.04 0.153 0.57 0.084 0.52 0.085 0.37 0.056 0.29 0.047 0.19 0.03
First Factor Second Factorwords 0.41 −0.32sentences 0.32 −0.49letters 0.40 −0.13spelling 0.43 −0.28addition 0.38 0.41substraction 0.35 0.52multiplication 0.33 0.36
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 15 / 99
Implications
There appear to be two large eigenvalues relative to the others
The first eigenvector has similar loadings for all seven subjectsI A general abililty/achievement factor
The second factor loads negatively on all four literacy subjects, andpositively on all math subjects
I A “math vs literacy” factor!
Appears in first grade exams!
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 16 / 99
School Factors
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 17 / 99
Estimation of Factors
Factor ModelI xi = H ′ f i + uiI If you knew H an estimator of f i would be
fi = H ′xi = f i + vi
vi = H ′ui
I The error is mean zero so f i is unbiased for f i.I With estimated loadings fi = H
′xi
I As p, n→ ∞ f i →p= f i
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 18 / 99
Factor-Augmented Regression
yi = f ′iβ+ ei
xi = H ′ f i + ui
Estimated in multiple stepsI Estimate the loadings H from the covariance matrix of xiI Estimate the factors f iI Estimate the coeffi cient β by least squares of yi on the estimatedfactors
Generated RegressorsI Problem diminishes as n, p→ ∞
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 19 / 99
RIDGE REGRESSION
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 20 / 99
Linear Model
yi = x′iβ+ ei
Assume all variables demeaned
xi is p× 1
Classical OLS β = (X ′X)−1(X ′y)
I Unstable if X ′X is near singularI Unstable if X ′X is near collinearI These problems are not typical when p is small; but are typical when pis large and variables are correlated
I Infeasible if p > nF Many current applications (high dimension regression)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 21 / 99
Wisconsin Regression
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 22 / 99
Ridge Regression
Classical solution to near singularity
βridge =(X ′X + λIp
)−1(X ′y)
λ is a tuning parameter
Classical motivationI Solves multicollinearityI Solves singularityI Stabilizes estimator
Modern motivationI Penalized least squaresI Regularized least squares
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 23 / 99
Ridge Regression
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 24 / 99
Penalized Least Squares
Penalized criteria:I S2(β) = (y− Xβ)′ (y− Xβ) + λβ′β
SSE plus an L2 penalty on the coeffi cient
βridge = argminβ
S2(β)
F.O.C.I 0 = −2X ′ (y− Xβ) + 2λβ
SolutionI βridge =
(X ′X + λIp
)−1(X ′y)
Thus the ridge regression estimator is a penalized LS estiamtor
If the OLS estimator is “large”, the penalty pushes it towards zero
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 25 / 99
Dual Problem
The dual of a penalized minimization problem is a Lagrangian problem
Consider the problemI min (y− Xβ)′ (y− Xβ)subject to β′β ≤ τ
This minimizes the SSE over an ellipse around zero with radius τ
Lagrangian L (β, λ) = (y− Xβ)′ (y− Xβ) + λ(
β′β− τ)
F.O.C. for β
I 0 = −2X ′ (y− Xβ) + 2λβI The same as the penalized least squares problemI They have the same solution
There is a mapping between λ and τ
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 26 / 99
Dual (Lagrangian) Constrained Minimization
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 27 / 99
Shrinkage Interpretation
Suppose X ′X = Ip
Then βridge = (1+ λ)−1 βols
Shrinks OLS towards zero
Similar to Stein estimator
βridge is biased for β but has lower variance than least squares
MSE is not easy to characterize
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 28 / 99
Selection of Ridge Parameter
MallowsI µridge = X
(X ′X + λIp
)−1 X ′y is linear in y
I Penalty is 2σ2 tr((
X ′X + λIp)−1 X ′X
)Cross-Validation
I ei(λ) = yi − x′i βridge,−1(λ)
I CV(λ) = ∑ni=t ei(λ)
2
Theory: Li (1986, Annals of Statistics)I Mallows/CV selection asymptotically equivalent to infeasible optimal λ
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 29 / 99
Interpretation via the SVD
The singular value decomposition of a matrix is X = UDV ′ where Uand V are orthonormal and D is diagonal with the singular values ofX on the diagonal
Apply to the regressor matrix X = UDV ′
X ′X = V DU ′UDV ′ = V DDDV ′ = V D2V ′
(X ′X)−1= V D−2V ′
X ′y = V DU ′y
βols = (X′X)−1 X ′y = V D−2V ′V DU ′y
= V D−2DU ′y = V D−1U ′y
µols = X βols = UDV ′V D−1U ′y = UU ′y = ∑pj=1 uju′jy
The least squares estimator can be seen as a projection on theorthonormal basis uj
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 30 / 99
Interpretation via the SVD
Let Λ = λIp
X ′X +Λ = V D2V ′ +Λ = V(
D2 +Λ)
V ′
(X ′X +Λ)−1= V (D+Λ)−1 V ′
βridge = (X′X +Λ)
−1 X ′y = V (D+Λ)−1 V ′V DU ′y
= V (D+Λ)−1DU ′y = V(D+Λ)−1D−1U ′y
µridge = X βridge = UDV ′V(D+Λ)−1DU ′y
= UD(D+Λ)−1DU ′y = ∑pj=1
d2j
d2j + λ
uju′jy
A shrunken projection
Each basis uj is shrunk according to d2j /(d2
j + λ)
I Smaller d2j implies greater shrinkage
I Small singular values of X receive greatest shrinkage
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 31 / 99
Ridge Coeffi cients varying with Lambda
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 32 / 99
Features of Ridge Estimator
Well defined even if p > nAll p coeffi cients are estimated with non-zero estimatesDoes not perform simultaneous selection
Shrinkage greatest on small singular values of regressors
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 33 / 99
Computation of Ridge Regression
R: package glmnet(x,y,alpha=0)I Selects λ by cross-validation
MATLAB: ridge(y,X,λ)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 34 / 99
LASSO(Least Absolute Shrinkage and Selection Operator)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 35 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 36 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 37 / 99
Lasso
Penalized criteria:I S1(β) = (y− Xβ)′ (y− Xβ) + λ ‖β‖1I ‖β‖1 = ∑
pj=1
∣∣∣βj
∣∣∣SSE plus an L1 penalty on the coeffi cient
βlasso = argminβ
S1(β)
The minimizer has no closed-form solution
F.O.C. for βj
I 0 = −2X ′j (y− Xβ) + λ sgn(
βj
)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 38 / 99
Dual Problem
min (y− Xβ)′ (y− Xβ)subject to ‖β‖1 ≤ τ
This minimizes the SSE over a constraint set which looks like asquare on its edge (a cross-polytope)
LagrangianI L (β, λ) = (y− Xβ)′ (y− Xβ) + λ (‖β‖1 − τ)
F.O.C. for βj
I 0 = −2X ′j (y− Xβ) + λ sgn(
βj
)I The same as the penalized problemI The solution is identical
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 39 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 40 / 99
Lasso vs Ridge
3.4 Shrinkage Methods 71
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
(0,0) (0,0) (0,0)
|β(M)|
λ
Best Subset Ridge Lasso
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2
1 + β22 ≤ t2, respectively,
while the red ellipses are the contours of the least squares error function.Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 41 / 99
Lasso solution tends to hit a corner
Since the constraint region has corners, the lasso tends to hit one
In contrast, the ridge minimizer tends to hit an interior point on thesmooth constraint region
The corners represent parameters where some coeffi cients equal zero
Hence the lasso solution sets some coeffi cients to zero
However, if the constraint is relaxed then Lasso=OLS
I If τ ≥∥∥∥βols
∥∥∥ then βlasso = βols
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 42 / 99
As Constraint Set Decreases, Lasso Estimates Shrink tozero
70 3. Linear Methods for Regression
0.0 0.2 0.4 0.6 0.8 1.0
−0.
20.
00.
20.
40.
6
Shrinkage Factor s
Coe
ffici
ents
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
FIGURE 3.10. Profiles of lasso coefficients, as the tuning parameter t is varied.Coefficients are plotted versus s = t/
∑p
1 |βj |. A vertical line is drawn at s = 0.36,the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lassoprofiles hit zero, while those for ridge do not. The profiles are piece-wise linear,and so are computed only at the points displayed; see Section 3.4.4 for details.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 43 / 99
Effect of changing Lasso Parameter
For λ = 0, Lasso=OLSAs λ increases, the estimates shrink together
At some point, one estimate hits zero. It remains zero
As λ increases further, the estimates continue to shrink at a new rate
One by one, the estimates hit and stick at zero
When λ is suffi ciently large, all coeffi cients equal 0
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 44 / 99
Selection of Lasso Parameter
Most commonly by K-fold CV
Standard algorithms use CV by default or an option
Theoretical justifcation in developmentI Chernozhukov, Chetverikov, and Liao, working paper
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 45 / 99
Nesting Selection, Lasso and Ridge
βq = argminβ
(y− Xβ)′ (y− Xβ) + λ ‖β‖qq
‖β‖qq =
p
∑j=1
∣∣∣βj
∣∣∣q(q = 0) variable subset selection(q = 1) lasso(q = 2) ridge
72 3. Linear Methods for Regression
region for ridge regression is the disk β21 + β2
2 ≤ t, while that for lasso isthe diamond |β1| + |β2| ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero. When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.We can generalize ridge regression and the lasso, and view them as Bayes
estimates. Consider the criterion
β = argminβ
{N∑
i=1
(yi − β0 −
p∑
j=1
xijβj)2
+ λ
p∑
j=1
|βj |q}
(3.53)
for q ≥ 0. The contours of constant value of∑
j |βj |q are shown in Fig-ure 3.12, for the case of two inputs.Thinking of |βj |q as the log-prior density for βj , these are also the equi-
contours of the prior distribution of the parameters. The value q = 0 corre-sponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior correspond-ing to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ) exp(−|β|/τ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.In this view, the lasso, ridge regression and best subset selection are
Bayes estimates with different priors. Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.
Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) for
q = 4 q = 2 q = 1 q = 0.5 q = 0.1
FIGURE 3.12. Contours of constant value of∑
j |βj |q for given values of q.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 46 / 99
Elastic Net —Compromise between Lasso and Ridge
βnet = argminβ
(y− Xβ)′ (y− Xβ)
+ λ(
α ‖β‖22 + (1− α) ‖β‖1
)α = 0 is Lassoα = 1 is Ridge0 < α < 1 mixes Lasso and Ridge penalties
3.4 Shrinkage Methods 73
q = 1.2 α = 0.2
Lq Elastic Net
FIGURE 3.13. Contours of constant value of∑
j |βj |q for q = 1.2 (left plot),
and the elastic-net penalty∑
j(αβ2j +(1−α)|βj |) for α = 0.2 (right plot). Although
visually very similar, the elastic-net has sharp (non-differentiable) corners, whilethe q = 1.2 penalty does not.
setting coefficients exactly to zero. Partly for this reason as well as forcomputational tractability, Zou and Hastie (2005) introduced the elastic-net penalty
λ
p∑
j=1
(αβ2
j + (1− α)|βj |), (3.54)
a different compromise between ridge and lasso. Figure 3.13 compares theLq penalty with q = 1.2 and the elastic-net penalty with α = 0.2; it ishard to detect the difference by eye. The elastic-net selects variables likethe lasso, and shrinks together the coefficients of correlated predictors likeridge. It also has considerable computational advantages over the Lq penal-ties. We discuss the elastic-net further in Section 18.4.
3.4.4 Least Angle Regression
Least angle regression (LAR) is a relative newcomer (Efron et al., 2004),and can be viewed as a kind of “democratic” version of forward stepwiseregression (Section 3.3.2). As we will see, LAR is intimately connectedwith the lasso, and in fact provides an extremely efficient algorithm forcomputing the entire lasso path as in Figure 3.10.
Forward stepwise regression builds a model sequentially, adding one vari-able at a time. At each step, it identifies the best variable to include in theactive set, and then updates the least squares fit to include all the activevariables.
Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves. At the first step it identifies the variablemost correlated with the response. Rather than fit this variable completely,LAR moves the coefficient of this variable continuously toward its least-squares value (causing its correlation with the evolving residual to decreasein absolute value). As soon as another variable “catches up” in terms ofcorrelation with the residual, the process is paused. The second variablethen joins the active set, and their coefficients are moved together in a waythat keeps their correlations tied and decreasing. This process is continued
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 47 / 99
Minimum Distance Representation
When p < n
I (y− Xβ)′ (y− Xβ) = e′ e+(
βols − β)′
X ′X(
βols − β)
(algebraic trick)
Thus
I βlasso = argminβ
(βols − β
)′X ′X
(βols − β
)+ λ ‖β‖1
βlasso minimizes the weighted Euclidean distance to βols with penalty
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 48 / 99
Thresholding Representation
Suppose p < n and X ′X = Ip
Then βridge =1
1+ λβols evenly shrinks OLS towards zero
Selection using a critrical value of c (e.g. c = 1.96σ2)
I βtest,j = βols,j1(
β2ols,j ≥ c
)I This is called a “hard thresholding” rule
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 49 / 99
Thresholding Representation
Lasso criterion under X ′X = Ip
I ∑pj=1
((βols,j − βj
)2+ λ
∣∣∣βols,j − βj
∣∣∣)F.O.C. is
I −2(
βols,j − βj
)+ λ sgn
(βols,j − βj
)= 0
Solution
I βlasso,j =
βols,j − λ/2 βols,j > λ/2
0∣∣∣βols,j
∣∣∣ ≤ λ/2
βols,j + λ/2 βols,j ≤ −λ/2I This is called a “soft thresholding” rule
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 50 / 99
Selection, Ridge Regression and the Lasso
3.4 Shrinkage Methods 71
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
(0,0) (0,0) (0,0)
|β(M)|
λ
Best Subset Ridge Lasso
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2
1 + β22 ≤ t2, respectively,
while the red ellipses are the contours of the least squares error function.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 51 / 99
Scaling
The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tore-scaling the regressors
I The penalty λ ∑pj=1
∣∣∣βj
∣∣∣ is identical to each coeffi cientI If you rescale a regressor (e.g. change units of measurement) then thepenalty has a completely different meaning
I Hence the scale mattersI In practice, it is common to rescale the regressors so that all are meanzero and have the same variance
F Unless variables are already scaled similarly (e.g. interest rates)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 52 / 99
Which Regressors
The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tolinear transformations of the regressors
Suppose you have X1 and X2
OLS on (X1, X2) is the same as OLS on (X1 − X2, X2)
Lasso on (X1, X2) is different than Lasso on (X1 − X2, X2)
OrthogonalityI Much theoretical insight arises from the case of orthogonal regressorsI It may therfore be useful to start with transformed regressors which arenear orthogonal
I e.g. differences between interest rates (spreads) rather than levels
Getting the right zerosI Many theoretical results concern sparsity (more on this later)I This occurs when the true regression has many 0 coeffi cientsI It is therefore useful to start with transformed regressors which aremost likely to have many zero coeffi cients
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 53 / 99
Grouped Lasso
We can penalize groups of coeffi cients so that they areincluded/excluded as a group
Grouped Lasso criterion
I(
y−∑L`=1 X`β`
)′ (y−∑L
`=1 X`β`)+ λ ∑L
`=1√
p` ‖β`‖2I p` = group size
I Note the penalty is ‖β`‖2 =(
∑j β2`j
)1/2
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 54 / 99
Statistical Properties
There are asymptotic results for the Lasso allowing for p >> nThe results rely on a sparsity assumption: The true regression has p0non-zero coeffi cients, where p0 is fixed
I This assumption can be relaxed in some respects, but some form ofsparsity lies at the core of current theory
Under regularity conditions, Lasso estimation identifies the truepredictors with high probability
I Consistent model selectionI Similar to BIC selection
The non-zero coeffi cients, however are not consistently estimated butare biased
Proposals to eliminate the biasI Least squares after Lasso selectionI SCAD (smoothly clipped absolute deviation)I Adaptive Lasso
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 55 / 99
Sparsity
Sparsity is all the fashion in this literature
It seems to be an assumption used to justify theory which can beproved, not an assumption based on reality
People talk about “imposing sparsity”as if the theorist can influencethe world
The world is the way it is.
What does sparsity mean in practical econometrics?I That a few coeffi cients are “big”, the remainder zero or smallI In a series regression, only a few coeffi cients are non-zero, theremainder zero or small
I This does not make much senseI More reasonable, all coeffi cients are non-zero, and all are small
This is a challenge for Lasso-type theory
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 56 / 99
Computation via LAR algorithm
Least Angle Regression (LAR)
A modification of forward stepwise regressionI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y
F Take residuals along the wayF Stop when some other x` and xk have the same correlation with theresidual
I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual
I Continue until all predictors are in model
This algorithm gives the Lasso expansion path
Used to produce an ordering yields Least Angle Regression Selection(LARS)
I Alterntiave to stepwise regression
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 57 / 99
Least Angle Regression Solution Path
3.4 Shrinkage Methods 75
0 5 10 15
0.0
0.1
0.2
0.3
0.4
v2 v6 v4 v5 v3 v1
L1 Arc Length
Absolute
Correlations
FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
5
Least Angle Regression
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
5
Lasso
L1 Arc LengthL1 Arc Length
Coefficients
Coefficients
FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 58 / 99
LAR modification to make equal to Lasso
Modified LAR algorithmI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y
F Take residuals along the wayF Stop when some other x` and xj have the same correlation with theresidual
F If a non-zero coeffi cient hits zero, drop from the active set of variablesand recompute the joint least squares direction.
I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual
I Continue until all predictors are in model
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 59 / 99
Comparison of performance of methods
78 3. Linear Methods for Regression
0.0 0.2 0.4 0.6 0.8 1.0
0.55
0.60
0.65
Forward StepwiseLARLassoForward StagewiseIncremental Forward Stagewise
E||β
(k)−β||2
Fraction of L1 arc-length
FIGURE 3.16. Comparison of LAR and lasso with forward stepwise, forwardstagewise (FS) and incremental forward stagewise (FS0) regression. The setupis the same as in Figure 3.6, except N = 100 here rather than 300. Here theslower FS regression ultimately outperforms forward stepwise. LAR and lassoshow similar behavior to FS and FS0. Since the procedures take different numbersof steps (across simulation replicates and methods), we plot the MSE as a functionof the fraction of total L1 arc-length toward the least-squares fit.
adaptively fitted to the training data. This definition is motivated anddiscussed further in Sections 7.4–7.6.
Now for a linear regression with k fixed predictors, it is easy to showthat df(y) = k. Likewise for ridge regression, this definition leads to theclosed-form expression (3.50) on page 68: df(y) = tr(Sλ). In both thesecases, (3.60) is simple to evaluate because the fit y = Hλy is linear in y.If we think about definition (3.60) in the context of a best subset selectionof size k, it seems clear that df(y) will be larger than k, and this can beverified by estimating Cov(yi, yi)/σ
2 directly by simulation. However thereis no closed form method for estimating df(y) for best subset selection.
For LAR and lasso, something magical happens. These techniques areadaptive in a smoother way than best subset selection, and hence estimationof degrees of freedom is more tractable. Specifically it can be shown thatafter the kth step of the LAR procedure, the effective degrees of freedom ofthe fit vector is exactly k. Now for the lasso, the (modified) LAR procedure
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 60 / 99
Computation
Dual representation of Lasso and Elastic Net is a quadraticprogramming problem
Effi cient when we have a fixed λ
I Numerically fast
The LARS algorithm provides the entire path as a function of thetuning parameter
I Useful for cross-validation
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 61 / 99
Computation
R (recommended)I package glmnet
F cv.glmnet(x,y)F Selects λ by cross-validation
I For ridge or elastic net
F cv.glmnet(x,y,alpha=a)F Set a = 0 for ridge, a = 1 for Lasso, 0 < a < 1 for elastic net
I package lars
F lars(x,y,type="lasso")F lars(x,y,type="lar")
MATLABI lasso(X,y)
I lasso(X,y,’CV’,K)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 62 / 99
Computation in R
library(glmnet)
mLasso <- cv.glmnet(X,y,family="gaussian",nfolds=200)
I beta <- coef(mLasso,mLasso$lambda.min)I Useful to specify the number of folds (default is 10)I More folds reduces instability, but takes longerI If you do not specify “lambda.min” the package will use “lambda.1se”which is different than the minimizer
mRidge <-cv.glmnet(X,y,alpha=0,family="gaussian",nfolds=200))
mElastic <-cv.glmnet(X,y,alpha=1/2,family="gaussian",nfolds=200))
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 63 / 99
Illustration
cps wage regression using subsample of Asian woment (n = 1149)Regressors:
I education (linear), and dummies for education equalling 12, 13, 14, 16,18, and 20
I experience in powers from 1 to 9I marriage dummies (6 of 7 categories), 3 region dummies, union dummy
Lasso, with λ selected by minimizing 200-fold CV
Selected regressors:I Education dummiesI Experience powers 1, 2, 3, 5, 6I All remaining dummies
Coeffi cient estimates: Most shrunk about 10% relative to leastsquares
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 64 / 99
REGRESSION TREES
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 65 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 66 / 99
Regression Tree
Partition regressor space into rectanglesI Split based if a regressor is below or exceeds a thresholdI Split againI Split again
Each split point is a nodeEach subset is a branchOn each branch, fit a simple model
I Often just the sample mean of yiI Or linear regression
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 67 / 99
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2X1 ≤ t3
X2 ≤ t4
FIGURE 9.2. Partitions and CART. Top right panelshows a partition of a two-dimensional feature space byrecursive binary splitting, as used in CART, applied tosome fake data. Top left panel shows a general partitionthat cannot be obtained from recursive binary splitting.Bottom left panel shows the tree corresponding to thepartition in the top right panel, and a perspective plotof the prediction surface appears in the bottom rightpanel.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 68 / 99
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125 edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125 edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
FIGURE 9.5. The pruned tree for the spam example.The split variables are shown in blue on the branches,and the classification is shown in every node.The num-bers under the terminal nodes indicate misclassificationrates on the test data.
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 69 / 99
Estimation of nodes
Regression —Minimizing SSE
This is the same as for threshold models in econometricsI Similar to structural change estimation
Potential split points equal n (at each observation point)Estimate up to n regressions, for each possible split pointFind split point which minimizes the SSE
This is the least squares estimator of the node
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 70 / 99
Tree Estimation
Given the nodes (the split points) you estimate the simple model(mean or linear regression) on each branch
The regression model is the tree structure plus the models for eachbranch
Prediction (estimation of the conditional mean at a point)I Given the regressor, which branch are we on?I Compute conditional mean for that branch
Take, for example, a wage regressionI splits might be based on sex, race, region, education levels, experiencelevels, etc
I Each split is binaryI A branch is a set of characteristics. The estimate (typically) is themean for this group
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 71 / 99
How Many Nodes?
First fit (grow) a large tree, based on a pre-specified maximumnumber of nodes
Then prune back by miniminizing a cost criterionI T = treeI |T| = number of terminal nodesI yi = fitted value (mean or regression fit within each branch)I ei = yi − yiI C = ∑
|T|t=1 e2
i + α |T|I Penalized sum of squared errors, AIC-like
Penalty term α selected by CV
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 72 / 99
Comments on Trees
Flexible non-parametric approach
Typically used for prediction
Can be useful for decisions
Consider doctor deciding on a treatment:I What is your gender?I Is your cholesterol above 200?I Is your blood pressure above 130?I Is your age above 60?I Given this information, we recommend you take the BLUE pill
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 73 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 74 / 99
BAGGING(BOOTSTRAP AGGREGATION)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 75 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 76 / 99
Bagging
Bootstrap averaging for estimator m(x) of conditional mean m(x)I Example: Regression tree
Generate B random samples of size n by sampling with replacementfrom data
I On each bootstrap sample, fit estimator mb(x)
Average across the bootstrap samplesI mbag(x) = B−1 ∑B
b=1 mb(x)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 77 / 99
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 78 / 99
Bagging
If m(x) is linear (with large B) mbag(x) ' m(x)If m(x) is nonlinear they are different
I Bagging reduces variance (but adds bias)I Averaging smooths the estimator, smoothing reduces variance
Use mbag(x) for prediction
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 79 / 99
Bagging and Bias
If m(x) is biased then bagging increases the biasSimple bootstrap estimator of bias is
I bias(m) = B−1 ∑Bb=1 mb(x)− m(x) = mbag(x)− m(x)
Thus a bias-corrected estimator of m(x) is
I mbc(x) = m(x)− bias(m) = m(x)−(
mbag(x)− m(x))=
2m(x)− mbag(x)I Not mbag(x)!
Bagging does not reduce bias, but accentuates bias
Bagging is best applied to low-bias estimators
Goal is to reduce variance
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 80 / 99
A little intuition
θ ∼ N(θ, Ip
)Thresholded estimator θ = θ1
(θ′θ > c
)Bootstrap
I θ∗ ∼ N
(θ, Ip
)I θ
∗= θ
∗1(
θ∗′
θ∗> c)
Bagging estimator
I θbag = B−1 ∑Bb=1 θ
∗ ' µ(θ) where µ(θ) = E(
θ)
θ is a non-smooth function of θ
µ(θ) is a smooth function of θ
θbag ' µ(θ) is a smooth function of θ
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 81 / 99
When to use Bagging
Bagging is ideal for methods such as regression trees
Deep regression trees have low bias but high variance
Regression trees use hard thresholding rules
Bagging smooths the hard thresholding into smooth thresholding
Bagging reduces variance
Bagging is not useful for high-bias, smooth, or low-varianceprocedures
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 82 / 99
RANDOM FORESTS
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 83 / 99
Random Forests
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 84 / 99
Random Forests
Similar to bagging, but with an adjustment to reduce variance
When you do bagging, you are averaging identically distributed butcorrelated bootstrapped trees
The correlation means that the averaging does not reduce thevariance as much as if the bootstrapped trees were uncorrelated
Random forests tries to de-correlate the bootstrapped trees
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 85 / 99
Random Forest Algorithm for Regression
For b = 1, ..., BI Draw a random sample of size n from the data setI Grow a random forest tree Tb to the bootstrapped data, by recursivelyrepeating the following steps for each terminal node of the tree untilthe minimum node size nmin (recommended nmin = 5)
F Select m variables at random from the p variables.Recommended m = p/3
F Pick the best variable/split point among the mF Split the node into two daughter nodes
m(x) = B−1 ∑Bb=1 Tb(x)
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 86 / 99
Why Random Forest?
The randomization over the variables means that the bootstrappedregression trees are less correlated than using standard bagging
Very popular
Numerical studies shows that it works well in many applications
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 87 / 99
Out-of-Bag Samples
Random forests have a evaluation device similar to cross-validation,called the out-of-bag (OOB) sample
Recall that a random forest predictor is calculated by averaging overB bootstrap samplesThe probability that a given observation i is in a given bootstrapsample is about 63%
For the OOS sampleI For each observation i
F Construct its random forest predictor by averaging only over the(approximately) 37% of bootstrap samples where observation i doesnot appear
F Compute the OOB prediction error
I Take sum of squared errors
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 88 / 99
ENSEMBLING
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 89 / 99
Ensembling Econometricians
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 90 / 99
Emsembling is Averaging
Suppose you have several estimatorsI AIC-selectedI JMA model averagedI Stein shrinkageI Principle components regressionI RidgeI LassoI Elastic NetI Regression TreeI Random Forest
What do you do?
Let’s look at our favorite models again
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 91 / 99
Model 1: Kendall Jenner
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 92 / 99
Model 2: Fabio
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 93 / 99
Model 3: Einstein
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 94 / 99
Emsembling is Averaging the Estimators
Weight Selection Methods:
Method 1: Elements of Statistical Learning recommends penalizedregression (Ridge or Lasso penalty)
I Regress yi on predictions from the models, with penaltyI Regularization (penalty) is essentialI Unclear how to select λ
Method 2: Select weights by cross-validation
Considerable evidence indicates that ensembling (averaging) is betterthan using just one method
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 95 / 99
What’s Next?Last time I taught here, my wife and I went to Gavdos
������������ ��� �Gavdos
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 96 / 99
Gavdos
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 97 / 99
Assignment # 5
Use the cps dataset from before, but now use ALL observationsCreate a large set of regressors, that you believe are appropriate topotentially model wages
Estimate a regression for log(wage) using the following methodsI OLSI Ridge RegressionI LassoI Elastic Net with α = 1/2
Report your coeffi cient estimates in a table
Comment on your findings
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 98 / 99
That’s It!
Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 99 / 99