Linear Methods for Regression - University of...
Transcript of Linear Methods for Regression - University of...
![Page 1: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/1.jpg)
Linear Methods for Regression
The Elements of Statistical LearningTrevor Hastie, Robert Tibshirani, Jerome Friedman
Presented by Junyan Li
![Page 2: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/2.jpg)
Input: XT=(x1,…,xp )
Linear regression model
Linearregressionmodel: f ∑
The linear model could be a reasonable approximation:
• Basic expansion: => a polynomial representation
• Interactions between variables:
is added so that f(x) do not have to pass through the origin.
![Page 3: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/3.jpg)
Linear regression model
residual sum of squares, RSS ∑∑ ∑ , given a set of training data (x,y)
denote by X the N*(p+1) matrix with each row an input vector( with a 1 in the first position)
RSSLet 2 0. As 2 0, if X is of full rank (if not,
remove the redundancies).
, here is called the “hat”.
![Page 4: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/4.jpg)
Linear regression model
is the orthogonal projection of y onto the subspace the column vectors of X by x0,…, xp, with x0 ≡ 1
0. => y‐ is orthogonal to the subspace.
![Page 5: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/5.jpg)
Linear regression model
Assume has constant variance .
Assume the deviations of around its expectation are additive and Gaussian, and e~N 0, . Then ~ , .
![Page 6: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/6.jpg)
Linear regression model
are constrained to p+1 equalities:
0
So, ∑ is a unbiased estimator and 1 ~ χ
To test the hypothesis that a particular coefficient 0,use z ~ 0,1 (if is
given) or t ~ (if is not given). When N‐p‐1 is large enough, thedifferent between the tail quantities of a t‐distribution and a standard normaldistribution is negligible, so we use the standard normal distribution regardless ofwhether is given or not.Simultaneously, we could use F statistic to test whether some parameters could beremoved.
F∑ ∑
/
∑/
//
where RSS is the residual sum‐of‐squares for the least squares fit of the biggermodel with +1 parameters, and RSS the same for the nested smaller model with+1 parameters.
![Page 7: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/7.jpg)
Gauss Markov Theorem
the least squares estimates of the parameters β have the smallestvariance among all linear unbiased estimates.
Let be another linear estimator of E EE E , as E(e)=0E E=>DX=0
However, there may exist a biased estimator with smaller mean squared error, is intimately related to prediction accuracy.
![Page 8: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/8.jpg)
Multiple Regression from Simple Univariate Regression
If p=1, ∑∑
,,
, where <x,y>=∑
Let x1,…,xp be the columns of X. If <xi,xj>=0 (orthogonal) for each i j, ,,
,
r
⋮ ⋮ … ⋮
⋮<x1,x1>
⋯ 0⋮ ⋱ ⋮0 ⋯ <xp,xp>
⋮
⋮
,,⋮,,
r , −,,
, =0
=>r , are orthogonal.
![Page 9: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/9.jpg)
Multiple Regression from Simple Univariate Regression
Orthogonal inputs occur most often with balanced, designed experiments (whereorthogonality is enforced), but almost never with observational data.
Regression by Successive Orthogonalization:1. Initialize z0 = x0 = 1.2. For j = 1, 2, . . . , pRegress xj on z0, z1, . . . , , zj−1 to produce coefficients
,,
,l = 0, . . . , j − 1
and residual vector z ∑ z3. Regress y on the residual zp to give the estimate
,,
we can see that each of the is a linear combination of the z , k ≤ j, and are orthogonal to each other.
![Page 10: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/10.jpg)
In matrix form:
, where Z has columns zj and is the upper triangular matrixwith entries . Introducting D with jth diagonal entry Djj=
, let and
The QR decomposition represents a convenient orthogonal basis forthe column space of X.Q is an N ×(p+1) orthogonal matrix, and R is a (p + 1) × (p + 1) uppertriangular matrix.
, ,
Multiple Regression from Simple Univariate Regression
![Page 11: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/11.jpg)
Subset Selection
two reasons why we are often not satisfied with the least squares estimates :
• prediction accuracy: the least squares estimates often have low bias but largevariance• interpretation. With a large number of predictors, we often would like todetermine a smaller subset that exhibits the strongest effects.
1. Best‐Subset Selection
Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the subset of size k that gives smallest residual sum of squares.Infeasible for p much larger than 40.
![Page 12: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/12.jpg)
Subset Selection
2. Forward‐ and Backward‐Stepwise Selection
1) Forward‐stepwise selection is a greedy algorithm, and starts with theintercept, and then sequentially adds into the model the predictor that mostimproves the fit.
Computational: for large p we cannot compute the best subsetsequence, but we can always compute the forward stepwise sequence(even when p >>N).Statistical: a price is paid in variance for selecting the best subset ofeach size; forward stepwise is a more constrained search, and will havelower variance, but perhaps more bias.
2) Backward‐stepwise selection starts with the full model, and sequentiallydeletes the predictor that has the least impact on the fit. (use z‐score, n>p+1)
3) Hybrid
![Page 13: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/13.jpg)
Subset Selection
3. Forward‐Stagewise Regression
Forward‐stagewise regression (FS) is even more constrained than forward‐stepwiseregression. It starts like forward‐stepwise regression, with an intercept equal to y, andcentered predictors with coefficients initially all 0. At each step the algorithm identifiesthe variable most correlated with the current residual. It then computes the simplelinear regression coefficient of the residual on this chosen variable, and then adds it tothe current coefficient for that variable. This is continued till none of the variableshave correlation with the residuals.
In Forward selection, variables are added at one time, but in FS selection, variables areadded partially, which works better in very high dimensional problems.
![Page 14: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/14.jpg)
Shrinkage Methods
it often exhibits high variance, and so doesn’t reduce the prediction error ofthe full model. Shrinkage methods are more continuous, and don’t sufferas much from high variability.
1. Ridge Regression
λ , λ 0, λ
0
∑ ∑ , Subject to ∑ t
Penalizing by the sum‐of‐squares
There is a one‐to‐one correspondence between λin"λ∑ " and t insubjectto ∑
![Page 15: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/15.jpg)
Shrinkage Methods
RSS λ y Xβ y Xβ λβ β
λ , which could be got by calculating thefirst and second derivation.
singular value decomposition (SVD): , where U and V are N ×pand p×p orthogonal matrices, with the columns of U spanning the columnspace of X, and the columns of V spanning the row space. D is a p × pdiagonal matrix, with diagonal entries d11 ≥ d22 ≥∙ ∙ ∙ ≥ dpp ≥ 0.
λ λ ∑
Like linear regression, ridge regression computes the coordinates of y with respect to the orthonormal basis U. It then shrinks these coordinates by the
factors .
![Page 16: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/16.jpg)
Shrinkage Methods
Degree of freedom df λ tr ∑
Hence the small singular values djcorrespond to directions in the columnspace of X having small variance, andridge regression shrinks these directionsthe most.
Eigen decomposition of D , and The eigenvectors vj (columns of V) are also called the principal components (or Karhunen–Loeve) directions of X.
![Page 17: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/17.jpg)
Shrinkage Methods
2. Lasso Regression
∑ ∑ λ∑ , λ 0
The latter constraint makes the solutions nonlinear in the yi. there is noclosed form expression as in ridge regression. Computing the lasso solutionis a quadratic programming problem.
if the solution occurs at a corner, then it has one parameter βj equal to zero.
![Page 18: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/18.jpg)
Shrinkage Methods
Assume that columns of X are orthonormal => =I
Minimize <==> minimize = − ( ‐ )
(β −β )+ λ β ) =0 so β β ‐ λ β )
1. Unbiasedness: The resulting estimator is early unbiased when the true unknownparameter is large to avoid unnecessary modeling bias.2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets smallestimated coeffcients to zero to reduce model complexity.3. Continuity: The resulting estimator is continuous to avoid instability in modelprediction.
To minimize − ( ‐ )+ λ , we can see that β )= β )
so β β )( β )β ‐ λ)= β )( β λ)
soβ β ) ( β λ
![Page 19: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/19.jpg)
Shrinkage Methods
3. Least Angle Regression
Forward stepwise regression builds a model sequentially, adding onevariable at a time. At each step, it identifies the best variable to includein the active set, and then updates the least squares fit to include allthe active variables.Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves.
1. Standardize the predictors to have mean zero and unit norm. Start with the residual, β , . . . , β 0.
2. Find the predictor xj most correlated with r.(cosine)3. Move β from 0 towards its least‐squares coefficient <xj , r>, until some othercompetitor xk has as much correlation with the current residual as does xj .4. Move β and β in the direction(α , ) definedby their joint least squares coefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.4a. If a non‐zero coefficient hits zero, drop its variable from the active set of variablesand recompute the current joint least squares direction.(it becomes lasso if this is added)5. Continue in this way until all p predictors have been entered. After min(N − 1, p) steps,we arrive at the full least‐squares solution.
![Page 20: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/20.jpg)
Linear Methods for Classification
K classes and the fitted linear models f x β β x. Thedecision boundary between class k and l is that set of points forwhich f x f xP(G=k|X=x)For two classes, a popular model is
P G 1 X x
P G 2 X x
log β β x
Hyperplanes‐>model boundaries aslinear
![Page 21: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/21.jpg)
Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, withYk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,YK.
Classify according to G x argmax ∈ f xBecause of the rigid nature of the regression model, classes may bemasked by others, even though they are perfectly separated.
A loose but general rule is that if K ≥ 3 classes are lined up, polynomialterms up to degree K − 1 might be needed to resolve them.
Linear Methods for Classification
![Page 22: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/22.jpg)
Linear Methods for Classification
Suppose f x is the class‐conditional density of X in class G=k and πk is the priorprobability of class k.
P G k X xf x π
∑ f x π
Suppose f x exp
In linear discriminant analysis(LDA) we assume that for each k.
log log log
log log
is linear in x in a p‐dimension hyperplane.
![Page 23: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/23.jpg)
Linear Methods for Classification
The linear discriminant function:
δ x logπ12
, as is symmetric and is a number, whose transpose is still a number.
is ignored, as it is a const.
In practice, we estimate the parameters using training data:
π / , where is the number of class‐k observations∑ /∑ ∑ /
![Page 24: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/24.jpg)
Linear Methods for Classification
In quadratic discriminant analysis(QDA) we assume that Σ are not equal.
δ x logπ x μ Σ x μ log Σ
LDA in the enlarged quadratic polynomial space is quite similar with QDA.
Regularized discriminant analysis(RDA):Σ a aΣ 1 a ΣIn practice a can be chosen based on the performance of the model onvalidation data, or by cross‐validation.
Σ , where is a p*p orthonormal and is a diagonal matrix of positive eigenvalues d .
x μ Σ x μ x μ x μx μ x μ
log Σ ∑ d
![Page 25: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/25.jpg)
Linear Methods for Classification
Gaussian classification with common covariances leads to linear decisionboundaries. Classification can be achieved by sphering the data with respect to W,and classifying to the closest centroid (modulo log πk) in the sphered space.Since only the relative distances to the centroids count, one can confine the data tothe subspace spanned by the centroids in the sphered space.This subspace can be further decomposed into successively optimal subspaces interm of centroid separation. This decomposition is identical to the decompositiondue to Fisher.
![Page 26: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/26.jpg)
Linear Methods for Classification
Find the linear combination Z a X such that the between‐class
variance is maximized relative to the within‐class variance. max
∈B is the “between classes scatter matrix” and W is the “within classes scatter matrix”
As a scatter matrix => min a a subject to a a 1
L12 a a
12 λ a a 1a λa
![Page 27: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/27.jpg)
Logistic Regression
P G k X x∑
,k=1,…,K‐1
P G K X x∑
Let P G K X x P x ; θ
l θ x ; θ
in two‐class case:via a 0/1 response yi, where yi = 1 when gi = 1, and yi = 0 when gi = 2
l θ ∑ ; 1 log 1 ;l θ ∑ log 1 β β , β
∑ ; 0
∑ ; ;
Newton–Raphson algorithm β β
![Page 28: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/28.jpg)
Logistic Regression
In matrix notation:Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P thevector of fitted probabilities with ith element ; β and W a N*Ndiagonal matrix of weights with ith diagona element ; β 1
; β
β ββ β
,adjusted response
Too complicated, many softwares use a quadratic approximation to logisticregression and L1 Regularized Logistic Regression.
![Page 29: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/29.jpg)
LDA: log log α α x
LogisticRegression: log β β x
Although they have exactly the same form, the difference lies in the way the linearcoefficients are estimated. The logistic regression model is more general, in that itmakes less assumptions.Logistic regression: fit the parameters by maximizing the conditional likelihood—the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.LDA: fit the parameters by maximizing the full log‐likelihood, based on the jointdensity P(X,G=k)=φ(X;μk,Σ)πk, where P(X) does play a role as P(X)=Σ P(X,G=k).Assume f(x)’s are Gaussian, we can find a more efficient way.It is generally felt that logistic regression is a safer, more robust bet than the LDAmodel, relying on fewer assumptions. It is our experience that the models givevery similar results, even when LDA is used inappropriately.
LDA or Logistic Regression
![Page 30: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most](https://reader030.fdocuments.us/reader030/viewer/2022041301/5e10d44805a2583824052a75/html5/thumbnails/30.jpg)
Any Questions?