Linear Methods for Regression - University of...

30
Linear Methods for Regression The Elements of Statistical Learning Trevor Hastie, Robert Tibshirani, Jerome Friedman Presented by Junyan Li

Transcript of Linear Methods for Regression - University of...

Page 1: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Regression

The Elements of Statistical LearningTrevor Hastie, Robert Tibshirani, Jerome Friedman

Presented by  Junyan Li

Page 2: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Input:     XT=(x1,…,xp )

Linear regression model

Linearregressionmodel: f ∑

The linear model could be a reasonable approximation:

• Basic expansion: => a polynomial representation

• Interactions between variables:

is added so that f(x) do not have to pass through the origin.

Page 3: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear regression model

residual sum of squares, RSS ∑∑ ∑ , given a set of training data (x,y)

denote by X the N*(p+1) matrix with each row an input vector( with a 1 in the first position)

RSSLet 2 0. As 2 0, if X is of full rank (if not, 

remove the redundancies).

, here is called the “hat”.

Page 4: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear regression model

is the orthogonal projection of y onto the subspace the column vectors of X by x0,…, xp, with x0 ≡ 1

0. => y‐ is orthogonal to the subspace.

Page 5: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear regression model

Assume  has constant variance  . 

Assume the deviations of   around its expectation are additive and Gaussian, and e~N 0, . Then  ~ , .

Page 6: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear regression model

are constrained to p+1 equalities:

0

So, ∑ is a unbiased estimator and 1 ~ χ

To test the hypothesis that a particular coefficient 0,use z ~ 0,1 (if is

given) or t ~ (if is not given). When N‐p‐1 is large enough, thedifferent between the tail quantities of a t‐distribution and a standard normaldistribution is negligible, so we use the standard normal distribution regardless ofwhether is given or not.Simultaneously, we could use F statistic to test whether some parameters could beremoved.

F∑ ∑

/

∑/

//

where RSS is the residual sum‐of‐squares for the least squares fit of the biggermodel with +1 parameters, and RSS the same for the nested smaller model with+1 parameters.

Page 7: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Gauss Markov Theorem

the least squares estimates of the parameters β have the smallestvariance among all linear unbiased estimates.

Let  be another linear estimator of E EE E , as E(e)=0E E=>DX=0

However, there may exist a biased estimator with smaller mean squared error, is intimately related to prediction accuracy.

Page 8: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Multiple Regression from Simple Univariate Regression

If p=1,  ∑∑

,,

, where <x,y>=∑

Let x1,…,xp be the columns of X. If <xi,xj>=0 (orthogonal) for each i j, ,,

r

⋮ ⋮ …     ⋮

⋮<x1,x1>

⋯ 0⋮ ⋱ ⋮0 ⋯ <xp,xp>

,,⋮,,

r , −,,

, =0

=>r , are orthogonal.

Page 9: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Multiple Regression from Simple Univariate Regression

Orthogonal inputs occur most often with balanced, designed experiments (whereorthogonality is enforced), but almost never with observational data.

Regression by Successive Orthogonalization:1. Initialize z0 = x0 = 1.2. For j = 1, 2, . . . , pRegress xj on z0, z1, . . . , , zj−1 to produce coefficients  

,,

,l = 0, . . . , j − 1 

and residual vector z ∑ z3. Regress y on the residual zp to give the estimate  

,,

we can see that each of the  is a linear combination of the z , k ≤ j, and are orthogonal to each other.

Page 10: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

In matrix form:

, where Z has columns zj and is the upper triangular matrixwith entries . Introducting D with jth diagonal entry Djj=

, let and

The QR decomposition represents a convenient orthogonal basis forthe column space of X.Q is an N ×(p+1) orthogonal matrix, and R is a (p + 1) × (p + 1) uppertriangular matrix.

, ,

Multiple Regression from Simple Univariate Regression

Page 11: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Subset Selection

two reasons why we are often not satisfied with the least squares estimates :

• prediction accuracy: the least squares estimates often have low bias but largevariance• interpretation. With a large number of predictors, we often would like todetermine a smaller subset that exhibits the strongest effects.

1. Best‐Subset Selection

Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the subset of size k that gives smallest residual sum of squares.Infeasible for p much larger than 40.

Page 12: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Subset Selection

2. Forward‐ and Backward‐Stepwise Selection

1) Forward‐stepwise selection is a greedy algorithm, and starts with theintercept, and then sequentially adds into the model the predictor that mostimproves the fit.

Computational: for large p we cannot compute the best subsetsequence, but we can always compute the forward stepwise sequence(even when p >>N).Statistical: a price is paid in variance for selecting the best subset ofeach size; forward stepwise is a more constrained search, and will havelower variance, but perhaps more bias.

2) Backward‐stepwise selection starts with the full model, and sequentiallydeletes the predictor that has the least impact on the fit. (use z‐score, n>p+1)

3) Hybrid

Page 13: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Subset Selection

3. Forward‐Stagewise Regression

Forward‐stagewise regression (FS) is even more constrained than forward‐stepwiseregression. It starts like forward‐stepwise regression, with an intercept equal to y, andcentered predictors with coefficients initially all 0. At each step the algorithm identifiesthe variable most correlated with the current residual. It then computes the simplelinear regression coefficient of the residual on this chosen variable, and then adds it tothe current coefficient for that variable. This is continued till none of the variableshave correlation with the residuals.

In Forward selection, variables are added at one time, but in FS selection, variables areadded partially, which works better in very high dimensional problems.

Page 14: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

it often exhibits high variance, and so doesn’t reduce the prediction error ofthe full model. Shrinkage methods are more continuous, and don’t sufferas much from high variability.

1. Ridge Regression

λ , λ 0, λ

0

∑ ∑ , Subject to ∑ t

Penalizing by the sum‐of‐squares

There is a one‐to‐one correspondence between λin"λ∑ " and t insubjectto ∑

Page 15: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

RSS λ y Xβ y Xβ λβ β

λ , which could be got by calculating thefirst and second derivation.

singular value decomposition (SVD): , where U and V are N ×pand p×p orthogonal matrices, with the columns of U spanning the columnspace of X, and the columns of V spanning the row space. D is a p × pdiagonal matrix, with diagonal entries d11 ≥ d22 ≥∙ ∙ ∙ ≥ dpp ≥ 0.

λ λ ∑

Like linear regression, ridge regression computes the coordinates of y with respect to the orthonormal basis U. It then shrinks these coordinates by the 

factors .

Page 16: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

Degree of freedom df λ tr ∑

Hence the small singular values djcorrespond to directions in the columnspace of X having small variance, andridge regression shrinks these directionsthe most.

Eigen decomposition of  D , and The eigenvectors vj (columns of V) are also called the principal components (or Karhunen–Loeve) directions of X.

Page 17: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

2. Lasso Regression

∑ ∑ λ∑ , λ 0

The latter constraint makes the solutions nonlinear in the yi. there is noclosed form expression as in ridge regression. Computing the lasso solutionis a quadratic programming problem.

if the solution occurs at a corner, then it has one parameter βj equal to zero.

Page 18: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

Assume that columns of X are orthonormal =>  =I

Minimize <==> minimize = − ( ‐ )

(β −β )+ λ β ) =0   so    β β ‐ λ β ) 

1. Unbiasedness: The resulting estimator is early unbiased when the true unknownparameter is large to avoid unnecessary modeling bias.2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets smallestimated coeffcients to zero to reduce model complexity.3. Continuity: The resulting estimator is continuous to avoid instability in modelprediction.

To minimize  − ( ‐ )+ λ , we can see that β )= β )

so β β )( β )β ‐ λ)= β )( β λ)

soβ β ) ( β λ

Page 19: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Shrinkage Methods

3. Least Angle Regression

Forward stepwise regression builds a model sequentially, adding onevariable at a time. At each step, it identifies the best variable to includein the active set, and then updates the least squares fit to include allthe active variables.Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves.

1. Standardize the predictors to have mean zero and unit norm. Start with the residual, β , . . . , β 0.

2. Find the predictor xj most correlated with r.(cosine)3. Move β from 0 towards its least‐squares coefficient <xj , r>, until some othercompetitor xk has as much correlation with the current residual as does xj .4. Move β and β in the direction(α , ) definedby their joint least squares coefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.4a. If a non‐zero coefficient hits zero, drop its variable from the active set of variablesand recompute the current joint least squares direction.(it becomes lasso if this is added)5. Continue in this way until all p predictors have been entered. After min(N − 1, p) steps,we arrive at the full least‐squares solution.

Page 20: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

K classes and the fitted linear models f x β β x. Thedecision boundary between class k and l is that set of points forwhich f x f xP(G=k|X=x)For two classes, a popular model is

P G 1 X x

P G 2 X x

log β β x

Hyperplanes‐>model boundaries aslinear

Page 21: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, withYk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,YK.

Classify according to G x argmax ∈ f xBecause of the rigid nature of the regression model, classes may bemasked by others, even though they are perfectly separated.

A loose but general rule is that if K ≥ 3 classes are lined up, polynomialterms up to degree K − 1 might be needed to resolve them.

Linear Methods for Classification

Page 22: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

Suppose f x is the class‐conditional density of X in class G=k and πk is the priorprobability of class k.

P G k X xf x π

∑ f x π

Suppose f x exp

In linear discriminant analysis(LDA) we assume that for each k.

log log log

log log

is linear in x in a p‐dimension hyperplane.

Page 23: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

The linear discriminant function:

δ x logπ12

, as  is symmetric and  is a number, whose transpose is still a number.

is ignored, as it is a const. 

In practice, we estimate the parameters using training data:

π / , where  is the number of class‐k observations∑ /∑ ∑ /

Page 24: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

In quadratic discriminant analysis(QDA) we assume that Σ are not equal.

δ x logπ x μ Σ x μ log Σ

LDA in the enlarged quadratic polynomial space is quite similar with QDA.

Regularized discriminant analysis(RDA):Σ a aΣ 1 a ΣIn practice a can be chosen based on the performance of the model onvalidation data, or by cross‐validation.

Σ , where  is a p*p orthonormal and  is a diagonal matrix of positive eigenvalues d .

x μ Σ x μ x μ x μx μ x μ

log Σ ∑ d

Page 25: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

Gaussian classification with common covariances leads to linear decisionboundaries. Classification can be achieved by sphering the data with respect to W,and classifying to the closest centroid (modulo log πk) in the sphered space.Since only the relative distances to the centroids count, one can confine the data tothe subspace spanned by the centroids in the sphered space.This subspace can be further decomposed into successively optimal subspaces interm of centroid separation. This decomposition is identical to the decompositiondue to Fisher.

Page 26: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Linear Methods for Classification

Find the linear combination Z a X such that the between‐class 

variance is maximized relative to the within‐class variance. max

∈B is the “between classes scatter matrix” and W is the “within classes scatter matrix”

As a scatter matrix => min  a a subject to  a a 1

L12 a a

12 λ a a 1a λa

Page 27: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Logistic Regression

P G k X x∑

,k=1,…,K‐1

P G K X x∑

Let P G K X x P x ; θ

l θ x ; θ

in two‐class case:via a 0/1 response yi, where yi = 1 when gi = 1, and yi = 0 when gi = 2

l θ ∑ ; 1 log 1 ;l θ ∑ log 1 β β , β

∑ ; 0

∑ ; ;

Newton–Raphson algorithm β β

Page 28: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Logistic Regression

In matrix notation:Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P thevector of fitted probabilities with ith element ; β and W a N*Ndiagonal matrix of weights with ith diagona element ; β 1

; β

β ββ β

,adjusted response

Too complicated, many softwares use a quadratic approximation to logisticregression and L1 Regularized Logistic Regression.

Page 29: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

LDA: log log α α x

LogisticRegression: log β β x

Although they have exactly the same form, the difference lies in the way the linearcoefficients are estimated. The logistic regression model is more general, in that itmakes less assumptions.Logistic regression: fit the parameters by maximizing the conditional likelihood—the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.LDA: fit the parameters by maximizing the full log‐likelihood, based on the jointdensity P(X,G=k)=φ(X;μk,Σ)πk, where P(X) does play a role as P(X)=Σ P(X,G=k).Assume f(x)’s are Gaussian, we can find a more efficient way.It is generally felt that logistic regression is a safer, more robust bet than the LDAmodel, relying on fewer assumptions. It is our experience that the models givevery similar results, even when LDA is used inappropriately.

LDA or Logistic Regression

Page 30: Linear Methods for Regression - University of Kansasjhuan/EECS940_S12/slides/linearRegression.pdf · Multiple Regression from Simple UnivariateRegression Orthogonal inputs occur most

Any Questions?