Linear Methods for Regression - University of...

Linear Methods for Regression

The Elements of Statistical LearningTrevor Hastie, Robert Tibshirani, Jerome Friedman

Presented by Junyan Li

Input: XT=(x1,…,xp )

Linear regression model

Linearregressionmodel: f ∑

The linear model could be a reasonable approximation:

• Basic expansion: => a polynomial representation

• Interactions between variables:

is added so that f(x) do not have to pass through the origin.


residual sum of squares, RSS ∑∑ ∑ , given a set of training data (x,y)

denote by X the N*(p+1) matrix with each row an input vector( with a 1 in the first position)

RSSLet 2 0. As 2 0, if X is of full rank (if not,

remove the redundancies).

, here is called the “hat”.


is the orthogonal projection of y onto the subspace the column vectors of X by x0,…, xp, with x0 ≡ 1

0. => y‐ is orthogonal to the subspace.


Assume has constant variance .

Assume the deviations of around its expectation are additive and Gaussian, and e~N 0, . Then ~ , .


are constrained to p+1 equalities:

0

So, ∑ is a unbiased estimator and 1 ~ χ

To test the hypothesis that a particular coefficient 0,use z ~ 0,1 (if is

given) or t ~ (if is not given). When N‐p‐1 is large enough, thedifferent between the tail quantities of a t‐distribution and a standard normaldistribution is negligible, so we use the standard normal distribution regardless ofwhether is given or not.Simultaneously, we could use F statistic to test whether some parameters could beremoved.

F∑ ∑

/

∑/

//

where RSS is the residual sum‐of‐squares for the least squares fit of the biggermodel with +1 parameters, and RSS the same for the nested smaller model with+1 parameters.

Gauss Markov Theorem

the least squares estimates of the parameters β have the smallestvariance among all linear unbiased estimates.

Let be another linear estimator of E EE E , as E(e)=0E E=>DX=0

However, there may exist a biased estimator with smaller mean squared error, is intimately related to prediction accuracy.

Multiple Regression from Simple Univariate Regression

If p=1, ∑∑

,,

, where <x,y>=∑

Let x1,…,xp be the columns of X. If <xi,xj>=0 (orthogonal) for each i j, ,,

,

r

⋮ ⋮ … ⋮

⋮<x1,x1>

⋯ 0⋮ ⋱ ⋮0 ⋯ <xp,xp>

⋮

⋮

,,⋮,,

r , −,,

, =0

=>r , are orthogonal.


Orthogonal inputs occur most often with balanced, designed experiments (whereorthogonality is enforced), but almost never with observational data.

Regression by Successive Orthogonalization:1. Initialize z0 = x0 = 1.2. For j = 1, 2, . . . , pRegress xj on z0, z1, . . . , , zj−1 to produce coefficients

,,

,l = 0, . . . , j − 1

and residual vector z ∑ z3. Regress y on the residual zp to give the estimate

,,

we can see that each of the is a linear combination of the z , k ≤ j, and are orthogonal to each other.

In matrix form:

, where Z has columns zj and is the upper triangular matrixwith entries . Introducting D with jth diagonal entry Djj=

, let and

The QR decomposition represents a convenient orthogonal basis forthe column space of X.Q is an N ×(p+1) orthogonal matrix, and R is a (p + 1) × (p + 1) uppertriangular matrix.

, ,


Subset Selection

two reasons why we are often not satisfied with the least squares estimates :

• prediction accuracy: the least squares estimates often have low bias but largevariance• interpretation. With a large number of predictors, we often would like todetermine a smaller subset that exhibits the strongest effects.

1. Best‐Subset Selection

Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the subset of size k that gives smallest residual sum of squares.Infeasible for p much larger than 40.

Subset Selection

2. Forward‐ and Backward‐Stepwise Selection

1) Forward‐stepwise selection is a greedy algorithm, and starts with theintercept, and then sequentially adds into the model the predictor that mostimproves the fit.

Computational: for large p we cannot compute the best subsetsequence, but we can always compute the forward stepwise sequence(even when p >>N).Statistical: a price is paid in variance for selecting the best subset ofeach size; forward stepwise is a more constrained search, and will havelower variance, but perhaps more bias.

2) Backward‐stepwise selection starts with the full model, and sequentiallydeletes the predictor that has the least impact on the fit. (use z‐score, n>p+1)

3) Hybrid

Subset Selection

3. Forward‐Stagewise Regression

Forward‐stagewise regression (FS) is even more constrained than forward‐stepwiseregression. It starts like forward‐stepwise regression, with an intercept equal to y, andcentered predictors with coefficients initially all 0. At each step the algorithm identifiesthe variable most correlated with the current residual. It then computes the simplelinear regression coefficient of the residual on this chosen variable, and then adds it tothe current coefficient for that variable. This is continued till none of the variableshave correlation with the residuals.

In Forward selection, variables are added at one time, but in FS selection, variables areadded partially, which works better in very high dimensional problems.

Shrinkage Methods

it often exhibits high variance, and so doesn’t reduce the prediction error ofthe full model. Shrinkage methods are more continuous, and don’t sufferas much from high variability.

1. Ridge Regression

λ , λ 0, λ

0

∑ ∑ , Subject to ∑ t

Penalizing by the sum‐of‐squares

There is a one‐to‐one correspondence between λin"λ∑ " and t insubjectto ∑

Shrinkage Methods

RSS λ y Xβ y Xβ λβ β

λ , which could be got by calculating thefirst and second derivation.

singular value decomposition (SVD): , where U and V are N ×pand p×p orthogonal matrices, with the columns of U spanning the columnspace of X, and the columns of V spanning the row space. D is a p × pdiagonal matrix, with diagonal entries d11 ≥ d22 ≥∙ ∙ ∙ ≥ dpp ≥ 0.

λ λ ∑

Like linear regression, ridge regression computes the coordinates of y with respect to the orthonormal basis U. It then shrinks these coordinates by the

factors .

Shrinkage Methods

Degree of freedom df λ tr ∑

Hence the small singular values djcorrespond to directions in the columnspace of X having small variance, andridge regression shrinks these directionsthe most.

Eigen decomposition of D , and The eigenvectors vj (columns of V) are also called the principal components (or Karhunen–Loeve) directions of X.

Shrinkage Methods

2. Lasso Regression

∑ ∑ λ∑ , λ 0

The latter constraint makes the solutions nonlinear in the yi. there is noclosed form expression as in ridge regression. Computing the lasso solutionis a quadratic programming problem.

if the solution occurs at a corner, then it has one parameter βj equal to zero.

Shrinkage Methods

Assume that columns of X are orthonormal => =I

Minimize <==> minimize = − ( ‐ )

(β −β )+ λ β ) =0 so β β ‐ λ β )

1. Unbiasedness: The resulting estimator is early unbiased when the true unknownparameter is large to avoid unnecessary modeling bias.2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets smallestimated coeffcients to zero to reduce model complexity.3. Continuity: The resulting estimator is continuous to avoid instability in modelprediction.

To minimize − ( ‐ )+ λ , we can see that β )= β )

so β β )( β )β ‐ λ)= β )( β λ)

soβ β ) ( β λ

Shrinkage Methods

3. Least Angle Regression

Forward stepwise regression builds a model sequentially, adding onevariable at a time. At each step, it identifies the best variable to includein the active set, and then updates the least squares fit to include allthe active variables.Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves.

1. Standardize the predictors to have mean zero and unit norm. Start with the residual, β , . . . , β 0.

2. Find the predictor xj most correlated with r.(cosine)3. Move β from 0 towards its least‐squares coefficient <xj , r>, until some othercompetitor xk has as much correlation with the current residual as does xj .4. Move β and β in the direction(α , ) definedby their joint least squares coefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.4a. If a non‐zero coefficient hits zero, drop its variable from the active set of variablesand recompute the current joint least squares direction.(it becomes lasso if this is added)5. Continue in this way until all p predictors have been entered. After min(N − 1, p) steps,we arrive at the full least‐squares solution.

Linear Methods for Classification

K classes and the fitted linear models f x β β x. Thedecision boundary between class k and l is that set of points forwhich f x f xP(G=k|X=x)For two classes, a popular model is

P G 1 X x

P G 2 X x

log β β x

Hyperplanes‐>model boundaries aslinear

Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, withYk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,YK.

Classify according to G x argmax ∈ f xBecause of the rigid nature of the regression model, classes may bemasked by others, even though they are perfectly separated.

A loose but general rule is that if K ≥ 3 classes are lined up, polynomialterms up to degree K − 1 might be needed to resolve them.



Suppose f x is the class‐conditional density of X in class G=k and πk is the priorprobability of class k.

P G k X xf x π

∑ f x π

Suppose f x exp

In linear discriminant analysis(LDA) we assume that for each k.

log log log

log log

is linear in x in a p‐dimension hyperplane.


The linear discriminant function:

δ x logπ12

, as is symmetric and is a number, whose transpose is still a number.

is ignored, as it is a const.

In practice, we estimate the parameters using training data:

π / , where is the number of class‐k observations∑ /∑ ∑ /


In quadratic discriminant analysis(QDA) we assume that Σ are not equal.

δ x logπ x μ Σ x μ log Σ

LDA in the enlarged quadratic polynomial space is quite similar with QDA.

Regularized discriminant analysis(RDA):Σ a aΣ 1 a ΣIn practice a can be chosen based on the performance of the model onvalidation data, or by cross‐validation.

Σ , where is a p*p orthonormal and is a diagonal matrix of positive eigenvalues d .

x μ Σ x μ x μ x μx μ x μ

log Σ ∑ d


Gaussian classification with common covariances leads to linear decisionboundaries. Classification can be achieved by sphering the data with respect to W,and classifying to the closest centroid (modulo log πk) in the sphered space.Since only the relative distances to the centroids count, one can confine the data tothe subspace spanned by the centroids in the sphered space.This subspace can be further decomposed into successively optimal subspaces interm of centroid separation. This decomposition is identical to the decompositiondue to Fisher.


Find the linear combination Z a X such that the between‐class

variance is maximized relative to the within‐class variance. max

∈B is the “between classes scatter matrix” and W is the “within classes scatter matrix”

As a scatter matrix => min a a subject to a a 1

L12 a a

12 λ a a 1a λa

Logistic Regression

P G k X x∑

,k=1,…,K‐1

P G K X x∑

Let P G K X x P x ; θ

l θ x ; θ

in two‐class case:via a 0/1 response yi, where yi = 1 when gi = 1, and yi = 0 when gi = 2

l θ ∑ ; 1 log 1 ;l θ ∑ log 1 β β , β

∑ ; 0

∑ ; ;

Newton–Raphson algorithm β β

Logistic Regression

In matrix notation:Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P thevector of fitted probabilities with ith element ; β and W a N*Ndiagonal matrix of weights with ith diagona element ; β 1

; β

β ββ β

,adjusted response

Too complicated, many softwares use a quadratic approximation to logisticregression and L1 Regularized Logistic Regression.

LDA: log log α α x

LogisticRegression: log β β x

Although they have exactly the same form, the difference lies in the way the linearcoefficients are estimated. The logistic regression model is more general, in that itmakes less assumptions.Logistic regression: fit the parameters by maximizing the conditional likelihood—the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.LDA: fit the parameters by maximizing the full log‐likelihood, based on the jointdensity P(X,G=k)=φ(X;μk,Σ)πk, where P(X) does play a role as P(X)=Σ P(X,G=k).Assume f(x)’s are Gaussian, we can find a more efficient way.It is generally felt that logistic regression is a safer, more robust bet than the LDAmodel, relying on fewer assumptions. It is our experience that the models givevery similar results, even when LDA is used inappropriately.

LDA or Logistic Regression

Any Questions?

Linear Methods for Regression - University of...

Documents

Transcript of Linear Methods for Regression - University of...