Linear Regression and least square error solution.

Linear Regression and least square error solution

What is “Linear”?

• Remember this:• Y=mX+B?

Simple linear regression

The linear regression model:

Love of Math = 5 + .01*math SAT score

intercept

P=.22; not significant

Prediction

If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)

EXAMPLE

• The distribution of baby weights at Stanford ~ N(3400, 360000)

Your “Best guess” at a random baby’s weight, given no information about the baby, is what?

3400 grams

But, what if you have relevant information? Can you make a better guess?

Predictor variable

• X=gestation time

• Assume that babies that gestate for longer are born heavier, all other things being equal.

• Pretend (at least for the purposes of this example) that this relationship is linear.

• Example: suppose a one-week increase in gestation, on average, leads to a 100-gram increase in birth-weight

Y depends on X

Y=birth- weight

X=gestation time (weeks)

Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Yi’s) from the line is minimized:

Or mathematically… (remember max and mins from calculus)…

Derivative[(Yi-(mx+b))2]=0

But…Note that not every Y-value (Yi) sits on the line. There’s

variability.

Y=baby weights

X=gestation times (weeks)

20 30 40

Y/X=40 weeks ~ N(4000, 2)

Y/X=30 weeks ~ N(3000, 2)

Y/X=20 weeks ~ N(2000, 2)

Mean values fall on the line

• E(Y/X=40 weeks)=4000• E(Y/X=30 weeks)=3000• E(Y/X=20 weeks)=2000

E(Y/X)= Y/X = 100 grams/week*X weeks

Linear Regression Model

Y’s are modeled…

Yi= 100*X + random errori

Follows a normal distribution

Fixed – exactly on the line

Assumptions (or the fine print)

• Linear regression assumes that… – 1. The relationship between X and Y is linear– 2. Y is distributed normally at each value of X– 3. The variance of Y at every value of X is the same

(homogeneity of variances)

• Why? The math requires it—the mathematical process is called “least squares” because it fits the regression line by minimizing the squared errors from the line (mathematically easy, but not general—relies on above assumptions).

Non-homogenous variance

Y=birth-weight

(100g)

X=gestation time (weeks)

Least squares estimation

** Least Squares EstimationA little calculus….What are we trying to estimate? β, the slope, from

What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability)

Differencei = yi – (βx + α) Differencei2 = (yi – (βx + α)) 2

Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the derivative; set it equal to zero; and solve. Typical max/min problem from calculus….

From here takes a little math trickery to solve for β…

...0))((2

)))(((2))((

xxyxyd

Residual

Residual = observed value – predicted value

At 33.5 weeks gestation, predicted baby weight is 3350 grams

33.5 weeks

This baby was actually 3380 grams.

His residual is +30 grams:

3350 grams

Y=baby weights

X=gestation times (weeks)

20 30 40

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Residual Analysis: check assumptions

• The residual for observation i, ei, is the difference between its observed and predicted value

• Check the assumptions of regression by examining the residuals– Examine for linearity assumption– Examine for constant variance for all levels of X (homoscedasticity) – Evaluate normal distribution assumption– Evaluate independence assumption

• Graphical Analysis of Residuals– Can plot residuals vs. X

iii YYe ˆ

Residual Analysis for Linearity

Not Linear Linear

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Homoscedasticity

Non-constant variance Constant variance

Residual Analysis for Independence

Not IndependentIndependent

Other types of multivariate regression

Multiple linear regression is for normally distributed outcomes

Logistic regression is for binary outcomes

Cox proportional hazards regression is used when time-to-event is the outcome

Principal Component Analysis (PCA)

• Given: n d-dimensional points x1, . . . , xn• Goal: find the “right” features from the data

Zero-D Representation

• Task: find x0 to “represent” x1, . . . , xn• Criterion: find x0 such that the sum of the

squared distances between x0 and the various xk is as small as possible

• the “best” zero-dimensional representation of the data set is the sample mean

One-D Representation

• Consider: represent the set of points by a line through m

• x = m+ ae, e: unit vector along the line

Cont’d

Finding eigenvector problem

Geometrical Interpretation

Finding least square error solution

• Finding the direction such that the least square errors is minimized

• Solution: Eigenvector with smallest eigenvalue

Minimize

Solving big matrix systems

• Ax=b• You can use Matlab’s \

– But not very scalable

• There is also sparse matrix library in C\C++, e.g. TAUCS, that provides routine for solving this sparse linear system

• Good News !You can use existing library to avoid the ``trouble’’ implementation of linear equation solver

• But, you need to understand what is happening within the linear solver

Conjugate gradient

• “The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep, geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas that almost anyone can understand. Of course, a reader as intelligent as yourself will learn them almost effortlessly.”

• A is square, symmetric and positive-definite• When the A is dense, you’re stuck, use backsubstitution• When A is sparse, iterative techniques (such as

Conjugate Gradient) are faster and more memory efficient

• Simple example:

(Yeah yeah, it’s not sparse)

Turn Ax=b into a minimization problem• Minimization is more logical to analyze iteration (gradient ascent/descent)• Quadratic form

– c can be ignored because we want to minimize• Intuition:

– the solution of a linear system is always the intersection of n hyperplanes– Take the square distance to them– A needs to be positive-definite so that we have a nice parabola

Gradient of the quadratic form

And since A is symmetric

Not surprising: we turned Ax=b into the quadratic minimization

(if A is not symmetric, conjuagte gradient finds solution for

– Not our image gradient!– Multidimensional gradient (as

many dim as rows in matrix)

Steepest descent/ascent• Pick

gradient direction

• Find optimum in this direction

Gradient direction

Gradient dire

Energy along the gradient

Residual

• At iteration i, we are at a point x(i)• Residual r(i)=b-Ax(i)• Cool property of quadratic form:

residual = - gradient

Behavior of gradient descent• Zigzag or goes straight depending if we’re lucky

– Ends up doing multiple steps in the same direction

Conjugate gradient

• Smarter choice of direction– Ideally, step directions should be orthogonal to

one another (no redundancy)– But tough to achieve– Next best thing: make them A-orthogonal

(conjugate)That is, orthogonal when transformed by A:

Conjugate gradient

• For each step: – Take the residual (gradient)– Make it A-orthogonal to the previous ones– Find minimum along this direction

• Plus life is good:– In practice, you only

need the previous one– You can show that the new

residual r(i+1) is already A-orthogonal to all previous directions p but p(i)

Linear Regression and least square error solution.

Documents

Transcript of Linear Regression and least square error solution.

Chapter 3 Tree-based Regression - UPltorgo/PhD/th3.pdf · Chapter 3 Tree-based Regression ... (LAD) error criterion to obtain regression trees. ... Work on tree-based regression models

Regression 2: the Output - Ilvento 2.pdf · ii) Regression Statistics Multiple R 0.700 R Square 0.489 Adjusted R Square 0.489 Standard Error 687.068 Observations 1000 The Regression

STAT 220 Lecture Slides Least Square Regression Line

Linear Regression ( Cont'd ). Outline - Multiple Regression - Checking The Regression : Coeff. Determination Standard Error Confidence Interval Hypothesis.

Logistische Regression - · PDF fileLogistische Regression Christian Herta ... wobei im obigen Fall, der linearen Regression, der squared error loss verwendet wurde: loss (h (x );y

THE ERROR~IN~VARIABLES PROBLEM IN THE LOGISTIC REGRESSION ... · 2.1.1 Simple Linear Regression Case.....•.... 24 ... the Logistic Regression Model .....• ... in Chapter 4.....•

Minimum Mean Square Error Equalization With Priors

MEASUREMENT ERROR MODELS - Stanford Universityweb.stanford.edu/~doubleh/eco273B/survey-jan27chenhandenis-07.pdf · The implication of this assumption in the linear least square regression

Causal Forecasting Models - edXMITx+CTL.SC1x_1+2T2015+type@ass… · OLS Regression in Spreadsheet 9 . ... Causal Forecasting Models Regression ... the error term around the regression

Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Ordinary least square regression, orthogonal - IOPscience

Least-Square Regression Line

CONSISTENT MOMENT ESTIMATORS OF REGRESSION COEFFICIENTS …staff.ustc.edu.cn/~ynyang/group-meeting/2012-/error-in-variable/... · CONSISTENT MOMENT ESTIMATORS OF REGRESSION ... This

Explaining Wind Farm Output Using Regression · PDF fileExplaining Wind Farm Output Using Regression ... absolute value of residual error values, the final regression model was shown

Least Square Regression Based Integrated Multi-Parameteric

MEASUREMENT ERROR MODELS - web.stanford.edudoubleh/eco273B/survey-jan27chenhandenis... · with regression error) that allows for point identiﬁcation of linear EIV regression models

Testing Panel Data Regression Models with Spatial Error ... · Testing Panel Data Regression Models with Spatial Error Correlation* by Badi H. Baltagi Department of Economics, Texas

Department of the Interior U.S. Geological Survey = Mean square error (standard deviation) in height mp = Mean square error (standard deviation) in planimetry = Swing of the plate

Forecasting - JDELIST · PDF file• Least square regression ... One commonly used method for measuring error is MAD. MAD is ... Forecasting Forecast

Regression in ANOVA - Statpower Notes/RegressionIntro.pdf · Regression in ANOVA ... Residual standard error: ... How can we set up a regression model corresponding to the null model?