Linear Regression with One Regressor
description
Transcript of Linear Regression with One Regressor
![Page 1: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/1.jpg)
Linear Regression with One Regressor
![Page 2: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/2.jpg)
Introduction Linear Regression Model Measures of Fit Least Squares Assumptions Sampling Distribution of OLS Estimator
![Page 3: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/3.jpg)
Introduction
Empirical problem: Class size and educational output.
Policy question:What is the effect of reducing class size by one student per class? by 8 students/class?
What is the right output (performance) measure? parent satisfaction. student personal development. future adult welfare. future adult earnings. performance on standardized tests.
![Page 4: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/4.jpg)
What do data say about class sizes and test scores?
The California Test Score Data SetAll K-6 and K-8 California school districts (n = 420)
Variables: 5th grade test scores (Stanford-9 achievement
test, combined math and reading), district average.
Student-teacher ratio (STR) = number of students in the district divided by number of full-time equivalent teachers.
![Page 5: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/5.jpg)
An initial look at the California test score data:
![Page 6: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/6.jpg)
Question:
Do districts with smaller classes (lower STR) have higher test scores? And by how much?
![Page 7: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/7.jpg)
The class size/test score policy question: What is the effect of reducing STR by one
student/teacher on test scores ? Object of policy interest: . This is the slope of the line relating test score
and STR.
![Page 8: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/8.jpg)
This suggests that we want to draw a line through the Test Score v.s. STR scatterplot, but how?
![Page 9: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/9.jpg)
Linear Regression: Some Notation and Terminology
The population regression line is
![Page 10: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/10.jpg)
β0 and β1 are “population” parameters?
We would like to know the population value of β1
We don’t know β1, so we must estimate it using data.
![Page 11: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/11.jpg)
The Population Linear Regression Model— general notation
X is the independent variable or regressor. Y is the dependent variable. β0 = intercept.
β1 = slope.
![Page 12: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/12.jpg)
ui = the regression error. The regression error consists of omitted factors,
or possibly measurement error in the measurement of Y . In general, these omitted factors are other factors that influence Y , other than the variable X.
![Page 13: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/13.jpg)
Ex : The population regression line and the error term
![Page 14: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/14.jpg)
Data and samplingThe population objects (“parameters”) 0 and 1 are unknown; so to draw inferences about these unknown parameters we must collect relevant data.
Simple random sampling:Choose n entities at random from the population of interest, and observe (record) X and Y for each entity
Simple random sampling implies that {(Xi, Yi)}, i = 1,…, n, are independently and identically distributed (i.i.d.). (Note: (Xi, Yi) are distributed independently of (Xj, Yj) for different observations i and j.)
![Page 15: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/15.jpg)
The Ordinary Least Squares Estimator
How can we estimate β0 and β1 from data?
We will focus on the least squares (“ordinary least squares” or ”OLS”) estimator of the unknown parameters β0 and β1, which solves
![Page 16: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/16.jpg)
The OLS estimator solves:
The OLS estimator minimizes the sum of squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line.
This minimization problem can be solved. The result is the OLS estimators of β0 and β1.
![Page 17: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/17.jpg)
Why use OLS, rather than some other estimator?
The OLS estimator has some desirable properties:under certain assumptions, it is unbiased (that is, E(β1 )=β1), and it has a tighter sampling distribution than some other candidate estimators of β1.
This is what everyone uses— the common “language” of linear regression.
![Page 18: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/18.jpg)
Why use OLS, rather than some other estimator?
OLS is a generalization of the sample average: if the “line” is just an intercept (no X), then the OLS estimator is just the sample average of Y1,…Yn ( ).
Like , the OLS estimator has some desirable properties: under certain assumptions, it is unbiased (that is, E( ) = 1), and it has a tighter sampling distribution than some other candidate estimators of 1 (more on this later)
Importantly, this is what everyone uses – the common “language” of linear regression.
Y
Y
1̂
![Page 19: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/19.jpg)
Derivation of the OLS Estimators
and are the values of b0 and b1 the above two normal equations.
![Page 20: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/20.jpg)
From equations (1) and (2), and divide each term by n, we have
From (3), , substitute in (4) and collect terms, we have
![Page 21: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/21.jpg)
![Page 22: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/22.jpg)
Application to the California Test Score-Class Size data
Estimated slope = = - 2.28 Estimated intercept = = 698.9 Estimated regression line: = 698.9 - 2.28 ST
R
![Page 23: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/23.jpg)
Interpretation of the estimated slope and intercept
Districts with one more student per teacher on average have test scores that are 2.28 points lower.
That is, =-2.28. The intercept (taken literally) means that, according
to this estimated line, districts with zero students per teacher would have a (predicted) test score of 698.9.
This interpretation of the intercept makes no sense – it extrapolates the line outside the range of the data – in this application, the intercept is not itself economically meaningful.
![Page 24: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/24.jpg)
Predicted values and residuals:
One of the districts in the data set is Antelope, CA, for which ST R = 19.33 and Score = 657.8
![Page 25: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/25.jpg)
![Page 26: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/26.jpg)
Measures of Fit
A natural question is how well the regression line “fits” or explains the data. There are two regression statistics that provide complementary measures of the quality of fit:
The regression R2 measures the fraction of the variance of Y that is explained by X; it is unitless and ranges between zero (no fit) and one (perfect fit).
The standard error of the regression (SER) measures the magnitude of a typical regression residual in the units of Y .
![Page 27: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/27.jpg)
The R2:
The regression R2 is the fraction of the sample variance of Yi “explained” by the regression.
from equations (1) and (2).
![Page 28: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/28.jpg)
Definition of R2:
R2 = 0 means ESS = 0. R2 = 1 means ESS = T SS. 0 ≤ R2 ≤ 1. For regression with a single X, R2 = the square of
the correlation coefficient between X and Y. (Exercise 4.12)
![Page 29: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/29.jpg)
The Standard Error of the Regression (SER)
The SER measures the spread of the distribution of u. The SER is (almost) the sample standard deviation of the OLS residuals:
The second equality holds bacause
![Page 30: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/30.jpg)
The SER: has the units of u, which are the units of Y . measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line) The root mean squared error (RMSE) is closely
related to the SER:
This measures the same thing as the SER—the minor difference is division by 1/n instead of 1/(n-2).
![Page 31: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/31.jpg)
Technical note: why divide by n-2 instead of n-1?
Division by n-2 is a “degrees of freedom” correction— just like division by n-1 in , except that for the SER, two parameters have been estimated (β0 and β1, by and ), whereas in only one has been estimated ( , by ).
When n is large, it makes negligible difference whether n, n-1, or n-2 are used— although the conventional formula uses n-2 when there is a single regressor.
For details, see Section 17.4.
![Page 32: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/32.jpg)
Example of the R2 and the SER
R2 = 0.05, SER=18.6STR explains only a small fraction of the variation in test
scores. Does this make sense? Does this mean the ST R is unimportant in a policy sense? No.
![Page 33: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/33.jpg)
The Least Squares Assumptions
What, in a precise sense, are the properties of the OLS estimator? We would like it to be unbiased, and to have a small variance. Does it? Under what conditions is it an unbiased estimator of the true population parameters?
To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme).
These assumptions— there are three—are known as the Least Squares Assumptions.
![Page 34: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/34.jpg)
The Least Squares Assumptions The conditional distribution of u given X has mean
zero, that is, . This implies that is unbiased. (Xi , Yi) , i= 1, … , n, are i.i.d.
This is true if X, Y are collected by simple random sampling.
This delivers the sampling distribution of and . Large outliers in X and/or Y are rare.
Technically, X and u have four moments, that is: E(X4)<∞ and E(u4)<∞.
Outliers can result in meaningless values of .
![Page 35: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/35.jpg)
Least squares assumption #1: E(u|X=x)= 0.
For any given value of X, the mean of u is zero.
![Page 36: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/36.jpg)
Example: Assumption #1 and the class size example.
“Other factors” include parental involvement outside learning opportunities (extra math
class,..) home environment family income is a useful proxy for many such
factorsSo, E(u|X=x)0 means E(Family Income|STR) =
constant (which implies that family income and STR are uncorrelated).
![Page 37: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/37.jpg)
Least squares assumption #2:
(Xi , Yi) , i= 1, … , n, are i.i.d.
This arises automatically if the entity (individual, district) is sampled by simple random sampling. The entity is selected then, for that entity, X and Y are observed (recorded).
The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”) – this will introduce some extra complications.
![Page 38: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/38.jpg)
Least squares assumption #3: Large outliers are rare.
Technical statement: E(X4)<∞ and E(u4)<∞. A large outlier is an extreme value of X or Y . On a technical level, if X and Y are bounded, then
they have finite fourth moments. (Standardized test scores automatically satisfy this; STR, family income, etc. satisfy this too).
However, the substance of this assumption is that a large outlier can strongly influence the results.
![Page 39: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/39.jpg)
OLS can be sensitive to an outlier
Is the lone point an outlier in X or Y ? In practice, outliers often are data glitches
(coding/recording problems)— so check your data for outliers! The easiest way is to produce a scatterplot.
![Page 40: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/40.jpg)
Sampling Distribution of OLS Estimator
The OLS estimator is computed from a sample of data; a different sample gives a different value of . This is the source of the “sampling uncertainty” of .
We want to: quantify the sampling uncertainty associated
with . use to test hypotheses such as β1 = 0.
construct a confidence interval for β1.
![Page 41: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/41.jpg)
All these require figuring out the sampling distribution of the OLS estimator. Two steps to get there.
Probability framework for linear regression. Distribution of the OLS estimator.
![Page 42: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/42.jpg)
Probability Framework for Linear Regression
The Probability framework for linear regression is summarized by the three least squares assumption.
Populationpopulation of interest (ex: all possible school districts)
Random variables: Y, X (ex: Test Score, STR) Joint distribution of (Y, X)
The population regression function is linear.E(u|X) = 0X, Y have finite fourth moments.
Data collection by simple randomsampling{(Xi , Yi)} , i= 1, … , n are i.i.d.
![Page 43: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/43.jpg)
The Sampling Distribution of
Like , has a sampling distribution. What is E( )? (where is it centered?) What is Var( )? (measure of sampling
uncertainty) What is its sampling distribution in small
samples? What is its sampling distribution in large
samples?
![Page 44: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/44.jpg)
The mean and variance of the sampling distribution of
Thus
because
![Page 45: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/45.jpg)
is unbiased.Law of Iterated Expectations: E(Y) = E(E(Y|X)).
![Page 46: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/46.jpg)
so – 1 =
We can simplify this formula by noting that:
= -
=
1
2
1
( )( )
( )
n
i ii
n
ii
X X u u
X X
1
( )( )n
i ii
X X u u
1
( )n
i ii
X X u
1
( )n
ii
X X u
1
( )n
i ii
X X u
![Page 47: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/47.jpg)
Thus
![Page 48: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/48.jpg)
Calculate the mean of
E( – 1) =
=
=
Now E(vi/ ) = E[(Xi – )ui/ ] = 0
because E(ui|Xi=x) = 0
Thus, E( – 1) = = 0
so E( ) = 1
That is, is an unbiased estimator of 1.
2
1
1 1n
i Xi
nE v sn n
21
1
1
ni
i X
vnE
n n s
21
1
1
ni
i X
vnE
n n s
X
21
1
1
ni
i X
vnE
n n s
![Page 49: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/49.jpg)
Because
when n is large is i.i.d. and has two moments.
That is Var( ) < ∞. Thus is distributed when n is large. (central limit theorem)
is approximately equal to when n is large. when n is large.Putting these together we have:
![Page 50: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/50.jpg)
Large-n approximation to the distribution of :
which is approximately distributed Because , we can write this as: is
approximately distributed
![Page 51: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/51.jpg)
The larger the variance of X, the smaller the variance of
The math:
where . The variance of X appears in its square in the denominator—so increasing the spread of X decreases the variance of β1.
The intuitionIf there is more variation in X, then there is more
information in the data that you can use to fit the regression line. This is most easily seen in a figure.
![Page 52: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/52.jpg)
The larger the variance of X, the smaller thevariance of
There are the same number of black and blue dots— using which would you get a more accurate regression line?
![Page 53: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/53.jpg)
Asymptotic Distribution Theory
Consistency and convergence in probability. Let S1, S2, … , Sn, … be a sequence of random
variable. The sequency of random variables {Sn} is said to converge in probability to a limit, μ (that is ), if the probability that Sn is within ±δ of μ tends to one as n→∞, as long as the constant δ is positive. That is
if and only if Pr [|Sn − μ|≥δ] → 0
as n→∞ for every δ > 0. If , then Sn is said to be a consistent estimator of μ .
![Page 54: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/54.jpg)
Law of Large Number:Under certain conditions on Y1, … , Yn, the sample average converges in probability to the population mean.If Y1, … , Yn are i.i.d. , , and , then
![Page 55: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/55.jpg)
Another apporach to obtain an estimator:
Apply Law of Large NumberThe least square assumption #1
implies
Apply LLN we have
![Page 56: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/56.jpg)
Replacing the population mean with sample average is called the analogy principle.
This leads to the two normal equations in the bivariate least squares regression.
![Page 57: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/57.jpg)
Convergence in distribution.
Let F1, … , Fn, … be a sequence of cumulative distribution functions corresponding to a sequence of random variables, S1, … , Sn, … . Then the sequence of random variables Sn is said to converge in distribution to S ( denoted ) if the distribution functions {Fn} converge to F. That is,
where the limit holds at all points t at which the limiting distribution F is continuous. The distribution F is called the asymptotic distribution of Sn.
![Page 58: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/58.jpg)
The central limit theorem.If Y1, … , Yn are i.i.d. and then
In other words, the asymptotic distribution of
is N(0,1).
![Page 59: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/59.jpg)
Slutsky’s theorem combines consistency and convergence in distribution.
Suppose that an where a is a constant, and
Then
![Page 60: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/60.jpg)
Continuous mapping theorem:If g is a continuous function, then
![Page 61: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/61.jpg)
Summary for the OLS estimator :
Under the three Least Squares Assumptions, The exact (finite sample) sampling distribution of
has mean ( is an unbiased estmator of ), and Var( ) is inversely proportional to n.
Other than its mean and variance, the exact distribution of is complicated and depends on the distribution of (X, u).
. (law of large numbers)
is approximately distributed N(0, 1). (CLT)
![Page 62: Linear Regression with One Regressor](https://reader036.fdocuments.us/reader036/viewer/2022081504/56814589550346895db27175/html5/thumbnails/62.jpg)