Note 13 of 5E Statistics with Economics and Business Applications Chapter 11 Linear Regression and...

60
Note 13 of 5E Statistics with Statistics with Economics and Economics and Business Applications Business Applications Chapter 11 Linear Regression and Correlation Correlation Coefficient, Least Squares, ANOVA Table, Test and Confidence Intervals, Estimation and Prediction
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of Note 13 of 5E Statistics with Economics and Business Applications Chapter 11 Linear Regression and...

Note 13 of 5E

Statistics with Economics and Statistics with Economics and Business ApplicationsBusiness Applications

Chapter 11 Linear Regression and Correlation

Correlation Coefficient, Least Squares, ANOVA Table, Test and Confidence Intervals,

Estimation and Prediction

Note 13 of 5E

ReviewReview

I. I. What’s in last two lectures?What’s in last two lectures?Hypotheses tests for means and proportions Chapter 8

II. II. What's in the next three lectures?What's in the next three lectures?Correlation coefficient and linear regression Read Chapter 11

Note 13 of 5E

IntroductionIntroduction

So far we have done statistics on one variable at a time. We now interested in relationships between two variables and how to use one variable to predict another variable.

• Does weight depend on height?

• Does blood pressure level predict life expectancy?

• Do SAT scores predict college performance?

• Does taking PSTAT 5E make you a better person?

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

The following data was collected in a study of age and fatness in humans.

* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839

One of the questions was, “What is the relationship between age and fatness?”

Age 23 23 27 27 39 41 45 49 50% Fat 9 .5 27 .9 7 .8 17 .8 31 .4 25 .9 27 .4 25 .2 31 .1

Age 53 53 54 56 57 58 58 60 61% Fat 34.7 42 29.1 32 .5 30 .3 33 33.8 41 .1 34 .5

Note 13 of 5E

Example: Age and FatnessExample: Age and FatnessThe following scatterplot shows that % fat in general tend to increase with age. The relationship is close, but not exactly, linear.

Note 13 of 5E

Example: Advertising and SaleExample: Advertising and Sale The following table contains sales (y) and advertising

expenditures (x) for 10 branches of a retail store.

x ($100) 18 7 14 31 21 5 11 16 26 29

y ($1000) 55 17 36 85 62 18 33 41 63 87

Note 13 of 5E

• What patternpattern or formform do you see?

• Straight line upward or downward

• Curve or no pattern at all

• How strongstrong is the pattern?

• Strong or weak

• Are there any unusual observationsunusual observations?

• Clusters or outliers

Describing the ScatterplotDescribing the Scatterplot

Note 13 of 5E

Positive linear - strong Negative linear -weak

Curvilinear No relationship

ExamplesExamples

Note 13 of 5E

Investigation of RelationshipInvestigation of Relationship

There are two approaches to investigate linear relationship

– Correlation coefficient: a numerical measure of the strengthstrength and directiondirection of the linear relationship between x and y.

– Linear regression: a linear equation expresses the relationship between x and y. It provides a formform of the relationship.

Note 13 of 5E

The Correlation CoefficientThe Correlation Coefficient The strength and direction of the relationship

between x and y are measured using the correlation coefficient (Pearson product correlation coefficient (Pearson product moment coefficient of correlation), moment coefficient of correlation), rr..

where

sx = standard deviation of the x’s

sy = standard deviation of the y’s

sx = standard deviation of the x’s

sy = standard deviation of the y’s

yx

xy

ss

sr

yx

xy

ss

sr

1

))((

nn

yxyx

s

iiii

xy 1

))((

nn

yxyx

s

iiii

xy

Note 13 of 5E

ExampleExample

The table shows the heights and weights ofn = 10 randomly selected college football players.Player 1 2 3 4 5 6 7 8 9 10

Height, x 73 71 75 72 72 75 67 69 71 69

Weight, y 185 175 200 210 190 195 150 170 180 175

Use your calculator to find the sums and sums of squares.

Use your calculator to find the sums and sums of squares.

8261.)2610)(4.60(

328

26104.60328

r

SSS yyxxxy

8261.)2610)(4.60(

328

26104.60328

r

SSS yyxxxy

Note 13 of 5E

Football PlayersFootball Players

Height

Weig

ht

75747372717069686766

210

200

190

180

170

160

150

Scatterplot of Weight vs Height

r = .8261

Strong positive correlation

As the player’s height increases, so

does his weight.

r = .8261

Strong positive correlation

As the player’s height increases, so

does his weight.

Note 13 of 5E

Interpreting Interpreting rr

All points fall exactly on a straight line.

Strong relationship; either positive or negative

No relationship; random scatter of points

• -1 r 1

• r 0

• r 1 or –1

• r = 1 or –1

Sign of r indicates direction of the linear relationship.

Note 13 of 5E

Example: Advertising Example: Advertising and Saleand Sale

Often we want to investigate the form of the relationship for the purpose of description, control and prediction. For the advertising and sale example, sale increases as advertising expenditure increase. The relationship is almost, but not exact linear.

• Description: how sales depends on advertising expenditure

• Control: how much to spend on advertising to reach certain goals on sales

• Prediction: how much sales do we expect if we spend certain amount of money on advertising

Note 13 of 5E

Linear Deterministic ModelLinear Deterministic Model• Denote x as independent variables and y as

dependent variable. For example, x=age and y=%fat in the first example; x=advertising expenditure, y=sales in the second example

• We want to find how y depends on x, or how to predict y using x

• One of the simplest deterministic mathematical relationship between two variables x and y is a linear relationship y = α + βx

00

= intercept

α: intercept

β: slope

Note 13 of 5E

RealityReality • In the real world, things are never so clean!

– Age influences fatness. But it is not the sole influence. There are other factors such as sex, body type and random variation (e.g. measurement error)

– Other factors such as time of year, state of economy and size of inventory, besides the advertising expenditure, can influence the sale

• Observations of (x, y) do not fall on a straight line

As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality. Albert Einstein

Note 13 of 5E

Probabilistic ModelProbabilistic Model • Probabilistic model: y = deterministic model + random error

• Random error represents random fluctuation from the deterministic model

• The probabilistic model is assumed for the population

• Simple linear regression model: y = α + βx + ε

• Without the random deviation ε, all observed points (x, y) points would fall exactly on the deterministic line. The inclusion of ε in the model equation allows points to deviate from the line by random amounts.

Note 13 of 5E

Simple Linear Regression ModelSimple Linear Regression Model

0

0x = x1 x = x2

ε1

Observation when x = x1

(positive deviation)

ε2

Observation when x = x2

(negativetive deviation)

= vertical intercept

Population regression line(Slope )

Note 13 of 5E

Basic Assumptions of the Simple Linear Basic Assumptions of the Simple Linear Regression ModelRegression Model

1. The distribution of ε at any particular x value has mean value 0.

2. The standard deviation of ε is the same for any particular value of x. This standard deviation is denoted by .

3. The distribution of ε at any particular x value is normal.

4. The random errors are independent of one another.

Note 13 of 5E

Interpretation of TermsInterpretation of Terms1. The line α + βx describes average value of y for any

fixed value of x.

2. The slope of the population regression line is the average change in y associated with a 1-unit increase in x.

3. The intercept is the height of the population line when x = 0.

4. The size of determines the extent to which (x, y) observations deviate from the population line.

Small Large

Note 13 of 5E

The Distribution of yThe Distribution of y

For any fixed x, y has normal distribution with mean xx and standard deviation and standard deviation .

Note 13 of 5E

DataData1. So far we have described the population

probabilistic model.2. Usually three population parameters, α, β and σ,

are unknown. We need to estimate them from data.

3. Data: n pairs of observations of independent and dependent variables

(x1,y1), (x2,y2), …, (xn,yn)

4. Probabilistic model yi = + xi + i, i=1,…,n i are independent normal with mean 0 and

standard deviation .

Note 13 of 5E

Steps in Regression AnalysisSteps in Regression Analysis When you perform simple regression analysis, use a step-by step approach:

1. Fit the model to data – estimate parameters.

2. Use the analysis of variance F test (or t test) and r2 to determine how well the model fits the data.

3. Use diagnostic plots to check for violation of the regression assumptions.

4. Proceed to estimate or predict the quantity of interest

We now discuss statistical methods for each step and use the fatness example to illustrate each step

Note 13 of 5E

The Method of Least SquaresThe Method of Least Squares• The equation of the best-fitting line is

calculated using n pairs of data (xi, yi). • We choose our estimates and to estimate and so that the vertical distances of the points from the line,are minimized.

n

1i

2ii

n

1i

2ii

)xβα(y

)y(ySSE

minimize toβ and α Choose

xβαy :line fittingBest

Note 13 of 5E

Least Squares EstimatorsLeast Squares Estimators

xy

Sn

yxyxS

n

yyS

n

xxS

n

yy

n

xx

xx

xy

iiiixy

iiyy

iixx

ii

ˆ - ofestimator point ˆ

S ofestimator point ˆ

Then .) )( (

,) (

,) (

, , Compute

22

22

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

The following data was collected in a study of age and fatness in humans.

* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839

One of the questions was, “What is the relationship between age and fatness?”

Age 23 23 27 27 39 41 45 49 50% Fat 9 .5 27 .9 7 .8 17 .8 31 .4 25 .9 27 .4 25 .2 31 .1

Age 53 53 54 56 57 58 58 60 61% Fat 34.7 42 29.1 32 .5 30 .3 33 33.8 41 .1 34 .5

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

2

n 18

X 834

y 515

X 41612

XY 25489.2

Age (x) % Fat y x2 xy23 9.5 529 218.523 27.9 529 641.727 7.8 729 210.627 17.8 729 480.639 31.4 1521 1224.641 25.9 1681 1061.945 27.4 2025 123349 25.2 2401 1234.850 31.1 2500 155553 34.7 2809 1839.153 42 2809 222654 29.1 2916 1571.456 32.5 3136 182057 30.3 3249 1727.158 33 3364 191458 33.8 3364 1960.460 41.1 3600 246661 34.5 3721 2104.5

834 515 41612 25489.2

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

2

n 18, x 834, y 515

x 41612, xy 25489.2

2

2xx

2

xS x

n

83441612 2970

18

xy

x yS xy

n834 515

25489.2 1627.5318

xy

xy

S

xx

xy

55.22.3ˆ

22.318

83455.

18

515

ˆ - ˆ

55.2970

53.1627

S ˆ

Note 13 of 5E

The Analysis of VarianceThe Analysis of Variance• The total variation in the experiment is measured by

the total sum of squarestotal sum of squares:2)yyS yy ( SS Total

2)yyS yy ( SS Total

• The Total SSTotal SS is divided into two parts:

SSRSSR (sum of squares for regression): measures the variation explained by including the independent variable x in the model.

SSESSE (sum of squares for error): measures the leftover variation not explained by x.

xx

xy

S

S 2)(SSR

xx

xy

S

S 2)(SSR SSR - SS TotalSSE SSR - SS TotalSSE

Note 13 of 5E

The ANOVA TableThe ANOVA Table

Total df = Mean SquaresRegression df = Error df =

n -1

1

n –1 – 1 = n - 2

MSR = SSR/1

MSE = SSE/(n-2)

Source df SS MS F

Regression 1 SSR MSR MSR/MSE

Error n - 2 SSE MSE

Total n - 1 Total SS

Note 13 of 5E

The F TestThe F Test

We can test the overall usefulness of the linear model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.

0 :H vs0 :H a0

df. 2-n and 1 with FF if HReject

MSE

MSRF :StatisticTest

α0

Note 13 of 5E

Coefficient of DeterminationCoefficient of Determination

• r2 is the square of correlation coefficient• r2 is a number between zero and one and a value close to zero suggests a poor model.• It gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.• A very high value of r² can arise even though the relationship between the two variables is non-linear. The fit of a model should never simply be judged from the r² value alone.

SS Total

SSE-1

SS Total

SSRr

as defined ision determinat oft coefficien The

2

Note 13 of 5E

Estimate of Estimate of σσ

An estimator of the variance 2 is

MSE 2-n

SSEˆ 2

Thus, an estimator of the standard deviation is

MSE2-n

SSEˆ

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

5.75 33.11 ˆ

33.11 2-18

529.71

2-n

SSE ˆ

.627 1421.58

529.71 -1

SS Total

SSE-1 r

529.71 891.27-1421.58 SSR - SS Total SSE

27.8912970

53.1627)( SSR

58.142118

5153.16156

n

)( SS Total

2

2

22

222

xx

xy

S

S

yy

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

Source df SS MS F

Regression 1 891.27 891.27 26.94

Error 16 529.71 33.11

Total 17 1421.58

ANOVA Table

• With r2=0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age.• The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. • This would suggest that the model is only useful in the sense of provide gross ballpark estimates for %Fat for humans based on age.

Note 13 of 5E

Inference Concerning the Slope Inference Concerning the Slope ββ

• Do the data present sufficient evidence to indicate that y increases (or decreases) linearly as x increases?

• Is the independent variable x useful in predicting y?

• A no answer to above questions means that y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.

0:0:0 aH versusH 0:0:0 aH versusH

Note 13 of 5E

Sampling Distribution Sampling Distribution When the four basic assumptions of the simple linear

regression model are satisfied, the following are true:

1. The mean value of is . That is, is unbiased

2. The standard deviation of the statistic is

3. has a normal distribution (a consequence of the error being normally distributed)

4. The probability distribution of the standardized variable

has the t distribution with df=n-2

xxS

xxSt

ˆ

Note 13 of 5E

Confidence Interval for Confidence Interval for When then four basic assumptions of the simple linear regression model are satisfied, a (1-α)100% confidence interval for is

where the t critical value is based on df = n - 2.

xxSt /ˆ ˆ2/

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

.77) (.33,or

.22 .55 29705.75/ 2.12 .55 /ˆ ˆ

is for interval confidence 95%A

2/ xxSt

Based on sample data, the %Fat increases .55% on average with one year of age, and we are 95% confident that the true increase per year is between 0.33% and 0.77%.

Note 13 of 5E

Hypothesis Tests Concerning Hypothesis Tests Concerning Step 1: Specify the null and alternative hypothesis

– H0: β = βversus Ha: β βtwo-sided test)– H0: β = βversus Ha: β > β(one-sided test)– H0: β = βversus Ha: β < βone-sided test)

Step 2: Test statistic

Step 3: When four basic assumptions of the simple linear regression model are satisfied, under H0 , the sampling distribution of t has a Student’s t distribution with n-2 degrees of freedom

xxSt

ˆ0

Note 13 of 5E

Hypothesis Tests Concerning Hypothesis Tests Concerning Step 3: Find p-value. Compute

sample statistic

– Ha: two-sided test)

– Ha: > (one-sided test)

– Ha: < (one-sided test)

|)*|(2value-p ttP |)*|(2value-p ttP

*)(value-p ttP *)(value-p ttP

*)(value-p ttP *)(value-p ttP

P(t>|t*|), P(t>t*) and P(t<t*) can be found from the t tableP(t>|t*|), P(t>t*) and P(t<t*) can be found from the t table

xxSt

ˆ* 0

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

fatness. and agebetween

iprelationshlinear t significan a is There .5

Hreject 4.

.005value-p 3.

162-ndf

21.52970/75.5

55.

0ˆ t*.2

0:H 0, :H 1.

0

a0

xxS

fatness. and agebetween

iprelationshlinear t significan a is There .5

Hreject 4.

.005value-p 3.

162-ndf

21.52970/75.5

55.

0ˆ t*.2

0:H 0, :H 1.

0

a0

xxS

Note 13 of 5E

Interpreting a Significant RegressionInterpreting a Significant Regression

• Even if you do not reject the null hypothesis

that the slope of the line equals 0, it does not

necessarily mean that y and x are unrelated.

• It may happen that y and x are perfectly

related in a nonlinear way.

• Causality—Do not conclude that x causes y.

There may be an unknown variable at work!

Note 13 of 5E

Checking the Checking the Regression AssumptionsRegression Assumptions

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and variance 2.

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and variance 2.

Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.

Note 13 of 5E

ResidualsResiduals• The residual corresponding to (xi,yi) is • The residual residual is the “leftover” variation in each data point after the variation explained by the regression model has been removed. • Residuals reflect random errors, thus we can use them to check violations in the assumptions about random errors.

iii xye ˆˆ iii xye ˆˆ

00 x = x1 x = x2

e1 e2

fitted line

Note 13 of 5E

Diagnostic Tools – Residual PlotsDiagnostic Tools – Residual Plots

1. Normal probability plot of residuals

2. Plot of residuals versus fit or residuals versus variables

1. Normal probability plot of residuals

2. Plot of residuals versus fit or residuals versus variables

We can check the normality and equal variance assumptions using

Note 13 of 5E

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

Normal Probability PlotNormal Probability Plot

Note 13 of 5E

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

Residuals versus FitsResiduals versus Fits

Note 13 of 5E

Estimation and Prediction Estimation and Prediction • Once we have

determined that the regression line is useful used the diagnostic plots to check for

violation of the regression assumptions.

• We are ready to use the regression line to Estimate the average value of y for a

given value of x Predict a particular value of y for a

given value of x.

Note 13 of 5E

Estimation and Prediction Estimation and Prediction

Estimating the average value of y when x = x0

Predicting a particular value of y when x = x0

Note 13 of 5E

Estimation and Prediction Estimation and Prediction • The best estimate of the average value of y and

prediction of y for a given value x = x0 is

• The prediction of y is more difficult, requiring a wider range of values in the prediction interval.

0 ˆˆˆ xy 0 ˆˆˆ xy

Note 13 of 5E

Estimation and Prediction Estimation and Prediction

xx

xx

S

xx

nty

xxy

S

xx

nty

xxy

20

2/

0

20

2/

0

)(11ˆˆ

: when of valueparticular apredict To

)(1ˆˆ

: when of valueaverage theestimate To

xx

xx

S

xx

nty

xxy

S

xx

nty

xxy

20

2/

0

20

2/

0

)(11ˆˆ

: when of valueparticular apredict To

)(1ˆˆ

: when of valueaverage theestimate To

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

The fitted line is

If x0=45 is put into the equation for x, we have both an estimated average %Fat for 45 year old humans and a predicted %Fat for a 45 year old human

The two interpretations are quite different.

y 3.22 0.548x

3.22+0.548(45)=27.9%

Note 13 of 5E

Example: Age and FatnessExample: Age and Fatness

.much wider is interval prediction The 40.43). (15.37,or

53.129.272970

)18/83445(

18

1175.512.2ˆ

is 45at xy of prediction for the interval confidence 95%

30.78). (25.2,or

88.29.272970

)18/83445(

18

175.512.2ˆ

is 45at xy of average for the interval confidence 95%

2

0

2

0

y

y

.much wider is interval prediction The 40.43). (15.37,or

53.129.272970

)18/83445(

18

1175.512.2ˆ

is 45at xy of prediction for the interval confidence 95%

30.78). (25.2,or

88.29.272970

)18/83445(

18

175.512.2ˆ

is 45at xy of average for the interval confidence 95%

2

0

2

0

y

y

Note 13 of 5E

Steps in Regression AnalysisSteps in Regression Analysis When you perform simple regression analysis, use a step-by step approach:

1. Fit the model to data – estimate parameters.

2. Use the analysis of variance F test (or t test) and r2 to determine how well the model fits the data.

3. Use diagnostic plots to check for violation of the regression assumptions.

4. Proceed to estimate or predict the quantity of interest

Note 13 of 5E

Regression Analysis: y versus xThe regression equation is y = 40.8 + 0.766 xPredictor Coef SE Coef T PConstant 40.784 8.507 4.79 0.001x 0.7656 0.1750 4.38 0.002

S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of VarianceSource DF SS MS F PRegression 1 1450.0 1450.0 19.14 0.002Residual Error 8 606.0 75.8Total 9 2056.0

Regression coefficients, a and b

Minitab OutputMinitab Output

MSE

0: 0H test ToLeast squares regression line

Ft 2MSE

Note 13 of 5E

Key ConceptsKey ConceptsI. Scatter plotI. Scatter plot

Pattern and unusual observationsPattern and unusual observations

II. Correlation coefficient II. Correlation coefficient a numerical measure of the strengthstrength and direction.direction.

Interpretation of rInterpretation of r

Note 13 of 5E

Key ConceptsKey ConceptsIII.III. A Linear Probabilistic ModelA Linear Probabilistic Model

1. When the data exhibit a linear relationship, the appropriate model is y x .

2. The random error has a normal distribution with mean 0 and variance 2.

IV.IV. Method of Least SquaresMethod of Least Squares1. Estimates and , for and , are chosen to

minimize SSE, the sum of the squared deviations about the regression line,

2. The least squares estimates are and

. ˆˆˆ xy

xy ˆˆ xxxy SS /ˆ

Note 13 of 5E

Key ConceptsKey ConceptsV.V. Analysis of VarianceAnalysis of Variance

1. Total SS SSR SSE, where Total SS Syy and SSR (Sxy)2 Sxx.

2. The best estimate of 2 is MSE SSE (n 2).

VI.VI. Testing, Estimation, and PredictionTesting, Estimation, and Prediction

1. A test for the significance of the linear regression

—H0 : 0 can be implemented using statistic

ˆ

xxSt

Note 13 of 5E

Key ConceptsKey Concepts2. The strength of the relationship between x and y can be

measured using

which gets closer to 1 as the relationship gets stronger.3. Use residual plots to check for nonnormality, inequality of

variances, and an incorrectly fit model.4. Confidence intervals can be constructed to estimate the

slope of the regression line and to estimate the average value of y, for a given value of x.

5. Prediction intervals can be constructed to predict a particular observation, y, for a given value of x. For a given x, prediction intervals are always wider than confidence intervals.

SS Total

SSE12 r