Objectives (BPS chapter 24)

Objectives (BPS chapter 24)Inference for regression

Conditions for regression inference

Estimating the parameters

Using technology

Testing the hypothesis of no linear relationship

Testing lack of correlation

Confidence intervals for the regression slope

Inference about prediction

Checking the conditions for inference

The data in a scatterplot are a random

sample from a population that may

exhibit a linear relationship between x

and y. Different sample different plot.

ˆ y 0.125x 41.4

Now we want to describe the population mean response y as a function of the explanatory

variable x: y = + x.

And we want to assess whether the observed relationship is statistically significant (not entirely explained by chance events due to random sampling).

The regression modelThe least-squares regression line ŷ = a + bx is a mathematical model

of form “sample data = fit + residual.” For each data point in the

sample, the residual is the difference (y − ŷ).

At the population level, the model becomes yi = ( + xi) + (i)

with residuals i independent and

normally distributed N(0, ).

The population mean response y is

y = + x

y = + x

The intercept , the slope , and the standard deviation of y are the

unknown parameters of the regression model We rely on the random

sample data to provide unbiased estimates of these parameters.

The value of ŷ from the least-squares regression line is really a prediction of

the mean value of y (y) for a given value of x.

The least-squares regression line [ŷ = a + bx] obtained from sample data is

the best estimate of the true population regression line [y = + x].

ŷ unbiased estimate for mean response y

a unbiased estimate for intercept

b unbiased estimate for slope

Conditions for inference The observations are independent.

The relationship is indeed linear.

The standard deviation of y, σ, is the same for all values of x.

The response y varies normally around its mean.

The regression standard error, s, for n sample data points is calculated from the residuals (yi – ŷi):

s is an unbiased estimate of the regression standard deviation

Regression assumes equal variance of y ( is the same for all values of x).

The population standard deviation for y at any given value of x represents the spread of the normal distribution of

the i around the mean y.

For any fixed x, the responses y follow a Normal distribution with standard deviation .

2)ˆ(

2

22

n

yyn

residuals ii

Confidence interval for the slope βEstimating the regression parameter for the slope is a case of one-

sample inference with σ unknown. Hence we rely on t distributions.

The standard error of the slope b is:

(s is the regression standard error.)

Thus, a level C confidence interval for the slope is:

estimate ± t*SEestimate

b ± t* SEb

t* is t critical for t(df = n − 2) density curve with C% between –t* and +t*

)( 2

xx

sSEb

Testing the hypothesis of no relationshipTo test for the existence of a significant relationship, we can test if the parameter for the slope is significantly different from zero using a one-sample t-test procedure.

The standard error of the slope b is:

We test the hypotheses H0: = 0

Ha: ≠ 0, >0, or <0 (two- or one-sided)

We calculate

t = b/SEb

which has the t (n – 2) distribution

to find the P-value of the test.

)( 2

xx

sSEb

Testing for lack of correlation

The regression slope b and the correlation

coefficient r are related and b = 0 r = 0.

Similarly, the population parameter for the slope β is related to the

population correlation coefficient ρ, and when β= 0 ρ = 0.

Thus, testing the hypothesis H0: β = 0 is the same as testing the

hypothesis of no correlation between x and y in the population from

which our data were drawn.

slope y

x

sb r

s

Inference about predictionOne use of regression is for predicting the value of y, ŷ, for any value

of x within the range of data tested: ŷ = a + bx.

But the regression equation depends on the particular sample drawn.

More reliable predictions require statistical inference

To estimate an individual response y for a given value of x, we use a

prediction interval. If we randomly sampled many times, there

would be many different values of y

obtained for a particular x following N(0, σ)

around the mean response µy.

The level C prediction interval for a single observation on y when x

takes the value x* is:

ŷ ± t*n − 2 SEŷ

The prediction interval represents

mainly the error from the normal

distribution of the residuals i.

Graphically, a series of confidence

intervals for the whole range of x

values is shown as a continuous

interval on either side of ŷ.

95% prediction interval for ŷ

t* for t distribution with n – 2 df

Confidence interval for µy

We may also want to predict the population mean value of y, µy, for any

value of x within the range of data tested.

Using inference, we calculate a level C confidence interval for the

population mean μy of all responses y when x takes the value x*:

This interval is centered on ŷ, the unbiased estimate of μy.

The true value of the population mean μy at a given

value of x will indeed be within our confidence

interval in C% of all intervals calculated

from many different random samples.

The level C confidence interval for the mean response μy at a given

value x* of x is centered on ŷ (unbiased estimate of μy):

ŷ ± tn − 2 * SE^

A separate confidence interval is

calculated for μy along all the values

that x takes.

Graphically, the series of confidence

intervals for the whole range of x

values is shown as a continuous

interval on either side of ŷ.

95% confidence interval for y

t* for t distribution with n – 2 df

The confidence interval for μy contains with C% confidence the

population mean μy of all responses at a particular value of x.

The prediction interval contains C% of all the individual values taken by y at a particular value of x.

Least-squares regression line95% prediction interval for ŷ95% confidence interval for y

Estimating y uses a smaller

confidence interval than estimating

an individual in the population

because the sampling distribution is

narrower than the population

distribution.

Residuals are randomly scattered good!

Curved pattern the relationship is not linear.

Change in variability across plot σ not equal for all values of x.

The annual bonuses ($ 1000) of six randomly selected emplyees and their years of services were recorded. We wish to analyze the relationship between the two variables. Data was analyzed using MINITAB. The output is shown below

Example

Yeas (X) 1 2 3 4 5 6Bonus (Y) 6 1 9 5 17 12

Predictor Coef SE Coef T PConstant 0.933 4.192 0.22 0.835Years 2.114 1.076 1.96 0.121

S = 4.50291 R-Sq = 49.1% R-Sq(adj) = 36.4%

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI 1 11.50 2.45 (4.71, 18.30) (-2.72, 25.73)

Values of Predictors for New Observations

NewObs Years 1 5.00

a. What is the equation of the least squares regression line ?

Example

xy 114.2933.0ˆ

b. Calculate the 95% confidence interval for the true slope coefficient.

]10.5,873.0[076.1776.2114.21

*1 bSEtb

c. Based on the above output, at the .05 level of significance, test if slope β is significantly different from zero.

0:0:

1

0

HH

96.1* bSE

bt 05.0121.0 valuep

The test is not significant, fail to reject null hypothesis

d. What is the predicted annual bonus of an employee with 5 years of service ?

Example

50.115114.2933.0ˆ y

e. What is the value of the residual for the data value (5, 17)?

50.650.1117ˆ yyresidual

f. Construct a 95% prediction interval for a single employee’s bonus whose year of service is 7 years.

]73.25,72.2[)(

)(11 t ˆ2

2**

xx

xxn

syi

Example

f. Construct a 95% confidence interval for the mean bonus when years of �service is 7.

]30.18,71.4[)(

)(1 t ˆ2

2**

xx

xxn

syi

y

Objectives (BPS chapter 24)

Documents

Transcript of Objectives (BPS chapter 24)