Chapter 11
description
Transcript of Chapter 11
![Page 1: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/1.jpg)
Chapter 11
Simple Linear Regression and Correlation
![Page 2: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/2.jpg)
Learning Objectives• Use simple linear regression for building empirical
models • Estimate the parameters in a linear regression model• Determine if the regression model is an adequate fit
to the data• Test statistical hypotheses and construct confidence
intervals • Prediction of a future observation• Use simple transformations to achieve a linear
regression model• understand the correlation
![Page 3: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/3.jpg)
Regression analysis• Relationships between two or more variables• Useful for these types of problems• Predict a new observation• Sometimes a regression model will arise from a
theoretical relationship• At other times no theoretical knowledge• Choice of the model is based on inspection of a
scatter diagram• Empirical model
![Page 4: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/4.jpg)
Regression Model• Mean of the random variable Y is related to x
E[Y|x]=Y|x=0+ 1x 0 and 1 called regression coefficients• Appropriate way to generalize this to a probabilistic
model • Assume that the expected value of Y is a linear
function of x• Actual value of Y is determined by the mean value
function plus a random error termY=0+ 1x+
• Where called random error term
![Page 5: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/5.jpg)
1 and • Suppose that the mean and variance of
are 0 and 2 • Slope, 1, can be interpreted as the change
in the mean of Y • Height of the line at any value of x is just
the expected value of Y for that x• Variability of Y at a particular value of x is
determined by the error variance 2
• Implies that there is a distribution of Y-values at each x
![Page 6: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/6.jpg)
Graph of the Variability
• Distribution of Y for any given value of x
• Values of x are fixed, and Y is a random variable with the following mean and varianceMean: Y|x=0+ 1x
Variance 2
True regression line
Y
x
![Page 7: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/7.jpg)
Simple Linear Regression• Values of the intercept, slope and the
error variance will not be known• Must be estimated from sample data• Fitted model is used in prediction of future
observations of Y at a particular level of x
![Page 8: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/8.jpg)
Method of Least Squares• True relationship
between Y and x is a straight line
• Assume n pairs of observations
• Estimates of 0 and 1 result in a line that is a “best fit” to the data
• Called method of least squares
![Page 9: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/9.jpg)
Least Squares Method• Assuming the n observations in the sample• Sum of the squares of the deviations of the
observations from the true regression line
• Taking the partial derivatives
210
11
2 )( i
n
ii
n
ii xyL
0)ˆˆ(21
1ˆˆ1
n
iioi
o
xyLo
0)ˆˆ(21
1ˆˆ1
1
i
n
iioi xxyL
o
![Page 10: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/10.jpg)
Least Squares Method-Cont.• Simplifying
• Results are
• Fitted or estimated regression line is
n
ii
n
iio yxn
111̂
ˆ
i
n
ii
n
ii
n
iio xyxx
11
21
1
ˆˆ
n
xx
n
xyxy
xy
n
iin
ii
n
ii
n
iin
iii
o
1
2
1
2
11
11
1
)(
))((
ˆ
ˆˆ
xy o 1̂ˆˆ
![Page 11: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/11.jpg)
Using Special Symbols• Convenient to use special symbols
• Numerator
• Denominator
xx
xy
SS
1̂
n
yxyxxxyS
n
ii
n
ii
i
n
iii
n
iixy
11
1
2
1
))(()(
n
xxxxS
n
iin
iii
n
ixx
2
1
1
22
1
)()(
![Page 12: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/12.jpg)
Residual Error
• Describes the error in the fit of the model to the ith observation yi
• Each pair of observations satisfies
• Denoted by ei
iioi exy 1̂ˆ
iii yye ˆ
![Page 13: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/13.jpg)
Estimating 2
• Another unknown parameter,2, the variance of the error term
• Residuals ei are used to obtain an estimate of 2
• Sum of squares of the residuals, often called the error sum of squares
• A more convenient computing formula
• SST is the total sum of squares
2ˆ 2
nSSE
xyTE SSSSS 1̂
2
11
2 )ˆ( i
n
ii
n
iiE yyeSS
2
1
)ˆ( yySSTn
ii
![Page 14: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/14.jpg)
Example• Regression methods were used to analyze the data from
a study investigating the relationship between roadway surface temperature (x) and pavement deflection ( y).
• Summary quantities were as follows
1478 ,86.8 ,75.12 ,20 2 iii xyyn
67.1083 and 8.215,1432iii yxx
![Page 15: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/15.jpg)
Questions(a) Calculate the least squares estimates of the slope
and intercept. Graph the regression line.(b) Use the equation of the fitted line to predict what
pavement deflection would be observed when the surface temperature is 85F.
(c) What is the mean pavement deflection when the surface temperature is 90F?
(d) What change in mean pavement deflection would be expected for a 1F change in surface temperature?
![Page 16: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/16.jpg)
Solution• Need to have
• Hence, the slope and intercept
• Regression line
6.339918.143215 2014782 xxS
3299892.0))(0041612.0(ˆ
0041612.06.33991
445.141ˆ
201478
2075.12
0
1
xx
xy
SS
xy 0041612.03299892.0ˆ 1100 xy
n
xxxxS
n
iin
ii
n
iixx
2
1
1
22
1
)()(
![Page 17: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/17.jpg)
Solution-Cont.
• Graph of the regression line
00.10.20.30.40.50.60.70.8
-50 0 50 100 150
x
y
![Page 18: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/18.jpg)
Solution-Cont.
• Pavement deflection
• Mean pavement deflection
• Change in mean pavement deflection
6836.0)85(0041612.03299892.0ˆ y
7045.0)90(0041612.03299892.0ˆ y
00416.01̂
![Page 19: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/19.jpg)
Properties of the Least Estimators• Assumed that the error term in the model is a
random variable • Estimators will be viewed as random variables• Properties of the slope
• Properties of the intercept
ooE )ˆ(
xx
o Sx
nV
22 1)ˆ(
11)ˆ( ExxS
V2
1)ˆ(
![Page 20: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/20.jpg)
Analysis of Variance Approach
• Used to test for significance of regression• Partitions the total variability in the response
variable into two components
• First term is called error sum of squares• Second term is called regression sum of squares• Symbolically
SST=SSR+SSE
• SST is the total corrected sum of squares
n
iii
n
ii
n
ii yyyyyy
1
22
1
2
1
)ˆ()ˆ()(
![Page 21: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/21.jpg)
Analysis of Variance
• SST, SSR, and SSE has n-1, 1, and n-2 d.o.f, respectively
• SSR= β1Sxy and SSE=SST- β1Sxy
• Divide by its d.o.f MSR=[SSR/1] and MSE=[SSE/n-2]
• Then F=MSR/MSE follows F1,n-2 distribution
![Page 22: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/22.jpg)
Hypothesis Tests for Slope• Adequacy of a linear regression model• Appropriate hypotheses for slope are
H0: β1=β1,0
H1: β1#β1,0 • Test Statistic
• Follows the F 1,n-2 distribution• Reject H0 if f0>f,1,n-2
E
R
E
Ro MS
MSnSS
SSF
)2/(
1/
![Page 23: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/23.jpg)
Analysis of Variance for Testing Significance of Regression
![Page 24: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/24.jpg)
Example• Consider the data from the previous example on
x=roadway surface temperature and y=pavement deflection.
• (a) Test for significance of regression using α = 0.05. What conclusions can you draw?• (b) Estimate the standard errors of the slope and
intercept.
![Page 25: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/25.jpg)
Solution• Use the steps in hypotheses testing1) Parameter of interest is slope of the regression line 1
2)3)4) = 0.055) The test statistic is
6) Reject H0 if f0 > f,1,18 where f0.05,1,18 = 4.416
0: 10 H
0: 11 H
)2/(1/
0
nSSSS
MSMSf
E
R
E
R
![Page 26: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/26.jpg)
Solution7) Using the results from the previous example
• Hence, the test statistic
8) Since 73.95 > 4.416, reject H0 and conclude the model specifies a useful relationship at = 0.05
• Standard error
143275.05886.0)86.8(
5886.0)445.141)(0041612.0(ˆ
2075.12
1
2
RyyE
xyR
SSSSS
SSS
95.7318/143275.0
5886.00 f
00796.018
143275.02
ˆ 2
nSSMS E
E
![Page 27: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/27.jpg)
Confidence Intervals on the Slope and Intercept• Interested to obtain C.I. estimates of the
parameters• Width of these C.I. is a measure of the overall
quality of the regression line• 100(1-α)% C.I. on the slope β1
• 100(1-α)% C.I. on the intercept β0
xxn
xxn S
tS
t2
2,2/11
2
2,2/1ˆˆˆˆ
xx
nooxx
no Sx
nt
Sx
nt
22
2,2/
22
2,2/1ˆˆ1ˆˆ
![Page 28: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/28.jpg)
Confidence Interval on the Mean Response• Constructed on the mean response at a specified
value of x, say, x0
• Called a C.I. about the regression line• C.I. about the mean response at the value of x=x0
• Applies only to the interval
xx
onnxyxy
xx
onnxy S
xxtSxxt
ooo
212
2,2/
212
2,2/)(ˆˆˆ)(ˆˆ
![Page 29: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/29.jpg)
Prediction of New Observations• An important application of a regression model• New observation is independent of the observations
used to develop the regression model• C.I. for Y|x is inappropriate• Prediction interval on a future observation at the value
x0
• Always wider than the C.I. at x0• Depends on both the error from the fitted model and
the error associated with future observations
xx
onoo
xx
ono S
xxn
tyYSxx
nty
22
2,2/
22
2,2/)(11ˆˆ)(11ˆˆ
![Page 30: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/30.jpg)
Example
• The first example presented data on roadway surface temperature x and pavement deflection y
• Find a 99% confidence interval on each of the following:
• (a) Slope • (b) Intercept• (c) Mean deflection when temperature x=85o F• (d) Find a 99% prediction interval on pavement
deflection when the temperature is 90oF.
![Page 31: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/31.jpg)
Solutiona) Confidence interval on the slope
• Critical value t/2,n-2 = t0.005,18 = 2.878• Hence
b) Confidence interval on the intercept
xxn
xxn S
tS
t2
2,2/11
2
2,2/1ˆˆˆˆ
0055542.00027682.0)000484.0)(878.2(0041612.0
1
4478433.02121351.0)04095.0)(878.2(3299892.0
0
xx
nooxx
no Sx
nt
Sx
nt
22
2,2/
22
2,2/1ˆˆ1ˆˆ
![Page 32: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/32.jpg)
Solutionc) 99% confidence interval on when x=85 F
d) 99% prediction interval when x=90 F.
7431497.0ˆ6242283.00594607.0683689.0
)(00796.0)878.2(683689.0
)(ˆˆ
683689.0ˆ
0
2
20
0
0
|
6.33991)9.7385(
201
)(1218,005.|
|
xY
Sxx
nxY
xY
xxt
9685614.04404284.02640665.07044949.0
)1(00796.0878.27044949.0
)1(ˆˆ
7044949.0ˆ
0
6.33991)9.7390(
201
)(1218,005.0
0
2
20
y
ty
y
xxSxx
n
![Page 33: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/33.jpg)
Residual Analysis• Helpful in checking the
errors are approximately normally distributed with constant variance
• Useful in determining whether additional terms in the model are required
• Construct normal probability plot of residuals
• Patterns of residual plots
![Page 34: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/34.jpg)
Coefficient of Determination(R2)• Judge the adequacy of a regression model• Coefficient of determination
• Referred as the amount of variability in the data explained by the regression model and
• SSR is that portion of SST that is explained by the use of the regression model
• SSE is that portion of SST that is not explained by the use of the regression model
SSTSSESST
SSTSSRR
2
![Page 35: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/35.jpg)
Transformation of Data Points• Inappropriateness of straight-line regression model• Scatter diagram• Consider the exponential function Y=β0eβ1x transformed to a straight line• By a logarithmic transformation
ln Y=lnβ0 +β1 x + ln
• Another intrinsically linear function is Y=β0+β1(1/x)+
• By using the reciprocal transformation z=1/xY=β0 +β1 z +
• Transformed error terms are normally distributed
![Page 36: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/36.jpg)
Correlation• Assumed that x is a mathematical variable and
that Y is a random variable• Many applications involve situations in which both
X and Y are random variables• Suppose observations are jointly distributed
random variables• Measures the strength of linear association
between two variables and denoted by • Shows how closely the points in a scatter diagram
are spread around the regression line
![Page 37: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/37.jpg)
Hypothesis Tests • Useful to test the hypotheses
H0: =0
H1: #0 • Appropriate test statistic
• Follows the t distribution with n-2 degrees of freedom
• Reject the null hypothesis if
212RnRTo
2, no tt
![Page 38: Chapter 11](https://reader036.fdocuments.us/reader036/viewer/2022062302/568167d7550346895ddd30ca/html5/thumbnails/38.jpg)
Next Agenda
• Chapters 13 deals with designing and conducting engineering experiments
• ANOVA in designing single factor experiments will be emphasized