Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1...
Transcript of Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1...
10/27/15
1
Chapter 11: Linear Regression and Correla4on
• Regression analysis is a sta4s4cal tool that u4lizes the rela4on between two or more quan4ta4ve variables so that one variable can be predicted from the other, or others. • Some Examples: • Height and weight of people • Income and expenses of people • Produc4on size and produc4on 4me • Soil pH and the rate of growth of plants
1
Correla4on • An easy way to determine if two quan4ta4ve variables are linearly related is by looking at their scaLerplot. • Another way is to calculate the correla4on coefficient, denoted usually by r.
Note: -‐1 ≤ r ≤ 1.
2
• The Linear Correla+on measures the strength of the linear rela4onship between explanatory variable (x) and the response variable (y). An es4mate of this correla4on parameter is provided by the Pearson sample correla4on coefficient, r.
Example Sca;erplots with Correla4ons
3
If X and Y are independent, then their correla4on is 0.
Correla4on
• Some Guidelines in Interpre4ng r.
Value of |r| Strength of linear rela1onship
If |r|≥ 0.95 Very Strong
If 0.85 ≤ |r|< 0.95 Strong
If 0.65 ≤ |r|< 0.85 Moderate to Strong
If 0.45 ≤ |r|< 0.65 Moderate
If 0.25 ≤ |r|< 0.45 Weak
If|r|< 0.25 Very weak/Close to none
If the correla4on between X and Y is 0, it doesn’t mean they are independent. It only means that they are not linearly related.
One complain about the correla4on is that it can be subjec4ve when interpre4ng its value. Some people are very happy with r≈0.6, while others are not.
Note: Correla4on does not necessarily imply Causa4on!
4
10/27/15
2
Compu4ng Correla4on in R
5
data.health=read.csv("HealthExam.csv",header=T)!head(data.health)!Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm !1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 !2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 !3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 !!attach(health.exam)!plot(Height,Weight,pch=19,main="Scatterplot")!!cor(Height,Weight) !# 0.544563!!cor(Waist,Weight) !# 0.9083268!plot(Waist,Weight,pch=19,main="Scatterplot")!
Simple Linear Regression • Model: Yi=(β0+β1xi) + εi where,
• Yi is the ith value of the response variable. • xi is the ith value of the explanatory variable. • εi’s are uncorrelated with a mean of 0 and constant variance σ2.
x
y
Y=β0+β1x Random Error
x1
*
Expected point
Observed point
ε1
*
*
**
*
**
• How do we determine the underlying linear rela4onship?
• Well, since the points are following this linear trend, why don’t we look for a line that “best” fit the points.
• But what do we mean by “best” fit? We need a criterion to help us determine which between 2 compe4ng candidate lines is beLer.
L1 L2
L3
L4
6
Method of Least Squares • Model: Yi=(β0+β1xi) + εi where,
• Yi is the ith value of the response variable. • xi is the ith value of the explanatory variable. • εi’s are uncorrelated with a mean of 0 and constant variance σ2.
x
y
Y=β0+β1x
x1
*
P1(x1,y1)
e1
*
*
**
*
**
• Residual = (Observed y-‐value) – (Predicted y-‐value) e1= y1 –
y1 Example: 2+.8x
e2
Method of Least Squares: Choose the line that minimizes the SSE as the “best” line. This line is unknown as the Least-‐Squares Regression Line.
Ques1on: But there are infinite possible candidate lines, how can we find the one that minimizes the SSE?
Observed
Predicted
Answer: Since SSE is a con+nuous func+on of 2 variables, we can use methods from calculus to minimize the SSE. 7
Obtaining the Regression Line in R
8
data.health=read.csv("HealthExam.csv",header=T)!head(data.health)! Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm !1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 !2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 !3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 !!attach(health.exam)!plot(Waist,Weight,pch=19,main="Scatterplot")!!result=lm(Weight~Waist)!coef(result)!(Intercept) Waist ! -51.72790 2.39469 !!abline(a=-51.7279,b=2.39469,lwd=2,col="blue")!!So, for the first person, her predicted weight is 143.2 pounds. Predicted.1=-51.728+2.395*81.4 !# 143.225 pounds!!Since her actual weight is 156.3 pounds. Residual.1=156.3-143.2 !# 13.1 pounds!
As waist increases by 1 cm, weight goes up by about 2.4 pounds.
10/27/15
3
What else do we get from the ‘lm’ func4on? data.health=read.csv("HealthExam.csv",header=T)!attach(health.exam)!result=lm(Weight~Waist)!attributes(result)!$names!"coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model" !!result$fit[1] !# 143.1999 !result$res[1] !# 13.10011 !!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 14.68 on 78 degrees of freedom!Multiple R-squared: 0.8251, !Adjusted R-squared: 0.8228 !F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16!
Coefficient of Determination (R2) : This index measures the amount of variability in the dependent variable (y) that can be explained by the regression line. Hence, about 82.51% of the variability of weight can be explained by the regression line involving the waist size.
Testing H0: β1 = 0 vs. H1: β1 ≠ 0. Since the p-vlaue is extremely small (<0.05), we can reject the null hypothesis and conclude that waist has a significant effect on weight.
Model Assump4ons
Since the underlying (green) line is unknown to us, we can’t calculate the values of the error terms (εi). The best that we can do is study the residuals (ei).
• Model: Yi=(β0+β1xi) + εi where,
• εi’s are uncorrelated with a mean of 0 and constant variance σ2ε. • εi’s are normally distributed. (This is needed in the test for the slope.)
x
y
Y=β0+β1x
e1 *
ε1
x1
Expected point
Observed point
Predicted point
10
Es4ma4ng the Variance of the Error Terms
11
• The unbiased estimator for σε2 is
!!!sse=sum(result$residuals^2) !# 16811.16!mse=sse/(80-2) ! ! !# 215.5277!sigma.hat=sqrt(mse) ! !# 14.68086!!anova(result) ! !!Response: Weight! Df Sum Sq Mean Sq F value Pr(>F) !Waist 1 79284 79284 367.86 < 2.2e-16 ***!Residuals 78 16811 216 !Total ! 79 96095!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!!Residual standard error: 14.68 on 78 degrees of freedom!Multiple R-squared: 0.8251, !Adjusted R-squared: 0.8228 !F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16!!
Y=β0+β1x
*
**
*
*
*
*
*
yi
Pi(xi,yi)
y
xi
SSTO = SSE + SSR Since the p-vlaue is less than 0.05, we conclude the the regression model account for a significant amount of the variability in weight. R2 = SSR/SSTO
Things that affect the slope es4mate
12
Ø Watch the regression podcast by Dr. Will posted on our course webpage. • Three things that affect the slope estimate:
1. Sample size (n). 2. Variability of the error terms (σε2). 3. Spread of the independent variable.
!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!!
SS=function(x,y){sum((x-mean(x))*(y-mean(y)))}!SSxy=SS(Waist,Weight) !# 33108.35!SSxx=SS(Waist,Waist) !# 13825.73!SSyy=SS(Weight,Weight) !# 96095.4 = SSTO!Beta1.hat=SSxy/SSxx !# 2.39469!
MSE=anova(result)$Mean[2] !# 215.5277!SE.beta1=sqrt(MSE/SSxx) ! !# 0.1248554!
t.obs=(Beta1.hat-0)/SE.beta1 !# 19.17971!p.value=2*(1-pt(19.18,df=78)) !# virtually 0!
Testing H0: β1 = 0 vs. H1: β1 ≠ 0.
As n increases, the standard error of the slope estimate decreases.
The smaller σε is, the smaller the standard error of the slope estimate.
10/27/15
4
Effect of Outliers to the Slope Es4mate
13
Three types of outliers: 1. Outlier in the x direction – This type of an outlier is said to be a high leverage point. 2. Outlier in the y direction. 3. Outlier in both x and y directions – This point is said to be a high influence point.
The effect of a high influence point. The effect of a point with an outlying y value.
Confidence Intervals • The (1-α)100% C.I. for β1: Hence, the 90% C.I. for β1 for our example is !
Lower=Beta1.hat – qt(0.95,df=78)*SE.beta1 ! ! !# 2.186853 Upper=Beta1.hat + qt(0.95,df=78)*SE.beta1 ! ! !# 2.602528!!
confint(result,level=.90)!! ! 5 % 95 %!
(Intercept) -70.253184 -33.202619!Waist 2.186853 2.602528!!!• Estimating the mean response (µy) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)))! 1 2 !139.8473 163.7942!!• Confidence interval for the mean response (µy) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)),interval=”confidence”)! fit lwr upr!1 139.8473 136.0014 143.6932!2 163.7942 160.4946 167.0938!!
Predic4on Intervals • Predicting the value of the response variable at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)))! 1 2 !139.8473 163.7942!!• Prediction interval for the value of new response value (yn+1) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)),interval=”prediction”)! fit lwr upr!1 139.8473 110.3680 169.3266!2 163.7942 134.3812 193.2072!!predict(result,newdata=data.frame(Waist=c(80,90)),interval=”prediction”,level=.99)! fit lwr upr!1 139.8473 100.7507 178.9439!2 163.7942 124.7855 202.8029!!
Note that the only difference between the prediction interval and confidence interval for the mean response is the addition of 1 inside the square root. This makes the prediction intervals wider than the confidence intervals for the mean response.
Confidence and Predic4on Bands • Working-Hotelling (1-α)100% confidence band: !result=lm(Weight~Waist)!CI=predict(result,se.fit=TRUE) # se.fit=SE(mean)!W=sqrt(2*qf(0.95,2,78)) ! ! # 2.495513 !band.lower=CI$fit - W*CI$se.fit!band.upper=CI$fit + W*CI$se.fit!!plot(Waist,Weight,xlab="Waist”,ylab="Weight”,main="Confidence Band")!abline(result)!points(sort(Waist),sort(band.lower),type="l",lwd=2,lty=2,col=”Blue")!points(sort(Waist),sort(band.upper),type="l",lwd=2,lty=2,col=”Blue")!!• The ((1-α)100% Prediction Band: mse=anova(result)$Mean[2]!se.pred=sqrt(CI$se.fit^2+mse)!band.lower.pred=CI$fit - W*se.pred!band.upper.pred=CI$fit + W*se.pred!!points(sort(Waist),sort(band.lower.pred),type="l",lwd=2,lty=2,col="Red")!points(sort(Waist),sort(band.upper.pred),type="l",lwd=2,lty=2,col="Red”)!
,
10/27/15
5
Tests for Correla4ons
17
• Testing H0: ρ = 0 vs. H1: ρ ≠ 0. cor(Waist,Weight) ! ! !# Computes the Pearson correlation coefficient, r!0.9083268 !cor.test(Waist,Weight, conf.level=.99) # Tests Ho:rho=0 and also constructs C.I. for rho!
!Pearson's product-moment correlation!data: Waist and Weight!t = 19.1797, df = 78, p-value < 2.2e-16!alternative hypothesis: true correlation is not equal to 0!99 percent confidence interval:! 0.8409277 0.9479759!!• Testing H0: ρ = 0 vs. H1: ρ ≠ 0 using the (Nonparametric) Spearman’s method. cor.test(Waist,Weight,method="spearman") !# Test of independence using the !
!Spearman's rank correlation rho !# Spearman Rank correlation!data: Waist and Weight!S = 8532, p-value < 2.2e-16!alternative hypothesis: true rho is not equal to 0!sample estimates:!rho !0.9 ! ! ! ! !!
Note that the results are exactly the same as what we got when testing H0: β1 = 0 vs. H1: β1 ≠ 0.
Model Diagnos4cs
18
• Model: Yi=(β0+β1xi) + εi where,
• εi’s are uncorrelated with a mean of 0 and constant variance σ2ε.
• εi’s are normally distributed. (This is needed in the test for the slope.)
• Assessing uncorrelatedness of the error terms plot(result$residuals,type='b')!
• Assessing Normality qqnorm(result$residuals); qqline(result$residuals)
shapiro.test(result$residuals)!
W = 0.9884, p-value = 0.6937!
• Assessing Constant Variance plot(result$fitted,result$residuals)
levene.test(result$residuals,Waist)!
Test Statistic = 2.1156, p-value = 0.06764