Topic 7: Analysis of Variance
description
Transcript of Topic 7: Analysis of Variance
Topic 7: Analysis of Variance
Outline
• Partitioning sums of squares
• Breakdown degrees of freedom
• Expected mean squares (EMS)
• F test
• ANOVA table
• General linear test
• Pearson Correlation / R2
Analysis of Variance
• Organize results arithmetically
• Total sum of squares in Y is
• Partition this into two sources
– Model (explained by regression)
– Error (unexplained / residual)
ˆ ˆY Y Y Y Y Yi i i i
2Y Yi
Total Sum of Squares• MST is the usual estimate of the variance of
Y if there are no explanatory variables
• SAS uses the term Corrected Total for this source
• Uncorrected is ΣYi2
• The “corrected” means that we subtract of the mean before squaring Y
Model Sum of Squares
•
• dfR = 1 (due to the addition of the slope)
• MSR = SSR/dfR
• KNNL uses regression for what SAS calls model
• So SSR (KNNL) is the same as SS Model
2ˆSSR= Y Yi
Error Sum of Squares
• • dfE = n-2 (both slope and intercept)• MSE = SSE/dfE
• MSE is an estimate of the variance of Y taking into account (or conditioning on) the explanatory variable(s)
• MSE=s2
2ˆSSE= Y -Yi i
ANOVA Table
Source df SS MS
Regression 1 SSR/dfR
Error n-2 SSE/dfE
________________________________
Total n-1 SSTO/dfT
2ˆY -Yi i
2
Y Yi
2Y -Yi
Expected Mean Squares
• MSR, MSE are random variables
•
•
• When H0 : β1 = 0 is true
E(MSR) =E(MSE)
22 21
2
E(MSR)
E(MSE)
iX X
F test
• F*=MSR/MSE ~ F(dfR, dfE) = F(1, n-2)
• See KNNL pgs 69-71
• When H0: β1=0 is false, MSR tends to be larger than MSE
• We reject H0 when F is large
If F* F(1-α, dfR, dfE) = F(.95, 1, n-2)
• In practice we use P-values
F test• When H0: β1=0 is false, F has a
noncentral F distribution
• This can be used to calculate power
• Recall t* = b1/s(b1) tests H0 : β1=0
• It can be shown that (t*)2 = F* (pg 71)
• Two approaches give same P-value
ANOVA Table
Source df SS MS F P
Model 1 SSM MSM MSM/MSE 0.##
Error n-2 SSE MSE
Total n-1
**Note: Model instead of Regression used here. More similar to SAS
Examples• Tower of Pisa study (n=13 cases)
proc reg data=a1;
model lean=year;
run;
• Toluca lot size study (n=25 cases) proc reg data=toluca;
model hours=lotsize;
run;
Pisa Output Number of Observations Read 13
Number of Observations Used 13
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 15804 15804 904.12 <.0001
Error 11 192.28571 17.48052
Corrected Total 12 15997
Pisa Output Root MSE 4.18097 R-Square 0.9880
Dependent Mean 693.69231 Adj R-Sq 0.9869
Coeff Var 0.60271
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 -61.12088 25.12982 -2.43 0.0333
year 1 9.31868 0.30991 30.07 <.0001
(30.07)2=904.2 (rounding error)
Toluca Output
Number of Observations Read 25
Number of Observations Used 25
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 252378 252378 105.88 <.0001
Error 23 54825 2383.71562
Corrected Total 24 307203
Toluca Output
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 62.36586 26.17743 2.38 0.0259
lotsize 1 3.57020 0.34697 10.29 <.0001
Root MSE 48.82331 R-Square 0.8215
Dependent Mean 312.28000 Adj R-Sq 0.8138
Coeff Var 15.63447
(10.29)2=105.88
General Linear Test
• A different view of the same problem
• We want to compare two models
– Yi = β0 + β1Xi + ei (full model)
– Yi = β0 + ei (reduced model)
• Compare two models using the error sum of squares…better model will have “smaller” mean square error
General Linear Test
• Let
SSE(F) = SSE for full model
SSE(R) = SSE for reduced model
• Compare with F(1-α,dfR-dfF,dfF)
SSE(R)-SSE(F)
SSE(F)
( ) ( )R F
F
df dfF
df
Simple Linear Regression
•
•
• dfR=n-1, dfF=n-2,
• dfR-dfF=1
• F=(SSTO-SSE)/MSE=SSR/MSE
• Same test as before
• This approach is more general
22
0SSE(R) Y Y Y SSTO
SSE(F) SSE
i ib
Pearson Correlation
• r is the usual correlation coefficient
• It is a number between –1 and +1 and measures the strength of the linear relationship between two variables
i i
2 2i i
X X Y Y
X X Y Y
( )( )( ) ( )
r
Pearson Correlation
• Notice that
• Test H0: β1=0 similar to H0: ρ=0
2i
1 2i
1 Y
X Xb
Y Y
b
( )( )
( )X
r
s s
R2 and r2
• Ratio of explained and total variation
2i2 2
1 2i
X Xb
Y Y
SSR SSTO
( )( )
r
R2 and r2
• We use R2 when the number of explanatory variables is arbitrary (simple and multiple regression)
• r2=R2 only for simple regression
• R2 is often multiplied by 100 and thereby expressed as a percent
R2 and r2
• R2 always increases when additional explanatory variables are added to the model
• Adjusted R2 “penalizes” larger models
• Doesn’t necessarily get larger
Pisa Output Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 15804 15804 904.12 <.0001
Error 11 192.28571 17.48052
Corrected Total 12 15997
R-Square 0.9880 (SAS) = SSM/SSTO = 15804/15997 = 0.9879
Toluca Output
R-Square 0.8215 (SAS) = SSM/SSTO = 252378/307203 = 0.8215
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 252378 252378 105.88 <.0001
Error 23 54825 2383.71562
Corrected Total 24 307203
Background Reading• May find 2.10 and 2.11 interesting• 2.10 provides cautionary remarks
– Will discuss these as they arise
• 2.11 discusses bivariate Normal dist– Similarities and differences – Confidence interval for r
• Program topic7.sas has the code to generate the ANOVA output
• Read Chapter 3