COURSE: Applied Regression Analysis - Rutgers …cabrera/sc/workshop/Appliedreg.pdf · Types of...
Transcript of COURSE: Applied Regression Analysis - Rutgers …cabrera/sc/workshop/Appliedreg.pdf · Types of...
Lecture 1: Review Simple linear
regression.
1
COURSE: Applied Regression Analysis
Fundamental elements of statistics:
• Population: set of units
• Sample: a subset of the population
• Variable of interest:systolic blood pressure (continuous)number of errors on an exam (discrete)
diabetic/non-diabetic (categorical binary)
marital status: married, divorced, single (nominal)
degree of pain: minimal/moderate/severe (ordinal)
• Statistical inference: estimate, prediction or other generalization about the population based on the information contained in the sample
• Reliability of statistical inference: degree of uncertainty associated with statistical inference
2
Types of variables in regression
• Dependent (response) variable: affected by one or more independent variables, assumed to have a probability distribution at each value of the independent variable
• Independent (explanatory, covariate, predictor, regressor) variable: can be set to a desired level if its values are recorded as they occur in a population
• Example: consumption of saturated fatty acids and plasma cholesterol levels, plasma cholesterol levels and heart disease
• Types of relationships between variables: association and causal
3
Distributions:
• Probability distribution is specified mathematically and
is used to calculate the theoretical probability of different
values occurring. It is described by a mathematical formula
with parameters.
• Examples of probability distributions: normal, log-normal,
binomial, t-distribution, etc.
4
Some graphs of normal distributions
5
Normal density
)2
exp(2
1)(
2yyf
}
2
)(exp{
2
1)(
2
2
2
yyf
6
T-distribution
• T-distribution is like normal but with heavier tails
• As the degrees of freedom of the t-distribution increase it more closely resembles the normal distribution
• For df=30 or higher the two are almost indistinguishable
7
Parameters
• Mean: sum of measurements divided by the total number of measurements
• Population mean:
• Variance: sum of squared deviations from the mean
• Population variance:
• Population standard deviation:
N
i
iYN
YE1
1)(
8
2
1
2 )(1
)(
N
i
iYN
YVar
)()( YVarYSD
Other important parameters
• Median (the middle value when the observations are ordered from the lowest to the highest)
• Mode (the measurement that occurs most often)
• pth percentile: the value such that at most p% of the measurements fall below it and (100-p)% above it when the measurements are ordered from lowest to highest
• Skewness
9
Estimates and sampling
• Estimate: a quantity computed from the sample which is intended to estimate a population parameter
• Example: Sample mean
• Variability of an estimate:
• Example: Standard error of the mean
• If σ is unknown it is estimated by the sample standard deviation s
• Sampling distribution of an estimate: if many samples of fixed size can be obtained what would the histogram of the estimator look like
• Example: The distribution of is either standard normal or t
n
i
iYn
Y1
1
ns
Y
10
n
i
i YYn
s1
22 )(1
1
Ys
Y
Confidence intervals
• “Definition:” a range of values which we can be confident (90%, 95%, 99%, never 100%!) includes the true value
• Basic idea: the confidence interval covers a large proportion of the sampling distribution of the statistic of interest
• Example: To obtain a confidence interval for a population mean take estimator +/- 2*standard error (approximately 95% CI)
• Interpretation: About 95 out of 100 confidence intervals based on different random samples from the same population will include the true mean. (5% will not include the true mean).
11
One sample confidence interval
12
• CI:
• 95% CI means α=0.05
• Interpretation of confidence interval: We are 95% confident that the interval contains the true population mean. That is, if we were to construct many 95% CI based on different random samples from the same population, 95% of these intervals will contain the true mean, 5% won’t. Our particular interval either does or does not contain the true mean.
nsntY )1,2/1(
Hypothesis tests
• Null hypothesis (Ho): usually the opposite of the research hypothesis generating data (no difference between treatments, no linear relationship between X and Y)
• Alternative hypothesis (Ha): research hypothesis (treatment A is better than treatment B, Y increases with X)
• Alpha level: tolerance for mistakenly declaring Ha to be true (usually set at 0.05, but can use 0.01, 0.10). This type of mistake is called type I error.
• Test statistic: numerical summary of evidence contained in the sample in favor of Ha under the assumption that Ho is true.
• P-value: the probability of obtaining as much or more evidence as contained in the test statistic in favor of Ha under the assumption that Ho is true.
13
Hypothesis testing
• Ho: = 0 vs Ha: ≠ 0
• TS: t* =
• Test statistic has t distribution with n-1 degrees of freedom
• Rejection region t* > t(1-α/2,n-1) or t* < -t(1-α/2,n-1)
• p-value = the probability to get a more extreme value than the observed value of the test statistic based on t-distribution with n-1 degrees of freedom. Note: this probability is computed under Ho.
• Reject Ho if p-value < α (0.05) or equivalently if t* fall in RR.
ns
Y 0
14
Rejection regions (critical values) of
t-distribution
15
Relationship between variables
0 10 20 30 40
Celsius
20
40
60
80
100
Fahre
nheit
40 45 50 55 60 65
Weight (in kg)
800
1000
1200
1400
Resting M
eta
bolic
Rate
(in
kcal/24 h
rs)
16
Functional Statistical
Functional vs Statistical
• Functional:
- Yi = f(Xi)
- relationship is
perfect
- data points fall
exactly on the curve of the
relationship
- systematic part only
• Statistical:
- Yi = f(Xi) + εi
- relationship is not
perfect
- data points are
scattered around the curve
- systematic plus
random part
17
Model definition
18
• Yi – dependent variable for subject i
• Xi – independent variable for subject i
The regression equation is then:
- β0 is unknown intercept and β1 is unknown slope
(β0 and β1 are called regression parameters)
- εi are independent, identically distributed errors with mean 0 and variance σ2
( E(εi)=0, Var(εi)= σ2>0, Cov(εi , εj) = 0)
niXY iii ,...1,10
Properties of SLR
• The model is linear in the parameters
• There is a single independent variable
• The model is also linear in the independent variable
• The mean of the dependent variable depends on the predictor according to the systematic part of the model
E(Yi)= β0 + β1 Xi
• The variance of the dependent variable is constant
Var(Yi)= σ2
19
Estimation of regression parameters
• Least squares is the standard method. The least squares
regression line minimizes the sums of squares of the
vertical distances from the observations to the line
(residuals).
• The obtained estimates of the regression parameters are
called “ordinary least squares” estimates
• These are the values that minimize
n
i
ii XYQ1
2
10 )(
20
Estimation of regression parameters cont
• The estimates of the regression parameters are obtained by solving the normal equations and are as follows:
• b1 is interpreted as the estimated mean change in the response variable per unit change in the predictor
• b0 is interpreted as the estimated mean response at 0 value of the predictor. Note that if 0 is not in the range of values of the predictor variable then this estimate is meaningless.
XbYb
XX
YYXX
bn
i
i
n
i
ii
10
1
2
11
)(
))((
21
Regression line
)(
lyequivalentor
1
10
XXbYY
XbbY
22
Estimating the variance
Fitted response:
Residual:
Error sum of squares:
Mean squared error:
The residual variance σ2
is estimated by the MSE
ii XbbY 10ˆ
iii YYe ˆ
23
n
i
i
n
i
ii eYYSSE1
2
1
2)ˆ(
2
)ˆ(
21
2
n
YY
n
SSEMSE
n
i
ii
Properties of least squares estimates
• Unbiased (on average do not overestimate or underestimate the true value):
E(b0)=β0, E(b1)=β1, E(MSE)=σ2
• Estimated regression coefficients have minimum variance among all unbiased linear estimators, i.e. among all estimators that are linear functions of the Yi’s
• For inference about the regression parameters we need additional assumption:
ε i ~ N(0, σ2)
24
Maximum likelihood estimation
• This is an alternative method to ordinary least squares to obtain estimates of the regression parameters and the error variance
• It requires that the errors ε i ~ i.i.d. N(0, σ2)
• Based on this method we can also do inference (hypotheses tests, confidence intervals) for the parameters of interest
• The main idea is to find the values of the parameters that maximize the likelihood of the observed data
25
Normal density
)2
exp(2
1)(
2yyf
}
2
)(exp{
2
1)(
2
2
2
yyf
26
Likelihood function
• The density of each individual Yi is:
• The likelihood is the product of the individual
densities:
}2
)(exp{
2
1)(
2
2
10
2
iii
XYYf
n
i
ii XYL
12
2
10
2
2
10 }2
)(exp{
2
1),,(
27
Maximum likelihood estimates
• The estimates of the
regression parameters are
exactly the same as the
OLS estimates b0 and b1
• The estimate of the
variance is different
(biased estimate):
XbY
XX
YYXX
n
i
i
n
i
ii
10
1
2
11
ˆ
)(
))((ˆ
n
YY
n
SSE
n
i
ii
1
2
2
)ˆ(
28
2ˆ2
n
nMSE
Interpretation of SLR
• The intercept is the value of E(Y) for X=0
• The regression line consists of the estimated mean responses (systematic part) over the range of the fixed X’s
• Regression line passes through the point
• Sum of residuals is zero
• Therefore
• Sum of squared residuals Q is minimum
• Also
),( YX
0)ˆ( iii YYe
29
ii YY ˆ
0ˆ iiii eYeX
Data example
• Assessment of the strength of association
between body weight (kg) and resting
metabolic rate (RMR) (kcal/24hr). Can
body weight be used to predict RMR?
• Now please look at attached SAS code
30
Results from data example
• b0 = 811.23
• b1 = 7.06
• Regression equation:
RMR=811.23+7.06*weight
• SSE = 1047230.71, MSE = 24934.06
• Point estimate of the mean RMR for weight = 70 kg: Yi = 1305.43
• Residual for the first data point: ei = -84.50
• Note: sum of residuals is 0
31
Distinctions between regression and
correlation models
• Both simple linear regression and correlation can be used to assess the linear relationship between two continuous variables.
• In regression one of the variables is a response, the other one is a predictor. Correlation treats the variables equally.
• Correlation assumes X to be a random variable
• Regression is more informative than correlation. It allows to predict values of the response (dependent variable) from values of the independent variable.
32
Joint and marginal distributions
of X and Y
• The bivariate normal distribution describes the joint distribution of two continuous variables
• μ1 and σ1 are the mean and standard deviation of the marginal distribution of X
• μ2 and σ2 are the mean and standard deviation of the marginal distribution of Y
• ρ is the correlation coefficient between X and Y
• Marginally X ~ N(μ1,σ12) and Y ~ N(μ2,σ2
2)
2
221
21
2
1
2
1,~
N
Y
X
33
Graph of bivariate normal density
34
Conditional distribution of Y given
X and SLR
• Y|X ~ N(μ2+ρσ2/σ1(X- μ1), σ22(1-ρ2)), i.e. for every value of X the
conditional distribution of Y given X is normal with mean and variance that depend on the parameters of the joint distribution
• The mean of the distribution of Y|X is a LINEAR function of X
• For all values of X the variance of the conditional distribution Y|X is the same
• Same observations can be made for the distribution of X|Y
• For making inferences for Y conditional on X (X conditional on Y) the normal regression model is appropriate
• Even if X is not normally distributed, but Xi are independent and the distribution of X does not involve the regression parameters, we can still use simple linear regression
35
Estimation of correlation coefficient
• Pearson product-moment correlation coefficient
• -1≤ρ≤1, ρ=0 when X and Y are independent
• Testing H0: ρ=0 versus Ha: ρ≠0 is equivalent to
testing whether the slope of the regression line of Y|X
or X|Y is 0
• Setting up confidence interval for ρ is based on the
normal distribution and Fisher’s z-transform
21
22
XXYY
XXYYr
ii
ii
36
Nonparametric estimation of the
correlation coefficient
• When the joint distribution of X and Y is not bivariate
normal and can not be transformed to normality, the
nonparametric rank correlation procedure can be used
• Spearman rank correlation coefficient:
• That is, we first obtain the ranks of the Yi’s (RYi) and
Xi’s (RXi) separately and then compute the Pearson
correlation coefficient on the ranks
21
22
XXYY
XXYY
s
RRRR
RRRRr
ii
ii
37
Lecture 2: Inference in Simple
Linear Regression
38
COURSE: Applied Regression Analysis
Inferences concerning β1
• The test of linear association between
predictor and response variable:
H0: β1= 0 vs Ha: β1≠ 0
• Confidence interval for β1
39
Sampling distribution of the
estimator of β1
• Remember
• Since the estimated slope is a linear combination of the Yi’s it is N(β1, σ
2(b1)), where
• Hence,
• The variance σ2 is unknown but we have an unbiased estimate of it …
)1,0(~)(
)(*
1
11 Nb
bz
n
i
i XX
b
1
2
2
1
2
)(
)(
40
n
i
i
n
i
ii
XX
YYXX
b
1
2
111
)(
))((
Sampling distribution of the estimator of β1 cont
• … the MSE
• Hence the estimated variance will be:
• and we can use the following statistic:
• Since we are estimating the variance, the sampling distribution is no longer normal
• t* ~ t(n-2), i.e. the sampling distribution of b1 is t with n-2 degrees of freedom
)(
)(*
1
11
bs
bt
n
i
i XX
MSEbs
1
21
2
)(
)(
41
Two-sided hypothesis test for β1
• H0: β1= 0 vs Ha: β1≠ 0
• Compute TS under H0:
• If t* > t(1-α/2,n-2) or if t* < -t(1-α/2,n-2) then reject H0
in favor of Ha, otherwise fail to reject H0
• Equivalently compute p-value as 2*P(t(n-2) >|t*|) and reject H0 if p-value < α
)(
)0(*
1
1
bs
bt
42
Rejection regions (critical values) of
t-distribution
43
One-sided hypotheses tests for β1
• Test for negative slope
• H0: β1≥ 0 vs Ha: β1< 0
• Compute TS under H0
• If t* < -t(1-α,n-2) then reject
H0 in favor of Ha, otherwise
fail to reject H0
• Equivalently compute p-value
as P(t(n-2) < t*) and reject H0
if p-value < α
• Test for positive slope
• H0: β1≤ 0 vs Ha: β1> 0
• Compute TS under H0
• If t* > t(1-α,n-2) then reject
H0 in favor of Ha, otherwise
fail to reject H0
• Equivalently compute p-value
as P(t(n-2) > t*) and reject H0
if p-value < α
44
Confidence interval for β1
• Consider α = 0.05. Then with probability 0.95
• Therefore a 95% confidence interval is as follows:
)2,2
1()(
)2,2
1(1
11
ntbs
bnt
45
)()2,2
1(
))()2,2
1(),()2,2
1((
11
1111
bsntb
bsntbbsntb
Inference concerning β0
• Estimator:
• Sampling distribution of b0 is N(β0,σ2(b0))
• Variance:
• Estimated variance:
• Therefore we have:
XbYb 100ˆ
)2(~)( 0
00
ntbs
b
46
]
)(
1[)(
1
2
2
2
0
2
n
i
i XX
X
nb
]
)(
1[)(
1
2
2
0
2
n
i
i XX
X
nMSEbs
Inference concerning β0
• CI:
• HT: H0: β0=0 vs Ha:β0≠0
TS: t*=
RR: |t*|> t(1-α/2,n-2)
)2(~)( 0
00
ntbs
b
47
)()2,2
1( 00 bsntb
RMR data example (cont’d from lecture 2)
• Assessment of the strength of association between body weight (kg) and resting metabolic rate (RMR) (kcal/24hr).
• Construct 90% CI for the slope.
• Test at alpha=0.05 whether the slope is significantly positive.
48
RMR data example cont’d
• 90% CI: b1 ± t(1-0.10/2,42)s{b1}
s2{b1} = MSE/[(n-1)sx2]=24934.06/(43*24.632)=0.956
s{b1}=0.978
7.06 ±(1.68)(0.98)=(5.35,8.71)
• We are 90% confident that the mean increase in RMR per 1 kg increase in body weight is between 5.35 and 8.71 kcal/24hrs
• HT: H0: β1 = 0 vs Ha: β1 > 0
• TS: t* = (b1-0)/s{b1}=7.06/0.98=7.20
• RR: t*>1.68
• Conclusion: t* > 1.68 and hence we reject H0 and conclude that average RMR significantly increases on average as body weight increases
49
Inference concerning E(Yh)
• Point estimator:
• Expectation:
• Variance:
• Sampling
distribution:
])(
)(1[)ˆ(
2
222
XX
XX
nY
i
hh
])(
)(1[)ˆ(
2
22
XX
XX
nMSEYs
i
hh
50
hh XbbY 10ˆ
)2(~)ˆ(
)(ˆ
nt
Ys
YEY
h
hh
)()ˆ( hh YEYE
Inference concerning E(Yh) cont’d
• CI:
• HT:
H0: E(Yh) = E0 vs Ha: E(Yh) ≠ E0
TS:
RR: |t*| > t(1-α/2,n-2)
)2(~)ˆ(
ˆ* 0
nt
Ys
EYt
h
h
51
)ˆ()2,2
1(ˆhh YsntY
Prediction of a new observation Yh(new)
• Point estimator is the same as in mean estimation:
• Expectation is the same but can’t be used in inference since it is unknown:
• Rather we base inference on the following distribution:
• Variance of prediction:
• Estimated variance of prediction:
])(
)(1[)ˆ(
2
2
)(22
)(
2
XX
XX
nYY
i
newh
hnewh
)2(~)(
ˆ)()(
ntpreds
YY newhnewh
52
)(10)(ˆ
newhnewh XbbY
])(
)(11[)(
2
2
)(2
XX
XX
nMSEpreds
i
newh
)()ˆ( )()( newhnewh YEYE
Inference for Yh(new)
)ˆ()2,2
1(ˆ)()( newhnewh YsntY
53
• Prediction interval for Yh(new):
• Note that this interval is wider than the interval around the
estimated mean at the same value of X
RMR example: mean estimation and
prediction of a new observation
• 95% CI for mean RMR at weight = 90kg: CI = Ŷ90 ± t(0.975,42)s{Ŷ90}
Ŷ90 = b0+b1*90=811.23+7.06(90)=1446.63
s{Ŷ90} = 28.021
CI = 1446.63 ± 2.018 (28.021) = (1390.08, 1503.18)
• 95% prediction interval for RMR for an individual who weighs 90kg:
CI = Ŷ90 ± t(0.975,42)s{pred}, Ŷ90 = b0+b1*90=1446.63
s{pred} = 160.37
CI = 1446.63 ± 2.018 (160.372) = (1121.9, 1771.29)
166.785])74.606(43
)88.7490(
44
1[06.24934]
)(
)(1[)ˆ(
2
2
22
XX
XX
nMSEYs
i
hh
23.25719])74.606(43
)88.7490(
44
11[06.24934]
)(
)(11[)ˆ(
2
2
22
XX
XX
nMSEYs
i
hh
54
Analysis of variance approach to regression
• Total variation (variation around the mean of the response variable): SSTO
• Variation around the fitted regression line: SSE
• Variation of fitted regression line from the mean of the response variable: SSR
2)( YYSSTO i
2)ˆ( ii YYSSE
55
2)ˆ( YYSSR i
Partitioning of total sum of squares and
degrees of freedom
• SSTO = SSE + SSR
• That is,
• Total df = n-1
• Regression df = 1
• Error df = n-2
• Total df = Error df + Regression df
• n-1 = n-2 + 1
222 )ˆ()ˆ()( YYYYYY iiii
56
Mean squares and ANOVA table
• MSE = SSE/(n-2)
• MSR = SSR/1 = SSR
------------------------------------------------------------
Source of
variation df SS MS F p
------------------------------------------------------------
Regression 1 SSR MSR MSR/MSE
Error n-2 SSE MSE
------------------------------------------------------------
Total n-1 SSTO
57
F-test for β1=0
• H0: β1= 0 vs Ha: β1≠ 0
• TS: F* = MSR/MSE ~F(1,n-2)
• If F* > F(1-α;1,n-2) reject H0 in favor of Ha, otherwise fail to reject H0
Intuition:
Under Ho the ratio is 1, under the alternative it is > 1
Note: F* = t*2, hence the F-test is equivalent to the two-sided t-test
22
1
2
2
)()(
)(
XXMSRE
MSEE
i
58
General Linear Test
• Full model: Yi = β0 + β1Xi + εi
• Reduced model: Yi = β0 + εi
• H0: β1= 0 vs Ha: β1≠ 0
• SSE(F) = SSE
• SSE(R) = SSTO
• TS:
• If F* > F(1-α;dfR-dfF,dfF) then reject the null and assume that the full model fits the data better than the reduced model
)(/)(
)]()([)]()([*
FdfFSSE
FdfRdfFSSERSSEF
59
RMR example: ANOVA table
------------------------------------------------------------------
Source of
variation df SS MS F p
------------------------------------------------------------------
Regression 1 SSR 1300241.18 52.15 <.0001
Error 42 SSE 24934.06
------------------------------------------------------------------
Total 43 SSTO
60
Measures of association between
X and Y
• Coefficient of determination:
• r2 is interpreted as the proportionate reduction of total variation in Y associated with the use of predictor X
• 0 ≤ r2 ≤ 1
• r2 = 1 when all the points fall on the fitted regression line (and the regression line is not horizontal)
• r2 = 0 when the fitted regression line is horizontal
SSTO
SSE
SSTO
SSRr 12
61
Measures of association between
X and Y cont’d
• Correlation coefficient (Person moment-product correlation coefficient)
• -1 ≤ r ≤ 1
• r > 0 indicate positive linear association between X and Y
• r < 0 indicate negative linear association between X and Y
• Note that small r does not necessarily mean no relationship between X and Y (it means no LINEAR relationship)
• High r does not necessarily indicate good fit and that good predictions can be made
2rr
62
Relationship between r and the estimated
regression slope b1
x
y
i
i
s
sr
XX
YYrb
2/1
2
2
1)(
)(
63
• sx and sy are the sample standard deviations of X and Y respectively
• Note that b1 = 0 implies r=0 and vice versa
• The signs of b1 and r are also the same
• The value of r is affected by the spacing of the X values
Results from data example
• Testing b1 =0: t*=7.22, p<.0001
• F*=52.15,p<.0001
• 95% CI for b1 : (6.09,9.03)
• R2 = 0.55
• 95% CI for mean RMR for weight = 90 kg:
(1390.0, 1503.1)
• 95% CI for individual RMR for weight = 90kg: (1122.9, 1770.2)
64
Lecture 3: Diagnostics and
remedial measures
65
COURSE: Applied Regression Analysis
Departures from model Remedial measures
Non-linear regression function
Non-normal errors
Non-constant error variance
Non-independent error terms
Presence of outliers
Omission of important predictor variables
Nonparametric regression
Quadratic or higher order polynomial
Transformations
Weighted least squares
Models for correlated data (mixed models, time series models)
Robust regression
Multiple regression
66
Consequences of model departures
• Nonlinearity and/or omission of predictor variables lead to biased estimates of the parameters (serious)
• Non-constant error variance leads to less efficient estimates and to invalid error variance estimates (less serious)
• Presence of outliers may or may not be serious. Depends on how influential outliers are for the regression estimation and on the size of the data set.
• Non-independence of errors results in biased variances (estimates are unbiased). May be serious.
67
Diagnostics for predictor variable
• Dot plot: useful when number of observations in the data set is not large. Helps identify outlying cases.
• Sequence plot: useful when data on the predictor variable are obtained in sequence. Helps identify patterns.
• Stem-and-Leaf Plot: provides information similar to a frequency histogram. Useful to identify outliers and skewness.
• Box plot: shows minimum, maximum value, first, second and third quartiles. Helps identify skewness and outliers. Most useful for large data sets.
68
Residuals
• Remember
• Basic idea: If the assumptions of the regression model are satisfied the distribution of the residuals should resemble the error distribution
• The mean of the residuals is zero
• The variance of the residuals is approximated by the MSE
• The residuals are not independent because the fitted values are based on the same regression line. The residuals are subject to 2 constraints: that their sum is zero and that the products Xiei
sum to 0. However when n is large the dependency can be ignored.
0
n
ee
i
69
iii YYe ˆ
Standardized residuals
• The idea is to standardize the residuals by their
standard deviation
• However the latter is unknown
• We can use the MSE
• Semistudentized residuals:
• Studentized residuals: use a different
denominator
MSE
e
MSE
eee ii
i
*
70
Several prototype situations for residual
vs predictor plot
• Figure 3.4 on p.106 in textbook
71
Non-linear regression function
• Can be assessed via plot of
the residuals vs predictor
variable
(or equivalently residuals vs
fitted values)
72
Non-normal error terms
• To detect gross departures from normality for a
relatively large sample a histogram, dot plot, box
plot or stem-and-leaf plot of the residuals can be
helpful
• Otherwise a normal probability plot of the
residuals is more informative
73
Normal probability plot
• Order the residuals
• Plot each residual against its expected value under normality
• Expected value of the kth smallest residual is
• z(a) is the (a)100 percentile of the standard normal distribution
• Plot should be nearly linear
25.
375.
n
kzMSE
74
Good normal probability plot of
residuals
75
Examples of “not so good” normal
probability plots
76
Caution in assessing normality
• Normal probability plots may reflect other violations of the assumptions:
• For example a wrong choice of regression function
• Or non-constant error variance
• Investigate other possible violations first!
77
Non-constant error variance
• Can be detected using plot of residuals versus values of the predictor variable
• Equal variability around zero indicates constant variances
• Increasing spread of the residuals around zero indicates that variance increases with the mean (“megaphone type”)
• Plot of absolute residuals or squared residuals against X may be even more useful to detect nonconstancy of error variance
78
Brown-Forsythe test
• Appropriate for SLR
• Can be used even when errors are non-normal
• Requires large sample size to be able to ignore
dependency among residuals
• Main idea is to separate the residuals in two
groups and compare the average absolute
deviations from the center in the two groups
79
Brown-Forsythe test cont’d
)2(~
11
2
)()(
~
~
groups residual 2 and 1 of medians - ~,~
*
21
21*
1
2
22
1
2
11
2
222
111
ndst
21
21
1
1
ntt
nns
ddt
n
dddd
s
eed
eed
ee
nnn
BF
BF
n
ni
i
n
i
i
ii
ii
80
Non-independent error terms
• When the data are recorded in a temporal or
spatial fashion a sequence plot may be useful to
detect non-independent error terms
• This plot may reveal trend effect or cyclical non-
independence
81
Correlated errors
82
Correlated errors
83
Presence of outliers
• Outliers are extreme observations
• Can be identified based on plots of residuals against X or fitted values, box plots, stem-and-leaf plots or dot plots of residuals
• It is better to plot semistudentized residuals
• Semistudentized residuals with values of four or more are considered outliers
• Outliers may result from error in recording, malfunction, miscalculation, etc and may need to be discarded
• Otherwise they may provide important information
• Note that the estimates of regression parameters may or may not be greatly affected by outliers
84
Omission of important predictor variables
• Plot residuals against variables omitted from the model that might have important effects on the response
• If a pattern emergence it may lead to a substantially better fit if we include one or more of these additional variables in the model
85
F test for lack of fit
• Tests whether the chosen regression function
adequately fits the data
• Assumes that Y|X ~ independent and normally
distributed and that the error variances are
constant
• Requires repeat observations on some X
86
F-test for lack of fit cont’d
• X1, …Xc – levels of the predictor
• n1, … nc – replicates at each X level
• Yij – observed response for the ith replicate at the
jth level of X
• Full model:
• Let SSPE (sum of squares for pure error) denote
the SSE for the full model
• Reduced model:
),0(.~ 2
, NindepY ijijjij
),0(.~ 2
,10 NindepXY ijijjij
87
F-test of lack of fit cont’d
0),2;1(*
2*
)()2(*
HrejectcncFFIf
MSPE
MSLF
cn
SSPEc
SSLF
F
SSPESSESSLF
cn
SSPEcnn
SSPESSE
F
88
Transformations
• Transformation on X should be attempted when the error terms are approximately normally distributed with constant variance but the relationship between X and Y is non-linear
• Otherwise transformation on Y is more appropriate
• For some transformations (log, sqrt, inverse sqrt) adding a constant may be necessary to make numbers positive
89
Prototype plots for
transformation of X
• Use X’ = X2 or X’ = exp(X) if the X-Y plot suggests an arc from lower left to upper right with bulge below the straight line (1st plot on right)
• Use X’ = square root of X or X’ = log10(X) if the X-Y plot suggests an arc from lower left to upper right with bulge above the straight line
• Use X’ = 1/X or X’ = exp(-X) if the X-Y plot suggests an arc from upper left to lower right with bulge below the straight line (2nd plot on right)
90
Prototype plots suggesting transformation on Y
91
Transformations on Y
• The choice of a transformation of Y may be suggested by examining the plot of residuals against fitted values. If this appears linear, but the variance of the residuals increases as fitted Y increases, suggesting a wedge or megaphone shape, then taking square roots, logarithms, or reciprocals of the Y values may promote homogeneity of variance
• Note that a simultaneous transformation on X may be necessary
92
Transformations on Y
• Use Y’ = square root of Y if there is an arc from
lower left to upper right with bulge below the
straight line, and the variance of the residuals
increases as fitted Y increases
• Use Y’ = log(Y) if there is an arc from upper left
to lower right, and the variance of the residuals
increases as fitted Y increases
• Use Y’ = 1/Y if variance of the residuals increases
as fitted Y increases
93
Box-Cox transformations
• Automatically identifies a transformation from the family of power transformations Y’ = Yλ, where λ is identified from the data
• λ = 2 corresponds to Y’ = Y2
• λ = ½ corresponds to Y’ = √Y
• λ = 0 corresponds to Y’ = loge(Y)
• λ = -½ corresponds to Y’ = 1/ √Y
• λ = -1 corresponds to Y’ = 1/Y
• Regression model:
• The Box-Cox transformation finds the MLE of λ
iii XY 00
94
Comments regarding transformations
• Theoretical considerations may prevail
• Residual plots and tests needed to ascertain appropriateness of transformation
• Interpretation and properties of regression coefficients are with respect to the transformed scale
• A more convenient value of λ may be selected for interpretation purposes
• When the confidence interval for λ includes 1 it may be better to stay with the original scale
95
Nonparametric regression
• Idea: fit a smoothed curve to the data to explore or
confirm regression relationship
• For time-series data popular methods are: the
method of moving averages, the method of
running medians
• For regression data popular methods are band
regression and locally weighted regression scatter
plot smoothing (Lowess)
96
Lowess method
• Obtains a smoothed curve by fitting successive linear regression functions in local neighborhoods
• The smoothed Y value at a given X is equal to the fitted Y value for the regression in that local neighborhood
• For example if neighborhood of 3 values is used, the smoothed value of Y at X2 is the fitted value for Y at X2based on the regression fitted to (X1,Y1), (X2,Y2) (X3,Y3)
• Similarly the smoothed value at X3 will be the fitted value for Y at X3 based on the regression fitted to (X2,Y2) (X3,Y3) and (X4,Y4)
97
Steps in obtaining final smoothed curve
• 1. The linear regression is weighted with smaller
weights given to X values further from the middle
X level in each neighborhood
• 2. Linear regression fitting is repeated with revised
weights so that cases with large residuals in the
first fitting receive smaller weights in the second
fitting
• 3. Additional iterations of step 2 may be needed
98
Choices in lowess method
• Size of successive neighborhoods (the larger the size the smoother the function but essential features may be lost)
• In SAS PROC LOESS a smoothing parameter s is chosen. When s < 1 the s fraction of values closest to X are chosen in each neighborhood
• Weight function for X values
• SAS PROC LOESS uses a tricube weight function:
• wi= [32/5] (1-([(di)/(dq)])3 )3 where d1,, … dq are the
distances from the 1st, 2nd, … qth closest X value to Xi
• Weight function for residuals
99
Important points regarding the lowess method
• No analytical expression is provided for the
functional form of the regression relationship
• Higher order polynomials may be used to smooth
out in local neighborhoods
• If lowess curve falls in confidence bands of
regression line then it can be considered
confirmatory of the chosen regression relationship
100
Lecture 4: Simultaneous
inference and other topics
101
COURSE: Applied Regression Analysis
Joint estimation of β0 and β1
• Statement confidence coefficient: reflects the proportion of confidence intervals that contain the true value of a parameter in repeated sampling
• Setting two separate confidence intervals for the slope and the intercept in SLR assures that each of the two statement confidence coefficients is correct
• However, the probability that both confidence intervals contain their respective parameters is less than 95%
• Family confidence coefficient corresponds to the proportion of repeated samples in which both the true intercept and the true slope fall in their respective confidence intervals
102
Bonferroni joint confidence intervals
for intercept and slope
• Separate confidence intervals:
• Let A1 denote the event that β0 does not belong
to the first CI, and A2 denote the event that β1
does not belong to the second CI. Therefore
P(A1)= P(A2)=α
• We want
• But
}{)2;2/1(
}{)2;2/1(
11
00
bsntb
bsntb
1)( 21 AAP
103
21)()(1)()()(1
)(1)(
212121
2121
APAPAAPAPAP
AAPAAP
Bonferroni joint confidence
intervals cont’d
• Therefore, if we construct two 100(1-α/2)%
confidence intervals we will get at least
100(1- α)% family confidence coefficient for
both parameters
• The joint confidence intervals are as follows:
}{)2;4/1(
}{)2;4/1(
11
00
bsntb
bsntb
104
An example
• Remember for RMR data we had
• b0 = 811.23, s(b0)=76.98
• b1 = 7.06, s(b1)=0.98
• Therefore joint 90% confidence intervals will be:
• b0 ± t(1-0.10/4;42)s{b0}=811.23 ± 2.02 (76.98) = (655.73,966.73)
• b1 ± t(1-0.10/4;42)s{b1}=7.06 ± 2.02 (0.98) = (5.08,9.04)
• Interpretation: we are at least 90% confident that both the intercept and the slope fall in their respective confidence interval above
105
Comments
• The Bonferroni family confidence coefficient is not an exact coefficient but is rather a lower bound to the desired probability
• The Bonferroni inequality is extended to more than two events
• Useful when number of confidence intervals is not too large
• Joint confidence intervals can be used for testing (for example of whether both slope and intercept are equal to zero)
• Each confidence interval can have its own confidence level to reflect its relative importance
gAPg
j
i
1)(1
106
Simultaneous estimation of mean
responses
• Separate estimates of the mean response at different levels of the predictor need not be simultaneously right or simultaneously wrong.
• It is possible that the confidence interval for the mean response is right only over some part of the range
• We’ll consider two procedures:
• Working-Hotelling procedure
• Bonferroni procedure
107
Working-Hotelling procedure
• The confidence bounds at each value of X are
the values of confidence band for the entire
regression line:
• Here W2=2F(1-α;2,n-2}
}ˆ{ˆhh YWsY
108
RMR example
Xh Yh S{Yh}
60 1234.80 27.90
82 1390.11 24.80
103 1538.36 36.37
• Working-Hotelling 90% CI for mean:
• W2 = 2F(0.9,2,42)=4.87
• W = 2.21
• (1234.80±2.21(27.90))=(1173.14,1296.46)
• (1390.11±2.21(24.80))
=(1335.30,1444.92)
• (1538.36±2.21(36.37))=(1457.98,1618.74)
109
Bonferroni procedure
• Same idea as for simultaneous confidence
intervals for the intercept and slope
• For simultaneous confidence intervals on g
means:
)2);2/(1(
}ˆ{ˆ
ngtB
YBsY hh
110
RMR example
Xh Yh S{Yh}
60 1234.80 27.90
82 1390.11 24.80
103 1538.36 36.37
• Bonferroni 94% CI for mean:
• 1-α/(2g)=0.99
• B = t(0.99,42)=2.42
• (1234.80±2.42(27.90))
=(1167.28,1302.32)
• (1390.11±2.42(24.80))
=(1330.09,1450.13)
• (1538.36±2.42(36.37))
=(1450.34,1626.38)
111
Working-Hotelling vs Bonferroni procedures
• For large g WH provides tighter bounds and hence
is preferred
• Both WH and Bonferroni provide lower bounds
• When the levels of the predictor variable at which
mean estimation is of interest are not known a
priori WH is more appropriate since WH provides
simultaneous protection for all levels of X
112
Regression through origin
• The regression line may be forced to go
through the origin (or a particular value)
• Regression model:
• Then the point estimates are:
iii XY 1
1
)ˆ( 2
2
21
n
YYMSEs
X
YXb
ii
i
ii
113
Regression through origin cont’d
• CI for the slope then is:
• where
• CI for the mean response is:
• where
• CI for a predicted response is:
• where
21
2 )(iX
MSEbs
}{)1;2/1( 11 bsntb
114
}ˆ{)1;2/1(ˆhh YsntY
2
22 )ˆ(
i
hh
X
MSEXYs
}ˆ{)1;2/1(ˆ)()( newhnewh YsntY
)1()ˆ(2
2
)(
)(
2
i
newh
newhX
XMSEYs
Example: Typographical errors
(problem 4.12)
• X – number of galleys for a manuscript
• Y – total dollar cost of correcting typographical errors
• Estimated regression function: Y = 18.03X
• 95% CI for the slope: 18.03±2.20(0.08)= (17.85,18.21)
• 95% prediction interval for X=10: (170.21,190.36)
• Note residuals do not sum to 0
115
Typographical errors example cont’d
• Test for lack of fit of regression through the origin: full
model (SLR), reduced model (regression through the
origin)
• F*=(223.42-219.99)/22.00=0.16
• F* is not in the rejection region for any meaningful test
and hence ok to use reduced model
0)2,1;1(*
)(
)()(
2
)()2()1(
)()(
*
HrejectnFFIf
FMSE
FSSERSSE
n
FSSEnn
FSSERSSE
F
116
Caveats of regression through the origin
• Sum of residuals is not zero
• SSE may exceed SSTO
• R2 may be negative
• The intervals for mean response and new prediction
widen at X far away from the origin
• Uncorrected total and corrected total sums of squares
are related as follows:
• SSTOU = SSRU+ SSE
• where 2
1
22
1
22 )(,ˆ, iiiii XbYSSEXbYSSRUYSSTOU
117
Effects of measurement error
• Measurement errors in Y do not affect the estimates as long as these errors are uncorrelated and not biased
• Measurement error in X does affect the estimate of the slope by attenuating it (except in Berkson’s case). The magnitude of the bias depends on the relative sizes of the errors in X and in Y.
• When measurement error in X is present a special approach such as the use of instrumental variable approach is needed
118
Berkson’s model
• The observation Xi* is a fixed quantity
while the underlying Xi is a random
variable.
• In that case the errors have expectation 0
and the predictor variable is a constant
(therefore the errors are uncorrelated with
it) and hence SLR is valid and can be used.
119
Choice of X levels
• When the levels of X are under the control of the experimenter the following considerations should be used:
• If the main purpose of the regression analysis is to estimate the slope use well spread levels of X
• If the main purpose is to estimate the intercept the mean of X values should be 0.
• If the main purpose is to predict a new observation at Xh(new) the mean of the X values should be at Xh(new)
• Choose as many levels as needed to estimate shape (for example at least 2 for straight line, at least 3 for quadratic trend, etc)
120
Formulae to aid understanding of
selection of X levels
2
2
)(2
)(
2
2
222
2
2
1
2
2
22
0
2
)(
)(11}ˆ{
)(
)(1}ˆ{
)(
1}{
)(
1}{
XX
XX
nY
XX
XX
nY
XXb
XX
X
nb
i
newh
newh
i
hh
i
i
121
Lecture 5: Matrix representation of
simple linear regression
122
COURSE: Applied Regression Analysis
Matrices
• Matrix: a rectangular array of elements
• Dimension: rxc means r rows by b columns
• Example: A = [aij], i=1,2,3; j=1,2
• In general: A = [aij], i=1,…r; j=1,...c
01
24
53
3231
2221
1211
aa
aa
aa
A
rcrr
c
c
aaa
aaa
aaa
A
...
............
...
...
21
22221
11211
123
Types of matrices
• A square matrix (r=c)
• Symmetric matrix
(square and upper
right triangle is a
mirror image of lower
right triangle)
201
424
253
333231
232221
131211
aaa
aaa
aaa
A
201
024
143
A
124
Types of matrices cont’d
• Row vector: r=1
• Column vector: c=1
• Transpose of a matrix:
(A’ or AT)
• A = [aij], A’ = [aji]
• If A is r by c matrix then
A’ is c by r matrix
• Symmetric matrix:
• A = A’
201
424
253
A
242
025
143
'A
125
Types of matrices cont’d
• Diagonal matrix: a square matrix with all
off-diagonal elements equal to 0
• Identity matrix: a diagonal matrix with all
diagonal elements equal to 1
• Note that A*I=I*A=A
• Scalar matrix: a diagonal matrix with all
diagonal elements equal to a scalar
126
Examples:
200
020
003
A
100
010
001
I
200
020
002
2I
0
0
0
0~
1
1
1
1~
127
Types of matrices cont’d
• Vector of ones
• Vector of zeros
• J denotes the matrix formed by multiplication of a
transpose of a vector of ones and itself
n
J
nn
nnnn
11
11
1'1
1...11
............
1...11
1...11
1...11
1
...
1
1
'11
128
Equality of matrices
• Two matrices are equal only if they have
the same dimensions and all corresponding
elements are the same
• That is, if A=[aij], i=1,…r; j=1,…c and
B=[bij], i=1,…k; j=1,…l for A to be equal to
B we need:
• K=r, l=c and aij= bij for all i and j
129
Operations on matrices: addition
45
31A
11
02B
54
33
1415
0321BA
130
Operations on matrices: subtraction
45
31A
11
02B
36
31
14)1(5
0321BA
131
Operations on matrices: multiplication
45
31A
11
02B
46
31
)1)(4()0)(5()1)(4()2)(5(
)1)(3()0)(1()1)(3()2)(1(* BA
132
Cautions about matrix operations
• For addition and subtraction number of rows of
the matrices must be the same and number of
columns must be the same
• For multiplication number of columns of the first
matrix (left multiplier) must be the same as
number of rows of second matrix (right multiplier)
• Note that A + B = B + A but A*B ≠ B*A in
general
133
Inverse of a matrix
• Inverse for a square matrix A is A-1 such that:
• A A-1 = A-1 A = I
• A-1 is unique and has the same rank as A
• To have an inverse a matrix needs to be full rank
(or nonsingular)
• To check where a matrix is of full rank we check
whether its determinant is 0
134
Finding the inverse
• For a 2x2 matrix:
2.03.0
4.01.0
23
41
10
1
10)3)(4()1)(2(13
42
1
1
1
A
DA
bcadDac
bd
DA
dc
baA
135
Simple linear regression in matrix terms
n
n
n
n
n
n
X
X
X
X
Y
Y
Y
Y
...
1
......
1
1
...
2
1
11
0
12
2
1
2
2
1
1
nnn
nn
nn
nnn
IVar
E
XYE
XY
2
1
11
1221
11221
)(
0)(
)(
136
Regression parameter estimate in
matrix notation
YXXXb
YXXXbXXXX
YXbXX
YbXnn
')'(
')'(')'(
''
1
11
121222
1122
137
Fitted values in matrix notation
')'(
')'(
ˆ
...
ˆ
ˆ
ˆ
1
1
1
122
2
1
1
XXXXH
YHYXXXXbX
Y
Y
Y
Ynnnn
n
n
• The H matrix is called the hat matrix
• H is symmetric and idempotent (HH=H)
• H is very useful in residual diagnostics
138
Residuals in matrix notation
)()(
)()(
)(ˆ...
2
22
11111
2
1
1
HIMSEes
HIe
YHIYHYYY
e
e
e
ennnnnnnnn
n
n
• I-H is also symmetric and idempotent
139
Analysis of variance results in
matrix notation
YJn
IYJYYn
YYSSTO
YHIYYXbYYSSE
YJn
HYJYYn
YXbSSR
]1
[''1
'
)(''''
]1
[''1
''
140
Inference about (i) regression coefficients, (ii) mean
response and (iii) prediction of a new observation
12
122
1
)'()(
)'()(
')'(
XXMSEbs
XXb
YXXXb
141
))'('1(}ˆ{
))'('(}ˆ{
)'('}ˆ{
'ˆ
1
1
)(
2
12
122
hhnewh
hhh
hhh
hh
h
h
XXXXMSEYs
XXXXMSEYs
XXXXY
bXY
XX
(i)
(ii) and (iii)
Example: Flavor deterioration
(problem 5.4 on p.210)
• Will work out in class
142
Lecture 6: Multiple Linear
Regression
143
COURSE: Applied Regression Analysis
Model formulation
• Yi is the response for the ith subject
• Xi1, Xi2, …Xi,p-1 are the values of the predictor variables for the ith subject
• β1, β2,… βp-1 are unknown parameters to be estimated from the data (they are also called partial regression coefficients)
• Regression (response) surface:
• E(εi)=0, Cov(εi, εj)=0 for i≠j, Var(εi)=σ2>0
niXXXY ipipiii ,...1,... 1,122110
1,122110 ...)( pipiii XXXYE
144
Comments on model formulation
• When p-1 = 1 multiple linear regression reduces to simple linear regression
• If we assume Xi0=1 then we can write the regression equation as follows:
• βk is interpreted as the mean change in Y per unit change in the kth
predictor when the remaining predictors are held constant
• The predictors need not be all different variables
• When there are two different predictors the response surface is a plane
• For inference we require εi ~N(0,σ2)
1
0
1,11100 ...p
k
iikkipipiii XXXXY
145
First order regression model
• All predictors represent different variables
• Some of the predictors can be qualitative (dummy
variables): For example let X1 be age and
• Then a first order regression model is:
femaleif
maleifX
0
1
2
malesforXYE
andfemalesforXYE
XXY
ii
ii
iiii
1120
110
22110
)()(
)(
146
Polynomial regression
• All variables are powers of the same
variable
• This implies curvilinear relationship
between the predictor and the response
1
11,
2
12
1
11
2
12110
...,
...
p
ipiii
i
p
ipiii
XXXX
XXXY
147
Regression with transformed
variables, interaction effects
• Ex: Transformation of the response
• First order (two-way) interaction
• Any possible combinations of the above as long as linear in the parameters
• Example of nonlinear regression:
1
0
' )log(p
k
iikkii XYY
213
21322110
iii
iiiiii
XXX
XXXXY
148
ippiiiii ccccY 11,221100 ...
iii XY )exp( 10
Example: Lung transplantation
data (Altman et al, p.364)
• It is difficult to measure total lung capacity (TLC) and hence it is useful to be able to predict TLC from other information.
• The data set contains measurements of pre-transplant TLC of 32 recipients of heart-lung transplant, obtained by whole-body plethysmography, and their age, sex and height.
• Sample data:
• Age Sex Height(cm) TLC(l)
• 35 F 149 3.40
• 11 F 138 3.41
• 12 M 148 3.80
149
Example cont’d:
TLC
140
150
160
170
180
190
10
20
30
40
50
4 6 8 10
140 150 160 170 180 190
height
sex
1.0 1.2 1.4 1.6 1.8 2.0
4
6
8
10
10 20 30 40 50
1.0
1.2
1.4
1.6
1.8
2.0
age
150
Model formulation in matrix notation
),...,,1(...
,...1,...
1,21
'
1
1
0
'
1,122110
piiii
p
iii
ipipiii
XXXXand
where
XY
asrewrittenbecan
niXXXY
151
Model formulation in matrix notation cont’d
111
............
....1
............
...1
...1
...
2
1
2
1
2
1
1
1
0
1,1
1,221
1,111
2
1
1
111111
nppnn
X
X
X
XX
XX
XX
Y
Y
Y
Y
subjectsallFor
subjectiforXY
XY
nT
n
T
T
nppnn
p
p
n
n
th
ipp
T
ii
152
111
nppnnXY
• Y – vector of responses
• β- vector of parameters
• X – matrix of constants (design matrix)
• ε ~ N(0,σ2I) and hence Y ~ N(Xβ,σ2I)
153
Model formulation in matrix notation cont’d
Estimation of regression coefficients
• Least square estimates are obtained by minimizing the sum of distances from the points to the regression plane:
• Denote the vector of the least squares
estimated regression coefficients as b:
• Least squares normal equations:
• Least squares estimates
(!mistake in textbook – eq 6.25):
Maximum-likelihood estimates are the same
n
i
pipiii XXXYQ1
2
1,122110 )...(
1
1
0
1 ...
p
p
b
b
b
b
154
YXXbX ''
1
1
1)'()'(
ppppYXXXb
Fitted values and residuals
)(}{),(}{
)(ˆ...
')'(
')'(
ˆ
...
ˆ
ˆ
ˆ
222
11111
2
1
1
1
1
1
1
2
1
1
HIMSEesHIe
YHIbXYYY
e
e
e
e
XXXXH
YHYXXXXbX
Y
Y
Y
Y
nnnnnppnnnn
n
n
nnnppn
n
n
155
Sums of squares and mean squares
YJn
IYJYYn
YYSSTO
pn
SSEMSE
YHIYYXbYYSSE
p
SSRMSR
YJn
HYJYYn
YXbSSR
]1
[''1
'
)(''''
1
]1
[''1
''
156
ANOVA table
--------------------------------------------------------------------
Source of
variation df SS MS F
--------------------------------------------------------------------
Regression p-1 SSR MSR MSR/MSE
Error n-p SSE MSE
--------------------------------------------------------------------
Total n-1 SSTO
157
F-test for regression relation
• H0: β1= β2=…= βp-1 =0
• Ha: not all βk (k=1,…p-1) equal to 0
• TS: F* = MSR/MSE
• Rejection region: If F* > F(1-α;p-1,n-p) reject H0
• Note that under H0 E(MSR)=E(MSE)=σ2
• And under Ha E(MSR)>σ2 and hence the test has intuitive sense
• When p-1=1 the test reduces to the F-test in SLR
158
Coefficients of multiple determination
and correlation
• Usual definition:
• Note that the coefficient is between 0 and 1 and increases as variables are added to the model
• Modified measure that adjusts for number of variables in the model:
• Adjusted R-square:
• Adjusted R-square can decrease as number of variables increase
• Coefficient of multiple correlation: R=√R2
SSTO
SSE
pn
n
nSSTO
pnSSERa
11
)1/(
)/(12
159
SSTO
SSE
SSTO
SSRR 12
Inferences about regression parameters
• The LSE of the regression parameters are unbiased: E(b)=β
• Variance is σ2{b}=σ2 (X’X)-1
• Estimated variance is s2{b}=MSE(X’X)-1
• 100(1-α)% CI for βk is bk±t(1- α/2;n-p)s{bk}
• HT: H0: βk=0 vs Ha: βk≠0
• TS: t* = bk/s{bk}
• Rejection region: |t*|>t(1- α/2;n-p)
• Joint inferences on several βk’s: bk±Bs{bk} where B=t(1- α/(2g);n-p)
160
Estimation of mean response
}ˆp)s{-nα/2;1(ˆ:CI α)100%(1
)'(}ˆ{ : varianceEstimated
)'(}{}ˆ{ :Variance
}{}ˆ{:nExpectatio
ˆ :estimatorPoint
}{:response Mean
...
1
Let
1'2
1'22'2
'
'
'
1,
1
1
hh
hhh
hhhhh
hhh
hh
hh
ph
h
p
YtY
XXXXMSEYs
XXXXXbXY
YEXYE
bXY
XYE
X
X
hX
161
Other inference regarding mean response
• Confidence region for regression surface (an extension of Working-Hotelling band). Also used for simultaneous confidence interval for several mean responses:
• Bonferroni simultaneous confidence intervals for several mean responses:
),;1(W
}ˆWs{ˆ
2 pnppF
YY hh
));2/(1( B
}ˆBs{ˆ
pngt
YY hh
162
Prediction of new observation
}p)s{-nα/2;1(ˆ
:CI α)100%(1
])'(1[}{ : varianceEstimated
ˆ :estimatorPoint
:response Individual
)(
1'2
'
)()(
'
)()(
predtY
XXXXMSEpreds
bXY
XY
newh
hh
newhnewh
hnewhnewh
163
Caution about hidden extrapolations
• Picture from textbook
164
165
Diagnostic and remedial measures
• Most of the diagnostics and remedial procedures
from simple linear regression carry over to
multiple linear regression
• Scatter plot matrix: a collection of scatter plots
between predictors and response variables. It is
useful to assess bivariate relationships and identify
outliers.
• Correlation matrix: a matrix of correlation
coefficients
Residual plots
• Plot of residuals against fitted values is useful for assessing the appropriateness of the multiple linear regression function, the constancy of the variances of the error terms and the presence of outliers
• Normal probability plots of the residuals can help assess normality of errors.
• Residual plots against each predictor variable can help assess the adequacy of the regression model with respect to that variable.
• Residuals can also be plotted against variables, or interactions of variables not in the model to assess whether these variables/interactions are needed.
• Brown-Forsythe test and Breusch-Pagan test can be applied for a particular predictor variable suspected to be associated with increase or decrease of error variance.
• F-test for lack of fit is also applied as in SLR except that we now require replications over all predictor variables simultaneously.
166
Remedial measures
• Remedial measures in MLR are applied as
in SLR:
• Definition of more complex polynomial or
interaction models
• Transformations on response and predictor
variables
• Box-Cox transformation
167
Lecture 8: Statistical inference in
multiple regression
168
COURSE: Applied Regression Analysis
Extra sums of squares
• Due to the relationship among predictor variables
and between the predictors and the response, it is
possible that the relationship between one
predictor and the response is affected by other
predictors in the model
• SSE decreases as more and more predictors are
added to the model
• The extra sum of squares measures such marginal
reduction in SSE
169
TLC example (extra sums of squares)
• Suppose we first consider gender (X1) as a predictor of TLC: Then SSE (X1) =56.40, SSR(X1)=25.31
• Then suppose we add height (X2) to the model: Then
SSE (X1, X2) =38.92, SSR(X1, X2) =42.79
• We see that SSE (X1) > SSE (X1, X2) and SSR (X1) < SSR (X1, X2), that is residual variability decreases by adding a variable to the model while systematic variability increases
• SSR(X2|X1) is called the extra sum of squares when adding X2 to the model that already contains X1:
SSR(X2|X1) = SSR (X1 , X2) - SSR (X1)=42.79-25.31=17.48
SSR(X2|X1) = SSE (X1) - SSE (X1, X2)=56.40-38.92=17.48
170
Extra sums of squares cont’d
• SSR(X2|X1) measures the reduction in errors
variability (SSE) when X2 is added to the model that
already includes X1
• Thus it measures the marginal effect of adding X2 to
the model that already includes X1
)(),()|(
lyequivalent thenSSESSR SSTO Since
),()()|(
12112
21112
XSSRXXSSRXXSSR
XXSSEXSSEXXSSR
171
Extra Sums of Squares cont’d
• Additionally:
• SSR(X2,X3|X1) measures the reduction in
errors variability (SSE) when X2 and X3 are
added to the model that already includes X1
)(),,()|,(
lyequivalent thenSSESSR SSTO Since
),,()()|,(
1321132
3211132
XSSRXXXSSRXXXSSR
XXXSSEXSSEXXXSSR
172
Extra Sums of Squares
• Also:
• SSR(X3|X1, X2) measures the reduction in
errors variability (SSE) when X3 is added to
the model that already includes X1 and X2
),(),,(),|(
),,(),(),|(
21321213
32121213
XXSSRXXXSSRXXXSSR
XXXSSEXXSSEXXXSSR
173
Decomposition of SSR into extra sum of
squares
• Different decompositions can be considered:
freedom of degrees q withassociated is
variablesadditional qfor squares of sumextra is,That
...
1
),|(),|(
2
)|,()|,(
...
)|,()(),,(
),|()|()(),,(
),|()|()(),,(
213213
213213
2132321
321232321
213121321
XXXSSRXXXMSR
XXXSSRXXXMSR
XXXSSRXSSRXXXSSR
XXXSSRXXSSRXSSRXXXSSR
XXXSSRXXSSRXSSRXXXSSR
174
ANOVA table with decomposition
of sums of squares
-----------------------------------------------------------------------------------------
Source of
variation df SS MS F
-----------------------------------------------------------------------------------------
Regression p-1 SSR MSR MSR/MSE
X1 1 SSR(X1) MSR(X1) MSR(X1)/MSE
X2|X1 1 SSR(X2|X1) MSR(X2|X1) MSR(X2|X1)/MSE
X3|X1,X2 1 SSR(X3|X1,X2) MSR(X3|X1,X2) MSR(X3|X1,X2)/MSE
…
Xp-1|X1,…Xp-2 1 SSR(Xp-1|X1,…Xp-2) MSR(Xp-1|X1,…Xp-2) MSR(Xp-1|X)/MSE
Error n-p SSE MSE
-----------------------------------------------------------------------------------------
Total n-1 SSTO
175
Tests of multiple regression coefficients
),(~),...,(
),...,|,...,(*
)/(),...,,(
)/()),...,(),...,((*
)()(
)]()([SSE(F)]-[SSE(R) F*: statisticTest
...... :F
... :R
:model full vsreduced following the testing toequivalent is This
0not is ,... , of oneleast At :
0...:
121
1211
121
121121
1,1,1,1110
1,1110
11
110
pnqpFXXXMSE
XXXXXMSRF
pnXXXSSE
qpXXXSSEXXXSSEF
FdfFSSE
FdfRdf
XXXXY
XXY
H
H
p
qpq
p
pq
ipipqiqqiqii
iqiqii
pqqa
pqq
176
Example (TLC data)
• Testing whether age contributes to the model after
accounting for height and sex:
0
3210
210
33
330
reject H toFail:Conclusion;2.4)28,1,95.0(*:
13.134.1
51.1
37.40/28
(1.51)/1
),,(
),|(* : statisticTest
:F
TLC :R
:test-F
06.1024.0/025.0}{/ t*:statisticTest
00: H:test-T
:test-or Ftest - tdo Can
FFRR
agesexheightMSE
sexheightageMSRF
agesexheightTLC
sexheight
bsb
vs
ii
ii
177
Example (TLC data cont’d)
• Testing whether age and sex contribute to the model
after accounting for height:
0
3210
10
reject H toFail:Conclusion;3.3)28,2,95.0(*:
78.134.1
38.2
37.40/28
37.40)/2-(42.16
),,(
)|,(* :statisticTest
:F
TLC :R
:eappropriat istest -Only F
FFRR
agesexheightMSE
heightagesexMSRF
agesexheightTLC
height
ii
ii
178
Type I and Type III sums of squares
• Type I sums of squares are sequential and order-dependent, they adjust only for variables already in the model
• In TLC example Type I SS in SAS table are as follows: SSR(height), SSR(sex|height), SSR(age|height,sex)
• If we had specified the variable in different order in the model statement we would have gotten different type I SS: for example “model TLC= height age sex” leads to SSR(height), SSR(age|height), SSR(sex|age,height)
• Type III sums of squares are order-independent (they adjust for all of the remaining variables in the model)
• In TLC example the type III sums of squares are SSR(height|sex,age), SSR(sex|age,height), SSR(age|height,sex)
179
Coefficients of partial determination
• Recall that coefficient of multiple determination R2 measures the proportionate reduction in the variation of Y achieved by all X variables
• Coefficients of partial determination measure the contribution of one X variable once all the remaining variables are already in the model
• The coefficient of partial determination between Y and X1 given that X2 is in the model measures:
• the relative reduction in SSE when X1 is added to the model that already contains X2
• the relative marginal reduction in the variation in Y associated with X1when X2 is already in the model
)(
)|(
)(
),()(
Y:Model
2
21
2
2122
2|1
22110i
XSSE
XXSSR
XSSE
XXSSEXSSER
XX
Y
iii
180
Coefficients of partial determination cont’d
• The coefficient of partial determination
between Y and X1 can be interpreted as a
coefficient of simple determination between
the residuals from the regression of Y on X2
and the residuals from the regression of X1
on X2
181
Coefficients of partial determination and
correlation
2
1,...1,1,...2,1|1,...1,1,...2,1|
1111
11112
1,...1,1,...2,1|
1,122110i
/
:ncorrelatio partial of tsCoefficien
),...,,...(
),...,,...|(
... Y:Model
pkkYkpkkYk
pkk
pkkk
pkkYk
ipipii
Rr
XXXXSSE
XXXXXSSRR
XXX
• In TLC example R2Yage|height,sex=1.51/38.92=0.04
• rYage|height,sex=√0.04 = 0.2
182
Standardized multiple regression model
• It is desirable to be able to compare the magnitude
of the regression coefficients and based on that to
judge the relative importance of the predictor
variables
• However the magnitude of the regression
coefficients depends on the scale of X and hence
rescaling is in order
• Also, standardizing the regression coefficients
improves numerical stability
183
Correlation transformation
• All X and Y variables are standardized by
subtracting the sample mean and dividing by the
sample standard deviation:
1,...1,1
)(,
1
1
1
)(,
1
1
2
2*
2
2*
pkn
XXs
s
XX
nX
n
YYs
s
YY
nY
kik
k
k
kikik
i
Y
Y
ii
184
Standardized regression model
• Regression model in terms of the standardized variables
• Note that there is no intercept in this model
• Coefficients in SRM are interpreted as estimated change in standard deviations in Y for one standard deviation change in X
11110
*
k
**
1,
*
1
*
1
*
1
*
...
1,...1,
that shown be canIt
...
pp
k
k
Y
ipipii
XXY
pks
s
XXY
185
Polynomial regression (one variable)
• All variables are powers of the same variable
• This implies curvilinear relationship between the predictor and the response when p > 2
1
11
2
12
1
11
2
12110
...,
...
p
iipii
i
p
ipiii
XXXX
XXXY
186
Polynomial regression: applicability and
caution
• Polynomial regression models are useful when the true curvilinear function is indeed polynomial, or when polynomial is a good approximation
• Extrapolation outside of the range of the data is dangerous, especially with high-order polynomials
• Data that consists of n distinct X values can always be fitted perfectly with a polynomial of degree n-1
• Polynomial regression models may contain one, two or more continuous predictor variables and each predictor may be present at different power
• Each predictor is often centered to remove high correlation between linear and quadratic terms (e.g. usually high correlation exists between X and X2)
187
Polynomial regression models for a
single predictor variable
188
Polynomial models for a single
predictor variable cont’d
XXx
xxxY
XXx
xxY
ii
iiiii
ii
iiii
where
:modelorder -Third
where
:modelorder -Second
3
111
2
1110
2
1110
189
Second-order polynomial regression models
with 2 predictors
section conica defines
)(
and where
2112
2
222
2
11122110
222111
2112
2
222
2
11122110
iiiiiii
iiii
iiiiiiii
xxxxxxYE
XXxXXx
xxxxxxY
190
Second-order polynomial regression models
with 3 predictors
333222111
322331132112
2
333
2
222
2
111
3322110
and , where
XXxXXxXXx
xxxxxxxxx
xxxY
iiiiii
iiiiiiiiii
iiii
191
Hierarchical approach to fitting
polynomial models
• Polynomial models are special cases of MLR and hence can be fitted using the usual approach
• If a higher order term is present in a regression model, the lower order terms need also to be present (e.g. quadratic model should have a linear term)
• It is usually of interest to test whether a simpler polynomial model adequately represents the data. This can be achieved using the extra sums of squares approach
192
Hierarchical approach to fitting
polynomial models cont’d
MSE
xxxSSR
MSE
xxxSSR
xxxSSRxxSSRxSSRSSR
xxxY iiiii
2
)|,( F*use 0 and 0her check whet To
),|( F*use 0her check whet To
),|()|()(
:Model
:model regression polynomiala in needed is
termcubic he whether testingconsider t exampleFor
32
11111
23
111
232
3
111
2
1110
193
Comments of polynomial regression
• Expensive in terms of degrees of freedom
• Collinearity may still exist and then
orthogonal polynomials can be used
• Test of lower order effect is meaningless
when a higher order term is present
194
Muscle mass example
To explore the relationship between muscle mass
and age in women, a nutritionist randomly
selected 15 women from each 10-year age group,
beginning with age 40 and ending with age 79. X
is age and Y is a measure of muscle mass
Consider polynomial regression
195
Steroid example cont’d
• We find (in terms of centered age X = age –59.983): Y = 82.936 - 1.184 X – 0.015 X2.
• R-squared =SSR/SSTO= 11830.621/15501.933= 0.763.
• Test whether or not there is a regression relation (α=.01).
Ho: β1= β2=0
Ha: at least one is not zero
TS: F*= MSR/MSE=5915.3 / 64.409= 91.84
p-value < .0001
196
Muscle mass example continued
197
Muscle mass example cont’d
• Test whether the quadratic term can be dropped from the model
R: Y = ß0 + ß1 X
F: Y = ß0 + ß1 X + ß2X2
TS: F* = MSR(X2|X)/MSE(X,X2)= 203.135/64.409 = 3.154
RR: F* > F(0.95, 1, 57) = 4.01, R model is adequate
• Express the fitted regression function in terms of age
Y = ß0 + ß1(age-59.983) + ß2(age-59.983)2
We want Y = ß0* + ß1
*age + ß2 *age2
ß0* = ß0 -ß159.983+ß259.9832 = 82.936 + (1.184)*59.983 +
0.015*(59.983)2 = 207.926
ß1 * = ß1-2ß2 = -1.184-(2*0.015*59.983) = -2.983
ß2 * = ß2 = 0.015
198
Lecture 8: Multicollinearity and
interaction models
199
Multicollinearity
• In MLR frequently asked questions are:
1. What is the relative importance of the effects of different predictor variables?
2. What is the magnitude of the effect of a given predictor variable on the response variable
3. Can any predictor variable be dropped from the model because it has little or no effect on the response variable?
4. Should any predictor variable not yet included in the model be considered for inclusion?
These questions have easy answers when the predictor variables are not intercorrelated (correlated with one another)
200
Uncorrelated predictor variables
• When the predictors are uncorrelated the estimated effects are the same no matter what other variables are in the model.
• Also SSR(X1|X2)=SSR(X1) and SSR(X2|X1)=SSR(X2) (That is, type III SS = type I SS)
• So direct comparisons of standardized regression coefficients and simple t-tests for each predictor variable can help answer questions 1 through 4.
201
Effects of multicollinearity
• Our ability to obtain a good fit, to estimate and predict mean and individual response respectively are not inhibited
• However estimated regression coefficients have large sampling variability
• Common interpretation of regression coefficients as the effect of one predictor while holding the others constant may not be realistic, since we may not be able to vary one variable while keeping another variable highly correlated with it constant.
202
Body fat example (from
textbook)
• Study of the relationship between amount of body fat (Y) and several possible predictors: triceps skinfold thickness (X1), thigh circumference (X2), and midarm circumference (X3).
R2=0.998 when X3 regressed on X1 and X2
• Regression coefficients vary widely according to what other predictors are in the model.
0.1
085.00.1
458.0924.00.1
XXr
Variables in model
b1 b2
X1 0.86 --X2 -- 0.86X1,X2 0.22 0.66X1,X2,X3 4.33 -2.86
203
Body fat example cont’d
• Type I and Type III SS can be very different:
SSR(X1)=352.27, SSR(X1|X2)=3.47.
• The interpretation in this example is that by itself
triceps skinfold thickness is an important predictor
of body fat but it does not add much new
information after thigh circumference is accounted
for (and vice versa)
204
Body fat example cont’d
• Multicollinearity also affects standard errors of the
regression coefficients
Variables in model s{b1} s{b2}X1 0.13 --X2 -- 0.11X1,X2 0.30 0.29X1,X2,X3 3.02 2.58
205
Body fat example cont’d
• Multicollinearity does not significantly
affect the fitted values and predictions
Variables in model
Fitted Y at X1=25.0
(X2=50.0,
X3=29.0)
s{fitted Y} at X1=25.0
(X2=50.0,
X3=29.0)
X1 19.93 0.63X1,X2 19.36 0.62X1,X2,X3 19.19 0.62 206
Interaction models with
quantitative variables
3223311321123322110
211222110
22
2
111110
112211
}{
}{
:Eg
:ons)(interacti effects tivemultiplica withModel
}{
:
)(...)()(
:effects additive withModel
XXXXXXXXXYE
XXXXYE
XXXYE
Eg
XfXfXfE{Y} pp
207
Interpretation of regression coefficients in
interaction models with quantitative variables
• Model:
• Intercept for the relationship between Y and X1:
• Slope for the relationship between Y and X1:
• Intercept for the relationship between Y and X2:
• Slope for the relationship between Y and X2:
220 iX
231 iX
208
iiiiii XXXXY 21322110
110 iX
132 iX
Example:
2121
2121
21
5.5210}{ .
5.5210}{ .
5210}{ .
XXXXYEc
XXXXYEb
XXYEa
209
Comments on interactive effects for
quantitative variables• Curvilinear effects may be present
• Since some of the predictors may be highly correlated with the interaction terms, it is a good idea to center the predictor variables
• When the number of predictor variables in the regression model is large, the potential number of interactions may be large. Either a priori knowledge or residual plots of fitted values (based on the main effects model) vs interaction terms can be used then to guide the choice of interaction terms to include in the model.
• F-tests based on extra sums of squares can be used to test whether interaction effects are needed in the model.
210
Models with qualitative
predictors
• Qualitative predictors with
2 classes (binary):
• Qualitative predictors with
3 or more classes
(categorical nominal):
• Nominal categorical
variable with c classes is
represented by c-1
indicator (dummy)
variables
maleif
femaleifX
0
1
1
otherwise 0
education school high thanless if 1
otherwise 0
education school high thanmore if 1
2
1
X
X
211
Interpretation of regression
coefficients
• Example: X1 - age, X2 – gender (X2=1 if female)
• β1 – common slope
• β0 – intercept for males
• β0+ β2 – intercept for females
• First-order model implies parallel regression lines at each level of the categorical predictor
femalesforXYE
andmalesforXYE
XXY
ii
ii
iiii
1120
110
22110
)()(
)(
212
Interpretation of regression
coefficients cont’d
• Example: X1 - age, X2 and X3 – education level (X2=1 if > HS, X3=1 if < HS)
• β1 – common slope
• β0 – intercept for HS
• β0+ β2 – intercept for > HS
• β0+ β3 – intercept for < HS
• First-order model implies parallel regression lines at each level of the categorical predictor
HSforXYE
HSforXYE
HSforXYE
XXXY
ii
ii
ii
iiiii
1130
1120
110
3322110
)()(
)()(
)(
213
Considerations in using indicator
variables• Allocation codes may be used in some cases instead of
indicator variables: e.g. education defined as +1 if >HS, 0 if HS, -1 if <HS and then treating the variable as continuous for the purposes of regression. That however implies a metric which may not correspond to reality
• Sometimes quantitative variables may be used based on intervals defined for qualitative variables
• Different type of coding may be used: for example X1=1 if female, -1 if male
• Intercept may be dropped and c indicator variables may be used for a categorical variable with c classes
214
Interactions between qualitative and
quantitative predictors
effects additivefor testing toequivalent is 0 Testing
parallel)not (areintersect variablevequantitati
theof levels twofor the lines regression that theimplies This
)()()(
)(
12
112120
110
211222110
femalesforXYE
andmalesforXYE
XXXXY
ii
ii
iiiiii
215
Comments
• We can have models with any combination
of quantitative and qualitative variables
• If interactions are present the corresponding
main effects should also be included
216
Comparison of two or more
regression functions
• Soap production line example:
• Y – amount of scrap
• X1 – line speed
• X2 – production line (1 if line 1, 0 if line 2)
• Model:
• Fitted model:
iiiiii XXXXY 211222110
2121 1767.039.90322.157.7ˆ XXXXYi
217
Soap production example cont’d
218
Soap production example cont’d
• One important assumption to be able to
estimate the regression relationship for
both production lines simultaneously is that
the residual variances for the two
production lines are the same
• Hence perform Brown-Forsythe test for
equality of variance
219
Soap production example cont’d
220
Soap production example cont’d
0
*
*
2
0
reject H toFail
06.2)25;975.0(|:|
636.
12
1
15
1139.14
648.12132.16
139.14
921.199227
82.045,220.952,2
variancesunequal :Ha
lines production 2 for the variances Equal:
ttRR
t
s
s
H
BF
BF
221
Soap production example cont’d
222
Soap production example cont’d
• Test whether the two regression lines are
identical
identical are lines that theReject H
65.229,904/23
810)/2(18,694 F*example thein Since
67.5)23,2;99.0(*:
)4()XX,X,(X
2)X|XX,SSR(X F*:TS
zeronot is or of oneleast at :
0:
0
2121
1212
32
320
FFRR
nSSE
H
H
a
223
Soap production example cont’d
• Test whether the slopes of the two
regression lines are the same
parallel are lines that thereject H toFail
88.19,904/23
810 F*example thein Since
88.7)23,1;99.0(*:
)4()XX,X,(X
)X,X|XSSR(X F*:TS
0 :
0:
0
2121
2121
3
30
FFRR
nSSE
H
H
a
224
Lecture 10: Model selection and
validation
Applied Regression Analysis (BIS623a)
Fall 2005
Instructor: Ralitza Gueorguieva
225
Model-building process
• Data collection and preparation
• Reduction of explanatory variables
• Model refinement and selection
• Model validation
226
Data collection and preparation
• Data collection varies by type of study:
- Controlled experiments
- Controlled experiments with covariates
- Confirmatory observational studies
- Exploratory observational studies
- exclude exploratory variables not fundamental to the problem
- may exclude variables subject to large measurement error
- exclude duplicate variables
Data preparation involves edit checks, plots to identify gross errors
Preliminary model investigation: identify functional form of explanatory variables, important interactions, may rely on prior knowledge
227
Reduction of explanatory
variables
• Omitting important variables may bias estimates and may damage exploratory power of the model
• Overfitted model may result in large variances of estimates of parameters
• Variable subset should be manageable and large enough for adequate description
228
Model refinement and selection
• Tentative regression model or several good
regression models need to be checked for
curvature and interaction effects. Residual plots,
formal tests for violations of assumptions and lack
of fit can be used
• Remedial actions may need to be applied
229
Model validation
• Model validity refers to:
- the stability and reasonableness of the regression
coefficients
- usability of the regression function
- ability to generalize inferences
230
Criteria for model selection
• Examination of all models is virtually impossible
(For p-1 predictors there are 2p-1 possible models not
considering interactions or higher order terms among
the predictors)
• Model selection procedures are used to identify a
small group of regression models that are “good”
according to a specified criterion
• These “good” models are then examined in detail to
come up with “best” model(s)
• We will consider 6 criteria:
pppppap PRESSSBCAICCRR ,,,,, 2
,
2
231
Notation and assumptions
• Number of potential X variables: P-1
• All models contain an intercept β0
• The number of X variables in a subset is denoted by p-1, 1≤p≤P
• Number of observations n is greater than maximum number of potential parameters n>P
• It is highly desirable that n is much larger than P (n>>P)
232
• R2 based on p parameters, p-1 variables
• The goal is to identify models with high coefficient of determination
• R2 always increases as we add variables to the model but a small increase may not be worthwhile
SSTO
SSE-1
p2 pR
233
criterion SSEor p
2
pR
• Rp2 does not take into account the number of variables in the model
• The adjusted coefficient of determination Ra,p2 is a better choice
• The adjusted coefficient of determination does not always increase
• It increases only if MSEp decreases.
1
SSTO1
SSTO
SSE
p-n
1-n-1
p2
,
n
MSER
p
pa
234
criterion or MSE p
2
,paR
Mallows’ Cp criterion
)2(),....(
))ˆ((}ˆ{1
))ˆ((}ˆ{)ˆ(
bias squared ˆ of variance))ˆ((}ˆ{)ˆ(
)])ˆ(())ˆ(ˆ[()ˆ(
)ˆ()ˆ(ˆˆ
ˆ :response mean from valuesfitted of Deviations
model regressionsubset eachfor valuesfitted
n theoferror squared mean total theminimize to:Goal
11
1
2
1
2
2
1
2
1
2
1
2
222
22
pnXXMSE
SSEC
μYEY
μYEYμYE
YμYEYμYE
μYEYEYμY
biaserrorrandomμYEYEYμY
μY
P
p
p
n
i
ii
n
i
ip
n
i
ii
n
i
i
n
i
ii
iiiiii
iiiiii
iiiiii
ii
235
Mallows’ Cp criterion cont’d
ip pCE }Y E{ when)( i
• Models without bias will fall near the line Cp=p
• Models above the line show substantial bias
• Models below the line are there due to random error
• Hence we are looking for values of Cp that are small and near the line Cp=p
• CP = P
• The choice of the P-1 potential variables is very important since on that choice depends whether the MSEp is unbiased estimator of the error variance
236
AICp and SBCp criteria
• Akaike’s information criterion (AICp) and Schwartz’ Bayesian criterion (SBCp) also penalize for the number of variables in the model.
• We look for models with small AICp and SBCp
• SBCp favors more parsimonious models
pnnnSSEnSBC
pnnSSEnAIC
pp
pp
][lnlnln
2lnln
237
PRESSp criterion
• Prediction sum of squares criterion is a measure of
how well the use of the fitted values for a subset
model can predict the observed responses Yi
run regression singlea from calculated be can values
preferred are values small withModels
)ˆˆ(
ˆˆ
:case ifor error prediction PRESS
deleted wascase i the whenline regression fitted theon based case i for the valuepredictedˆ
1
2
)(
)(
)(
p
p
n
i
iiip
iii
ii
PRESS
PRESS
YYPRESS
YY
Y
238
Surgical unit example
• A hospital surgical unit is interested in predicting survival in patients undergoing a liver operation. A random sample of 108 patients was available for analysis.
• Potential predictor variables: blood clotting score (X1), prognostic index (X2), enzyme function test score (X3), liver function test score (X4), age (X5), gender (X6), indicator variables for history of alcohol use (X7,X8)
• Demonstrates the utility of the 6 different criteria for model selection (graph on p. 362).
239
240
Automatic search procedures for
model selection
• Best subset selection
• Stepwise regression model
• LASSO
241
“Best” subsets algorithms
• Provide best subsets of models according to a particular criterion
• Also provide good models for any number of variables in the model
• When the pool of X variables is large (30, 40 or more) even such algorithms can take excessive amount of time
• When several good models are identified the final choice of variables for the model is based on model-building, residual analyses, other diagnostics, the investigators knowledge and confirmed by validation studies
242
Stepwise regression models
• This method develops a sequence of regression models at each step adding or deleting an X variable according to equivalently reduction in SSE, coefficient of partial correlation, t* statistic, F* statistic
• Ends with a single “best” regression model
• Different stepwise procedures can lead to different “best” models
• Can use this method to obtain the right size of the regression model and then identify other good regression models using “best” subset procedures
243
Forward stepwise regression
• 1. Fit SLR for each of the P-1 potential variables. Look at the t-test statistic for testing that the slope is 0: tk*=bk/s{bk}. The X variable with the largest t* value is added to the model first provided that its value exceeds a predetermined threshold tcrit (usually critical value corresponding to a prespecified significance level, e.g. 0.05). If none of the t* values are greater than tcrit the algorithm stops. Equivalently decision can be based on p-value. Hence we select for inclusion the variable with smallest p-value.
• 2. Let Xk be the variable entered at step 1. We now fit all possible models with Xk and one additional X variable in and we compute the t* statistics and corresponding p values for that new variable. We enter the variable with the smallest p-value (or equivalently the largest t* value) as long as that p-value is smaller than the predetermined significance level alpha.
244
Forward stepwise regression
cont’d• 3. Check whether Xk can be dropped from the model. The
criteria for dropping is similar to the one for adding a variable: drop if the corresponding p-value from the latest model is GREATER than a predetermined level (need not be the same as the level for adding a variable in).
• 4. Try to add one more variable to the model according to criteria in step 2 and then try to drop one of the variables already in the model (except the last one to be entered) as in step 3. Continue until no more variables can be added or dropped from the model.
245
Forward stepwise regression
cont’d• Choices of α-to-enter and α-to-drop are important.
• Large α-to-enter values result in models with too many
variables.
• Models with small α-to-enter values may be underspecified
and hence estimate of the error variance may be too large.
• α-to-enter should never be larger than α-to-drop.
• The order of variable in the model is not important.
246
Other stepwise procedures
• Forward selection: a simplified version of forward stepwise regression without the test whether a variable once entered into the model should be dropped.
• Backward elimination: the opposite of forward selection. Starts with a model with all variables in and drops the one with the largest p-value (provided the p-value is larger than the prespecified threshold). Then fits the model to all remaining variables and drops the variable with the largest p-value based on that second model. The process continues until no more variables can be dropped.
247
Comments on stepwise
procedures• Variations of procedures are possible: for example
stagewise regression to include interactions
• No uniformity across software packages
• Variables can be retained in the model “by force”
• Dummy variables coding a single categorical variable should be kept together
• At each step the model should be hierarchically well-defined
• Methods based on deletion of cases, bootstrapping are newly proposed.
248
Model validation
• Three approaches:
• 1. Checking the model and its predictive ability on a new sample (Note, it may be difficult to replicate the study).
• 2. Comparison of results with theoretical expectations, earlier empirical results and simulation results.
• 3. Use of a holdout sample to check the model and its predictive ability.
249
Methods of checking validity
Re-estimate the model form using the new data and compare the estimated regression coefficients and various characteristics of the fitted models to the new and the old data
Assess the predictive ability of the selected regression model by using the original model to predict each case in the new data set and then to calculate the mean of squared prediction errors (MSPR)
ability predictive of estimatebetter a is MSPR then MSE MSPRIf
good. is capability predictive thenset,data building modelfor MSE MSPRIf
*
)ˆ(1
2
n
YY
MSPR
n
i
ii
250
Data splitting
• Used when the collection of new data is not feasible and the original sample is large enough (6 to 10 cases per variable)
• Involves splitting the original data set into a model-building set (training set), and validation set (prediction set)
• This validation is often called “cross-validation”
• The splits of the data can be made at random, by time, or pairs of data points can be created and each placed in different set
251
Comments on data splitting
• Data can be split so that the two data sets have similar statistical properties
• Different refinements of data splitting are possible (for example K-fold cross-validation)
• The PRESS criterion can also be used for data validation
• A disadvantage of not using the whole data set to develop the model is that the parameter estimates are more imprecise
252
Lecture 11: Building the Regression
Model: Diagnostics
Applied Regression Analysis (BIS623a)
Fall 2005
Instructor: Ralitza Gueorguieva
253
Added-variable plots
• Added-variable plots (partial regression plots, adjusted variable plots) are residual plots that provide graphical information about the marginal importance of a variable Xkgiven the other predictor variables already in the model.
• To create added-variable plots two regressions must be performed: Y on all other predictor variables, and Xk on all other predictor variables. Then residuals from these two regressions are plotted against each other.
• The plots help decide whether the additional variable Xkshould be added to the regression model and in what form.
• They also provide an idea about the strength of the marginal relationship between Y and Xk and about possible outliers.
254
Comments on added-variable
plots
• Only suggest the nature of the relationship
between Y and the predictor variable but do
not show the functional form.
• May be misleading if there are complex
interrelationships between the predictor
variables.
255
Example (added-variable plot)
• Life-insurance example:
• Y – amount of insurance carried
• X1 – annual income
• X2 – measure of risk aversion
• SAS program/output
256
Identifying outlying Y
observations
MSEheeshMSEes
hhee
hhe
HIe
X
MSE
ee
YYe
ijjiiii
ijijji
iiiii
i
iii
},{),1(}{ :Estimators
)0(},{ :residuals between Covariance
Hofelement diagonal ith theis where),1(}{ :residual ith of Variance
)(}{ :residuals of Variance
H)Y-(Ie :residuals ofVector
'X)X(X' H:matrixHat
:residual tizedSemistuden
ˆ :Residual
2
22
22
22
1-
*
i
257
Studentized residuals
• Also known as internally studentized residuals
• Unlike semistudentized residuals have constant
variances
)1(}{,}{
iii
i
ii hMSEes
es
er
258
Deleted residuals
• More useful to detect outlying Y values since if
particular case is highly influential it will pull the
regression line towards it and its internal residual
will be small
)1(~ :onDistributi
1}{s : varianceEstimated
1 : valueEquivalent
ˆ :residual Deleted
)(2
)(
pnt}s{d
d
h
MSEd
h
ed
YYd
i
i
ii
i
i
ii
ii
iiii
259
Studentized deleted residual
• Externally studentized deleted residual
• Can be calculated without fitting the regression
line
2
)( )1(
1
)1(}{iii
i
iii
i
i
ii
ehSSE
pne
hMSE
e
ds
dt
260
Test for outliers
• Idea: identify cases with large externally
studentized residuals and perform
Bonferroni procedure for n residuals
• Consider a residual ti:
• If |ti|>t(1-α/(2n);n-p-1) then declare that
residual an outlier
261
Identifying outlying X
observations
• Also based on the hat matrix
setsdata bigfor 5.0or ,2
:X respect to withcases Outlying
Y influencelly substantia valuesleverage High
LEVERAGE
valuesX mean theand case ith for the valuesX thebetween distance
10
i
1
iiii
ii
ii
n
i
iiii
hn
ph
h
h
phh
262
Identifying influential cases
• A case is influential if its exclusion causes
major changes in the fitted regression
function
263
Identifying influential cases
setsdata bigfor /2)(
setsdata umsmall/medifor 1)(:case lInfluentia
1
ˆˆ)(
: valuefitted single on Influence
2/1
)(
)(
npDFFITS
DFFITS
h
ht
hMSE
YYDFFITS
i
i
ii
iii
iii
iii
i
264
Identifying influential cases
line. regression theon influence lsubstantia has case ith the
more,or 50%near is percentile If
percentile its assess and
ondistributi p)-n F(p, to Relate:case lInfluentia
bothor leverage largea
residual, largea havingby linfluentia bemay case A
)1(
)ˆˆ(
distance sCook' : valuesfitted all on Influence
2
21
2
)(
i
ii
iii
n
j
ijj
i
D
h
h
pMSE
e
pMSE
YY
D
265
Identifying influential cases
• Influence on the regression coefficients -
DFBETAS
setsdata largefor /2 DFBETAS
setsdata mediumor smallfor 1 DFBETAS:case lInfluentia
X)(X' ofelement diagonal
1,...1,0,)(
1-
)(
)(
)(
n
kthc
pkcMSE
bbDFBETAS
kk
kki
ikk
ik
266
Influences on inferences
• Compare regression models and fits with
and without influential observations
• Note that one influential observation may
mask another so that dropping an influential
observation may reveal another influential
observation
267
Multicollinerity diagnostics
• Informal diagnostics:
- large changes in the estimated regression coefficients when a variable is added or deleted, or when an observation is added or deleted
- Nonsignificant individual tests for regression coefficients for important predictor variables
- Estimated regression coefficients with an opposite sign to theoretical considerations or prior experience
- Large correlations between predictor variables
- Wide confidence intervals for important predictor variables
268
Multicollinearity diagnostics
• Variance inflation factors: measure the inflation of
variances of estimated regression coefficients as
compared to when the predictor variables are not
linearly related
variablespredictor other the withcorrelatedperfectly is X heninfinity w is )(
0 when1)(
0 when1)(
variablesX remaining theon X of iondeterminat multiple oft coefficien theis where,)1()(
k
2
2
k
212
k
kk
kk
kkk
VIF
RVIF
RVIF
RRVIF
269
VIF: diagnostic uses
• Largest VIF > 10 is indication that
multicollinearity may be unduly influencing
the least squares estimates
• Mean VIF considerably larger than 1 is
indicative of serious multicollinearity
problems
270
Lecture 12: Building the Regression
Model: Remedial measures
Applied Regression Analysis (BIS623a)
Fall 2005
Instructor: Ralitza Gueorguieva
271
Weighted Least Squares
• A remedial measure for unequal error variances (heteroscedasticity) when the regression relationship has been appropriately identified.
• Takes into account that observations with small variances provide more information than observations with large variances.
),0(.~
...
2
1,1110
ii
ipipii
Nindep
XXY
272
Weighted least squares cont’d
k
2
i
2
2
1
2
2
2
2
1
1
X variablepredictor a and anceerror vari the
between iprelationsh edhypothesiza on basedusually estimate ˆ and ˆ where
ˆ1...00
............
0...ˆ
10
0...0ˆ
1
or
ˆ1...00
............
0...ˆ
10
0...0ˆ
1
')'(
ii
nn
w
vs
v
v
v
W
s
s
s
W
WYXWXXb
273
Weighted least squares cont’d
• Steps:
• 1. First fit the regression model using ordinary (unweighted) least squares and obtain residuals
• 2. Depending on residual plots regress either the absolute residuals against appropriate predictor variables to obtain estimates si of σi , or regress the squared residuals against appropriate predictor variables to obtain estimates si
2 (vi) of σi2
• 3. Use the estimated variances in part 2 in the weight matrix and obtain weighted least squares estimates of the regression coefficients.
• Note: the estimated variances/standard deviations are just the fitted values of the regression in part 2.
274
Weighted least squares cont’d
Possible variance and standard deviation functions:
• If residual plot against Xk exhibits megaphone shape then regress the absolute residuals against Xk
• If residual plot against Yhat exhibits megaphone shape then regress the absolute residuals against Yhat
• If a plot of squared residuals against Xk exhibits an upward tendency regress the squared residuals against Xk
• If a plot of the residuals against Xk suggests that the variance increases rapidly with increases in Xk up to a point and then increases more slowly regress the absolute residuals against Xk and Xk
2
275
Weighted least squares cont’d
• Iteratively reweighted least squares (IRLS): If the weighted least squares estimates (in step 3) are very different from the unweighted least squares estimates then the residuals from step 3 (weighted regression) can be regressed on the appropriate predictor variables to re-estimate the variance/standard deviation and the process can be repeated to obtain more stable estimates of the regression coefficients. Usually one or two iterative steps are sufficient.
• In designed experiments replicates may be present at each combination of levels of X variables and sample variances/standard deviations at each replicate point can be used to estimate the weights.
276
Weighted least squares cont’d
• Inference when weights are estimated:
pn
ew
pn
YYw
MSE
WXXMSEbs
WYXWXXb
n
i
ii
n
i
iii
w
ww
w
1
2
1
2
12
1
)ˆ(
)'(}{
')'(
277
Blood pressure example
• A researcher studied the relationship between
blood pressure (dbp) and age (age) in healthy
women 20 to 60 years old
• Sample size: 54
• SAS program to illustrate the use of weighted least
squares
278
Remedial measures for
multicolinearity• Drop one of two highly correlated variables from
the model
• Use centered data
• Form independent indexes (linear combinations of the predictor variables) and use these as predictors: principal component analysis
• Ridge regression
279
Ridge regression
• Idea: obtain biased estimates of the regression parameters
with smaller variance than the LSE
• They will be much better than the LSE in terms of mean
squared error
2222 bias variance)}{(}{}{ RRR bEbbE
280
Ridge estimators
c usefulidentify help also can (VIF)
cproper gdeterminin in useful is c)against of(plot traceRidge
}MSE{} MSE{for which c of valueexists always There
stable morebut biased are 0c if , then0c If
estimators in bias ofamount reflects c
constant,0,)( :estimators Ridge
:
... :ation transformncorrelatioAfter
k
1-
11)-(p
1
11)-(p
**
1,
*
1
*
2
*
2
*
1
*
1
*
R
R
RR
YXXX
R
YXXX
ipipiii
b
bb
bbb
crcIrb
rrbOLS
eXXXY
281
Body fat example
• Study of the relationship between amount of body fat (Y) and several possible predictors:
• triceps skinfold thickness (X1)
• thigh circumference (X2)
• midarm circumference (X3)
• SAS program/output
282
Comments on ridge regression
• Ridge regression estimate are stable, i.e. there are little affected by changes in the data on which the fitted regression model is based.
• Predictions of new observations based on ridge estimated regression functions are more precise than predictions based on OLS
• Advantages of ridge regression increase as degree of multicollinearity increases
• Limitation of ridge regression is that ordinary inference procedures do not apply. Bootstrapping may be employed to evaluate the precision of ridge regression coefficients
• Another limitation is that choice of c is judgemental
• Ridge regression can be modified to use different c values for different regression coefficients
• Ridge traces can be used to reduce the number of predictor variables. Variables with unstable traces are candidates for dropping and variables whose traces tend to zero are also candidates for dropping
283
Remedial measures for influential
cases• 1. Check if influential cases are recording errors, due to
breakdown of instruments, etc.
• 2. If no obvious errors are observed, the adequacy of the model should be examined – are interactions or higher order terms omitted, is the functional form appropriate
• 3. Influential cases that are not obvious errors should be discarded only with extreme caution
• 4. If it is not desirable to discard influential cases, robust regression can be used
284
Robust regression
• Idea: dampens the influence of outlying cases to provide better fit to the majority of cases
• Types of robust regression:
• - Least absolute residuals (LAR) or least absolute deviations (LAD) regression: minimizes
• - IRLS robust regression
• - Least median of squares (LMS) regression: minimizes
• - other procedures involving trimming, or ranks
n
i
pipii XXYL1
1,11101 |)...(|
})]...({[ 2
1,1110 pipii XXYmedian
285
IRLS Robust regression
• WLS used to reduce the influence of outlying cases by employing
weights that vary inversely with the size of the residual
• Weight functions:
685.4||0
685.4||685.4
1:
345.1||||
345.1345.1||1
:
22
u
uu
wBisquare
uu
u
wHuber
286
IRLS robust regression
• 1. Choose a weighting function for weighting the cases
• 2. Obtain starting weights for all cases
• 3. Use starting weights in WLS and obtain scaled residuals from the fitted regression function
• 4. Use the residuals in step 3 to obtain revised weights
• 5. Continue the iterations until convergence is obtained (small change in weights, residuals, estimated regression coefficients or fitted values)
|}}{{|6745.
1
:residual Scaled
ii
ii
emedianemedianMAD
MAD
eu
287
Comments about robust
regression• - requires knowledge of the regression function
• - can identify outliers in case of multiple outliers masking each others effects. Cases with small final weights are outlying.
• - can confirm the appropriateness of OLS
• - is mainly reducing the influence of cases outlying with respect to Y. To make the procedure more sensitive to outlying X observations studentized residuals may be used, or weights can incorporate the leverage of observations
• - limitation is that evaluation of precision of estimates is more complicated. Bootstrap can be used.
288
Bootstrapping
• Can be used as a remedial measure for evaluating
precision in non-standard situations
• Computationally intensive
289
Bootstrapping cont’d
• Suppose that we have a sample of size n on which we want to fit a regression model
• Consider an estimated regression coefficient b1
• To estimate the precision of this estimate we generate many samples with replacement from the original sample and fit regression models to them
• From each bootstrap sample we obtain one additional estimate of the same regression coefficient
• The standard deviation of all the bootstrap estimates s*{b1*} is a measure of the precision of b1
290
Bootstrap sampling
• Fixed X sampling: used when the regression function is appropriate to the data, the errors have constant variance and the predictor variables may be regarded as fixed
- residuals ei from original fitting are regarded as the sample data and are sampled with replacement
- the bootstrap Y values are then obtained by adding the original fitted values and these bootstrapped residuals
- the bootstrapped Y are then regressed on the original X to obtain bootstrap estimates of the regression coefficients
*)(*)( ˆ m
ii
m
i eYY
291
Bootstrap sampling cont’d
• Random X sampling: used when there are
questions regarding the adequacy of the regression
function, the error variances are not constant,
and/or the predictor variables can not be regarded
as fixed
- For SLR (X,Y) pairs are sampled with
replacement from the original sample
292
Bootstrap confidence intervals
• Reflection method: based on the (α/2)100 and (1- α/2)100 percentiles of the bootstrap distribution of b1*: b1*(α/2) and b1*(1-α/2)
• Requires large number of bootstrap samples (at least 500)
),(:for CI %100)-(1 approx.
)2/1(
)2/(
11211
1
*
12
*
111
dbdb
bbd
bbd
293