COURSE: Applied Regression Analysis - Rutgers …cabrera/sc/workshop/Appliedreg.pdf · Types of...

Lecture 1: Review Simple linear

regression.

1

COURSE: Applied Regression Analysis

Fundamental elements of statistics:

• Population: set of units

• Sample: a subset of the population

• Variable of interest:systolic blood pressure (continuous)number of errors on an exam (discrete)

diabetic/non-diabetic (categorical binary)

marital status: married, divorced, single (nominal)

degree of pain: minimal/moderate/severe (ordinal)

• Statistical inference: estimate, prediction or other generalization about the population based on the information contained in the sample

• Reliability of statistical inference: degree of uncertainty associated with statistical inference

2

Types of variables in regression

• Dependent (response) variable: affected by one or more independent variables, assumed to have a probability distribution at each value of the independent variable

• Independent (explanatory, covariate, predictor, regressor) variable: can be set to a desired level if its values are recorded as they occur in a population

• Example: consumption of saturated fatty acids and plasma cholesterol levels, plasma cholesterol levels and heart disease

• Types of relationships between variables: association and causal

3

Distributions:

• Probability distribution is specified mathematically and

is used to calculate the theoretical probability of different

values occurring. It is described by a mathematical formula

with parameters.

• Examples of probability distributions: normal, log-normal,

binomial, t-distribution, etc.

4

Some graphs of normal distributions

5

Normal density

)2

exp(2

1)(

2yyf

}

2

)(exp{

2

1)(

2

2

2

yyf

6

T-distribution

• T-distribution is like normal but with heavier tails

• As the degrees of freedom of the t-distribution increase it more closely resembles the normal distribution

• For df=30 or higher the two are almost indistinguishable

7

Parameters

• Mean: sum of measurements divided by the total number of measurements

• Population mean:

• Variance: sum of squared deviations from the mean

• Population variance:

• Population standard deviation:

N

i

iYN

YE1

1)(

8

2

1

2 )(1

)(

N

i

iYN

YVar

)()( YVarYSD

Other important parameters

• Median (the middle value when the observations are ordered from the lowest to the highest)

• Mode (the measurement that occurs most often)

• pth percentile: the value such that at most p% of the measurements fall below it and (100-p)% above it when the measurements are ordered from lowest to highest

• Skewness

9

Estimates and sampling

• Estimate: a quantity computed from the sample which is intended to estimate a population parameter

• Example: Sample mean

• Variability of an estimate:

• Example: Standard error of the mean

• If σ is unknown it is estimated by the sample standard deviation s

• Sampling distribution of an estimate: if many samples of fixed size can be obtained what would the histogram of the estimator look like

• Example: The distribution of is either standard normal or t

n

i

iYn

Y1

1

ns

Y

10

n

i

i YYn

s1

22 )(1

1

Ys

Y

Confidence intervals

• “Definition:” a range of values which we can be confident (90%, 95%, 99%, never 100%!) includes the true value

• Basic idea: the confidence interval covers a large proportion of the sampling distribution of the statistic of interest

• Example: To obtain a confidence interval for a population mean take estimator +/- 2*standard error (approximately 95% CI)

• Interpretation: About 95 out of 100 confidence intervals based on different random samples from the same population will include the true mean. (5% will not include the true mean).

11

One sample confidence interval

12

• CI:

• 95% CI means α=0.05

• Interpretation of confidence interval: We are 95% confident that the interval contains the true population mean. That is, if we were to construct many 95% CI based on different random samples from the same population, 95% of these intervals will contain the true mean, 5% won’t. Our particular interval either does or does not contain the true mean.

nsntY )1,2/1(

Hypothesis tests

• Null hypothesis (Ho): usually the opposite of the research hypothesis generating data (no difference between treatments, no linear relationship between X and Y)

• Alternative hypothesis (Ha): research hypothesis (treatment A is better than treatment B, Y increases with X)

• Alpha level: tolerance for mistakenly declaring Ha to be true (usually set at 0.05, but can use 0.01, 0.10). This type of mistake is called type I error.

• Test statistic: numerical summary of evidence contained in the sample in favor of Ha under the assumption that Ho is true.

• P-value: the probability of obtaining as much or more evidence as contained in the test statistic in favor of Ha under the assumption that Ho is true.

13

Hypothesis testing

• Ho: = 0 vs Ha: ≠ 0

• TS: t* =

• Test statistic has t distribution with n-1 degrees of freedom

• Rejection region t* > t(1-α/2,n-1) or t* < -t(1-α/2,n-1)

• p-value = the probability to get a more extreme value than the observed value of the test statistic based on t-distribution with n-1 degrees of freedom. Note: this probability is computed under Ho.

• Reject Ho if p-value < α (0.05) or equivalently if t* fall in RR.

ns

Y 0

14

Rejection regions (critical values) of

t-distribution

15

Relationship between variables

0 10 20 30 40

Celsius

20

40

60

80

100

Fahre

nheit

40 45 50 55 60 65

Weight (in kg)

800

1000

1200

1400

Resting M

eta

bolic

Rate

(in

kcal/24 h

rs)

16

Functional Statistical

Functional vs Statistical

• Functional:

- Yi = f(Xi)

- relationship is

perfect

- data points fall

exactly on the curve of the

relationship

- systematic part only

• Statistical:

- Yi = f(Xi) + εi

- relationship is not

perfect

- data points are

scattered around the curve

- systematic plus

random part

17

Model definition

18

• Yi – dependent variable for subject i

• Xi – independent variable for subject i

The regression equation is then:

- β0 is unknown intercept and β1 is unknown slope

(β0 and β1 are called regression parameters)

- εi are independent, identically distributed errors with mean 0 and variance σ2

( E(εi)=0, Var(εi)= σ2>0, Cov(εi , εj) = 0)

niXY iii ,...1,10

Properties of SLR

• The model is linear in the parameters

• There is a single independent variable

• The model is also linear in the independent variable

• The mean of the dependent variable depends on the predictor according to the systematic part of the model

E(Yi)= β0 + β1 Xi

• The variance of the dependent variable is constant

Var(Yi)= σ2

19

Estimation of regression parameters

• Least squares is the standard method. The least squares

regression line minimizes the sums of squares of the

vertical distances from the observations to the line

(residuals).

• The obtained estimates of the regression parameters are

called “ordinary least squares” estimates

• These are the values that minimize

n

i

ii XYQ1

2

10 )(

20

Estimation of regression parameters cont

• The estimates of the regression parameters are obtained by solving the normal equations and are as follows:

• b1 is interpreted as the estimated mean change in the response variable per unit change in the predictor

• b0 is interpreted as the estimated mean response at 0 value of the predictor. Note that if 0 is not in the range of values of the predictor variable then this estimate is meaningless.

XbYb

XX

YYXX

bn

i

i

n

i

ii

10

1

2

11

)(

))((

21

Regression line

)(

lyequivalentor

1

10

XXbYY

XbbY

22

Estimating the variance

Fitted response:

Residual:

Error sum of squares:

Mean squared error:

The residual variance σ2

is estimated by the MSE

ii XbbY 10ˆ

iii YYe ˆ

23

n

i

i

n

i

ii eYYSSE1

2

1

2)ˆ(

2

)ˆ(

21

2

n

YY

n

SSEMSE

n

i

ii

Properties of least squares estimates

• Unbiased (on average do not overestimate or underestimate the true value):

E(b0)=β0, E(b1)=β1, E(MSE)=σ2

• Estimated regression coefficients have minimum variance among all unbiased linear estimators, i.e. among all estimators that are linear functions of the Yi’s

• For inference about the regression parameters we need additional assumption:

ε i ~ N(0, σ2)

24

Maximum likelihood estimation

• This is an alternative method to ordinary least squares to obtain estimates of the regression parameters and the error variance

• It requires that the errors ε i ~ i.i.d. N(0, σ2)

• Based on this method we can also do inference (hypotheses tests, confidence intervals) for the parameters of interest

• The main idea is to find the values of the parameters that maximize the likelihood of the observed data

25

Normal density

)2

exp(2

1)(

2yyf

}

2

)(exp{

2

1)(

2

2

2

yyf

26

Likelihood function

• The density of each individual Yi is:

• The likelihood is the product of the individual

densities:

}2

)(exp{

2

1)(

2

2

10

2

iii

XYYf

n

i

ii XYL

12

2

10

2

2

10 }2

)(exp{

2

1),,(

27

Maximum likelihood estimates

• The estimates of the

regression parameters are

exactly the same as the

OLS estimates b0 and b1

• The estimate of the

variance is different

(biased estimate):

XbY

XX

YYXX

n

i

i

n

i

ii

10

1

2

11

ˆ

)(

))((ˆ

n

YY

n

SSE

n

i

ii

1

2

2

)ˆ(

28

2ˆ2

n

nMSE

Interpretation of SLR

• The intercept is the value of E(Y) for X=0

• The regression line consists of the estimated mean responses (systematic part) over the range of the fixed X’s

• Regression line passes through the point

• Sum of residuals is zero

• Therefore

• Sum of squared residuals Q is minimum

• Also

),( YX

0)ˆ( iii YYe

29

ii YY ˆ

0ˆ iiii eYeX

Data example

• Assessment of the strength of association

between body weight (kg) and resting

metabolic rate (RMR) (kcal/24hr). Can

body weight be used to predict RMR?

• Now please look at attached SAS code

30

Results from data example

• b0 = 811.23

• b1 = 7.06

• Regression equation:

RMR=811.23+7.06*weight

• SSE = 1047230.71, MSE = 24934.06

• Point estimate of the mean RMR for weight = 70 kg: Yi = 1305.43

• Residual for the first data point: ei = -84.50

• Note: sum of residuals is 0

31

Distinctions between regression and

correlation models

• Both simple linear regression and correlation can be used to assess the linear relationship between two continuous variables.

• In regression one of the variables is a response, the other one is a predictor. Correlation treats the variables equally.

• Correlation assumes X to be a random variable

• Regression is more informative than correlation. It allows to predict values of the response (dependent variable) from values of the independent variable.

32

Joint and marginal distributions

of X and Y

• The bivariate normal distribution describes the joint distribution of two continuous variables

• μ1 and σ1 are the mean and standard deviation of the marginal distribution of X

• μ2 and σ2 are the mean and standard deviation of the marginal distribution of Y

• ρ is the correlation coefficient between X and Y

• Marginally X ~ N(μ1,σ12) and Y ~ N(μ2,σ2

2)

2

221

21

2

1

2

1,~

N

Y

X

33

Graph of bivariate normal density

34

Conditional distribution of Y given

X and SLR

• Y|X ~ N(μ2+ρσ2/σ1(X- μ1), σ22(1-ρ2)), i.e. for every value of X the

conditional distribution of Y given X is normal with mean and variance that depend on the parameters of the joint distribution

• The mean of the distribution of Y|X is a LINEAR function of X

• For all values of X the variance of the conditional distribution Y|X is the same

• Same observations can be made for the distribution of X|Y

• For making inferences for Y conditional on X (X conditional on Y) the normal regression model is appropriate

• Even if X is not normally distributed, but Xi are independent and the distribution of X does not involve the regression parameters, we can still use simple linear regression

35

Estimation of correlation coefficient

• Pearson product-moment correlation coefficient

• -1≤ρ≤1, ρ=0 when X and Y are independent

• Testing H0: ρ=0 versus Ha: ρ≠0 is equivalent to

testing whether the slope of the regression line of Y|X

or X|Y is 0

• Setting up confidence interval for ρ is based on the

normal distribution and Fisher’s z-transform

21

22

XXYY

XXYYr

ii

ii

36

Nonparametric estimation of the

correlation coefficient

• When the joint distribution of X and Y is not bivariate

normal and can not be transformed to normality, the

nonparametric rank correlation procedure can be used

• Spearman rank correlation coefficient:

• That is, we first obtain the ranks of the Yi’s (RYi) and

Xi’s (RXi) separately and then compute the Pearson

correlation coefficient on the ranks

21

22

XXYY

XXYY

s

RRRR

RRRRr

ii

ii

37

Lecture 2: Inference in Simple

Linear Regression

38


Inferences concerning β1

• The test of linear association between

predictor and response variable:

H0: β1= 0 vs Ha: β1≠ 0

• Confidence interval for β1

39

Sampling distribution of the

estimator of β1

• Remember

• Since the estimated slope is a linear combination of the Yi’s it is N(β1, σ

2(b1)), where

• Hence,

• The variance σ2 is unknown but we have an unbiased estimate of it …

)1,0(~)(

)(*

1

11 Nb

bz

n

i

i XX

b

1

2

2

1

2

)(

)(

40

n

i

i

n

i

ii

XX

YYXX

b

1

2

111

)(

))((

Sampling distribution of the estimator of β1 cont

• … the MSE

• Hence the estimated variance will be:

• and we can use the following statistic:

• Since we are estimating the variance, the sampling distribution is no longer normal

• t* ~ t(n-2), i.e. the sampling distribution of b1 is t with n-2 degrees of freedom

)(

)(*

1

11

bs

bt

n

i

i XX

MSEbs

1

21

2

)(

)(

41

Two-sided hypothesis test for β1

• H0: β1= 0 vs Ha: β1≠ 0

• Compute TS under H0:

• If t* > t(1-α/2,n-2) or if t* < -t(1-α/2,n-2) then reject H0

in favor of Ha, otherwise fail to reject H0

• Equivalently compute p-value as 2*P(t(n-2) >|t*|) and reject H0 if p-value < α

)(

)0(*

1

1

bs

bt

42

Rejection regions (critical values) of

t-distribution

43

One-sided hypotheses tests for β1

• Test for negative slope

• H0: β1≥ 0 vs Ha: β1< 0

• Compute TS under H0

• If t* < -t(1-α,n-2) then reject

H0 in favor of Ha, otherwise

fail to reject H0

• Equivalently compute p-value

as P(t(n-2) < t*) and reject H0

if p-value < α

• Test for positive slope

• H0: β1≤ 0 vs Ha: β1> 0

• Compute TS under H0

• If t* > t(1-α,n-2) then reject

H0 in favor of Ha, otherwise

fail to reject H0

• Equivalently compute p-value

as P(t(n-2) > t*) and reject H0

if p-value < α

44

Confidence interval for β1

• Consider α = 0.05. Then with probability 0.95

• Therefore a 95% confidence interval is as follows:

)2,2

1()(

)2,2

1(1

11

ntbs

bnt

45

)()2,2

1(

))()2,2

1(),()2,2

1((

11

1111

bsntb

bsntbbsntb

Inference concerning β0

• Estimator:

• Sampling distribution of b0 is N(β0,σ2(b0))

• Variance:

• Estimated variance:

• Therefore we have:

XbYb 100ˆ

)2(~)( 0

00

ntbs

b

46

]

)(

1[)(

1

2

2

2

0

2

n

i

i XX

X

nb

]

)(

1[)(

1

2

2

0

2

n

i

i XX

X

nMSEbs

Inference concerning β0

• CI:

• HT: H0: β0=0 vs Ha:β0≠0

TS: t*=

RR: |t*|> t(1-α/2,n-2)

)2(~)( 0

00

ntbs

b

47

)()2,2

1( 00 bsntb

RMR data example (cont’d from lecture 2)

• Assessment of the strength of association between body weight (kg) and resting metabolic rate (RMR) (kcal/24hr).

• Construct 90% CI for the slope.

• Test at alpha=0.05 whether the slope is significantly positive.

48

RMR data example cont’d

• 90% CI: b1 ± t(1-0.10/2,42)s{b1}

s2{b1} = MSE/[(n-1)sx2]=24934.06/(43*24.632)=0.956

s{b1}=0.978

7.06 ±(1.68)(0.98)=(5.35,8.71)

• We are 90% confident that the mean increase in RMR per 1 kg increase in body weight is between 5.35 and 8.71 kcal/24hrs

• HT: H0: β1 = 0 vs Ha: β1 > 0

• TS: t* = (b1-0)/s{b1}=7.06/0.98=7.20

• RR: t*>1.68

• Conclusion: t* > 1.68 and hence we reject H0 and conclude that average RMR significantly increases on average as body weight increases

49

Inference concerning E(Yh)

• Point estimator:

• Expectation:

• Variance:

• Sampling

distribution:

])(

)(1[)ˆ(

2

222

XX

XX

nY

i

hh

])(

)(1[)ˆ(

2

22

XX

XX

nMSEYs

i

hh

50

hh XbbY 10ˆ

)2(~)ˆ(

)(ˆ

nt

Ys

YEY

h

hh

)()ˆ( hh YEYE

Inference concerning E(Yh) cont’d

• CI:

• HT:

H0: E(Yh) = E0 vs Ha: E(Yh) ≠ E0

TS:

RR: |t*| > t(1-α/2,n-2)

)2(~)ˆ(

ˆ* 0

nt

Ys

EYt

h

h

51

)ˆ()2,2

1(ˆhh YsntY

Prediction of a new observation Yh(new)

• Point estimator is the same as in mean estimation:

• Expectation is the same but can’t be used in inference since it is unknown:

• Rather we base inference on the following distribution:

• Variance of prediction:

• Estimated variance of prediction:

])(

)(1[)ˆ(

2

2

)(22

)(

2

XX

XX

nYY

i

newh

hnewh

)2(~)(

ˆ)()(

ntpreds

YY newhnewh

52

)(10)(ˆ

newhnewh XbbY

])(

)(11[)(

2

2

)(2

XX

XX

nMSEpreds

i

newh

)()ˆ( )()( newhnewh YEYE

Inference for Yh(new)

)ˆ()2,2

1(ˆ)()( newhnewh YsntY

53

• Prediction interval for Yh(new):

• Note that this interval is wider than the interval around the

estimated mean at the same value of X

RMR example: mean estimation and

prediction of a new observation

• 95% CI for mean RMR at weight = 90kg: CI = Ŷ90 ± t(0.975,42)s{Ŷ90}

Ŷ90 = b0+b1*90=811.23+7.06(90)=1446.63

s{Ŷ90} = 28.021

CI = 1446.63 ± 2.018 (28.021) = (1390.08, 1503.18)

• 95% prediction interval for RMR for an individual who weighs 90kg:

CI = Ŷ90 ± t(0.975,42)s{pred}, Ŷ90 = b0+b1*90=1446.63

s{pred} = 160.37

CI = 1446.63 ± 2.018 (160.372) = (1121.9, 1771.29)

166.785])74.606(43

)88.7490(

44

1[06.24934]

)(

)(1[)ˆ(

2

2

22

XX

XX

nMSEYs

i

hh

23.25719])74.606(43

)88.7490(

44

11[06.24934]

)(

)(11[)ˆ(

2

2

22

XX

XX

nMSEYs

i

hh

54

Analysis of variance approach to regression

• Total variation (variation around the mean of the response variable): SSTO

• Variation around the fitted regression line: SSE

• Variation of fitted regression line from the mean of the response variable: SSR

2)( YYSSTO i

2)ˆ( ii YYSSE

55

2)ˆ( YYSSR i

Partitioning of total sum of squares and

degrees of freedom

• SSTO = SSE + SSR

• That is,

• Total df = n-1

• Regression df = 1

• Error df = n-2

• Total df = Error df + Regression df

• n-1 = n-2 + 1

222 )ˆ()ˆ()( YYYYYY iiii

56

Mean squares and ANOVA table

• MSE = SSE/(n-2)

• MSR = SSR/1 = SSR

------------------------------------------------------------

Source of

variation df SS MS F p

------------------------------------------------------------

Regression 1 SSR MSR MSR/MSE

Error n-2 SSE MSE

------------------------------------------------------------

Total n-1 SSTO

57

F-test for β1=0

• H0: β1= 0 vs Ha: β1≠ 0

• TS: F* = MSR/MSE ~F(1,n-2)

• If F* > F(1-α;1,n-2) reject H0 in favor of Ha, otherwise fail to reject H0

Intuition:

Under Ho the ratio is 1, under the alternative it is > 1

Note: F* = t*2, hence the F-test is equivalent to the two-sided t-test

22

1

2

2

)()(

)(

XXMSRE

MSEE

i

58

General Linear Test

• Full model: Yi = β0 + β1Xi + εi

• Reduced model: Yi = β0 + εi

• H0: β1= 0 vs Ha: β1≠ 0

• SSE(F) = SSE

• SSE(R) = SSTO

• TS:

• If F* > F(1-α;dfR-dfF,dfF) then reject the null and assume that the full model fits the data better than the reduced model

)(/)(

)]()([)]()([*

FdfFSSE

FdfRdfFSSERSSEF

59

RMR example: ANOVA table

------------------------------------------------------------------

Source of

variation df SS MS F p

------------------------------------------------------------------

Regression 1 SSR 1300241.18 52.15 <.0001

Error 42 SSE 24934.06

------------------------------------------------------------------

Total 43 SSTO

60

Measures of association between

X and Y

• Coefficient of determination:

• r2 is interpreted as the proportionate reduction of total variation in Y associated with the use of predictor X

• 0 ≤ r2 ≤ 1

• r2 = 1 when all the points fall on the fitted regression line (and the regression line is not horizontal)

• r2 = 0 when the fitted regression line is horizontal

SSTO

SSE

SSTO

SSRr 12

61

Measures of association between

X and Y cont’d

• Correlation coefficient (Person moment-product correlation coefficient)

• -1 ≤ r ≤ 1

• r > 0 indicate positive linear association between X and Y

• r < 0 indicate negative linear association between X and Y

• Note that small r does not necessarily mean no relationship between X and Y (it means no LINEAR relationship)

• High r does not necessarily indicate good fit and that good predictions can be made

2rr

62

Relationship between r and the estimated

regression slope b1

x

y

i

i

s

sr

XX

YYrb

2/1

2

2

1)(

)(

63

• sx and sy are the sample standard deviations of X and Y respectively

• Note that b1 = 0 implies r=0 and vice versa

• The signs of b1 and r are also the same

• The value of r is affected by the spacing of the X values

Results from data example

• Testing b1 =0: t*=7.22, p<.0001

• F*=52.15,p<.0001

• 95% CI for b1 : (6.09,9.03)

• R2 = 0.55

• 95% CI for mean RMR for weight = 90 kg:

(1390.0, 1503.1)

• 95% CI for individual RMR for weight = 90kg: (1122.9, 1770.2)

64

Lecture 3: Diagnostics and

remedial measures

65


Departures from model Remedial measures

Non-linear regression function

Non-normal errors

Non-constant error variance

Non-independent error terms

Presence of outliers

Omission of important predictor variables

Nonparametric regression

Quadratic or higher order polynomial

Transformations

Weighted least squares

Models for correlated data (mixed models, time series models)

Robust regression

Multiple regression

66

Consequences of model departures

• Nonlinearity and/or omission of predictor variables lead to biased estimates of the parameters (serious)

• Non-constant error variance leads to less efficient estimates and to invalid error variance estimates (less serious)

• Presence of outliers may or may not be serious. Depends on how influential outliers are for the regression estimation and on the size of the data set.

• Non-independence of errors results in biased variances (estimates are unbiased). May be serious.

67

Diagnostics for predictor variable

• Dot plot: useful when number of observations in the data set is not large. Helps identify outlying cases.

• Sequence plot: useful when data on the predictor variable are obtained in sequence. Helps identify patterns.

• Stem-and-Leaf Plot: provides information similar to a frequency histogram. Useful to identify outliers and skewness.

• Box plot: shows minimum, maximum value, first, second and third quartiles. Helps identify skewness and outliers. Most useful for large data sets.

68

Residuals

• Remember

• Basic idea: If the assumptions of the regression model are satisfied the distribution of the residuals should resemble the error distribution

• The mean of the residuals is zero

• The variance of the residuals is approximated by the MSE

• The residuals are not independent because the fitted values are based on the same regression line. The residuals are subject to 2 constraints: that their sum is zero and that the products Xiei

sum to 0. However when n is large the dependency can be ignored.

0

n

ee

i

69

iii YYe ˆ

Standardized residuals

• The idea is to standardize the residuals by their

standard deviation

• However the latter is unknown

• We can use the MSE

• Semistudentized residuals:

• Studentized residuals: use a different

denominator

MSE

e

MSE

eee ii

i

*

70

Several prototype situations for residual

vs predictor plot

• Figure 3.4 on p.106 in textbook

71

Non-linear regression function

• Can be assessed via plot of

the residuals vs predictor

variable

(or equivalently residuals vs

fitted values)

72

Non-normal error terms

• To detect gross departures from normality for a

relatively large sample a histogram, dot plot, box

plot or stem-and-leaf plot of the residuals can be

helpful

• Otherwise a normal probability plot of the

residuals is more informative

73

Normal probability plot

• Order the residuals

• Plot each residual against its expected value under normality

• Expected value of the kth smallest residual is

• z(a) is the (a)100 percentile of the standard normal distribution

• Plot should be nearly linear

25.

375.

n

kzMSE

74

Good normal probability plot of

residuals

75

Examples of “not so good” normal

probability plots

76

Caution in assessing normality

• Normal probability plots may reflect other violations of the assumptions:

• For example a wrong choice of regression function

• Or non-constant error variance

• Investigate other possible violations first!

77

Non-constant error variance

• Can be detected using plot of residuals versus values of the predictor variable

• Equal variability around zero indicates constant variances

• Increasing spread of the residuals around zero indicates that variance increases with the mean (“megaphone type”)

• Plot of absolute residuals or squared residuals against X may be even more useful to detect nonconstancy of error variance

78

Brown-Forsythe test

• Appropriate for SLR

• Can be used even when errors are non-normal

• Requires large sample size to be able to ignore

dependency among residuals

• Main idea is to separate the residuals in two

groups and compare the average absolute

deviations from the center in the two groups

79

Brown-Forsythe test cont’d

)2(~

11

2

)()(

~

~

groups residual 2 and 1 of medians - ~,~

*

21

21*

1

2

22

1

2

11

2

222

111

ndst

21

21

1

1

ntt

nns

ddt

n

dddd

s

eed

eed

ee

nnn

BF

BF

n

ni

i

n

i

i

ii

ii

80

Non-independent error terms

• When the data are recorded in a temporal or

spatial fashion a sequence plot may be useful to

detect non-independent error terms

• This plot may reveal trend effect or cyclical non-

independence

81

Correlated errors

82

Correlated errors

83

Presence of outliers

• Outliers are extreme observations

• Can be identified based on plots of residuals against X or fitted values, box plots, stem-and-leaf plots or dot plots of residuals

• It is better to plot semistudentized residuals

• Semistudentized residuals with values of four or more are considered outliers

• Outliers may result from error in recording, malfunction, miscalculation, etc and may need to be discarded

• Otherwise they may provide important information

• Note that the estimates of regression parameters may or may not be greatly affected by outliers

84

Omission of important predictor variables

• Plot residuals against variables omitted from the model that might have important effects on the response

• If a pattern emergence it may lead to a substantially better fit if we include one or more of these additional variables in the model

85

F test for lack of fit

• Tests whether the chosen regression function

adequately fits the data

• Assumes that Y|X ~ independent and normally

distributed and that the error variances are

constant

• Requires repeat observations on some X

86

F-test for lack of fit cont’d

• X1, …Xc – levels of the predictor

• n1, … nc – replicates at each X level

• Yij – observed response for the ith replicate at the

jth level of X

• Full model:

• Let SSPE (sum of squares for pure error) denote

the SSE for the full model

• Reduced model:

),0(.~ 2

, NindepY ijijjij

),0(.~ 2

,10 NindepXY ijijjij

87

F-test of lack of fit cont’d

0),2;1(*

2*

)()2(*

HrejectcncFFIf

MSPE

MSLF

cn

SSPEc

SSLF

F

SSPESSESSLF

cn

SSPEcnn

SSPESSE

F

88

Transformations

• Transformation on X should be attempted when the error terms are approximately normally distributed with constant variance but the relationship between X and Y is non-linear

• Otherwise transformation on Y is more appropriate

• For some transformations (log, sqrt, inverse sqrt) adding a constant may be necessary to make numbers positive

89

Prototype plots for

transformation of X

• Use X’ = X2 or X’ = exp(X) if the X-Y plot suggests an arc from lower left to upper right with bulge below the straight line (1st plot on right)

• Use X’ = square root of X or X’ = log10(X) if the X-Y plot suggests an arc from lower left to upper right with bulge above the straight line

• Use X’ = 1/X or X’ = exp(-X) if the X-Y plot suggests an arc from upper left to lower right with bulge below the straight line (2nd plot on right)

90

Prototype plots suggesting transformation on Y

91

Transformations on Y

• The choice of a transformation of Y may be suggested by examining the plot of residuals against fitted values. If this appears linear, but the variance of the residuals increases as fitted Y increases, suggesting a wedge or megaphone shape, then taking square roots, logarithms, or reciprocals of the Y values may promote homogeneity of variance

• Note that a simultaneous transformation on X may be necessary

92

Transformations on Y

• Use Y’ = square root of Y if there is an arc from

lower left to upper right with bulge below the

straight line, and the variance of the residuals

increases as fitted Y increases

• Use Y’ = log(Y) if there is an arc from upper left

to lower right, and the variance of the residuals

increases as fitted Y increases

• Use Y’ = 1/Y if variance of the residuals increases

as fitted Y increases

93

Box-Cox transformations

• Automatically identifies a transformation from the family of power transformations Y’ = Yλ, where λ is identified from the data

• λ = 2 corresponds to Y’ = Y2

• λ = ½ corresponds to Y’ = √Y

• λ = 0 corresponds to Y’ = loge(Y)

• λ = -½ corresponds to Y’ = 1/ √Y

• λ = -1 corresponds to Y’ = 1/Y

• Regression model:

• The Box-Cox transformation finds the MLE of λ

iii XY 00

94

Comments regarding transformations

• Theoretical considerations may prevail

• Residual plots and tests needed to ascertain appropriateness of transformation

• Interpretation and properties of regression coefficients are with respect to the transformed scale

• A more convenient value of λ may be selected for interpretation purposes

• When the confidence interval for λ includes 1 it may be better to stay with the original scale

95

Nonparametric regression

• Idea: fit a smoothed curve to the data to explore or

confirm regression relationship

• For time-series data popular methods are: the

method of moving averages, the method of

running medians

• For regression data popular methods are band

regression and locally weighted regression scatter

plot smoothing (Lowess)

96

Lowess method

• Obtains a smoothed curve by fitting successive linear regression functions in local neighborhoods

• The smoothed Y value at a given X is equal to the fitted Y value for the regression in that local neighborhood

• For example if neighborhood of 3 values is used, the smoothed value of Y at X2 is the fitted value for Y at X2based on the regression fitted to (X1,Y1), (X2,Y2) (X3,Y3)

• Similarly the smoothed value at X3 will be the fitted value for Y at X3 based on the regression fitted to (X2,Y2) (X3,Y3) and (X4,Y4)

97

Steps in obtaining final smoothed curve

• 1. The linear regression is weighted with smaller

weights given to X values further from the middle

X level in each neighborhood

• 2. Linear regression fitting is repeated with revised

weights so that cases with large residuals in the

first fitting receive smaller weights in the second

fitting

• 3. Additional iterations of step 2 may be needed

98

Choices in lowess method

• Size of successive neighborhoods (the larger the size the smoother the function but essential features may be lost)

• In SAS PROC LOESS a smoothing parameter s is chosen. When s < 1 the s fraction of values closest to X are chosen in each neighborhood

• Weight function for X values

• SAS PROC LOESS uses a tricube weight function:

• wi= [32/5] (1-([(di)/(dq)])3 )3 where d1,, … dq are the

distances from the 1st, 2nd, … qth closest X value to Xi

• Weight function for residuals

99

Important points regarding the lowess method

• No analytical expression is provided for the

functional form of the regression relationship

• Higher order polynomials may be used to smooth

out in local neighborhoods

• If lowess curve falls in confidence bands of

regression line then it can be considered

confirmatory of the chosen regression relationship

100

Lecture 4: Simultaneous

inference and other topics

101


Joint estimation of β0 and β1

• Statement confidence coefficient: reflects the proportion of confidence intervals that contain the true value of a parameter in repeated sampling

• Setting two separate confidence intervals for the slope and the intercept in SLR assures that each of the two statement confidence coefficients is correct

• However, the probability that both confidence intervals contain their respective parameters is less than 95%

• Family confidence coefficient corresponds to the proportion of repeated samples in which both the true intercept and the true slope fall in their respective confidence intervals

102

Bonferroni joint confidence intervals

for intercept and slope

• Separate confidence intervals:

• Let A1 denote the event that β0 does not belong

to the first CI, and A2 denote the event that β1

does not belong to the second CI. Therefore

P(A1)= P(A2)=α

• We want

• But

}{)2;2/1(

}{)2;2/1(

11

00

bsntb

bsntb

1)( 21 AAP

103

21)()(1)()()(1

)(1)(

212121

2121

APAPAAPAPAP

AAPAAP

Bonferroni joint confidence

intervals cont’d

• Therefore, if we construct two 100(1-α/2)%

confidence intervals we will get at least

100(1- α)% family confidence coefficient for

both parameters

• The joint confidence intervals are as follows:

}{)2;4/1(

}{)2;4/1(

11

00

bsntb

bsntb

104

An example

• Remember for RMR data we had

• b0 = 811.23, s(b0)=76.98

• b1 = 7.06, s(b1)=0.98

• Therefore joint 90% confidence intervals will be:

• b0 ± t(1-0.10/4;42)s{b0}=811.23 ± 2.02 (76.98) = (655.73,966.73)

• b1 ± t(1-0.10/4;42)s{b1}=7.06 ± 2.02 (0.98) = (5.08,9.04)

• Interpretation: we are at least 90% confident that both the intercept and the slope fall in their respective confidence interval above

105

Comments

• The Bonferroni family confidence coefficient is not an exact coefficient but is rather a lower bound to the desired probability

• The Bonferroni inequality is extended to more than two events

• Useful when number of confidence intervals is not too large

• Joint confidence intervals can be used for testing (for example of whether both slope and intercept are equal to zero)

• Each confidence interval can have its own confidence level to reflect its relative importance

gAPg

j

i

1)(1

106

Simultaneous estimation of mean

responses

• Separate estimates of the mean response at different levels of the predictor need not be simultaneously right or simultaneously wrong.

• It is possible that the confidence interval for the mean response is right only over some part of the range

• We’ll consider two procedures:

• Working-Hotelling procedure

• Bonferroni procedure

107

Working-Hotelling procedure

• The confidence bounds at each value of X are

the values of confidence band for the entire

regression line:

• Here W2=2F(1-α;2,n-2}

}ˆ{ˆhh YWsY

108

RMR example

Xh Yh S{Yh}

60 1234.80 27.90

82 1390.11 24.80

103 1538.36 36.37

• Working-Hotelling 90% CI for mean:

• W2 = 2F(0.9,2,42)=4.87

• W = 2.21

• (1234.80±2.21(27.90))=(1173.14,1296.46)

• (1390.11±2.21(24.80))

=(1335.30,1444.92)

• (1538.36±2.21(36.37))=(1457.98,1618.74)

109

Bonferroni procedure

• Same idea as for simultaneous confidence

intervals for the intercept and slope

• For simultaneous confidence intervals on g

means:

)2);2/(1(

}ˆ{ˆ

ngtB

YBsY hh

110

RMR example

Xh Yh S{Yh}

60 1234.80 27.90

82 1390.11 24.80

103 1538.36 36.37

• Bonferroni 94% CI for mean:

• 1-α/(2g)=0.99

• B = t(0.99,42)=2.42

• (1234.80±2.42(27.90))

=(1167.28,1302.32)

• (1390.11±2.42(24.80))

=(1330.09,1450.13)

• (1538.36±2.42(36.37))

=(1450.34,1626.38)

111

Working-Hotelling vs Bonferroni procedures

• For large g WH provides tighter bounds and hence

is preferred

• Both WH and Bonferroni provide lower bounds

• When the levels of the predictor variable at which

mean estimation is of interest are not known a

priori WH is more appropriate since WH provides

simultaneous protection for all levels of X

112

Regression through origin

• The regression line may be forced to go

through the origin (or a particular value)

• Regression model:

• Then the point estimates are:

iii XY 1

1

)ˆ( 2

2

21

n

YYMSEs

X

YXb

ii

i

ii

113

Regression through origin cont’d

• CI for the slope then is:

• where

• CI for the mean response is:

• where

• CI for a predicted response is:

• where

21

2 )(iX

MSEbs

}{)1;2/1( 11 bsntb

114

}ˆ{)1;2/1(ˆhh YsntY

2

22 )ˆ(

i

hh

X

MSEXYs

}ˆ{)1;2/1(ˆ)()( newhnewh YsntY

)1()ˆ(2

2

)(

)(

2

i

newh

newhX

XMSEYs

Example: Typographical errors

(problem 4.12)

• X – number of galleys for a manuscript

• Y – total dollar cost of correcting typographical errors

• Estimated regression function: Y = 18.03X

• 95% CI for the slope: 18.03±2.20(0.08)= (17.85,18.21)

• 95% prediction interval for X=10: (170.21,190.36)

• Note residuals do not sum to 0

115

Typographical errors example cont’d

• Test for lack of fit of regression through the origin: full

model (SLR), reduced model (regression through the

origin)

• F*=(223.42-219.99)/22.00=0.16

• F* is not in the rejection region for any meaningful test

and hence ok to use reduced model

0)2,1;1(*

)(

)()(

2

)()2()1(

)()(

*

HrejectnFFIf

FMSE

FSSERSSE

n

FSSEnn

FSSERSSE

F

116

Caveats of regression through the origin

• Sum of residuals is not zero

• SSE may exceed SSTO

• R2 may be negative

• The intervals for mean response and new prediction

widen at X far away from the origin

• Uncorrected total and corrected total sums of squares

are related as follows:

• SSTOU = SSRU+ SSE

• where 2

1

22

1

22 )(,ˆ, iiiii XbYSSEXbYSSRUYSSTOU

117

Effects of measurement error

• Measurement errors in Y do not affect the estimates as long as these errors are uncorrelated and not biased

• Measurement error in X does affect the estimate of the slope by attenuating it (except in Berkson’s case). The magnitude of the bias depends on the relative sizes of the errors in X and in Y.

• When measurement error in X is present a special approach such as the use of instrumental variable approach is needed

118

Berkson’s model

• The observation Xi* is a fixed quantity

while the underlying Xi is a random

variable.

• In that case the errors have expectation 0

and the predictor variable is a constant

(therefore the errors are uncorrelated with

it) and hence SLR is valid and can be used.

119

Choice of X levels

• When the levels of X are under the control of the experimenter the following considerations should be used:

• If the main purpose of the regression analysis is to estimate the slope use well spread levels of X

• If the main purpose is to estimate the intercept the mean of X values should be 0.

• If the main purpose is to predict a new observation at Xh(new) the mean of the X values should be at Xh(new)

• Choose as many levels as needed to estimate shape (for example at least 2 for straight line, at least 3 for quadratic trend, etc)

120

Formulae to aid understanding of

selection of X levels

2

2

)(2

)(

2

2

222

2

2

1

2

2

22

0

2

)(

)(11}ˆ{

)(

)(1}ˆ{

)(

1}{

)(

1}{

XX

XX

nY

XX

XX

nY

XXb

XX

X

nb

i

newh

newh

i

hh

i

i

121

Lecture 5: Matrix representation of

simple linear regression

122


Matrices

• Matrix: a rectangular array of elements

• Dimension: rxc means r rows by b columns

• Example: A = [aij], i=1,2,3; j=1,2

• In general: A = [aij], i=1,…r; j=1,...c

01

24

53

3231

2221

1211

aa

aa

aa

A

rcrr

c

c

aaa

aaa

aaa

A

...

............

...

...

21

22221

11211

123

Types of matrices

• A square matrix (r=c)

• Symmetric matrix

(square and upper

right triangle is a

mirror image of lower

right triangle)

201

424

253

333231

232221

131211

aaa

aaa

aaa

A

201

024

143

A

124

Types of matrices cont’d

• Row vector: r=1

• Column vector: c=1

• Transpose of a matrix:

(A’ or AT)

• A = [aij], A’ = [aji]

• If A is r by c matrix then

A’ is c by r matrix

• Symmetric matrix:

• A = A’

201

424

253

A

242

025

143

'A

125


• Diagonal matrix: a square matrix with all

off-diagonal elements equal to 0

• Identity matrix: a diagonal matrix with all

diagonal elements equal to 1

• Note that A*I=I*A=A

• Scalar matrix: a diagonal matrix with all

diagonal elements equal to a scalar

126

Examples:

200

020

003

A

100

010

001

I

200

020

002

2I

0

0

0

0~

1

1

1

1~

127


• Vector of ones

• Vector of zeros

• J denotes the matrix formed by multiplication of a

transpose of a vector of ones and itself

n

J

nn

nnnn

11

11

1'1

1...11

............

1...11

1...11

1...11

1

...

1

1

'11

128

Equality of matrices

• Two matrices are equal only if they have

the same dimensions and all corresponding

elements are the same

• That is, if A=[aij], i=1,…r; j=1,…c and

B=[bij], i=1,…k; j=1,…l for A to be equal to

B we need:

• K=r, l=c and aij= bij for all i and j

129

Operations on matrices: addition

45

31A

11

02B

54

33

1415

0321BA

130

Operations on matrices: subtraction

45

31A

11

02B

36

31

14)1(5

0321BA

131

Operations on matrices: multiplication

45

31A

11

02B

46

31

)1)(4()0)(5()1)(4()2)(5(

)1)(3()0)(1()1)(3()2)(1(* BA

132

Cautions about matrix operations

• For addition and subtraction number of rows of

the matrices must be the same and number of

columns must be the same

• For multiplication number of columns of the first

matrix (left multiplier) must be the same as

number of rows of second matrix (right multiplier)

• Note that A + B = B + A but A*B ≠ B*A in

general

133

Inverse of a matrix

• Inverse for a square matrix A is A-1 such that:

• A A-1 = A-1 A = I

• A-1 is unique and has the same rank as A

• To have an inverse a matrix needs to be full rank

(or nonsingular)

• To check where a matrix is of full rank we check

whether its determinant is 0

134

Finding the inverse

• For a 2x2 matrix:

2.03.0

4.01.0

23

41

10

1

10)3)(4()1)(2(13

42

1

1

1

A

DA

bcadDac

bd

DA

dc

baA

135

Simple linear regression in matrix terms

n

n

n

n

n

n

X

X

X

X

Y

Y

Y

Y

...

1

......

1

1

...

2

1

11

0

12

2

1

2

2

1

1

nnn

nn

nn

nnn

IVar

E

XYE

XY

2

1

11

1221

11221

)(

0)(

)(

136

Regression parameter estimate in

matrix notation

YXXXb

YXXXbXXXX

YXbXX

YbXnn

')'(

')'(')'(

''

1

11

121222

1122

137

Fitted values in matrix notation

')'(

')'(

ˆ

...

ˆ

ˆ

ˆ

1

1

1

122

2

1

1

XXXXH

YHYXXXXbX

Y

Y

Y

Ynnnn

n

n

• The H matrix is called the hat matrix

• H is symmetric and idempotent (HH=H)

• H is very useful in residual diagnostics

138

Residuals in matrix notation

)()(

)()(

)(ˆ...

2

22

11111

2

1

1

HIMSEes

HIe

YHIYHYYY

e

e

e

ennnnnnnnn

n

n

• I-H is also symmetric and idempotent

139

Analysis of variance results in

matrix notation

YJn

IYJYYn

YYSSTO

YHIYYXbYYSSE

YJn

HYJYYn

YXbSSR

]1

[''1

'

)(''''

]1

[''1

''

140

Inference about (i) regression coefficients, (ii) mean

response and (iii) prediction of a new observation

12

122

1

)'()(

)'()(

')'(

XXMSEbs

XXb

YXXXb

141

))'('1(}ˆ{

))'('(}ˆ{

)'('}ˆ{

'ˆ

1

1

)(

2

12

122

hhnewh

hhh

hhh

hh

h

h

XXXXMSEYs

XXXXMSEYs

XXXXY

bXY

XX

(i)

(ii) and (iii)

Example: Flavor deterioration

(problem 5.4 on p.210)

• Will work out in class

142

Lecture 6: Multiple Linear

Regression

143


Model formulation

• Yi is the response for the ith subject

• Xi1, Xi2, …Xi,p-1 are the values of the predictor variables for the ith subject

• β1, β2,… βp-1 are unknown parameters to be estimated from the data (they are also called partial regression coefficients)

• Regression (response) surface:

• E(εi)=0, Cov(εi, εj)=0 for i≠j, Var(εi)=σ2>0

niXXXY ipipiii ,...1,... 1,122110

1,122110 ...)( pipiii XXXYE

144

Comments on model formulation

• When p-1 = 1 multiple linear regression reduces to simple linear regression

• If we assume Xi0=1 then we can write the regression equation as follows:

• βk is interpreted as the mean change in Y per unit change in the kth

predictor when the remaining predictors are held constant

• The predictors need not be all different variables

• When there are two different predictors the response surface is a plane

• For inference we require εi ~N(0,σ2)

1

0

1,11100 ...p

k

iikkipipiii XXXXY

145

First order regression model

• All predictors represent different variables

• Some of the predictors can be qualitative (dummy

variables): For example let X1 be age and

• Then a first order regression model is:

femaleif

maleifX

0

1

2

malesforXYE

andfemalesforXYE

XXY

ii

ii

iiii

1120

110

22110

)()(

)(

146

Polynomial regression

• All variables are powers of the same

variable

• This implies curvilinear relationship

between the predictor and the response

1

11,

2

12

1

11

2

12110

...,

...

p

ipiii

i

p

ipiii

XXXX

XXXY

147

Regression with transformed

variables, interaction effects

• Ex: Transformation of the response

• First order (two-way) interaction

• Any possible combinations of the above as long as linear in the parameters

• Example of nonlinear regression:

1

0

' )log(p

k

iikkii XYY

213

21322110

iii

iiiiii

XXX

XXXXY

148

ippiiiii ccccY 11,221100 ...

iii XY )exp( 10

Example: Lung transplantation

data (Altman et al, p.364)

• It is difficult to measure total lung capacity (TLC) and hence it is useful to be able to predict TLC from other information.

• The data set contains measurements of pre-transplant TLC of 32 recipients of heart-lung transplant, obtained by whole-body plethysmography, and their age, sex and height.

• Sample data:

• Age Sex Height(cm) TLC(l)

• 35 F 149 3.40

• 11 F 138 3.41

• 12 M 148 3.80

149

Example cont’d:

TLC

140

150

160

170

180

190

10

20

30

40

50

4 6 8 10

140 150 160 170 180 190

height

sex

1.0 1.2 1.4 1.6 1.8 2.0

4

6

8

10

10 20 30 40 50

1.0

1.2

1.4

1.6

1.8

2.0

age

150

Model formulation in matrix notation

),...,,1(...

,...1,...

1,21

'

1

1

0

'

1,122110

piiii

p

iii

ipipiii

XXXXand

where

XY

asrewrittenbecan

niXXXY

151

Model formulation in matrix notation cont’d

111

............

....1

............

...1

...1

...

2

1

2

1

2

1

1

1

0

1,1

1,221

1,111

2

1

1

111111

nppnn

X

X

X

XX

XX

XX

Y

Y

Y

Y

subjectsallFor

subjectiforXY

XY

nT

n

T

T

nppnn

p

p

n

n

th

ipp

T

ii

152

111

nppnnXY

• Y – vector of responses

• β- vector of parameters

• X – matrix of constants (design matrix)

• ε ~ N(0,σ2I) and hence Y ~ N(Xβ,σ2I)

153

Model formulation in matrix notation cont’d

Estimation of regression coefficients

• Least square estimates are obtained by minimizing the sum of distances from the points to the regression plane:

• Denote the vector of the least squares

estimated regression coefficients as b:

• Least squares normal equations:

• Least squares estimates

(!mistake in textbook – eq 6.25):

Maximum-likelihood estimates are the same

n

i

pipiii XXXYQ1

2

1,122110 )...(

1

1

0

1 ...

p

p

b

b

b

b

154

YXXbX ''

1

1

1)'()'(

ppppYXXXb

Fitted values and residuals

)(}{),(}{

)(ˆ...

')'(

')'(

ˆ

...

ˆ

ˆ

ˆ

222

11111

2

1

1

1

1

1

1

2

1

1

HIMSEesHIe

YHIbXYYY

e

e

e

e

XXXXH

YHYXXXXbX

Y

Y

Y

Y

nnnnnppnnnn

n

n

nnnppn

n

n

155

Sums of squares and mean squares

YJn

IYJYYn

YYSSTO

pn

SSEMSE

YHIYYXbYYSSE

p

SSRMSR

YJn

HYJYYn

YXbSSR

]1

[''1

'

)(''''

1

]1

[''1

''

156

ANOVA table

--------------------------------------------------------------------

Source of

variation df SS MS F

--------------------------------------------------------------------

Regression p-1 SSR MSR MSR/MSE

Error n-p SSE MSE

--------------------------------------------------------------------

Total n-1 SSTO

157

F-test for regression relation

• H0: β1= β2=…= βp-1 =0

• Ha: not all βk (k=1,…p-1) equal to 0

• TS: F* = MSR/MSE

• Rejection region: If F* > F(1-α;p-1,n-p) reject H0

• Note that under H0 E(MSR)=E(MSE)=σ2

• And under Ha E(MSR)>σ2 and hence the test has intuitive sense

• When p-1=1 the test reduces to the F-test in SLR

158

Coefficients of multiple determination

and correlation

• Usual definition:

• Note that the coefficient is between 0 and 1 and increases as variables are added to the model

• Modified measure that adjusts for number of variables in the model:

• Adjusted R-square:

• Adjusted R-square can decrease as number of variables increase

• Coefficient of multiple correlation: R=√R2

SSTO

SSE

pn

n

nSSTO

pnSSERa

11

)1/(

)/(12

159

SSTO

SSE

SSTO

SSRR 12

Inferences about regression parameters

• The LSE of the regression parameters are unbiased: E(b)=β

• Variance is σ2{b}=σ2 (X’X)-1

• Estimated variance is s2{b}=MSE(X’X)-1

• 100(1-α)% CI for βk is bk±t(1- α/2;n-p)s{bk}

• HT: H0: βk=0 vs Ha: βk≠0

• TS: t* = bk/s{bk}

• Rejection region: |t*|>t(1- α/2;n-p)

• Joint inferences on several βk’s: bk±Bs{bk} where B=t(1- α/(2g);n-p)

160

Estimation of mean response

}ˆp)s{-nα/2;1(ˆ:CI α)100%(1

)'(}ˆ{ : varianceEstimated

)'(}{}ˆ{ :Variance

}{}ˆ{:nExpectatio

ˆ :estimatorPoint

}{:response Mean

...

1

Let

1'2

1'22'2

'

'

'

1,

1

1

hh

hhh

hhhhh

hhh

hh

hh

ph

h

p

YtY

XXXXMSEYs

XXXXXbXY

YEXYE

bXY

XYE

X

X

hX

161

Other inference regarding mean response

• Confidence region for regression surface (an extension of Working-Hotelling band). Also used for simultaneous confidence interval for several mean responses:

• Bonferroni simultaneous confidence intervals for several mean responses:

),;1(W

}ˆWs{ˆ

2 pnppF

YY hh

));2/(1( B

}ˆBs{ˆ

pngt

YY hh

162

Prediction of new observation

}p)s{-nα/2;1(ˆ

:CI α)100%(1

])'(1[}{ : varianceEstimated

ˆ :estimatorPoint

:response Individual

)(

1'2

'

)()(

'

)()(

predtY

XXXXMSEpreds

bXY

XY

newh

hh

newhnewh

hnewhnewh

163

Caution about hidden extrapolations

• Picture from textbook

164

165

Diagnostic and remedial measures

• Most of the diagnostics and remedial procedures

from simple linear regression carry over to

multiple linear regression

• Scatter plot matrix: a collection of scatter plots

between predictors and response variables. It is

useful to assess bivariate relationships and identify

outliers.

• Correlation matrix: a matrix of correlation

coefficients

Residual plots

• Plot of residuals against fitted values is useful for assessing the appropriateness of the multiple linear regression function, the constancy of the variances of the error terms and the presence of outliers

• Normal probability plots of the residuals can help assess normality of errors.

• Residual plots against each predictor variable can help assess the adequacy of the regression model with respect to that variable.

• Residuals can also be plotted against variables, or interactions of variables not in the model to assess whether these variables/interactions are needed.

• Brown-Forsythe test and Breusch-Pagan test can be applied for a particular predictor variable suspected to be associated with increase or decrease of error variance.

• F-test for lack of fit is also applied as in SLR except that we now require replications over all predictor variables simultaneously.

166

Remedial measures

• Remedial measures in MLR are applied as

in SLR:

• Definition of more complex polynomial or

interaction models

• Transformations on response and predictor

variables

• Box-Cox transformation

167

Lecture 8: Statistical inference in

multiple regression

168


Extra sums of squares

• Due to the relationship among predictor variables

and between the predictors and the response, it is

possible that the relationship between one

predictor and the response is affected by other

predictors in the model

• SSE decreases as more and more predictors are

added to the model

• The extra sum of squares measures such marginal

reduction in SSE

169

TLC example (extra sums of squares)

• Suppose we first consider gender (X1) as a predictor of TLC: Then SSE (X1) =56.40, SSR(X1)=25.31

• Then suppose we add height (X2) to the model: Then

SSE (X1, X2) =38.92, SSR(X1, X2) =42.79

• We see that SSE (X1) > SSE (X1, X2) and SSR (X1) < SSR (X1, X2), that is residual variability decreases by adding a variable to the model while systematic variability increases

• SSR(X2|X1) is called the extra sum of squares when adding X2 to the model that already contains X1:

SSR(X2|X1) = SSR (X1 , X2) - SSR (X1)=42.79-25.31=17.48

SSR(X2|X1) = SSE (X1) - SSE (X1, X2)=56.40-38.92=17.48

170

Extra sums of squares cont’d

• SSR(X2|X1) measures the reduction in errors

variability (SSE) when X2 is added to the model that

already includes X1

• Thus it measures the marginal effect of adding X2 to

the model that already includes X1

)(),()|(

lyequivalent thenSSESSR SSTO Since

),()()|(

12112

21112

XSSRXXSSRXXSSR

XXSSEXSSEXXSSR

171

Extra Sums of Squares cont’d

• Additionally:

• SSR(X2,X3|X1) measures the reduction in

errors variability (SSE) when X2 and X3 are

added to the model that already includes X1

)(),,()|,(

lyequivalent thenSSESSR SSTO Since

),,()()|,(

1321132

3211132

XSSRXXXSSRXXXSSR

XXXSSEXSSEXXXSSR

172

Extra Sums of Squares

• Also:

• SSR(X3|X1, X2) measures the reduction in

errors variability (SSE) when X3 is added to

the model that already includes X1 and X2

),(),,(),|(

),,(),(),|(

21321213

32121213

XXSSRXXXSSRXXXSSR

XXXSSEXXSSEXXXSSR

173

Decomposition of SSR into extra sum of

squares

• Different decompositions can be considered:

freedom of degrees q withassociated is

variablesadditional qfor squares of sumextra is,That

...

1

),|(),|(

2

)|,()|,(

...

)|,()(),,(

),|()|()(),,(

),|()|()(),,(

213213

213213

2132321

321232321

213121321

XXXSSRXXXMSR

XXXSSRXXXMSR

XXXSSRXSSRXXXSSR

XXXSSRXXSSRXSSRXXXSSR

XXXSSRXXSSRXSSRXXXSSR

174

ANOVA table with decomposition

of sums of squares

-----------------------------------------------------------------------------------------

Source of

variation df SS MS F

-----------------------------------------------------------------------------------------

Regression p-1 SSR MSR MSR/MSE

X1 1 SSR(X1) MSR(X1) MSR(X1)/MSE

X2|X1 1 SSR(X2|X1) MSR(X2|X1) MSR(X2|X1)/MSE

X3|X1,X2 1 SSR(X3|X1,X2) MSR(X3|X1,X2) MSR(X3|X1,X2)/MSE

…

Xp-1|X1,…Xp-2 1 SSR(Xp-1|X1,…Xp-2) MSR(Xp-1|X1,…Xp-2) MSR(Xp-1|X)/MSE

Error n-p SSE MSE

-----------------------------------------------------------------------------------------

Total n-1 SSTO

175

Tests of multiple regression coefficients

),(~),...,(

),...,|,...,(*

)/(),...,,(

)/()),...,(),...,((*

)()(

)]()([SSE(F)]-[SSE(R) F*: statisticTest

...... :F

... :R

:model full vsreduced following the testing toequivalent is This

0not is ,... , of oneleast At :

0...:

121

1211

121

121121

1,1,1,1110

1,1110

11

110

pnqpFXXXMSE

XXXXXMSRF

pnXXXSSE

qpXXXSSEXXXSSEF

FdfFSSE

FdfRdf

XXXXY

XXY

H

H

p

qpq

p

pq

ipipqiqqiqii

iqiqii

pqqa

pqq

176

Example (TLC data)

• Testing whether age contributes to the model after

accounting for height and sex:

0

3210

210

33

330

reject H toFail:Conclusion;2.4)28,1,95.0(*:

13.134.1

51.1

37.40/28

(1.51)/1

),,(

),|(* : statisticTest

:F

TLC :R

:test-F

06.1024.0/025.0}{/ t*:statisticTest

00: H:test-T

:test-or Ftest - tdo Can

FFRR

agesexheightMSE

sexheightageMSRF

agesexheightTLC

sexheight

bsb

vs

ii

ii

177

Example (TLC data cont’d)

• Testing whether age and sex contribute to the model

after accounting for height:

0

3210

10

reject H toFail:Conclusion;3.3)28,2,95.0(*:

78.134.1

38.2

37.40/28

37.40)/2-(42.16

),,(

)|,(* :statisticTest

:F

TLC :R

:eappropriat istest -Only F

FFRR

agesexheightMSE

heightagesexMSRF

agesexheightTLC

height

ii

ii

178

Type I and Type III sums of squares

• Type I sums of squares are sequential and order-dependent, they adjust only for variables already in the model

• In TLC example Type I SS in SAS table are as follows: SSR(height), SSR(sex|height), SSR(age|height,sex)

• If we had specified the variable in different order in the model statement we would have gotten different type I SS: for example “model TLC= height age sex” leads to SSR(height), SSR(age|height), SSR(sex|age,height)

• Type III sums of squares are order-independent (they adjust for all of the remaining variables in the model)

• In TLC example the type III sums of squares are SSR(height|sex,age), SSR(sex|age,height), SSR(age|height,sex)

179

Coefficients of partial determination

• Recall that coefficient of multiple determination R2 measures the proportionate reduction in the variation of Y achieved by all X variables

• Coefficients of partial determination measure the contribution of one X variable once all the remaining variables are already in the model

• The coefficient of partial determination between Y and X1 given that X2 is in the model measures:

• the relative reduction in SSE when X1 is added to the model that already contains X2

• the relative marginal reduction in the variation in Y associated with X1when X2 is already in the model

)(

)|(

)(

),()(

Y:Model

2

21

2

2122

2|1

22110i

XSSE

XXSSR

XSSE

XXSSEXSSER

XX

Y

iii

180

Coefficients of partial determination cont’d

• The coefficient of partial determination

between Y and X1 can be interpreted as a

coefficient of simple determination between

the residuals from the regression of Y on X2

and the residuals from the regression of X1

on X2

181

Coefficients of partial determination and

correlation

2

1,...1,1,...2,1|1,...1,1,...2,1|

1111

11112

1,...1,1,...2,1|

1,122110i

/

:ncorrelatio partial of tsCoefficien

),...,,...(

),...,,...|(

... Y:Model

pkkYkpkkYk

pkk

pkkk

pkkYk

ipipii

Rr

XXXXSSE

XXXXXSSRR

XXX

• In TLC example R2Yage|height,sex=1.51/38.92=0.04

• rYage|height,sex=√0.04 = 0.2

182

Standardized multiple regression model

• It is desirable to be able to compare the magnitude

of the regression coefficients and based on that to

judge the relative importance of the predictor

variables

• However the magnitude of the regression

coefficients depends on the scale of X and hence

rescaling is in order

• Also, standardizing the regression coefficients

improves numerical stability

183

Correlation transformation

• All X and Y variables are standardized by

subtracting the sample mean and dividing by the

sample standard deviation:

1,...1,1

)(,

1

1

1

)(,

1

1

2

2*

2

2*

pkn

XXs

s

XX

nX

n

YYs

s

YY

nY

kik

k

k

kikik

i

Y

Y

ii

184

Standardized regression model

• Regression model in terms of the standardized variables

• Note that there is no intercept in this model

• Coefficients in SRM are interpreted as estimated change in standard deviations in Y for one standard deviation change in X

11110

*

k

**

1,

*

1

*

1

*

1

*

...

1,...1,

that shown be canIt

...

pp

k

k

Y

ipipii

XXY

pks

s

XXY

185

Polynomial regression (one variable)

• All variables are powers of the same variable

• This implies curvilinear relationship between the predictor and the response when p > 2

1

11

2

12

1

11

2

12110

...,

...

p

iipii

i

p

ipiii

XXXX

XXXY

186

Polynomial regression: applicability and

caution

• Polynomial regression models are useful when the true curvilinear function is indeed polynomial, or when polynomial is a good approximation

• Extrapolation outside of the range of the data is dangerous, especially with high-order polynomials

• Data that consists of n distinct X values can always be fitted perfectly with a polynomial of degree n-1

• Polynomial regression models may contain one, two or more continuous predictor variables and each predictor may be present at different power

• Each predictor is often centered to remove high correlation between linear and quadratic terms (e.g. usually high correlation exists between X and X2)

187

Polynomial regression models for a

single predictor variable

188

Polynomial models for a single

predictor variable cont’d

XXx

xxxY

XXx

xxY

ii

iiiii

ii

iiii

where

:modelorder -Third

where

:modelorder -Second

3

111

2

1110

2

1110

189

Second-order polynomial regression models

with 2 predictors

section conica defines

)(

and where

2112

2

222

2

11122110

222111

2112

2

222

2

11122110

iiiiiii

iiii

iiiiiiii

xxxxxxYE

XXxXXx

xxxxxxY

190

Second-order polynomial regression models

with 3 predictors

333222111

322331132112

2

333

2

222

2

111

3322110

and , where

XXxXXxXXx

xxxxxxxxx

xxxY

iiiiii

iiiiiiiiii

iiii

191

Hierarchical approach to fitting

polynomial models

• Polynomial models are special cases of MLR and hence can be fitted using the usual approach

• If a higher order term is present in a regression model, the lower order terms need also to be present (e.g. quadratic model should have a linear term)

• It is usually of interest to test whether a simpler polynomial model adequately represents the data. This can be achieved using the extra sums of squares approach

192

Hierarchical approach to fitting

polynomial models cont’d

MSE

xxxSSR

MSE

xxxSSR

xxxSSRxxSSRxSSRSSR

xxxY iiiii

2

)|,( F*use 0 and 0her check whet To

),|( F*use 0her check whet To

),|()|()(

:Model

:model regression polynomiala in needed is

termcubic he whether testingconsider t exampleFor

32

11111

23

111

232

3

111

2

1110

193

Comments of polynomial regression

• Expensive in terms of degrees of freedom

• Collinearity may still exist and then

orthogonal polynomials can be used

• Test of lower order effect is meaningless

when a higher order term is present

194

Muscle mass example

To explore the relationship between muscle mass

and age in women, a nutritionist randomly

selected 15 women from each 10-year age group,

beginning with age 40 and ending with age 79. X

is age and Y is a measure of muscle mass

Consider polynomial regression

195

Steroid example cont’d

• We find (in terms of centered age X = age –59.983): Y = 82.936 - 1.184 X – 0.015 X2.

• R-squared =SSR/SSTO= 11830.621/15501.933= 0.763.

• Test whether or not there is a regression relation (α=.01).

Ho: β1= β2=0

Ha: at least one is not zero

TS: F*= MSR/MSE=5915.3 / 64.409= 91.84

p-value < .0001

196

Muscle mass example continued

197

Muscle mass example cont’d

• Test whether the quadratic term can be dropped from the model

R: Y = ß0 + ß1 X

F: Y = ß0 + ß1 X + ß2X2

TS: F* = MSR(X2|X)/MSE(X,X2)= 203.135/64.409 = 3.154

RR: F* > F(0.95, 1, 57) = 4.01, R model is adequate

• Express the fitted regression function in terms of age

Y = ß0 + ß1(age-59.983) + ß2(age-59.983)2

We want Y = ß0* + ß1

*age + ß2 *age2

ß0* = ß0 -ß159.983+ß259.9832 = 82.936 + (1.184)*59.983 +

0.015*(59.983)2 = 207.926

ß1 * = ß1-2ß2 = -1.184-(2*0.015*59.983) = -2.983

ß2 * = ß2 = 0.015

198

Lecture 8: Multicollinearity and

interaction models

199

Multicollinearity

• In MLR frequently asked questions are:

1. What is the relative importance of the effects of different predictor variables?

2. What is the magnitude of the effect of a given predictor variable on the response variable

3. Can any predictor variable be dropped from the model because it has little or no effect on the response variable?

4. Should any predictor variable not yet included in the model be considered for inclusion?

These questions have easy answers when the predictor variables are not intercorrelated (correlated with one another)

200

Uncorrelated predictor variables

• When the predictors are uncorrelated the estimated effects are the same no matter what other variables are in the model.

• Also SSR(X1|X2)=SSR(X1) and SSR(X2|X1)=SSR(X2) (That is, type III SS = type I SS)

• So direct comparisons of standardized regression coefficients and simple t-tests for each predictor variable can help answer questions 1 through 4.

201

Effects of multicollinearity

• Our ability to obtain a good fit, to estimate and predict mean and individual response respectively are not inhibited

• However estimated regression coefficients have large sampling variability

• Common interpretation of regression coefficients as the effect of one predictor while holding the others constant may not be realistic, since we may not be able to vary one variable while keeping another variable highly correlated with it constant.

202

Body fat example (from

textbook)

• Study of the relationship between amount of body fat (Y) and several possible predictors: triceps skinfold thickness (X1), thigh circumference (X2), and midarm circumference (X3).

R2=0.998 when X3 regressed on X1 and X2

• Regression coefficients vary widely according to what other predictors are in the model.

0.1

085.00.1

458.0924.00.1

XXr

Variables in model

b1 b2

X1 0.86 --X2 -- 0.86X1,X2 0.22 0.66X1,X2,X3 4.33 -2.86

203

Body fat example cont’d

• Type I and Type III SS can be very different:

SSR(X1)=352.27, SSR(X1|X2)=3.47.

• The interpretation in this example is that by itself

triceps skinfold thickness is an important predictor

of body fat but it does not add much new

information after thigh circumference is accounted

for (and vice versa)

204


• Multicollinearity also affects standard errors of the

regression coefficients

Variables in model s{b1} s{b2}X1 0.13 --X2 -- 0.11X1,X2 0.30 0.29X1,X2,X3 3.02 2.58

205


• Multicollinearity does not significantly

affect the fitted values and predictions

Variables in model

Fitted Y at X1=25.0

(X2=50.0,

X3=29.0)

s{fitted Y} at X1=25.0

(X2=50.0,

X3=29.0)

X1 19.93 0.63X1,X2 19.36 0.62X1,X2,X3 19.19 0.62 206

Interaction models with

quantitative variables

3223311321123322110

211222110

22

2

111110

112211

}{

}{

:Eg

:ons)(interacti effects tivemultiplica withModel

}{

:

)(...)()(

:effects additive withModel

XXXXXXXXXYE

XXXXYE

XXXYE

Eg

XfXfXfE{Y} pp

207

Interpretation of regression coefficients in

interaction models with quantitative variables

• Model:

• Intercept for the relationship between Y and X1:

• Slope for the relationship between Y and X1:

• Intercept for the relationship between Y and X2:

• Slope for the relationship between Y and X2:

220 iX

231 iX

208

iiiiii XXXXY 21322110

110 iX

132 iX

Example:

2121

2121

21

5.5210}{ .

5.5210}{ .

5210}{ .

XXXXYEc

XXXXYEb

XXYEa

209

Comments on interactive effects for

quantitative variables• Curvilinear effects may be present

• Since some of the predictors may be highly correlated with the interaction terms, it is a good idea to center the predictor variables

• When the number of predictor variables in the regression model is large, the potential number of interactions may be large. Either a priori knowledge or residual plots of fitted values (based on the main effects model) vs interaction terms can be used then to guide the choice of interaction terms to include in the model.

• F-tests based on extra sums of squares can be used to test whether interaction effects are needed in the model.

210

Models with qualitative

predictors

• Qualitative predictors with

2 classes (binary):

• Qualitative predictors with

3 or more classes

(categorical nominal):

• Nominal categorical

variable with c classes is

represented by c-1

indicator (dummy)

variables

maleif

femaleifX

0

1

1

otherwise 0

education school high thanless if 1

otherwise 0

education school high thanmore if 1

2

1

X

X

211

Interpretation of regression

coefficients

• Example: X1 - age, X2 – gender (X2=1 if female)

• β1 – common slope

• β0 – intercept for males

• β0+ β2 – intercept for females

• First-order model implies parallel regression lines at each level of the categorical predictor

femalesforXYE

andmalesforXYE

XXY

ii

ii

iiii

1120

110

22110

)()(

)(

212

Interpretation of regression

coefficients cont’d

• Example: X1 - age, X2 and X3 – education level (X2=1 if > HS, X3=1 if < HS)

• β1 – common slope

• β0 – intercept for HS

• β0+ β2 – intercept for > HS

• β0+ β3 – intercept for < HS

• First-order model implies parallel regression lines at each level of the categorical predictor

HSforXYE

HSforXYE

HSforXYE

XXXY

ii

ii

ii

iiiii

1130

1120

110

3322110

)()(

)()(

)(

213

Considerations in using indicator

variables• Allocation codes may be used in some cases instead of

indicator variables: e.g. education defined as +1 if >HS, 0 if HS, -1 if <HS and then treating the variable as continuous for the purposes of regression. That however implies a metric which may not correspond to reality

• Sometimes quantitative variables may be used based on intervals defined for qualitative variables

• Different type of coding may be used: for example X1=1 if female, -1 if male

• Intercept may be dropped and c indicator variables may be used for a categorical variable with c classes

214

Interactions between qualitative and

quantitative predictors

effects additivefor testing toequivalent is 0 Testing

parallel)not (areintersect variablevequantitati

theof levels twofor the lines regression that theimplies This

)()()(

)(

12

112120

110

211222110

femalesforXYE

andmalesforXYE

XXXXY

ii

ii

iiiiii

215

Comments

• We can have models with any combination

of quantitative and qualitative variables

• If interactions are present the corresponding

main effects should also be included

216

Comparison of two or more

regression functions

• Soap production line example:

• Y – amount of scrap

• X1 – line speed

• X2 – production line (1 if line 1, 0 if line 2)

• Model:

• Fitted model:

iiiiii XXXXY 211222110

2121 1767.039.90322.157.7ˆ XXXXYi

217

Soap production example cont’d

218


• One important assumption to be able to

estimate the regression relationship for

both production lines simultaneously is that

the residual variances for the two

production lines are the same

• Hence perform Brown-Forsythe test for

equality of variance

219


220


0

*

*

2

0

reject H toFail

06.2)25;975.0(|:|

636.

12

1

15

1139.14

648.12132.16

139.14

921.199227

82.045,220.952,2

variancesunequal :Ha

lines production 2 for the variances Equal:

ttRR

t

s

s

H

BF

BF

221


222


• Test whether the two regression lines are

identical

identical are lines that theReject H

65.229,904/23

810)/2(18,694 F*example thein Since

67.5)23,2;99.0(*:

)4()XX,X,(X

2)X|XX,SSR(X F*:TS

zeronot is or of oneleast at :

0:

0

2121

1212

32

320

FFRR

nSSE

H

H

a

223


• Test whether the slopes of the two

regression lines are the same

parallel are lines that thereject H toFail

88.19,904/23

810 F*example thein Since

88.7)23,1;99.0(*:

)4()XX,X,(X

)X,X|XSSR(X F*:TS

0 :

0:

0

2121

2121

3

30

FFRR

nSSE

H

H

a

224

Lecture 10: Model selection and

validation

Applied Regression Analysis (BIS623a)

Fall 2005

Instructor: Ralitza Gueorguieva

225

Model-building process

• Data collection and preparation

• Reduction of explanatory variables

• Model refinement and selection

• Model validation

226

Data collection and preparation

• Data collection varies by type of study:

- Controlled experiments

- Controlled experiments with covariates

- Confirmatory observational studies

- Exploratory observational studies

- exclude exploratory variables not fundamental to the problem

- may exclude variables subject to large measurement error

- exclude duplicate variables

Data preparation involves edit checks, plots to identify gross errors

Preliminary model investigation: identify functional form of explanatory variables, important interactions, may rely on prior knowledge

227

Reduction of explanatory

variables

• Omitting important variables may bias estimates and may damage exploratory power of the model

• Overfitted model may result in large variances of estimates of parameters

• Variable subset should be manageable and large enough for adequate description

228

Model refinement and selection

• Tentative regression model or several good

regression models need to be checked for

curvature and interaction effects. Residual plots,

formal tests for violations of assumptions and lack

of fit can be used

• Remedial actions may need to be applied

229

Model validation

• Model validity refers to:

- the stability and reasonableness of the regression

coefficients

- usability of the regression function

- ability to generalize inferences

230

Criteria for model selection

• Examination of all models is virtually impossible

(For p-1 predictors there are 2p-1 possible models not

considering interactions or higher order terms among

the predictors)

• Model selection procedures are used to identify a

small group of regression models that are “good”

according to a specified criterion

• These “good” models are then examined in detail to

come up with “best” model(s)

• We will consider 6 criteria:

pppppap PRESSSBCAICCRR ,,,,, 2

,

2

231

Notation and assumptions

• Number of potential X variables: P-1

• All models contain an intercept β0

• The number of X variables in a subset is denoted by p-1, 1≤p≤P

• Number of observations n is greater than maximum number of potential parameters n>P

• It is highly desirable that n is much larger than P (n>>P)

232

• R2 based on p parameters, p-1 variables

• The goal is to identify models with high coefficient of determination

• R2 always increases as we add variables to the model but a small increase may not be worthwhile

SSTO

SSE-1

p2 pR

233

criterion SSEor p

2

pR

• Rp2 does not take into account the number of variables in the model

• The adjusted coefficient of determination Ra,p2 is a better choice

• The adjusted coefficient of determination does not always increase

• It increases only if MSEp decreases.

1

SSTO1

SSTO

SSE

p-n

1-n-1

p2

,

n

MSER

p

pa

234

criterion or MSE p

2

,paR

Mallows’ Cp criterion

)2(),....(

))ˆ((}ˆ{1

))ˆ((}ˆ{)ˆ(

bias squared ˆ of variance))ˆ((}ˆ{)ˆ(

)])ˆ(())ˆ(ˆ[()ˆ(

)ˆ()ˆ(ˆˆ

ˆ :response mean from valuesfitted of Deviations

model regressionsubset eachfor valuesfitted

n theoferror squared mean total theminimize to:Goal

11

1

2

1

2

2

1

2

1

2

1

2

222

22

pnXXMSE

SSEC

μYEY

μYEYμYE

YμYEYμYE

μYEYEYμY

biaserrorrandomμYEYEYμY

μY

P

p

p

n

i

ii

n

i

ip

n

i

ii

n

i

i

n

i

ii

iiiiii

iiiiii

iiiiii

ii

235

Mallows’ Cp criterion cont’d

ip pCE }Y E{ when)( i

• Models without bias will fall near the line Cp=p

• Models above the line show substantial bias

• Models below the line are there due to random error

• Hence we are looking for values of Cp that are small and near the line Cp=p

• CP = P

• The choice of the P-1 potential variables is very important since on that choice depends whether the MSEp is unbiased estimator of the error variance

236

AICp and SBCp criteria

• Akaike’s information criterion (AICp) and Schwartz’ Bayesian criterion (SBCp) also penalize for the number of variables in the model.

• We look for models with small AICp and SBCp

• SBCp favors more parsimonious models

pnnnSSEnSBC

pnnSSEnAIC

pp

pp

][lnlnln

2lnln

237

PRESSp criterion

• Prediction sum of squares criterion is a measure of

how well the use of the fitted values for a subset

model can predict the observed responses Yi

run regression singlea from calculated be can values

preferred are values small withModels

)ˆˆ(

ˆˆ

:case ifor error prediction PRESS

deleted wascase i the whenline regression fitted theon based case i for the valuepredictedˆ

1

2

)(

)(

)(

p

p

n

i

iiip

iii

ii

PRESS

PRESS

YYPRESS

YY

Y

238

Surgical unit example

• A hospital surgical unit is interested in predicting survival in patients undergoing a liver operation. A random sample of 108 patients was available for analysis.

• Potential predictor variables: blood clotting score (X1), prognostic index (X2), enzyme function test score (X3), liver function test score (X4), age (X5), gender (X6), indicator variables for history of alcohol use (X7,X8)

• Demonstrates the utility of the 6 different criteria for model selection (graph on p. 362).

239

Automatic search procedures for

model selection

• Best subset selection

• Stepwise regression model

• LASSO

241

“Best” subsets algorithms

• Provide best subsets of models according to a particular criterion

• Also provide good models for any number of variables in the model

• When the pool of X variables is large (30, 40 or more) even such algorithms can take excessive amount of time

• When several good models are identified the final choice of variables for the model is based on model-building, residual analyses, other diagnostics, the investigators knowledge and confirmed by validation studies

242

Stepwise regression models

• This method develops a sequence of regression models at each step adding or deleting an X variable according to equivalently reduction in SSE, coefficient of partial correlation, t* statistic, F* statistic

• Ends with a single “best” regression model

• Different stepwise procedures can lead to different “best” models

• Can use this method to obtain the right size of the regression model and then identify other good regression models using “best” subset procedures

243

Forward stepwise regression

• 1. Fit SLR for each of the P-1 potential variables. Look at the t-test statistic for testing that the slope is 0: tk*=bk/s{bk}. The X variable with the largest t* value is added to the model first provided that its value exceeds a predetermined threshold tcrit (usually critical value corresponding to a prespecified significance level, e.g. 0.05). If none of the t* values are greater than tcrit the algorithm stops. Equivalently decision can be based on p-value. Hence we select for inclusion the variable with smallest p-value.

• 2. Let Xk be the variable entered at step 1. We now fit all possible models with Xk and one additional X variable in and we compute the t* statistics and corresponding p values for that new variable. We enter the variable with the smallest p-value (or equivalently the largest t* value) as long as that p-value is smaller than the predetermined significance level alpha.

244


cont’d• 3. Check whether Xk can be dropped from the model. The

criteria for dropping is similar to the one for adding a variable: drop if the corresponding p-value from the latest model is GREATER than a predetermined level (need not be the same as the level for adding a variable in).

• 4. Try to add one more variable to the model according to criteria in step 2 and then try to drop one of the variables already in the model (except the last one to be entered) as in step 3. Continue until no more variables can be added or dropped from the model.

245


cont’d• Choices of α-to-enter and α-to-drop are important.

• Large α-to-enter values result in models with too many

variables.

• Models with small α-to-enter values may be underspecified

and hence estimate of the error variance may be too large.

• α-to-enter should never be larger than α-to-drop.

• The order of variable in the model is not important.

246

Other stepwise procedures

• Forward selection: a simplified version of forward stepwise regression without the test whether a variable once entered into the model should be dropped.

• Backward elimination: the opposite of forward selection. Starts with a model with all variables in and drops the one with the largest p-value (provided the p-value is larger than the prespecified threshold). Then fits the model to all remaining variables and drops the variable with the largest p-value based on that second model. The process continues until no more variables can be dropped.

247

Comments on stepwise

procedures• Variations of procedures are possible: for example

stagewise regression to include interactions

• No uniformity across software packages

• Variables can be retained in the model “by force”

• Dummy variables coding a single categorical variable should be kept together

• At each step the model should be hierarchically well-defined

• Methods based on deletion of cases, bootstrapping are newly proposed.

248

Model validation

• Three approaches:

• 1. Checking the model and its predictive ability on a new sample (Note, it may be difficult to replicate the study).

• 2. Comparison of results with theoretical expectations, earlier empirical results and simulation results.

• 3. Use of a holdout sample to check the model and its predictive ability.

249

Methods of checking validity

Re-estimate the model form using the new data and compare the estimated regression coefficients and various characteristics of the fitted models to the new and the old data

Assess the predictive ability of the selected regression model by using the original model to predict each case in the new data set and then to calculate the mean of squared prediction errors (MSPR)

ability predictive of estimatebetter a is MSPR then MSE MSPRIf

good. is capability predictive thenset,data building modelfor MSE MSPRIf

*

)ˆ(1

2

n

YY

MSPR

n

i

ii

250

Data splitting

• Used when the collection of new data is not feasible and the original sample is large enough (6 to 10 cases per variable)

• Involves splitting the original data set into a model-building set (training set), and validation set (prediction set)

• This validation is often called “cross-validation”

• The splits of the data can be made at random, by time, or pairs of data points can be created and each placed in different set

251

Comments on data splitting

• Data can be split so that the two data sets have similar statistical properties

• Different refinements of data splitting are possible (for example K-fold cross-validation)

• The PRESS criterion can also be used for data validation

• A disadvantage of not using the whole data set to develop the model is that the parameter estimates are more imprecise

252

Lecture 11: Building the Regression

Model: Diagnostics


Fall 2005


253

Added-variable plots

• Added-variable plots (partial regression plots, adjusted variable plots) are residual plots that provide graphical information about the marginal importance of a variable Xkgiven the other predictor variables already in the model.

• To create added-variable plots two regressions must be performed: Y on all other predictor variables, and Xk on all other predictor variables. Then residuals from these two regressions are plotted against each other.

• The plots help decide whether the additional variable Xkshould be added to the regression model and in what form.

• They also provide an idea about the strength of the marginal relationship between Y and Xk and about possible outliers.

254

Comments on added-variable

plots

• Only suggest the nature of the relationship

between Y and the predictor variable but do

not show the functional form.

• May be misleading if there are complex

interrelationships between the predictor

variables.

255

Example (added-variable plot)

• Life-insurance example:

• Y – amount of insurance carried

• X1 – annual income

• X2 – measure of risk aversion

• SAS program/output

256

Identifying outlying Y

observations

MSEheeshMSEes

hhee

hhe

HIe

X

MSE

ee

YYe

ijjiiii

ijijji

iiiii

i

iii

},{),1(}{ :Estimators

)0(},{ :residuals between Covariance

Hofelement diagonal ith theis where),1(}{ :residual ith of Variance

)(}{ :residuals of Variance

H)Y-(Ie :residuals ofVector

'X)X(X' H:matrixHat

:residual tizedSemistuden

ˆ :Residual

2

22

22

22

1-

*

i

257

Studentized residuals

• Also known as internally studentized residuals

• Unlike semistudentized residuals have constant

variances

)1(}{,}{

iii

i

ii hMSEes

es

er

258

Deleted residuals

• More useful to detect outlying Y values since if

particular case is highly influential it will pull the

regression line towards it and its internal residual

will be small

)1(~ :onDistributi

1}{s : varianceEstimated

1 : valueEquivalent

ˆ :residual Deleted

)(2

)(

pnt}s{d

d

h

MSEd

h

ed

YYd

i

i

ii

i

i

ii

ii

iiii

259

Studentized deleted residual

• Externally studentized deleted residual

• Can be calculated without fitting the regression

line

2

)( )1(

1

)1(}{iii

i

iii

i

i

ii

ehSSE

pne

hMSE

e

ds

dt

260

Test for outliers

• Idea: identify cases with large externally

studentized residuals and perform

Bonferroni procedure for n residuals

• Consider a residual ti:

• If |ti|>t(1-α/(2n);n-p-1) then declare that

residual an outlier

261

Identifying outlying X

observations

• Also based on the hat matrix

setsdata bigfor 5.0or ,2

:X respect to withcases Outlying

Y influencelly substantia valuesleverage High

LEVERAGE

valuesX mean theand case ith for the valuesX thebetween distance

10

i

1

iiii

ii

ii

n

i

iiii

hn

ph

h

h

phh

262

Identifying influential cases

• A case is influential if its exclusion causes

major changes in the fitted regression

function

263


setsdata bigfor /2)(

setsdata umsmall/medifor 1)(:case lInfluentia

1

ˆˆ)(

: valuefitted single on Influence

2/1

)(

)(

npDFFITS

DFFITS

h

ht

hMSE

YYDFFITS

i

i

ii

iii

iii

iii

i

264


line. regression theon influence lsubstantia has case ith the

more,or 50%near is percentile If

percentile its assess and

ondistributi p)-n F(p, to Relate:case lInfluentia

bothor leverage largea

residual, largea havingby linfluentia bemay case A

)1(

)ˆˆ(

distance sCook' : valuesfitted all on Influence

2

21

2

)(

i

ii

iii

n

j

ijj

i

D

h

h

pMSE

e

pMSE

YY

D

265


• Influence on the regression coefficients -

DFBETAS

setsdata largefor /2 DFBETAS

setsdata mediumor smallfor 1 DFBETAS:case lInfluentia

X)(X' ofelement diagonal

1,...1,0,)(

1-

)(

)(

)(

n

kthc

pkcMSE

bbDFBETAS

kk

kki

ikk

ik

266

Influences on inferences

• Compare regression models and fits with

and without influential observations

• Note that one influential observation may

mask another so that dropping an influential

observation may reveal another influential

observation

267

Multicollinerity diagnostics

• Informal diagnostics:

- large changes in the estimated regression coefficients when a variable is added or deleted, or when an observation is added or deleted

- Nonsignificant individual tests for regression coefficients for important predictor variables

- Estimated regression coefficients with an opposite sign to theoretical considerations or prior experience

- Large correlations between predictor variables

- Wide confidence intervals for important predictor variables

268

Multicollinearity diagnostics

• Variance inflation factors: measure the inflation of

variances of estimated regression coefficients as

compared to when the predictor variables are not

linearly related

variablespredictor other the withcorrelatedperfectly is X heninfinity w is )(

0 when1)(

0 when1)(

variablesX remaining theon X of iondeterminat multiple oft coefficien theis where,)1()(

k

2

2

k

212

k

kk

kk

kkk

VIF

RVIF

RVIF

RRVIF

269

VIF: diagnostic uses

• Largest VIF > 10 is indication that

multicollinearity may be unduly influencing

the least squares estimates

• Mean VIF considerably larger than 1 is

indicative of serious multicollinearity

problems

270

Lecture 12: Building the Regression

Model: Remedial measures


Fall 2005


271

Weighted Least Squares

• A remedial measure for unequal error variances (heteroscedasticity) when the regression relationship has been appropriately identified.

• Takes into account that observations with small variances provide more information than observations with large variances.

),0(.~

...

2

1,1110

ii

ipipii

Nindep

XXY

272

Weighted least squares cont’d

k

2

i

2

2

1

2

2

2

2

1

1

X variablepredictor a and anceerror vari the

between iprelationsh edhypothesiza on basedusually estimate ˆ and ˆ where

ˆ1...00

............

0...ˆ

10

0...0ˆ

1

or

ˆ1...00

............

0...ˆ

10

0...0ˆ

1

')'(

ii

nn

w

vs

v

v

v

W

s

s

s

W

WYXWXXb

273


• Steps:

• 1. First fit the regression model using ordinary (unweighted) least squares and obtain residuals

• 2. Depending on residual plots regress either the absolute residuals against appropriate predictor variables to obtain estimates si of σi , or regress the squared residuals against appropriate predictor variables to obtain estimates si

2 (vi) of σi2

• 3. Use the estimated variances in part 2 in the weight matrix and obtain weighted least squares estimates of the regression coefficients.

• Note: the estimated variances/standard deviations are just the fitted values of the regression in part 2.

274


Possible variance and standard deviation functions:

• If residual plot against Xk exhibits megaphone shape then regress the absolute residuals against Xk

• If residual plot against Yhat exhibits megaphone shape then regress the absolute residuals against Yhat

• If a plot of squared residuals against Xk exhibits an upward tendency regress the squared residuals against Xk

• If a plot of the residuals against Xk suggests that the variance increases rapidly with increases in Xk up to a point and then increases more slowly regress the absolute residuals against Xk and Xk

2

275


• Iteratively reweighted least squares (IRLS): If the weighted least squares estimates (in step 3) are very different from the unweighted least squares estimates then the residuals from step 3 (weighted regression) can be regressed on the appropriate predictor variables to re-estimate the variance/standard deviation and the process can be repeated to obtain more stable estimates of the regression coefficients. Usually one or two iterative steps are sufficient.

• In designed experiments replicates may be present at each combination of levels of X variables and sample variances/standard deviations at each replicate point can be used to estimate the weights.

276


• Inference when weights are estimated:

pn

ew

pn

YYw

MSE

WXXMSEbs

WYXWXXb

n

i

ii

n

i

iii

w

ww

w

1

2

1

2

12

1

)ˆ(

)'(}{

')'(

277

Blood pressure example

• A researcher studied the relationship between

blood pressure (dbp) and age (age) in healthy

women 20 to 60 years old

• Sample size: 54

• SAS program to illustrate the use of weighted least

squares

278

Remedial measures for

multicolinearity• Drop one of two highly correlated variables from

the model

• Use centered data

• Form independent indexes (linear combinations of the predictor variables) and use these as predictors: principal component analysis

• Ridge regression

279

Ridge regression

• Idea: obtain biased estimates of the regression parameters

with smaller variance than the LSE

• They will be much better than the LSE in terms of mean

squared error

2222 bias variance)}{(}{}{ RRR bEbbE

280

Ridge estimators

c usefulidentify help also can (VIF)

cproper gdeterminin in useful is c)against of(plot traceRidge

}MSE{} MSE{for which c of valueexists always There

stable morebut biased are 0c if , then0c If

estimators in bias ofamount reflects c

constant,0,)( :estimators Ridge

:

... :ation transformncorrelatioAfter

k

1-

11)-(p

1

11)-(p

**

1,

*

1

*

2

*

2

*

1

*

1

*

R

R

RR

YXXX

R

YXXX

ipipiii

b

bb

bbb

crcIrb

rrbOLS

eXXXY

281

Body fat example

• Study of the relationship between amount of body fat (Y) and several possible predictors:

• triceps skinfold thickness (X1)

• thigh circumference (X2)

• midarm circumference (X3)

• SAS program/output

282

Comments on ridge regression

• Ridge regression estimate are stable, i.e. there are little affected by changes in the data on which the fitted regression model is based.

• Predictions of new observations based on ridge estimated regression functions are more precise than predictions based on OLS

• Advantages of ridge regression increase as degree of multicollinearity increases

• Limitation of ridge regression is that ordinary inference procedures do not apply. Bootstrapping may be employed to evaluate the precision of ridge regression coefficients

• Another limitation is that choice of c is judgemental

• Ridge regression can be modified to use different c values for different regression coefficients

• Ridge traces can be used to reduce the number of predictor variables. Variables with unstable traces are candidates for dropping and variables whose traces tend to zero are also candidates for dropping

283

Remedial measures for influential

cases• 1. Check if influential cases are recording errors, due to

breakdown of instruments, etc.

• 2. If no obvious errors are observed, the adequacy of the model should be examined – are interactions or higher order terms omitted, is the functional form appropriate

• 3. Influential cases that are not obvious errors should be discarded only with extreme caution

• 4. If it is not desirable to discard influential cases, robust regression can be used

284

Robust regression

• Idea: dampens the influence of outlying cases to provide better fit to the majority of cases

• Types of robust regression:

• - Least absolute residuals (LAR) or least absolute deviations (LAD) regression: minimizes

• - IRLS robust regression

• - Least median of squares (LMS) regression: minimizes

• - other procedures involving trimming, or ranks

n

i

pipii XXYL1

1,11101 |)...(|

})]...({[ 2

1,1110 pipii XXYmedian

285

IRLS Robust regression

• WLS used to reduce the influence of outlying cases by employing

weights that vary inversely with the size of the residual

• Weight functions:

685.4||0

685.4||685.4

1:

345.1||||

345.1345.1||1

:

22

u

uu

wBisquare

uu

u

wHuber

286

IRLS robust regression

• 1. Choose a weighting function for weighting the cases

• 2. Obtain starting weights for all cases

• 3. Use starting weights in WLS and obtain scaled residuals from the fitted regression function

• 4. Use the residuals in step 3 to obtain revised weights

• 5. Continue the iterations until convergence is obtained (small change in weights, residuals, estimated regression coefficients or fitted values)

|}}{{|6745.

1

:residual Scaled

ii

ii

emedianemedianMAD

MAD

eu

287

Comments about robust

regression• - requires knowledge of the regression function

• - can identify outliers in case of multiple outliers masking each others effects. Cases with small final weights are outlying.

• - can confirm the appropriateness of OLS

• - is mainly reducing the influence of cases outlying with respect to Y. To make the procedure more sensitive to outlying X observations studentized residuals may be used, or weights can incorporate the leverage of observations

• - limitation is that evaluation of precision of estimates is more complicated. Bootstrap can be used.

288

Bootstrapping

• Can be used as a remedial measure for evaluating

precision in non-standard situations

• Computationally intensive

289

Bootstrapping cont’d

• Suppose that we have a sample of size n on which we want to fit a regression model

• Consider an estimated regression coefficient b1

• To estimate the precision of this estimate we generate many samples with replacement from the original sample and fit regression models to them

• From each bootstrap sample we obtain one additional estimate of the same regression coefficient

• The standard deviation of all the bootstrap estimates s*{b1*} is a measure of the precision of b1

290

Bootstrap sampling

• Fixed X sampling: used when the regression function is appropriate to the data, the errors have constant variance and the predictor variables may be regarded as fixed

- residuals ei from original fitting are regarded as the sample data and are sampled with replacement

- the bootstrap Y values are then obtained by adding the original fitted values and these bootstrapped residuals

- the bootstrapped Y are then regressed on the original X to obtain bootstrap estimates of the regression coefficients

*)(*)( ˆ m

ii

m

i eYY

291

Bootstrap sampling cont’d

• Random X sampling: used when there are

questions regarding the adequacy of the regression

function, the error variances are not constant,

and/or the predictor variables can not be regarded

as fixed

- For SLR (X,Y) pairs are sampled with

replacement from the original sample

292

Bootstrap confidence intervals

• Reflection method: based on the (α/2)100 and (1- α/2)100 percentiles of the bootstrap distribution of b1*: b1*(α/2) and b1*(1-α/2)

• Requires large number of bootstrap samples (at least 500)

),(:for CI %100)-(1 approx.

)2/1(

)2/(

11211

1

*

12

*

111

dbdb

bbd

bbd

293

COURSE: Applied Regression Analysis - Rutgers …cabrera/sc/workshop/Appliedreg.pdf · Types of...

Documents

Transcript of COURSE: Applied Regression Analysis - Rutgers …cabrera/sc/workshop/Appliedreg.pdf · Types of...