Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research...

Simple Linear Regression:An Introduction

Dr. Tuan V. Nguyen

Garvan Institute of Medical Research

Sydney

Give a man three weapons – correlation, regression and a pen – and he will use all three

(Anon, 1978)

An exampleID Age Chol (mg/ml)

1 463.5

2 201.9

3 524.0

4 302.6

5 574.5

6 253.0

7 282.9

8 363.8

9 222.1

10 433.8

11 574.1

12 333.0

13 222.5

14 634.6

15 403.2

16 484.2

17 282.3

18 494.0

Age and cholesterol levels in 18 individuals

Read data into R

id <- seq(1:18)age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49)chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)plot(chol ~ age, pch=16)

20 30 40 50 60

2.0

2.5

3.0

3.5

4.0

4.5

age

cho

l

Questions of interest

• Association between age and cholesterol levels• Strength of association• Prediction of cholesterol for a given age

Correlation and Regression analysis

Variance and covariance: algebra

• Let x and y be two random variables from a sample of n obervations.

• Measure of variability of x and y: variance

n

i

i

n

xxx

1

2

1var

n

i

i

n

yyy

1

2

1var

• Measure of covariation between x and y ?

• Algebraically:

var(x + y) = var(x) + var(y)

var(x + y) = var(x) + var(y) + 2cov(x,y)

Where:

n

iii yyxx

nyx

11

1,cov

Variance and covariance: geometry

• The independence or dependence between x and y can be represented geometrically:

y

x

h

h2 = x2 + y2

x

yh

h2 = x2 + y2 – 2xycos(H)

H

Meaning of variance and covariance

• Variance is always positive

• If covariance = 0, x and y are independent.• Covariance is sum of cross-products: can be positive or

negative.

• Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation.

• Positive covariance = deviations in the two distributions in are in the same direction.

• Covariance = a measure of strength of association.

Covariance and correlation• Covariance is unit-depenent. • Coefficient of correlation (r) between x and y is a standardized

covariance.• r is defined by:

yx SDSD

yx

yx

yxr

,cov

varvar

,cov

Positive and negative correlation

8 10 12 14 16

-30

-25

-20

-15

x

y

8 10 12 14 16

1520

2530

x

y

r = 0.9 r = -0.9

Test of hypothesis of correlation• Hypothesis: Ho: r = 0 versus Ho: r not equal to 0.

• Standard error of r is: • The t-statistic:

21

2

r

nrt

• This statistic has a t distribution with n – 2 degrees of freedom.

• Fisher’s z-transformation:

• Standard error of z:

• Then 95% CI of z can be constructed as:

2

1 2

n

rrSE

r

rz

1

1ln

2

1

3

1

nzSE

3

1

nz

An illustration of correlation analysisID Age Cholesterol

(x) (y; mg/100ml)

1 46 3.52 20 1.93 52 4.04 30 2.65 57 4.56 25 3.07 28 2.98 36 3.89 22 2.110 43 3.811 57 4.112 33 3.013 22 2.514 63 4.615 40 3.216 48 4.217 28 2.318 49 4.0Mean 38.83 3.33SD 13.60 0.84

Cov(x, y) = 10.68

94.0

84.060.13

68.10,cov

yx SDSD

yxr

56.094.01

94.01ln

2

1

z

26.015

1

3

1

n

zSE

t-statistic = 0.56 / 0.26 = 2.17

Critical t-value with 17 df and alpha = 5% is 2.11

Conclusion: There is a significant association between age and cholesterol.

Simple linear regression analysis

• Assessment:– Quantify the relationship between two variables

• Prediction– Make prediction and validate a test

• Control– Adjusting for confounding effect (in the case of multiple variables)

• Only two variables are of interest: one response variable and one predictor variable

• No adjustment is needed for confounding or covariate

Relationship between age and cholesterol

Linear regression: model

• Y : random variable representing a response• X : random variable representing a predictor variable

(predictor, risk factor)– Both Y and X can be a categorical variable (e.g., yes / no) or a

continuous variable (e.g., age). – If Y is categorical, the model is a logistic regression model; if Y is

continuous, a simple linear regression model.

• ModelY = + X +

: intercept : slope / gradient : random error (variation between subjects in y even if x is constant, e.g.,

variation in cholesterol for patients of the same age.)

Linear regression: assumptions

• The relationship is linear in terms of the parameter;

• X is measured without error;

• The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ;

• The random error term () is normally distributed with mean 0 and constant variance.

Expected value and variance

• If the assumptions are tenable, then: • The expected value of Y is: E(Y | x) = + x

• The variance of Y is: var(Y) = var() = 2

Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points.

A(x1,y1)

B(x2,y2)

Gradient:12

12

xx

yy

dx

dym

Equation: y = mx + a

What happen if we have more than 2 points?

a

x

y

0

dy

dx

Estimation of model parameters

Estimation of and

• For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)

• Let a and b be sample estimates for parameters and ,

• We have a sample equation: Y* = a + bx

• Aim: finding the values of a and b so that (Y – Y*) is minimal.

• Let SSE = sum of (Yi – a – bxi)2.

• Values of a and b that minimise SSE are called least square estimates.

Criteria of estimation

Chol

Age

ii bxay ˆ

iii yyd ˆyi

The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal.

Estimation of and • After some calculus operations, the results can be shown

to be:

xx

xy

S

Sb

xbya

n

iixx xxS

1

2

n

iiixy yyxxS

1

Where:

• When the regression assumptions are valid, the estimators of and have the following properties:

– Unbiased

– Uniformly minimal variance (eg efficient)

Goodness-of-fit

• Now, we have the equation Y = a + bX + e

• Question: how well the regression equation describe the actual data?

• Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X.

Partitioning of variations: concept

• SST = sum of squared difference between yi and the mean of y.

• SSR = sum of squared difference between the predicted value of y and the mean of y.

• SSE = sum of squared difference between the observed and predicted value of y.

SST = SSR + SSE

The the coefficient of determination is:

R2 = SSR / SST

Partitioning of variations: geometry

Chol (Y)

Age (X)

mean

SSR

SSE

SST

Partitioning of variations: algebra

• Some statistics:• Total variation:• Attributed to the model:• Residual sum of square: • SST = SSR + SSE• SSR = SST – SSE

n

ii yySST

1

2

n

ii yySSR

1

2ˆ

n

iii yySSE

1

2ˆ

Analysis of variance

• SS increases in proportion to sample size (n)

• Mean squares (MS): normalise for degrees of freedom (df)

– MSR = SSR / p (where p = number of degrees of freedom)

– MSE = SSE / (n – p – 1)

– MST = SST / (n – 1)

• Analysis of variance (ANOVA) table:

Source d.f. Sum of squares (SS)

Mean squares (MS)

F-test

Regression

Residual

Total

p

N–p –1

n – 1

SSR

SSE

SST

MSR

MSE

MSR/MSE

Hypothesis tests in regression analysis

• Now, we have

Sample data: Y = a + bX + ePopulation: Y = + X +

• Ho: = 0. There is no linear association between the outcome and predictor variable.

• In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?”

Inference about slope (parameter )

• Recall that is assumed to be normally distributed with mean 0 and variance = 2.

• Estimate of 2 is MSE (or s2)• It can be shown that

– The expected value of b is , i.e. E(b) = – The standard error of b is:

• Then the test whether = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom.

xxSsbSE /

Confidence interval around predicted valued

• Observed value is Yi.

• Predicted value is • The standard error of the predicted value is:

• Interval estimation for Yi values

xx

ii S

xx

nsYSE

211ˆ

2/1,1ˆˆ

pnii tYSEY

ii bxaY ˆ

Checking assumptions

• Assumption of constant variance• Assumption of normality• Correctness of functional form• Model stability

• All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures.

Checking assumptions

• Assumption of constant variance– Plot the studentized residuals versus their predicted values. Examine

whether the variability between residuals remains relatively constant across the range of fitted values.

• Assumption of normality– Plot the residuals versus their expected values under normality (Normal

probability plot). If the residuals are normally distributed, it should fall along a 45o line.

• Correct functional form? – Plot the residuals versus fitted values. Examine whether the residual

plot for evidence of a non-linear trend in the value of the residual across the range of fitted values.

• Model stability– Check whether one or more observations are influential. Use Cook’s

distance.

Checking assumptions (Cont)

• Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set.

• Leverage is a measure of how extreme the value of xi is relative to the remaining value of x.

• The Studentized residual provides a measure of how extreme the value of yi is relative to the remaining value of y.

Remedial measures

• Non-constant variance– Transform the response variable (y) to a new scale (e.g. logarithm) is

often helpful.– If no transformation can achieve the non-constant variance problem,

use a more robust estimator such as iterative weighted least squares.

• Non-normality– Non-normality and non-constant variance go hand-in-hand.

• Outliers– Check for accuracy– Use robust estimator

Regression analysis using R

id <- seq(1:18)

age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22,

43, 57, 33, 22, 63, 40, 48, 28, 49)

chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1,

3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)

#Fit linear regression model

reg <- lm(chol ~ age)

summary(reg)

ANOVA result

> anova(reg)Analysis of Variance Table

Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 ***Residuals 16 1.4656 0.0916 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results of R analysis> summary(reg)

Call:lm(formula = chol ~ age)

Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedomMultiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Diagnostics: influential data

par(mfrow=c(2,2))plot(reg)

2.5 3.0 3.5 4.0 4.5

-0.4

0.0

0.2

0.4

0.6

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

8

6

17

-2 -1 0 1 2

-10

12

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

8

6

17

2.5 3.0 3.5 4.0 4.5

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location8

617

0.00 0.05 0.10 0.15 0.20 0.25-1

01

2

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance0.5

0.5

1

Residuals vs Leverage

6

2

8

A non-linear illustration: BMI and sexual attractiveness

– Study on 44 university students

– Measure body mass index (BMI)

– Sexual attractiveness (SA) score

id <- seq(1:44)bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00, 14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00, 20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00, 24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50, 28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)

Linear regression analysis of BMI and SA

reg <- lm (sa ~ bmi)

summary(reg)

Residuals: Min 1Q Median 3Q Max -2.54204 -0.97584 0.05082 1.16160 2.70856

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 ***bmi -0.05967 0.02862 -2.084 0.0432 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.354 on 42 degrees of freedomMultiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

BMI and SA: analysis of residuals

plot(reg)

3.0 3.5 4.0

-3-2

-10

12

3

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

21

10

20

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

21

10

20

3.0 3.5 4.0

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location21 1020

0.00 0.02 0.04 0.06 0.08 0.10 0.12

-2-1

01

2

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance

Residuals vs Leverage

1310

BMI and SA: a simple plotpar(mfrow=c(1,1))reg <- lm(sa ~ bmi)plot(sa ~ bmi, pch=16)abline(reg)

10 15 20 25 30 35

23

45

6

bmi

sa

# Fit 3 regression modelslinear <- lm(sa ~ bmi)quad <- lm(sa ~ poly(bmi, 2))cubic <- lm(sa ~ poly(bmi, 3))

# Make new BMI axisbmi.new <- 10:40

# Get predicted valuesquad.pred <- predict(quad,data.frame(bmi=bmi.new))cubic.pred <- predict(cubic,data.frame(bmi=bmi.new))

# Plot predicted valuesabline(reg)lines(bmi.new, quad.pred, col="blue",lwd=3)lines(bmi.new, cubic.pred, col="red",lwd=3)

Re-analysis of sexual attractiveness data

10 15 20 25 30 35

23

45

6

bmi

sa

Some comments: Interpretation of correlation

• Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear.

• For curlinearity, a rank correlation is better than the Pearson’s correlation.

• A small correlation (eg 0.1) may be statistically significant, but clinically unimportant.

• R2 is another measure of strength of association. An r = 0.7 may sound impressive, but R2 is 0.49!

• Correlation does not mean causation.

Some comments: Interpretation of correlation

• Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem.

• Correlation can not be inferred directly from association.– r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that

r(age, fat) is near zero.

– In fact, r(age, fat) = 0.79.

Some comments: Interpretation of regression

• The fitted line (regression) is only an estimated of the relation between these variables in the population.

• Uncertainty associated with estimated parameters.

• Regression line should not be used to make prediction of x values outside the range of values in the observed data.

• A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation.

Some comments: Reporting results

• Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc.

• Regression coefficients (a, b), their associated standard errors, and R2 are useful summary.

Some final comments

• Equations are the cornerstone on which the edifice of science rests.

• Equations are like poems, or even an onion.

• So, be careful with your building of equations!

Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research...

Documents

Transcript of Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research...