Lecture 8: Correlation and Intro to Linear Regressioneckel/biostat2/slides/lecture8.pdf ·...

40
Lecture 8: Correlation and Intro to Linear Regression Sandy Eckel [email protected] 5 May 2008 1 / 40

Transcript of Lecture 8: Correlation and Intro to Linear Regressioneckel/biostat2/slides/lecture8.pdf ·...

Lecture 8: Correlation andIntro to Linear Regression

Sandy [email protected]

5 May 2008

1 / 40

Quantifying association

Goal: Express the strength of the relationship between twovariables

Metric depends on the nature of the variables

For now, we’ll focus on continuous variables(e.g. height, weight)

Important! association does not imply causation

To describe the relationship between two continuous variables, use:

Correlation analysisMeasures strength and direction of the linear relationshipbetween two variables

Regression analysisConcerns prediction or estimation of outcome variable, basedon value of another variable (or variables)

2 / 40

Correlation analysis

Plot the data (or have a computer to do so)

Visually inspect the relationship between two continousvariables

Is there a linear relationship (correlation)?

Are there outliers?

Are the distributions skewed?

3 / 40

Correlation Coefficient I

Measures the strength and direction of the linear relationshipbetween two variables X and Y

Population correlation coefficient:

ρ =cov(X ,Y )√

var(X ) · var(Y )=

E [(X − µX )(Y − µY )]√E [(X − µX )2] · E[(Y − µY )2]

Sample correlation coefficient:(obtained by plugging in sample estimates)

r =sample cov(X ,Y )√

s2x · s2

Y

=

∑ni=1

(Xi−X )(Yi−Y )n−1√∑n

i=1(Xi−X )2

n−1 ·∑n

i=1(Yi−Y )2

n−1

4 / 40

Correlation Coefficient II

The correlation coefficient, ρ, takes values between -1 and +1

-1: Perfect negative linear relationship

0: No linear relationship

+1: Perfect positive relationship

5 / 40

Correlation Coefficient III

Plot standardized Y versusstandardized X

Observe an ellipse(elongated circle)

Correlation is the slope ofthe major axis

6 / 40

Correlation Notes

Other names for r

Pearson correlation coefficientProduct moment of correlation

Characteristics of r

Measures *linear* associationThe value of r is independent of units used to measure thevariablesThe value of r is sensitive to outliersr2 tells us what proportion of variation in Y is explained bylinear relationship with X

7 / 40

Several levels of correlation

8 / 40

Examples of the Correlation Coefficient I

Perfect positive correlation, r ≈ 1

● ●●

●●

●●

●●

● ●

● ●●

●●

● ●●

● ●●

● ●●

● ●

●●

●●

● ●●

9 / 40

Examples of the Correlation Coefficient II

Perfect negative correlation, r ≈ -1

● ●●

●●

●●

● ●●

●●

● ●

●●

●●

●● ●

●●

●● ●

●●

● ●

●●

●●

10 / 40

Examples of the Correlation Coefficient III

Imperfect positive correlation, 0< r <1

●●

●●

● ●

●●

● ●

●●

11 / 40

Examples of the Correlation Coefficient IV

Imperfect negative correlation, -1<r <0

● ●

●●

● ●

●●

12 / 40

Examples of the Correlation Coefficient V

No relation, r ≈ 0

●●

●●

●●

●●

●●

● ●

●●

13 / 40

Examples of the Correlation Coefficient VI

Some relation but little *linear* relationship, r ≈ 0

●●

●●

● ●

● ●

● ●●

●●

14 / 40

Association and Causality

In general, association between two variables means there issome form of relationship between them

The relationship is not necessarily causalAssociation does not imply causation, no matter how much wewould like it to

Example: Hot days, ice cream, drowning

15 / 40

Sir Bradford Hill’s Criteria for Causality

Strength: magnitude of association

Consistency of association: repeated observation of theassociation in different situations

Specificity: uniqueness of the association

Temporality: cause precedes effect

Biologic gradient: dose-response relationship

Biologic plausibility: known mechanisms

Coherence: makes sense based on other known facts

Experimental evidence: from designed (randomized)experiments

Analogy: with other known associations

16 / 40

Simple Linear Regression (SLR): Main idea

Linear regression can be used to study a continuous outcomevariable as a linear function of a predictor variable

Example: 60 cities in the US were evaluated for numerouscharacteristics, including:

Outcome variable (y) the % of the population with low income

Predictor variable (x) median education level

Linear regression can help us to model the association betweenmedian education and % of the population with low income

17 / 40

Example: Boxplot of % low income by education level

Education level is coded as a binary variable with values‘low’ and ‘high’

18 / 40

Simple linear regression, t-tests and ANOVA

Mean in low education group: 15.7%

Mean in high education group: 13.2%

The two means could be compared by a t-test or ANOVA, butregression provides a unified equation:

yi = β0 + β1xi

yi = 15.7− 2.5xi

where

xi = 1 for high education and 0 for low education (x is calleda dummy variable or indicator variable that designates group)

yi is our estimate of the mean % low income for the given thevalue of education

what about the β’s?19 / 40

Review: equation for a line

Recall that back in geometry class, you learned that a line could berepresented by the equation

y = mx + b

where

m = slope of the line (rise/run)

b = y-intercept (value y when x=0)

0 1 2 3 4 5

05

1015

20

X

Y

y=mx+b

m = 2b = 5

20 / 40

Regression analysis represented by equation for a line

In simple linear regression, we use the equation for a line

y = mx + b

but we write it slightly differently:

y = β0 + β1x

β0 = y-intercept (value y when x=0)

β1 = slope of the line (rise/run)

21 / 40

Example: the model components

yi = β0 + β1xi

yi = 15.7 + (−2.5)xi

yi is the predicted mean of the outcome yi for xi

β0 is the intercept, or the value of yi when xi = 0

β1 is the slope, or the change in yi for a 1 unit increase inxi = 0

xi is the indicator variable of low or high education forobservation i

22 / 40

Example: fill in covariate value to help interpretation

yi = β0 + β1xi

yi = 15.7− 2.5xi

xi = 0 (low education)

yi = 15.7− 2.5× 0

= 15.7 = β0

xi = 1 (high education)

yi = 15.7− 2.5× 1

= 13.2 = β0 + β1

23 / 40

Interpretations

Intercept

β0 is the mean outcome for the reference group, or thegroup for which xi = 0.

Here, β0 is the average percent of the population that is lowincome for cities with low education.

Slope

β1 is the difference in the mean outcome between the twogroups (when xi = 1 vs. when xi = 0)

Here, β1 is difference in the average percent of thepopulation that is low income for cities with high educationcompared to cities with low education.

24 / 40

Why use linear regression?

Linear regression is very powerful. It can be used for many things:

Binary X

Continuous X

Categorical X

Adjustment for confounding

Interaction

Curved relationships between X and Y

25 / 40

Regression analysis

A regression is a description of a response measure, Y, thedependent variable, as a function of an explanatory variable,X, the independent variable.

Goal: prediction or estimation of the value of one variable, Y,based on the value of the other variable, X.

A simple relationship between the two variables is alinear relationship (straight line relationship)

Other names: linear, simple linear, least squares regression

26 / 40

Foundational example: Galton’s study on height

1000 records of heights of family groups

Really tall fathers tend on average to have tall sons but notquite as tall as the really tall fathers

Really short fathers tend on average to have short sons butnot quite as short as the really short fathers

There is a regression of a sons height toward the mean heightfor sons

27 / 40

Example: Galton’s data and resulting regression

28 / 40

Regression analysis: population model

Probability model: Independent responses y1, y2, . . . , yn aresampled from

yi ∼ N(µi , σ2)

Systematic model: µi = E (yi |xi ) = β0 + β1xi

where

β0 = intercept

β1 = slope

29 / 40

Another way to write the model

Systematic: yi = β0 + β1xi + εi

Probability (random): εi ∼ N(0, σ2)

The response yi is a linear function of xi plus some random,normally distributed error, εi

data = signal + noise

30 / 40

Geometric interpretation

31 / 40

Remember: two (equivalent) ways to write the model

Probability: yi ∼ N(µi , σ2)

Systematic: µi = E (yi |xi ) = β0 + β1xi

where

β0 = intercept

β1 = slope

OR

Systematic: yi = β0 + β1xi + εi

Probability: εi ∼ N(0, σ2)

The response, yi , is a linear function of xi plus some random,normally distributed error, εi

32 / 40

Interpretation of coefficients

Intercept (β0)Mean model: µ = E (y |x) = β0 + β1x

β0 = expected response when x = 0Since E (y |x = 0) = β0 + β1(0) = β0

Slope (β1)β1 = change in expected response per 1 unit increase in x

33 / 40

From Galton’s example

E (y |x) = β0 + β1x

E (y |x) = 33.7 + 0.52xwhere: y = son’s height (inches)

x = father’s height (inches)

Expected son’s height = 33.7 inches when father’s height is 0inches

Expected difference in heights for sons whose fathers’ heightsdiffer by one inch = 0.52 inches

34 / 40

City education/income example

When education is a continuous variable (not binary)

35 / 40

City education/income model

Using the continuous variable for median education in city i (xi ):

E (yi |xi ) = β0 + β1xi

E (yi |xi ) = 36.2− 2.0xi

When xi = 0E (yi |xi ) = 36.2− 2.0(0)

= 36.2 = β0

When xi = 1

E (yi |xi ) = 36.2− 2.0(1)

= 34.2 = β0 + β1

When xi = 2

E (yi |xi ) = 36.2− 2.0(2)

= 32.2 = β0 + β1 × 236 / 40

City education/income model interpretation

Intercept (β0)

β0 is the mean outcome for the reference group, or the groupfor which xi = 0.

Here, β0 is the average percent of the population that is lowincome for cities with median education level of 0.

Slope (β1)

β1 is the difference in the mean outcome for a one unitchange in x.

Here, β1 is difference in the average percent of the populationthat is low income between two cities, when the first city has1 unit higher median education level than the second city.

37 / 40

Finding β’s from the graph

β0 is the y -intercept of the line, or the average value of ywhen x = 0.

β1 is the slope of the line, or the average change in y per unitchange in x .

y = mx + b

b = β0, m = β1

β1 =rise

run=

y1 − y2

x1 − x2

Note on notation:

β1 represents the trueslope (in the population)

β1 (or b1) is the sampleestimate of the slope

38 / 40

Where is our intercept?

The intercept isn’t in the range of our observed data. This means:

The intercept isn’t very interpretable since the average of ywhen x = 0 was never observedPossible solution: we might want to center our x variable

39 / 40

Summary

Today we’ve discussed

Correlation

Linear regression with continuous y variables

Simple linear regression (just one x variable)

Binary x (‘dummy’ or ‘indicator’ variable for group)β1: mean difference in outcome between groupsContinuous xβ1: mean difference in outcome corresponding to a 1-unitincrease in x

Interpretation of regression coefficients (intercept and slope)

How to write the regression model (2 ways)

Next time we’ll discuss multiple linear regression(more than one x variable) and confounding

40 / 40