Lecture 8: Correlation and Intro to Linear Regressioneckel/biostat2/slides/lecture8.pdf ·...
-
Upload
nguyenkhue -
Category
Documents
-
view
216 -
download
2
Transcript of Lecture 8: Correlation and Intro to Linear Regressioneckel/biostat2/slides/lecture8.pdf ·...
Quantifying association
Goal: Express the strength of the relationship between twovariables
Metric depends on the nature of the variables
For now, we’ll focus on continuous variables(e.g. height, weight)
Important! association does not imply causation
To describe the relationship between two continuous variables, use:
Correlation analysisMeasures strength and direction of the linear relationshipbetween two variables
Regression analysisConcerns prediction or estimation of outcome variable, basedon value of another variable (or variables)
2 / 40
Correlation analysis
Plot the data (or have a computer to do so)
Visually inspect the relationship between two continousvariables
Is there a linear relationship (correlation)?
Are there outliers?
Are the distributions skewed?
3 / 40
Correlation Coefficient I
Measures the strength and direction of the linear relationshipbetween two variables X and Y
Population correlation coefficient:
ρ =cov(X ,Y )√
var(X ) · var(Y )=
E [(X − µX )(Y − µY )]√E [(X − µX )2] · E[(Y − µY )2]
Sample correlation coefficient:(obtained by plugging in sample estimates)
r =sample cov(X ,Y )√
s2x · s2
Y
=
∑ni=1
(Xi−X )(Yi−Y )n−1√∑n
i=1(Xi−X )2
n−1 ·∑n
i=1(Yi−Y )2
n−1
4 / 40
Correlation Coefficient II
The correlation coefficient, ρ, takes values between -1 and +1
-1: Perfect negative linear relationship
0: No linear relationship
+1: Perfect positive relationship
5 / 40
Correlation Coefficient III
Plot standardized Y versusstandardized X
Observe an ellipse(elongated circle)
Correlation is the slope ofthe major axis
6 / 40
Correlation Notes
Other names for r
Pearson correlation coefficientProduct moment of correlation
Characteristics of r
Measures *linear* associationThe value of r is independent of units used to measure thevariablesThe value of r is sensitive to outliersr2 tells us what proportion of variation in Y is explained bylinear relationship with X
7 / 40
Examples of the Correlation Coefficient I
Perfect positive correlation, r ≈ 1
● ●●
●●
●●
●●
● ●
●
● ●●
●●
● ●●
●
● ●●
● ●●
●
●
● ●
●●
●
●●
●
● ●●
9 / 40
Examples of the Correlation Coefficient II
Perfect negative correlation, r ≈ -1
● ●●
●●
●●
●
● ●●
●
●●
● ●
●●
●●
●
●● ●
●
●●
●● ●
●
●●
● ●
●●
●●
●
10 / 40
Examples of the Correlation Coefficient III
Imperfect positive correlation, 0< r <1
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
11 / 40
Examples of the Correlation Coefficient IV
Imperfect negative correlation, -1<r <0
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
12 / 40
Examples of the Correlation Coefficient V
No relation, r ≈ 0
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●●
13 / 40
Examples of the Correlation Coefficient VI
Some relation but little *linear* relationship, r ≈ 0
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
14 / 40
Association and Causality
In general, association between two variables means there issome form of relationship between them
The relationship is not necessarily causalAssociation does not imply causation, no matter how much wewould like it to
Example: Hot days, ice cream, drowning
15 / 40
Sir Bradford Hill’s Criteria for Causality
Strength: magnitude of association
Consistency of association: repeated observation of theassociation in different situations
Specificity: uniqueness of the association
Temporality: cause precedes effect
Biologic gradient: dose-response relationship
Biologic plausibility: known mechanisms
Coherence: makes sense based on other known facts
Experimental evidence: from designed (randomized)experiments
Analogy: with other known associations
16 / 40
Simple Linear Regression (SLR): Main idea
Linear regression can be used to study a continuous outcomevariable as a linear function of a predictor variable
Example: 60 cities in the US were evaluated for numerouscharacteristics, including:
Outcome variable (y) the % of the population with low income
Predictor variable (x) median education level
Linear regression can help us to model the association betweenmedian education and % of the population with low income
17 / 40
Example: Boxplot of % low income by education level
Education level is coded as a binary variable with values‘low’ and ‘high’
18 / 40
Simple linear regression, t-tests and ANOVA
Mean in low education group: 15.7%
Mean in high education group: 13.2%
The two means could be compared by a t-test or ANOVA, butregression provides a unified equation:
yi = β0 + β1xi
yi = 15.7− 2.5xi
where
xi = 1 for high education and 0 for low education (x is calleda dummy variable or indicator variable that designates group)
yi is our estimate of the mean % low income for the given thevalue of education
what about the β’s?19 / 40
Review: equation for a line
Recall that back in geometry class, you learned that a line could berepresented by the equation
y = mx + b
where
m = slope of the line (rise/run)
b = y-intercept (value y when x=0)
0 1 2 3 4 5
05
1015
20
X
Y
y=mx+b
m = 2b = 5
20 / 40
Regression analysis represented by equation for a line
In simple linear regression, we use the equation for a line
y = mx + b
but we write it slightly differently:
y = β0 + β1x
β0 = y-intercept (value y when x=0)
β1 = slope of the line (rise/run)
21 / 40
Example: the model components
yi = β0 + β1xi
yi = 15.7 + (−2.5)xi
yi is the predicted mean of the outcome yi for xi
β0 is the intercept, or the value of yi when xi = 0
β1 is the slope, or the change in yi for a 1 unit increase inxi = 0
xi is the indicator variable of low or high education forobservation i
22 / 40
Example: fill in covariate value to help interpretation
yi = β0 + β1xi
yi = 15.7− 2.5xi
xi = 0 (low education)
yi = 15.7− 2.5× 0
= 15.7 = β0
xi = 1 (high education)
yi = 15.7− 2.5× 1
= 13.2 = β0 + β1
23 / 40
Interpretations
Intercept
β0 is the mean outcome for the reference group, or thegroup for which xi = 0.
Here, β0 is the average percent of the population that is lowincome for cities with low education.
Slope
β1 is the difference in the mean outcome between the twogroups (when xi = 1 vs. when xi = 0)
Here, β1 is difference in the average percent of thepopulation that is low income for cities with high educationcompared to cities with low education.
24 / 40
Why use linear regression?
Linear regression is very powerful. It can be used for many things:
Binary X
Continuous X
Categorical X
Adjustment for confounding
Interaction
Curved relationships between X and Y
25 / 40
Regression analysis
A regression is a description of a response measure, Y, thedependent variable, as a function of an explanatory variable,X, the independent variable.
Goal: prediction or estimation of the value of one variable, Y,based on the value of the other variable, X.
A simple relationship between the two variables is alinear relationship (straight line relationship)
Other names: linear, simple linear, least squares regression
26 / 40
Foundational example: Galton’s study on height
1000 records of heights of family groups
Really tall fathers tend on average to have tall sons but notquite as tall as the really tall fathers
Really short fathers tend on average to have short sons butnot quite as short as the really short fathers
There is a regression of a sons height toward the mean heightfor sons
27 / 40
Regression analysis: population model
Probability model: Independent responses y1, y2, . . . , yn aresampled from
yi ∼ N(µi , σ2)
Systematic model: µi = E (yi |xi ) = β0 + β1xi
where
β0 = intercept
β1 = slope
29 / 40
Another way to write the model
Systematic: yi = β0 + β1xi + εi
Probability (random): εi ∼ N(0, σ2)
The response yi is a linear function of xi plus some random,normally distributed error, εi
data = signal + noise
30 / 40
Remember: two (equivalent) ways to write the model
Probability: yi ∼ N(µi , σ2)
Systematic: µi = E (yi |xi ) = β0 + β1xi
where
β0 = intercept
β1 = slope
OR
Systematic: yi = β0 + β1xi + εi
Probability: εi ∼ N(0, σ2)
The response, yi , is a linear function of xi plus some random,normally distributed error, εi
32 / 40
Interpretation of coefficients
Intercept (β0)Mean model: µ = E (y |x) = β0 + β1x
β0 = expected response when x = 0Since E (y |x = 0) = β0 + β1(0) = β0
Slope (β1)β1 = change in expected response per 1 unit increase in x
33 / 40
From Galton’s example
E (y |x) = β0 + β1x
E (y |x) = 33.7 + 0.52xwhere: y = son’s height (inches)
x = father’s height (inches)
Expected son’s height = 33.7 inches when father’s height is 0inches
Expected difference in heights for sons whose fathers’ heightsdiffer by one inch = 0.52 inches
34 / 40
City education/income model
Using the continuous variable for median education in city i (xi ):
E (yi |xi ) = β0 + β1xi
E (yi |xi ) = 36.2− 2.0xi
When xi = 0E (yi |xi ) = 36.2− 2.0(0)
= 36.2 = β0
When xi = 1
E (yi |xi ) = 36.2− 2.0(1)
= 34.2 = β0 + β1
When xi = 2
E (yi |xi ) = 36.2− 2.0(2)
= 32.2 = β0 + β1 × 236 / 40
City education/income model interpretation
Intercept (β0)
β0 is the mean outcome for the reference group, or the groupfor which xi = 0.
Here, β0 is the average percent of the population that is lowincome for cities with median education level of 0.
Slope (β1)
β1 is the difference in the mean outcome for a one unitchange in x.
Here, β1 is difference in the average percent of the populationthat is low income between two cities, when the first city has1 unit higher median education level than the second city.
37 / 40
Finding β’s from the graph
β0 is the y -intercept of the line, or the average value of ywhen x = 0.
β1 is the slope of the line, or the average change in y per unitchange in x .
y = mx + b
b = β0, m = β1
β1 =rise
run=
y1 − y2
x1 − x2
Note on notation:
β1 represents the trueslope (in the population)
β1 (or b1) is the sampleestimate of the slope
38 / 40
Where is our intercept?
The intercept isn’t in the range of our observed data. This means:
The intercept isn’t very interpretable since the average of ywhen x = 0 was never observedPossible solution: we might want to center our x variable
39 / 40
Summary
Today we’ve discussed
Correlation
Linear regression with continuous y variables
Simple linear regression (just one x variable)
Binary x (‘dummy’ or ‘indicator’ variable for group)β1: mean difference in outcome between groupsContinuous xβ1: mean difference in outcome corresponding to a 1-unitincrease in x
Interpretation of regression coefficients (intercept and slope)
How to write the regression model (2 ways)
Next time we’ll discuss multiple linear regression(more than one x variable) and confounding
40 / 40