Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician...

33
Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician http://research.LABioMed.org/Biostat 1

Transcript of Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician...

Page 1: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Biostatistics in PracticeSession 5:

Associations and confoundingYoungju Pak, Ph.D.

Biostatistician

http://research.LABioMed.org/Biostat

1

Page 2: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Revisiting the Food Additives Study

Unadjusted

Adjusted

What does “adjusted” mean?

How is it done?

From Table 3

Page 3: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Goal One of Session 5Earlier: Compare means for a single measure among groups.

Use t-test, ANOVA.

Session 5: Relate two or more measures.

Use correlation or regression.

Qu et al(2005), JCEM 90:1563-1569.

ΔΔY/ΔX

Page 4: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Goal Two of Session 5

Try to isolate the effects of different characteristics on an outcome.

Previous slide:

Gender

BMI

GH Peak

Page 5: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

5

Correlation

Standard English word correlate• to establish a mutual or reciprocal relation

between <correlate activities in the lab and the field> b: to show correlation or a causal relationship between

In statistics, it has a more precise meaning

Page 6: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

6

Correlation in Statistics Correlation: measure of the strength of LINEAR

association

Positive correlation: two variables move to the same direction As one variable increase, other variables also tends to increase LINEARLY or vice versa.• Example: Weight vs Height

Negative correlation: two variables move opposite of each other. As one variable increases, the other variable tends to decrease LINEARLY or vice versa (inverse relationship).• Example: Physical Activity level vs. Abdominal height (Visceral Fat)

Page 7: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

7

Pearson r correlation coefficient

r can be any value from -1 to +1 r = -1 indicates a perfect negative LINEAR

relationship between the two variables

r = 1 indicates a perfect positive LINEAR

relationship between the two variables

r = 0 indicates that there is no LINEAR

relationship between the two variables

Page 8: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

8

Scatter Plot: r= 1.0

0

2

4

6

8

10

12

14

0 2 4 6 8 10

Page 9: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

9

Scatter Plot: r= -1.0

0

1

2

3

4

5

6

7

1 2 3 4 5 6

Page 10: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

10

Scatter Plot: r= 0

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

Page 11: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Anemic women: Anemia.sav n=20

Hb(g/dl) PCV(%)

11.1 3510.7 4512.4 4713.1 3110.5 309.6 2512.5 3313.5 35… …

r expresses how well the data fits in a straight

line. Here, Pearson’s r =0.673

Page 12: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Correlations in real data

Page 13: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Logic for Value of Correlation

Σ (X-Xmean) (Y-Ymean)

√Σ(X-Xmean)2 Σ(Y-Ymean)2Pearson’s r =

+

+-

-

Statistical software gives r.

Page 14: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Correlation Depends on Ranges of X & Y

Graph B contains only the graph A points in the ellipse.

Correlation is reduced in graph B.

Thus: correlations for the same quantities X and Y may be quite different in different study populations.

BA

Page 15: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Simple Linear Regression (SLR) X and Y now assume unique roles: Y is an outcome, response, output, dependent

variable. X is an input, predictor, explanatory, independent

variable. Regression analysis is used to:

Measure more than X-Y association, as with correlation.

Fit a straight line through the scatter plot, for:Prediction of Ymean from X. Estimation of Δ in Ymean for a unit change in X

= Rate of change of Ymean as a unit change in X

(slope = regression coefficient measure “effect” of X on Y).

Page 16: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

SLR Example

ei

Minimizes

Σei2

Range for Individuals

Range for mean

Statistical software gives all this info.

Range for Individuals

Range for individuals

Page 17: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Hypothesis testing for the true slope=0

H0: true slope = 0 vs. Ha: true slope ≠0, with the rule:

Claim association (slope≠0) if

tc=|slope/SE(slope)| > t ≈ 2.

There is a 5% chance of claiming an X-Y association that really does not exist.

Note similarity to t-test for means:

tc=|mean/ SE(mean)|

Formula for SE(slope) is in statistics books.

Page 18: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Example Software OutputThe regression equation is: Ymean = 81.6 + 2.16 X

Predictor Coeff StdErr T PConstant 81.64 11.47 7.12 <0.0001X 2.1557 0.1122 19.21 <0.0001

S = 21.72 R-Sq = 79.0%

Predicted Values:

X: 100Fit: 297.21SE(Fit): 2.1795% CI: 292.89 - 301.5295% PI: 253.89 - 340.52

Predicted y = 81.6 + 2.16(100)

Range of Ys with 95% assurance for:

Mean of all subjects with x=100.

Individual with x=100.

19.21=2.16/0.112 should be between ~ -2 and 2 if “true” slope=0.

Refers to Intercept

Page 19: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Multiple Regression

We now generalize to prediction from multiple characteristics.

The next slide gives a geometric view of prediction from two factors simultaneously.

Page 20: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Multiple Lienar Regression: Geometric View

LHCY is the Y (homocysteine) to be predicted from the two X’s: LCLC (folate) and LB12 (B12).

LHCY = b0 + b1LCLC + b2LB12 is the equation of the plane

Suppose multiple predictors are continuous.

Geometrically, this is fitting a slanted plane to a cloud of points:

www.StatisticalPractice.com

Page 21: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Multiple Regression: Software

Page 22: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Multiple Regression: Software

Output: Values of b0, b1, and b2 for

LHCYmean = b0 + b1LCLC + b2LB12

Page 23: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

How Are Coefficients Interpreted?

LHCYmean = b0 + b1LCLC + b2LB12

OutcomePredictors

LHCY

LCLC

LB12

LB12 may have both an independent and an indirect (via LCLC) association with LHCY

Correlation

b1 ?

b2 ?

Page 24: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Coefficients: Meaning of their Values

LHCY = b0 + b1LCLC + b2LB12

OutcomePredictors

Mean LHCY increases by b2 for a 1-unit increase in LB12

… if other factors (LCLC) remain constant, or

… adjusting for other factors in the model (LCLC)

May be physiologically impossible to maintain one predictor constant while changing the other by 1 unit.

Page 25: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

250

200

150

100

Age (Years)

IGF

1 (

ug

/L)

IGF1 Adjustment for Age - Simulated Data

(Mean)

140

155

15 = Diff

160157

Diff = 3

Unadjusted 22.2 Adjusted

CaucasianAfrican

15 30

Page 26: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

*

* for age, gender, and BMI.

Figure 2.

Determine the relative and combined explanatory power of age, gender, BMI, ethnicity, and sport type on the markers.

Page 27: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Another Example: HDL Cholesterol Std Coefficient Error t Pr > |t|

Intercept 1.16448 0.28804 4.04 <.0001 AGE -0.00092 0.00125 -0.74 0.4602 BMI -0.01205 0.00295 -4.08 <.0001BLC 0.05055 0.02215 2.28 0.0239PRSSY -0.00041 0.00044 -0.95 0.3436DIAST 0.00255 0.00103 2.47 0.0147GLUM -0.00046 0.00018 -2.50 0.0135SKINF 0.00147 0.00183 0.81 0.4221LCHOL 0.31109 0.10936 2.84 0.0051

The predictors of log(HDL) are age, body mass index, blood vitamin C, systolic and diastolic blood pressures, skinfold thickness, and the log of total cholesterol. The equation is:

Log(HDL) mean = 1.16 - 0.00092(Age) +…+ 0.311(LCHOL)

www.

Statistical

Practice

.com

Output:

Page 28: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

HDL Example: Coefficients

Interpretation of coefficients on previous slide:

1. Need to use entire equation for making predictions.

2. Each coefficient measures the difference in mean LHDL between 2 subjects if the factor differs by 1 unit between the two subjects, and if all other factors are the same. E.g., expected LHDL is 0.012 lower in a subject whose BMI is 1 unit greater, but is the same as the other subject on other factors.

Continued …

Page 29: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

HDL Example: CoefficientsInterpretation of coefficients two slides back:

3. P-values measure how strong the association of a factor with Log(HDL) is , if other factors do not change.

This is sometimes expressed as “after accounting for other factors” or “adjusting for other factors”, and is called independent association.

SKINF probably is associated. Its p=0.42 says that it has no additional info to predict LogHDL, after accounting for other factors such as BMI.

Page 30: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Special Cases of Multiple Regression

So far, our predictors were all measured over a continuum, like age or concentration.

This is simply called multiple regression.

When some predictors are grouping factors like gender or ethnicity, regression has other special names:

ANOVA

Analysis of Covariance

Page 31: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Analysis of Variance

• All predictors are grouping factors.

• One-way ANOVA: Only 1 predictor that may have only 2 “levels”, such as gender, or more levels, such as ethnicity.

• Two-way ANOVA: Two grouping predictors, such as decade of age and genotype.

Page 32: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Two way ANOVA

• Interaction in 2-way ANOVA: Measures whether the effect of one factor depends on the other factor. Difference of a difference in outcome. E.g.,

(Trt.-– control)Female – (Trt. – control)Male

• The effect of treatment, adjusted for gender, is a weighted average of group differences over two gender group, i.e., of :

(Trt.– control)Female and (Trt. – control)Male

Page 33: Biostatistics in Practice Session 5: Associations and confounding Youngju Pak, Ph.D. Biostatistician 1.

Analysis of Covariance

• At least one primary predictor is a grouping factor, such as treatment group , and at least one predictor is continuous, such as age, called a “covariate”.

• Interest is often on comparing the groups.

• The covariate is often a nuisance.

Confounder: A covariate that both co-varies with the outcome and is distributed differently in the groups.