Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome...

20
Chapter 2 Looking at Data - Relationships

Transcript of Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome...

Page 1: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Chapter 2

Looking at Data - Relationships

Page 2: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Relations Among Variables

• Response variable - Outcome measurement (or characteristic) of a study. Also called: dependent variable, outcome, and endpoint. Labelled as y.

• Explanatory variable - Condition that explains or causes changes in response variables. Also called: independent variable and predictor. Labelled as x.

• Theories usually are generated about relationships among variables and statistical methods can be used to test them.

• Research questions are stated such as: Do changes in x cause changes in y?

Page 3: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Scatterplots

• Identify the explanatory and response variables of interest, and label them as x and y

• Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs.

• Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots)

• Plot the n pairs of points (x,y) on the graph

Page 4: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

France August,2003 Heat Wave Deaths

• Individuals: 13 cities in France• Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002• Explanatory Variable: Change in Mean Temp in period (C)• Data: City Dth03 Dth9902 %chng (y) Degchg(x)

Little 200 192.3 4 4Marseilles 571 456.8 25 4.3Grenoble 148 115.6 28 6.3Rennes 156 114.7 36 5.6Toulouse 315 231.6 36 6.6Bordeaux 318 222.4 43 6.2Strasbourg 253 167.5 51 5.9Nice 341 222.9 53 4.3Poitiers 184 102.8 79 7.3Lyon 447 248.3 80 6.8Le Mans 204 112.1 82 7Dijon 168 87 93 7.4Paris 1854 766.1 142 6.7

Page 5: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

France August,2003 Heat Wave Deaths2003 France Heat Wave Mortality

0

20

40

60

80

100

120

140

160

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Change in Mean Temp (Celsius)

Ex

cess

Mo

rta

lity

(%

)

Possible Outlier

Page 6: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Example - Pharmacodynamics of LSD

Score (y) LSD Conc (x)78.93 1.1758.20 2.9767.47 3.2637.47 4.6945.65 5.8332.92 6.0029.97 6.41

• Response (y) - Math score (mean among 5 volunteers)

• Explanatory (x) - LSD tissue concentration (mean of 5 volunteers)

• Raw Data and scatterplot of Score vs LSD concentration:

LSD_CONC

7654321

SC

OR

E

80

70

60

50

40

30

20

Source: Wagner, et al (1968)

Page 7: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Manufacturer Production/Cost Relation

Month Prod Cost Month Prod Cost Month Prod Cost1 46.75 92.64 17 36.54 91.56 33 32.26 66.712 42.18 88.81 18 37.03 84.12 34 30.97 64.373 41.86 86.44 19 36.60 81.22 35 28.20 56.094 43.29 88.80 20 37.58 83.35 36 24.58 50.255 42.12 86.38 21 36.48 82.29 37 20.25 43.656 41.78 89.87 22 38.25 80.92 38 17.09 38.017 41.47 88.53 23 37.26 76.92 39 14.35 31.408 42.21 91.11 24 38.59 78.35 40 13.11 29.459 41.03 81.22 25 40.89 74.57 41 9.50 29.02

10 39.84 83.72 26 37.66 71.60 42 9.74 19.0511 39.15 84.54 27 38.79 65.64 43 9.34 20.3612 39.20 85.66 28 38.78 62.09 44 7.51 17.6813 39.52 85.87 29 36.70 61.66 45 8.35 19.2314 38.05 85.23 30 35.10 77.14 46 6.25 14.9215 39.16 87.75 31 33.75 75.47 47 5.45 11.4416 38.59 92.62 32 34.29 70.37 48 3.79 12.69

Y= Amount Produced x= Total Cost n=48 months (not in order)

Page 8: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Manufacturer Production/Cost Relation

Production (x) / Cost (y) Relation

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Total Production

To

tal C

ost

Page 9: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Correlation• Numerical measure to summarize the strength of the

linear (straight-line) association between two variables• Bounded between -1 and +1 (Labelled as r)

– Values near -1 Strong Negative association

– Values near 0 Weak or no association

– Values near +1 Strong Positive association

• Not affected by linear transformation of either x or y

• Does not distinguish between response and explanatory variable (x and y can be interchaged)

yyxxn

yxCOVss

yxCOV

s

yy

s

xx

nr ii

yxy

i

x

i

1

1),(

),(

1

1

Page 10: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Excess French Heatwave Deaths

City Degchg(x) %chng (y) x-xbar y-ybar (x-xbar)(y-ybar)Little 4.0 4 -2.03 -53.85 109.3155Marseilles 4.3 25 -1.73 -32.85 56.8305Grenoble 6.3 28 0.27 -29.85 -8.0595Rennes 5.6 36 -0.43 -21.85 9.3955Toulouse 6.6 36 0.57 -21.85 -12.4545Bordeaux 6.2 43 0.17 -14.85 -2.5245Strasbourg 5.9 51 -0.13 -6.85 0.8905Nice 4.3 53 -1.73 -4.85 8.3905Poitiers 7.3 79 1.27 21.15 26.8605Lyon 6.8 80 0.77 22.15 17.0555Le Mans 7.0 82 0.97 24.15 23.4255Dijon 7.4 93 1.37 35.15 48.1555Paris 6.7 142 0.67 84.15 56.3805Total 78.4 752.0 0.0 0.0 333.7

1346.3685.5716.103.6 nsysx yx

66.029.42

81.27

)46.36)(16.1(

81.2781.27

113

7.333),(

ryxCOV

Page 11: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Examples

Page 12: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Least-Squares Regression

• Goal: Fit a line that “best fits” the relationship between the response variable and the explanatory variable

• Equation of a straight line: y = a + bx– a - y-intercept (value of y when x = 0)

– b - slope (amount y increases as x increases by 1 unit)

• Prediction: Often want to predict what y will be at a given level of x. (e.g. How much will it cost to fill an order of 1000 t-shirts)

• Extrapolation: Using a fitted line outside level of the explanatory variable observed in sample: BAD IDEA

Page 13: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Least-Squares Regression

• y = a + bx is a deterministic equation• Sample data don’t fall on a straight line, but rather

around one• Obtain equation that “best fits” a sample of data points• Error - Difference between observed response and

predicted response (from equation)• Least Squares criteria: Choose the line that minimizes

the sum of squared errors. Resulting regression line:

xbyas

srbbxay

x

y ^

Page 14: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Excess French Heatwave Deaths

xy

a

b

rsysx yx

74.2021.67

21.6706.12585.57)03.6(74.2085.57

74.20)43.31(66.016.1

46.3666.0

66.046.3685.5716.103.6

^

2003 France Heat Wave Mortality

0

20

40

60

80

100

120

140

160

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Change in Mean Temp (Celsius)

Ex

cess

Mo

rta

lity

(%

)

For each 1C increase in mean temp, excess mortality increases about 20%

Page 15: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Effect of an Outlier (Paris)

• Re-fitting the model without Paris, which had a very high excess mortality (Using EXCEL):

xyr 34.1778.52*76.0^

* Heat Wave Mortality (No Paris)

0

10

20

30

40

50

60

70

80

90

100

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Temp Change

Exc

ess

Mo

rtal

ity

Page 16: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Squared Correlation• The squared correlation represents the fraction of the

variation in the response variable that is “explained” by the explanatory variable

• Represents the improvement (reduction in sum of squared errors) by using x (and fitted equation y-hat) to predict y as opposed to ignoring x (and simply using the sample mean y-bar) to predict y

• 0 r2 1 – Values near 0 x does not help predict y (regression line flat)

– Values near 1 x predicts y well (data near regression line)

2

2^

2

yy

yyr

Page 17: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Residual Analysis

• Residuals: Difference between observed responses and their predicted values:

• Useful to plot the residuals versus the level of the explanatory variable (x)

• Outliers: Large (positive or negative) residuals. Values of y that are inconsistent with prediction

• Influential observations: Cases where the level of the explanatory variable is far away from the other individuals (extreme x values)

^

yy

Page 18: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

France Heatwave Mortalityy x yhat e=y-yhat4 4 16.04 -12.0425 4.3 22.22 2.7828 6.3 63.39 -35.3936 5.6 48.98 -12.9836 6.6 69.56 -33.5643 6.2 61.33 -18.3351 5.9 55.15 -4.1553 4.3 22.22 30.7879 7.3 83.98 -4.9880 6.8 73.68 6.3282 7 77.80 4.2093 7.4 86.03 6.97

142 6.7 71.62 70.38

Residual Plot

-60.00

-40.00

-20.00

0.00

20.00

40.00

60.00

80.00

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Temp Change (x)

Res

idu

al

Paris (outlier)

Page 19: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Miscellaneous Topics

• Lurking Variable: Variable not included in regression analysis that may influence the association between y and x. Sometimes referred to as a spurious association between y and x.

• Association does not imply causation (it is one of various steps to demonstrating cause-and-effect)

• Do not extrapolate outside range of x observed in study • Some relationships are not linear, which may show low

correlation when relation is strong• Correlations based on averages across individuals tend

to be higher than those based on individuals

Page 20: Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome measurement (or characteristic) of a study. Also called:

Causation

• Association between x and y demonstrated• Time order confirmed (x “occurs” before y)• Alternative explanations are considered and explained

away:– Lurking variables - Another variable causes both x and y

– Confounding - Two explanatory variables are highly related, and which causes y cannot be determined

• Dose-Response Effect • Plausible cause