What is an association between variables? Explanatory and response variables Key characteristics...

48
2.1 Relationships What is an association between variables? Explanatory and response variables Key characteristics of a data set 1

Transcript of What is an association between variables? Explanatory and response variables Key characteristics...

Page 1: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

2.1 Relationships

What is an association between variables? Explanatory and response variables Key characteristics of a data set

1

Page 2: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Association between a pair of variables

Association: Some values of one variable tend to occur more often with certain values of the other variable

Both variables measured on same set of individuals

Examples:◦ Height and weight of same individual◦ Smoking habits and life expectancy◦ Age and bone-density of individuals

Page 3: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Causation?

Caution: Often there are spurious, other variables lurking in the background ◦ Shorter women have lower risk of heart attack◦ Countries with more TV sets have better life expectancy

rates◦ More deaths occur when ice cream sales peak

Just explore association or investigate a causal relationship?

Page 4: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

4

Key Characteristics of a Data Set

Certain characteristics of a data set are key to exploring the relationship between two variables. These should include the following:

Cases: Identify the cases and how many there are in the data set.

Label: Identify what is used as a label variable if one is present.

Categorical or quantitative: Classify each variable as categorical or

quantitative.

Values: Identify the possible values for each variable.

Explanatory or response: If appropriate, classify each variable as

explanatory or response.

Page 5: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

2.2 Scatterplots

Scatterplots Interpreting scatterplots Categorical variables in scatterplots

5

Page 6: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Examples

Does fidgeting keep you slim? Some people don’t gain weight

even when they overeat. Perhaps fidgeting and other “nonexercise activity” explains why, here is the data:

We want to plot Y vs. X◦ Which is Y?◦ Which is X?

Page 7: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Page 8: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Things to look for on scatterplot:

Form (linear, curve, exponential, parabola) Direction: ◦ Positive Association: Y increases as X increases◦ Negative Association: Y decreases as X increases

Strength: Do the points follow the form quite closely or scattered?

Outliers: deviations from overall relationship

Let’s look again…

Page 9: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: State mean SAT math score plotted against the percent of seniors taking the exam

Page 10: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Adding a categorical variable or grouping

May enhance understanding of the data

Categorical variable is (region):◦“e” is for northeastern states◦“m” is for midwestern states ◦All others states excluded

Page 11: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: Adding categorical variable

Page 12: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Other things:

Plotting different categories via different symbols may throw light on data

Read examples 2.7-2.9 for more examples of scatterplots

Existence of a relationship does not imply causation ◦ SAT math and SAT verbal scores have strong

relationship◦ But a person’s intelligence is causing both

The relationship does not have to hold true for every subject, it is random

Page 13: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

2.3 Correlation

The correlation coefficient r Properties of r Influential points

13

Page 14: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Correlation Linear relationships are quite common

Correlation coefficient r measures strength and direction of a linear relationship between two quantitative variables X and Y

Data structure: (X,Y) pairs measured on n individuals◦ Weight and blood pressure ◦ Age and bone-density

Page 15: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

15

We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. The following facts about r help us further interpret the strength of the linear relationship.

Properties of Correlation r is always a number between –1 and 1.

r > 0 indicates a positive association.

r < 0 indicates a negative association.

Values of r near 0 indicate a very weak linear relationship.

The strength of the linear relationship increases as r moves away from 0 toward –1 or 1.

The extreme values r = –1 and r = 1 occur only in the case of a perfect linear relationship.

Measuring Linear Association

Page 16: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Page 17: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Formula: 1

1i i

x y

x x y yr

n s s

Calculate means and standard deviations of data Standardize X and Y:

take off respective mean divide by corresponding standard deviation

Take products of X(standardized)*Y (standardized) for each subject

Add up and divide by n-1

Or just ask your calculator nicely!

Page 18: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Issues: r is affected by outliers Captures only the strength of the “linear”

relationship◦ it could be true that Y and X have a very strong non-

linear relationship but r is close to zero r = +1 or -1 only when points lie perfectly on a

straight line. (Y=2X+3) SAS program: correlation.doc◦ proc corr is the procedure

Page 19: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

19

For each graph, estimate the correlation r and interpret it in context.

Correlation Examples

Page 20: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

20

For each graph, estimate the correlation r and interpret it in context.

Correlation Examples

Page 21: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

21

For each graph, estimate the correlation r and interpret it in context.

Correlation Examples

Page 22: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

22

For each graph, estimate the correlation r and interpret it in context.

Correlation Examples

Page 23: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

2.4 Least-Squares Regression

Regression lines Least-squares regression line Predictions Facts about least-squares regression Correlation and regression

23

Page 24: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Regression Line Straight line which describes best how the response

variable y changes when the explanatory variable x changes◦ We do distinguish between Y and X cannot switch their roles

Equation of straight line: ŷ = b0 + b1x◦ ŷ is the predicted value (the line at a given x value)◦ b0 is the intercept (where it crosses the y-axis)◦ b1 is the slope (rate)

Procedure◦ calculate best b0 and b1 for your data◦ Find the line that best fits your data ◦ Use this line to predict y for different values of x

Page 25: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: Regression line for NEA data. We can predict the mean fat gain at 400 calories

Page 26: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Prediction and Extrapolation

Fitted line for NEA data:

Pred. fat gain = 3.505 – 0.00344(NEA)

Prediction at 400 calories:

Pred. fat gain = 3.505 – 0.00344*400 = 2.13 kg

So when a person’s NEA increases by 400 calories when they overeat, they will have a predicted fat gain of 2.13 kilograms.

Page 27: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Prediction and Extrapolation

Warning: Extrapolation--predicting beyond the range of the data--is dangerous!

Prediction at 1500 calories

Pred. fat gain = 3.505 – 0.00344*1500 = -1.66 kg

So predicting for a 1500 NEA increase when overeating, the prediction is that they will lose 1.66 kilograms of weight Not trustworthy Far outside the range of the data

Page 28: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Least Squares Regression (LSR) Line

The line which makes the sum of squares of the vertical distances of the data points from the line as small as possible

y is the observed (actual) response ŷ is the predicted response by using the line Residuals◦ Error in prediction◦ y – ŷ

Page 29: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Formula for Least Squares Regression lineCalculate:

Y1

X

s

sslope b r

0 1intercept b Y b X

0 1y b b x

, , , ,x yX Y s s r

Page 30: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: (NEA data)

X

Y

324.8

s 257.66

2.388

s 1.1389

0.7786

X cal

cal

Y kg

kg

r

Y1

X

s

sb r

0 1b Y b X

y

Page 31: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Using the formula:

Slope: b1 = -.7786 * 1.1389/257.66 = -0.00344

Intercept: b0 = (mean of y) – slope * (mean of x) = 2.388 – (-0.00344)*324.8 = 3.505

Regression line: Predicted fat gain = 3.505 – 0.00344*cal

ŷ = 3.505 – 0.00344x

Page 32: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: Predicted values and Residuals

Predicted fat gain for observation 2 (-57 cal.)ŷ2 = 3.505 – 0.00344*(-57) = 3.70108 kg

Observed fat gain: y2 = 3.0 kg

Residual or error in prediction = y2 - ŷ2 = 3.0 – 3.70108 = -0.70108 kg

Page 33: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Residual practice

Residual is yi – ŷi

For NEA data observation 14 has NEA = x14 = 580◦ Find the predicted value, ŷ14

◦ Find the residual, y14 - ŷ14

Page 34: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Properties of regression line Cannot switch Y and X Passes through the mean of x and mean of y

Physical interpretation of the slope b1: ◦ with one unit increase in X, how much does Y change on

average?◦ Example: NEA data: with 1 calorie increase in NEA, fain

gain changes by -0.00344 kg◦ How about 100 increase in NEA?

Page 35: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Using software SAS will evaluate the least squares regression line

but you have to know where to find them in the output!◦ Residuals and predicted values are also printed

SAS program : regression.doc ◦ the regression procedure is proc reg

We will do a deeper analysis of regression in chapter 10

Page 36: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Correlation and Regression In correlation, X and Y are interchangeable, NOT

so in regression.

Slope (b1), depends on correlation (r)

R2—Coefficient of Determination ◦ Square of correlation◦ Fraction of variation in y explained by LSR line◦ Higher R2 suggests better fit◦ Example: R2 = 0.6062 for NEA data

means that 60.62% of the unexplained variation in fat gain is explained by your fitted regression line with x = NEA.

Page 37: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

R2—another example

less spreadtight fit

R2 = 0.989

Explains the part of the variation of y which comes from the linear relationship between y and x. In this case between Height and Age.

more scattermore error in prediction

R2 = 0.849

Page 38: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

2.5 Cautions About Correlation and Regression

Residuals and residual plots Outliers and influential observations Lurking variables Correlation and causation

38

Page 39: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

39

A residual is the difference between an observed value of the response variable and the value predicted by the regression line:

residual = observed y – predicted y

Residuals

y ˆ y

A regression line describes the overall pattern of a linear relationship between an explanatory variable and a response variable. Deviations from the overall pattern are also important. The vertical distances between the points and the least-squares regression line are called residuals.

Residuals add up to zero and have a mean of zero

Thus, a fit is considered good if the plot shows a random spread of points about the zero line but without any definitive pattern

Page 40: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Scatterplot of residuals against explanatory variable Helps assess the fit of regression line

Residual plot

Page 41: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Page 42: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Page 43: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Page 44: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Outliers and influential observations Outliers: Lies outside the pattern of other

observations◦ Y-outliers: large residual◦ X-outliers: often influential in regression

Influential points: Deleting this point changes your statistical analysis drastically◦ pull the regression line towards themselves

Least squares regression is NOT robust to presence of outliers

Page 45: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: Gesell data

r = 0.4819 Subject 15: ◦ Y-outlier ◦ Far from line◦ High residual

Subject 18: ◦ X-outlier ◦ Close to line◦ Small residual

Page 46: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Example: Gesell data

r = 0.4819 Drop 15: ◦ r = 0.5684

Drop 18: ◦ r = 0.3837

Both have some influence, but neither seems excessive

Page 47: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

Association, however strong, does NOT imply causation.

Some possible explanations for an observed association

The dashed lines show an association. The solid arrows show a cause-

and-effect link. x is explanatory, y is response, and z is a lurking variable.

Explaining Association: Causation

47

Page 48: What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.

It appears that lung cancer is associated with smoking.

How do we know that both of these variables are not being affected by an unobserved third (lurking) variable?

For instance, what if there is a genetic predisposition that causes people to both get lung cancer and become addicted to smoking, but the smoking itself doesn’t CAUSE lung cancer?

1. The association is strong.

2. The association is consistent.

3. Higher doses are associated with stronger responses.

4. Alleged cause precedes the effect.

5. The alleged cause is plausible.

We can evaluate the association using the following criteria:

48

Establishing Causation