Lecture 4 Notes - Weebly

30
Lecture 4 Notes Chapter 6. Scatterplots, Association, and Correlation 1

Transcript of Lecture 4 Notes - Weebly

Page 1: Lecture 4 Notes - Weebly

Lecture 4 NotesChapter 6. Scatterplots, Association, and Correlation

1

Page 2: Lecture 4 Notes - Weebly

Lecture Outline

• Examine relationships between two quantitative variables with scatterplots

• Describe the form, direction, and strength of a relationship

• Use correlation to measure the strength of a relationship

• Check conditions for using correlation

• Correlation is not causation

2

Page 3: Lecture 4 Notes - Weebly

Scatterplots

3

• An ideal way to plot values of two quantitative variables against each other in order to picture association.

• Association: When certain values of one variable tend to go with certain values of the other.

• An analysis of relationship (association) between two variables is also called bivariate analysis.

• We can see patterns, trends, relationships, and extraordinary values apart from the overall pattern.

• Main features of the scatterplots: Direction, Form, Strength

• Direction: upward or downward trend

• Form: Straight-line relationship, slightly curve or bend, curvature (it curves down or up sharply)

• Strength: Tight cluster of points, measures how much scatter

And, unusual feature: An outlying point standing away from the overall pattern of the plot.

Page 4: Lecture 4 Notes - Weebly

Roles of Variables in a Scatterplot

4

• Before plotting the two quantitative variables against each other, determine which variables are response variables and which

are explanatory (predictor) variables.

• In determining which variable is response, and which one is explanatory, think about the context of the study and the research

question that the study aims at investigating.

• Response variable (outcome measure): A variable on which the comparisons are made at levels of the other variables.

They are placed on the y-axis (also known as y-variable)

• Explanatory (predictor) variable: A variable that is proposed to explain the variation in the outcome variable in the

analysis. They are placed on the x-axis (also known as x-variable).

• Analysis: Studies how the outcomes on the response variable depends on or is explained by the values of the

explanatory variable.

Page 5: Lecture 4 Notes - Weebly

Investigating a Scatterplot

5

Questions to consider when investigating a scatterplot:

• Is the general pattern roughly straight or curved?

• Are some observations somewhat unusual compared with the rest?

• Are there distinct group or clusters?

• Discover lurking (hidden) variables that lurk up or stand behind an analysis that have important influence

on the relationship being studied.

Page 6: Lecture 4 Notes - Weebly

Scatterplot: Relationship Between Drug Amount and Reaction Time

Suppose the researchers were interested to predict mean reaction time (in seconds) from the amount of drug (in %) in the blood.

• Response variable: length of reaction time (in sec) to a stimulus

• Explanatory variable: percentage of a certain drug in the blood

6

Interpreting the scatterplot of drug amount % and

reaction time in seconds:

There appears to be an upward trend. As the amount

of drug % increases, the reaction time (in seconds)

tend to increase.

Page 7: Lecture 4 Notes - Weebly

Relationship Between Drug Amount and Reaction Time

7

• Can researchers’ explain the 100% of variation (differences) in reaction time by using the variable information

amount of drug?

• Can amount of drug account for 100% of variation in reaction time?

• Do you think an exact relationship exists between amount of drug and reaction time?

• Do you think it is possible to state the exact length of time it takes a subject to respond if the amount of the drug

in the blood is known?

Page 8: Lecture 4 Notes - Weebly

Relationship Between Drug Amount and Reaction Time

❖ There would be (possibly) some lurking variables (not included in the analysis) that can account for unexplained

variation in reaction time. These include:

• time of the day;

• the amount of sleep the subject had the day before;

• subject’s visual acuity;

• subject’s reaction time without drug;

• subject’s age.

❖ There would be some variation in reaction time that we could not explain.

❖ It is unlikely that we would be able to predict the subject’s reaction time exactly even if many variables are included in the model.

8

Page 9: Lecture 4 Notes - Weebly

9

Relationship Between Smoking Rate and Year

The Centre of Disease Control and Prevention track cigarette smoking in the United States for both men and women. The study

examined the percentage of men and women aged 18-24 who smoked from 1965 to 2011. The relationship between smoking

rate and year was investigated.

Interpreting the scatterplot of smoking rate

and year:

• This type of scatterplot is also referred to as

time plot; the time is placed on the x-axis.

• Smoking rates for both men and women in the

United States appear to have decreased over

the time period from 1965 to 2011.

• There are two distinct subgroups in this study

(men, women); this information can help

explain the variation in smoking rate and thus

should be accounted for in the analysis.

• Smoking rates are generally lower for women

than for men.

Page 10: Lecture 4 Notes - Weebly

The Estimated Correlation Coefficient, 𝒓

• The population correlation coefficient is denoted by symbol 𝜌 (rho).

• We estimate the population correlation coefficient based our sample information (denoted by 𝑟).

• 𝑟 is a number that measures the strength of linear association between two quantitative variables.

• Covariance of 𝑥 and 𝑦 depends on the units of measurement. One solution is to standardize it.

• That is to divide Covariance of 𝑥 and 𝑦 by the standard deviation of both variables to remove scale dependence.

• This idea, the standard version of covariance, is known as Correlation Coefficient, 𝑟.

• Therefore, the estimated correlation coefficient can be obtained by: 𝑟 =𝑆𝑥𝑦

𝑆𝑥𝑆𝑦(we can use R to calculate this value)

• 𝑆𝑥𝑦: sum of cross-deviations of x and y (it measures the tendency of 𝑥 and 𝑦 to vary together “covary”).

• 𝑆𝑥: standard deviation of x

• 𝑆𝑥: standard deviation of x

10

Page 11: Lecture 4 Notes - Weebly

The Estimated Correlation Coefficient, 𝒓

• Instead of plotting (x, y), we plot (𝑧𝑥, 𝑧𝑦) which centre the plotted points around the origin, and remove

original units, replacing them with standard deviation units.

• We measure strength of association by adding up the products 𝑧𝑥 𝑧𝑦, which summarizes both the direction and

strength of association.

• Points farther from the origin have larger z-scores so it contributes more to the sum: σ𝑧𝑥𝑧𝑦. Also, adding

more data points will grow the sum.

• Solution: Average the sum products by dividing by 𝑛 − 1: 𝑟 =σ 𝑧𝑥𝑧𝑦

𝑛−1

• This strategy makes: −1 ≤ 𝑟 ≤ 1

11

Page 12: Lecture 4 Notes - Weebly

The Estimated Correlation Coefficient, 𝒓

• 𝑟 is a number that measures the strength of linear association between two quantitative variables.

• 𝑟 has no unit of measurement.

• r is always between -1 and 1 (−1 ≤ 𝑟 ≤ 1).

• 𝑟 values near 0 indicate a very weak linear relationship.

• 𝑟 = 0 implies no linear association (𝑟 = 0 implies 𝑥 contributes no information predicting 𝑦).

• The strength of the linear relationship increases as 𝑟 moves away from 0.

• 𝑟 values close to -1 and 1 indicate that the point lie close to a straight line, indicating stronger linear relationship between two

quantitative variables.

• r is sensitive and it is affected by extreme outliers in the data.

Note:

• A high correlation does not necessarily imply that a causal relationship exists between 𝑥 and 𝑦 – only that a linear trend may exist.

• A low correlation does not necessarily imply that 𝑥 and 𝑦 are unrelated – only that they are not strongly linearly related. 12

Page 13: Lecture 4 Notes - Weebly

Assumptions and Conditions for Correlations

13

1. Quantitative Variable condition: Both variables must be quantitative

2. Straight-enough line condition: Look at the scatterplot to see whether it looks reasonably straight.

3. No outlier condition: Outlying points can make a small correlation large and a large one look small.

Page 14: Lecture 4 Notes - Weebly

Values of 𝒓 and their Implications

14

McClave, J., Sincich, T. (2017). Chapter 11 (Figure 11.16):

Simple Linear Regression in Statistics 13th Ed. Pearson, Inc

Strength of associations:

• Strong correlations: |𝑟| ≥ 0.8

• Moderate to strong correlations: 0.5 ≤ |𝑟| < 0.8

• Moderate correlations: 0.3 ≤ |𝑟| < 0.5

• Weak correlations: |𝑟| < 0.3

Page 15: Lecture 4 Notes - Weebly

Drug Reaction Example: The Est. Correlation Coefficient, 𝒓

In R:

15

Describing the trend in the Scatterplot:

There appears to be a strong, positive linear relationship

between drug amount in the blood and reaction times.

* This r value indicates that the reaction times tend to

increase as the amount of drug in the blood increases.

Note: When you calculate a correlation, it doesn’t

matter which variable is 𝑥 (explanatory variable)

and which is 𝑦 (response variable).

Page 16: Lecture 4 Notes - Weebly

Units of measurement shouldn’t affect 𝒓: Drug Reaction Example

16

Describing the trend in the Scatterplot:

There appears to be a strong, positive linear relationship

between drug amount in the blood and reaction times.

In the drug reaction time example:

• Each product of standardized version of (x, y)

coordinate is positive.

• That is, product of 𝑍𝑥 and 𝑍𝑦 are all positive.

• This indicates that sum of products of all these

pairs (𝑍𝑥, 𝑍𝑦) is positive.

• Therefore, the estimated correlation coefficient

is positive.

𝑟 =σ𝑍𝑋𝑍𝑦

𝑛 − 1

Page 17: Lecture 4 Notes - Weebly

Example of Relationship Between Two Quantitative Variables

17

Let’s consider a data set on geographic socio-political areas in each state in the United States that includes information about:

• Percent of birth for teen moms: quantitative variable

• Poverty rate: quantitative variable

Research Question:

Is there a relationship between

percent of birth to teen moms and poverty rate?

• Response variable: percent of birth to teen moms

• Explanatory variable: poverty rate

Page 18: Lecture 4 Notes - Weebly

Example of Relationship Between Two Quantitative Variables

18

Research Question: Is there a relationship between percent of birth to teen moms and poverty rate?

Recall: Describing a Scatterplot

• Look for overall pattern and striking deviation form

that pattern.

• The overall pattern of a scatterplot can be described

by the form, direction, and strength of the

relationship.

• An important kind of deviation is an outlier, an

individual value that falls outside the overall pattern.

Page 19: Lecture 4 Notes - Weebly

Example of Relationship Between Two Quantitative Variables

19

Direction:

There is an upward trend.

The points are forming an upward trend.

The association is, thus, positive.

Form:

Overall, the points are forming a trend that is close to a straight line.

No curvature is observed in the overall pattern.

Strength:

There appears to be a strong to moderate linear association.

In the context of this study, we interpret this scatterplot as:

• There appears to be a strong to moderate positive linear

association between percent of birth to teen moms and

poverty rate.

• Higher percent of birth to teen moms are, in general,

associated with higher poverty rate.

• As poverty rate increases, the percent of birth to teen moms

tend to increase.

Page 20: Lecture 4 Notes - Weebly

Possible High Outlier

20

Page 21: Lecture 4 Notes - Weebly

21

Relationship Between Smoking Rate and Year

The Centre of Disease Control and Prevention track cigarette smoking in the United States for both men and women. The study

examined the percentage of men and women aged 18-24 who smoked from 1965 to 2011. The relationship between smoking

rate and year was investigated.

There appears to be a moderate to strong negative

association between smoking rate and years in

this study. As the years increase, the smoking rate

tend to decrease for both men and women.

Page 22: Lecture 4 Notes - Weebly

Investigating Interesting Relationships Among a Set of Variables

Researchers are interested to investigate whether percent of birth to teen moms in U.S.A is related to poverty rate,

welfare, government education spending, and percent African Americans population.

Response variable (Quantitative):

• 𝑦: Percent of birth to teen moms as percent of all birth

Explanatory variables (Quantitative):

• 𝑥1: Poverty rate

• 𝑥2: Average monthly temporary assistance to needy families (Welfare)

• 𝑥3: Per Capita State and Local Government Spending for Education

• 𝑥4: Percent African American population

22

Page 23: Lecture 4 Notes - Weebly

23

Important note:

Correlation does not imply causation.

Why? Because, there are other variables (perhaps not

included into study’s data analysis) that could contribute

to understanding the variation in the percent of birth to

teen moms. These variables are often called lurking or

hidden variables.

• Can you propose some variables that you think they

could contribute to understanding the variation in

percent of birth to teen moms?

Correlation Does Not Imply Causation

Page 24: Lecture 4 Notes - Weebly

Teen Moms Data File: Read into R

24

Page 25: Lecture 4 Notes - Weebly

Estimated Correlations, 𝒓 values: Correlation Matrix

25

Correlation between % Birth to Teen Moms and Possible Predictor Variables:

• Correlation (Poverty rate, % Birth to Teen Moms) = 0.85 (high positive correlation)

• Correlation (Welfare, % Birth to Teen Moms) = - 0.66 (moderate to high negative correlation)

• Correlation (% African American, % Birth to Teen Moms) = 0.44 (moderate positive correlation)

• Correlation (Govt. Education Spending, % Birth to Teen Moms) = -0.30 (moderate negative correlation)

Page 26: Lecture 4 Notes - Weebly

Scatterplot Matrix

26

Page 27: Lecture 4 Notes - Weebly

Correlation Does NOT Prove Causation

27

• In many studies of the relationship between two variables the goal is to establish that changes in the explanatory

variable cause changes in response variable.

• Even a strong association between two variables, does not necessarily imply a casual link between the variables.

• Some explanations for an observed association:

• The dashed double arrow lines show an association.

• The solid arrows show a cause and effect link.

• The variable 𝑥 is explanatory, 𝑦 is response, and 𝑧 is

a lurking variable.

• In picture (c), the effects of 𝑥 and 𝑧 are confounded, making it difficult to determine which (if either) of the

two confounder variables is causal culprit. Maybe both have a casual link to 𝑦-variable, maybe neither. Maybe

some other undiscovered lurking variable is the real culprit, its hidden effect is confounded with 𝑥-variable.

Page 28: Lecture 4 Notes - Weebly

Example from OECD-iLibrary

What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en

“The Survey of Adult Skills, a product of the OECD Programme for the International Assessment of Adult Competencies (PIAAC), shows that there are substantial differences in proficiency in information-processing skills across age groups. Proficiency is highest among adults in their late 20s and early 30s. From that point, proficiency declines with age. Depending on whether and how one tries to account for differences in individuals’ socio-demographic characteristics, 55-65 year-olds can be estimated to score between 18 and 32 points below 25-34 year-olds in literacy” (OECD, 2012).

28

Page 29: Lecture 4 Notes - Weebly

What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en

29

Page 30: Lecture 4 Notes - Weebly

What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en

“As the survey measures the proficiency of adults of different ages at a single point in

time, the differences in proficiency related to age may reflect the impact of other

factors in addition to biological ageing (the so-called “age effects”).

For example, the quantity and quality of education received by individuals born in

different years may vary considerably – thus generating so-called “cohort effects”.

Yet, the results from the Survey of Adult Skills are broadly similar to those of other

studies that are better able to separate age and cohort effects: ageing does have an

impact on skills proficiency” (OECD, 2012).

30