Lecture 4 Notes - Weebly
Transcript of Lecture 4 Notes - Weebly
Lecture 4 NotesChapter 6. Scatterplots, Association, and Correlation
1
Lecture Outline
• Examine relationships between two quantitative variables with scatterplots
• Describe the form, direction, and strength of a relationship
• Use correlation to measure the strength of a relationship
• Check conditions for using correlation
• Correlation is not causation
2
Scatterplots
3
• An ideal way to plot values of two quantitative variables against each other in order to picture association.
• Association: When certain values of one variable tend to go with certain values of the other.
• An analysis of relationship (association) between two variables is also called bivariate analysis.
• We can see patterns, trends, relationships, and extraordinary values apart from the overall pattern.
• Main features of the scatterplots: Direction, Form, Strength
• Direction: upward or downward trend
• Form: Straight-line relationship, slightly curve or bend, curvature (it curves down or up sharply)
• Strength: Tight cluster of points, measures how much scatter
And, unusual feature: An outlying point standing away from the overall pattern of the plot.
Roles of Variables in a Scatterplot
4
• Before plotting the two quantitative variables against each other, determine which variables are response variables and which
are explanatory (predictor) variables.
• In determining which variable is response, and which one is explanatory, think about the context of the study and the research
question that the study aims at investigating.
• Response variable (outcome measure): A variable on which the comparisons are made at levels of the other variables.
They are placed on the y-axis (also known as y-variable)
• Explanatory (predictor) variable: A variable that is proposed to explain the variation in the outcome variable in the
analysis. They are placed on the x-axis (also known as x-variable).
• Analysis: Studies how the outcomes on the response variable depends on or is explained by the values of the
explanatory variable.
Investigating a Scatterplot
5
Questions to consider when investigating a scatterplot:
• Is the general pattern roughly straight or curved?
• Are some observations somewhat unusual compared with the rest?
• Are there distinct group or clusters?
• Discover lurking (hidden) variables that lurk up or stand behind an analysis that have important influence
on the relationship being studied.
Scatterplot: Relationship Between Drug Amount and Reaction Time
Suppose the researchers were interested to predict mean reaction time (in seconds) from the amount of drug (in %) in the blood.
• Response variable: length of reaction time (in sec) to a stimulus
• Explanatory variable: percentage of a certain drug in the blood
6
Interpreting the scatterplot of drug amount % and
reaction time in seconds:
There appears to be an upward trend. As the amount
of drug % increases, the reaction time (in seconds)
tend to increase.
Relationship Between Drug Amount and Reaction Time
7
• Can researchers’ explain the 100% of variation (differences) in reaction time by using the variable information
amount of drug?
• Can amount of drug account for 100% of variation in reaction time?
• Do you think an exact relationship exists between amount of drug and reaction time?
• Do you think it is possible to state the exact length of time it takes a subject to respond if the amount of the drug
in the blood is known?
Relationship Between Drug Amount and Reaction Time
❖ There would be (possibly) some lurking variables (not included in the analysis) that can account for unexplained
variation in reaction time. These include:
• time of the day;
• the amount of sleep the subject had the day before;
• subject’s visual acuity;
• subject’s reaction time without drug;
• subject’s age.
❖ There would be some variation in reaction time that we could not explain.
❖ It is unlikely that we would be able to predict the subject’s reaction time exactly even if many variables are included in the model.
8
9
Relationship Between Smoking Rate and Year
The Centre of Disease Control and Prevention track cigarette smoking in the United States for both men and women. The study
examined the percentage of men and women aged 18-24 who smoked from 1965 to 2011. The relationship between smoking
rate and year was investigated.
Interpreting the scatterplot of smoking rate
and year:
• This type of scatterplot is also referred to as
time plot; the time is placed on the x-axis.
• Smoking rates for both men and women in the
United States appear to have decreased over
the time period from 1965 to 2011.
• There are two distinct subgroups in this study
(men, women); this information can help
explain the variation in smoking rate and thus
should be accounted for in the analysis.
• Smoking rates are generally lower for women
than for men.
The Estimated Correlation Coefficient, 𝒓
• The population correlation coefficient is denoted by symbol 𝜌 (rho).
• We estimate the population correlation coefficient based our sample information (denoted by 𝑟).
• 𝑟 is a number that measures the strength of linear association between two quantitative variables.
• Covariance of 𝑥 and 𝑦 depends on the units of measurement. One solution is to standardize it.
• That is to divide Covariance of 𝑥 and 𝑦 by the standard deviation of both variables to remove scale dependence.
• This idea, the standard version of covariance, is known as Correlation Coefficient, 𝑟.
• Therefore, the estimated correlation coefficient can be obtained by: 𝑟 =𝑆𝑥𝑦
𝑆𝑥𝑆𝑦(we can use R to calculate this value)
• 𝑆𝑥𝑦: sum of cross-deviations of x and y (it measures the tendency of 𝑥 and 𝑦 to vary together “covary”).
• 𝑆𝑥: standard deviation of x
• 𝑆𝑥: standard deviation of x
10
The Estimated Correlation Coefficient, 𝒓
• Instead of plotting (x, y), we plot (𝑧𝑥, 𝑧𝑦) which centre the plotted points around the origin, and remove
original units, replacing them with standard deviation units.
• We measure strength of association by adding up the products 𝑧𝑥 𝑧𝑦, which summarizes both the direction and
strength of association.
• Points farther from the origin have larger z-scores so it contributes more to the sum: σ𝑧𝑥𝑧𝑦. Also, adding
more data points will grow the sum.
• Solution: Average the sum products by dividing by 𝑛 − 1: 𝑟 =σ 𝑧𝑥𝑧𝑦
𝑛−1
• This strategy makes: −1 ≤ 𝑟 ≤ 1
11
The Estimated Correlation Coefficient, 𝒓
• 𝑟 is a number that measures the strength of linear association between two quantitative variables.
• 𝑟 has no unit of measurement.
• r is always between -1 and 1 (−1 ≤ 𝑟 ≤ 1).
• 𝑟 values near 0 indicate a very weak linear relationship.
• 𝑟 = 0 implies no linear association (𝑟 = 0 implies 𝑥 contributes no information predicting 𝑦).
• The strength of the linear relationship increases as 𝑟 moves away from 0.
• 𝑟 values close to -1 and 1 indicate that the point lie close to a straight line, indicating stronger linear relationship between two
quantitative variables.
• r is sensitive and it is affected by extreme outliers in the data.
Note:
• A high correlation does not necessarily imply that a causal relationship exists between 𝑥 and 𝑦 – only that a linear trend may exist.
• A low correlation does not necessarily imply that 𝑥 and 𝑦 are unrelated – only that they are not strongly linearly related. 12
Assumptions and Conditions for Correlations
13
1. Quantitative Variable condition: Both variables must be quantitative
2. Straight-enough line condition: Look at the scatterplot to see whether it looks reasonably straight.
3. No outlier condition: Outlying points can make a small correlation large and a large one look small.
Values of 𝒓 and their Implications
14
McClave, J., Sincich, T. (2017). Chapter 11 (Figure 11.16):
Simple Linear Regression in Statistics 13th Ed. Pearson, Inc
Strength of associations:
• Strong correlations: |𝑟| ≥ 0.8
• Moderate to strong correlations: 0.5 ≤ |𝑟| < 0.8
• Moderate correlations: 0.3 ≤ |𝑟| < 0.5
• Weak correlations: |𝑟| < 0.3
Drug Reaction Example: The Est. Correlation Coefficient, 𝒓
In R:
15
Describing the trend in the Scatterplot:
There appears to be a strong, positive linear relationship
between drug amount in the blood and reaction times.
* This r value indicates that the reaction times tend to
increase as the amount of drug in the blood increases.
Note: When you calculate a correlation, it doesn’t
matter which variable is 𝑥 (explanatory variable)
and which is 𝑦 (response variable).
Units of measurement shouldn’t affect 𝒓: Drug Reaction Example
16
Describing the trend in the Scatterplot:
There appears to be a strong, positive linear relationship
between drug amount in the blood and reaction times.
In the drug reaction time example:
• Each product of standardized version of (x, y)
coordinate is positive.
• That is, product of 𝑍𝑥 and 𝑍𝑦 are all positive.
• This indicates that sum of products of all these
pairs (𝑍𝑥, 𝑍𝑦) is positive.
• Therefore, the estimated correlation coefficient
is positive.
𝑟 =σ𝑍𝑋𝑍𝑦
𝑛 − 1
Example of Relationship Between Two Quantitative Variables
17
Let’s consider a data set on geographic socio-political areas in each state in the United States that includes information about:
• Percent of birth for teen moms: quantitative variable
• Poverty rate: quantitative variable
Research Question:
Is there a relationship between
percent of birth to teen moms and poverty rate?
• Response variable: percent of birth to teen moms
• Explanatory variable: poverty rate
Example of Relationship Between Two Quantitative Variables
18
Research Question: Is there a relationship between percent of birth to teen moms and poverty rate?
Recall: Describing a Scatterplot
• Look for overall pattern and striking deviation form
that pattern.
• The overall pattern of a scatterplot can be described
by the form, direction, and strength of the
relationship.
• An important kind of deviation is an outlier, an
individual value that falls outside the overall pattern.
Example of Relationship Between Two Quantitative Variables
19
Direction:
There is an upward trend.
The points are forming an upward trend.
The association is, thus, positive.
Form:
Overall, the points are forming a trend that is close to a straight line.
No curvature is observed in the overall pattern.
Strength:
There appears to be a strong to moderate linear association.
In the context of this study, we interpret this scatterplot as:
• There appears to be a strong to moderate positive linear
association between percent of birth to teen moms and
poverty rate.
• Higher percent of birth to teen moms are, in general,
associated with higher poverty rate.
• As poverty rate increases, the percent of birth to teen moms
tend to increase.
Possible High Outlier
20
21
Relationship Between Smoking Rate and Year
The Centre of Disease Control and Prevention track cigarette smoking in the United States for both men and women. The study
examined the percentage of men and women aged 18-24 who smoked from 1965 to 2011. The relationship between smoking
rate and year was investigated.
There appears to be a moderate to strong negative
association between smoking rate and years in
this study. As the years increase, the smoking rate
tend to decrease for both men and women.
Investigating Interesting Relationships Among a Set of Variables
Researchers are interested to investigate whether percent of birth to teen moms in U.S.A is related to poverty rate,
welfare, government education spending, and percent African Americans population.
Response variable (Quantitative):
• 𝑦: Percent of birth to teen moms as percent of all birth
Explanatory variables (Quantitative):
• 𝑥1: Poverty rate
• 𝑥2: Average monthly temporary assistance to needy families (Welfare)
• 𝑥3: Per Capita State and Local Government Spending for Education
• 𝑥4: Percent African American population
22
23
Important note:
Correlation does not imply causation.
Why? Because, there are other variables (perhaps not
included into study’s data analysis) that could contribute
to understanding the variation in the percent of birth to
teen moms. These variables are often called lurking or
hidden variables.
• Can you propose some variables that you think they
could contribute to understanding the variation in
percent of birth to teen moms?
Correlation Does Not Imply Causation
Teen Moms Data File: Read into R
24
Estimated Correlations, 𝒓 values: Correlation Matrix
25
Correlation between % Birth to Teen Moms and Possible Predictor Variables:
• Correlation (Poverty rate, % Birth to Teen Moms) = 0.85 (high positive correlation)
• Correlation (Welfare, % Birth to Teen Moms) = - 0.66 (moderate to high negative correlation)
• Correlation (% African American, % Birth to Teen Moms) = 0.44 (moderate positive correlation)
• Correlation (Govt. Education Spending, % Birth to Teen Moms) = -0.30 (moderate negative correlation)
Scatterplot Matrix
26
Correlation Does NOT Prove Causation
27
• In many studies of the relationship between two variables the goal is to establish that changes in the explanatory
variable cause changes in response variable.
• Even a strong association between two variables, does not necessarily imply a casual link between the variables.
• Some explanations for an observed association:
• The dashed double arrow lines show an association.
• The solid arrows show a cause and effect link.
• The variable 𝑥 is explanatory, 𝑦 is response, and 𝑧 is
a lurking variable.
• In picture (c), the effects of 𝑥 and 𝑧 are confounded, making it difficult to determine which (if either) of the
two confounder variables is causal culprit. Maybe both have a casual link to 𝑦-variable, maybe neither. Maybe
some other undiscovered lurking variable is the real culprit, its hidden effect is confounded with 𝑥-variable.
Example from OECD-iLibrary
What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en
“The Survey of Adult Skills, a product of the OECD Programme for the International Assessment of Adult Competencies (PIAAC), shows that there are substantial differences in proficiency in information-processing skills across age groups. Proficiency is highest among adults in their late 20s and early 30s. From that point, proficiency declines with age. Depending on whether and how one tries to account for differences in individuals’ socio-demographic characteristics, 55-65 year-olds can be estimated to score between 18 and 32 points below 25-34 year-olds in literacy” (OECD, 2012).
28
What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en
29
What does age have to do with skills proficiency?http://www.oecd-ilibrary.org/education/what-does-age-have-to-do-with-skills-proficiency_5jm0mq158zjl-en
“As the survey measures the proficiency of adults of different ages at a single point in
time, the differences in proficiency related to age may reflect the impact of other
factors in addition to biological ageing (the so-called “age effects”).
For example, the quantity and quality of education received by individuals born in
different years may vary considerably – thus generating so-called “cohort effects”.
Yet, the results from the Survey of Adult Skills are broadly similar to those of other
studies that are better able to separate age and cohort effects: ageing does have an
impact on skills proficiency” (OECD, 2012).
30