Two Variable Statistics Correlation Coefficient & Pearson’s Correlation Coefficient.
Correlation in Statistics
-
Upload
avjinder-avi-kaler -
Category
Education
-
view
320 -
download
3
Transcript of Correlation in Statistics
![Page 1: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/1.jpg)
CORRELATION Avjinder Singh Kaler and Kristi Mai
![Page 2: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/2.jpg)
Linear correlation coefficient, r, is a number that measures how well
paired sample data fit a straight-line pattern when graphed.
Using paired sample data (sometimes called bivariate data), we find
the value of r (usually using technology), then we use that value to
conclude that there is (or is not) a linear correlation between the two
variables.
In this section we will consider only linear relationships, which means
that when graphed, the points approximate a straight-line pattern.
We will discuss methods of hypothesis testing for correlation.
![Page 3: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/3.jpg)
Correlation – a correlation exists between two variables when the
values of one variable are somehow associated with the values of the
other variable.
• Can be positive, negative, non-existent, or non-linear
• A linear correlation exists between two variables when there is a
correlation and the plotted points of paired data result in a pattern that
can be approximated by a straight line.
![Page 4: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/4.jpg)
We can often see a relationship between two variables by
constructing a scatterplot.
Scatter plots of paired data
![Page 5: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/5.jpg)
![Page 6: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/6.jpg)
1. The sample of paired data is a Simple Random Sample of quantitative data
2. The pairs of data ( 𝑥,𝑦) have a bivariate normal distribution, meaning the following:
• Visual examination of the scatter plot(s) confirms that the sample points follow an approximately straight line(s)
• Because results can be strongly affected by the presence of outliers, any outliers should be removed if they are known to be errors (Note: Use caution when removing data points)
Note: These are the same as the Requirements for Simple Linear Regression.
![Page 7: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/7.jpg)
• Linear Correlation Coefficient (𝑟) – measures the strength of the linear
correlation between the paired quantitative 𝑥 and 𝑦 values in a sample
• Also known as the Pearson Product Moment Correlation Coefficient in honor of
Karl Pearson
• This is a Sample Statistic of the correlation that is linear between 𝑥 and 𝑦
• If this value is squared, the value is the Coefficient of Determination ( 𝑟2)
• Notation:
• 𝑟 : linear correlation coefficient for sample data
• 𝜌 : linear correlation coefficient for a population of paired data
• Formula for calculating 𝑟: 𝑟 =𝑛 Σ𝑥𝑦 − Σ𝑥 ∗(Σ𝑦)
𝑛(Σ𝑥2)− Σ𝑥 2∗ 𝑛(Σ𝑦2)− Σ𝑦 2
![Page 8: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/8.jpg)
1. – 1 ≤ r ≤ 1
2. If all values of either variable are converted to a different scale, the value of r does not change.
3. The value of r is not affected by the choice of x and y.
4. r measures strength of a linear relationship.
5. r is very sensitive to outliers
• A single outlier can dramatically affect the value of r
![Page 9: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/9.jpg)
• The value of 𝑟2 is the proportion of the variation in 𝑦 that is explained by the linear relationship that exists between 𝑥 and 𝑦
• Thus, 𝑟2 is, also, the amount of variation in 𝑦 that is explained by the regression line itself
• We may use 𝑟2 to describe the predictive power of the regression equation
![Page 10: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/10.jpg)
• To conclude that correlation implies causality • Using data based on averages
• This type of data causes an inflated correlation coefficient
• To conclude that if there is no linear correlation, there is no correlation
at all
![Page 11: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/11.jpg)
Hypotheses:
𝐻0: 𝜌 = 0𝐻1: 𝜌 ≠ 0
(𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)(𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑖𝑠𝑡𝑠)
These hypotheses can be equivalently tested with the following hypotheses:
𝐻0: 𝛽1 = 0 𝐻1: 𝛽1 ≠ 0
(𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)(𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑖𝑠𝑡𝑠)
Note: This equivalence will be important for the interpretation of the
technological output.
![Page 12: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/12.jpg)
• Use Critical Value from Table A-6 (this is a simpler approach) and think of the Linear Correlation Coefficient (𝑟) as a ‘test statistic’
• OR, use the following t-score test statistic with 𝑑𝑓 = 𝑛 − 2
𝑡 =𝑟
1 − 𝑟2
𝑛 − 2
• This 𝑡 test statistic can be viewed in most technological output
corresponding to the test of significance for the slope in the regression line (i.e. the second set of equivalent hypotheses listed above)
![Page 13: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/13.jpg)
• If using critical values from Table A-6:
𝐼𝑓 𝑟 > 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
𝐼𝑓 𝑟 ≤ 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒, 𝑓𝑎𝑖𝑙 𝑡𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
This test is obviously a two-tailed hypothesis test based on the alternative hypothesis. Visualize this by plotting the possible values for 𝑟 on a number line with a labeled critical region.
• If using the t-score test statistic:
• Use statistical software to calculate the correct p-value that corresponds with the test statistic. Then, base the conclusion on comparison between the p-value and 𝛼
![Page 14: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/14.jpg)
One-tailed tests can occur with a claim of a positive linear correlation or a claim of a negative linear correlation. In such cases, the hypotheses will be as shown here. For these one-tailed tests, the P-value method can be used as well.
![Page 15: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/15.jpg)
• Construct a scatter plot and verify that the pattern of the points is approximately a straight line pattern without outliers
• Assess the linear correlation between two variables of interest and create a regression equation
• Consider any effects of a pattern over time
• Perform a Residual Analysis:
• Construct a residual plot and verify that there is no pattern (other than a straight line pattern) and also verify that the residual plot does not become thicker or thinner
• Use a histogram, normal quantile plot, or Shapiro Wilk test of normality to confirm that the values of the residuals have a distribution that is approximately normal
![Page 16: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/16.jpg)
• Measurement Error – could be described as ‘explainable’ outliers
• Nonlinear Associations – ignoring possible nonlinear relationships
• Extrapolation – predicting far beyond the scope of our available data
![Page 17: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/17.jpg)
The paired shoe / height data from five males are listed below.
Using StatCrunch, find the value of the correlation coefficient r.
![Page 18: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/18.jpg)
Requirement Check:
The data are a simple random sample of quantitative data, the plotted
points appear to roughly approximate a straight-line pattern, and there
are no outliers.
![Page 19: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/19.jpg)
A few technologies are displayed below, used to calculate the
value of r.
![Page 20: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/20.jpg)
We found previously for the shoe and height example that r = 0.591.
With r = 0.591, we get r2 = 0.349.
We conclude that about 34.9% of the variation in height can be
explained by the linear relationship between lengths of shoe prints and
heights.
![Page 21: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/21.jpg)
Conduct a formal hypothesis test of the claim that there is a linear
correlation between the two variables.
Use a 0.05 significance level.
![Page 22: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/22.jpg)
We test the claim:
𝐻0: 𝛽1 = 0 𝐻1: 𝛽1 ≠ 0
(𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛) (𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑖𝑠𝑡𝑠)
![Page 23: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/23.jpg)
We calculate the test statistic:
Table A-3 shows this test statistic yields a p-value that is greater than 0.20.
2 2
0.5911.269
1 1 0.591
2 5 2
rt
r
n
![Page 24: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/24.jpg)
StatCrunch provides a P-value of 0.2937. Because the p-value of 0.2937 is greater than the significance level of 0.05, we fail to reject the null hypothesis. We conclude there is not sufficient evidence to support the claim that there is a linear correlation between shoe print length and heights of males.
![Page 25: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/25.jpg)
With the test statistic, r = 0.591. The critical values of r = ± 0.878 are found in Table A-6 with n = 5 and α = 0.05. We fail to reject the null and conclude there is not sufficient evidence to support the claim that there is a linear correlation between shoe print length and heights of males.
![Page 26: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/26.jpg)
Use the 5 pairs of shoe print lengths and heights to predict the height of a person with a shoe print length of 29 cm.
The regression line does not fit the points well. The correlation is r = 0.591, which suggests there is not a linear correlation (the p-value was 0.2937).
From StatCrunch,
The best predicted height is simply the mean of the sample heights:
177.3 cmy
![Page 27: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/27.jpg)
Use the 40 pairs of shoe print lengths from Data Set 2 in Appendix B to predict the height of a person with a shoe print length of 29 cm. Now, the regression line does fit the points well, and the correlation of r = 0.813 suggests that there is a linear correlation since the p-value is < 0.0001.
![Page 28: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/28.jpg)
The regression equation and scatterplot are shown below:
![Page 29: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/29.jpg)
The given shoe length of 29 cm is not beyond the scope of the available data, so substitute in 29 cm into the regression model:
A person with a shoe length of 29 cm is predicted to be 174.3 cm tall.
Using StatCrunch,
ˆ 80.9 3.22
80.9 3.22 29
174.3 cm
y x
![Page 30: Correlation in Statistics](https://reader034.fdocuments.us/reader034/viewer/2022052418/587087961a28ab57368b805b/html5/thumbnails/30.jpg)
What if we have two or more explanatory variables? Do we have a method for this? YES!! Of course, we do. We may want to predict a sea turtle’s lifespan by more variables than simply length of shell! I want to use variables that account for the diet, exercise, mental health, and captivity status of the turtle! What about variables that also account for the water quality in the turtle’s surrounding environment? I want that too!!! To do this, we can use something known as Multiple Linear Regression!