Correlation1. The variance of a variable X provides information on the variability of X. The...
-
Upload
russell-ford -
Category
Documents
-
view
214 -
download
1
Transcript of Correlation1. The variance of a variable X provides information on the variability of X. The...
Correlation 1
Correlation
Correlation 2
The variance of a variable X provides information on the variability of X.
The covariance of two variables X and Y provides information on the related variability of X and Y together.
Note the similarity of the structure of the formulas. Instead of relating X to itself as in the variance, X is related to the other variable Y.
Covariance
1
))((
1
)(11
2
n
XXXX
n
XXVarianceSample
n
i ii
n
i i
1
))((1
n
YYXXCoVarianceSample
n
i ii
Correlation 3
There is an Excel function to calculate covariance:
=COVAR(range1, range2)
Unfortunately for most common purposes, Excel does not calculate the sample covariance but instead calculates what is known as the population covariance.
Therefore, in order to transform Excel’s covariance calculation into the more useful sample covariance, it is necessary multiply Excel’s covariance calculation by the factor n/(n-1).
Excel Note on Covariance
n
YYXXCoVariancePopulation
n
i ii
1))((
Correlation 4
Covariance measures how much Y and X tend to vary in the same direction
High positive covariance means the highest values of Y tend to occur along with the highest values of X
However, it’s hard to interpret because it has no standard scale of reference. A covariance of 300,000 could be trivial while another of 2.1 fairly substantial.
Covariance
Correlation 5
A more useful expression of this relationship between X and Y is to express it as a percentage of the standard deviations of X and Y. This percentage is known as the “standardized” covariance, or the correlation coefficient (correlation for short), and is commonly denoted by the variable r.
In Excel, the correlation formula is=CORREL(range1, range2)
Correlation Coefficient
XY SSr
YandXbetweencovarianceSample
Correlation 6
The correlation coefficient (r) measures how much Y and X tend to vary in the same direction on a standard scale. (Varying in the same direction is implicitly a linear relationship.)
It will always be between -1 and +1r = +1 implies a perfect positive relationshipr = –1 implies a perfect negative relationshipr = 0 implies no linear relationship exists!
Correlation Coefficient
Correlation 7
Since it is unlikely that any real social data will have either a perfect positive correlation (r=1) or a perfect negative correlation (r=-1), how does an analyst know if there is “enough” correlation.
A simple rule of the thumb is that a “correlation value” of less than 30% suggests no linear relationship, whereas a “correlation value” of more than 70% suggests a strong linear relationship. Everything in between is, say, “somewhat of a relationship”.
How Much is Enough?
Correlation 8
Examples
Correlation 9
The hypotheses are:H0: correlation = 0
VersusH1: correlation ≠ 0
Approximate the standard error using the formula:
Calculate the T-statistic, n-2 dof. The formula is:
A T-test for formalizing “good enough”
2
1 2
n
rsr
rstatistic s
rt
Correlation 10
Suppose for a sample of size 20, the sample covariance between two variables X and Y is 87, the sample variance of X is 100 and the sample variance of Y is 400. Is there a statistically significant linear relationship?
Example 1
Correlation 11
Example 1 Calculations435.0
20*10
87
400100
87r
210.018
798.0
220
435.01 2
rs
07.2210.0
435.0statistict
A linear relationship between the two variables is statistically significant at the 10% level but not at the 5% level. (1.734 and 2.101)
Correlation 12
Example 2 – Readership Data
Correlation 13
Correlation is most useful for quickly considering possible relationships between many different variables.
Suppose for example that the analysis is examining 10 different variables: X1 … X10
Using Excel’s correl function would require entering 45 such calculations.
A better exploratory (one-time) way is use Excel’s built-in Data-Analysis Toolpak.
Excel’s Data Analysis Correlation
Correlation 14
Correlation Toolpak Example
Complete the dialog box
Leads to the results
Correlation 15
The correlation coefficient measures linearity.
If there is a nonlinear relationship, r will underestimate the predictive power of the relationship between the two variables.
Measuring Nonlinear Behavior
Correlation 16
Rank correlation measures how two variables are related in a more general way.
A high rank correlation says that large values of X tend to occur with large values of Y, and low with low, whether or not the relationship is linear.
Generally this type of correlation test might be applied to data that is highly skewed. In other words, there are a significant amount of very extreme values.
Rank Correlation
Correlation 17
Rank Correlation Computation Compute the ranks for the set of X
values, then for the Y values, low to high. Compute the differences of the ranks and
the square of the differences. The statistic then is:
nn
dr
n
ii
31
261
For simplicity, if some values are tied, interpolate the ranks and use the formula above. In this case, technically, the previous correlation calculation should be applied to the rankings rather than the formula above.
Correlation 18
The hypotheses are:H0: correlation = 0
VersusH1: correlation ≠ 0
Approximate the standard error using the formula:
Calculate the T-statistic, n-2 dof. The formula is:
Re-use the T-test for “good enough”
2
1 2
n
rsr
rstatistic s
rt
Correlation 19
Example 3