Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x...
-
Upload
matilda-melton -
Category
Documents
-
view
213 -
download
0
Transcript of Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x...
Correlation
Review and Extension
Questions to be asked…
Is there a linear relationship between x and y? What is the strength of this relationship?
Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to
predict y from x? y=bx+a
Is the relationship we have described statistically significant? Not a very interesting one if tested against a null of r =
0
Other stuff
Check scatterplots to see whether a Pearson r makes sense
Use both r and r2 to understand the situation If data is non-metric or non-normal, use “non-
parametric” correlations Correlation does not prove causation
True relationship may be in opposite direction, co-causal, or due to other variables
However, correlation is the primary statistic used in making an assessment of causality ‘Potential’ Causation
Possible outcomes
-1 to +1 As one variable increases/decreases, the other
variable increases/decreases Positive covariance
As one variable increases/decreases, another decreases/increases
Negative covariance No relationship (independence)
r = 0 Non-linear relationship
Covariance
The variance shared by two variables When X and Y move in the same direction (i.e. their
deviations from the mean are similarly pos or neg) cov (x,y) = pos.
When X and Y move in opposite directions cov (x,y) = neg.
When no constant relationship cov (x,y) = 0
1
( )( )cov( , )
1
n
i ii
x x y yx y
n
Covariance is not very meaningful on its own and cannot be compared across different scales of measurement
Solution: standardize this measure Pearson’s r:
),cov( yx
yxxy ss
yxr
),cov(
1
))((),cov( 1
n
yyxxyx
i
n
ii
yx
i
n
ii
xy ssn
yyxxr
)1(
))((1
11 r1
1
i i
n
x yi
xy
Z Zr
n
Factors affecting Pearson r
LinearityHeterogeneous subsamplesRange restrictionsOutliers
Linearity
Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship
LLR Smoother
2.00 4.00 6.00 8.00 10.00
VAR00001
5.00
10.00
15.00
20.00
VA
R00
002
Linear Regression
2.00 4.00 6.00 8.00 10.00
VAR00001
5.00
10.00
15.00
20.00
VA
R00
002
Heterogeneous subsamples
Sub-samples may artificially increase or decrease overall r.
Solution - calculate r separately for sub-samples & overall, look for differences
100 200 300 400 500
Engine Displacement (cu. inches)
10
20
30
40
50
Mile
s p
er G
allo
n
Country of Origin
American
European
Japanese
Correlations
1 -.789
-.789 1.000
1 -.836**
. .000
248 248
-.836** 1
.000 .
248 253
1 -.510**
. .000
70 70
-.510** 1
.000 .
70 73
1 -.366**
. .001
79 79
-.366** 1
.001 .
79 79
Pearson Correlation
Pearson Correlation
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Miles per Gallon
Engine Displacement(cu. inches)Miles per Gallon
Engine Displacement(cu. inches)
Miles per Gallon
Engine Displacement(cu. inches)
Miles per Gallon
Engine Displacement(cu. inches)
Country of OriginOverall
American
European
Japanese
Miles perGallon
EngineDisplacement(cu. inches)
Correlation is significant at the 0.01 level (2-tailed).**.
Range restriction
Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r.
Common example occurs with Likert scales E.g. 1 - 4 vs. 1 - 9
However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out
Effect of Outliers
Outliers can artificially and dramatically increase or decrease r
Options Compute r with and without outliers Conduct robustified R!
For example, recode outliers as having more conservative scores (winsorize)
Transform variables
What else?
r is the starting point for any regression and related method
Both the slope and magnitude of residuals are reflective of r R = 0 slope =0
As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables
Robust Approaches to Correlation
Rank approaches Winsorized Percentage Bend
Rank approaches: Spearman’s rho and Kendall’s tau Spearman’s rho is calculated using the same formula
as Pearson’s r, but when variables are in the form of ranks Simply rank the data available X = 10 15 5 35 25 becomes X = 2 3 1 5 4 Do this for X and Y and calculate r as normal
Kendall’s tau is a another rank based approach but the details of its calculation are different
For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data
Winsorized Correlation
As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized X = 1 2 3 4 5 6 becomes X = 2 2 3 4 5 5
Winsorize both X and Y values (without regard to each other) and compute Pearson’s r
This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged
For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming Though trimming is preferable for group comparisons
Methods Related to M-estimators
The percentage bend correlation utilizes the median and a generalization of MAD
A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that
While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general
With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r
With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not
Problem
While these alternative methods help us in some sense, an issue remains
When dealing with correlation, we are not considering the variables in isolation
Outliers on one or the other variable, might not be a bivariate outlier
Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves
Global measures of association
Measures are available that take into account the bivariate nature of the situation
Minimum Volume Ellipsoid Estimator (MVE) Minimum Covariance Determinant Estimator
(MCD)
Minimum Volume Ellipsoid Estimator
Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the
inner circle contains half the values and anything outside the dotted circle would be considered an outlier
A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points
Those points are then used to calculate the correlation The MVE
Minimum Covariance Determinant Estimator The MCD is another alternative we might used and involves the
notion of a generalized variance, which is a measure of the overall variability among a cloud of points For the more adventurous, see my /6810 page for info
matrices and their determinants The determinant of a matrix is the generalized variance
For the two variable situation
As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be
The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that
2 2 2 2(1 )g x ys s s R
Global measures of association
Note that both the MVE and MCD can be extended to situations with more than two variables We’d just be dealing with a larger matrix
Example using the Robust library in S-Plus OMG! Drop down menus even!
Remaining issues: Curvature
The fact is that straight lines may not capture the true story
We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship
There may still be a relationship, and a strong one, just more complex
Summary
Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables
One data point can render it a useless measure, as it is not robust to outliers
Measures which are robust are available, and some take into account the bivariate nature of the data
However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable