Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x...

Correlation

Review and Extension

Questions to be asked…

Is there a linear relationship between x and y? What is the strength of this relationship?

Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to

predict y from x? y=bx+a

Is the relationship we have described statistically significant? Not a very interesting one if tested against a null of r =

0

Other stuff

Check scatterplots to see whether a Pearson r makes sense

Use both r and r2 to understand the situation If data is non-metric or non-normal, use “non-

parametric” correlations Correlation does not prove causation

True relationship may be in opposite direction, co-causal, or due to other variables

However, correlation is the primary statistic used in making an assessment of causality ‘Potential’ Causation

Possible outcomes

-1 to +1 As one variable increases/decreases, the other

variable increases/decreases Positive covariance

As one variable increases/decreases, another decreases/increases

Negative covariance No relationship (independence)

r = 0 Non-linear relationship

Covariance

The variance shared by two variables When X and Y move in the same direction (i.e. their

deviations from the mean are similarly pos or neg) cov (x,y) = pos.

When X and Y move in opposite directions cov (x,y) = neg.

When no constant relationship cov (x,y) = 0

1

( )( )cov( , )

1

n

i ii

x x y yx y

n

Covariance is not very meaningful on its own and cannot be compared across different scales of measurement

Solution: standardize this measure Pearson’s r:

),cov( yx

yxxy ss

yxr

),cov(

1

))((),cov( 1

n

yyxxyx

i

n

ii

yx

i

n

ii

xy ssn

yyxxr

)1(

))((1

11 r1

1

i i

n

x yi

xy

Z Zr

n

Factors affecting Pearson r

LinearityHeterogeneous subsamplesRange restrictionsOutliers

Linearity

Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

LLR Smoother

2.00 4.00 6.00 8.00 10.00

VAR00001

5.00

10.00

15.00

20.00

VA

R00

002

Linear Regression

2.00 4.00 6.00 8.00 10.00

VAR00001

5.00

10.00

15.00

20.00

VA

R00

002

Heterogeneous subsamples

Sub-samples may artificially increase or decrease overall r.

Solution - calculate r separately for sub-samples & overall, look for differences

100 200 300 400 500

Engine Displacement (cu. inches)

10

20

30

40

50

Mile

s p

er G

allo

n

Country of Origin

American

European

Japanese

Correlations

1 -.789

-.789 1.000

1 -.836**

. .000

248 248

-.836** 1

.000 .

248 253

1 -.510**

. .000

70 70

-.510** 1

.000 .

70 73

1 -.366**

. .001

79 79

-.366** 1

.001 .

79 79

Pearson Correlation

Pearson Correlation

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Miles per Gallon

Engine Displacement(cu. inches)Miles per Gallon

Engine Displacement(cu. inches)

Miles per Gallon


Miles per Gallon


Country of OriginOverall

American

European

Japanese

Miles perGallon

EngineDisplacement(cu. inches)

Correlation is significant at the 0.01 level (2-tailed).**.

Range restriction

Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r.

Common example occurs with Likert scales E.g. 1 - 4 vs. 1 - 9

However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out

Effect of Outliers

Outliers can artificially and dramatically increase or decrease r

Options Compute r with and without outliers Conduct robustified R!

For example, recode outliers as having more conservative scores (winsorize)

Transform variables

What else?

r is the starting point for any regression and related method

Both the slope and magnitude of residuals are reflective of r R = 0 slope =0

As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables

Robust Approaches to Correlation

Rank approaches Winsorized Percentage Bend

Rank approaches: Spearman’s rho and Kendall’s tau Spearman’s rho is calculated using the same formula

as Pearson’s r, but when variables are in the form of ranks Simply rank the data available X = 10 15 5 35 25 becomes X = 2 3 1 5 4 Do this for X and Y and calculate r as normal

Kendall’s tau is a another rank based approach but the details of its calculation are different

For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Winsorized Correlation

As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized X = 1 2 3 4 5 6 becomes X = 2 2 3 4 5 5

Winsorize both X and Y values (without regard to each other) and compute Pearson’s r

This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged

For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming Though trimming is preferable for group comparisons

Methods Related to M-estimators

The percentage bend correlation utilizes the median and a generalization of MAD

A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that

While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general

With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r

With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Problem

While these alternative methods help us in some sense, an issue remains

When dealing with correlation, we are not considering the variables in isolation

Outliers on one or the other variable, might not be a bivariate outlier

Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

Global measures of association

Measures are available that take into account the bivariate nature of the situation

Minimum Volume Ellipsoid Estimator (MVE) Minimum Covariance Determinant Estimator

(MCD)

Minimum Volume Ellipsoid Estimator

Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the

inner circle contains half the values and anything outside the dotted circle would be considered an outlier

A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points

Those points are then used to calculate the correlation The MVE

Minimum Covariance Determinant Estimator The MCD is another alternative we might used and involves the

notion of a generalized variance, which is a measure of the overall variability among a cloud of points For the more adventurous, see my /6810 page for info

matrices and their determinants The determinant of a matrix is the generalized variance

For the two variable situation

As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be

The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

2 2 2 2(1 )g x ys s s R

Global measures of association

Note that both the MVE and MCD can be extended to situations with more than two variables We’d just be dealing with a larger matrix

Example using the Robust library in S-Plus OMG! Drop down menus even!

Remaining issues: Curvature

The fact is that straight lines may not capture the true story

We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship

There may still be a relationship, and a strong one, just more complex

Summary

Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables

One data point can render it a useless measure, as it is not robust to outliers

Measures which are robust are available, and some take into account the bivariate nature of the data

However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable

Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x...

Documents

Transcript of Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x...