Correlation1. The variance of a variable X provides information on the variability of X. The...

19
Correlation Correlation 1

Transcript of Correlation1. The variance of a variable X provides information on the variability of X. The...

Page 1: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 1

Correlation

Page 2: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 2

The variance of a variable X provides information on the variability of X.

The covariance of two variables X and Y provides information on the related variability of X and Y together.

Note the similarity of the structure of the formulas. Instead of relating X to itself as in the variance, X is related to the other variable Y.

Covariance

1

))((

1

)(11

2

n

XXXX

n

XXVarianceSample

n

i ii

n

i i

1

))((1

n

YYXXCoVarianceSample

n

i ii

Page 3: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 3

There is an Excel function to calculate covariance:

=COVAR(range1, range2)

Unfortunately for most common purposes, Excel does not calculate the sample covariance but instead calculates what is known as the population covariance.

Therefore, in order to transform Excel’s covariance calculation into the more useful sample covariance, it is necessary multiply Excel’s covariance calculation by the factor n/(n-1).

Excel Note on Covariance

n

YYXXCoVariancePopulation

n

i ii

1))((

Page 4: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 4

Covariance measures how much Y and X tend to vary in the same direction

High positive covariance means the highest values of Y tend to occur along with the highest values of X

However, it’s hard to interpret because it has no standard scale of reference. A covariance of 300,000 could be trivial while another of 2.1 fairly substantial.

Covariance

Page 5: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 5

A more useful expression of this relationship between X and Y is to express it as a percentage of the standard deviations of X and Y. This percentage is known as the “standardized” covariance, or the correlation coefficient (correlation for short), and is commonly denoted by the variable r.

In Excel, the correlation formula is=CORREL(range1, range2)

Correlation Coefficient

XY SSr

YandXbetweencovarianceSample

Page 6: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 6

The correlation coefficient (r) measures how much Y and X tend to vary in the same direction on a standard scale. (Varying in the same direction is implicitly a linear relationship.)

It will always be between -1 and +1r = +1 implies a perfect positive relationshipr = –1 implies a perfect negative relationshipr = 0 implies no linear relationship exists!

Correlation Coefficient

Page 7: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 7

Since it is unlikely that any real social data will have either a perfect positive correlation (r=1) or a perfect negative correlation (r=-1), how does an analyst know if there is “enough” correlation.

A simple rule of the thumb is that a “correlation value” of less than 30% suggests no linear relationship, whereas a “correlation value” of more than 70% suggests a strong linear relationship. Everything in between is, say, “somewhat of a relationship”.

How Much is Enough?

Page 8: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 8

Examples

Page 9: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 9

The hypotheses are:H0: correlation = 0

VersusH1: correlation ≠ 0

Approximate the standard error using the formula:

Calculate the T-statistic, n-2 dof. The formula is:

A T-test for formalizing “good enough”

2

1 2

n

rsr

rstatistic s

rt

Page 10: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 10

Suppose for a sample of size 20, the sample covariance between two variables X and Y is 87, the sample variance of X is 100 and the sample variance of Y is 400. Is there a statistically significant linear relationship?

Example 1

Page 11: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 11

Example 1 Calculations435.0

20*10

87

400100

87r

210.018

798.0

220

435.01 2

rs

07.2210.0

435.0statistict

A linear relationship between the two variables is statistically significant at the 10% level but not at the 5% level. (1.734 and 2.101)

Page 12: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 12

Example 2 – Readership Data

Page 13: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 13

Correlation is most useful for quickly considering possible relationships between many different variables.

Suppose for example that the analysis is examining 10 different variables: X1 … X10

Using Excel’s correl function would require entering 45 such calculations.

A better exploratory (one-time) way is use Excel’s built-in Data-Analysis Toolpak.

Excel’s Data Analysis Correlation

Page 14: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 14

Correlation Toolpak Example

Complete the dialog box

Leads to the results

Page 15: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 15

The correlation coefficient measures linearity.

If there is a nonlinear relationship, r will underestimate the predictive power of the relationship between the two variables.

Measuring Nonlinear Behavior

Page 16: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 16

Rank correlation measures how two variables are related in a more general way.

A high rank correlation says that large values of X tend to occur with large values of Y, and low with low, whether or not the relationship is linear.

Generally this type of correlation test might be applied to data that is highly skewed. In other words, there are a significant amount of very extreme values.

Rank Correlation

Page 17: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 17

Rank Correlation Computation Compute the ranks for the set of X

values, then for the Y values, low to high. Compute the differences of the ranks and

the square of the differences. The statistic then is:

nn

dr

n

ii

31

261

For simplicity, if some values are tied, interpolate the ranks and use the formula above. In this case, technically, the previous correlation calculation should be applied to the rankings rather than the formula above.

Page 18: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 18

The hypotheses are:H0: correlation = 0

VersusH1: correlation ≠ 0

Approximate the standard error using the formula:

Calculate the T-statistic, n-2 dof. The formula is:

Re-use the T-test for “good enough”

2

1 2

n

rsr

rstatistic s

rt

Page 19: Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.

Correlation 19

Example 3