Correlation

70
Dr Kirsten Challinor Acknowledgment to Andy Field chapter 7 Correlation

Transcript of Correlation

Page 1: Correlation

Dr Kirsten ChallinorAcknowledgment to Andy Field chapter 7

Correlation

Page 2: Correlation

COMMONWEALTH OF AUSTRALIACopyright Regulations 1969

WARNING

This material has been reproduced and communicated to you by or on behalf of the

University of New South Wales pursuant to Part VB of the Copyright Act 1968 (the Act).

The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the

subject of copyright protection under the Act.

Do not remove this notice.

Page 3: Correlation

Lecture outline• Why do we need stats?

• Evidence based practice- Appraisal

• Statistical models• The mean as a model• Sums of squares/fit/Variance

• Correlation• Graphs• Assumptions• Measuring Relationships

• Pearson r• R squared

• Non-parametric

Page 4: Correlation

http://www.uk.sagepub.com/field4e/main.htm

4

Page 5: Correlation

Why do we need stats?

Page 6: Correlation

Evidence based practice

EBP

(Hoffman, 2010) 6

EBP is the combination of the best available evidence from research, the patient’s preferences/circumstances, the clinical environment and the practitioner’s expertise.

Page 7: Correlation

Process of EBP

ASK formulating answerable questions

ACQUIREsearching for the

best evidence

APPRAISE critically assess

the evidence

APPLYthe appraised evidence

to patient / practice

AUDITevaluating outcome of EBP

process

7(Dawes, 2005)

Page 8: Correlation

8

Page 9: Correlation

9

Page 10: Correlation

Remember that Appraisal is

Evaluating the relevant research evidence, to find the highest quality (most reliable, or valid) evidence available relevant to your question.

Critical appraisal is the process of assessing and interpreting evidence by systematically considering its validity and its relevance to the question. Internal validity: the extent to which the research is reliable. External validity: is an indication of the generalisability of the findings.

10

Page 11: Correlation

Simplified version of CA worksheet

11

Questions Yes No

Were subjects randomized? The study is not likely to be biased by subject grouping.

Subject allocation may cause bias.

Was there a control? Is the control group within this study, or historical?

There is unlikely to be a placebo effect in the treatment group. We can be less sure of this, though, if the control group data are taken from a previous study.

Subjects were in therapy, but there is no comparison with those not in therapy, so we cannot know to what extent any treatment effect is due to the treatment.

Is the population clinically relevant for my application?

Findings may be population-specific. The findings may apply to one population but not to the population in which the therapy is to be applied.

Is attrition described? If attrition rate is low, the findings are not confounded by this factor.

We do not know the results in subjects who withdraw from the study.

Were experimenters and subjects “blind” in this trial?

The findings are not biased by expectation of outcomes.

The experimenters and the subjects may have unintentionally or otherwise affected the outcome.

Are the subject groups comparable? The subject groups were equal at baseline, so are likely to have been similarly affected.

Outcomes in the groups may differ due to factors other than the treatment.

Was subject treatment equal across groups, apart from the therapy?

The subject groups were equal in all respects apart from the therapy.

Outcomes in the groups may differ due to factors other than the treatment.

Are the results both clinically and statistically significant?

The results are clinically relevant. Results may be statistically significant, but have no clinical significance. They may not be statistically significant, in which case there is no effect.

Acknowledgement: Catherine Suttle

Page 12: Correlation

12

Page 13: Correlation
Page 14: Correlation

The Research Process

Page 15: Correlation

Stats Models…

Page 16: Correlation

Building Statistical Models

50! DISCOVERING!STATISTICS!USING!SPSS!

must$represent$the$data$collected$(the$observed!data)$as$closely$as$possible.$The$degree$to$which$a$

statistical$model$represents$the$data$collected$is$known$as$the$fit$of$the$model.$

Figure$2.2$illustrates$three$models$that$an$engineer$might$build$to$represent$the$realJworld$

bridge$that$she$wants$to$create.$The$first$model$is$an$excellent$representation$of$the$realJworld$

situation$and$is$said$to$be$a$good!fit$(i.e.,$the$model$is$basically$a$very$good$replica$of$reality).$If$the$

engineer$uses$this$model$to$make$predictions$about$the$real$world$then,$because$it$so$closely$

resembles$reality,$she$can$be$confident$that$these$predictions$will$be$accurate.$So,$if$the$model$

collapses$in$a$strong$wind,$then$there$is$a$good$chance$that$the$real$bridge$would$collapse$also.$The$

second$model$has$some$similarities$to$the$real$world:$the$model$includes$some$of$the$basic$

structural$features,$but$there$are$some$big$differences$from$the$realJworld$bridge$(namely$the$

absence$of$one$of$the$supporting$towers).$We$might$consider$this$model$to$have$a$moderate!fit$(i.e.,$

there$are$some$similarities$to$reality$but$also$some$important$differences).$If$the$engineer$uses$this$

model$to$make$predictions$about$the$real$world$then$these$predictions$may$be$inaccurate$or$even$

catastrophic$(e.g.,$the$model$predicts$that$the$bridge$will$collapse$in$a$strong$wind,$causing$the$real$

bridge$to$be$closed$down,$creating$100Jmile$tailbacks$with$everyone$stranded$in$the$snow;$all$of$

which$was$unnecessary$because$the$real$bridge$was$perfectly$safe—the$model$was$a$bad$

representation$of$reality).$We$can$have$some$confidence,$but$not$complete$confidence,$ in$

predictions$from$this$model.$The$final$model$is$completely$different$to$the$realJworld$situation;$it$

bears$no$structural$similarities$to$the$real$bridge$and$is$a$poor$fit.$As$such,$any$predictions$based$on$

this$model$are$likely$to$be$completely$inaccurate.$Extending$this$analogy$to$science,$it$is$important$

when$we$fit$a$statistical$model$to$a$set$of$data$that$it$fits$the$data$well.$If$our$model$is$a$poor$fit$of$

the$observed$data$then$the$predictions$we$make$from$it$will$be$equally$poor.$

$

$

Figure'2.2:'Fitting'models'to'real5world'data'(see'text'for'details)'

Jane'Superbrain'Box'2.1'Types'of'statistical'models'(1)'

The Real W orld

Good Fit Moderate Fit Poor Fit

Field, chapter 2

Page 17: Correlation

Populations and Samples

Population• The collection of units (be they people,

plankton, plants, cities, etc.) to which we want to generalize a set of findings or a statistical model.

Sample• A smaller (but hopefully representative)

collection of units from a population used to determine truths about that population

Page 18: Correlation

Slide 18

A Simple Statistical Model

• In Statistics we fit models to our data. • The mean is a hypothetical value. • The mean is simple statistical model.

Page 19: Correlation

• Lets measure the number of friends that lecturers have.

• Mean doesn’t have to be a value that is actually observed in the data set. (e.g. 2.67 friends is not real)

The mean as a modelLecturer Number of

Friends

Kirsten 1

Jack 3

Lily 4

Mean 8/3 = 2.67

Number of Friends (Kirsten) = Mean + Error related to (Kirsten)

1 = 2.6 + E

Page 20: Correlation

Slide 20

The Only Equation You Will Ever Need

The data we observe can be predicted from the model we choose to fit to the

data, plus some amount of error.

Number of Friends (Kirsten) = Mean + Error related to (Kirsten)

1 = 2.6 + E

ii errorModelOutcome

Page 21: Correlation

Slide 21

Measuring the ‘Fit’ of the Model

The mean is a model of what happens in the real world: the typical scoreIt is not a perfect representation of the dataHow can we assess how well the mean represents reality?“How good is the fit?”

Page 22: Correlation

Slide 22

Calculating ‘Error’A deviation is the difference between the mean and an actual data point.Deviations can be calculated by taking each score and subtracting the mean from it:

Number of Friends (Kirsten) = Mean + Error related to (Kirsten)1 = 2.6 + EE = 1- 2.6E = -1.6

Page 23: Correlation

Slide 23

Text book example page 48

Page 24: Correlation

• The sum of squares is a good measure of the accuracy of our model.• It is also helpful to take the mean of the sum of squares.• Both the sum of squares and the mean squared error tell us about the

fit of the model. Large values indicate a bad fit.• When the model is the mean, mean square error has a special name:

Variance. • If you are into equations…What you have just learnt is represented by

this equation (see page 49):

Sum of squares

Page 25: Correlation

The special case of a more general principle:

That the fit of a model can be assessed by the sum of squared error or the mean squared

error.

Variance

Page 26: Correlation

… just check out his sum of squares…

How fit is a model?

Page 27: Correlation

Correlation

Page 28: Correlation

Aims

Measuring Relationships• Scatterplots• Covariance• Pearson’s Correlation Coefficient

Nonparametric measures• Spearman’s Rho• Kendall’s Tau

Interpreting Correlations• Causality

Partial Correlations

Page 29: Correlation

What is a Correlation?

• It is a way of measuring the extent to which two variables are related.

• It measures the pattern of responses across variables.

• Observing what naturally goes on in the world without directly interfering with it.

Page 30: Correlation

The danger of mixing up causality and correlation: Ionica Smeets at TEDxDelft

Page 31: Correlation

The first thing to do with your data… always.

Look at graphs

Page 32: Correlation

Slide 32

Very small relationship

Age

10 20 30 40 50 60 70 80 90

App

reci

atio

n of

Dim

mu

Bor

gir

-20

0

20

40

60

80

100

120

140

160

Page 33: Correlation

Slide 33

Positive relationship

Age

10 20 30 40 50 60 70 80 90

Appreciation of

Dimmu Borgir

10

20

30

40

50

60

70

80

90

Page 34: Correlation

Slide 34

Negative relationship

Age

10 20 30 40 50 60 70 80 90

App

reci

atio

n o

f Dim

mu

Bor

gir

-20

0

20

40

60

80

100

Page 35: Correlation

Bias

Assumptions

Page 36: Correlation

Something other than evidence is affecting your conclusion. A source of bias comes from violating assumptionsAn assumption is a condition that ensures what you’re attempting to do works.

Bias

Page 37: Correlation

If Juno had 16 friends this would pull the mean up and incorrectly make KC and SK seem to be more popular. This affects our Sums of Squares and further calculations on the data.We spot outliers by looking at graphs

Outliers

Page 38: Correlation

Additivity and Linearity

• The outcome variable is, in reality, linearly related to any predictors.• If you have several predictors then their combined effect is best

described by adding their effects together.• If this assumption is not met then your model is invalid.

Page 39: Correlation

Normally Distributed Something or Other

The normal distribution is relevant to:• Parameters• Confidence intervals around a parameter• Null hypothesis significance testing

This assumption tends to get incorrectly translated as ‘your data need to be normally distributed’.Usually it refers to the Sampling distribution of what’s being tested must be normal.More on this in tutorials and on page 168 of your textbook.

Page 40: Correlation

When does the Assumption of Normality Matter?

In small samples.• The central limit theorem allows us to forget about this assumption in larger

samples.In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.

Page 41: Correlation

Spotting Normality

We don’t have access to the sampling distribution so we usually test the observed dataCentral Limit Theorem

• If N > 30, the sampling distribution is normal anyway

Graphical displays• P-P Plot (or Q-Q plot)• Histogram

Values of Skew/Kurtosis• 0 in a normal distribution• Convert to z (by dividing value by SE)

Kolmogorov-Smirnov Test• Tests if data differ from a normal distribution• Significant = non-Normal data• Non-Significant = Normal data

Slide 41

Page 42: Correlation
Page 43: Correlation

Measuring Relationships

Page 44: Correlation

• We need to see whether as one variable increases, the other increases, decreases or stays the same.

• This can be done by calculating the Covariance.• We look at how much each score deviates from the mean.• If both variables deviate from the mean by the same amount, they are

likely to be related.

Measure the relationship

Page 45: Correlation
Page 46: Correlation
Page 47: Correlation

• The variance tells us by how much scores deviate from the mean for a single variable.

• It is closely linked to the sum of squares.• Covariance is similar – it tells is by how much scores on two variables

differ from their respective means.

Revision of Variance

Page 48: Correlation

• Calculate the error between the mean and each subject’s score for the first variable (x).

• Calculate the error between the mean and their score for the second variable (y).

• Multiply these error values. (-3 x -0.4)• Add these values and you get the

cross product deviations. (Add each person)

• The covariance is the average cross-product deviations.

Covariance

Page 49: Correlation

See page 265 in text.

1

),(

Nyyxx iiyxovC

Page 50: Correlation

This is the 1hour breakpoint

Page 51: Correlation

https://en.wikipedia.org/wiki/Bessel%27s_correction#Proof_of_correctness_-_Alternate_3

http://www.analystforum.com/forums/cfa-forums/cfa-level-i-forum/9684494

http://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation

Why is it N-1? Aka what’s with Bessel’s correction?

Page 52: Correlation

It depends upon the units of measurement.

• E.g. The Covariance of two variables measured in Miles might be 4.25, but if the same scores are converted to Km, the Covariance is 11.

One solution: standardise it!• Divide by the standard deviations of

both variables.The standardised version of Covariance is known as the Correlation coefficient.

• It is relatively affected by units of measurement.

The problem with Covariance

Page 53: Correlation

Standardisation & the correlation coefficient

Page 54: Correlation

The Correlation Coefficient

87.92.267.1

25.4

yx

xy

ss

Covr

Page 55: Correlation

• r is between -1 and 1.• r or R (mostly R is used in the context of regression)• Bivariate correlation. Partial correlation – looks at the relationship

between two variables whilst controlling for one or more variables.

Correlation coefficient

Page 56: Correlation

Things to know about the Correlation

It varies between -1 and +1• 0 = no relationship

It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect

Coefficient of determination, r2

• By squaring the value of r you get the proportion of variance in one variable shared by the other.

Page 57: Correlation

Correlation and Causality

The third-variable problem*:• in any correlation, causality between two variables cannot be assumed

because there may be other measured or unmeasured variables affecting the results.

Direction of causality:• Correlation coefficients say nothing about which variable causes the other to

change

* Sometimes called Tertium Quid.

Page 58: Correlation

In your tutorial you will learn how to perform a correlationSPSS

How to perform a correlation

Page 59: Correlation

Correlation coefficientr = .871Positive relationship

Significance of rp = .054This is less than our criterion of .05, therefore the relationship is not statistically significant.

Pearson r

Page 60: Correlation

• There was no significant relationship between the number of adverts watched and the number of packets of toffee purchased, r = .87, p = .054.

• r = .87 is a large effect.• The sign of r is positive. As one variable increases, so too does

the other. Note that this doesn’t imply causation.

SPSS output

When interpreting a correlation coefficient there are 3 important things to consider.• The significance of r• The magnitude of r• The +/– sign of r

Page 61: Correlation

Confidence intervals (CIs) tell us about something about the likely value in the population. They give you a range called a lower bound and upper bound.SPSS calculates a fancy version of CIs called bootstrap confidence intervals.

Confidence intervals for r

Page 62: Correlation

Correlation coefficient squared = R2 = Coefficient of determination.

Measures the amount of variability in one variable shared by the other.Exam performanceAnxietyThere are loads of things that affect exam performance (variablity)R2 tells us how much of this variability is shared with Anxiety.r = -0.4410R2 = 0.194 (simply square r)Turn into a % by multiplying by 100So…

Exam anxiety shares 19.4% variability in exam performance

Using R squared for interpretation

Page 63: Correlation

Spurious correlations http://tylervigen.com

Correlation: -0.93

Correlation: 0.95

Page 64: Correlation

Non-parametric

Page 65: Correlation

Parametric vs Non-parametricParametric Non-parametric

Measurement scale Interval or ratio Nominal or ordinal

Information used Parametric correlation uses information about the mean and deviation from the mean

Non-parametric correlation will use only the ordinal position of pairs of scores.

You have to look at the distribution of your data. Check that the distributions are approximately normal*.

* The best way to do this is to check the skew and Kurtosis measures from the frequency output from SPSS. For a relatively normal distribution:skew ~= 1.0kurtosis~=1.0If a distribution deviates markedly from normality then you take the risk that the statistic will be inaccurate. The safest thing to do is to use an equivalent non-parametric statistic.

Page 66: Correlation

• Non-parametric stat based on ranked data• Minimises the effects of

• Extreme scores• Violations of the assumptions

• Ranks the data, then applies Pearson’s r to the ranks.

Spearman’s r (Spearmans Rho) rs

Page 67: Correlation

Use rather than Spearman’s r when• Small data set• With large number of tied ranks(That is, if you rank all the scores and many have the same rank)

Kendall’s is less popular than Spearman’s but can be a better estimate.

Kendall’s tau (non-parametric)

Page 68: Correlation

• Why do we need stats?• Evidence based practice- Appraisal

• Statistical models• The mean as a model• Sums of squares/fit/Variance

• Correlation• Graphs• Assumptions• Measuring Relationships

• Pearson r• R squared

• Non-parametric

Lect

ure

outli

ne

Page 69: Correlation

http://www.bbc.co.uk/podcasts/series/moreorless

Cartoon bookshttp://www.sumsar.net/blog/2014/06/statistics-comic-book-review/

For fun

Page 70: Correlation

http://www.uk.sagepub.com/field4e/main.htm

70