16_corr_regres3.pdf

8/10/2019 16_corr_regres3.pdf

1/16

Correlation & Regression, III

9.074/6/2004

Review

Linear regression refers to fitting a best fit liney=a+bx to the bivariate data (x, y), where

a = m y bm x b = cov(x, y)/s x2 = ss xy/ss xx

Correlation, r, is a measure of the strength anddirection (positive vs. negative) of the relationship

between x and y.r = cov(x, y)/(s x sy)

(There are various other computational formulas, too.)

Outline

Relationship between correlation andregression, along with notes on thecorrelation coefficient

Effect size, and the meaning of r

Other kinds of correlation coefficients Confidence intervals on the parameters of

correlation and regression

Relationship between r andregression

r = cov(x, y)/(s x sy)2 In regression, the slope, b = cov(x, y)/s x

So we could also write b = r(s y/sx) This means b = r when s x = s y

1


2/16

Notes on the correlation coefficient,r1. The correlation coefficient is the slope (b) of the

regression line when both the X and Y variableshave been converted to z-scores, i.e. whensx = s y = 1.Or more generally, when s x = s y.

For a given s x and s y, the larger the size of thecorrelation coefficient, the steeper the slope.

Invariance of r to linear transformationsof x and y

A linear change in scale of either x or y will notchange r.

E.G. converting height to meters and weight to

kilograms will not change r. This is just the sort of nice behavior wed like

from a measure of the strength of the relationship. If you can predict height in inches from weight in lbs,

you can just as well predict height in meters fromweight in kilograms.

Notes on the correlation coefficient, r2. The correlation coefficient is invariant

under linear transformations of x and/or y. (r is the average of z x zy, and z x and z arey

invariant to linear transformations of xand/or y)

How do correlations (=r) andregression differ?

While in regression the emphasis is on predictingone variable from the other, in correlation theemphasis is on the degree to which a linear model

may describe the relationship between twovariables.

The regression equation depends upon whichvariable we choose as the explanatory variable,and which as the variable we wish to predict.

The correlation equation is symmetric with respectto x and y switch them and r stays the same.

2


3/16

but regression is not

y x x y

)/sx2 )/sy2

)/(s x sy) )/(s x sy)

x y

To look out for, when calculating r:

regression)

Correlation over a normal range

80

120

160

200

240

50 60 70 80

( )

(

)

Correlation over a narrow range ofheights

80

50 70 80

( )

(

)

Correlation is symmetric wrt x & y,

a = m bm a = m bm b = cov(x, y b = cov(x, yr = cov(x, y r = cov(x, y

In regression, we had to watch out for outliers andextreme points, because they could have an undueinfluence on the results.In correlation, the key thing to be careful of is notto artificially limit the range of your data, as this

can lead to inaccurate estimates of the strength ofthe relationship (as well as give poor linear fits in

Often it gives an underestimate of r, though not always

Hei gh t i nc hes

W e i g h t

l b s

120

160

200

240

60

Hei g ht i n ch es

W e i g h t

l b s

r = 0.71 r = 0.62

3


4/16

Correlation over a limited range

A limited range will often (though not always) lead to anunderestimate of the strength of the association betweenthe two variables

200

150

100

60 65 70 75height (inches)

The meaning of r

Weve already talked about r indicating both whether the relationship between x and

y is positive or negative, and the strength ofthe relationship The correlation coefficient, r, also has

meaning as a measure of effect size

Outline



Other kinds of correlation coefficients Confidence intervals on the parameters of

correlation and regression

Effect size

When we talked about effect size before, it was inthe context of a two-sample hypothesis test for adifference in the mean.

If there were a significant difference, we decidedit was likely there was a real systematic difference

between the two samples. Measures of effect size attempt to get at how big is

this systematic effect, in an attempt to begin toanswer the question how important is it?

4


5/16

Effect size & regression

In the case of linear regression, thesystematic effect refers to the linearrelationship between x and y

A measure of effect size should get at howimportant (how strong) this relationship is The fact that were talking about strength of

relationship should be a hint that effect size willhave something to do with r

Predicting the value of y

If x is correlated with y, then the situationmight look like this:

X

The meaning of r and effect size

When we talked about two-sample tests, one particularly useful measure of effect size was the proportion of the variance in y accounted for byknowing x

(You might want to review this, to see thesimilarity to the development on the followingslides)

The reasoning went something like this, wherehere its been adapted to the case of linearregression:

Predicting the value of y

Suppose I pick a random individual fromthis scatter plot, dont tell you which, and

ask you to estimate y for that individual.

It would be hard to guess!The best you could probablyhope for is to guess the meanof all the y values (at leastyour error would be 0 onaverage)

X

5


6/16

6

How far off would your guess be?

X

Predicting the value of y when youknow x Now suppose that I told you the value of x,

and again asked you to predict y. This would be somewhat easier, because

you could use regression to predict a goodguess for y, given x.

Predicting the value of y when youknow x

Your best guess is y,the predicted value of ygiven x.

(Recall that regressionattempts to fit the bestfit line through theaverage y for each x.So the best guess isstill a mean, but its themean y given x.)

X

How far off would your guess be,now?

The variance about the mean score for thatvalue of x, gives a measure of your

uncertainty. Under the assumption of homoscedasticity,

that measure of uncertainty is s y2, where s yis the rms error = sqrt( (y i y i)2/N)

y2,

sy

The variance about the mean y score, sgives a measure of your uncertainty aboutthe y scores.


7/16

The strength of the relationship between x and y is reflected in the extent to which knowing x

reduces your uncertainty about y.2 Reduction in uncertainty = s y2 s y

Relative reduction in uncertainty:2(sy2 s y2)/ s y

This is the proportion of variance in y accounted for by x.

(total variation variation left over)/(total variation)= (variation accounted for)/(total variation)

Unpacking the equation for the proportion of variance accounted for,

2(sy2 s y2)/ s yFirst, unpack s y2:

1 N 's 2 y ' = ( yi yi )2 N i=1 where y = a +bxi = (m y bm x ) +bxi

=m y + ,cov( y x ) ( xi m x ) 2s x


2(sy2 s y2)/ s y1 N ' (( yi m y )

,cov(

s 2 y ' = ( yi yi )2 = 1

N y x ) ( x m x )) 22 i N i=1 N i=1 s x

21 N y x ) 1 N= ( yi m y )2 + ,cov( ( xi m x )2 N i=1 s x2 N i=1

y x ) N2 ,cov( ( xi m x )( yi m y ) 2 s x i=1

y x )2 2 ,cov( y x )2

= s 2 y +,cov( y x )2 = s 2 y

,cov( 2 2 2 s xs x s x


2(sy2 s y2)/ s y

22 2 2 2 2 y x )2 s y(s y s y ' ) / s y =s y s y +,cov(

2

s x ,cov( y x )2 = r 2 !!= 2 2s x s y

7


8/16

r 2

The squared correlation coefficient (r 2) is the proportion of variance in y that can be accountedfor by knowing x.

Conversely, since r is symmetric with respect to xand y, r 2 it is the proportion of variance in x that

can be accounted for by knowing y. r 2 is a measure of the size of the effect described

by the linear relationship

Weight as a Function of Height 350

300

250

Weight (lbs.) 200

150

100

50

W = 4H - 117

r 2 = 0.23

50 60 70 80

Height (in.)

The linear relationship accounts for 23% of the variation in the data.

Height as a Function of Weight80

75

70

Height (in.) 6560

55

50

H = 0.06W + 58

r 2 = 0.23

50 150 250 350

Weight (lbs.)

Again, accounting for 23% of the variability in the data.

Behavior and meaning of thecorrelation coefficient

Before, we talked about r as if it were ameasure of the spread about the regressionline, but this isnt quite true. If you keep the spread about the regression line

the same, but increase the slope of the line, r 2increases

The correlation coefficient for zero slope will be 0 regardless of the amount of scatter aboutthe line

8


9/16

Behavior and meaning of the

2 is %

2

R2

0

2

4

6

8

10

12

0 1 2 3 4 5 6

R2

0

2

4

6

8

10

12

0 1 2 3 4 5 6

Behavior and meaning of the

2 is

relationship. So r 2 = 0

R2

= 0

0

2

4

6

8

10

12

0 1 2 3 4 5 6

correlation coefficientBest way to think of thecorrelation coefficient: r of variance in Y accounted for

by the regression of Y on X.As you increase the slope of theregression line, the total varianceto be explained goes up.Therefore the unexplainedvariance (the spread about theregression line) goes downrelative to the total variance.Therefore, r increases..

= 0.75

= 0.9231

correlation coefficientBest way to think of thecorrelation coefficient: r % of variance in Yaccounted for by theregression of Y on X.

If the slope of the regression

line is 0, then 0% of thevariability in the data isaccounted for by the linear

Other correlation coefficients r pb: The point-biserial correlationcoefficient This percent of variance accounted for is a useful

concept. This was the correlation coefficient we used when

measuring effect size for a two-sample test. We talked about it before in talking about effect

size. This is used when one of your variables takes on

only two possible values (a dichotomous variable)

One thing it lets us do is compare effect sizesacross very different kinds of experiments.Different kinds of experiments -> differentcorrelation coefficients

In the two-sample case, the two possible valuecorresponded to the two experimental groups orconditions you wished to compare. E.G. are men significantly taller than women?

9


10/16

The correlation/regression r s: The Spearman rank-orderassociated with r pb correlation coefficient10 Describes the linear relationship between two

O b s e r v e r

B ' s r a n

k i n g s

M I T r a

t i n g s o

f a r

9 variables when measured by ranked scores.8 E.G. instead of using the actual height of a person,7

we might look at the rank of their height are they6the tallest in the group? The 2 nd tallest? Compare5this with their ranking in weight.4

3 Often used in behavioral research because a2 variable is difficult to measure quantitatively.1 May be used to compare observer As ranking0

No Yes (e.g. of the aggressiveness of each child) toSignificant other? observer Bs ranking.

Scatter plot for ranking data Three main types of correlationcoefficients: summaryRanking of aggressiveness of 9

children. 1=most aggressive Pearson product-moment correlation coefficient Standard correlation coefficient, r

9 Used for typical quantitative variables

7 Point-biserial correlation coefficient5 r pb

Used when one variable is dichotomous3

Spearman rank-order correlation coefficient1 r s

1 3 5 7 9 Used when data is ordinal (ranked) (1 st, 2nd, 3 rd, )

Observer A's r ankings

10


11/16

Outline



Other kinds of correlation coefficients Confidence intervals on the parameters ofcorrelation and regression

Best fit line for the data (the sample)

With regression analysis, we have foundline y = a + bx that best fits our data.

Our model for the data was that it wasapproximated by a linear relationship plusnormally-distributed random error, e, i.e.

y = a + bx + e

Confidence intervals and hypothesistesting on a, b, y, and r

Best fit line for the population

As you might imagine, we can think of our points(x i, y i) as samples from a population

Then our best fit line is our estimate of the true

best fit line for the population The regression model for the population as awhole:

y = + x + Where and are the parameters we want to

estimate, and is the random error Assume that for all x, the errors, are independent,

normal, of equal , and mean =0

11


12/16

Confidence intervals for and a and b are unbiased estimators for and , given

our sample Different samples yield different regression lines. These lines are distributed around

y = +x+ How are a and b distributed around and ? If

we know this, we can construct confidenceintervals and do hypothesis testing.

An estimator, s e for () se = sqrt( (y i y i)2/(N-2)) Why n-2? We used up two degrees of

freedom in calculating a and b, leaving n-2independent pieces of information toestimate ()

An alternative computational formula:se = sqrt((ss yy bss xy)/(n-2))

An estimator, s e for () The first thing we need is an estimator for

the standard deviation of the error (thespread) about the true regression line

A decent guess would be our rms error,rms = sqrt( (y i y i)2/N) = s y

We will use a modification of this estimate:

Recall some notation

2 ss xx = (x i m x)2ss = (y i m y)yy

ss = (x i m x)(y i m y)xy

12


13/16

Confidence intervals for and have the usual form = a t crit SE(a) = b t crit SE(b)

Where we use the t-distribution with N-2 degrees offreedom, for the same reason as given in the last slide.

Why are a and b distributed according to a t-distribution? This is complicated, and is offered without proof. Take

it on faith (or, I suppose, try it out in MATLAB).

So, we just need to know what SE(a) and SE(b)are

SE(a) and SE(b)

But some intuition: where does this ss xx come from? Usually we had a 1/sqrt(N) in our SE equations; now we have a

1/sqrt(ss xx) Like N, ssxx increases as we add more data points

ss xx also takes into account the spread of the x data Why care about the spread of the x data?

If all data points had the same x value, wed be unjustified indrawing any conclusions about or , or in making predictions forany other value of x

If ss xx = 0, our confidence intervals would be infinitely wide, toreflect that uncertainty

See picture on board

SE(a) and SE(b) look ratherunfamiliar SE(a) = s e sqrt(1/N + m x2/ss xx) SE(b) = s e/sqrt(ss xx) Just take these equations on faith, too.

So, you now know all you need toget confidence intervals for and

Just plug into the equation from 3 slides ago What about hypothesis testing?

Just as before when we talked about confidenceintervals and hypothesis testing, we can turnour confidence interval equation into theequation for hypothesis testing

E.g. is the population slope greater than 0?

13


14/16

Testing whether the slope is greaterthan 0

H 0: =0, H a: >0 tobt = b/SE(b) Compare t obt to t crit at the desired level of

significance (for, in this case, a one-tailedtest). Look up t

critin your t-tables with N-2

degrees of freedom.

Versus confidence intervals on anindividuals response y new , given x new

Previous slide had confidence intervals for the predicted mean response for x=x o

You can also find confidence intervals for anindividuals predicted y value, y

new, given their x

value, x new As you might imagine, SE(y new) is bigger than

SE(y) First theres variability in the predicted mean (given by

SE(y)) Then, on top of that, theres the variability of y about

the meanSee picture on board

What about confidence intervals forthe mean response y at x=x o? The confidence interval for y=a+bx o is

+xo = a+bx o t crit SE(y) Where SE(y) =

se sqrt(1/N + (xo m x)2/ss xx) Note the (xo m x) term

The regression line always passes through (m x, m y) If you wiggle the regression line (because youre

unsure of a and b), it makes more of a difference thefarther you are from the mean.

See picture on board

SE(y new)

These variances add, so2SE(y new)2 = SE(y) 2 + s e

SE(y new) = s e sqrt(1/N + (x new m x)2/ss xx + 1)

14


15/16

Homework notes

Problems 5 and 9 both sound like theycould be asking for either confidenceintervals on y, or on y new . They aresomewhat ambiguously worded

Assume that problem 5 is asking forconfidence intervals on y, and problem 9 isasking for confidence intervals on y new

Examples

Hypothesis testing on r We often want to know if there is a significant correlation

between x and y For this, we want to test whether the population parameter,

, is significantly different from 0 It turns out that for the null hypothesis H 0: =0,

rsqrt(N-2)/sqrt(1-r 2) has a t distribution, with N-2 degreesof freedom

So, just compare t obt = rsqrt(N-2)/sqrt(1-r 2) with t crit fromyour t-tables.

Dont use this test to test any other hypothesis, H 0: =0.For that you need a different test statistic.

Previous example: predicting weightfrom height

xi yi (x i-mx) (y i-m y) (x i-mx)2 (y i-my)2 (x i-m x) (y i-m y)

60 84 -8 -56 64 3136 44862 95 -6 -45 36 2025 27064 140 -4 0 16 0 0

66 155 -2 15 4 225 -3068 119 0 -21 0 441 070 175 2 35 4 1225 7072 145 4 5 16 25 2074 197 6 57 36 3249 34276 150 8 10 64 100 80

Sum=612 1260 ss xx=240 ss yy=10426 ss xy=1200=68 m =140m x y

b = ss xy/ss xx = 1200/240 = 5; a = m bm x = 140-5(68) = -200y

15


16/16

Confidence intervals on and se = sqrt((ss yy bss xy)/(n-2)) =

sqrt((10426 (5)(1200))/7) = 25.15 SE(a) = se sqrt(1/N + m x2/ss xx) =

25.15 sqrt(1/9 + 68 2/240) = 110.71 SE(b) = s e/sqrt(ss xx) = 25.15/sqrt(240) = 1.62 tcrit for 95% confidence = 2.36, so

= -200 (2.36) (110.71) = 5 (2.36) (1.62) = 5 3.82 It looks like is significantly different from 0

Testing the hypothesis that 0 r = cov(x, y)/(s x sy)

= (ss xy/N)/sqrt((ss xx/N ss yy/N))= 1200/sqrt((240)(10426)) = 0.76

tobt = rsqrt(N-2)/sqrt(1-r 2)= 0.76 sqrt(7)/sqrt(1-0.58) 3.10

tcrit = 2.36, so is significantly differentfrom 0

Confidence intervals on y and y new ,for x = 76 +xo = a+bx o t crit SE(y) SE(y) = s e sqrt(1/N + (x o m x)2/ssxx)

= (25.15) sqrt(1/9 + (76-68) 2/240)= (25.15) (0.61) = 15.46

So +xo = -200 + 5(76) (2.36) SE(y)= 180 36.49 lbs

SE(y new)= (25.15) sqrt(1+1/N+ (x new m x)2/ssxx) = 29.52

So y new = 180 (2.36)(29.52) 180 70 lbs

A last note

The test for 0 is actually the same as the test for 0.

This should make sense to you, since if 0, youwould also expect the slope to be 0

Want an extra point added to your midterm score?Prove that the two tests are the same. In general, not just for this example Submit it with your next homework

16

16_corr_regres3.pdf

Documents

Transcript of 16_corr_regres3.pdf