4AAH1007_71550 4AAH1007_71550.pdf 4AAH1007_71550.pdf 4AAH1007_71550.pdf 4AAH1007_71550.pdf
16_corr_regres3.pdf
-
Upload
kaung-htet-zaw -
Category
Documents
-
view
216 -
download
0
Transcript of 16_corr_regres3.pdf
-
8/10/2019 16_corr_regres3.pdf
1/16
Correlation & Regression, III
9.074/6/2004
Review
Linear regression refers to fitting a best fit liney=a+bx to the bivariate data (x, y), where
a = m y bm x b = cov(x, y)/s x2 = ss xy/ss xx
Correlation, r, is a measure of the strength anddirection (positive vs. negative) of the relationship
between x and y.r = cov(x, y)/(s x sy)
(There are various other computational formulas, too.)
Outline
Relationship between correlation andregression, along with notes on thecorrelation coefficient
Effect size, and the meaning of r
Other kinds of correlation coefficients Confidence intervals on the parameters of
correlation and regression
Relationship between r andregression
r = cov(x, y)/(s x sy)2 In regression, the slope, b = cov(x, y)/s x
So we could also write b = r(s y/sx) This means b = r when s x = s y
1
-
8/10/2019 16_corr_regres3.pdf
2/16
Notes on the correlation coefficient,r1. The correlation coefficient is the slope (b) of the
regression line when both the X and Y variableshave been converted to z-scores, i.e. whensx = s y = 1.Or more generally, when s x = s y.
For a given s x and s y, the larger the size of thecorrelation coefficient, the steeper the slope.
Invariance of r to linear transformationsof x and y
A linear change in scale of either x or y will notchange r.
E.G. converting height to meters and weight to
kilograms will not change r. This is just the sort of nice behavior wed like
from a measure of the strength of the relationship. If you can predict height in inches from weight in lbs,
you can just as well predict height in meters fromweight in kilograms.
Notes on the correlation coefficient, r2. The correlation coefficient is invariant
under linear transformations of x and/or y. (r is the average of z x zy, and z x and z arey
invariant to linear transformations of xand/or y)
How do correlations (=r) andregression differ?
While in regression the emphasis is on predictingone variable from the other, in correlation theemphasis is on the degree to which a linear model
may describe the relationship between twovariables.
The regression equation depends upon whichvariable we choose as the explanatory variable,and which as the variable we wish to predict.
The correlation equation is symmetric with respectto x and y switch them and r stays the same.
2
-
8/10/2019 16_corr_regres3.pdf
3/16
but regression is not
y x x y
)/sx2 )/sy2
)/(s x sy) )/(s x sy)
x y
To look out for, when calculating r:
regression)
Correlation over a normal range
80
120
160
200
240
50 60 70 80
( )
(
)
Correlation over a narrow range ofheights
80
50 70 80
( )
(
)
Correlation is symmetric wrt x & y,
a = m bm a = m bm b = cov(x, y b = cov(x, yr = cov(x, y r = cov(x, y
In regression, we had to watch out for outliers andextreme points, because they could have an undueinfluence on the results.In correlation, the key thing to be careful of is notto artificially limit the range of your data, as this
can lead to inaccurate estimates of the strength ofthe relationship (as well as give poor linear fits in
Often it gives an underestimate of r, though not always
Hei gh t i nc hes
W e i g h t
l b s
120
160
200
240
60
Hei g ht i n ch es
W e i g h t
l b s
r = 0.71 r = 0.62
3
-
8/10/2019 16_corr_regres3.pdf
4/16
Correlation over a limited range
A limited range will often (though not always) lead to anunderestimate of the strength of the association betweenthe two variables
200
150
100
60 65 70 75height (inches)
The meaning of r
Weve already talked about r indicating both whether the relationship between x and
y is positive or negative, and the strength ofthe relationship The correlation coefficient, r, also has
meaning as a measure of effect size
Outline
Relationship between correlation andregression, along with notes on thecorrelation coefficient
Effect size, and the meaning of r
Other kinds of correlation coefficients Confidence intervals on the parameters of
correlation and regression
Effect size
When we talked about effect size before, it was inthe context of a two-sample hypothesis test for adifference in the mean.
If there were a significant difference, we decidedit was likely there was a real systematic difference
between the two samples. Measures of effect size attempt to get at how big is
this systematic effect, in an attempt to begin toanswer the question how important is it?
4
-
8/10/2019 16_corr_regres3.pdf
5/16
Effect size & regression
In the case of linear regression, thesystematic effect refers to the linearrelationship between x and y
A measure of effect size should get at howimportant (how strong) this relationship is The fact that were talking about strength of
relationship should be a hint that effect size willhave something to do with r
Predicting the value of y
If x is correlated with y, then the situationmight look like this:
X
The meaning of r and effect size
When we talked about two-sample tests, one particularly useful measure of effect size was the proportion of the variance in y accounted for byknowing x
(You might want to review this, to see thesimilarity to the development on the followingslides)
The reasoning went something like this, wherehere its been adapted to the case of linearregression:
Predicting the value of y
Suppose I pick a random individual fromthis scatter plot, dont tell you which, and
ask you to estimate y for that individual.
It would be hard to guess!The best you could probablyhope for is to guess the meanof all the y values (at leastyour error would be 0 onaverage)
X
5
-
8/10/2019 16_corr_regres3.pdf
6/16
6
How far off would your guess be?
X
Predicting the value of y when youknow x Now suppose that I told you the value of x,
and again asked you to predict y. This would be somewhat easier, because
you could use regression to predict a goodguess for y, given x.
Predicting the value of y when youknow x
Your best guess is y,the predicted value of ygiven x.
(Recall that regressionattempts to fit the bestfit line through theaverage y for each x.So the best guess isstill a mean, but its themean y given x.)
X
How far off would your guess be,now?
The variance about the mean score for thatvalue of x, gives a measure of your
uncertainty. Under the assumption of homoscedasticity,
that measure of uncertainty is s y2, where s yis the rms error = sqrt( (y i y i)2/N)
y2,
sy
The variance about the mean y score, sgives a measure of your uncertainty aboutthe y scores.
-
8/10/2019 16_corr_regres3.pdf
7/16
The strength of the relationship between x and y is reflected in the extent to which knowing x
reduces your uncertainty about y.2 Reduction in uncertainty = s y2 s y
Relative reduction in uncertainty:2(sy2 s y2)/ s y
This is the proportion of variance in y accounted for by x.
(total variation variation left over)/(total variation)= (variation accounted for)/(total variation)
Unpacking the equation for the proportion of variance accounted for,
2(sy2 s y2)/ s yFirst, unpack s y2:
1 N 's 2 y ' = ( yi yi )2 N i=1 where y = a +bxi = (m y bm x ) +bxi
=m y + ,cov( y x ) ( xi m x ) 2s x
Unpacking the equation for the proportion of variance accounted for,
2(sy2 s y2)/ s y1 N ' (( yi m y )
,cov(
s 2 y ' = ( yi yi )2 = 1
N y x ) ( x m x )) 22 i N i=1 N i=1 s x
21 N y x ) 1 N= ( yi m y )2 + ,cov( ( xi m x )2 N i=1 s x2 N i=1
y x ) N2 ,cov( ( xi m x )( yi m y ) 2 s x i=1
y x )2 2 ,cov( y x )2
= s 2 y +,cov( y x )2 = s 2 y
,cov( 2 2 2 s xs x s x
Unpacking the equation for the proportion of variance accounted for,
2(sy2 s y2)/ s y
22 2 2 2 2 y x )2 s y(s y s y ' ) / s y =s y s y +,cov(
2
s x ,cov( y x )2 = r 2 !!= 2 2s x s y
7
-
8/10/2019 16_corr_regres3.pdf
8/16
r 2
The squared correlation coefficient (r 2) is the proportion of variance in y that can be accountedfor by knowing x.
Conversely, since r is symmetric with respect to xand y, r 2 it is the proportion of variance in x that
can be accounted for by knowing y. r 2 is a measure of the size of the effect described
by the linear relationship
Weight as a Function of Height 350
300
250
Weight (lbs.) 200
150
100
50
W = 4H - 117
r 2 = 0.23
50 60 70 80
Height (in.)
The linear relationship accounts for 23% of the variation in the data.
Height as a Function of Weight80
75
70
Height (in.) 6560
55
50
H = 0.06W + 58
r 2 = 0.23
50 150 250 350
Weight (lbs.)
Again, accounting for 23% of the variability in the data.
Behavior and meaning of thecorrelation coefficient
Before, we talked about r as if it were ameasure of the spread about the regressionline, but this isnt quite true. If you keep the spread about the regression line
the same, but increase the slope of the line, r 2increases
The correlation coefficient for zero slope will be 0 regardless of the amount of scatter aboutthe line
8
-
8/10/2019 16_corr_regres3.pdf
9/16
Behavior and meaning of the
2 is %
2
R2
0
2
4
6
8
10
12
0 1 2 3 4 5 6
R2
0
2
4
6
8
10
12
0 1 2 3 4 5 6
Behavior and meaning of the
2 is
relationship. So r 2 = 0
R2
= 0
0
2
4
6
8
10
12
0 1 2 3 4 5 6
correlation coefficientBest way to think of thecorrelation coefficient: r of variance in Y accounted for
by the regression of Y on X.As you increase the slope of theregression line, the total varianceto be explained goes up.Therefore the unexplainedvariance (the spread about theregression line) goes downrelative to the total variance.Therefore, r increases..
= 0.75
= 0.9231
correlation coefficientBest way to think of thecorrelation coefficient: r % of variance in Yaccounted for by theregression of Y on X.
If the slope of the regression
line is 0, then 0% of thevariability in the data isaccounted for by the linear
Other correlation coefficients r pb: The point-biserial correlationcoefficient This percent of variance accounted for is a useful
concept. This was the correlation coefficient we used when
measuring effect size for a two-sample test. We talked about it before in talking about effect
size. This is used when one of your variables takes on
only two possible values (a dichotomous variable)
One thing it lets us do is compare effect sizesacross very different kinds of experiments.Different kinds of experiments -> differentcorrelation coefficients
In the two-sample case, the two possible valuecorresponded to the two experimental groups orconditions you wished to compare. E.G. are men significantly taller than women?
9
-
8/10/2019 16_corr_regres3.pdf
10/16
The correlation/regression r s: The Spearman rank-orderassociated with r pb correlation coefficient10 Describes the linear relationship between two
O b s e r v e r
B ' s r a n
k i n g s
M I T r a
t i n g s o
f a r
9 variables when measured by ranked scores.8 E.G. instead of using the actual height of a person,7
we might look at the rank of their height are they6the tallest in the group? The 2 nd tallest? Compare5this with their ranking in weight.4
3 Often used in behavioral research because a2 variable is difficult to measure quantitatively.1 May be used to compare observer As ranking0
No Yes (e.g. of the aggressiveness of each child) toSignificant other? observer Bs ranking.
Scatter plot for ranking data Three main types of correlationcoefficients: summaryRanking of aggressiveness of 9
children. 1=most aggressive Pearson product-moment correlation coefficient Standard correlation coefficient, r
9 Used for typical quantitative variables
7 Point-biserial correlation coefficient5 r pb
Used when one variable is dichotomous3
Spearman rank-order correlation coefficient1 r s
1 3 5 7 9 Used when data is ordinal (ranked) (1 st, 2nd, 3 rd, )
Observer A's r ankings
10
-
8/10/2019 16_corr_regres3.pdf
11/16
Outline
Relationship between correlation andregression, along with notes on thecorrelation coefficient
Effect size, and the meaning of r
Other kinds of correlation coefficients Confidence intervals on the parameters ofcorrelation and regression
Best fit line for the data (the sample)
With regression analysis, we have foundline y = a + bx that best fits our data.
Our model for the data was that it wasapproximated by a linear relationship plusnormally-distributed random error, e, i.e.
y = a + bx + e
Confidence intervals and hypothesistesting on a, b, y, and r
Best fit line for the population
As you might imagine, we can think of our points(x i, y i) as samples from a population
Then our best fit line is our estimate of the true
best fit line for the population The regression model for the population as awhole:
y = + x + Where and are the parameters we want to
estimate, and is the random error Assume that for all x, the errors, are independent,
normal, of equal , and mean =0
11
-
8/10/2019 16_corr_regres3.pdf
12/16
Confidence intervals for and a and b are unbiased estimators for and , given
our sample Different samples yield different regression lines. These lines are distributed around
y = +x+ How are a and b distributed around and ? If
we know this, we can construct confidenceintervals and do hypothesis testing.
An estimator, s e for () se = sqrt( (y i y i)2/(N-2)) Why n-2? We used up two degrees of
freedom in calculating a and b, leaving n-2independent pieces of information toestimate ()
An alternative computational formula:se = sqrt((ss yy bss xy)/(n-2))
An estimator, s e for () The first thing we need is an estimator for
the standard deviation of the error (thespread) about the true regression line
A decent guess would be our rms error,rms = sqrt( (y i y i)2/N) = s y
We will use a modification of this estimate:
Recall some notation
2 ss xx = (x i m x)2ss = (y i m y)yy
ss = (x i m x)(y i m y)xy
12
-
8/10/2019 16_corr_regres3.pdf
13/16
Confidence intervals for and have the usual form = a t crit SE(a) = b t crit SE(b)
Where we use the t-distribution with N-2 degrees offreedom, for the same reason as given in the last slide.
Why are a and b distributed according to a t-distribution? This is complicated, and is offered without proof. Take
it on faith (or, I suppose, try it out in MATLAB).
So, we just need to know what SE(a) and SE(b)are
SE(a) and SE(b)
But some intuition: where does this ss xx come from? Usually we had a 1/sqrt(N) in our SE equations; now we have a
1/sqrt(ss xx) Like N, ssxx increases as we add more data points
ss xx also takes into account the spread of the x data Why care about the spread of the x data?
If all data points had the same x value, wed be unjustified indrawing any conclusions about or , or in making predictions forany other value of x
If ss xx = 0, our confidence intervals would be infinitely wide, toreflect that uncertainty
See picture on board
SE(a) and SE(b) look ratherunfamiliar SE(a) = s e sqrt(1/N + m x2/ss xx) SE(b) = s e/sqrt(ss xx) Just take these equations on faith, too.
So, you now know all you need toget confidence intervals for and
Just plug into the equation from 3 slides ago What about hypothesis testing?
Just as before when we talked about confidenceintervals and hypothesis testing, we can turnour confidence interval equation into theequation for hypothesis testing
E.g. is the population slope greater than 0?
13
-
8/10/2019 16_corr_regres3.pdf
14/16
Testing whether the slope is greaterthan 0
H 0: =0, H a: >0 tobt = b/SE(b) Compare t obt to t crit at the desired level of
significance (for, in this case, a one-tailedtest). Look up t
critin your t-tables with N-2
degrees of freedom.
Versus confidence intervals on anindividuals response y new , given x new
Previous slide had confidence intervals for the predicted mean response for x=x o
You can also find confidence intervals for anindividuals predicted y value, y
new, given their x
value, x new As you might imagine, SE(y new) is bigger than
SE(y) First theres variability in the predicted mean (given by
SE(y)) Then, on top of that, theres the variability of y about
the meanSee picture on board
What about confidence intervals forthe mean response y at x=x o? The confidence interval for y=a+bx o is
+xo = a+bx o t crit SE(y) Where SE(y) =
se sqrt(1/N + (xo m x)2/ss xx) Note the (xo m x) term
The regression line always passes through (m x, m y) If you wiggle the regression line (because youre
unsure of a and b), it makes more of a difference thefarther you are from the mean.
See picture on board
SE(y new)
These variances add, so2SE(y new)2 = SE(y) 2 + s e
SE(y new) = s e sqrt(1/N + (x new m x)2/ss xx + 1)
14
-
8/10/2019 16_corr_regres3.pdf
15/16
Homework notes
Problems 5 and 9 both sound like theycould be asking for either confidenceintervals on y, or on y new . They aresomewhat ambiguously worded
Assume that problem 5 is asking forconfidence intervals on y, and problem 9 isasking for confidence intervals on y new
Examples
Hypothesis testing on r We often want to know if there is a significant correlation
between x and y For this, we want to test whether the population parameter,
, is significantly different from 0 It turns out that for the null hypothesis H 0: =0,
rsqrt(N-2)/sqrt(1-r 2) has a t distribution, with N-2 degreesof freedom
So, just compare t obt = rsqrt(N-2)/sqrt(1-r 2) with t crit fromyour t-tables.
Dont use this test to test any other hypothesis, H 0: =0.For that you need a different test statistic.
Previous example: predicting weightfrom height
xi yi (x i-mx) (y i-m y) (x i-mx)2 (y i-my)2 (x i-m x) (y i-m y)
60 84 -8 -56 64 3136 44862 95 -6 -45 36 2025 27064 140 -4 0 16 0 0
66 155 -2 15 4 225 -3068 119 0 -21 0 441 070 175 2 35 4 1225 7072 145 4 5 16 25 2074 197 6 57 36 3249 34276 150 8 10 64 100 80
Sum=612 1260 ss xx=240 ss yy=10426 ss xy=1200=68 m =140m x y
b = ss xy/ss xx = 1200/240 = 5; a = m bm x = 140-5(68) = -200y
15
-
8/10/2019 16_corr_regres3.pdf
16/16
Confidence intervals on and se = sqrt((ss yy bss xy)/(n-2)) =
sqrt((10426 (5)(1200))/7) = 25.15 SE(a) = se sqrt(1/N + m x2/ss xx) =
25.15 sqrt(1/9 + 68 2/240) = 110.71 SE(b) = s e/sqrt(ss xx) = 25.15/sqrt(240) = 1.62 tcrit for 95% confidence = 2.36, so
= -200 (2.36) (110.71) = 5 (2.36) (1.62) = 5 3.82 It looks like is significantly different from 0
Testing the hypothesis that 0 r = cov(x, y)/(s x sy)
= (ss xy/N)/sqrt((ss xx/N ss yy/N))= 1200/sqrt((240)(10426)) = 0.76
tobt = rsqrt(N-2)/sqrt(1-r 2)= 0.76 sqrt(7)/sqrt(1-0.58) 3.10
tcrit = 2.36, so is significantly differentfrom 0
Confidence intervals on y and y new ,for x = 76 +xo = a+bx o t crit SE(y) SE(y) = s e sqrt(1/N + (x o m x)2/ssxx)
= (25.15) sqrt(1/9 + (76-68) 2/240)= (25.15) (0.61) = 15.46
So +xo = -200 + 5(76) (2.36) SE(y)= 180 36.49 lbs
SE(y new)= (25.15) sqrt(1+1/N+ (x new m x)2/ssxx) = 29.52
So y new = 180 (2.36)(29.52) 180 70 lbs
A last note
The test for 0 is actually the same as the test for 0.
This should make sense to you, since if 0, youwould also expect the slope to be 0
Want an extra point added to your midterm score?Prove that the two tests are the same. In general, not just for this example Submit it with your next homework
16