16_corr_regres3.pdf

download 16_corr_regres3.pdf

of 16

Transcript of 16_corr_regres3.pdf

  • 8/10/2019 16_corr_regres3.pdf

    1/16

    Correlation & Regression, III

    9.074/6/2004

    Review

    Linear regression refers to fitting a best fit liney=a+bx to the bivariate data (x, y), where

    a = m y bm x b = cov(x, y)/s x2 = ss xy/ss xx

    Correlation, r, is a measure of the strength anddirection (positive vs. negative) of the relationship

    between x and y.r = cov(x, y)/(s x sy)

    (There are various other computational formulas, too.)

    Outline

    Relationship between correlation andregression, along with notes on thecorrelation coefficient

    Effect size, and the meaning of r

    Other kinds of correlation coefficients Confidence intervals on the parameters of

    correlation and regression

    Relationship between r andregression

    r = cov(x, y)/(s x sy)2 In regression, the slope, b = cov(x, y)/s x

    So we could also write b = r(s y/sx) This means b = r when s x = s y

    1

  • 8/10/2019 16_corr_regres3.pdf

    2/16

    Notes on the correlation coefficient,r1. The correlation coefficient is the slope (b) of the

    regression line when both the X and Y variableshave been converted to z-scores, i.e. whensx = s y = 1.Or more generally, when s x = s y.

    For a given s x and s y, the larger the size of thecorrelation coefficient, the steeper the slope.

    Invariance of r to linear transformationsof x and y

    A linear change in scale of either x or y will notchange r.

    E.G. converting height to meters and weight to

    kilograms will not change r. This is just the sort of nice behavior wed like

    from a measure of the strength of the relationship. If you can predict height in inches from weight in lbs,

    you can just as well predict height in meters fromweight in kilograms.

    Notes on the correlation coefficient, r2. The correlation coefficient is invariant

    under linear transformations of x and/or y. (r is the average of z x zy, and z x and z arey

    invariant to linear transformations of xand/or y)

    How do correlations (=r) andregression differ?

    While in regression the emphasis is on predictingone variable from the other, in correlation theemphasis is on the degree to which a linear model

    may describe the relationship between twovariables.

    The regression equation depends upon whichvariable we choose as the explanatory variable,and which as the variable we wish to predict.

    The correlation equation is symmetric with respectto x and y switch them and r stays the same.

    2

  • 8/10/2019 16_corr_regres3.pdf

    3/16

    but regression is not

    y x x y

    )/sx2 )/sy2

    )/(s x sy) )/(s x sy)

    x y

    To look out for, when calculating r:

    regression)

    Correlation over a normal range

    80

    120

    160

    200

    240

    50 60 70 80

    ( )

    (

    )

    Correlation over a narrow range ofheights

    80

    50 70 80

    ( )

    (

    )

    Correlation is symmetric wrt x & y,

    a = m bm a = m bm b = cov(x, y b = cov(x, yr = cov(x, y r = cov(x, y

    In regression, we had to watch out for outliers andextreme points, because they could have an undueinfluence on the results.In correlation, the key thing to be careful of is notto artificially limit the range of your data, as this

    can lead to inaccurate estimates of the strength ofthe relationship (as well as give poor linear fits in

    Often it gives an underestimate of r, though not always

    Hei gh t i nc hes

    W e i g h t

    l b s

    120

    160

    200

    240

    60

    Hei g ht i n ch es

    W e i g h t

    l b s

    r = 0.71 r = 0.62

    3

  • 8/10/2019 16_corr_regres3.pdf

    4/16

    Correlation over a limited range

    A limited range will often (though not always) lead to anunderestimate of the strength of the association betweenthe two variables

    200

    150

    100

    60 65 70 75height (inches)

    The meaning of r

    Weve already talked about r indicating both whether the relationship between x and

    y is positive or negative, and the strength ofthe relationship The correlation coefficient, r, also has

    meaning as a measure of effect size

    Outline

    Relationship between correlation andregression, along with notes on thecorrelation coefficient

    Effect size, and the meaning of r

    Other kinds of correlation coefficients Confidence intervals on the parameters of

    correlation and regression

    Effect size

    When we talked about effect size before, it was inthe context of a two-sample hypothesis test for adifference in the mean.

    If there were a significant difference, we decidedit was likely there was a real systematic difference

    between the two samples. Measures of effect size attempt to get at how big is

    this systematic effect, in an attempt to begin toanswer the question how important is it?

    4

  • 8/10/2019 16_corr_regres3.pdf

    5/16

    Effect size & regression

    In the case of linear regression, thesystematic effect refers to the linearrelationship between x and y

    A measure of effect size should get at howimportant (how strong) this relationship is The fact that were talking about strength of

    relationship should be a hint that effect size willhave something to do with r

    Predicting the value of y

    If x is correlated with y, then the situationmight look like this:

    X

    The meaning of r and effect size

    When we talked about two-sample tests, one particularly useful measure of effect size was the proportion of the variance in y accounted for byknowing x

    (You might want to review this, to see thesimilarity to the development on the followingslides)

    The reasoning went something like this, wherehere its been adapted to the case of linearregression:

    Predicting the value of y

    Suppose I pick a random individual fromthis scatter plot, dont tell you which, and

    ask you to estimate y for that individual.

    It would be hard to guess!The best you could probablyhope for is to guess the meanof all the y values (at leastyour error would be 0 onaverage)

    X

    5

  • 8/10/2019 16_corr_regres3.pdf

    6/16

    6

    How far off would your guess be?

    X

    Predicting the value of y when youknow x Now suppose that I told you the value of x,

    and again asked you to predict y. This would be somewhat easier, because

    you could use regression to predict a goodguess for y, given x.

    Predicting the value of y when youknow x

    Your best guess is y,the predicted value of ygiven x.

    (Recall that regressionattempts to fit the bestfit line through theaverage y for each x.So the best guess isstill a mean, but its themean y given x.)

    X

    How far off would your guess be,now?

    The variance about the mean score for thatvalue of x, gives a measure of your

    uncertainty. Under the assumption of homoscedasticity,

    that measure of uncertainty is s y2, where s yis the rms error = sqrt( (y i y i)2/N)

    y2,

    sy

    The variance about the mean y score, sgives a measure of your uncertainty aboutthe y scores.

  • 8/10/2019 16_corr_regres3.pdf

    7/16

    The strength of the relationship between x and y is reflected in the extent to which knowing x

    reduces your uncertainty about y.2 Reduction in uncertainty = s y2 s y

    Relative reduction in uncertainty:2(sy2 s y2)/ s y

    This is the proportion of variance in y accounted for by x.

    (total variation variation left over)/(total variation)= (variation accounted for)/(total variation)

    Unpacking the equation for the proportion of variance accounted for,

    2(sy2 s y2)/ s yFirst, unpack s y2:

    1 N 's 2 y ' = ( yi yi )2 N i=1 where y = a +bxi = (m y bm x ) +bxi

    =m y + ,cov( y x ) ( xi m x ) 2s x

    Unpacking the equation for the proportion of variance accounted for,

    2(sy2 s y2)/ s y1 N ' (( yi m y )

    ,cov(

    s 2 y ' = ( yi yi )2 = 1

    N y x ) ( x m x )) 22 i N i=1 N i=1 s x

    21 N y x ) 1 N= ( yi m y )2 + ,cov( ( xi m x )2 N i=1 s x2 N i=1

    y x ) N2 ,cov( ( xi m x )( yi m y ) 2 s x i=1

    y x )2 2 ,cov( y x )2

    = s 2 y +,cov( y x )2 = s 2 y

    ,cov( 2 2 2 s xs x s x

    Unpacking the equation for the proportion of variance accounted for,

    2(sy2 s y2)/ s y

    22 2 2 2 2 y x )2 s y(s y s y ' ) / s y =s y s y +,cov(

    2

    s x ,cov( y x )2 = r 2 !!= 2 2s x s y

    7

  • 8/10/2019 16_corr_regres3.pdf

    8/16

    r 2

    The squared correlation coefficient (r 2) is the proportion of variance in y that can be accountedfor by knowing x.

    Conversely, since r is symmetric with respect to xand y, r 2 it is the proportion of variance in x that

    can be accounted for by knowing y. r 2 is a measure of the size of the effect described

    by the linear relationship

    Weight as a Function of Height 350

    300

    250

    Weight (lbs.) 200

    150

    100

    50

    W = 4H - 117

    r 2 = 0.23

    50 60 70 80

    Height (in.)

    The linear relationship accounts for 23% of the variation in the data.

    Height as a Function of Weight80

    75

    70

    Height (in.) 6560

    55

    50

    H = 0.06W + 58

    r 2 = 0.23

    50 150 250 350

    Weight (lbs.)

    Again, accounting for 23% of the variability in the data.

    Behavior and meaning of thecorrelation coefficient

    Before, we talked about r as if it were ameasure of the spread about the regressionline, but this isnt quite true. If you keep the spread about the regression line

    the same, but increase the slope of the line, r 2increases

    The correlation coefficient for zero slope will be 0 regardless of the amount of scatter aboutthe line

    8

  • 8/10/2019 16_corr_regres3.pdf

    9/16

    Behavior and meaning of the

    2 is %

    2

    R2

    0

    2

    4

    6

    8

    10

    12

    0 1 2 3 4 5 6

    R2

    0

    2

    4

    6

    8

    10

    12

    0 1 2 3 4 5 6

    Behavior and meaning of the

    2 is

    relationship. So r 2 = 0

    R2

    = 0

    0

    2

    4

    6

    8

    10

    12

    0 1 2 3 4 5 6

    correlation coefficientBest way to think of thecorrelation coefficient: r of variance in Y accounted for

    by the regression of Y on X.As you increase the slope of theregression line, the total varianceto be explained goes up.Therefore the unexplainedvariance (the spread about theregression line) goes downrelative to the total variance.Therefore, r increases..

    = 0.75

    = 0.9231

    correlation coefficientBest way to think of thecorrelation coefficient: r % of variance in Yaccounted for by theregression of Y on X.

    If the slope of the regression

    line is 0, then 0% of thevariability in the data isaccounted for by the linear

    Other correlation coefficients r pb: The point-biserial correlationcoefficient This percent of variance accounted for is a useful

    concept. This was the correlation coefficient we used when

    measuring effect size for a two-sample test. We talked about it before in talking about effect

    size. This is used when one of your variables takes on

    only two possible values (a dichotomous variable)

    One thing it lets us do is compare effect sizesacross very different kinds of experiments.Different kinds of experiments -> differentcorrelation coefficients

    In the two-sample case, the two possible valuecorresponded to the two experimental groups orconditions you wished to compare. E.G. are men significantly taller than women?

    9

  • 8/10/2019 16_corr_regres3.pdf

    10/16

    The correlation/regression r s: The Spearman rank-orderassociated with r pb correlation coefficient10 Describes the linear relationship between two

    O b s e r v e r

    B ' s r a n

    k i n g s

    M I T r a

    t i n g s o

    f a r

    9 variables when measured by ranked scores.8 E.G. instead of using the actual height of a person,7

    we might look at the rank of their height are they6the tallest in the group? The 2 nd tallest? Compare5this with their ranking in weight.4

    3 Often used in behavioral research because a2 variable is difficult to measure quantitatively.1 May be used to compare observer As ranking0

    No Yes (e.g. of the aggressiveness of each child) toSignificant other? observer Bs ranking.

    Scatter plot for ranking data Three main types of correlationcoefficients: summaryRanking of aggressiveness of 9

    children. 1=most aggressive Pearson product-moment correlation coefficient Standard correlation coefficient, r

    9 Used for typical quantitative variables

    7 Point-biserial correlation coefficient5 r pb

    Used when one variable is dichotomous3

    Spearman rank-order correlation coefficient1 r s

    1 3 5 7 9 Used when data is ordinal (ranked) (1 st, 2nd, 3 rd, )

    Observer A's r ankings

    10

  • 8/10/2019 16_corr_regres3.pdf

    11/16

    Outline

    Relationship between correlation andregression, along with notes on thecorrelation coefficient

    Effect size, and the meaning of r

    Other kinds of correlation coefficients Confidence intervals on the parameters ofcorrelation and regression

    Best fit line for the data (the sample)

    With regression analysis, we have foundline y = a + bx that best fits our data.

    Our model for the data was that it wasapproximated by a linear relationship plusnormally-distributed random error, e, i.e.

    y = a + bx + e

    Confidence intervals and hypothesistesting on a, b, y, and r

    Best fit line for the population

    As you might imagine, we can think of our points(x i, y i) as samples from a population

    Then our best fit line is our estimate of the true

    best fit line for the population The regression model for the population as awhole:

    y = + x + Where and are the parameters we want to

    estimate, and is the random error Assume that for all x, the errors, are independent,

    normal, of equal , and mean =0

    11

  • 8/10/2019 16_corr_regres3.pdf

    12/16

    Confidence intervals for and a and b are unbiased estimators for and , given

    our sample Different samples yield different regression lines. These lines are distributed around

    y = +x+ How are a and b distributed around and ? If

    we know this, we can construct confidenceintervals and do hypothesis testing.

    An estimator, s e for () se = sqrt( (y i y i)2/(N-2)) Why n-2? We used up two degrees of

    freedom in calculating a and b, leaving n-2independent pieces of information toestimate ()

    An alternative computational formula:se = sqrt((ss yy bss xy)/(n-2))

    An estimator, s e for () The first thing we need is an estimator for

    the standard deviation of the error (thespread) about the true regression line

    A decent guess would be our rms error,rms = sqrt( (y i y i)2/N) = s y

    We will use a modification of this estimate:

    Recall some notation

    2 ss xx = (x i m x)2ss = (y i m y)yy

    ss = (x i m x)(y i m y)xy

    12

  • 8/10/2019 16_corr_regres3.pdf

    13/16

    Confidence intervals for and have the usual form = a t crit SE(a) = b t crit SE(b)

    Where we use the t-distribution with N-2 degrees offreedom, for the same reason as given in the last slide.

    Why are a and b distributed according to a t-distribution? This is complicated, and is offered without proof. Take

    it on faith (or, I suppose, try it out in MATLAB).

    So, we just need to know what SE(a) and SE(b)are

    SE(a) and SE(b)

    But some intuition: where does this ss xx come from? Usually we had a 1/sqrt(N) in our SE equations; now we have a

    1/sqrt(ss xx) Like N, ssxx increases as we add more data points

    ss xx also takes into account the spread of the x data Why care about the spread of the x data?

    If all data points had the same x value, wed be unjustified indrawing any conclusions about or , or in making predictions forany other value of x

    If ss xx = 0, our confidence intervals would be infinitely wide, toreflect that uncertainty

    See picture on board

    SE(a) and SE(b) look ratherunfamiliar SE(a) = s e sqrt(1/N + m x2/ss xx) SE(b) = s e/sqrt(ss xx) Just take these equations on faith, too.

    So, you now know all you need toget confidence intervals for and

    Just plug into the equation from 3 slides ago What about hypothesis testing?

    Just as before when we talked about confidenceintervals and hypothesis testing, we can turnour confidence interval equation into theequation for hypothesis testing

    E.g. is the population slope greater than 0?

    13

  • 8/10/2019 16_corr_regres3.pdf

    14/16

    Testing whether the slope is greaterthan 0

    H 0: =0, H a: >0 tobt = b/SE(b) Compare t obt to t crit at the desired level of

    significance (for, in this case, a one-tailedtest). Look up t

    critin your t-tables with N-2

    degrees of freedom.

    Versus confidence intervals on anindividuals response y new , given x new

    Previous slide had confidence intervals for the predicted mean response for x=x o

    You can also find confidence intervals for anindividuals predicted y value, y

    new, given their x

    value, x new As you might imagine, SE(y new) is bigger than

    SE(y) First theres variability in the predicted mean (given by

    SE(y)) Then, on top of that, theres the variability of y about

    the meanSee picture on board

    What about confidence intervals forthe mean response y at x=x o? The confidence interval for y=a+bx o is

    +xo = a+bx o t crit SE(y) Where SE(y) =

    se sqrt(1/N + (xo m x)2/ss xx) Note the (xo m x) term

    The regression line always passes through (m x, m y) If you wiggle the regression line (because youre

    unsure of a and b), it makes more of a difference thefarther you are from the mean.

    See picture on board

    SE(y new)

    These variances add, so2SE(y new)2 = SE(y) 2 + s e

    SE(y new) = s e sqrt(1/N + (x new m x)2/ss xx + 1)

    14

  • 8/10/2019 16_corr_regres3.pdf

    15/16

    Homework notes

    Problems 5 and 9 both sound like theycould be asking for either confidenceintervals on y, or on y new . They aresomewhat ambiguously worded

    Assume that problem 5 is asking forconfidence intervals on y, and problem 9 isasking for confidence intervals on y new

    Examples

    Hypothesis testing on r We often want to know if there is a significant correlation

    between x and y For this, we want to test whether the population parameter,

    , is significantly different from 0 It turns out that for the null hypothesis H 0: =0,

    rsqrt(N-2)/sqrt(1-r 2) has a t distribution, with N-2 degreesof freedom

    So, just compare t obt = rsqrt(N-2)/sqrt(1-r 2) with t crit fromyour t-tables.

    Dont use this test to test any other hypothesis, H 0: =0.For that you need a different test statistic.

    Previous example: predicting weightfrom height

    xi yi (x i-mx) (y i-m y) (x i-mx)2 (y i-my)2 (x i-m x) (y i-m y)

    60 84 -8 -56 64 3136 44862 95 -6 -45 36 2025 27064 140 -4 0 16 0 0

    66 155 -2 15 4 225 -3068 119 0 -21 0 441 070 175 2 35 4 1225 7072 145 4 5 16 25 2074 197 6 57 36 3249 34276 150 8 10 64 100 80

    Sum=612 1260 ss xx=240 ss yy=10426 ss xy=1200=68 m =140m x y

    b = ss xy/ss xx = 1200/240 = 5; a = m bm x = 140-5(68) = -200y

    15

  • 8/10/2019 16_corr_regres3.pdf

    16/16

    Confidence intervals on and se = sqrt((ss yy bss xy)/(n-2)) =

    sqrt((10426 (5)(1200))/7) = 25.15 SE(a) = se sqrt(1/N + m x2/ss xx) =

    25.15 sqrt(1/9 + 68 2/240) = 110.71 SE(b) = s e/sqrt(ss xx) = 25.15/sqrt(240) = 1.62 tcrit for 95% confidence = 2.36, so

    = -200 (2.36) (110.71) = 5 (2.36) (1.62) = 5 3.82 It looks like is significantly different from 0

    Testing the hypothesis that 0 r = cov(x, y)/(s x sy)

    = (ss xy/N)/sqrt((ss xx/N ss yy/N))= 1200/sqrt((240)(10426)) = 0.76

    tobt = rsqrt(N-2)/sqrt(1-r 2)= 0.76 sqrt(7)/sqrt(1-0.58) 3.10

    tcrit = 2.36, so is significantly differentfrom 0

    Confidence intervals on y and y new ,for x = 76 +xo = a+bx o t crit SE(y) SE(y) = s e sqrt(1/N + (x o m x)2/ssxx)

    = (25.15) sqrt(1/9 + (76-68) 2/240)= (25.15) (0.61) = 15.46

    So +xo = -200 + 5(76) (2.36) SE(y)= 180 36.49 lbs

    SE(y new)= (25.15) sqrt(1+1/N+ (x new m x)2/ssxx) = 29.52

    So y new = 180 (2.36)(29.52) 180 70 lbs

    A last note

    The test for 0 is actually the same as the test for 0.

    This should make sense to you, since if 0, youwould also expect the slope to be 0

    Want an extra point added to your midterm score?Prove that the two tests are the same. In general, not just for this example Submit it with your next homework

    16