bivariate EDA and regression analysis

61
bivariate EDA and regression analysis

description

bivariate EDA and regression analysis. width. length. weight of core. distance from quarry. “scatterplot matrix”. scatterplots. scatterplots provide the most detailed summary of a bivariate relationship , but they are not concise , and there are limits to what else you can do with them… - PowerPoint PPT Presentation

Transcript of bivariate EDA and regression analysis

Page 1: bivariate EDA and regression  analysis

bivariate EDA and regression analysis

Page 2: bivariate EDA and regression  analysis

length

width

Page 3: bivariate EDA and regression  analysis

distance from quarry

weight of core

Page 4: bivariate EDA and regression  analysis

-4 -3 -2 -1 0 1 2 3 4 5AG_C1_1

-5

-4

-3

-2

-1

0

1

2

3

AG

_C1_

2

Page 5: bivariate EDA and regression  analysis

-4 -3 -2 -1 0 1 2 3 4 5AG_C1_1

-5

-4

-3

-2

-1

0

1

2

3

AG

_C

1_

2

Page 6: bivariate EDA and regression  analysis

-4 -3 -2 -1 0 1 2 3 4 5AG_C1_1

-5

-4

-3

-2

-1

0

1

2

3

AG

_C1_

2

Page 7: bivariate EDA and regression  analysis

AG_C1_2

AG

_C1_

1

AG_C2_2 AG_C3_2 AG_C4_2

AG

_C1_1

AG

_C2_

1A

G_C

2_1A

G_C

3_1

AG

_C3_1

AG_C1_2

AG

_C4_

1

AG_C2_2 AG_C3_2 AG_C4_2

AG

_C4_1

“scatterplot matrix”

Page 8: bivariate EDA and regression  analysis

AG_C1_1

AG

_C

1_

1

AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2

AG

_C

1_

1

AG

_C

2_

1 AG

_C

2_

1

AG

_C

3_

1 AG

_C

3_

1

AG

_C

4_

1 AG

_C

4_

1

AG

_C

1_

2 AG

_C

1_

2

AG

_C

2_

2 AG

_C

2_

2

AG

_C

3_

2 AG

_C

3_

2

AG_C1_1

AG

_C

4_

2

AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2

AG

_C

4_

2

Page 9: bivariate EDA and regression  analysis

-4 -3 -2 -1 0 1 2 3 4 5AG_C1_1

-10

-5

0

5

10

AG

_C2_

1

Page 10: bivariate EDA and regression  analysis

scatterplots

• scatterplots provide the most detailed summary of a bivariate relationship, but they are not concise, and there are limits to what else you can do with them…

• simpler kinds of summaries may be useful– more compact; often capture less detail– may support more extended mathematical analyses– may reveal fundamental relationships…

-4 -3 -2 -1 0 1 2 3 4 5AG_C1_1

-5

-4

-3

-2

-1

0

1

2

3A

G_

C1

_2

Page 11: bivariate EDA and regression  analysis
Page 12: bivariate EDA and regression  analysis

y = a + bx

Page 13: bivariate EDA and regression  analysis

y = a + bx

1 2 3 4 5 6

1

2

3

4

5

6

a = “y intercept”

y

x

(x2,y2)

(x1,y1)

b = “slope”

b = y/x

b = (y2-y1)/(x2-x1)

Page 14: bivariate EDA and regression  analysis

y = a + bx

• we can predict values of y from values of x• predicted values of y are called “y-hat”

• the predicted values (y) are often regarded as “dependent” on the (independent) x values

• try to assign independent values to x-axis, dependent values to the y-axis…

bxay ˆ

Page 15: bivariate EDA and regression  analysis

y = a + bx

• becomes a concise summary of a point distribution, and a model of a relationship

• may have important explanatory and predictive value

Page 16: bivariate EDA and regression  analysis
Page 17: bivariate EDA and regression  analysis

• how do we come up with these lines?

• various options:– by eye– calculating a “Tukey Line” (resistant to

outliers)– ‘locally weighted regression’ – “LOWESS”– least squares regression

Page 18: bivariate EDA and regression  analysis

linear regression

• linear regression and correlation analysis are generally concerned with fitting lines to real data

• least squares regression is one of the main tools

• attempts to minimize deviation of observed points from the regression line

• maximizes its potential for prediction

Page 19: bivariate EDA and regression  analysis

• standard approach minimizes the squared variation in y

• Note:– these are the vertical deviations– this is a “sum-squared-error approach”

n

iii yy

1

2)ˆ(

Page 20: bivariate EDA and regression  analysis

• regressing x on y would involve defining the line

by minimizing

ii dycx ˆ

2ˆii xx

Page 21: bivariate EDA and regression  analysis

• calculating a line that minimizes this value is called “regressing y on x”

• appropriate when we are trying to predict y from x

• this is also called “Model I Regression”

Page 22: bivariate EDA and regression  analysis

• start by calculating the slope (b):

n

ii

n

iii

xx

yyxxb

1

2

1

)(

))(( covariance

Page 23: bivariate EDA and regression  analysis

• once you have the slope, you can calculate the y-intercept (a):

n

xbyxbya ii

Page 24: bivariate EDA and regression  analysis

regression “pathologies”

• things to avoid in regression analysis

Page 25: bivariate EDA and regression  analysis
Page 26: bivariate EDA and regression  analysis
Page 27: bivariate EDA and regression  analysis
Page 28: bivariate EDA and regression  analysis
Page 29: bivariate EDA and regression  analysis

Tukey Line

• resistant to outliers

• divide cases into thirds, based on x-axis

• identify the median x and y values in upper and lower thirds

• slope (b)= (My3-My1)/(Mx3-Mx1)

• intercept (a) = median of all values yi-b*xi

Page 30: bivariate EDA and regression  analysis
Page 31: bivariate EDA and regression  analysis

Correlation

• regression concerns fitting a linear model to observed data

• correlation concerns the degree of fit between observed data and the model...

• if most points lie near the line:– the ‘fit’ of the model is ‘good’– the two variables are ‘strongly’ correlated– values of y can be ‘well’ predicted from x

Page 32: bivariate EDA and regression  analysis

“Pearson’s r”

• this is assessed using the product-moment correlation coefficient:

= covariance (the numerator), standardized by a measure of variation in both x and y

22 )()(

))((

yyxx

yyxxr

ii

ii

Page 33: bivariate EDA and regression  analysis

y

x

22 )()(

))((

yyxx

yyxxr

ii

ii

+

+

-

-

(xi,yi)

Page 34: bivariate EDA and regression  analysis

• unlike the covariance, r is unit-less

• ranges between –1 and 1 0 = no correlation -1 and 1 = perfect negative and positive

correlation (respectively)

• r is symmetrical correlation between x and y is the same as

between y and x no question of independence or dependence… recall, this symmetry is not true of regression…

Page 35: bivariate EDA and regression  analysis

• regression/correlation– one can assess the strength of a relationship by

seeing how knowledge of one variable improves the ability to predict the other

Page 36: bivariate EDA and regression  analysis

• if you ignore x, the best predictor of y will be the mean of all y values (y-bar)

• if the y measurements are widely scattered, prediction errors will be greater than if they are close together

• we can assess the dispersion of y values around their mean by:

2)( yyi

Page 37: bivariate EDA and regression  analysis

y

iy

2)( yyi

2)ˆ( ii yy

Page 38: bivariate EDA and regression  analysis

2)ˆ( ii yy

2)( yyir2=

• “coefficient of determination” (r2)

• describes the proportion of variation that is “explained” or accounted for by the regression line…

• r2=.5 half of the variation is explained by the regression…

half of the variation in y is explained by variation in x…

Page 39: bivariate EDA and regression  analysis

y

iy

Page 40: bivariate EDA and regression  analysis

correlation and percentages

• much of what we want to learn about association between variables can be learned from counts– ex: are high counts of bone needles associated

with high counts of end scrapers?

• sometimes, similar questions are posed of percent-standardized data– ex: are high proportions of decorated pottery

associated with high proportions of copper bells?

Page 41: bivariate EDA and regression  analysis

caution…

• these are different questions and have different implications for formal regression

• percents will show at least some level of correlation even if the underlying counts do not…– ‘spurious’ correlation (negative)– “closed-sum” effect

Page 42: bivariate EDA and regression  analysis

case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10

1 15 14 94 59 76 13 8 97 10 95

2 35 1 89 95 23 77 14 9 27 43

3 20 96 73 31 90 65 74 60 85 27

4 23 59 7 52 33 83 71 35 57 90

5 36 90 86 15 97 54 52 41 34 3

6 79 2 26 5 11 68 74 44 13 87

7 40 99 28 66 77 23 69 22 63 36

8 95 36 22 75 21 48 95 58 74 68

9 27 0 58 99 32 30 5 5 100 75

10 67 93 98 61 62 94 3 16 43 48

10 vars.5 vars.

3 vars.2 vars.

Page 43: bivariate EDA and regression  analysis

-1.0 -0.5 0.0 0.5 1.0r

original counts

-1.0 -0.5 0.0 0.5 1.0r

percents (10 vars.)

-1.0 -0.5 0.0 0.5 1.0r

percents (5 vars.)

-1.0 -0.5 0.0 0.5 1.0r

percents (3 vars.)

-1.0 -0.5 0.0 0.5 1.0r

percents (2 vars.)

Page 44: bivariate EDA and regression  analysis

0 20 40 60 80 100C_V1

0

20

40

60

80

100

C_

V2

0 5 10 15 20P10_V1

0

5

10

15

20

P10

_V2

0 10 20 30 40 50 60 70T5_V1

0

10

20

30

40

T5_

V2

10 20 30 40 50 60 70 80T3_V1

0

10

20

30

40

50

60

70

T3_

V2

10 20 30 40 50 60 70 80 90 100T2_V1

0

10

20

30

40

50

60

70

80

90

T2_

V2

Page 45: bivariate EDA and regression  analysis

regression assumptions

• both variables are measured at the interval scale or above

• variation is the same at all points along the regression line (variation is homoscedastic)

Page 46: bivariate EDA and regression  analysis

residuals

• vertical deviations of points around the regression

• for case i, residual = yi-y-hati [yi-(a+bxi)]

• residuals in y should not show patterned variation either with x or y-hat

• normally distributed around the regression line• residual error should not be autocorrelated

(errors/residuals in y are independent…)

Page 47: bivariate EDA and regression  analysis

standard error of the regression

• recall: ‘standard error’ of an estimate (SEE) is like a standard deviation

• can calculate an SEE for residuals associated with a regression formula

n

yyS ii

iyyi

2

ˆ

ˆ

Page 48: bivariate EDA and regression  analysis

• to the degree that the regression assumptions hold, there is a 68% probability that true values of y lie within 1 SEE of y-hat

• 95% within 2 SEE…

• can plot lines showing the SEE…

• y-hat = a+bx +/- SEE

Page 49: bivariate EDA and regression  analysis
Page 50: bivariate EDA and regression  analysis

data transformations and regression

• read Shennan, Chapter 9 (esp. pp. 151-173)

Page 51: bivariate EDA and regression  analysis

0 50 100 150 200VAR1

0

50

100

150

200

VA

R2

0 50 100 150 200VAR1

0

50

100

150

200V

AR

2

Page 52: bivariate EDA and regression  analysis

40 80 120 160VAR1

0

50

100

150

200

VA

R2

Page 53: bivariate EDA and regression  analysis

0 5 10 15VAR1T

0

50

100

150

200

VA

R2

let VAR1T = sqr(VAR1)

Page 54: bivariate EDA and regression  analysis

• “distribution” and “fall-off” models

• ex: density of obsidian vs. distance from the quarry:

0 10 20 30 40 50 60 70 80DIST

0

1

2

3

4

5

6

DE

NS

ITY

Page 55: bivariate EDA and regression  analysis
Page 56: bivariate EDA and regression  analysis

0 10 20 30 40 50 60 70 80DIST

0

1

2

3

4

5

6

DE

NS

ITY

Plot of Residuals against Predicted Values

-1 0 1 2 3 4ESTIMATE

-1

0

1

2

RE

SID

UA

L

Page 57: bivariate EDA and regression  analysis

0 10 20 30 40 50 60 70 80DIST

1

2

3456

DE

NS

ITY

0 10 20 30 40 50 60 70 80DIST

-3

-2

-1

0

1

2

LG

_D

EN

S

LG_DENS log(DENSITY)

Page 58: bivariate EDA and regression  analysis

0 10 20 30 40 50 60 70 80DIST

-3

-2

-1

0

1

2

LG

_D

EN

S y = 1.70-.05x

[remember y is logged density]

Page 59: bivariate EDA and regression  analysis

0 10 20 30 40 50 60 70 80DISTANCE

0

1

2

3

4

5

6

DE

NS

ITY

0 800

6

0 10 20 30 40 50 60 70 80DISTANCE

0

1

2

3

4

5

6

DE

NS

ITY

logy = 1.70-.05x

“fplot y = exp(1.70-.05*x)”

Page 60: bivariate EDA and regression  analysis

begin

PLOT DENSITY*DISTANCE / FILL=1,0,0

fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL='' XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0 XMAX=80 YMIN=0 YMAX=6 end

Page 61: bivariate EDA and regression  analysis

transformation summary

• correcting left skew:x4 strongerx3 strongx2 mild

• correcting right skew:x weaklog(x) mild-1/x strong-1/x2 stronger