Regression. Correlation measures the strength of the linear relationship Great! But what is that...

Post on 19-Dec-2015

215 views 1 download

Tags:

Transcript of Regression. Correlation measures the strength of the linear relationship Great! But what is that...

Regression

Regression

• Correlation measures the strength of the linear relationship

• Great! But what is that relationship? How do we describe it?

– regression, regression line, regression equation

• Regression line is used for prediction

Predicting weights from heights• Independent variable: height• Dependent variable: weight• How can we predict one from the other ?• Regression is to a scatter plot as the mean is to a

histogram.

Weights vs. Heights

YRS EM

302520151050-5

SA

LA

RY

70000

60000

50000

40000

30000

20000

Salary by years employed

Regression by local averages

Approximation ofLocal averages by regression line

Inappropriate useof regression line(use other methods)

The equation of a line

• a represents the y-intercept

– when x equals zero, y equals a

– Is this always meaningful in the context of a problem?

– Is it always useful in defining a line?

• b represents the slope of the line (rise/run)

– for every unit change in x, y changes by b.

– Does this mean that if we physically change x by one unit, y will change by b units? Say we gain another year of experience. Will our salary go up by 1107?

bxay

Regression equation• What is the predicted weight of somebody

whose height is h cm ?

• w = intercept + slope x h

• This is known as the regression equation.

• How do we get this formula ?

• We have a statistical model

YRS EM

302520151050-5

SA

LAR

Y

70000

60000

50000

40000

30000

20000

A residual

xy 110728394

line regression gives Minimising

errors, squared of sum theMinimise 2i

Regression line by minimising residual errors

iii bxay i = error of i-th obs from regression line •The best candidate line willminimise these errors•No line can make all errors vanish (some +ve, some –ve)

Regression and correlation• Want to predict weight for those people who are 1 SD

more than avg. height.

• SD line says:• pred. wt. = overall avg. wt. + SD of wt.

• Regression line says:• Predicted wt. = overall avg. wt. + r x SD of wt.• • For people who are k SDs away from avg. height:• Predicted wt. = overall avg. wt. + r x k SD of wt.• Clearly valid for r 0 or r 1

RMS error of regression

• RMS error = SD of y

• RMS inversely related to correlation

21 r

RMS error is to regression what SD is to average

Residuals

residual =observed -predicted

Example: ozone vs. temperature> air[,c(1,3)]

ozone temperature

3.45 67

3.30 72

2.29 74

2.62 62

2.84 65

. . .> cor(ozone,temperature)

[1] 0.7531038

Fitting a regression model in S> ozone.lm <- lm(ozone ~ temperature, data = air)

Coefficients:

. Value Std. Error tvalue Pr(>|t|)

(Intercept) -2.23 0.46 -4.82 0.0000

temperature 0.07 0.01 11.95 0.0000

Multiple R-Squared: 0.5672

> var(ozone)

[1] 0.7928069

> var(resid(ozone.lm))

[1] 0.3431544

> cor(ozone,temperature)

[1] 0.7531038

Checking model appropriatenessWhat assumptions have we made in the regression model ?

Checking model assumptions in S-plus

> par(mfrow=c(2,3))

> plot(ozone.lm)

Fitted : temperature

Res

idua

ls

2.0 2.5 3.0 3.5 4.0 4.5

-10

12

45

23

77

fitssq

rt(a

bs(R

esid

uals

))

2.0 2.5 3.0 3.5 4.0 4.5

0.2

0.4

0.6

0.8

1.0

1.2

1.4

4523

77

Fitted : temperature

ozon

e

2.0 2.5 3.0 3.5 4.0 4.5

12

34

5

Quantiles of Standard Normal

Res

idua

ls

-2 -1 0 1 2

-10

12

45

23

77

Fitted Values

0.0 0.4 0.8

-10

12

Residuals

0.0 0.4 0.8

-10

12

f-value

ozon

e

Index

Coo

k's

Dis

tanc

e0 20 40 60 80 100

0.0

0.02

0.04

0.06 17 77

20

Residual diagnostics for ozone data

Pizza party at the Frat.• How many laps would you

predict a pledge could run if he ate 6 slices of pizza?

• How many laps if he ate 9 slices of pizza?

• A pledge shows off and eats 35 slices of pizza. How many laps would you predict he would run? SLICES

121086420D

ISTA

NC

E

20

18

16

14

12

10

8

6

4

2

965.0

5.120

r

xy

Beware of extrapolation