Regression

28
Regression Petter Mostad 2005.10.10

description

Regression. Petter Mostad 2005.10.10. Some problems you might want to look at. Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty. - PowerPoint PPT Presentation

Transcript of Regression

Page 1: Regression

Regression

Petter Mostad

2005.10.10

Page 2: Regression

Some problems you might want to look at

• Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.

• There seems to be a connection between efficiency and size for Norwegian hospitals. Given data from many hospitals, determine if there is a connection, and what it is.

• Investigate the connection between efficiency and a number of possible explanatory variables.

Page 3: Regression

Connection between variables

1000 1500 2000 2500

1015

2025

3035

areal

kost

nad

0 1 2 3 4 50

2040

6080

100

120

140

år

kost

nad

We would like to study connection between x and y!

Page 4: Regression

Connection between variables

1000 1500 2000 2500

1015

2025

3035

areal

kost

nad

0 1 2 3 4 50

2040

6080

100

120

140

år

kost

nad

Fit a line!

Page 5: Regression

What can you do with a fitted line?

• Interpolation

• Extrapolation (sometimes dangerous!)

• Interpret the parameters of the line

0 2 4 6 8 10 12

02

46

810

12

Page 6: Regression

How to define the line that ”fits best”?

• Note: many other ways to fit the line can be imagined

The sum of the squares of the ”errors” minimized

=Least squares method!

0 2 4 6 8 10 12

02

46

810

12

Page 7: Regression

How to compute the line fit with the least squares method?

• Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane. • Find a and b so that y=a+bx fit the points by minimizing

• Solution:

where and all sums are done for i=1,...,n.

n

iiinn ybxaybxaybxaybxaS

1

22222

211 )()()()(

2222

ii

ii

ii

iiii

xnx

yxnyx

xxn

yxyxnb

xbyn

xbya ii

inin yyxx 11 ,

Page 8: Regression

How do you get this answer?

• Differentiate S with respect to a og b, and set the result to 0

We get:

This is two equations with two unknowns, and the solution of these give the answer.

n

iii ybxa

a

S

1

02

n

iiii xybxa

b

S

1

02

0 ii yxbna

02 iiii yxxbxa

Page 9: Regression

Example• Some grasshoppers make sound by rubbing their wings against

each other. There is a connection between the temperature and the frequency of the movements, unique for each species. Here are some data for Nemobius fasciatus fasciatus:

If you measure 18 movements per sec, what is estim. temperature?

Moves / sec Temp.

20,0 31,4

16,0 22,0

19,8 34,1

18,4 29,1

15,5 24,0

14,7 21,0

17,1 27,7

15,4 20,7

16,2 28,5

15,0 26,4

17,2 28,1

17,0 28,6

14,4 24,6

15 16 17 18 19 20

2224

2628

3032

34

bevegelser / sekund

tem

pera

tur

Data from Pierce, GW. The Songs of Insects. Cambridge, Mass.: Harvard University Press, 1949, pp. 12-21

Page 10: Regression

Example (cont.)

Computation:

Answer: Estimated temperature

15 16 17 18 19 20

2224

2628

3032

34

bevegelser / sekund

tem

pera

tur

70,216 ix

15,36522 ix

33,346 iy

34,5847 ii yx

859,170,21615,365213

33,34670,21634,5847132

b

348,413

70,216859,133,346

a

1,2918859,1348,4

Page 11: Regression

y against x ≠ x against y• Linear regression of y against x does not give the same result

as the opposite.

0 2 4 6 8 10 12

02

46

810

12

x

y

Regression of x against y

Regression of y against x

Page 12: Regression

Centered variables• Assume we subtract the average

from both x- and y-values• We get and • We get and • From definitions of correlation

and standard deviation se get

(even in uncentered case)• Note also: The residuals

sum to 0.

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

0 ix 0 iy

0a 2/ iii xyxb

( )( , )

( )

std yb corr x y

std x

ii ybxa

Page 13: Regression

Anaylzing the variance

• Define– SSE: Error sum of squares– SSR: Regression sum of squares– SST: Total sum of squares

• We can show that SST = SSR + SSE• Define • R2 is the ”coefficient of determination”

2( )i ia bx y 2( )ia bx y

2( )iy y

2 21 ( , )SSR SSE

R corr x ySST SST

Page 14: Regression

But how to answer questions like:

• Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation?

• What is a confidence interval for the estimated slope?

• What is the prediction, with uncertainty, at a new x value?

Page 15: Regression

The standard simple regression model

• We have to do as before, and define a model

where are independent, normally distributed, with equal variance

• We can then use data to estimate the model parameters, and to make statements about their uncertainty

0 1i i iY x

i2

Page 16: Regression

Confidence intervals for simple regression

• In a simple regression model, – a estimates – b estimates– estimates

• Also,

where estimates variance of b

• So a confidence interval for is given by

0

12ˆ /( 2)SSE n 2

1 2( ) / ~b nb S t 2

22

ˆ

( 1)bx

Sn s

1

2, / 2n bb t S

Page 17: Regression

Hypothesis testing for simple regression

• Choose hypotheses:

• Test statistic:

• Reject H0 if or

0 1: 0H 1 1: 0H

2/ ~b nb S t

2, / 2/ b nb S t 2, / 2/ b nb S t

Page 18: Regression

Prediction from a simple regression model

• A regression model can be used to predict the response at a new value xn+1

• The uncertainty in this prediction comes from two sources: – The uncertainty in the regression line– The uncertainty of any response, given the

regression line

• A confidence interval for the prediction:2

111 2, / 2 2

( )ˆ 1

( )n

n n ni

x xa bx t

x x

Page 19: Regression

Testing for correlation

• It is also possible to test whether a sample correlation r is large enough to indicate a nonzero population correlation

• Test statistic:

• Note: The test only works for normal distributions and linear correlations: Always also investigate scatter plot!

22

2~

1n

r nt

r

Page 20: Regression

Influence of extreme observations

• NOTE: The result of a regression analysis is very much influenced by points with extreme values, in either the x or the y direction.

• Always investigate visually, and determine if outliers are actually erroneous observations

Page 21: Regression

Example: Transformed variables

• The relationship between variables may not be linear

• Example: The natural model may be

• We want to find a and b so that the line approximates the points as well as possible

bxaey

15 20 25 30

0.05

0.10

0.15

0.20

bxaey

Page 22: Regression

Example (cont.)

• When then

• Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn))

• We get estimates for log(a) and b, and thus a and b

15 20 25 30

0.05

0.10

0.15

0.20bxaey

bxay )log()log(

Page 23: Regression

Another example of transformed variables

• Another natural model may be

• We get that

• Use standard formulas on the pairs

(log(x1), log(y1)),

(log(x2), log(y2)), ...,(log(xn),log(yn))

0 2 4 6 8

0.00

80.

010

0.01

20.

014

0.01

6

baxy

)log()log()log( xbay

Note: In this model, the curve goes through (0,0)

Page 24: Regression

More than one independent variable: Multiple regression

• Assume we have data of the type (x11, x12, x13, y1), (x21, x22, x23, y2), ...• We want to ”explain” y from the x-values by

fitting the following model:

• Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”.

• x1,x2,x3 can be transformations of different variables, or transformations of the same variable

321 dxcxbxay

Page 25: Regression

Multiple regression model

• The errors are independent random (normal) variables with expectation zero and variance

• The explanatory variables x1i, x2i, …, xni cannot be linearily related

0 1 1 2 2 ...i i i n ni iy x x x

i

2

Page 26: Regression

Use of multiple regression

• Versions of multiple regression is the most used model in econometrics, and in health economics

• It is a powerful tool to detect and verify connections between variables

Page 27: Regression

Doing a regression analysis

• Plot the data first, to investigate whether there is a natural relationship

• Linear or transformed model? • Are there outliers which will unduly affect the

result? • Fit a model. Different models with same number

of parameters may be compared with R2

• Make tests / confidence intervals for parameters

Page 28: Regression

Interpretation

• The parameters may have important interpretations

• The model may be used for prediction at new values (caution: Extrapolation can sometimes be dangerous!)

• Remember that subjective choices have been made, and interpret cautiously