Fitting the Data

20
Lecture 2 1 Econ 140 Econ 140 Fitting the Data Lecture 2

description

Fitting the Data. Lecture 2. Today’s Plan. Finishing off the examples from Lecture 1 Introducing different types of data Fitting the data One of the most important lectures of the course There will be a question on this on a midterm and the final! (Almost guaranteed!) - PowerPoint PPT Presentation

Transcript of Fitting the Data

Page 1: Fitting the Data

Lecture 2 1

Econ 140Econ 140

Fitting the DataLecture 2

Page 2: Fitting the Data

Lecture 2 2

Econ 140Econ 140Today’s Plan

• Finishing off the examples from Lecture 1

• Introducing different types of data

• Fitting the data

– One of the most important lectures of the course

– There will be a question on this on a midterm and the final! (Almost guaranteed!)

– You can find this material in the Appendix 4.2

Page 3: Fitting the Data

Lecture 2 3

Econ 140Econ 140Experimental vs Observational

• Because of financial/practical/ethical concerns, experiments in economics are rare (SIME/DIME, Tennessee STAR).

• Economists tend to use observational data - obtained from real world behavior. Collected using surveys/administrative records.

• Observational data poses problem: how to estimate causal effects, no random assignment, data definitions not quite right (what economic theory might require).

• Much of econometrics is devoted to estimation with problems encountered with observational data.

Page 4: Fitting the Data

Lecture 2 4

Econ 140Econ 140Cross-Section Data

• We have already seen 2 examples of cross-section data:

– Wages and years of education

– Voting polls in Florida

• Cross section data sets provide information about individual/agent behavior at a moment in time

• Current Population Survey is a cross-section survey that generates monthly detail about the US work force

• Data on county/state/or even countries at a moment in time is also cross-section data.

Page 5: Fitting the Data

Lecture 2 5

Econ 140Econ 140Time Series Data Sets (1)

• Time series data sets provide information about individual/agent behavior over time

– A time unit of observation (day, week, month, year) defines a time series

• We hear about time series data everyday:

– Nasdaq

– Financial Times Stock Exchange Index (FTSE)

– Dow Jones

– Government data: GDP/Unemployment/Inflation

Page 6: Fitting the Data

Lecture 2 6

Econ 140Econ 140Time Series Data Sets (2)

• Composition of unit can change

– FTSE gives information on the top 100 stocks each day, not necessarily the same 100 stocks every day

– CPS: gives data from each month on the number of people who are unemployed. Not the same people (we hope!) from month to month.

• Characteristics of time series data sets

– set of observations over time

– composition of unit can change

– compositional changes are dealt with using weighting schemes (Lecture 3)

Page 7: Fitting the Data

Lecture 2 7

Econ 140Econ 140Longitudinal Data Sets

• Longitudinal data sets provide information on a particular group of individuals/agents over time.

• For example: following Econ140, Fall 2002 over time. Alternatively, a set of firms over time.

• Example we will use: Production functions (Cobb-Douglas) - following firms over time.

• Book example: Traffic Deaths and Alcohol Taxes - following states over time.

Page 8: Fitting the Data

Lecture 2 8

Econ 140Econ 140Ordinary Least Squares (OLS)

• Learning how to calculate a straight line (Appendix 4.2)

– Recall the scatter plot of earnings vs. years of education: there was a mess of data!

• We can use Ordinary Least Squares (OLS) to fit a straight line through these data points

– This line is called the least squares line or line of best fit

– Why is it called: ‘least square line’?

– Least squares line is the minimization of errors - the OLS regression line picks up the smallest distance between data points and the line

Page 9: Fitting the Data

Lecture 2 9

Econ 140Econ 140Two Parts to OLS

1. Derive estimators for a (intercept) and b (slope coefficent)

– this means using differential calculus!

2. Calculate values for a and b from data

– this means mechanically using the derived formulas for a & b

• How to calculate a regression line through a mass of data points that do not necessarily lie on a straight line?

• Each data point (X,Y) has a value.

Page 10: Fitting the Data

Lecture 2 10

Econ 140Econ 140OLS Line

• We’ll call the regression line

– this is an estimate of the true Y

Y

ie Y

iii YYe ˆ

• The errors will be the difference between and Y

– errors can be positive or negative

• We can write the following general equations:

Where i = 1 … n.

ii bXaY ˆ

Page 11: Fitting the Data

Lecture 2 11

Econ 140Econ 140OLS Line

• A data set example is available at the course web site. It consists of five points. Using that output I can calculate the regression equation to be:

• Keeping this equation in mind we can find estimates of a and b given our general formulas for Y and

• We derive a and b from two different types of regression equations:

a from

b from

XYi 9.08.3ˆ

Y

iii

ii

ebXY

eaY

Page 12: Fitting the Data

Lecture 2 12

Econ 140Econ 140OLS Line: Deriving a (1)

• We can rewrite as ei=Yi - a

– we could write objective function for a as:

• Go back to the regression analysis example: notice that the sum of errors is zero!

– Why? The positive and negative errors from the line of best fit always cancel out

– For a minimum you need a first order condition (FOC) set to zero.

– We need a FOC for OLS that is set to zero, not zero to start with!

ii eaY

n

iieag

1

Page 13: Fitting the Data

Lecture 2 13

Econ 140Econ 140OLS Line: Deriving a (2)

• We can’t just minimize the sum of the errors because

• Instead, we have to minimize the sum of the errors squared (hence - least squares):

where ei = Y - a

01

n

iie

n

ieiag

1

2

aYieiag22

Page 14: Fitting the Data

Lecture 2 14

Econ 140Econ 140OLS Line: Deriving a (3)

• Differentiate with respect to a to find the formula for the OLS estimator a

• Note that you set the first order condition to zero to find a minimum: -2ei = 0

(don’t worry about the second order derivative - which will be positive).

• Remember that ei = Y - a

• Solve for a: a = Yi/n.

Page 15: Fitting the Data

Lecture 2 15

Econ 140Econ 140OLS Line: Deriving b (1)

22ii bXYbg e

2

2 0

022

i

ii

iii

iiiii

X

YXb

XbYX

eXXbXY

•Now consider the slope regression where

iiiii bXYeandbXY ˆ

•We use the same principles as before:

Note: this condition only holds if there’s no correlation between X and the errors

So:

(keep in mind that this expression only holds for the regression of a zero intercept and non-zero slope)

Page 16: Fitting the Data

Lecture 2 16

Econ 140Econ 140OLS Line: Collect a & b

• We know a regression line with a non-zero intercept and a non-zero slope coefficient looks like:

ii bXaY ˆ

• We also know:

iii YYe ˆ

0011

i

n

ii

n

ii Xeande

• From the derivations of a and b we have the necessary first order conditions:

Page 17: Fitting the Data

Lecture 2 17

Econ 140Econ 140OLS Line: Collect a & b (2)

• Plugging into the FOC from the derivation of b:

• Plug the new equation into the FOC from our derivation of a:

XbYaa

bXaYn

ii

n

i

n

ii

:for Solving111

2

1

2

1

1

2

11

:for Solving

0

XnX

YXnYXb

b

XbaXYX

n

ii

n

ii

n

ii

n

iii

n

ii

Page 18: Fitting the Data

Lecture 2 18

Econ 140Econ 140Example

• From the data set posted on the web

• To calculate the regression line you need:

• Solve for a & b given the formulas:

nXYXXYn

ii

n

iii

n

ii

n

ii ,,,,

1

2

111

n

ii

n

iii

n

ii

n

ii

XnX

YXnYXb

n

Xb

n

Ya

1

22

111

Page 19: Fitting the Data

Lecture 2 19

Econ 140Econ 140Example (2)

XY

n

Xb

n

Ya

XnX

YXnYXbb

nXYXXY

n

ii

n

ii

n

ii

n

iii

n

ii

n

iii

n

ii

n

ii

9.08.3ˆ

8.3)6(9.02.9

9.0

530

5190

546

530

5285

51902853046

11

2

1

22

1

1

2

111

Page 20: Fitting the Data

Lecture 2 20

Econ 140Econ 140Wrap Up

• Introduced three data types: cross-section, time series, and longitudinal

• Using the OLS technique to derive formulas for an intercept and a slope coefficient

– We estimated the regression lines

– We found FOCs

= 0

• Then we put everything together to estimate

n

iie

10

1

i

n

ii Xe

aYi ˆii bXY ˆ

ii bXaY ˆ