Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick...

47
slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Transcript of Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick...

Page 1: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

slide 1

DSCI 5180: Introduction to the Business Decision Process

Spring 2013 – Dr. Nick Evangelopoulos

Lectures 4-5: Simple Regression Analysis (Ch. 3)

Page 2: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 22Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 3Chapter 3Simple Regression AnalysisSimple Regression Analysis

(Part 1)(Part 1)

Terry DielmanTerry DielmanApplied Regression Analysis:Applied Regression Analysis:

A Second Course in Business and A Second Course in Business and Economic Statistics, fourth editionEconomic Statistics, fourth edition

Page 3: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 33Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.1 Using Simple Regression to 3.1 Using Simple Regression to Describe a RelationshipDescribe a Relationship

Regression analysisRegression analysis is a statistical is a statistical technique used to describe relationships technique used to describe relationships among variables.among variables.

The simplest case is one where a The simplest case is one where a dependent variabledependent variable yy may be related to an may be related to an independentindependent or or explanatory variable explanatory variable x.x.

The equation expressing this relationship The equation expressing this relationship is the line:is the line:

xbby 10

Page 4: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 44Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Slope and InterceptSlope and Intercept

For a given set of data, we need to For a given set of data, we need to calculate values for the slope calculate values for the slope bb11 and and the intercept the intercept bb00..

Figure 3.1 shows the graph of a set Figure 3.1 shows the graph of a set of six (of six (x, yx, y) pairs that have an exact ) pairs that have an exact relationship.relationship.

Ordinary algebra is all you need to Ordinary algebra is all you need to compute compute y = 1 + 2xy = 1 + 2x

Page 5: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 55Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.1 Graph of An Exact RelationshipFigure 3.1 Graph of An Exact Relationship

654321

13

8

3

x

y

xx yy

11 33

22 55

33 77

44 99

55 1111

66 1313

Page 6: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 66Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Error in the RelationshipError in the Relationship

In real life, we usually do not have exact In real life, we usually do not have exact relationships.relationships.

Figure 3.2 shows a situation where the Figure 3.2 shows a situation where the yy and and xx have a strong tendency to increase have a strong tendency to increase together but it is not perfect.together but it is not perfect.

You can use a ruler to put a line in You can use a ruler to put a line in approximately the "right place" and use approximately the "right place" and use algebra again.algebra again.

^̂ A good guess might be A good guess might be y = 1 + 2.5xy = 1 + 2.5x

Page 7: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 77Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.2 Graph of a Relationship That is NOT ExactFigure 3.2 Graph of a Relationship That is NOT Exact

xx yy

11 33

22 22

33 88

44 88

55 1111

66 1313654321

12

7

2

x

y

S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression Plot

Page 8: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 88Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Everybody Is DifferentEverybody Is Different

The drawback to this technique is The drawback to this technique is that everybody will have their own that everybody will have their own opinion about where the line goes.opinion about where the line goes.

There would be ever greater There would be ever greater differences if there were more data differences if there were more data with a wider scatter.with a wider scatter.

We need a precise mathematical We need a precise mathematical technique to use for this task.technique to use for this task.

Page 9: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 99Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ResidualsResiduals

Figure 3.3 shows the previous graph Figure 3.3 shows the previous graph where the "fit error" of each point is where the "fit error" of each point is indicated.indicated.

These These residualsresiduals are positive if the are positive if the point is above the line and negative point is above the line and negative if the line is above the point.if the line is above the point.

We want a technique that will make We want a technique that will make the + and – even out.the + and – even out.

Page 10: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1010Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

654321

12

7

2

x

yS = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression PlotFigure 3.3 Deviations From the LineFigure 3.3 Deviations From the Line

- deviations

+ deviations

Page 11: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1111Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (1)Computation Ideas (1)

We can search for a line that We can search for a line that minimizes the sum of the residuals:minimizes the sum of the residuals:

While this is a good idea, it can be While this is a good idea, it can be shown that shown that anyany line passing through line passing through the point (the point (x, yx, y) will have this sum = ) will have this sum = 0.0.

)ˆ(1

i

n

ii yy

Page 12: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1212Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (2)Computation Ideas (2)

We can work with absolute values and We can work with absolute values and search for a line that minimizes:search for a line that minimizes:

Such a procedure—called LAV or Such a procedure—called LAV or least least absolute valueabsolute value regression—does regression—does exist but usually is found only in exist but usually is found only in specialized software.specialized software.

|ˆ|1

i

n

ii yy

Page 13: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1313Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (3)Computation Ideas (3)

By far the most popular approach is to By far the most popular approach is to square the residuals and minimize:square the residuals and minimize:

This procedure is called This procedure is called least squaresleast squares and is widely available in software. and is widely available in software. It uses calculus to solve for the It uses calculus to solve for the bb0 0

and and bb11 terms and gives a unique terms and gives a unique solution.solution.

2

1

)ˆ( i

n

ii yy

Page 14: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1414Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Least Squares EstimatorsLeast Squares Estimators

There are several formula for the There are several formula for the bb11 term. If doing it by hand, we might term. If doing it by hand, we might want to use:want to use:

_ __ _ The intercept is The intercept is bb00 = y – b = y – b11 x x

n

i

n

iii

n

i

n

i

n

iiiii

xn

x

yxn

yxb

1

2

1

2

1 1 11

1

1

Page 15: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1515Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.5 Figure 3.5 Computations Computations

RequiredRequiredfor for bb1 1 and and bb00

xxii yyii xxii22 xxiiyyii

11 33 11 33

22 22 44 44

33 88 99 2424

44 88 1616 3232

55 1111 2525 5555

66 1313 3636 7878

2121 4545 9191 196196Totals

Page 16: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1616Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

CalculationsCalculations

n

i

n

iii

n

i

n

i

n

iiiii

xn

x

yxn

yxb

1

2

1

2

1 1 11

1

1

__ __bb00 = y – b = y – b11 x = x =

Page 17: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1717Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Unique MinimumThe Unique Minimum

The line we obtained was:The line we obtained was:

The sum of squared errors (SSE) is:The sum of squared errors (SSE) is:

No other linear equation will yield a No other linear equation will yield a smaller SSE. For the line smaller SSE. For the line 1 + 2.5x 1 + 2.5x we we guessed earlier, the SSE is 10.75guessed earlier, the SSE is 10.75

xy 2.22.0ˆ

80.8)ˆ( 2

1

i

n

ii yy

Page 18: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1818Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.2 Examples of Regression as a 3.2 Examples of Regression as a Descriptive TechniqueDescriptive Technique

Example 3.2 Pricing Communications NodesExample 3.2 Pricing Communications Nodes

A Ft. Worth manufacturing company was A Ft. Worth manufacturing company was concerned about the cost of adding nodes concerned about the cost of adding nodes to a communications network. They to a communications network. They obtained data on 14 existing nodes.obtained data on 14 existing nodes.

They did a regression of cost (the They did a regression of cost (the yy) on ) on number of ports (number of ports (xx).).

Page 19: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 1919Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

70605040302010

60000

50000

40000

30000

20000

NUMPORTS

CO

ST

Pricing Communications NodesPricing Communications Nodes

Cost = 16594 + 650 NUMPORTS

Page 20: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2020Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.3 Estimating Residential Example 3.3 Estimating Residential Real Estate ValuesReal Estate Values

The Tarrant County Appraisal District The Tarrant County Appraisal District uses data such as house size, uses data such as house size, location and depreciation to help location and depreciation to help appraise property.appraise property.

Regression can be used to establish a Regression can be used to establish a weight for each factor. Here we look weight for each factor. Here we look at how price depends on size for a at how price depends on size for a set of 100 homes. The data are from set of 100 homes. The data are from 1990.1990.

Page 21: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2121Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4500350025001500500

300000

200000

100000

0

SIZE

VA

LU

ETarrant County Real EstateTarrant County Real Estate

VALUE = -50035 + 72.8 SIZE

Page 22: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2222Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.4 Forecasting Housing StartsExample 3.4 Forecasting Housing Starts

Forecasts of various economic measures is Forecasts of various economic measures is important to the government and various important to the government and various industries.industries.

Here we analyze the relationship between Here we analyze the relationship between US housing starts and mortgage rates. US housing starts and mortgage rates. The rate used is the US average for new The rate used is the US average for new home purchases.home purchases.

Annual data from 1963 to 2002 is used.Annual data from 1963 to 2002 is used.

Page 23: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2323Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

15105

2400

2200

2000

1800

1600

1400

1200

1000

RATES

ST

AR

TS

US Housing StartsUS Housing Starts

STARTS = 1726 - 22.2 RATES

Page 24: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2424Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3 Inferences From a Simple 3.3 Inferences From a Simple Regression AnalysisRegression Analysis

So far regression has been used as a So far regression has been used as a way to describe the relationship way to describe the relationship between the two variables.between the two variables.

Here we will use our sample data to Here we will use our sample data to make inferences about what is going make inferences about what is going on in the underlying population.on in the underlying population.

To do that, we first need some To do that, we first need some assumptions about how things are.assumptions about how things are.

Page 25: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2525Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3.1 Assumptions Concerning the 3.3.1 Assumptions Concerning the Population Regression LinePopulation Regression Line

Lets use the communications nodes example to Lets use the communications nodes example to illustrate. Costs ranged from roughly $23000 to illustrate. Costs ranged from roughly $23000 to $57000 and number of ports from 12 to 68.$57000 and number of ports from 12 to 68.

Three times we had projects with 24 ports, but Three times we had projects with 24 ports, but the three costs were all different. The same thing the three costs were all different. The same thing occurred at repeated observations at 52 and 56 occurred at repeated observations at 52 and 56 ports.ports.

This illustrates how we view things: at each value This illustrates how we view things: at each value of of xx there is a there is a distributiondistribution of potential of potential y y values values that can occur.that can occur.

Page 26: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2626Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Conditional MeanThe Conditional Mean Our first assumption is that the means of these Our first assumption is that the means of these

distributions all lie on a straight line:distributions all lie on a straight line:

For example, at projects with 30 ports, we have:For example, at projects with 30 ports, we have:

The actual cost of projects with 30 ports are going The actual cost of projects with 30 ports are going to be distributed about the mean. This also to be distributed about the mean. This also happens at other sizes of projects, so you might happens at other sizes of projects, so you might see something like the next slide.see something like the next slide.

xxy 10|

1030| 30 xy

Page 27: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2727Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Cost

Nodes12 30 68

Figure 3.12 Distribution of Costs around the Figure 3.12 Distribution of Costs around the Regression LineRegression Line

0 + 1 Nodes

Page 28: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2828Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Disturbance TermsThe Disturbance Terms Because of the variation around the Because of the variation around the

regression line, it is convenient to regression line, it is convenient to view the individual costs as:view the individual costs as:

The The eeii are called the are called the disturbances disturbances and represent how and represent how yyii differs from its differs from its conditional mean. If conditional mean. If yyii is above the is above the mean, its disturbance has a + value.mean, its disturbance has a + value.

iii exy 10

Page 29: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 2929Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

AssumptionsAssumptions

1.1. We expect the average disturbance We expect the average disturbance eeii to be zero so the regression line to be zero so the regression line

passes through the conditional passes through the conditional mean of mean of yy..

2.2. The The eeii have constant variance have constant variance ee22..

3.3. The The eeii are normally distributed. are normally distributed.

4.4. The The eeii are independent. are independent.

Page 30: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3030Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3.2 Inferences About 3.3.2 Inferences About 00 and and 11

We use our sample data to estimate We use our sample data to estimate 0 0 by by bb00 and and 11 by by bb11.. If we had a different If we had a different sample, we would not be surprised to get sample, we would not be surprised to get different estimates.different estimates.

Understanding how much they would vary Understanding how much they would vary from sample to sample is an important from sample to sample is an important part of the inference process.part of the inference process.

We use the assumptions, together with our We use the assumptions, together with our data, to construct the sampling data, to construct the sampling distributions for distributions for bb00 and and bb11..

Page 31: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3131Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Sampling DistributionsThe Sampling Distributions

The estimators have many good The estimators have many good statistical properties. They are statistical properties. They are unbiased, consistent and minimum unbiased, consistent and minimum variance.variance.

They have normal distributions with They have normal distributions with standard errors that are functions of standard errors that are functions of the the xx values and values and ee

22.. Full details are in Section 3.3.2Full details are in Section 3.3.2

Page 32: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3232Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimate of Estimate of ee22

This is an unknown quantity that This is an unknown quantity that needs to be estimated from data.needs to be estimated from data.

We estimate it by the formula:We estimate it by the formula:

The term The term MSEMSE stands for mean stands for mean squared error and is more or less the squared error and is more or less the average squared residual.average squared residual.

MSEn

SSE

n

yyS

i

n

ii

e

22

)ˆ( 2

12

Page 33: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3333Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standard Error of the RegressionStandard Error of the Regression

The divisor The divisor n-2n-2 used in the previous calculation used in the previous calculation follows our general rule that degrees of freedom follows our general rule that degrees of freedom are sample size – the number of estimates we are sample size – the number of estimates we make (make (bb00 and and bb11)) before estimating the variance.before estimating the variance.

The square root of The square root of MSEMSE is is SSee which we call the which we call the standard error of the regressionstandard error of the regression..

SSee can be roughly interpreted as the "typical" can be roughly interpreted as the "typical" amount we miss in estimating each amount we miss in estimating each yy value. value.

Page 34: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3434Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Inference About Inference About 11

Interval estimates and hypothesis Interval estimates and hypothesis tests are constructed using the tests are constructed using the sampling distribution of sampling distribution of bb11..

The standard error of The standard error of bb11 is: is:

Computer programs routinely Computer programs routinely compute this and report its value.compute this and report its value.

2)1(

11

xeb SnSS

Page 35: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3535Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interval EstimateInterval Estimate

The distribution we use is a The distribution we use is a tt with with n-2 degrees of freedom.n-2 degrees of freedom.

The interval is:The interval is:

The value of The value of tt, of course, depends on , of course, depends on the selected confidence level.the selected confidence level.

121 bn stb

Page 36: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3636Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests About Tests About 11

The most common test is that a change in The most common test is that a change in the the xx variable does not induce a change in variable does not induce a change in yy, which can be stated:, which can be stated:

HH00:: 1 1 = 0 = 0 HHaa:: 11 ≠ 0 ≠ 0

If If HH11 is true it implies the population is true it implies the population regression equation is a flat line; that is, regression equation is a flat line; that is, regardless of the value of regardless of the value of xx, , yy has the has the same distribution. same distribution.

Page 37: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3737Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test StatisticTest Statistic

The test would be performed by using The test would be performed by using the standardized test statistic:the standardized test statistic:

Most computer programs compute Most computer programs compute this, and its associated p-value. and this, and its associated p-value. and include them on the output.include them on the output.

The p-value is for the two-sided The p-value is for the two-sided version of the test.version of the test.

bS

bt

1

01

Page 38: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3838Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Inference About Inference About 00

We can also compute confidence intervals and perform We can also compute confidence intervals and perform hypothesis tests about the intercept in the population hypothesis tests about the intercept in the population equation.equation.

Details about the tests and intervals are in Section 3.3.2, Details about the tests and intervals are in Section 3.3.2, but in most problems we are not interested in this.but in most problems we are not interested in this.

The intercept is the value of The intercept is the value of yy at at x=0x=0 and in many and in many problems this is not relevant; for example, we never see problems this is not relevant; for example, we never see houses with zero square feet of floor space.houses with zero square feet of floor space.

Sometimes it is relevant, anyway. If we are estimating Sometimes it is relevant, anyway. If we are estimating costs, we could interpret the intercept as the fixed cost. costs, we could interpret the intercept as the fixed cost. Even though we never see communication nodes with zero Even though we never see communication nodes with zero ports, there is likely to be a fixed cost associated with ports, there is likely to be a fixed cost associated with setting up each project.setting up each project.

Page 39: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 3939Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.6 Pricing Communications Example 3.6 Pricing Communications Nodes (continued)Nodes (continued)

Inference questions:Inference questions:1.1. What is the equation relating NUMPORTS to What is the equation relating NUMPORTS to

COST?COST?

2.2. Is the relationship significant?Is the relationship significant?

3.3. What is an interval estimate of What is an interval estimate of 11??

4.4. Is the relationship positive?Is the relationship positive?

5.5. Can we claim each port costs at least $1000?Can we claim each port costs at least $1000?

6.6. What is our estimate of fixed cost?What is our estimate of fixed cost?

7.7. Is the intercept 0?Is the intercept 0?

Page 40: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4040Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Minitab Regression OutputMinitab Regression OutputRegression Analysis: COST versus NUMPORTS

The regression equation isCOST = 16594 + 650 NUMPORTS

Predictor Coef SE Coef T PConstant 16594 2687 6.18 0.000NUMPORTS 650.17 66.91 9.72 0.000

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%

Analysis of Variance

Source DF SS MS F PRegression 1 1751268376 1751268376 94.41 0.000Residual Error 12 222594146 18549512Total 13 1973862521

Page 41: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4141Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the relationship significant?Is the relationship significant?

HH00:: 1 1 = 0 = 0 (Cost does not change when(Cost does not change whennumber of ports increase)number of ports increase)

HHaa:: 11 ≠ 0 ≠ 0 (Cost does change)(Cost does change)

We will use a 5% level of significance and the We will use a 5% level of significance and the tt distribution distribution with with (n-2)(n-2) = 12 degrees of freedom. = 12 degrees of freedom.

Decision rule: Reject Decision rule: Reject HH00 if if tt > 2.179 > 2.179 or if or if tt < -2.179 < -2.179

from Minitab output from Minitab output tt = 9.72 (p-value =.000) = 9.72 (p-value =.000)

We conclude that there is a significant relationship between We conclude that there is a significant relationship between project size and cost.project size and cost.

Page 42: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4242Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What is an interval estimate of What is an interval estimate of 11??

Interval is:Interval is:

For a 95% interval use For a 95% interval use t =t = 2.179 2.179

650.17 ± 2.179(66.91) = 650.17 ± 145.80650.17 ± 2.179(66.91) = 650.17 ± 145.80

We are 95% sure that the average cost for We are 95% sure that the average cost for each additional node is between $504 and each additional node is between $504 and $796.$796.

121 bn stb

Page 43: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4343Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Can we claim a positive relationship?Can we claim a positive relationship?

HH00:: 1 1 = 0 = 0 (Cost does not change when size increases)(Cost does not change when size increases)

HHaa:: 11 > 0 > 0 (Cost increases when size increases)(Cost increases when size increases)

We will use a 5% level of significance and the We will use a 5% level of significance and the tt distribution with distribution with (n-2)(n-2) = 12 degrees of freedom. = 12 degrees of freedom.

Decision rule: Reject Decision rule: Reject HH00 if if tt > 1.782 > 1.782 From Minitab output From Minitab output tt = 9.72 (p-value is half of the = 9.72 (p-value is half of the

listed value of .000, which is still .000)listed value of .000, which is still .000)

We conclude that the project cost does increase We conclude that the project cost does increase with project size.with project size.

Page 44: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4444Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the cost per port at least $1000?Is the cost per port at least $1000?

HH00:: 11 1000 1000 (Cost per port at least $1000)(Cost per port at least $1000)

HHaa:: 1 1 < 1000 < 1000 (Cost is less than $1000)(Cost is less than $1000)

Again we will use a 5% level of significance and 12 Again we will use a 5% level of significance and 12 degrees of freedom.degrees of freedom.

Decision rule: Reject Decision rule: Reject HH00 if if tt < -1.782 < -1.782

We conclude that the cost per node is (much) less We conclude that the cost per node is (much) less than $1000.than $1000.

23.591.66

100017.6501000useHere

1

1

bS

bt

Page 45: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4545Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What is our estimate of fixed cost?What is our estimate of fixed cost?

We can interpret the intercept of the We can interpret the intercept of the equation as fixed cost, and the slope as equation as fixed cost, and the slope as variable cost. For the intercept, an variable cost. For the intercept, an interval is:interval is:

16594 ± 2.179(2687) = 16954 ± 585516594 ± 2.179(2687) = 16954 ± 5855

We are 95% sure the fixed cost is between We are 95% sure the fixed cost is between $11,099 and $22,809.$11,099 and $22,809.

020 bn stb

Page 46: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

Simple Regression ISimple Regression I 4646Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the intercept 0?Is the intercept 0?

HH00:: 0 0 = 0 = 0 (Fixed cost is 0)(Fixed cost is 0)

HHaa:: 00 ≠ 0 ≠ 0 (Fixed cost is not 0)(Fixed cost is not 0)

Again, use a 5% level of significance and 12 d.f.Again, use a 5% level of significance and 12 d.f.

Decision rule: Reject Decision rule: Reject HH00 if if tt > 2.179 > 2.179 or if or if tt < -2.179 < -2.179

from Minitab output from Minitab output tt = 6.18 (p-value =.000) = 6.18 (p-value =.000)

We conclude that the fixed cost is not zero.We conclude that the fixed cost is not zero.

Page 47: Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)

slide 47

DSCI 5180Decision Making

HW 2 – Interpretation of regression coefficients

Y-intercept: A house of size 0 will have a value of -$50,035 (meaningless math artifact)

Slope: When Size increases by 1 sqft, value increases by $72.80