DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos

DSCI 5340: Predictive Modeling and Business Forecasting

Spring 2013 – Dr. Nick Evangelopoulos

Lecture 1: Introduction to Business Forecasting

Review of Simple Regression (Ch. 1-3)

Some material taken from:Michael Hand (Willamette University), Biz/Ed

DSCI 5340FORECASTING

Forecasting

http://www.dilbert.com/2010-07-02/

“It is far better to foresee even without certainty than not to foresee at all.”

Henri Poincaré (1854-1912), polymath and chaos theory pioneer, The Foundations of Science


Why Forecast?

The effectiveness of almost every human endeavor, every public initiative, depends in part upon unknown and uncertain future outcomes – the demand for services, the revenues to fund them.

The quality of decisions about whether or not to engage and at what level improves with the reliability of supporting forecasts.


The two types of Forecasting

Qualitative – seeking opinions on which to base decision making

Consumer panels, focus groups, etc

Quantitative – using statistical data to help inform decision making

Identifying trends Moving averages – seasonal, cyclical, random Extrapolation - simple


Costs and Benefits of Forecasting

Benefits:

Aids decision making Informs planning and resource allocation decisions If data is of high quality,

can be accurate


Costs: Data not always reliable or accurate Data may be out of date The past is not always a guide to the future Qualitative data may be influenced by peer pressure Difficulty of coping with changes to external factors

out of the business’s control – e.g. economic policy, political developments (9/11?), natural disasters – hurricanes, earthquakes, etc.

Costs and Benefits of Forecasting


Oregon Personal Income Tax Revenues (in $ Millions)

600

700

800

900

1000

1100

1200

1300

1400

15001

996:

01

199

6:0

2

199

6:0

3

199

6:0

4

199

7:0

1

199

7:0

2

199

7:0

3

199

7:0

4

199

8:0

1

199

8:0

2

199

8:0

3

199

8:0

4

199

9:0

1

199

9:0

2

199

9:0

3

199

9:0

4

200

0:0

1

200

0:0

2

200

0:0

3

200

0:0

4

200

1:0

1

200

1:0

2

200

1:0

3

200

1:0

4

Period

Dat

a/F

orec

asts

Example: Oregon Personal Income Taxes, 1996 – 2001

(see Class Tools > Sitewide > Hand Outs > Public Finance > MultDecompPIT.xls)


Example: Classical Multiplicative Decomposition

L

t

ttt

ttttt

s

s

s

tbby

SeasonalTrendy

IrregularSeasonalCycleTrendy

2

1

10ˆ

ˆ

Conceptual Decomposition:

Conceptual Forecast:

Forecasting Model:

Trend: Long-term growth/declineCycle: Long-term slow, irregular oscillationSeasonal: Regular, periodic variation w/in calendar yearIrregular: Short-term, erratic variation



ttttt IrregularSeasonalCycleTrendy Conceptual Decomposition:

600

700

800

900

1000

1100

1200

1300

1400

1500

1600

1996

:01

1996

:02

1996

:03

1996

:04

1997

:01

1997

:02

1997

:03

1997

:04

1998

:01

1998

:02

1998

:03

1998

:04

1999

:01

1999

:02

1999

:03

1999

:04

2000

:01

2000

:02

2000

:03

2000

:04

2001

:01

2001

:02

2001

:03

2001

:04

Period

Dat

a



600

800

1000

1200

1400

1600T

ren

d

0.85

0.95

1.05

1.15

1.25

Se

aso

na

l

0.85

0.95

1.05

1.15

1.25

Cyc

lica

l

0.85

0.95

1.05

1.15

1.25

Irre

gu

lar


Example: Classical Multiplicative Decomposition: Model Interpretation

2057.1

8913.0

9236.0

9794.0

5017.189291.731ˆ,ˆ 2

1

10 ty

s

s

s

tbby t

L

t

Model Interpretation

Initial, time-zero (1995:Q4) level is $731.92 millionIncreasing at $18.5 million per quarter Seasonal pattern

Peak in Q4 21% over trendTrough in Q3 11% below trend


Example: Classical Multiplicative Decomposition: Forecasts

2057.1

8913.0

9236.0

9794.0

5017.189291.731ˆ,ˆ 2

1

10 ty

s

s

s

tbby t

L

t

Forecasts

1507.072057.11249.97792057.1)28(5017.189291.731ˆ284:2002

1097.608913.01231.47628913.0)27(5017.189291.731ˆ273:2002

1120.329236.01212.97449236.0)26(5017.189291.731ˆ262:2002

1169.909794.01194.47279794.0)25(5017.189291.731ˆ251:2002

735.009794.0750.43099794.0)1(5017.189291.731ˆ11:1996

t

t

t

t

t

ytQ

ytQ

ytQ

ytQ

ytQ

Overview of Simple Regression Analysis


Regression Analysis

Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will study.

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).

Dependent variable: denoted Y Independent variables: denoted X1, X2, …, Xk


Correlation Analysis…

If we are interested only in determining whether a relationship exists, we employ correlation analysis, a technique introduced earlier.

This chapter will examine the relationship between two variables, sometimes called simple linear regression.

Mathematical equations describing these relationships are also called models, and they fall into two types: deterministic or probabilistic.


b = amount of increase in Y

Unit increase in X (x2 -x1) = 1

y = a + bx

A linear regression equation illustrating the geometrical interpretations of a and b


Simple Linear Regressioncorrelation = r

Positive r: y increases as x increases

r = 1: a perfect positive relationship between y and x


Simple Linear Regression

Negative r: y decreases as x increases

r = -1: a perfect negative relationship between y and x


Simple Linear Regression

r near zero: little or no linear relationship between y and x

r near zero: little or no linear relationship between y and x


A Model…

To create a probabilistic model, we start with a deterministic model that approximates the relationship we want to model and add a random term that measures the error of the deterministic component.

Deterministic Model: The cost of building a new house is about $75 per

square foot and most lots sell for about $25,000. Hence the approximate selling price (y) would be: y = $25,000 + (75$/ft2)(x)

(where x is the size of the house in square feet)


A Model…

A model of the relationship between house size (independent variable) and house price (dependent variable) would be:

House size

HousePrice

Most lots sell for $25,000

Building a house costs about

$75 per square foot.

House Price = 25000 + 75(Size)

In this model, the price of the house is completely determined by the size.


A Model…

In real life however, the house cost will vary even among the same size of house:

House size

HousePrice

25K$

Same square footage, but different price points(e.g. décor options, cabinet upgrades, lot location…)

Lower vs. HigherVariability

x

House Price = 25,000 + 75(Size) +


Simple Linear Regression Model…

A straight line model with one independent variable is called a first order linear model or a simple linear regression model. Its is written as:

error variable

dependentvariable

independentvariable

y-interceptslope of the line


Simple Linear Regression Model…

Note that both and are population parameters which are usually unknown and hence estimated from the data.

y

x

run

rise

=slope (=rise/run)

=y-intercept


Which line has the best “fit” to the data?


Estimating the Coefficients…

In much the same way we base estimates of on

, we estimate on b0 and on b1, the y-intercept and slope (respectively) of the least squares or regression line given by:

(Recall: this is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)


Least Squares Line…

This line minimizes th

e sum of th

e squared differences

between the points and the lin

e…

…but where did the line equation come from?How did we get .934 for a y-intercept and 2.114 for slope??

these differences are called residuals


Slope and Correlation

Y = O + 1 X + HO: 1 = 0 versus HA: 1 0

HO: = 0 versus HA: 0

where is the population correlation

between X and Y.


Slope and Correlation

Warning: High correlation does not imply causality. If a large positive or negative value of the sample correlation coefficient r is observed, it is incorrect to conclude that a change in x causes a change in y. The only valid conclusion is that a linear trend may exist between x and y.


Formulas to use in RegressionNotation

SSxy = ∑ xiyi – ∑xi ∑yi/n (SCP is sometimes used for SSxy.)

=xiyi- n™

SSxx = ∑ xi2 – ( ∑xi)2/n

= xi2- n2

SSyy = ∑yi2 - ( ∑yi)2/n

= yi2- n™2

(These numbers will always be given to you on the test – you will not have to calculate these.)

SCPxy = SSxy

SSx = SSxx

SSy = SSyy SSyy is also called SST (total sum of squares)


Formulas for the Least Squares Estimate

The values of 0 and 1 that minimize the SSE are given by the following formulas

Slope: 1 = SSxy / SSxx

y-intercept: 0 = ™ - 1 where,


Least Squares Line…

The coefficients b1 and b0 for the least squares line…

…are calculated as:


Data

Statistics

Information

Data Points:

x y

1 6

2 1

3 9

4 5

5 17

6 12 y = .934 + 2.114x

From Data to Information

Regression Line


Sum of Squares for Error (SSE)…

The sum of squares for error is calculated as:

and is used in the calculation of the standard error of estimate:

If is zero, all the points fall on the regression line.


Standard Error…

If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor…

But what is small and what is large?


Analysis of Variance


An omnibus or global test of the overall contribution of the set of driver variables to the prediction of the response variable is carried out via the analysis of variance (ANOVA). A summary table for the ANOVA of regression follows:

Source of Degrees of Sum of Mean Fcalc FcritVariation (SV) Freedom (df) Squares (SS) Square (MS) Regression p SSR MSR MSR/MSE F,p,n-(p+1)

Residual n-(p+1) SSE MSETotal n-1 SST

Where F,p,n-(p+1) is the value of F with p numerator df and n-(p+1) denominator df that places in the upper tail of the distribution.

Analysis of Variance for Regression


• In the ANOVA table we have the following:• SSR = Sum of Squares due to Regression• SSE = Sum of Squares due to Error or Residual• SST = Sum of Squares Total• MSR = SSR/p = Mean Square Regression• MSE = SSE/[n-(p+1)] = Mean Square Error or Residual

• The sums of squares are derived from the algebraic identity:

(Yi - Y)2 = (Yi - Y)2 + (Yi -Yi)2 • That is: SST = SSR + SSE So that R2 = SSR/SST represents the

proportion of variation in Y that is explained by the behavior of the driver variables. R2 is the coefficient of determination.

^^

Analysis of Variance for Regression


Coefficient of Determination…

Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R2.

The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2


The coefficient of determination is

r2 = (SSyy - SSE) = 1 - SSE (SSyy) SSyy

It represents the proportion of the sum of squares of deviations of the y values about their mean that can be attributed to a linear relationship between y and x. (In simple linear regression, it may also be computed as

the square of the coefficient of correlation r.)

Coefficient of Determination…


Practical Interpretation of the Coefficient of Determination, R2

About 100(R2)% of the sample variation in y (measured by the total sum of squares of deviations of the sample y values about their mean ™) can be explained by(or attributed to ) using x to predict y in the straight-line model.


Example: Data & Calculations

x = 268

= 26.8

x2 = 7668

y = 27.73

™ = 2.773

y2 = 83.8733


We need to calculate SSxy, SSxx, and SSyy as follows (equivalent formula on page 127 using R-square):

SSxy= 57.456

SSxx = 485.6

SSyy = 6.97801

Then, the coefficient of correlation is

r

SS

SS SSxy

xx yy

5 7 4 5 6

4 8 5 6 6 9 7 8 0 1

5 7 4 5 6

5 8 2 119 9

.

. .

.

..

Example: Data & Calculations


Although (a) has a large slope and (b) has a small slope, both are scatter diagrams for r = 0.9


One thing to keep in mind is that statistical significance does not always imply practical significance.

In other words, rejection of Ho: 1 = 0 (statistical significance) does not mean that precise prediction (practical significance) follows. It does demonstrate to the researcher that , within the sample data at least, this particular independent variable has an association with the dependent variable.

Interpretation of beta coefficients


Confidence Interval for 1

A (1 - ) Ÿ 100% confidence interval for 1 is

From: b1 - t/2, n - 2sb1 to: b1 + t/2, n - 2sb1

For b1 = .354 and sb1 = .0797 and using

t.05,8 = 1.860, the resulting confidence interval is

.354 - (1.860)(.0797) to .354 + (1.860)(.0797)

= .354 - .148 to = .354 + .148

= .206 to = .502


Confidence Interval for 1

So we are 90% confident that the value of the estimated slope (b1 = .354) is within .148 of the actual slope, 1.

The large width of this interval is due in part to the lack of information (small sample size) used to derive the estimates; a larger sample would decrease the width of this confidence interval.


Standardized Residual

Perc

ent

210-1-2

99

90

50

10

1

Fitted Value

Sta

ndard

ized R

esi

dual

20151050

2

1

0

-1

-2

Standardized Residual

Fre

quency

210-1-2

10.0

7.5

5.0

2.5

0.0

Observation Order

Sta

ndard

ized R

esi

dual

50454035302520151051

2

1

0

-1

-2

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Histogram of the Residuals Residuals Versus the Order of the Data

Residual Plots for Res.Time

Analysis of Residuals


Scatter diagram and least-squares line example

Geometric representation of residuals


Unexplained deviations (from the observed points to the line):

(yi - )



Explained deviations (From y-bar to the points on the line):

(- ™)



Total deviations (from the observed points to y-bar):

(yi- ™)



Scatter diagram: showing deviations about ™ and the regression line

Total deviation = Explained deviation + Unexplained deviation

The breakdown of variability into explained and unexplained parts


1. The coefficient of determination is the ratio of SSR to SST. (True/ False)

2. The regression sum of squares (SSR) can never be greater than the total sum of squares (SST). (True/ False)

3. Regression analysis is used to measure the strength of the association between two numerical variables, while correlation analysis is used for prediction. (True/ False)

4. The value of the t-test for testing b1 = 0 gives the same value as the t-test

for testing that correlation =0. (True/ False)

5. A confidence interval for the true slope b1 can never be used to test if b1 is

equal to 0. (True/ False)

True/False HW


6. The coefficient of determination is the percent of total variation explained by the regression model. (True/ False)

7. The coefficient of determination can never be greater than 1. (True/ False)

8. The predictor variable in regression analysis is referred to as the independent variable. (True/ False)

9. You give a pre-employment examination to your applicants. The test is scored from 1 to 100. You have data on their sales at the end of one year measured in dollars. You want to know if there is any linear relationship between pre-employment examination score and sales. An appropriate test to use is the t test on the population correlation coefficient. (True/ False)

10. Confidence intervals for the mean of Y are always narrower than prediction intervals for an individual Y for the same data set, X value and confidence level. (True/ False)

True/False HW

DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos

Documents

Transcript of DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos