DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos
-
Upload
eric-perry -
Category
Documents
-
view
53 -
download
0
description
Transcript of DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos
slide 1
DSCI 5340: Predictive Modeling and Business Forecasting
Spring 2013 – Dr. Nick Evangelopoulos
Lecture 1: Introduction to Business Forecasting
Review of Simple Regression (Ch. 1-3)
Some material taken from:Michael Hand (Willamette University), Biz/Ed
slide 2
DSCI 5340FORECASTING
Forecasting
http://www.dilbert.com/2010-07-02/
“It is far better to foresee even without certainty than not to foresee at all.”
Henri Poincaré (1854-1912), polymath and chaos theory pioneer, The Foundations of Science
slide 3
DSCI 5340FORECASTING
Why Forecast?
The effectiveness of almost every human endeavor, every public initiative, depends in part upon unknown and uncertain future outcomes – the demand for services, the revenues to fund them.
The quality of decisions about whether or not to engage and at what level improves with the reliability of supporting forecasts.
slide 4
DSCI 5340FORECASTING
The two types of Forecasting
Qualitative – seeking opinions on which to base decision making
Consumer panels, focus groups, etc
Quantitative – using statistical data to help inform decision making
Identifying trends Moving averages – seasonal, cyclical, random Extrapolation - simple
slide 5
DSCI 5340FORECASTING
Costs and Benefits of Forecasting
Benefits:
Aids decision making Informs planning and resource allocation decisions If data is of high quality,
can be accurate
slide 6
DSCI 5340FORECASTING
Costs: Data not always reliable or accurate Data may be out of date The past is not always a guide to the future Qualitative data may be influenced by peer pressure Difficulty of coping with changes to external factors
out of the business’s control – e.g. economic policy, political developments (9/11?), natural disasters – hurricanes, earthquakes, etc.
Costs and Benefits of Forecasting
slide 7
DSCI 5340FORECASTING
Oregon Personal Income Tax Revenues (in $ Millions)
600
700
800
900
1000
1100
1200
1300
1400
15001
996:
01
199
6:0
2
199
6:0
3
199
6:0
4
199
7:0
1
199
7:0
2
199
7:0
3
199
7:0
4
199
8:0
1
199
8:0
2
199
8:0
3
199
8:0
4
199
9:0
1
199
9:0
2
199
9:0
3
199
9:0
4
200
0:0
1
200
0:0
2
200
0:0
3
200
0:0
4
200
1:0
1
200
1:0
2
200
1:0
3
200
1:0
4
Period
Dat
a/F
orec
asts
Example: Oregon Personal Income Taxes, 1996 – 2001
(see Class Tools > Sitewide > Hand Outs > Public Finance > MultDecompPIT.xls)
slide 8
DSCI 5340FORECASTING
Example: Classical Multiplicative Decomposition
L
t
ttt
ttttt
s
s
s
tbby
SeasonalTrendy
IrregularSeasonalCycleTrendy
2
1
10ˆ
ˆ
Conceptual Decomposition:
Conceptual Forecast:
Forecasting Model:
Trend: Long-term growth/declineCycle: Long-term slow, irregular oscillationSeasonal: Regular, periodic variation w/in calendar yearIrregular: Short-term, erratic variation
slide 9
DSCI 5340FORECASTING
Example: Classical Multiplicative Decomposition
ttttt IrregularSeasonalCycleTrendy Conceptual Decomposition:
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1996
:01
1996
:02
1996
:03
1996
:04
1997
:01
1997
:02
1997
:03
1997
:04
1998
:01
1998
:02
1998
:03
1998
:04
1999
:01
1999
:02
1999
:03
1999
:04
2000
:01
2000
:02
2000
:03
2000
:04
2001
:01
2001
:02
2001
:03
2001
:04
Period
Dat
a
slide 10
DSCI 5340FORECASTING
Example: Classical Multiplicative Decomposition
600
800
1000
1200
1400
1600T
ren
d
0.85
0.95
1.05
1.15
1.25
Se
aso
na
l
0.85
0.95
1.05
1.15
1.25
Cyc
lica
l
0.85
0.95
1.05
1.15
1.25
Irre
gu
lar
slide 11
DSCI 5340FORECASTING
Example: Classical Multiplicative Decomposition: Model Interpretation
2057.1
8913.0
9236.0
9794.0
5017.189291.731ˆ,ˆ 2
1
10 ty
s
s
s
tbby t
L
t
Model Interpretation
Initial, time-zero (1995:Q4) level is $731.92 millionIncreasing at $18.5 million per quarter Seasonal pattern
Peak in Q4 21% over trendTrough in Q3 11% below trend
slide 12
DSCI 5340FORECASTING
Example: Classical Multiplicative Decomposition: Forecasts
2057.1
8913.0
9236.0
9794.0
5017.189291.731ˆ,ˆ 2
1
10 ty
s
s
s
tbby t
L
t
Forecasts
1507.072057.11249.97792057.1)28(5017.189291.731ˆ284:2002
1097.608913.01231.47628913.0)27(5017.189291.731ˆ273:2002
1120.329236.01212.97449236.0)26(5017.189291.731ˆ262:2002
1169.909794.01194.47279794.0)25(5017.189291.731ˆ251:2002
735.009794.0750.43099794.0)1(5017.189291.731ˆ11:1996
t
t
t
t
t
ytQ
ytQ
ytQ
ytQ
ytQ
slide 13
Overview of Simple Regression Analysis
slide 14
DSCI 5340FORECASTING
Regression Analysis
Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will study.
Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).
Dependent variable: denoted Y Independent variables: denoted X1, X2, …, Xk
slide 15
DSCI 5340FORECASTING
Correlation Analysis…
If we are interested only in determining whether a relationship exists, we employ correlation analysis, a technique introduced earlier.
This chapter will examine the relationship between two variables, sometimes called simple linear regression.
Mathematical equations describing these relationships are also called models, and they fall into two types: deterministic or probabilistic.
slide 16
DSCI 5340FORECASTING
b = amount of increase in Y
Unit increase in X (x2 -x1) = 1
y = a + bx
A linear regression equation illustrating the geometrical interpretations of a and b
slide 17
DSCI 5340FORECASTING
Simple Linear Regressioncorrelation = r
Positive r: y increases as x increases
r = 1: a perfect positive relationship between y and x
slide 18
DSCI 5340FORECASTING
Simple Linear Regression
Negative r: y decreases as x increases
r = -1: a perfect negative relationship between y and x
slide 19
DSCI 5340FORECASTING
Simple Linear Regression
r near zero: little or no linear relationship between y and x
r near zero: little or no linear relationship between y and x
slide 20
DSCI 5340FORECASTING
A Model…
To create a probabilistic model, we start with a deterministic model that approximates the relationship we want to model and add a random term that measures the error of the deterministic component.
Deterministic Model: The cost of building a new house is about $75 per
square foot and most lots sell for about $25,000. Hence the approximate selling price (y) would be: y = $25,000 + (75$/ft2)(x)
(where x is the size of the house in square feet)
slide 21
DSCI 5340FORECASTING
A Model…
A model of the relationship between house size (independent variable) and house price (dependent variable) would be:
House size
HousePrice
Most lots sell for $25,000
Building a house costs about
$75 per square foot.
House Price = 25000 + 75(Size)
In this model, the price of the house is completely determined by the size.
slide 22
DSCI 5340FORECASTING
A Model…
In real life however, the house cost will vary even among the same size of house:
House size
HousePrice
25K$
Same square footage, but different price points(e.g. décor options, cabinet upgrades, lot location…)
Lower vs. HigherVariability
x
House Price = 25,000 + 75(Size) +
slide 23
DSCI 5340FORECASTING
Simple Linear Regression Model…
A straight line model with one independent variable is called a first order linear model or a simple linear regression model. Its is written as:
error variable
dependentvariable
independentvariable
y-interceptslope of the line
slide 24
DSCI 5340FORECASTING
Simple Linear Regression Model…
Note that both and are population parameters which are usually unknown and hence estimated from the data.
y
x
run
rise
=slope (=rise/run)
=y-intercept
slide 25
DSCI 5340FORECASTING
Which line has the best “fit” to the data?
slide 26
DSCI 5340FORECASTING
Estimating the Coefficients…
In much the same way we base estimates of on
, we estimate on b0 and on b1, the y-intercept and slope (respectively) of the least squares or regression line given by:
(Recall: this is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)
slide 27
DSCI 5340FORECASTING
Least Squares Line…
This line minimizes th
e sum of th
e squared differences
between the points and the lin
e…
…but where did the line equation come from?How did we get .934 for a y-intercept and 2.114 for slope??
these differences are called residuals
slide 28
DSCI 5340FORECASTING
Slope and Correlation
Y = O + 1 X + HO: 1 = 0 versus HA: 1 0
HO: = 0 versus HA: 0
where is the population correlation
between X and Y.
slide 29
DSCI 5340FORECASTING
Slope and Correlation
Warning: High correlation does not imply causality. If a large positive or negative value of the sample correlation coefficient r is observed, it is incorrect to conclude that a change in x causes a change in y. The only valid conclusion is that a linear trend may exist between x and y.
slide 30
DSCI 5340FORECASTING
Formulas to use in RegressionNotation
SSxy = ∑ xiyi – ∑xi ∑yi/n (SCP is sometimes used for SSxy.)
=xiyi- n™
SSxx = ∑ xi2 – ( ∑xi)2/n
= xi2- n2
SSyy = ∑yi2 - ( ∑yi)2/n
= yi2- n™2
(These numbers will always be given to you on the test – you will not have to calculate these.)
SCPxy = SSxy
SSx = SSxx
SSy = SSyy SSyy is also called SST (total sum of squares)
slide 31
DSCI 5340FORECASTING
Formulas for the Least Squares Estimate
The values of 0 and 1 that minimize the SSE are given by the following formulas
Slope: 1 = SSxy / SSxx
y-intercept: 0 = ™ - 1 where,
slide 32
DSCI 5340FORECASTING
Least Squares Line…
The coefficients b1 and b0 for the least squares line…
…are calculated as:
slide 33
DSCI 5340FORECASTING
Data
Statistics
Information
Data Points:
x y
1 6
2 1
3 9
4 5
5 17
6 12 y = .934 + 2.114x
From Data to Information
Regression Line
slide 34
DSCI 5340FORECASTING
Sum of Squares for Error (SSE)…
The sum of squares for error is calculated as:
and is used in the calculation of the standard error of estimate:
If is zero, all the points fall on the regression line.
slide 35
DSCI 5340FORECASTING
Standard Error…
If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor…
But what is small and what is large?
slide 36
DSCI 5340FORECASTING
Analysis of Variance
slide 37
DSCI 5340FORECASTING
An omnibus or global test of the overall contribution of the set of driver variables to the prediction of the response variable is carried out via the analysis of variance (ANOVA). A summary table for the ANOVA of regression follows:
Source of Degrees of Sum of Mean Fcalc FcritVariation (SV) Freedom (df) Squares (SS) Square (MS) Regression p SSR MSR MSR/MSE F,p,n-(p+1)
Residual n-(p+1) SSE MSETotal n-1 SST
Where F,p,n-(p+1) is the value of F with p numerator df and n-(p+1) denominator df that places in the upper tail of the distribution.
Analysis of Variance for Regression
slide 38
DSCI 5340FORECASTING
• In the ANOVA table we have the following:• SSR = Sum of Squares due to Regression• SSE = Sum of Squares due to Error or Residual• SST = Sum of Squares Total• MSR = SSR/p = Mean Square Regression• MSE = SSE/[n-(p+1)] = Mean Square Error or Residual
• The sums of squares are derived from the algebraic identity:
(Yi - Y)2 = (Yi - Y)2 + (Yi -Yi)2 • That is: SST = SSR + SSE So that R2 = SSR/SST represents the
proportion of variation in Y that is explained by the behavior of the driver variables. R2 is the coefficient of determination.
^^
Analysis of Variance for Regression
slide 39
DSCI 5340FORECASTING
Coefficient of Determination…
Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R2.
The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2
slide 40
DSCI 5340FORECASTING
The coefficient of determination is
r2 = (SSyy - SSE) = 1 - SSE (SSyy) SSyy
It represents the proportion of the sum of squares of deviations of the y values about their mean that can be attributed to a linear relationship between y and x. (In simple linear regression, it may also be computed as
the square of the coefficient of correlation r.)
Coefficient of Determination…
slide 41
DSCI 5340FORECASTING
Practical Interpretation of the Coefficient of Determination, R2
About 100(R2)% of the sample variation in y (measured by the total sum of squares of deviations of the sample y values about their mean ™) can be explained by(or attributed to ) using x to predict y in the straight-line model.
slide 42
DSCI 5340FORECASTING
Example: Data & Calculations
x = 268
= 26.8
x2 = 7668
y = 27.73
™ = 2.773
y2 = 83.8733
slide 43
DSCI 5340FORECASTING
We need to calculate SSxy, SSxx, and SSyy as follows (equivalent formula on page 127 using R-square):
SSxy= 57.456
SSxx = 485.6
SSyy = 6.97801
Then, the coefficient of correlation is
r
SS
SS SSxy
xx yy
5 7 4 5 6
4 8 5 6 6 9 7 8 0 1
5 7 4 5 6
5 8 2 119 9
.
. .
.
..
Example: Data & Calculations
slide 44
DSCI 5340FORECASTING
Although (a) has a large slope and (b) has a small slope, both are scatter diagrams for r = 0.9
slide 45
DSCI 5340FORECASTING
One thing to keep in mind is that statistical significance does not always imply practical significance.
In other words, rejection of Ho: 1 = 0 (statistical significance) does not mean that precise prediction (practical significance) follows. It does demonstrate to the researcher that , within the sample data at least, this particular independent variable has an association with the dependent variable.
Interpretation of beta coefficients
slide 46
DSCI 5340FORECASTING
Confidence Interval for 1
A (1 - ) Ÿ 100% confidence interval for 1 is
From: b1 - t/2, n - 2sb1 to: b1 + t/2, n - 2sb1
For b1 = .354 and sb1 = .0797 and using
t.05,8 = 1.860, the resulting confidence interval is
.354 - (1.860)(.0797) to .354 + (1.860)(.0797)
= .354 - .148 to = .354 + .148
= .206 to = .502
slide 47
DSCI 5340FORECASTING
Confidence Interval for 1
So we are 90% confident that the value of the estimated slope (b1 = .354) is within .148 of the actual slope, 1.
The large width of this interval is due in part to the lack of information (small sample size) used to derive the estimates; a larger sample would decrease the width of this confidence interval.
slide 48
DSCI 5340FORECASTING
Standardized Residual
Perc
ent
210-1-2
99
90
50
10
1
Fitted Value
Sta
ndard
ized R
esi
dual
20151050
2
1
0
-1
-2
Standardized Residual
Fre
quency
210-1-2
10.0
7.5
5.0
2.5
0.0
Observation Order
Sta
ndard
ized R
esi
dual
50454035302520151051
2
1
0
-1
-2
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
Histogram of the Residuals Residuals Versus the Order of the Data
Residual Plots for Res.Time
Analysis of Residuals
slide 49
DSCI 5340FORECASTING
Scatter diagram and least-squares line example
Geometric representation of residuals
slide 50
DSCI 5340FORECASTING
Unexplained deviations (from the observed points to the line):
(yi - )
Geometric representation of residuals
slide 51
DSCI 5340FORECASTING
Explained deviations (From y-bar to the points on the line):
(- ™)
Geometric representation of residuals
slide 52
DSCI 5340FORECASTING
Total deviations (from the observed points to y-bar):
(yi- ™)
Geometric representation of residuals
slide 53
DSCI 5340FORECASTING
Scatter diagram: showing deviations about ™ and the regression line
Total deviation = Explained deviation + Unexplained deviation
The breakdown of variability into explained and unexplained parts
slide 54
DSCI 5340FORECASTING
1. The coefficient of determination is the ratio of SSR to SST. (True/ False)
2. The regression sum of squares (SSR) can never be greater than the total sum of squares (SST). (True/ False)
3. Regression analysis is used to measure the strength of the association between two numerical variables, while correlation analysis is used for prediction. (True/ False)
4. The value of the t-test for testing b1 = 0 gives the same value as the t-test
for testing that correlation =0. (True/ False)
5. A confidence interval for the true slope b1 can never be used to test if b1 is
equal to 0. (True/ False)
True/False HW
slide 55
DSCI 5340FORECASTING
6. The coefficient of determination is the percent of total variation explained by the regression model. (True/ False)
7. The coefficient of determination can never be greater than 1. (True/ False)
8. The predictor variable in regression analysis is referred to as the independent variable. (True/ False)
9. You give a pre-employment examination to your applicants. The test is scored from 1 to 100. You have data on their sales at the end of one year measured in dollars. You want to know if there is any linear relationship between pre-employment examination score and sales. An appropriate test to use is the t test on the population correlation coefficient. (True/ False)
10. Confidence intervals for the mean of Y are always narrower than prediction intervals for an individual Y for the same data set, X value and confidence level. (True/ False)
True/False HW