7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....
-
Upload
edwin-johnston -
Category
Documents
-
view
221 -
download
3
Transcript of 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....
7 - 17 - 1
Chapter 7: Data Analysis in the Service of Modeling
The Art of Modeling with Spreadsheets
S.G. Powell and K.R. Baker
© John Wiley and Sons, Inc.
PowerPoint Slides Prepared By:Tava Olsen Washington University in St. Louis
7 - 2
Data Analysis in the Context of Modeling
Supports the modeling process Improves accuracy of model Improves usefulness of conclusions
Modeling is the primary goal Data analysis is a means to that goal
7 - 3
Topics for Chapter
Finding facts in databases Searching, editing, sorting, and filtering
Estimating parameters Point estimates and interval estimates
Estimating relationships among variables Single, multiple, and nonlinear regression
Forecasting a single variable Time series methods
7 - 4
Databases
Tables of information Each row is a record in the database Each column is a field for the records Excel calls such a table a list
7 - 5
Excel Lists
First row contains names for each field Each successive row contains one record Lists may be:
Searched and edited Sorted Filtered Tabulated
7 - 6
Searching and Editing Lists
First assign a range name to entire list Include column titles
With list selected choose Data – Form Examine records one at a time:
Find Prev Find Next Enter new record with New button Delete record with Delete button
7 - 7
Database Form
***Insert Figure 7.8
7 - 8
Criteria Button
Found under Data – Form Allows for searching of records
Enter data into a field Click Find Next
7 - 9
Alternate Excel Search Techniques
Highlight entire database Use Edit – Find to search Use Find and Replace to edit entries In Find and Replace
“?” stands for any single symbol “*” stands for any sequence of symbols
7 - 10
Sorting: Data – Sort Command
***insert figure 7.10
7 - 11
Filtering
Select database then Data – Filter – AutoFilter Will filter lists based on values
Found under arrow at the title of each column
Arrow on title turns blue to remind list is filtered Can remove filter by:
Select (All) using the list arrow; or Selecting Show All under Data – Filter
7 - 12
More Filtering
Top 10 option returns records with smallest or largest value of a numerical record
Custom option allows filtering with compound criteria
More complicated compound criteria can be achieved with Data – Filter – Advanced Filter submenu
7 - 13
Tabulating
Select Data – Pivot Table Creates summary tables Layout button on
third step of wizard
creates the format
for the table
7 - 14
Analyzing Sample Data
Data is unlikely to cover whole population Work with sample from population
Statistics are summary measures about sample Want to construct statistics that represent population
Convenience sampling Have easy access to information on subset of population Subset may not be representative
Random sampling All objects in population have equal chance of appearing in
sample
7 - 15
Descriptive Statistics
Summarizes information in sample Gives numerical picture of observations Excel Tools – Data Analysis
Descriptive Statistics table produced based on data given as input
7 - 16
Inferential Statistics
Use information in sample to make inferences about population
Systematic Error If sample not representative of population Avoid by careful sampling
Sampling Error Sample is merely subset of population Mitigated by taking large samples
7 - 17
Point Estimates
The sample average is calculated as: The sample variance is calculated as:
and its square root is the sample standard deviation:
nxxn
i i
1
s 2 (xi x )2
n 1i 1
n
s (xi x )2
i 1
n
n 1
7 - 18
Interval Estimates
P(L <= <= U) = 1 –
L and U represent the lower and upper limits of the interval
1 – represents the confidence level Usually a large percentage like 95 or 99%
represents the (unknown) true value of the parameter.
7 - 19
Sampling Theory
Working with a population described by a Normal probability model Mean and standard deviation .
Take repeated samples of n items from population Calculate the sample average each time The sample averages will follow a Normal
distribution with a mean of and a variance of 2/n
7 - 20
Estimates
Standard error: the standard deviation of some function being used to provide an estimate
Use the sample average to estimate the population mean
The standard deviation of the sample average is called the standard error of the mean:
x / n
7 - 21
Z-scores
The z-score measures the number of standard deviations away from the mean
The z-score corresponding to any particular sample average is:
Tells how many standard errors from the mean 90% of the sample averages will have z-scores between
–1.64 and +1.64 The chances are 90% that the sample average will fall no
more than 1.64 standard errors from the true mean
z x x
x n
7 - 22
Confidence Intervals for Means
Upper and lower limits on estimate for mean:
n>30 recommended unless original population resembles Normal
z can be computed using NORMSINV(1-/2) Replace by the sample standard deviation s
Provided that sample is larger than n = 30 Excel Descriptive Statistics also will calculate half-
width of confidence interval
x z( / n )
7 - 23
Interval Estimates for a Proportion
To estimate the sample proportion p, the interval estimate is:
Sample size should be at least 50 for this
formula to be reliable
p zp(1 p)
n
7 - 24
Sample Size Determination
Suppose want to estimate mean of sample to within a range of ±R
n = (z/ R)2 Assumes:
Sampling from Normal distribution Known variance – can begin with small sample
to estimate standard deviation
7 - 25
Sample Size Determination for Proportions
Suppose want to estimate a proportion to within a range of ±R
n = z2p(1 – p) / R2 Value maximized at p = 0.5 Conservative value:
n = (z/2)2 / R2
7 - 26
Estimating Relationships
Scatter plot – visualize association Correlation:
n – number of pairs of observations for x, y sx, sy – standard deviations of x, y r – measures strength of linear relationship between
x and y
r 1
(n 1)
x i x sx
i1
n
(yi y )
sy
7 - 27
r-statistic
Independent of units of measurement Lies in range [-1, 1] r > 0 – positive association r < 0 – negative association r close to 1 (or –1) implies a strong association r close to 0 implies a weak association Excel function: CORREL(xrange,yrange)
7 - 28
Regression Relationships
Relationships based on empirical data Dependent variable – predicted from values
of one or more independent variables Regression models can be:
Linear or nonlinear Simple or multiple
7 - 29
Simple Linear Regression
y = a + bx + e
y - dependent variable x - independent variable e - an “error” term. Constants a and b represent the intercept and
slope, respectively, of the regression line
7 - 30
Error Term in Regression
Unexplained “noise” in the relationship May represent limitations of knowledge Or may represent random deviations of the
dependent variable from its mean, y
7 - 31
Regression Goal
Want to find line to most closely match the observed relationship between x and y
Define “most closely” as minimizing sum of squared differences between observed and model values Minimizing sum of differences would set y equal to its
mean Penalizes large differences more than small differences
7 - 32
Performing Regression
Residuals:
ei = yi – y = yi – (a + bxi)
Sum of squared differences between observations and model :
SS =
The regression problem: choose a and b to minimize SS
ei2
i 1
n
(yii 1
n
a bx1)2
7 - 33
Regression Analysis
Assumes residuals are normally distributed with mean 0
Regression parameters can be calculated directly from the data
Simpler to use Excel’s regression tool (Under Data Analysis menu)
b n x iyi x i
i1
n
yii1
n
i 1
n
n x i2 ( xi
i1
n
)2
i1
n
a y bx
7 - 34
Quantifying Regression Fit
Coefficient of determination: R2
Lies in range [0, 1] Closer to one – better fit Measures how much of the variation in y-
values is explained by model 1 – perfect match to model 0 – equation explains none of observed variation
7 - 35
Regression Window
*** insert Figure 7.28
7 - 36
Regression Output
R Squared
Degree of significance (under 0.1 is significant)
Estimate for a Estimate for b
P values of under 0.1 are statistically significant
7 - 37
Simple Nonlinear Regression
A straight line may not be the most plausible description of dependency, e.g., y = axb
Can follow previous ideas to minimize sum of squared differences No Excel functions or simple formulas
Or can transform non-linear relationship into linear one, e.g., log y = log a + b log x Give up some intuition for convenience
7 - 38
Multiple Linear Regression
Multiple independent variables
y = a0 + a1x1 + a2x2 + … + amxm + e
Work with n observations – each has: One observation of dependent variable One observation each of the m independent variables
Seek to minimize the sum of squared differences Put all independent variables into x-range in Excel’s
regression tool
7 - 39
Regression Output
Coefficient of multiple determination
Coefficients of regression equation
P values of under 0.1 are statistically significant
Square root of R square
Accounts for presence of multiple variables
7 - 40
Values to Include in Regression
Ideally pick values that can be justified based on practical or theoretical grounds
Could choose set that generates largest value of adjusted R2
Also could choose based on those with significant p-values for coefficients
Remember that good models require good forecasts for the independent variables
7 - 41
Regression Assumptions
Errors in the regression model Follow a Normal distribution Are mutually independent Have the same variance
Linearity is assumed to hold
7 - 42
Forecasting with Time Series Models
Use historical data Assume near-term future will resemble past Hypothesize a model with:
An average level: xt = + e
– mean value; e – random noise term A trend A seasonal or cyclic fluctuation
7 - 43
Measures of Forecast Accuracy
MSE – Mean Squared Error between forecast and actual
MAD – Mean Absolute Deviation between forecast and actual
MAPE – Mean Absolute Percent Error between forecast and actual
7 - 44
Moving Average Model
xt: observation from period t n-period moving average forecast:
Ft = (xt + xt–1 + … + xt–n+1) / n
Under Excel Data Analysis Moving Average: interval = number of periods Pairs forecast Ft and observation xt
7 - 45
Exponential Smoothing
Historic observations: xt, xt–1, xt–2, etc. Forecast: Ft = xt + (1 – )Ft–1 Smoothing constant:
Implies: Ft = xt + (1 – )xt–1 + (1 – )2xt–2 + (1 – )3xt–3 + … Ft = Ft–1 + (xt – Ft–1)
Under Excel Data Analysis damping factor = 1 -
7 - 46
Exponential Smoothing with a Trend
xt = + t + e
Forecast calculated after the observation for period t will be calculated as (Ft + Tt)
and – smoothing constants
Ft = xt + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1
7 - 47
Exponential Smoothing with Trend and Seasonality
xt = ( + t)St + e p = number of periods in a cycle Forecast calculated after the observation for period t will be
calculated as (Ft + Tt)St–p+1 , and – smoothing constants
Ft = xt / St-p + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1
St = xt / Ft + (1 – ) St-p
7 - 48
Summary
Data collection and analysis should support modeling Locate relevant information Estimate parameters and relations Construct routine forecasts
Excel provides many tools Databases: searching, sorting, filtering, and tabulating Data Analysis: descriptive statistics, linear regression,
moving average and exponentially smoothed forecasts