7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....

7 - 17 - 1

Chapter 7: Data Analysis in the Service of Modeling

The Art of Modeling with Spreadsheets

S.G. Powell and K.R. Baker

© John Wiley and Sons, Inc.

PowerPoint Slides Prepared By:Tava Olsen Washington University in St. Louis

7 - 2

Data Analysis in the Context of Modeling

Supports the modeling process Improves accuracy of model Improves usefulness of conclusions

Modeling is the primary goal Data analysis is a means to that goal

7 - 3

Topics for Chapter

Finding facts in databases Searching, editing, sorting, and filtering

Estimating parameters Point estimates and interval estimates

Estimating relationships among variables Single, multiple, and nonlinear regression

Forecasting a single variable Time series methods

7 - 4

Databases

Tables of information Each row is a record in the database Each column is a field for the records Excel calls such a table a list

7 - 5

Excel Lists

First row contains names for each field Each successive row contains one record Lists may be:

Searched and edited Sorted Filtered Tabulated

7 - 6

Searching and Editing Lists

First assign a range name to entire list Include column titles

With list selected choose Data – Form Examine records one at a time:

Find Prev Find Next Enter new record with New button Delete record with Delete button

7 - 7

Database Form

***Insert Figure 7.8

7 - 8

Criteria Button

Found under Data – Form Allows for searching of records

Enter data into a field Click Find Next

7 - 9

Alternate Excel Search Techniques

Highlight entire database Use Edit – Find to search Use Find and Replace to edit entries In Find and Replace

“?” stands for any single symbol “*” stands for any sequence of symbols

7 - 10

Sorting: Data – Sort Command

***insert figure 7.10

7 - 11

Filtering

Select database then Data – Filter – AutoFilter Will filter lists based on values

Found under arrow at the title of each column

Arrow on title turns blue to remind list is filtered Can remove filter by:

Select (All) using the list arrow; or Selecting Show All under Data – Filter

7 - 12

More Filtering

Top 10 option returns records with smallest or largest value of a numerical record

Custom option allows filtering with compound criteria

More complicated compound criteria can be achieved with Data – Filter – Advanced Filter submenu

7 - 13

Tabulating

Select Data – Pivot Table Creates summary tables Layout button on

third step of wizard

creates the format

for the table

7 - 14

Analyzing Sample Data

Data is unlikely to cover whole population Work with sample from population

Statistics are summary measures about sample Want to construct statistics that represent population

Convenience sampling Have easy access to information on subset of population Subset may not be representative

Random sampling All objects in population have equal chance of appearing in

sample

7 - 15

Descriptive Statistics

Summarizes information in sample Gives numerical picture of observations Excel Tools – Data Analysis

Descriptive Statistics table produced based on data given as input

7 - 16

Inferential Statistics

Use information in sample to make inferences about population

Systematic Error If sample not representative of population Avoid by careful sampling

Sampling Error Sample is merely subset of population Mitigated by taking large samples

7 - 17

Point Estimates

The sample average is calculated as: The sample variance is calculated as:

and its square root is the sample standard deviation:

nxxn

i i

1

s 2 (xi x )2

n 1i 1

n

s (xi x )2

i 1

n

n 1

7 - 18

Interval Estimates

P(L <= <= U) = 1 –

L and U represent the lower and upper limits of the interval

1 – represents the confidence level Usually a large percentage like 95 or 99%

represents the (unknown) true value of the parameter.

7 - 19

Sampling Theory

Working with a population described by a Normal probability model Mean and standard deviation .

Take repeated samples of n items from population Calculate the sample average each time The sample averages will follow a Normal

distribution with a mean of and a variance of 2/n

7 - 20

Estimates

Standard error: the standard deviation of some function being used to provide an estimate

Use the sample average to estimate the population mean

The standard deviation of the sample average is called the standard error of the mean:

x / n

7 - 21

Z-scores

The z-score measures the number of standard deviations away from the mean

The z-score corresponding to any particular sample average is:

Tells how many standard errors from the mean 90% of the sample averages will have z-scores between

–1.64 and +1.64 The chances are 90% that the sample average will fall no

more than 1.64 standard errors from the true mean

z x x

x n

7 - 22

Confidence Intervals for Means

Upper and lower limits on estimate for mean:

n>30 recommended unless original population resembles Normal

z can be computed using NORMSINV(1-/2) Replace by the sample standard deviation s

Provided that sample is larger than n = 30 Excel Descriptive Statistics also will calculate half-

width of confidence interval

x z( / n )

7 - 23

Interval Estimates for a Proportion

To estimate the sample proportion p, the interval estimate is:

Sample size should be at least 50 for this

formula to be reliable

p zp(1 p)

n

7 - 24

Sample Size Determination

Suppose want to estimate mean of sample to within a range of ±R

n = (z/ R)2 Assumes:

Sampling from Normal distribution Known variance – can begin with small sample

to estimate standard deviation

7 - 25

Sample Size Determination for Proportions

Suppose want to estimate a proportion to within a range of ±R

n = z2p(1 – p) / R2 Value maximized at p = 0.5 Conservative value:

n = (z/2)2 / R2

7 - 26

Estimating Relationships

Scatter plot – visualize association Correlation:

n – number of pairs of observations for x, y sx, sy – standard deviations of x, y r – measures strength of linear relationship between

x and y

r 1

(n 1)

x i x sx

i1

n

(yi y )

sy

7 - 27

r-statistic

Independent of units of measurement Lies in range [-1, 1] r > 0 – positive association r < 0 – negative association r close to 1 (or –1) implies a strong association r close to 0 implies a weak association Excel function: CORREL(xrange,yrange)

7 - 28

Regression Relationships

Relationships based on empirical data Dependent variable – predicted from values

of one or more independent variables Regression models can be:

Linear or nonlinear Simple or multiple

7 - 29

Simple Linear Regression

y = a + bx + e

y - dependent variable x - independent variable e - an “error” term. Constants a and b represent the intercept and

slope, respectively, of the regression line

7 - 30

Error Term in Regression

Unexplained “noise” in the relationship May represent limitations of knowledge Or may represent random deviations of the

dependent variable from its mean, y

7 - 31

Regression Goal

Want to find line to most closely match the observed relationship between x and y

Define “most closely” as minimizing sum of squared differences between observed and model values Minimizing sum of differences would set y equal to its

mean Penalizes large differences more than small differences

7 - 32

Performing Regression

Residuals:

ei = yi – y = yi – (a + bxi)

Sum of squared differences between observations and model :

SS =

The regression problem: choose a and b to minimize SS

ei2

i 1

n

(yii 1

n

a bx1)2

7 - 33

Regression Analysis

Assumes residuals are normally distributed with mean 0

Regression parameters can be calculated directly from the data

Simpler to use Excel’s regression tool (Under Data Analysis menu)

b n x iyi x i

i1

n

yii1

n

i 1

n

n x i2 ( xi

i1

n

)2

i1

n

a y bx

7 - 34

Quantifying Regression Fit

Coefficient of determination: R2

Lies in range [0, 1] Closer to one – better fit Measures how much of the variation in y-

values is explained by model 1 – perfect match to model 0 – equation explains none of observed variation

7 - 35

Regression Window

*** insert Figure 7.28

7 - 36

Regression Output

R Squared

Degree of significance (under 0.1 is significant)

Estimate for a Estimate for b

P values of under 0.1 are statistically significant

7 - 37

Simple Nonlinear Regression

A straight line may not be the most plausible description of dependency, e.g., y = axb

Can follow previous ideas to minimize sum of squared differences No Excel functions or simple formulas

Or can transform non-linear relationship into linear one, e.g., log y = log a + b log x Give up some intuition for convenience

7 - 38

Multiple Linear Regression

Multiple independent variables

y = a0 + a1x1 + a2x2 + … + amxm + e

Work with n observations – each has: One observation of dependent variable One observation each of the m independent variables

Seek to minimize the sum of squared differences Put all independent variables into x-range in Excel’s

regression tool

7 - 39

Regression Output

Coefficient of multiple determination

Coefficients of regression equation

P values of under 0.1 are statistically significant

Square root of R square

Accounts for presence of multiple variables

7 - 40

Values to Include in Regression

Ideally pick values that can be justified based on practical or theoretical grounds

Could choose set that generates largest value of adjusted R2

Also could choose based on those with significant p-values for coefficients

Remember that good models require good forecasts for the independent variables

7 - 41

Regression Assumptions

Errors in the regression model Follow a Normal distribution Are mutually independent Have the same variance

Linearity is assumed to hold

7 - 42

Forecasting with Time Series Models

Use historical data Assume near-term future will resemble past Hypothesize a model with:

An average level: xt = + e

– mean value; e – random noise term A trend A seasonal or cyclic fluctuation

7 - 43

Measures of Forecast Accuracy

MSE – Mean Squared Error between forecast and actual

MAD – Mean Absolute Deviation between forecast and actual

MAPE – Mean Absolute Percent Error between forecast and actual

7 - 44

Moving Average Model

xt: observation from period t n-period moving average forecast:

Ft = (xt + xt–1 + … + xt–n+1) / n

Under Excel Data Analysis Moving Average: interval = number of periods Pairs forecast Ft and observation xt

7 - 45

Exponential Smoothing

Historic observations: xt, xt–1, xt–2, etc. Forecast: Ft = xt + (1 – )Ft–1 Smoothing constant:

Implies: Ft = xt + (1 – )xt–1 + (1 – )2xt–2 + (1 – )3xt–3 + … Ft = Ft–1 + (xt – Ft–1)

Under Excel Data Analysis damping factor = 1 -

7 - 46

Exponential Smoothing with a Trend

xt = + t + e

Forecast calculated after the observation for period t will be calculated as (Ft + Tt)

and – smoothing constants

Ft = xt + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1

7 - 47

Exponential Smoothing with Trend and Seasonality

xt = ( + t)St + e p = number of periods in a cycle Forecast calculated after the observation for period t will be

calculated as (Ft + Tt)St–p+1 , and – smoothing constants

Ft = xt / St-p + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1

St = xt / Ft + (1 – ) St-p

7 - 48

Summary

Data collection and analysis should support modeling Locate relevant information Estimate parameters and relations Construct routine forecasts

Excel provides many tools Databases: searching, sorting, filtering, and tabulating Data Analysis: descriptive statistics, linear regression,

moving average and exponentially smoothed forecasts

7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....

Documents

Transcript of 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....