7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....

48
7 - 1 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc. PowerPoint Slides Prepared By: Tava Olsen Washington University in St. Louis

Transcript of 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G....

Page 1: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 17 - 1

Chapter 7: Data Analysis in the Service of Modeling

The Art of Modeling with Spreadsheets

S.G. Powell and K.R. Baker

© John Wiley and Sons, Inc.

PowerPoint Slides Prepared By:Tava Olsen Washington University in St. Louis

Page 2: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 2

Data Analysis in the Context of Modeling

Supports the modeling process Improves accuracy of model Improves usefulness of conclusions

Modeling is the primary goal Data analysis is a means to that goal

Page 3: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 3

Topics for Chapter

Finding facts in databases Searching, editing, sorting, and filtering

Estimating parameters Point estimates and interval estimates

Estimating relationships among variables Single, multiple, and nonlinear regression

Forecasting a single variable Time series methods

Page 4: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 4

Databases

Tables of information Each row is a record in the database Each column is a field for the records Excel calls such a table a list

Page 5: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 5

Excel Lists

First row contains names for each field Each successive row contains one record Lists may be:

Searched and edited Sorted Filtered Tabulated

Page 6: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 6

Searching and Editing Lists

First assign a range name to entire list Include column titles

With list selected choose Data – Form Examine records one at a time:

Find Prev Find Next Enter new record with New button Delete record with Delete button

Page 7: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 7

Database Form

***Insert Figure 7.8

Page 8: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 8

Criteria Button

Found under Data – Form Allows for searching of records

Enter data into a field Click Find Next

Page 9: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 9

Alternate Excel Search Techniques

Highlight entire database Use Edit – Find to search Use Find and Replace to edit entries In Find and Replace

“?” stands for any single symbol “*” stands for any sequence of symbols

Page 10: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 10

Sorting: Data – Sort Command

***insert figure 7.10

Page 11: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 11

Filtering

Select database then Data – Filter – AutoFilter Will filter lists based on values

Found under arrow at the title of each column

Arrow on title turns blue to remind list is filtered Can remove filter by:

Select (All) using the list arrow; or Selecting Show All under Data – Filter

Page 12: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 12

More Filtering

Top 10 option returns records with smallest or largest value of a numerical record

Custom option allows filtering with compound criteria

More complicated compound criteria can be achieved with Data – Filter – Advanced Filter submenu

Page 13: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 13

Tabulating

Select Data – Pivot Table Creates summary tables Layout button on

third step of wizard

creates the format

for the table

Page 14: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 14

Analyzing Sample Data

Data is unlikely to cover whole population Work with sample from population

Statistics are summary measures about sample Want to construct statistics that represent population

Convenience sampling Have easy access to information on subset of population Subset may not be representative

Random sampling All objects in population have equal chance of appearing in

sample

Page 15: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 15

Descriptive Statistics

Summarizes information in sample Gives numerical picture of observations Excel Tools – Data Analysis

Descriptive Statistics table produced based on data given as input

Page 16: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 16

Inferential Statistics

Use information in sample to make inferences about population

Systematic Error If sample not representative of population Avoid by careful sampling

Sampling Error Sample is merely subset of population Mitigated by taking large samples

Page 17: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 17

Point Estimates

The sample average is calculated as:   The sample variance is calculated as:

and its square root is the sample standard deviation:

nxxn

i i

1

s 2 (xi x )2

n 1i 1

n

s (xi x )2

i 1

n

n 1

Page 18: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 18

Interval Estimates

P(L <= <= U) = 1 –

L and U represent the lower and upper limits of the interval

1 – represents the confidence level Usually a large percentage like 95 or 99%

represents the (unknown) true value of the parameter.

Page 19: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 19

Sampling Theory

Working with a population described by a Normal probability model Mean and standard deviation .

Take repeated samples of n items from population Calculate the sample average each time The sample averages will follow a Normal

distribution with a mean of and a variance of 2/n

Page 20: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 20

Estimates

Standard error: the standard deviation of some function being used to provide an estimate

Use the sample average to estimate the population mean

The standard deviation of the sample average is called the standard error of the mean:

x / n

Page 21: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 21

Z-scores

The z-score measures the number of standard deviations away from the mean

The z-score corresponding to any particular sample average is:

Tells how many standard errors from the mean 90% of the sample averages will have z-scores between

–1.64 and +1.64 The chances are 90% that the sample average will fall no

more than 1.64 standard errors from the true mean

z x x

x n

Page 22: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 22

Confidence Intervals for Means

Upper and lower limits on estimate for mean:

n>30 recommended unless original population resembles Normal

z can be computed using NORMSINV(1-/2) Replace by the sample standard deviation s

Provided that sample is larger than n = 30 Excel Descriptive Statistics also will calculate half-

width of confidence interval

x z( / n )

Page 23: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 23

Interval Estimates for a Proportion

To estimate the sample proportion p, the interval estimate is:

 

  Sample size should be at least 50 for this

formula to be reliable

p zp(1 p)

n

Page 24: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 24

Sample Size Determination

Suppose want to estimate mean of sample to within a range of ±R

n = (z/ R)2 Assumes:

Sampling from Normal distribution Known variance – can begin with small sample

to estimate standard deviation

Page 25: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 25

Sample Size Determination for Proportions

Suppose want to estimate a proportion to within a range of ±R

n = z2p(1 – p) / R2 Value maximized at p = 0.5 Conservative value:

n = (z/2)2 / R2

Page 26: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 26

Estimating Relationships

Scatter plot – visualize association Correlation:

n – number of pairs of observations for x, y sx, sy – standard deviations of x, y r – measures strength of linear relationship between

x and y

r 1

(n 1)

x i x sx

i1

n

(yi y )

sy

Page 27: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 27

r-statistic

Independent of units of measurement Lies in range [-1, 1] r > 0 – positive association r < 0 – negative association r close to 1 (or –1) implies a strong association r close to 0 implies a weak association Excel function: CORREL(xrange,yrange)

Page 28: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 28

Regression Relationships

Relationships based on empirical data Dependent variable – predicted from values

of one or more independent variables Regression models can be:

Linear or nonlinear Simple or multiple

Page 29: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 29

Simple Linear Regression

y = a + bx + e

y - dependent variable x - independent variable e - an “error” term. Constants a and b represent the intercept and

slope, respectively, of the regression line

Page 30: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 30

Error Term in Regression

Unexplained “noise” in the relationship May represent limitations of knowledge Or may represent random deviations of the

dependent variable from its mean, y

Page 31: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 31

Regression Goal

Want to find line to most closely match the observed relationship between x and y

Define “most closely” as minimizing sum of squared differences between observed and model values Minimizing sum of differences would set y equal to its

mean Penalizes large differences more than small differences

Page 32: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 32

Performing Regression

Residuals:

ei = yi – y = yi – (a + bxi)

Sum of squared differences between observations and model :

SS =

The regression problem: choose a and b to minimize SS

ei2

i 1

n

(yii 1

n

a bx1)2

Page 33: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 33

Regression Analysis

Assumes residuals are normally distributed with mean 0

Regression parameters can be calculated directly from the data

Simpler to use Excel’s regression tool (Under Data Analysis menu)

b n x iyi x i

i1

n

yii1

n

i 1

n

n x i2 ( xi

i1

n

)2

i1

n

a y bx

Page 34: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 34

Quantifying Regression Fit

Coefficient of determination: R2

Lies in range [0, 1] Closer to one – better fit Measures how much of the variation in y-

values is explained by model 1 – perfect match to model 0 – equation explains none of observed variation

Page 35: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 35

Regression Window

*** insert Figure 7.28

Page 36: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 36

Regression Output

R Squared

Degree of significance (under 0.1 is significant)

Estimate for a Estimate for b

P values of under 0.1 are statistically significant

Page 37: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 37

Simple Nonlinear Regression

A straight line may not be the most plausible description of dependency, e.g., y = axb

Can follow previous ideas to minimize sum of squared differences No Excel functions or simple formulas

Or can transform non-linear relationship into linear one, e.g., log y = log a + b log x Give up some intuition for convenience

Page 38: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 38

Multiple Linear Regression

Multiple independent variables

y = a0 + a1x1 + a2x2 + … + amxm + e

Work with n observations – each has: One observation of dependent variable One observation each of the m independent variables

Seek to minimize the sum of squared differences Put all independent variables into x-range in Excel’s

regression tool

Page 39: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 39

Regression Output

Coefficient of multiple determination

Coefficients of regression equation

P values of under 0.1 are statistically significant

Square root of R square

Accounts for presence of multiple variables

Page 40: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 40

Values to Include in Regression

Ideally pick values that can be justified based on practical or theoretical grounds

Could choose set that generates largest value of adjusted R2

Also could choose based on those with significant p-values for coefficients

Remember that good models require good forecasts for the independent variables

Page 41: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 41

Regression Assumptions

Errors in the regression model Follow a Normal distribution Are mutually independent Have the same variance

Linearity is assumed to hold

Page 42: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 42

Forecasting with Time Series Models

Use historical data Assume near-term future will resemble past Hypothesize a model with:

An average level: xt = + e

– mean value; e – random noise term A trend A seasonal or cyclic fluctuation

Page 43: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 43

Measures of Forecast Accuracy

MSE – Mean Squared Error between forecast and actual

MAD – Mean Absolute Deviation between forecast and actual

MAPE – Mean Absolute Percent Error between forecast and actual

Page 44: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 44

Moving Average Model

xt: observation from period t n-period moving average forecast:

Ft = (xt + xt–1 + … + xt–n+1) / n

Under Excel Data Analysis Moving Average: interval = number of periods Pairs forecast Ft and observation xt

Page 45: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 45

Exponential Smoothing

Historic observations: xt, xt–1, xt–2, etc. Forecast: Ft = xt + (1 – )Ft–1 Smoothing constant:

Implies: Ft = xt + (1 – )xt–1 + (1 – )2xt–2 + (1 – )3xt–3 + … Ft = Ft–1 + (xt – Ft–1)

Under Excel Data Analysis damping factor = 1 -

Page 46: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 46

Exponential Smoothing with a Trend

xt = + t + e

Forecast calculated after the observation for period t will be calculated as (Ft + Tt)

and – smoothing constants

Ft = xt + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1 

Page 47: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 47

Exponential Smoothing with Trend and Seasonality

xt = ( + t)St + e p = number of periods in a cycle Forecast calculated after the observation for period t will be

calculated as (Ft + Tt)St–p+1 , and – smoothing constants

Ft = xt / St-p + (1 – )(Ft–1 + Tt–1) Tt = (Ft – Ft–1) + (1 – )Tt–1

St = xt / Ft + (1 – ) St-p  

Page 48: 7 - 1 Chapter 7: Data Analysis in the Service of Modeling The Art of Modeling with Spreadsheets S.G. Powell and K.R. Baker © John Wiley and Sons, Inc.

7 - 48

Summary

Data collection and analysis should support modeling Locate relevant information Estimate parameters and relations Construct routine forecasts

Excel provides many tools Databases: searching, sorting, filtering, and tabulating Data Analysis: descriptive statistics, linear regression,

moving average and exponentially smoothed forecasts