Download - Spreadsheet Problem Solving

1

Spreadsheet Problem Solving fitting models to data

straight-line regression multilinear regression nonlinear regression

model building and selection Data Analysis Regression tool Trendline Solver

using

2

Review of Straight-line Linear Regression[ from Class #6 ]

For each data point, there is an error between thatpoint and the model line. Fitting the model has to dowith minimizing these errors.

x

y

y = ax + by1

y11

e11

Model

x11

11y

3

Finding the model parameters that give the best fit

For the straight-line model, the model parameters arethe slope (a) and the intercept (b).

The problem is then to find the values of a and b thatgive the best fit. What is meant by the best fit?

The standard measure of goodness of fit is the sumof squares of the errors:

n

2

i ii 1

ˆSSE y y

i iy a x b

So, the problem reduces to finding the minimum ofSSE by adjusting a and b.

4

Fitting a straight-line model to data

The minimization of SSE can be solved by calculusto give formulas for the best values of a and b:

n n n

i i i ii 1 i 1 i 1

2n n2i i

i 1 i 1

n n

i ii 1 i 1

n x y x y

a

n x x

y xb a

n n

and Excel solves problems like this with either formulasor built-in tools (Data Analysis Regression & Trendline).

5

Example: straight-line fit

6

Transfer the data to an Excel spreadsheetand create a graph

CO2 Emissions for the US

1320

1340

1360

1380

1400

1420

1440

1460

1480

1500

1520

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

CO

2 E

mis

sio

ns

(MM

T C

)

7

Calculating the slope and intercept using Excel formulas

n n n

i i i ii 1 i 1 i 1

2n n2i i

i 1 i 1

n n

i ii 1 i 1

n x y x y

a

n x x

y xb a

n n

8

The formulas behind the numbers

9

Using the model straight-line equation to computethe predictions:

and copy theseto the graph,displaying asa straight line

10


1300

1350

1400

1450

1500

1550

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

CO

2 E

mis

sio

ns

(MM

T C

) y = 21.32x - 41090

11

Using an alternate, shortcut approach Trendline


1320

1340

1360

1380

1400

1420

1440

1460

1480

1500

1520

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

CO

2 E

mis

sio

ns

(MM

T C

)

Start with a simple graph of the data

Select the data series byclicking on it

Right-click on adata point to getcontext-sensitivemenu

SelectAdd Trendlineoption

12

The Add Trendline dialog box

Linear selectedby default

OK for thisproblem

Click onOptions tab

13

Options tab

Set forDisplay equationon chart

Click OK

14


y = 21.315x - 41090

1300

1350

1400

1450

1500

1550

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

CO

2 E

mis

sio

ns

(MM

T C

)Initial form of graph with straight-line added Fix up

equationdisplay

15


y = 21.315x - 41090

1300

1350

1400

1450

1500

1550

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

CO

2 E

mis

sio

ns

(MM

T C

)

Looks just like before, but we got there quicker

But neither of these approaches gives us much informationabout the model, how good it is, etc.

16

A 2nd alternate approach Data Analysis Regression tool

recall that, if Data Analysisdoes not appear on the Toolsmenu, you will need to checkAnalysis Toolpak in the Add-insdialog box [if it’s not there, youwill have to go back to MicrosoftOffice/Excel set-up]

Tools Data Analysis

Initial, emptyRegressiondialog box

17

Regression dialog box set up for our problem

checking Residualswill give us alsomodel predictions

18

Initial (poorly formatted) Regression output display[ on new worksheet ]

and fix updisplay forappropriatesignificantfigures

Format

Autoformat

OK

19

Final Display of Regression Output

[ tons of info, most of which you will not understand for a couple years ]

used to judgegoodness offit

interceptand slopevalues

used to judgewhether terms“belong” in themodel

add to data graphfor visual comparisonwith model

20

Judging Goodness of Fit correlation coefficient: if closeto +1 or –1, indicates strongcorrelation between x and y[something we already knowfrom the original graph!]

coefficient of determination:%-age of the variability in ythat’s accounted for by themodel

adjustment to R2 thatpenalizes the value forusing a model with toomany terms

gives an idea of howfar off the modelpredictions will be

Adjusted R2 or Standard Error can be used to comparedifferent models and choose which fits best. The higherthe value of Adjusted R2 the better, the lower the valueof Standard Error the better.

21

Judging whether terms belong in the model

P-values estimate the probabilitythat the true value of the coefficientcould be zero

P-values that are quite small, likethese, indicate that there is littlequestion about the significance ofthe term coefficients. In our casehere, that means that both theintercept term and the slope termbelong in the model.

A P-value of 5%(0.05) or greatercauses suspicionthat the coefficientmay not besignificant and thatthe term shouldprobably be droppedfrom the model

22

The Data Analysis Regression tool appears much morecomplicated and involved that the shortcut Trendline tool, so . . .

Why use Data Analysis Regression?

1) It provides more information that let’s usjudge the goodness of fit and significanceof model terms

2) It can handle model forms that cannot be handled by Trendline

So, generally, when using Excel, we preferthe Data Analysis Regression tool over Trendline

but Trendline is still quite good for “quick and dirty”looks at the data

Learn to use both!

23

More complicated models

Polynomial models2 3y a bx cx dx

General linear models

1 2 3 4y a f x b f x c f x d f x

Examples: polynomial models above

1y a b c ln x

x

Multilinear models

1 1 2 2 1 2 3 1 2y a f x ,x , b f x ,x , c f x ,x ,

Examples: 1 2 1 2y a bx cx dx x 1

2

x

xy a e

Note: it is called linear regression,even when there are nonlinearterms in x, because the terms arelinear in the model parameters,a, b, c, etc.

24

Nonlinear models

Transformable to linear

b xy a e ln y ln a b x

Not transformable

BA

T CP 10

straight-lineregression!

We can use the Data Analysis Regression tool for everythingexcept the nonlinear models that can’t be transformed intolinear. For those, we can use the Solver.

25

Example: polynomial regression

Viscosity of Water at Atmospheric Pressure

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

2.000

0 50 100 150 200 250

Temperature (degF)

Vis

cosi

ty (

cp)

curvature evident

26

Setting up for polynomial fits

Select for quadratic model, etc

27

Data Analysis Regression tool

check Labels becauseheadings are includedin selections for Y and X

checkResiduals

28

Quadratic model regression results

copy to graph

model coefficients

model performanceadjR2

29


0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

2.000

0 50 100 150 200 250

Temperature (degF)

Vis

cosi

ty (

cp)

Data

Quadratic

Quadratic model really doesn’t “capture” behavior of data

30

Continue with fits of cubic, 4th- & 5th-order polynomials

Summary of results

Looks like 5th-order offers best performance

but improvement is marginal over 4th-order.

Resulting model:

4 2 6 3 9 4Visc 3.161 0.05699 T 5.023 10 T 2.162 10 T 3.593 10 T

31


0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

2.000

20 40 60 80 100 120 140 160 180 200 220

Temperature (degF)

Vis

cosi

ty (

cp)

Data

Quadratic

Cubic

4th Order

32

Precautions on polynomial fitting

Try to use the lowest-order model that gives a good fit.

Higher-order models will have “wiggles” between datapoints that will cause prediction errors.

In fact, an (n-1)th-order polynomial will provide a perfectfit to the n data points, but it will usually do bizarre thingsin between the data points.

33

Example: multi-linear regression

X-input range includestwo independent variables:x1 and x2

Model 1: 1 2y a b x c x

Model 2: 1 2y b x c x

High P value for intercept inModel 1 suggests Model 2without intercept, but thereis a significant loss in adjR2

34

Multilinear Model Performance

0.0

2.0

4.0

6.0

8.0

10.0

12.0

0 2 4 6 8 10 12

Measured y

Pre

dic

ted

y

Model 1

Model 2

Model performance isn’t thatgreat for either model, andModel 1 doesn’t appeardramatically better than Model 2

Note: for multi-linear models, we plot Predicted vs Measured y.A perfect model would place points directly on the 45-degree line.

35

Nonlinear Regression

Fitting the parameters of the van der Waals’ equation of stateData for SO2

2

RT aP ˆ ˆV b V

Find the values of a and bthat give the best predictionsfor P, when compared to themeasured values of P

36

Strategy for Nonlinear Regression

1) estimate initial values for a and b

2) compute predicted P’s using data for and TV

3) compute errors between predicted P’s and measured P’s

4) sum the squares of these errors to compute SSE

5) have the Solver minimize SSEby adjusting the values of a and b

37

Basic data Calculated Pressure

Sum ofsquaresof thiscolumn

by both ideal gas lawand van der Waals

-

38

Ideal GasCalculation

van der Waals Calculation

Error Calculation

Sum of SquaresCalculation

39

Setting up Solver Parameters

SSE as Target CellMinimizeby adjusting a and bwith b>=0 constraint

Results

40

Results

41

Fit of van der Waals Eqn for SO2

and Comparison to Ideal Gas Law

0

2000000

4000000

6000000

8000000

10000000

12000000

0 2000000 4000000 6000000 8000000 10000000 12000000

Measured Pressure (Pa)

Pre

dic

ted

Pre

ssu

re (

Pa)

van der Waals

Ideal Gas

Note departure ofideal gas predictionsat higher pressures