1
Spreadsheet Problem Solving fitting models to data
straight-line regression multilinear regression nonlinear regression
model building and selection Data Analysis Regression tool Trendline Solver
using
2
Review of Straight-line Linear Regression[ from Class #6 ]
For each data point, there is an error between thatpoint and the model line. Fitting the model has to dowith minimizing these errors.
x
y
y = ax + by1
y11
e11
Model
x11
11y
3
Finding the model parameters that give the best fit
For the straight-line model, the model parameters arethe slope (a) and the intercept (b).
The problem is then to find the values of a and b thatgive the best fit. What is meant by the best fit?
The standard measure of goodness of fit is the sumof squares of the errors:
n
2
i ii 1
ˆSSE y y
i iy a x b
So, the problem reduces to finding the minimum ofSSE by adjusting a and b.
4
Fitting a straight-line model to data
The minimization of SSE can be solved by calculusto give formulas for the best values of a and b:
n n n
i i i ii 1 i 1 i 1
2n n2i i
i 1 i 1
n n
i ii 1 i 1
n x y x y
a
n x x
y xb a
n n
and Excel solves problems like this with either formulasor built-in tools (Data Analysis Regression & Trendline).
5
Example: straight-line fit
6
Transfer the data to an Excel spreadsheetand create a graph
CO2 Emissions for the US
1320
1340
1360
1380
1400
1420
1440
1460
1480
1500
1520
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
CO
2 E
mis
sio
ns
(MM
T C
)
7
Calculating the slope and intercept using Excel formulas
n n n
i i i ii 1 i 1 i 1
2n n2i i
i 1 i 1
n n
i ii 1 i 1
n x y x y
a
n x x
y xb a
n n
8
The formulas behind the numbers
9
Using the model straight-line equation to computethe predictions:
and copy theseto the graph,displaying asa straight line
10
CO2 Emissions for the US
1300
1350
1400
1450
1500
1550
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
CO
2 E
mis
sio
ns
(MM
T C
) y = 21.32x - 41090
11
Using an alternate, shortcut approach Trendline
CO2 Emissions for the US
1320
1340
1360
1380
1400
1420
1440
1460
1480
1500
1520
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
CO
2 E
mis
sio
ns
(MM
T C
)
Start with a simple graph of the data
Select the data series byclicking on it
Right-click on adata point to getcontext-sensitivemenu
SelectAdd Trendlineoption
12
The Add Trendline dialog box
Linear selectedby default
OK for thisproblem
Click onOptions tab
13
Options tab
Set forDisplay equationon chart
Click OK
14
CO2 Emissions for the US
y = 21.315x - 41090
1300
1350
1400
1450
1500
1550
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
CO
2 E
mis
sio
ns
(MM
T C
)Initial form of graph with straight-line added Fix up
equationdisplay
15
CO2 Emissions for the US
y = 21.315x - 41090
1300
1350
1400
1450
1500
1550
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
CO
2 E
mis
sio
ns
(MM
T C
)
Looks just like before, but we got there quicker
But neither of these approaches gives us much informationabout the model, how good it is, etc.
16
A 2nd alternate approach Data Analysis Regression tool
recall that, if Data Analysisdoes not appear on the Toolsmenu, you will need to checkAnalysis Toolpak in the Add-insdialog box [if it’s not there, youwill have to go back to MicrosoftOffice/Excel set-up]
Tools Data Analysis
Initial, emptyRegressiondialog box
17
Regression dialog box set up for our problem
checking Residualswill give us alsomodel predictions
18
Initial (poorly formatted) Regression output display[ on new worksheet ]
and fix updisplay forappropriatesignificantfigures
Format
Autoformat
OK
19
Final Display of Regression Output
[ tons of info, most of which you will not understand for a couple years ]
used to judgegoodness offit
interceptand slopevalues
used to judgewhether terms“belong” in themodel
add to data graphfor visual comparisonwith model
20
Judging Goodness of Fit correlation coefficient: if closeto +1 or –1, indicates strongcorrelation between x and y[something we already knowfrom the original graph!]
coefficient of determination:%-age of the variability in ythat’s accounted for by themodel
adjustment to R2 thatpenalizes the value forusing a model with toomany terms
gives an idea of howfar off the modelpredictions will be
Adjusted R2 or Standard Error can be used to comparedifferent models and choose which fits best. The higherthe value of Adjusted R2 the better, the lower the valueof Standard Error the better.
21
Judging whether terms belong in the model
P-values estimate the probabilitythat the true value of the coefficientcould be zero
P-values that are quite small, likethese, indicate that there is littlequestion about the significance ofthe term coefficients. In our casehere, that means that both theintercept term and the slope termbelong in the model.
A P-value of 5%(0.05) or greatercauses suspicionthat the coefficientmay not besignificant and thatthe term shouldprobably be droppedfrom the model
22
The Data Analysis Regression tool appears much morecomplicated and involved that the shortcut Trendline tool, so . . .
Why use Data Analysis Regression?
1) It provides more information that let’s usjudge the goodness of fit and significanceof model terms
2) It can handle model forms that cannot be handled by Trendline
So, generally, when using Excel, we preferthe Data Analysis Regression tool over Trendline
but Trendline is still quite good for “quick and dirty”looks at the data
Learn to use both!
23
More complicated models
Polynomial models2 3y a bx cx dx
General linear models
1 2 3 4y a f x b f x c f x d f x
Examples: polynomial models above
1y a b c ln x
x
Multilinear models
1 1 2 2 1 2 3 1 2y a f x ,x , b f x ,x , c f x ,x ,
Examples: 1 2 1 2y a bx cx dx x 1
2
x
xy a e
Note: it is called linear regression,even when there are nonlinearterms in x, because the terms arelinear in the model parameters,a, b, c, etc.
24
Nonlinear models
Transformable to linear
b xy a e ln y ln a b x
Not transformable
BA
T CP 10
straight-lineregression!
We can use the Data Analysis Regression tool for everythingexcept the nonlinear models that can’t be transformed intolinear. For those, we can use the Solver.
25
Example: polynomial regression
Viscosity of Water at Atmospheric Pressure
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
2.000
0 50 100 150 200 250
Temperature (degF)
Vis
cosi
ty (
cp)
curvature evident
26
Setting up for polynomial fits
Select for quadratic model, etc
27
Data Analysis Regression tool
check Labels becauseheadings are includedin selections for Y and X
checkResiduals
28
Quadratic model regression results
copy to graph
model coefficients
model performanceadjR2
29
Viscosity of Water at Atmospheric Pressure
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
2.000
0 50 100 150 200 250
Temperature (degF)
Vis
cosi
ty (
cp)
Data
Quadratic
Quadratic model really doesn’t “capture” behavior of data
30
Continue with fits of cubic, 4th- & 5th-order polynomials
Summary of results
Looks like 5th-order offers best performance
but improvement is marginal over 4th-order.
Resulting model:
4 2 6 3 9 4Visc 3.161 0.05699 T 5.023 10 T 2.162 10 T 3.593 10 T
31
Viscosity of Water at Atmospheric Pressure
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
2.000
20 40 60 80 100 120 140 160 180 200 220
Temperature (degF)
Vis
cosi
ty (
cp)
Data
Quadratic
Cubic
4th Order
32
Precautions on polynomial fitting
Try to use the lowest-order model that gives a good fit.
Higher-order models will have “wiggles” between datapoints that will cause prediction errors.
In fact, an (n-1)th-order polynomial will provide a perfectfit to the n data points, but it will usually do bizarre thingsin between the data points.
33
Example: multi-linear regression
X-input range includestwo independent variables:x1 and x2
Model 1: 1 2y a b x c x
Model 2: 1 2y b x c x
High P value for intercept inModel 1 suggests Model 2without intercept, but thereis a significant loss in adjR2
34
Multilinear Model Performance
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0 2 4 6 8 10 12
Measured y
Pre
dic
ted
y
Model 1
Model 2
Model performance isn’t thatgreat for either model, andModel 1 doesn’t appeardramatically better than Model 2
Note: for multi-linear models, we plot Predicted vs Measured y.A perfect model would place points directly on the 45-degree line.
35
Nonlinear Regression
Fitting the parameters of the van der Waals’ equation of stateData for SO2
2
RT aP ˆ ˆV b V
Find the values of a and bthat give the best predictionsfor P, when compared to themeasured values of P
36
Strategy for Nonlinear Regression
1) estimate initial values for a and b
2) compute predicted P’s using data for and TV
3) compute errors between predicted P’s and measured P’s
4) sum the squares of these errors to compute SSE
5) have the Solver minimize SSEby adjusting the values of a and b
37
Basic data Calculated Pressure
Sum ofsquaresof thiscolumn
by both ideal gas lawand van der Waals
-
38
Ideal GasCalculation
van der Waals Calculation
Error Calculation
Sum of SquaresCalculation
39
Setting up Solver Parameters
SSE as Target CellMinimizeby adjusting a and bwith b>=0 constraint
Results
40
Results
41
Fit of van der Waals Eqn for SO2
and Comparison to Ideal Gas Law
0
2000000
4000000
6000000
8000000
10000000
12000000
0 2000000 4000000 6000000 8000000 10000000 12000000
Measured Pressure (Pa)
Pre
dic
ted
Pre
ssu
re (
Pa)
van der Waals
Ideal Gas
Note departure ofideal gas predictionsat higher pressures
Top Related