Data Fitting

5
Parameter Estimation A common problem in chemical processing is to determine parameters (constants) in an equation used to represent experimental data. Examples are fitting a straight line or curve to a set of instrument calibration data and fitting a process model equation to measured process variable data. The problem is always the same – how to obtain the “best fit” of the equation to the data. This means that one wants to minimize the deviations between the data points and the equation predictions. Since some deviations are positive and others negative, if the sum of the deviations is calculated it is not an indicator of goodness of fit, because of cancellation of positive and negative deviations. The sum of the absolute values would be a good indicator, but this is not the indicator that is traditionally used. The universal indicator of goodness of fit is to use the sum of the squares of the deviations, which is the “method of least squares”. The problem is to find the set of constants that minimize the sum of the squares of the deviations between the measured values and the calculated values. Note: It is assumed here that the estimated uncertainties in all data points are the same. If this is not the case, then a more sophisticated approach should be used that takes into account the estimated uncertainty of each point. This is usually done through use of “chi- square” fitting procedures. Method of Least Squares Assume we have a set of data involving only two variables (y and x). There are a given number of measured points available (N). Thus, we have a table containing N rows of y and x values available to work with. For example, the x values could be different values of time and the y values could be measured temperatures at a certain point in our chemical process. As another example, y could be the viscosity of a process liquid and x could be the corresponding temperature of the liquid. After preparing our table of data points, the next item of business is to assume a form for the equation we wish to fit to the data. ) , , , ( L b a x f y calc = general form b ax y calc + = linear c bx ax y calc + + = 2 quadratic b calc ax y = power law bx calc ae y = exponential

Transcript of Data Fitting

Parameter Estimation

A common problem in chemical processing is to determine parameters (constants) in an equation used to represent experimental data. Examples are fitting a straight line or curve to a set of instrument calibration data and fitting a process model equation to measured process variable data. The problem is always the same – how to obtain the “best fit” of the equation to the data. This means that one wants to minimize the deviations between the data points and the equation predictions. Since some deviations are positive and others negative, if the sum of the deviations is calculated it is not an indicator of goodness of fit, because of cancellation of positive and negative deviations. The sum of the absolute values would be a good indicator, but this is not the indicator that is traditionally used. The universal indicator of goodness of fit is to use the sum of the squares of the deviations, which is the “method of least squares”. The problem is to find the set of constants that minimize the sum of the squares of the deviations between the measured values and the calculated values. Note: It is assumed here that the estimated uncertainties in all data points are the same. If this is not the case, then a more sophisticated approach should be used that takes into account the estimated uncertainty of each point. This is usually done through use of “chi-square” fitting procedures. Method of Least Squares

Assume we have a set of data involving only two variables (y and x). There are a given number of measured points available (N). Thus, we have a table containing N rows of y and x values available to work with. For example, the x values could be different values of time and the y values could be measured temperatures at a certain point in our chemical process. As another example, y could be the viscosity of a process liquid and x could be the corresponding temperature of the liquid. After preparing our table of data points, the next item of business is to assume a form for the equation we wish to fit to the data.

),,,( Lbaxfycalc

= general form

baxycalc

+= linear

cbxaxycalc

++=2 quadratic

b

calcaxy = power law

bx

calcaey = exponential

The form chosen may be based on theory, but it is often chosen empirically as a logical choice to fit the data. Methods of selecting equations, by plotting the y-x data in such a way that a straight line is obtained, are discussed in the Felder and Rousseau text. For example, when you plot y vs. x, does the resulting graph look like a straight line, or does y vs. 1/x, or y vs. ln(x), or how about y vs. exp(x)? This is the way most engineers select the form of an empirical equation to represent the data. Excel is helpful in this process, as there are options to plot x-y scatter diagrams in various standard forms. There are systematic methods for choosing the best terms to go into an equation, from a bank of possible terms, but such methods are beyond simple statistical analyses and not often used in chemical engineering practice. The method of least squares says we should determine the values of the constants in the equation that minimize the sum of the squares of the deviations between the values calculated from the equation and the experimental y values:

( )∑ =− min2

yycalc

This is easily accomplished using a spreadsheet like Excel by employing the “solver” function. An example problem on an Excel Spreadsheet is given below:

Least Squares Example -- Fitting Temperature-Time Data to a Straight Line

(y = ax + b)

y x a,b ycalc ycalc-y (ycalc-y)2 (y - ymean)

2 x^2

T (deg C) t (min) Consts T calc Devn Devn^2

21.53 0.00 0.30831 21.34 -0.1871 0.0350 56.6 0.00

24.31 10.05 21.343 24.44 0.1315 0.0173 22.5 101.00

27.58 20.00 27.51 -0.0708 0.0050 2.2 400.00

30.22 29.75 30.52 0.2953 0.0872 1.4 885.06

33.67 40.05 33.69 0.0209 0.0004 21.3 1604.00

37.01 50.20 36.82 -0.1897 0.0360 63.3 2520.04

150.05 SSD = 0.18091 167.3 5510.11

ymean = 29.05333 σa = 0.00507 σy = 0.21267

σb = 0.154 R2 = 0.9989

F-obs = 3694

The 1st two columns (columns A and B, rows 6 - 11 in Excel) contain the original experimental data (typed in). The 3rd column (C) contains the constants (type in initial estimates). The 4th column (D) is calculated from the formula =$C$6*B6 + $C$7 (drag down). The 5th column (E) is calculated from the formula =D6 – A6 (drag down). The 6th column (F) is calculated from the formula =E6^2 (drag down). The cell F12 is the sum of the F6:F11 values (sum of squares of deviations).

Solver function (under the Tools menu) is then used to minimize cell F12 by changing the values in cells C6:C7, resulting in the values shown above. The 7th column (G) is only needed to calculate the R2 value (see below). The 7th column (G) is calculated from the formula =(A6-$B$13)^2 (drag down). The cell B13 is the mean (average) of the A6:A11 y-values. The cell G12 is the sum of the G6:G11 values. The 8th column (H) is needed to calculate standard uncertainties in the parameters. It is calculated from the formula = B6^2 (drag down). The cell B12 is the sum of the B6;B11 values. The cell H12 is the sum of the H6:H11 values. The variance is just another name for the sum of the squares of the deviations divided by the "degrees of freedom" (N - C), where N is the number of data points and C is the number of constants that have been determined in the equation.

The standard deviation (σy) is the square root of the variance, calculated from the relationship ((Sum Devn^2)/(N-C))^0.5. Cell F13 is calculated from the formula =((F12)/(6-2))^0.5 The standard deviation is useful in estimating random errors, but it must be remembered that it depends on how well the model equation fits the data, as well as how much scatter there is in the data. If the

model appears to be satisfactory (sometimes called “chi by eye”), then 2σy is often taken as a measure of random errors (95 % level). Determination of random errors through replication of individual points is better, but this is not always possible. The R2 value is calculated as one minus the sum of the squares of the deviations (cell F12) divided by the total sum of the squares (cell G12). Cell F14 above is calculated from the formula =(1-F12/G12). (R2 = 1 – 0.18/167.815 = 0.9989). This means that well over 99 % of the variation in the y values is accounted for by the resulting equation. R2 tells us only about whether the model is the right shape to fit the data, not how much scatter is present in the data (i.e., not the magnitude of the random errors). The F-observed value (cell F15) is calculated as =(G12-F12)/(F12/(6-2)), or (F-observed = (167.8-0.18)/(0.18/4) = 3700 to two sig figs). When compared with an F-critical value from standard statistical tables, this F-observed value can help us determine whether the model chosen is valid. The standard uncertainties in the fitted parameters (cells D13 and D14) are calculated from =F13*(6/(6*H12-B12^2))^0.5 and =F13*(H12/(6*H12-B12^2))^0.5, respectively, where N=6 has been taken for the current problem.

Verify the values shown in the table by preparing a graph of the data in Excel and using the “Add Trendline” (under the Chart Menu) with options to see the

equation and the R2 value. You have to format the box with the equation to get the number of significant figures shown here. Whereas the Trendline function in Excel is limited to several standard forms for the equation, the spreadsheet method illustrated above has no limits on the equation chosen. It may be as complicated as you like. It may contain any number of independent variables. It is desirable to find the uncertainty in the coefficients determined through least squares. An example might be that experimental conversion versus time data is used to determine kinetic rate constant parameters, such as the pre-exponential factor and activation energy for a chemical reaction. It is always good to know the uncertainties in these parameters, because these uncertainties may lead to significant variations in operating conditions in a simulated chemical process. Parameter uncertainties and other statistical measures, such as the F-observed value, can be obtained in Excel using the LINEST function. See the Excel Help for details about this function and how to use it.

As an example, consider once again the sample problem given above in the Excel Spreadsheet. The function =LINEST(A6:A11,B6:B11,TRUE,TRUE) entered as an array (by Ctrl+Shift+Enter), after selecting the 2x5 area on the chart where you want the results to appear, results in a 2x5 array as follows:

0.308313126 21.34293592

0.005072658 0.153723418

0.998918374 0.212665244

3694.135186 4

167.0728273 0.180906024

The items in this array are the least-squares values of a and b in the 1st row, the standard uncertainties in a and b in the 2nd row, the values of R2 and standard

deviation σy in the 3rd row, the F-observed value and degrees of freedom (N-C) in the 4th row, and the total sum of squares and the sum of squares of deviations in the bottom row.

Temperature History

y = 0.30831x + 21.34294

R2 = 0.99892

20

30

40

0 20 40 60

t (min)

T (

C)

Nonlinear Least Squares

If the relationship between the dependent variable y and the independent variable x (or variables x1,…,xK) is nonlinear, the spreadsheet method in Excel can be used to find the constants and standard deviation of the fit, but it will not give you the uncertainties in the coefficients directly. There are specialty statistical programs, like SAS, that will give these uncertainties. There is also a website http://members.aol.com/johnp71/nonlin.html where you can put in your equation and data, and it will do the calculations for you, including the uncertainties in the coefficients. Mathcad can be used to solve nonlinear least-squares problems in at least two ways. First, the sum of squares of the errors can be set up in terms of the unknown constants (a,b,…), lets say we call it SSE. Then, a Solve Block is set up with the equation SSE(a,b,…) = 0, followed by the replacement relation (a,b,…) := Minerr(a,b,…). This function will find the constants to minimize SSE. The same problem can be solved in Mathcad using the “genfit” function. See Mathcad help for details. I have yet to find out how to obtain the standard uncertainties in the coefficients, but maybe you can help.