Statistical Analysis of Data in the Linear RegimeFour of the most common probability distributions...

Statistical Analysis of Datain the Linear Regime

Robert DeSerio

University of Florida — Department of PhysicsPHY4803L — Advanced Physics Laboratory

Contents

1 Introduction 7Linear Algebra and Taylor Expansions . . . . . . . . . . . . . . . . 10Small Error Approximation . . . . . . . . . . . . . . . . . . . . . . 13

2 Random Variables 15Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 16Sample averages and expectation values . . . . . . . . . . . . . . . 17Properties of expectation values . . . . . . . . . . . . . . . . . . . . 18Normalization, mean and variance . . . . . . . . . . . . . . . . . . . 19

3 Probability Distributions 25The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . 25The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . 25The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . 27The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Independence and Correlation 31Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Product rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

The covariance matrix . . . . . . . . . . . . . . . . . . . . . . 39

5 Measurement Model 41Random errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Systematic errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Correlated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3

4 CONTENTS

6 Propagation of Error 45Complete solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Propagation of error . . . . . . . . . . . . . . . . . . . . . . . . . . 48Correction to the mean . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Central Limit Theorem 57Single variable central limit theorem . . . . . . . . . . . . . . . . . 58Multidimensional central limit theorem . . . . . . . . . . . . . . . . 60

8 Regression Analysis 65Principle of Maximum Likelihood . . . . . . . . . . . . . . . . . . . 66

Iteratively Reweighted Regression . . . . . . . . . . . . . . . . 69Least Squares Principle . . . . . . . . . . . . . . . . . . . . . . 70

Sample mean and variance . . . . . . . . . . . . . . . . . . . . . . . 71Weighted mean . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Equally-weighted linear regression . . . . . . . . . . . . . . . . 81

Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 82The Gauss-Newton algorithm . . . . . . . . . . . . . . . . . . 84

Uncertainties in independent variables . . . . . . . . . . . . . . . . 87Regression with correlated yi . . . . . . . . . . . . . . . . . . . . . . 89Calibrations and instrument constants . . . . . . . . . . . . . . . . 90

Fitting to a function of the measured variable . . . . . . . . . 94

9 Evaluating a Fit 97Graphical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 97The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 99The χ2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Uncertainty in the uncertainty . . . . . . . . . . . . . . . . . . . . . 104The reduced χ2 distribution . . . . . . . . . . . . . . . . . . . . . . 107Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Student-T probabilities . . . . . . . . . . . . . . . . . . . . . . 111The ∆χ2 = 1 rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

10 Regression with Excel 117Linear regression with Excel . . . . . . . . . . . . . . . . . . . . . . 118General regression with Excel . . . . . . . . . . . . . . . . . . . . . 120

Parameter variances and covariances . . . . . . . . . . . . . . 123

CONTENTS 5

Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Probability tables 128Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128Reduced chi-square . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Student-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 CONTENTS

Chapter 1

Introduction

Data obtained through measurement always contain random error. Randomerror is readily observed by sampling—making repeated measurements whileall experimental conditions remain the same. For various reasons the mea-sured values will vary and thus they might be histogrammed as in Fig. 1.1.Each histogram bin represents a possible value or range of values as indicatedby its placement along the horizontal axis. The height of each bar gives thefrequency, or number of times a measurement falls in that bin.

The measurements are referred to as a sample, the number of measure-ments N is the sample size and the histogram is the sample frequency dis-

0

10

5

20

15

Frequency

2.58 2.662.62 (cm)y m

Figure 1.1: A sample frequency distribution for 100 measurements of the lengthof a rod.

7

8 CHAPTER 1. INTRODUCTION

tribution. Dividing the frequencies by the sample size yields the fraction ofmeasurements that fall in each bin. A graph of this kind is called the sampleprobability distribution because it provides an estimate for the probability tofall in each bin. Were new sample sets taken, the randomness of the measure-ment process would cause each new sample distribution to vary. However, asN →∞, the law of large numbers states that the sample probability distri-bution converges to the parent probability distribution—a complete statisticaldescription of that particular measurement.

Thus, a single measurement is simply one sample from a parent distri-bution. It is typically interpreted as the sum of a fixed “signal” componentand a random “noise” component. The signal will be taken as the mean ofthe parent distribution and the noise represents random processes that causeindividual measurements to deviate from that mean.

The experimenter gives physical meaning to the signal through an un-derstanding of the measuring instrument and its application to a particularapparatus. For example, a thermometer’s signal component might be inter-preted to be the temperature of the system to which it’s attached. Obviously,the interpretation is subject to possible deviations that are distinct from andin addition to the deviations associated with the measurement noise. For ex-ample, the thermometer may be out of calibration or it may not be in perfectthermal contact with the system. Such problems give rise to a non-randomdeviation between the measurement mean and the true value for the physicalquantity.

Measurement error refers to the difference between a measurement andthe true value and thus consists of two components. The deviation betweenan individual measurement and the mean of its distribution is called the ran-dom error, while the non-random deviation between the mean and the truevalue is called the systematic error. Measurement uncertainty refers to theexperimenter’s inability to provide specific values for either error in any par-ticular measurement. However, based on an understanding of the measuringinstrument and its application, the range of reasonably likely deviations canbe known and should be estimated. Indeed, the term measurement uncer-tainty often refers to such quantitative estimates.

Theoretical models provide relationships among physical variables. Forexample, the temperature, pressure, and volume of a quantity of gas mightbe measured to test various “equations of state” such as the ideal gas law orthe Van der Waals model, which predict specific relationships among thosevariables.

9

Broadly summarized, statistical analysis often amounts to a compatibilitytest between the measurements and the theory as specified in the followinghypotheses:

Experimental: The measurement uncertainties are well characterized.

Theoretical: The physical quantities follow the predicted relationships.

Experiment and theory are compatible if the deviations between the mea-surements and predictions can be accounted for by reasonable measurementerrors. If they are not compatible, at least one of the two hypotheses must berejected. The experimental hypothesis is usually first on the chopping blockbecause compatibility depends on how the random and systematic errors aremodeled and quantified. Only after careful assessment of both sources oferror can one conclude that predictions are the problem.

Even when experiment and theory appear compatible, there is still reasonto be cautious—one or both hypotheses can still be false. In particular,systematic errors are often difficult to disentangle from the theoretical model.Sorting out the behavior of measuring instruments from the behavior of thesystem under investigation is a basic goal of the experimental process.

The goal of this book is to present the statistical models, formulas, andprocedures needed to accomplish the compatibility test for a range of exper-imental situations commonly encountered in the physical sciences.

In Chapter 2 the basics of random variables and probability distributionsare presented and the law of large numbers is used to highlight the differencesbetween expectation values and sample averages.

Four of the most common probability distributions are described in Chap-ter 3. Chapter 4 introduces the joint probability distribution for multiple ran-dom variables and the related topics of statistical independence, correlation,and the covariance matrix.

Chapter 5 discusses systematic errors and other measurement issues.Chapter 6 provides propagation of error formulas for determining the un-certainty in variables defined from other variables. Chapter 7 demonstratesthe central limit theorem with one- and two-dimensional examples.

Chapter 8 discusses regression analysis based on the principle of maximumlikelihood. For data modeled by fitting functions with adjustable parameters,regression formulas and algorithms provide the fitting parameters and theiruncertainties based on the measurements and their uncertainties.


Chapter 9 discusses evaluation of regression results and the chi-squarerandom variable and Chapter 10 provides a guide to using Excel for regressionanalysis.

Linear Algebra and Taylor Expansions

Linear algebra is an essential tool for data analysis. Linear algebra turns setsof equations into vector equations and it replaces summation symbols withthe implied sums of linear algebra multiplication rules. A modest amount oflinear algebra is used throughout this book. The notation used is as follows.

Column and row vectors will be displayed in bold face type. For example,the main input data to an analysis will typically be represented by the setyi, i = 1...N . It will be referred to as the data set y, by the expression“the yi” or by the column vector

y =

y1

y2...yN

(1.1)

The transpose of a column vector is a row vector with the same elements inthe same order.

yT = (y1, y2, ..., yN) (1.2)

The product yTy is an inner product—a scalar given by

yTy = (y1, y2, ..., yN)

y1

y2...yN

=

N∑i=1

y2i (1.3)

11

whereas the product yyT is an outer product—the N ×N matrix given by

yyT =

y1

y2...yN

(y1, y2, ..., yN)

=

y2

1 y1y2 ... y1yNy2y1 y2

2 ... y2yN...

... ......

yNy1 yNy2 ... y2N

(1.4)

Matrices will be displayed in regular math fonts with square bracketssurrounding a descriptive name. For example, [σ2

y] will represent the N ×Ncovariance matrix describing one aspect of the probability distribution forthe yi as discussed in Chapter 4.

Non-square matrices arise when determining results based on input data.The output or results of an analysis will typically be represented by the seta, by ak, k = 1...M , where M < N (more data than results), or by thecolumn vector a. The N input yi are typically measurements (or derivedfrom measurements) and come with some uncertainty. Because of this, therewill be some uncertainty in the ak derived from them. Statistical analysismust not only provide the ak from the yi, but also the uncertainty in the akarising from the uncertainty in the yi.

The transformations between the input and output uncertainties is facili-tated by the Jacobian describing the relationship between the two data sets.The M × N (M rows by N columns) Jacobian [Jay ] is the matrix of partialderivatives of each ak with respect to each yi

[Jay ]ki =∂ak∂yi

(1.5)

Note the use of double subscripts k and i outside the square brackets labelingthe row and column, respectively, of the identified element. Note also theuse of a subscript and a superscript inside the square brackets as remindersof the variables involved. They also provide the physical units. For example,the element [Jay ]ki has the units of ak/yi. For a covariance matrix such as[σ2y ], the superscript 2 and subscript y should be taken as a reminder that its

elements have the units of yiyj or square yi units along the diagonal. (Often,


the ak have a variety of units and the yi all have the same units, but thisneed not always be the case.)

Matrix inversion will be signified by a superscript −1 outside the squarebrackets. The operation is only valid for certain square matrices and theinverse of the N ×N matrix [X] satisfies

[X][X]−1 = [X]−1[X] = [1] (1.6)

where [1] is the N × N identity matrix with ones along the diagonal andzeros elsewhere. It has the property that [Y ][1] = [Y ] for any M ×N matrix[Y ], and [1][Z] = [Z] for any N ×M matrix [Z]. The elements of [X]−1 haveunits inverse to those of [X].

The transpose of a matrix, signified by a superscript T outside the squarebrackets, has the same matrix elements but interchanges the rows of theoriginal matrix for the columns of its transpose and vice versa. Thus thetranspose of [Jay ] is the N ×M (N rows by M columns) matrix [Jay ]T withelements given by

[Jay ]Tik = [Jay ]ki (1.7)

=∂ak∂yi

When two matrices or a vector and a matrix are multiplied, their sizesmust be compatible and their ordering is important. Linear algebra multipli-cation does not necessarily commute. The adjacent indices in a multiplicationwill be summed over in forming the result and must be of the same size. Thus[Jay ][Jay ]T is an M ×M matrix with elements given by

[[Jay ][Jay ]T ]kj =N∑i=1

[Jay ]ki[Jay ]Tij (1.8)

How would one determine the elements of the Jacobian numerically?Their operational definition is similar to any numerical differentiation. Afterperforming an analysis, thereby finding the ak from the yi, change one yi toy′i, that is, change it by the (small) amount ∆yi = y′i − yi. Redo the analy-sis. Each ak will change by ∆ak (to a′k = ak + ∆ak). Make the change ∆yismaller and smaller until the ∆ak are proportional to ∆yi. For ∆yi in thislinear regime, the elements of the Jacobian are given by: [Jay ]ki = ∆ak/∆yi.

13

Of course, the Jacobian can also be obtained by explicit differentiation whenthe functional form: ak = fk(yi) is known.

With the Jacobian in hand, if all the yi are now simultaneously allowedto change—within the linear regime—to a new set y′, a first-order Taylorexpansion gives the new a′k for this set

a′k = ak +N∑i=1

∂ak∂yi

(y′i − yi) (1.9)

or

∆ak =N∑i=1

[Jay ]ki∆yi (1.10)

which is just the kth row of the vector equation

∆a = [Jay ]∆y (1.11)

The Jacobian evaluated at one set of yi describes how all the ak will changewhen any or all of the yi change by small amounts.

To similarly express the row vector, ∆aT , recall the rules for the trans-pose of a matrix-matrix or vector-matrix multiplication: The transpose of aproduct of terms is the product of the transpose of each term with the terms’ordering reversed: [[A][B]]T = [B]T [A]T . Thus, the equivalent transpose toEq. 1.11 is

∆aT = ∆yT [Jay ]T (1.12)

Small Error Approximation

The first-order Taylor expansion expressed by Eqs. 1.9-1.12 shows how changesto the input variables will propagate to changes in the output variables. Itis the basis for propagation of error and regression analysis—topics coveredin Chapters 6 and 8 where analytic expressions for the Jacobian are given.However, one issue is worthy of a brief discussion here.

In order for the first-order Taylor expansion to be valid, the effects ofhigher-order derivatives must be kept small. This happens for any size ∆yiwhen higher derivatives are absent, i.e., when the relationships between the


ak and the yi are linear. When the relationships are nonlinear, it requireskeeping the range of possible ∆yi small enough that throughout that range,the higher-order terms in the Taylor expansions would be small comparedto the first-order terms. The range of possible ∆yi is determined by themeasurement uncertainty and will be assumed small enough to satisfy thisrequirement. Linearity checks are presented in Chapter 9 and should be usedto verify this assumption.

Working with measurements having large uncertainties that could take∆yi outside the linear regime presents problems for error analysis that willonly be treated for a few special cases. If the uncertainties are too bigto treat with a first-order Taylor expansion, the best solution is to reducethe uncertainties by improving the measurements. Reducing the ∆yi intothe linear regime makes the uncertainty calculations more trustworthy andreduces the uncertainties in the final results.

Chapter 2

Random Variables

The experimental model treats each measurement as a random variable—a quantity whose value varies randomly as the procedure used to obtain itis repeated. Possible values occur randomly but with fixed probabilities asdescribed next.

When the possible yi form a discrete set, the probability P (yi) gives theprobability for that yi to occur. The complete set of probabilities for all yiis called a discrete probability function (or dpf). When the possible valuescover a continuous interval, their probabilities are described by a probabilitydensity function (or pdf). With the pdf p(y) specified for all possible valuesof y, the differential probability dP (y) of an outcome between y and y + dyis given by

dP (y) = p(y)dy (2.1)

Probabilities for outcomes in any finite range are obtained by integration.The probability of an outcome between y1 and y2 is given by

P (y1 < y < y2) =

∫ y2

y1

p(y) dy (2.2)

Continuous probability distributions become effectively discrete when thevariable is recorded with a chosen number of significant digits. The proba-bility of the measurement is then the integral of the pdf over a range ±1/2of the size, ∆y, of the least significant digit.

P (y) =

∫ y+∆y/2

y−∆y/2

p(y′) dy′ (2.3)

15

16 CHAPTER 2. RANDOM VARIABLES

Note how the values of P (y) for a complete set of non-overlapping intervalscovering the entire range of y-values would map the pdf into an associateddpf. Many statistical analysis procedures will be based on the assumptionthat P (y) is proportional to p(y). For this to be the case, ∆y must be smallcompared to the range of the distribution. More specifically, p(y) must havelittle curvature over the integration limits so that the integral becomes

P (y) = p(y) ∆y (2.4)

Both discrete probability functions and probability density functions arereferred to as probability distributions. The P (yi), being probabilities, mustbe between zero and one and are unitless. And because p(y)∆y is a probabil-ity, p(y) must be a “probability per unit y” and thus it must be non-negativewith units inverse to those of y.

Before discussing important properties of a distribution such as its meanand standard deviation, the related subject of sampling is addressed moregenerally. Sampling can be used to get an estimate of the shape and otherquantitative properties of a measurement distribution. The law of large num-bers is important because it defines ideal or expectation values with whichto compare those quantities.

Law of large numbers

P (y) for an unknown distribution can be determined to any degree of accu-racy by acquiring and histogramming a sample of sufficient size.

If histogramming a discrete probability distribution, the bins should belabeled by the allowed values yj. For a continuous probability distribution,the bins should be labeled by their midpoints yj and constructed as adjacent,non-overlapping intervals spaced ∆y apart and covering the complete rangeof possible outcomes. The sample, of size N , is then sorted to find thefrequencies f(yj) for each bin

The law of large numbers states that the sample probability, f(yj)/N ,for any bin will approach the predicted P (yj) more and more closely as thesample size increases. The limit satisfies

P (yj) = limN→∞

1

Nf(yj) (2.5)

17

Sample averages and expectation values

Let yi, i = 1..N represent sample values for a random variable y havingprobabilities of occurrence governed by a pdf p(y) or a dpf P (y). The sampleaverage of any function g(y) will be denoted with an overline so that g(y) isdefined as the value of g(y) averaged over all y-values in the sample set.

g(y) =1

N

N∑i=1

g(yi) (2.6)

For finite N , the sample average of any (non-constant) function g(y) isa random variable; taking a new sample set of yi would likely produce adifferent sample average. However, in the limit of infinite sample size, thelaw of large numbers implies that the sample average converges to a welldefined constant depending only on the parent probability distribution andthe particular function g(y). This constant is called the expectation value ofg(y) and will be denoted by putting angle brackets around the function

〈g(y)〉 = limN→∞

1

N

N∑i=1

g(yi) (2.7)

Equation 2.7 suggests that expectation values can be obtained to anydegree of uncertainty by making N large enough. Very large samples areeasily created with random number generators to test various aspects ofprobability theory. Such Monte-Carlo simulations are described in variouschapters and posted on the lab web site.

To obtain analytical expressions for expectation values, Eq. 2.7 can becast into a form suitable for use with a given probability distribution asfollows. Assume a large sample of size N has been properly histogrammed.If the variable is discrete, each possible value yj gets its own bin. If thevariable is continuous, the bins are labeled by their midpoints yj and theirsize ∆y is chosen small enough to ensure that (1) the probability for a y-valueto occur in any particular bin will be accurately given by P (yj) = p(yj)∆yand (2) all yi sorted into a bin at yj can be considered as contributing g(yj)—rather than g(yi)—to the sum in Eq. 2.7.

After sorting the sample yi-values into the bins, thereby finding the fre-quencies of occurrence f(yj) for each bin, the sum in Eq. 2.7 can be grouped


by bins and becomes

〈g(y)〉 = limN→∞

1

N

∑all yj

g(yj)f(yj) (2.8)

Note the change from a sum over all samples in Eq. 2.7 to a sum over allhistogram bins in Eq. 2.8.

Moving the limit and factor of 1/N inside the sum, the law of largenumbers (Eq. 2.5) can be used giving

〈g(y)〉 =∑all yj

g(yj)P (yj) (2.9)

Note that any reference to a sample is gone. Only the range of possibley-values, the probability distribution, and the arbitrary g(y) are involved.Equation 2.9 is called a weighted average of g(y); each value of g(yj) in thesum is weighted by the probability of its occurrence P (yj).

For a continuous probability density function, substitute P (yj) = p(yj)∆yin Eq. 2.9 and take the limit as ∆y → 0. This converts the sum to the integral

〈g(y)〉 =

∫ ∞−∞

g(y)p(y) dy (2.10)

Eq. 2.10 is a weighted integral with each g(y) weighted by its occurrenceprobability p(y) dy.

Properties of expectation values

Some frequently used properties of expectation values are given below. Theyall follow from simple substitutions for g(y) in Eqs. 2.9 or 2.10 or from theoperational definition of an expectation value as an average for an effectivelyinfinite data set (Eq. 2.7).

1. The expectation value of a constant is that constant: 〈c〉 = c. Substi-tute g(y) = c and use normalization condition (discussed in the nextsection). Guaranteed because the value c is averaged for every sampledyi.

19

2. Constants can be factored out of expectation value brackets: 〈cu(y)〉 =c 〈u(y)〉. Substitute g(y) = cu(y), where c is a constant. Guaranteed bythe distributive property of multiplication over addition for the termsinvolved in the averaging.

3. The expectation value of a sum of terms is the sum of the expectationvalue of each term: 〈u(y) + v(y)〉 = 〈u(y)〉+ 〈v(y)〉. Substitute g(y) =u(y)+v(y). Guaranteed by the associative property of addition for theterms involved in the averaging.

But also keep in mind the non-rule: The expectation value of a prod-uct is not necessarily the product of the expectation values: 〈u(y)v(y)〉 6=〈u(y)〉〈v(y)〉. Substituting g(y) = u(y)v(y) does not, in general, lead to〈u(y)v(y)〉 = 〈u(y)〉〈v(y)〉.

Normalization, mean and variance

Probability distributions are defined so that their sum or integral over anyrange of possible values gives the probability for an outcome in that range.Consequently, if the range includes all possible values, the probability of anoutcome in that range is 100% and the sum or integral must be equal to one.For a discrete probability distribution this normalization condition reads:∑

all yj

P (yj) = 1 (2.11)

and for a continuous probability distribution it becomes∫ ∞−∞

p(y) dy = 1 (2.12)

The normalization sum or integral is also called the zeroth moment of theprobability distribution—as it is the expectation value of y0. The other twomost important expectation values of a distribution are also moments of thedistribution.

The mean µy of a probability distribution is defined as the expectationvalue of y itself, that is, of y1. It is the first moment of the distribution.

µy = 〈y〉 (2.13)


If P (y) or p(y) is specified, µy could be evaluated using g(y) = y in Eq. 2.9or 2.10, respectively.

The mean is a measure of the central value of the distribution. It is apoint at the “center of probability” in analogy to a center of mass. Weremass distributed along the y-axis in proportion to P (y) (point masses) orp(y) (a mass distribution) µy would be the center of mass. The median andthe mode are also quantitative measures of the center of a distribution. Themedian is that value of y where there is equal probability above as below.Unlike the mean, moving a chunk of the probability distribution further fromthe median (in the same direction) does not change the median. The mode, ifit exists, is the point in the distribution where the probability is the highest.The mean is the only measure that will be considered further.

The quantity µy is sometimes called the true mean to distinguish it fromthe sample mean. The sample mean of a set of yi is simply the sample averageof y—defined by Eq. 2.6 with g(y) = y.

y =1

N

N∑i=1

yi (2.14)

The sample mean is often used as an estimate of the true mean because,by definition, it becomes exact as N → ∞. In addition, the sample meansatisfies another important property for any good estimate. Taking the ex-pectation value of both sides of Eq. 2.14 and noting that 〈yi〉 = µy for all Nsamples (Eq. 2.13) gives

µy = 〈y〉

=

⟨1

N

∑N

yi

⟩

=1

N

N∑i=1

〈yi〉

=1

N

N∑i=1

µy

=1

NNµy

= µy (2.15)

21

thereby demonstrating that the expectation value of the sample mean is equalto the true mean.

Any parameter estimate having an expectation value equal to theparameter it is estimating is said to be an unbiased estimate; itwill give the true parameter value “on average.”

Thus, the sample mean is an unbiased estimate of the true mean.After the mean, the next most important descriptor of a probability dis-

tribution is its standard deviation—a measure of how far away from the meanindividual values are likely to be. The quantity

δy = y − µy (2.16)

is called the deviation—the signed difference between a random variable’svalue and the mean of its parent distribution. One of its properties, true forany distribution, can be obtained by rewriting Eq. 2.13 in the form

〈y − µy〉 = 0 (2.17)

Deviations come with both signs and, for any distribution, by definition, themean deviation is always zero.

The mean absolute deviation is defined as the expectation value of the ab-solute value of the deviation. This quantity, 〈|y − µy|〉 for a random variabley, would be nonzero and a reasonable measure of the expected magnitudeof typical deviations. However, the mean absolute deviation does not arisenaturally when formulating the basic statistical procedures considered here.The mean squared deviation, on the other hand, plays a central role and sothe standard measure of a deviation, i.e., the standard deviation σy, is takenas the square root of the mean squared deviation.

The mean squared deviation is called the variance and written σ2y for a

random variable y. It is the second moment about the mean and defined asthe following expectation value

σ2y =

⟨(y − µy)2

⟩(2.18)

For a given probability distribution, the variance would then be evaluatedwith g(y) = (y − µy)2 in Eq. 2.9 or 2.10.

The standard deviation σy is the square root of the variance—the squareroot of the mean-squared deviation. Thus it is often referred to as the rms or


root-mean-square deviation—particularly when trying to emphasize its roleas the typical size of a deviation.

The variance has units of y2 while the standard deviation has the sameunits as y. The standard deviation is the most common measure of the widthof the distribution and the only one that will be considered further.

Expanding the right side of Eq. 2.18 gives σ2y =

⟨y2 − 2yµy + µ2

y

⟩and

then taking expectation values term by term, noting µy is a constant and〈y〉 = µy, gives

σ2y =

⟨y2⟩− µ2

y (2.19)

This equation is useful for evaluating the variance of a given probabilitydistribution and in the form ⟨

y2⟩

= µ2y + σ2

y (2.20)

shows that the expectation value of y2 (the second moment about the origin)exceeds the square of the mean by the variance.

The sample variance is then given by Eq. 2.6 with g(y) = (y − µy)2. Itwill be denoted s2

y and thus defined by

s2y =

1

N

N∑i=1

(yi − µy)2 (2.21)

Taking the expectation value of this equation shows that the sample varianceis an unbiased estimate of the true variance.⟨

s2y

⟩= σ2

y (2.22)

The proof is similar to that of Eq. 2.15, this time requiring an application ofEq. 2.18 to each term in the sum.

Typically, the true mean µy is not known and Eq. 2.21 can not be used.Can the sample mean y be used in place of µy? Yes, but making this substi-tution requires the following modification to Eq. 2.21.

s2y =

1

N − 1

N∑i=1

(yi − y)2 (2.23)

As will be proven later, the denominator is reduced by one so that thisdefinition of the sample variance will also be unbiased, i.e., will still satisfyEq. 2.22.

23

The sample mean and sample variance are random variables and eachfollows its own probability distribution. They are unbiased; the means oftheir distributions will be the true mean and true variance, respectively. Thestandard deviations of these distributions will be discussed in Chapters 8and 9.

Chapter 3

Probability Distributions

In this section, definitions and properties of a few fundamental probabilitydistributions are presented.

The Gaussian distribution

The Gaussian or normal probability density function has the form

p(y) =1√

2πσ2y

exp

[−(y − µy)2

2σ2y

](3.1)

and is parameterized by two quantities: the mean µy and the standard devi-ation σy.

Figure 3.1 shows the Gaussian pdf and gives various integral probabilities.Gaussian probabilities are described relative to the mean and standard de-viation. There is a 68% probability that a Gaussian random variable will bewithin one standard deviation of the mean, 95% probability it will be withintwo, and a 99.7% probability it will be within three. These “1-sigma,” “2-sigma,” and “3-sigma” probabilities should be committed to memory. Amore complete listing can be found in Table 10.2.

The binomial distribution

The binomial distribution results when an experiment, called a Bernoullitrial, is repeated a fixed number of times. A Bernoulli trial can have only

25

26 CHAPTER 3. PROBABILITY DISTRIBUTIONS

y

p(y)

0.340.34

0.140.140.020.02

σσσσ

σσσσ y

y

yµµµµ σσσσ y

Figure 3.1: The Gaussian distribution labeled with the mean µy, the standarddeviation σy and some areas, i.e., probabilities.

two outcomes. One outcome is termed a success and occurs with a probabilityp. The other, termed a failure, occurs with a probability 1 − p. Then, withN Bernoulli trials, the number of successes y can be any integer from zero(none of the N trials were a success) to N (all trials were successes).

The probability of y successes (and thus N − y failures) is given by

P (y) =N !

y!(N − y)!py(1− p)N−y (3.2)

The probability py(1− p)N−y would be the probability that the first y trialswere successes and the last N − y were not. Since the y successes andN − y failures can occur in any order and each distinct ordering would occurwith this probability, the extra multiplicative factor, called the binomialcoefficient, is needed to count the number of distinct orderings.

The binomial distribution has a mean

µy = Np (3.3)

and a varianceσ2y = Np(1− p) (3.4)

It will prove useful to rewrite the distribution and the variance in terms of

27

N and µy rather than N and p. Substituting µy/N for p, the results become

P (y) =N !

y!(N − y)!

1

NN(µy)

y(N − µy)N−y (3.5)

σ2y = µy

(1− µyN

)(3.6)

The binomial distribution arises, for example, when histogramming sam-ple frequency distributions. Consider N samples from a particular probabil-ity distribution for a random variable x. A particular bin at xj represents aparticular outcome or range of outcomes and the parent distribution woulddetermine the associated probability P (xj) for a result in that bin. Whileany distribution might give rise to the P (xj), the frequency in that partic-ular histogram bin would be governed by the binomial distribution. EachBernoulli trial consists of taking one new sample and either sorting it intothat bin—a success with a probability P (xj)—or not sorting it in that bin—afailure with a probability 1 − P (xj). After N samples, the number of suc-cesses (the bin frequency y) is a binomial random variable with that N andµy = NP (xj).

The Poisson distribution

Poisson-distributed variables arise in particle and photon counting experi-ments. For example, under unchanging experimental conditions and aver-aged over long times, pulses or “counts” from a nuclear radiation detectormight be occurring at an average rate of, say, one per second. Over manyten-second intervals, ten counts would be the average, but the actual numberin any particular interval will often be higher or lower.

More specifically, if µy is the average number of counts expected in aninterval (which need not be integer valued), then the counts y actually mea-sured in any such interval (which can only be zero or a positive integer) willoccur randomly with probabilities governed by the Poisson distribution.

P (y) = e−µy(µy)

y

y!(3.7)

The Poisson Variables addendum on the lab web site gives a derivation ofthe Poisson distribution and the related exponential distribution based on

http://www.phys.ufl.edu/courses/phy4803L/statistics/Poisson.pdf


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 5 6 7 8

y

P(y

)µ µ µ µ = 1.5y

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

60 70 80 90 100 110 120 130 140

y

P(y

)

µ µ µ µ = 100y

Figure 3.2: Poisson probability distributions for means of 1.5 and 100.

the assumption that the probability per unit time for an event to occur isconstant. Poisson probability distributions for µy = 1.5 and µy = 100 areshown in Fig. 3.2.

One can show (see Exercise 1) that the variance of a Poisson distributionis the mean.

σ2y = µy (3.8)

For large values of µy, the Poisson probability for a given y is very nearlyGaussian—given by Eq. 2.4 with ∆y = 1 and p(y) given by Eq. 3.1 (withσ2y = µy). That is,

P (y) ≈ 1√2πµy

exp

[−(y − µy)2

2µy

](3.9)

Eqs. 3.8 and 3.9 are the origin of the commonly accepted practice of ap-plying “square root statistics” or “counting statistics,” whereby a Poisson-distributed variable is treated as a Gaussian-distributed variable with thesame mean and with a variance chosen to be µy or some estimate of µy.

One common application of counting statistics arises when a single countis measured from a Poisson distribution of unknown mean and observed totake on a particular value y. With no additional information, that measuredy-value becomes an estimate of µy and thus it also becomes an estimate of thevariance of its own parent distribution. That is, y is assumed to be governedby a Gaussian distribution with a standard deviation given by

σy =√y (3.10)

Counting statistics is a good approximation for large values of y—greater

29

binomial Poisson uniform Gaussian

P (n) = P (n) = p(y) = p(y) =

formN !

n!(N − n)!pn(1− p)N−n e−µ

µn

n!

1

|b− a|1√

2πσ2exp

[− (y − µ)2

2σ2

]mean Np µ (a+ b)/2 µ

variance Np(1− p) µ (b− a)2/12 σ2

Table 3.1: Common probability distributions with their means and variances.

than about 30. Using it for values of y below 10 or so can lead to significanterrors in analysis.

The uniform distribution

The uniform probability distribution is often used for digital meters. Areading of 3.72 V on a 3-digit voltmeter might imply that the underlyingvariable is equally likely to be any value in the range 3.715 to 3.725 V. Avariable with a constant probability in the range from a to b has a pdf givenby

p(y) =1

|b− a|(3.11)

Exercise 1 Eqs. 2.13 and 2.18 provide the definitions of the mean µy andvariance σ2

y with Eqs. 2.9 or 2.10 used for their evaluation. Show that themeans and variances of the various probability distributions are as given inTable 3.1. Also show that they satisfy the normalization condition.

Do not use integral tables. Do the normalization sum or integral first,then the mean, then the variance. The earlier results can often be used in thelater calculations.

For the Poisson distribution, evaluation of the mean should thereby demon-strate that the parameter µy appearing in the distribution is, in fact, themean. For the Gaussian, evaluation of the mean and variance should therebydemonstrate that the parameters µy and σ2

y appearing in the distribution are,in fact, the mean and variance.

Hints: For the binomial distribution you may need the expansion

(a+ b)N =N∑n=0

N !

n!(N − n)!anbN−n (3.12)


For the Poisson distribution you may need the power series expansion

ea =∞∑n=0

an

n!(3.13)

For the Gaussian distribution be sure to always start by eliminating themean (with the substitution y′ = y−µy). The evaluation of the normalizationintegral I =

∫∞−∞ p(y) dy is most readily done by first evaluating the square

of the integral with one of the integrals using the dummy variable x and theother using y. (Both pdfs would use the same µ and σ.) That is, evaluate

I2 =

∫ ∞−∞

∫ ∞−∞

p(x)p(y) dx dy

and then take its square root. To evaluate the double integral, first eliminatethe mean and then convert from cartesian coordinates x′ and y′ to cylindricalcoordinates r and θ satisfying x′ = r cos θ, y′ = r sin θ. Convert the areaelement dx′ dy′ = r dr dθ, and set the limits of integration for r from 0 to ∞and for θ from 0 to 2π.

Exercise 2 (a) Use a software package to generate random samples froma Gaussian distribution with a mean µy = 0.5 and a standard deviationσy = 0.05. Use a large sample size N and well-chosen bins (make sureone bin is exactly centered at 0.5) to create a reasonably smooth, bell-shapedhistogram of the sample frequencies vs. the bin centers.(b) Consider the histogramming process with respect to the single bin at thecenter of the distribution—at µy. Explain why the probability for a sample tofall in that bin is approximately ∆y/

√2πσ2

y, where ∆y is the bin size, anduse that probability with your sample size to predict the mean and standarddeviation for that bin’s frequency. Compare your actual bin frequency at µywith this prediction. Is the difference between them reasonable? Hint: thebin frequency follows a binomial distribution, which has a mean of Np anda standard deviation equal to

√Np(1− p).

Chapter 4

Independence and Correlation

Statistical procedures typically involve multiple random variables as inputand produce multiple random variables as output. Probabilities associatedwith multiple random variables depend on whether the variables are statis-tically independent or not. Correlation describes a situation in which thedeviations for two random variables are related. Statistically independentvariables show no such relations. The consequences of independence andcorrelation affect all manner of statistical analysis.

Independence

Two events are statistically independent if knowing the outcome of one hasno effect on the outcomes of the other. For example, if you flip two coins, onein each hand, each hand is equally likely to hold a heads or a tails. Knowingthat the right hand holds a heads, say, does not change the equal probabilityfor heads or tails in the left hand. The two coin flips are independent.

Two events are statistically dependent if knowing the results of one affectsthe probabilities for the other. Consider a drawer containing two white socksand two black socks. You reach in without looking and pull out one sock ineach hand. Each hand is equally likely to hold a black sock or a white sock.However, if the right hand is known to hold a black sock, the left hand isnow twice as likely to hold a white sock as it is to hold a black sock. Thetwo sock pulls are dependent.

The unconditional probability of event A, expressed Pr(A), represents theprobability of event A occurring without regard to any other events. The

31

32 CHAPTER 4. INDEPENDENCE AND CORRELATION

conditional probability of “A given B,” expressed Pr(A|B), represents theprobability of event A occurring given that event B has occurred. Twoevents are statistically independent if and only if

Pr(A|B) = Pr(A) (4.1)

The multiplication rule for joint probabilities follows from Eq. 4.1 and ismore useful. The joint probability is the probability for both of two eventsto occur. The multiplication rule is that the joint probability for two inde-pendent events to occur is the product of the unconditional probability foreach to occur.

Whether events are independent or not, the joint probability of “A andB” occurring expressed Pr(A ∩ B), is logically the equivalent of Pr(B), theunconditional probability of B occurring without regard to A, multiplied bythe conditional probability of A given B.

Pr(A ∩B) = Pr(B) Pr(A|B) (4.2)

Then, substituting Eq. 4.1 gives the multiplication rule valid for independentevents.

Pr(A ∩B) = Pr(A) Pr(B) (4.3)

Equation 4.3 states the commonly accepted principle that the probability fortwo independent events to occur is simply the product of the probability foreach to occur.

And, of course, the roles of A and B can be interchanged in the logic orequations above.

For a random variable, an event can be defined as getting one particularvalue or getting within some range of values. Consistency with the multipli-cation rule for independent events then requires a product rule for the pdfsor dpfs governing the probabilities of independent random variables.

The joint probability distribution for two variables gives the probabili-ties for both variables to take on specific values. For independent, discreterandom variables x and y governed by the dpfs Px(x) and Py(y), the jointprobability P (x, y) for values of x and y to occur is given by the product ofeach variable’s probability

P (x, y) = Px(x)Py(y) (4.4)

33

And for independent, continuous random variables x and y governed by thepdfs px(x) and py(y), the differential joint probability dP (x, y) for x and y tobe in the intervals from x to x+ dx and y to y + dy is given by the productof each variable’s probability

dP (x, y) = px(x)py(y)dx dy (4.5)

The product rule for independent variables leads to the following impor-tant corollary. The expectation value of any function that can be expressedin the form f1(y1)f2(y2) will satisfy

〈f1(y1)f2(y2)〉 = 〈f1(y1)〉〈f2(y2)〉 (4.6)

if y1 and y2 are independent.For discrete random variables the proof proceeds from Eq. 4.4 as follows:

〈f1(y1)f2(y2)〉=

∑all y1,y2

f1(y1)f2(y2)P (y1, y2)

=∑all y1

∑all y2

f1(y1)f2(y2)P1(y1)P2(y2)

=∑all y1

f1(y1)P1(y1)∑all y2

f2(y2)P2(y2)

= 〈f1(y1)〉〈f2(y2)〉 (4.7)

And for continuous random variables it follows from Eq. 4.5:

〈f1(y1)f2(y2)〉

=

∫f1(y1)f2(y2) dP (y1, y2)

=

∫ ∫f1(y1)f2(y2)p1(y1)p2(y2) dy1 dy2

=

∫f1(y1)p1(y1) dy1

∫f2(y2)p2(y2) dy2

= 〈f1(y1)〉〈f2(y2)〉 (4.8)

A simple example of Eq. 4.6 is for the expectation value of the productof two independent variables, y1 and y2; 〈y1y2〉 = 〈y1〉〈y2〉 = µ1µ2. For the


special case where the independent samples yi and yj come from the samedistribution—having a mean µy and standard deviation σy, this becomes〈yiyj〉 = µ2

y for i 6= j. Coupling this result with Eq. 2.20 for the expecta-tion value of the square of any y-value: 〈y2

i 〉 = µ2y + σ2

y , gives the followingrelationship for independent samples from the same distribution

〈yiyj〉 = µ2y + σ2

yδij (4.9)

where δij is the Kronecker delta function: equal to 1 if i = j and zero if i 6= j.A related corollary arises from Eq. 4.6 with the substitutions: f1(y1) =

y1−µ1 and f2(y2) = y2−µ2 where y1 and y2 are independent random variables(though not necessarily from the same distribution)

〈(y1 − µ1)(y2 − µ2)〉 = 〈y1 − µ1〉〈y2 − µ2〉 (4.10)

Here µ1 and µ2 are the means of y1 and y2 and satisfy 〈yi − µi〉 = 0. Thusthe right-hand side of Eq. 4.10 is the product of two zeros and demonstratesthat

〈(y1 − µ1)(y2 − µ2)〉 = 0 (4.11)

for independent variables.Note that both y1−µ1 and y2−µ2 always have an expectation value of zero

whether or not y1 and y2 are independent. However, the expectation value oftheir product is guaranteed to be zero only if y1 and y2 are independent. Non-zero values for this quantity are possible if y1 and y2 are not independent.This issue will be addressed shortly.

Product rule

The product rule (Eqs. 4.4 and 4.5) can be extended—by repeated multipli-cation—to any number of independent random variables. The explicit formfor the joint probability for an entire data set yi, i = 1...N will be usefulfor our later treatment of regression analysis. This form depends on theparticular probability distributions for the yi. Often, all yi come from thesame kind of distribution: either Gaussian, binomial or Poisson. These kindsof data sets lead to the joint probability distributions considered next.

For N independent Gaussian random variables, with the distribution foreach yi having its own mean µi and standard deviation σi, the joint proba-bility distribution becomes the following product of terms—each having the

35

form of Eq. 2.4 with p(yi) having the Gaussian form of Eq. 3.1.

P (y) =N∏i=1

∆yi√2πσ2

i

exp

[−(yi − µi)2

2σ2i

](4.12)

where ∆yi represents the size of the least significant digit in yi, which are allassumed to be small compared to the σi.

For N independent binomial random variables, with the distribution foreach yi having its own number of trials Ni and mean µi, the joint proba-bility distribution becomes the following product of terms—each having thebinomial form of Eq. 3.5.

P (y) =N∏i=1

Ni!(Ni − yi)!yi!

1

NNii

(µi)yi(Ni − µi)Ni−yi (4.13)

For N independent Poisson random variables, with the distribution foreach yi having its own mean µi, the joint probability distribution becomesthe following product of terms—each having the Poisson form of Eq. 3.7.

P (y) =N∏i=1

e−µi (µi)yi

yi!(4.14)

The joint probability distributions of Eqs. 4.12-4.14 are the basis for re-gression analysis and produce amazingly similar expressions when applied tothat problem.

Correlation

Statistically independent random variables are always uncorrelated. Corre-lation describes relationships between pairs of random variables that are notstatistically independent.

The generic data set under consideration now consists of two random vari-ables, x and y, say—always measured or otherwise determined in unison—sothat a single sample consists of an x, y pair. They are sampled repeatedlyto make a set of ordered pairs: xi, yi, i = 1..N taken under unchanging ex-perimental conditions so that only random, but perhaps not independent,variations are expected.

36 CHAPTER 4. INDEPENDENCE AND CORRELATION14.06676681 4.037814 13.29383 3.959069 13.92504 4.131755 15.3175391913.99900682 3.968667 15.05385 4.006162 13.39006 4.091464 14.9146306415.95199061 4.035874 13.58051 4.037802 14.66628 4.05105 14.510502813.59616779 4.127463 15.56977 3.936418 13.93871 4.074706 14.7470480714.40648006 4.074184 14.51204 4.112669 14.91605 4.167671 15.6767000712.00449934 3.99743 13.18636 4.003715 13.72761 3.919488 13.1948701912.3859866 3.768247 13.20983 3.940571 14.37164 4.092471 14.92469686

12.92952701 3.963918 12.25293 3.990535 14.38602 3.842683 12.4268285214.84479046 3.883082 16.03258 3.96786 12.56186 4.066542 14.6654392414.2968161 3.843316 14.78141 4.00784 14.24232 3.913901 13.13901785

13.96867669 4.005338 14.43642 4.107398 14.60585 4.100188 15.0018662313.63509925 4.037445 13.08559 3.980415 13.37247 3.950136 13.5013528613.21259817 3.818129 14.19584 4.098536 14.95143 4.09648 14.9647898313.63248556 4.100148 13.11326 4.010504 13.15 3.964028 13.6403036714.12228724 3.98887 12.08419 4.179714 15.43139 3.927785 13.2778708213.42842939 4.015294 13.5981 4.058577 15.85655 3.954981 13.5498047214.44702706 4.112009 12.09149 3.904301 13.71838 4.112697 15.126989313.14688859 3.840876 13.28762 4.006037 14.33683 3.795298 11.9529748913.96653012 3.945918 13.99328 3.944161 13.57655 4.066088 14.6608836713.27230419 3.956455 13.79522 4.158466 15.05252 3.878676 12.7867628412.62221918 4.09005 12.34564 4.047978 14.19746 4.002671 14.0267288416.54043835 4.004634 12.14601 4.153113 14.86997 4.085522 14.8552066213.85817757 4.115209 14.42852 3.866664 12.27124 3.874497 12.7449551414.22370956 4.198661 16.12773 3.796669 12.25717 3.825973 12.2597331715.58978962 4.063404 13.4154 3.971138 14.77607 4.120464 15.2046324715.23186227 4.013308 15.14039 3.884708 11.96083 4.052102 14.5210280714.29895827 3.925827 14.35914 3.83001 13.11322 3.870446 12.7044533613.69447941 3.913251 14.55598 3.939723 13.32211 3.990799 13.9080103515.49113573 4.042911 11.93551 3.911886 13.189 3.930299 13.3029959912.66488983 4.020951 13.30007 4.094558 13.90913 4.020142 14.2014088914.43997335 4.075006 11.79499 3.858791 13.60738 4.181289 15.8128869515.03481675 4.02089 11.58601 3.973485 14.06732 3.987671 13.8766986913.47915529 3.956117 14.18133 4.011186 13.21782 3.917627 13.1762942414.09844199 4.012303 13.9663 4.022801 14.09632 3.889093 12.8909248416.2690998 3.818889 15.8687 3.903246 13.80229 4.027239 14.27237988

15.30751557 4.117519 12.89056 4.017416 14.92201 4.058938 14.5893835514.05078969 4.12934 13.30172 3.909139 13.28057 4.017353 14.1735422613.62457809 4.101789 13.11084 4.06433 13.72777 4.049084 14.4908131614.7414505 4.00057 14.30085 3.838909 12.37099 4.022532 14.2252952

13.23185368 4.034012 14.63632 3.8528 12.90707 3.908243 13.0824452814.31539103 3.947475 13.34191 3.95433 14.25499 3.940013 13.4001193514.33580844 4.042768 13.53226 4.023568 14.91744 3.925844 13.2584521713.79910428 3.889333 11.89796 3.99703 14.09841 3.97449 13.744890514.00855399 4.118357 13.57857 4.032541 14.83051 3.941887 13.4189017814.64000323 4.128445 14.37537 3.805828 12.69006 4.012514 14.1251328214.06267444 4.026661 14.70353 3.892252 13.62643 4.033152 14.331517313.41742227 4.014911 14.0009 3.977392 12.5653 3.938557 13.3855806114.62985393 3.953391 12.078 3.844711 13.52256 3.953312 13.5331240913.46927946 4.032691 15.11607 4.025409 14.28126 4.196286 15.9628606411.98172337 3.993828 12.80223 4.132471 14.0596 4.057236 14.57236642

10

11

12

13

14

15

16

17

18

3.6 3.8 4 4.2 4.4x

y

10

11

12

13

14

15

16

17

18

3.6 3.8 4 4.2 4.4x

y

10

11

12

13

14

15

16

17

18

3.6 3.8 4 4.2 4.4x

y

10

11

12

13

14

15

16

17

18

3.6 3.8 4 4.2 4.4x

y

10

11

12

13

14

15

16

17

18

3.6 3.8 4 4.2 4.4x

y

Figure 4.1: The behavior of uncorrelated and correlated Gaussian random vari-ables. The leftmost figure shows uncorrelated variables, the middle two showpartial correlation and the two on the right show total correlation. The upper twoshow positive correlations while the lower two show negative correlations.

The x- and y-variables themselves can represent physically different quan-tities with different units of measure. Considered separately, each variablevaries randomly according to an underlying probability distribution. Treatedas two separate sample sets: xi, i = 1..N and yi, i = 1..N , two different sam-ple probability distributions could be created—one for each set. The samplemeans x and y and the sample variances s2

x and s2y could be calculated and

would be estimates for the true means µx and µy and true variances σ2x and

σ2y for each variable’s parent distribution px(x) and py(y). These sample and

parent distributions would be considered unconditional because they provideprobabilities without regard to the other variable’s values.

The first look at the variables as pairs is typically with a scatter plotin which the N values of (xi, yi) are represented as points in the xy-plane.Figure 4.1 shows scatter plots for five different 1000-point samples of pairsof random variables. For all five sample sets, the unconditional parent pdfs,px(x) and py(y), are exactly the same, namely Gaussian distributions havingthe following parameters: µx = 4, σx = 0.1 and µy = 14, σy = 1. Eventhough the unconditional pdfs are all the same, the scatter plots clearly

37

show that the joint probability distributions are different. The set on the leftis uncorrelated and the other four are correlated.

For the uncorrelated case on the left, the probability for a given y isindependent of the value of x. For example, if only those points within somenarrow slice in x, say around x = 4.1, are analyzed—thereby making themconditional on that value of x, the values of y for that slice have the sameprobabilities as for the unconditional case—for example, there are just asmany y-values above the mean of 14 as below it.

The other four cases show correlation. Selecting different slices in onevariable will give different conditional probabilities for the other variable. Inparticular, the conditional mean for one variable goes up or down as the slicemoves up or down in the other variable.

The top two plots show positively correlated variables. The bottom twoshow negatively correlated variables. For positive correlation, the conditionalmean of one variable increases for slices at increasing values for the othervariable. When one variable is above (or below) its mean, the other is morelikely to be above (or below) its mean. The product (x − µx)(y − µy) ispositive more often than it is negative and its expectation value is positive.For negative correlation, these dependencies reverse—the variables are morelikely to be on opposite sides of their means and the expectation value of(x− µx)(y − µy) is negative. For independent variables, (x− µx)(y − µy) isequally likely to have either sign and its expectation value is zero.

One measure of correlation is, in fact, the covariance—defined as theexpectation value

σxy = 〈(x− µx)(y − µy)〉 (4.15)

It is limited by the size of σx and σy. The Cauchy-Schwarz inequality statesthat it can vary between

− σxσy ≤ σxy ≤ σxσy (4.16)

Thus, σxy is also often written

σxy = ρσxσy (4.17)

where ρ, called the correlation coefficient, is between -1 and 1. Correlationcoefficients at the two extremes represent perfect correlation where x andy follow a linear relationship exactly. The correlation coefficients used togenerate Fig. 4.1 were 0, ±0.7 and ±1.


The sample covariance of a data set is defined by

sxy =1

N − 1

N∑i=1

(xi − x)(yi − y) (4.18)

and is an unbiased estimate of the true covariance σxy, converging to it inthe limit of infinite sample size. The inequality expressed by Eq. 4.16 is alsotrue for the sample standard deviations and the sample covariance with thesubstitution of sx, sy and sxy for σx, σy and σxy, respectively. The samplecorrelation coefficient r is then defined by sxy = rsxsy and also varies between-1 and 1.

It is informative to see one method for generating two correlated randomvariables having a given correlation coefficient. Let R1(0, 1) and R2(0, 1)represent two independent random samples from any distributions with amean of zero and standard deviation of one. It is not hard to show thatrandom variables x and y generated by

x = µx + σxR1(0, 1) (4.19)

y = µy + σy

(ρR1(0, 1) +

√1− ρ2R2(0, 1)

)will have means µx and µy, standard deviations σx and σy, and correlationcoefficient ρ.

Equation set 4.19 shows that while the deviations in x arise from R1 only,the deviations in y arise from one component proportional to R1 plus a sec-ond independent component proportional to R2. The required correlation isachieved by setting the relative amplitude of those two components in pro-portion to ρ and

√1− ρ2, respectively. The Correlated RV.xls spreadsheet

uses these equations to generate correlated random variables with Gaussianand uniform distributions for R1 and R2.

Of course, a sample correlation coefficient from a particular data set is arandom variable. Its probability distribution depends on the true correlationcoefficient and the sample size. This distribution is of interest, for example,when looking for evidence of any correlation—even a weak one—between twovariables. A sample correlation coefficient near zero may be consistent withthe assumption that the variables are uncorrelated. A value too far from zero,however, might be too improbable under this assumption thereby implyinga correlation exists. These kinds of probabilities are more commonly neededin biological and social sciences and will not be considered further.

http://www.phys.ufl.edu/courses/phy4803L/statistics/Correlated RV.xls

39

The covariance matrix

The covariance matrix denoted [σ2] describes all the variances and covari-ances possible between two or more variables. For a set of 3 variablesy = y1, y2, y3, it would be

[σ2y] =

σ11 σ12 σ13

σ21 σ22 σ23

σ31 σ32 σ33

(4.20)

The variances and covariances can now be defined by the equation

[σ2y ]ij = σij

= 〈(yi − µi) (yj − µj)〉 (4.21)

where µi = 〈yi〉 is the mean of yi. Note how Eq. 4.21 properly definesboth the off-diagonal elements as the covariances and the diagonal elementsσii = σ2

i = 〈(yi − µi)2〉 as the variances. It also shows that the covariancematrix is symmetric about the diagonal with [σ2

y]ij = [σ2y]ji. In linear algebra

notation, the entire matrix can be written as the expectation value of anouter product:

[σ2y ] =

⟨(y − µ)(yT − µT )

⟩(4.22)

If all variables are independent, the covariances are zero and the covari-ance matrix is diagonal and given by

[σ2y] =

σ21 0 0

0 σ22 0

0 0 σ23

(4.23)

When variables are independent, their joint probability distribution fol-lows the product rule—leading to Eq. 4.12, for example, when they are allGaussian. What replaces the product rule for variables that are known tobe dependent—that have a covariance matrix with off-diagonal elements?No simple expression exists for the general case. However, the Gaussianjoint probability distribution (for N variables) having means µi and havingvariances and covariances satisfying Eq. 4.21 would be expressed

P (y) =

∏Ni=1 ∆yi√

(2π)N∣∣[σ2

y]∣∣ exp

[−1

2

(yT − µT

) [σ2y

]−1(y − µ)

](4.24)


where∣∣[σ2

y]∣∣ is the determinant of [σ2

y ] and[σ2y

]−1is its inverse. Normal

vector-matrix multiplication rules apply so that the argument of the expo-nential is a scalar.

Note that Eq. 4.24 is the general form for a Gaussian joint pdf and reducesto the special case of Eq. 4.12 for independent variables, i.e., for a diagonalcovariance matrix.

Chapter 5

Measurement Model

This chapter presents an idealized model for measurements and discussesseveral issues related to the use of measurements in statistical analysis pro-cedures.

Random errors

A measurement y can be expressed as the sum of the mean of its probabilitydistribution µy and a random error δy that scatters individual measurementsabove or below the mean.

y = µy + δy (5.1)

For most measurements the true mean is unknown and thus the actual ran-dom error (or deviation) δy = y − µy cannot be determined. Wheneverpossible, however, the experimentalist should supply the standard deviationσy to set the scale for the size of typical deviations that can be expected.The variance should be consistent with its definition as the mean squareddeviation and the covariances should be consistent with Eq. 4.21. As willbe discussed in Chapter 7, the values, variances, and covariances of a dataset are often the only quantities that will affect the results derived from thatdata.

One method for estimating variances and covariances for a measurementset is to take a large sample of such sets while all experimental conditionsremain constant. The resulting sample variances and covariances might thenbe calculated and assumed to be the true variances and covariances for anyfuture measurement sets of the same kind. Often, less rigorous estimates are

41

42 CHAPTER 5. MEASUREMENT MODEL

made. The experimenter may assume covariances are zero (perhaps becausethe measurements are expected to be independent) and the standard devi-ations might be estimated from the instrument scales or other informationabout the measurement.

Obviously, estimates of σy are not going to be 100% certain. For now,however, all variances and covariances entering into an analysis will be as-sumed exactly known. Issues associated with uncertainty in σy will be putoff until Chapter 9.

Systematic errors

In contrast to random errors, which cause measurement values to differ ran-domly from the mean of the measurement’s parent distribution, systematicerrors cause the mean of the parent distribution to differ systematically (non-randomly) from the true physical quantity the mean is interpreted to repre-sent. With yt representing this true value and δsys the systematic error, thisrelationship can be expressed

µy = yt + δsys (5.2)

Sometimes δsys is constant as yt varies. In such cases, it is called an offsetor zeroing error and µy will always be above or below the true value by thesame amount. Sometimes δsys is proportional to yt and it is then referredto as a scaling or gain error. For scaling errors, µy will always be above orbelow the true value by the same fractional amount, e.g., always 10% high.In some cases, δsys is a combination of an offset and a scaling error. Or, δsys

might vary in some arbitrary manner.Combining Eqs. 5.1 and 5.2

y = yt + δy + δsys (5.3)

demonstrates that both random and systematic errors contribute to everymeasurement. Both can be made smaller but neither can ever be entirelyeliminated. Accuracy refers to the size of possible systematic errors whileprecision refers to the size of possible random errors.

If systematic errors are small enough, they might safely be neglected inthe first round of data analysis in which results and their uncertainties areobtained taking into account random error only. Then, one examines how the

43

systematic errors would change those results. If the changes are found to besmall compared to the uncertainties determined in the first round, systematicerrors have been demonstrated to be inconsequential. If systematic errorscould change results at a level comparable to or larger than the uncertaintiesdetermined in the first round, those changes are significant and should bereported separately.

The possibility of systematic error can sometimes be included early in theanalysis as additional model parameters. For example, a possible offset errorcan be incorporated by assuming that measured y-values should be correctedby subtracting off some constant y0 (the offset error) before substituting forthe true quantity yt in the theory: yt = y− y0. A scaling error can be incor-porated by substituting yt = κy. Often, the “instrument constants” like y0

or κ are of little physical interest in themselves. However, one still needs toknow how any change in their values will affect the other, more theoreticallyrelevant model parameters. Sometimes, the instrument constants have littleor no effect on the theory parameters. Other times, they become mathe-matically inseparable from the theory parameters and including them in theanalysis helps show how any uncertainty in their value contributes to theuncertainty in the theory parameters.

The procedures to minimize systematic errors are called calibrations, andtheir design requires careful consideration of the particular instrument and itsapplication. Calibrations involve modeling instrument behavior and deter-mining instrument constants and their uncertainties. Careful calibrations arecritical when the uncertainties in the instrument constants affect the uncer-tainties in the theory parameters. Chapter 8 includes a section on analyzingdata sets with a calibration.

Correlated data

Measurement values are usually statistically independent. One measure-ment’s random error is unlikely to be related to another’s. However, there areoccasions when correlated measurement errors can be expected. For exam-ple, simultaneously measured voltages are prone to correlated random errors,called common mode noise, because electrical interference (e.g., from powerlines) might be picked up in both measurements. Measurement timing is acommon aspect of correlation because temporal fluctuations are a commonmanifestation of random error. Measurements made closely in time—shorter

44 CHAPTER 5. MEASUREMENT MODEL

than the time scale for the fluctuations—are likely to have correlated randomerrors.

Correlations can also arise by pre-treating uncorrelated data. They canbe the result of a true calibration or an implied one. For example, thecurrent I in some circuit element might be determined from measurementsof the voltage V across a resistor R wired in series with that element. Anapplication of Ohm’s law gives I = V/R. Both the values of V and Rhave uncertainty and will be assumed to be statistically independent. Withonly a single measurement of V , the uncertainty in I would be correctlydetermined according to the rules for propagation of error discussed in thenext chapter. It will depend on V , R and their uncertainties. However, if theexperiment involves making repeated current measurements using the samecurrent-measuring resistor, the current values Ii will be correlated even if themeasured Vi are statistically independent. A novice mistake is to treat theI-values determined this way as if they were all statistically independent.

Correctly dealing with correlated input data is discussed in the next chap-ter and in Chapter 8. In general, however, the simplest solution is to workdirectly with uncorrelated (usually raw) data whenever possible. Thus, themeasurements above should be analyzed by substituting Vi/R for Ii so thatonly the independent variables, Vi and R, appear directly in the analysis.The quantity R would likely combine algebraically with other model param-eters and only at the latest possible stage of the analysis (after any regressionanalysis, for example) should its value and uncertainty be used.

Chapter 6

Propagation of Error

Direct transformations of random variables will be treated in this chapter.A direct transformation gives M output variables ak, k = 1...M as definedfrom N input variables yi, i = 1...N according to M given functions

ak = fk(y1, y2, ..., yN) (6.1)

If the full set of N input variables are acquired repeatedly, they wouldvary according to their joint probability distribution py(y). And as eachnew set is acquired, the transformation Eqs. 6.1 lead to a new set of M outputvariables, which then would vary according to their joint distribution pa(a).Determining the relationship between the input and output distributions isa common analysis task.

Propagation of error is a procedure for finding the variances and covari-ances that can be expected for the output variables given the variances andcovariances for the input variables. Its validity will require that the inputvariations cause only proportional variations in the output variables, i.e.,that the output variables follow a first-order Taylor series in the input vari-ables. Measurement errors are often small enough to satisfy this restrictionand thus propagation of error is one of the more commonly used data analy-sis procedures. However, it deals only with the input and output covariancematrices and so a couple of special cases will be examined first that dealdirectly with the joint probability distributions themselves.

45

46 CHAPTER 6. PROPAGATION OF ERROR

Complete solutions

A complete solution to the problem would determine pa(a) given py(y)and Eqs. 6.1. This calculation is difficult when several variables are involved.Even with one input and one output variable, extra care is needed if a givenvalue of a can occur for more than one value of y or if a and y are notinvertible.

One case that can be expressed succinctly occurs when there is a singleinput variable y and a single output variable a, which are single-valued in-vertible functions of one another a = f(y), y = g(a), and g(f(y)) = y forall y. In this special case, the probability distributions pa(a) and py(y) willsatisfy

pa(a) =py(y)∣∣∣dady ∣∣∣ (6.2)

where the derivative da/dy would be determined directly from f(y) or im-plicitly from g(a). The inverse function y = g(a) is used to eliminate any yfrom the final expression for pa(a).

Figure 6.1 shows a simple example where a =√y and the input dis-

tribution py(y) is a Gaussian of mean µy and variance σ2y. Assuming the

probability for negative y-values is negligible, a =√y and y = a2 are single-

valued and da/dy = 1/2√y = 1/2a. Equation 6.2 then gives

pa(a) =py(y = a2)∣∣∣dady ∣∣∣

=2a√2πσ2

y

e−(a2−µy)2/2σ2y

Equation 6.2 can be extended to two or more variables, but the complex-ity and restrictions multiply quickly. The simplest multidimensional case in-volves two pairs of variables having a unique, one-to-one relationship betweeneach pair—where a1 = f1(y1, y2), a2 = f2(y1, y2) and the inverse equationsy1 = g1(a1, a2), y2 = g2(a1, a2) are all single valued. In this case,

pa(a1, a2) =py(y1, y2)∣∣∣∣∣∣∂a1∂y1

∂a1∂y2

∂a2∂y1

∂a2∂y2

∣∣∣∣∣∣(6.3)

47

20 1 2 3 4 5 6

a

p (a)a

da 1=

3

4

1.5

2a

p (a)

da

a

ya =

ydy

da

2

1=

2

3

1

1.5

dy

da

dy

da

p (y)

ya =

1

2

0.5

1dy

dy

da

p (y)y

p (y)y

0

1

0 0.5 1 1.5 2 2.5 30

0.5

y

p (y)y

00 0.5 1 1.5 2 2.5 3

0

y

Figure 6.1: Single variable probability density function transformation. py(y) isthe Gaussian distribution plotted along the horizontal axis. The transformationis a =

√y and results in pa(a)—the skewed distribution plotted along the vertical

axis. The probabilities in the black shaded bins in each distribution (areas py(y) dyand pa(a) da) must be equal because samples in that y-bin would have square rootsthat would put then into the corresponding a-bin. The same argument holds forthe blue shaded bins. Thus, the ratio pa(a)/py(y) must everywhere be equal to|dy/da| determined by the functional relationship between a and y.

The matrix of derivatives inside the determinant on the right is the Jacobianfor the transformation from the set y to the set a. For unique (one-to-one) transformations, the Jacobian matrix is square and the derivativesdescribe the relative stretching or squeezing caused by the transformation.Again, inverse transformations may be needed to express the result as afunction of a1 and a2 only.

A useful example for this 2 × 2 case is the Box-Muller transformation,which creates Gaussian random variables from uniform ones. For this trans-formation y1 and y2 are independent and uniformly distributed on the inter-val [0, 1]. Thus p(y1, y2) = 1. If y1 and y2 are randomly chosen from this


distribution and a1 and a2 are calculated according to

a1 =√−2 ln y1 sin 2πy2 (6.4)

a2 =√−2 ln y1 cos 2πy2

then a1 and a2 will be independent, Gaussian-distributed random variables—each with a mean of zero and a variance of one. This fact follows afterdemonstrating that the Jacobian determinant is 2π/y1 and that Eqs. 6.4give y1 = exp[−(a2

1 + a22)/2]. Equation 6.3 then gives the joint Gaussian

distribution: p(a1, a2) = (1/2π) exp[−(a21 + a2

2)/2] describing independentGaussian random variables of zero mean and unit variance.

One other exact transformation is often useful. An integral solution ariseswhen adding two continuous random variables, say, x and y. Specific valuesfor the sum z = x + y can be made with different combinations of x andy. The general form for the pdf for the sum, pz(z), can be expressed as aconvolution of the pdfs px(x) and py(y) for x and y.

pz(z) =

∫ ∞−∞

px(x)py(z − x)dx (6.5)

Note that pz(z) is the product of the x-probability density at any x timesthe y-probability density at y = z− x (so that x+ y = z) integrated over allpossible x.

Two such cases are illustrated in the bottom two graphs in Fig. 7.1, whichshow, respectively, the frequency distributions for a random variable obtainedas the sum of two or three uniform random variables. The solid curves showthe parent distributions predicted by Eq. 6.5. Adding two uniform randomvariables produces a triangular or piecewise linear distribution. Adding athird results in a piecewise quadratic distribution.

Propagation of error

Propagation of error refers to a restricted case of transformations where theranges for the input variables are small enough that Eq. 6.1 for each ak wouldbe well represented by a first-order Taylor series expansion about the meansof the yi.

If there were no random error in any of the yi and they were all equal totheir true means µi, Eq. 6.1 should then give the true values of the calculated

49

ak, which will be denoted αk

αk = fk(µ1, µ2, ..., µN) (6.6)

Assuming each yi is always in the range µi ± 3σi, propagation of errorformulas will be valid if, over such ranges, the ak are accurately representedby a first-order Taylor expansion of each fk about the values µ1, µ2, ..., µN .

ak = fk(y1, y2, ..., yN)

= fk(µ1, µ2, ..., µN) +

∂fk∂y1

(y1 − µ1) +∂fk∂y2

(y2 − µ2) + ...+∂fk∂yN

(yN − µN)

= αk +N∑i=1

∂fk∂yi

(yi − µi) (6.7)

where Eq. 6.6 has been used in the final step. In linear algebra form, Eq. 6.7becomes the kth element of the vector equation:

a = α+ [Jay ](y − µ) (6.8)

where the M ×N Jacobian has elements given by

[Jay ]ki =∂fk∂yi

(6.9)

Equation 6.8 is often used in the form

∆a = [Jay ]∆y (6.10)

where ∆a = a−α and ∆y = y − µ are the deviations from the means.To see how the first-order Taylor expansion simplifies the calculations,

consider the case where there is only one calculated variable, a, derived fromone random variable, y, according to a given function, a = f(y). Figure 6.2shows the situation where the standard deviation σy is small enough that fory-values in the range of their distribution, a = f(y) is well approximated bya straight line—the first-order Taylor expansion of f(y) about µy.

a = f(µy) +df

dy(y − µy) (6.11)


0

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10y

a

y -distribution:µ y = 6.0, σ y = 0.2

a -distribution:

µ a = f( µ y ), σ a = σ y df/dy |Taylor expansion

∆ y

∆ a

f(y)

∆ a / ∆ y = df/dy |µ y

µ y

Figure 6.2: Single variable propagation of errors. Only the behavior of f(y) overthe region µy ± 3σy affects the distribution in a.

where the derivative is evaluated at µy. With a linear relation between aand y, the distribution in y will lead to an identically-shaped distribution ina (or one reversed in shape if the slope is negative). With either slope sign,µa = f(µy) and σa = σy|df/dy| would hold.

Any second-order term in the Taylor expansion—proportional to (y −µy)

2—would warp the linear mapping between a and y. In Fig. 6.2, forexample, a = f(y) is always below the tangent line and thus the mean of thea-distribution will be slightly less than f(µy). Such higher order correctionsto the mean will be addressed at the end of this chapter, but they will beassumed small enough to neglect more generally.

When more than one random variable contributes to a calculated vari-able, the one-to-one relationship between the shape of the input and outputdistributions is lost. The distribution for the calculated variable becomes amultidimensional convolution of the shapes of the distributions for the con-tributing variables. The central limit theorem discussed in the next chapterstates that with enough contributing variables, the calculated quantities willbe Gaussian-distributed no matter what distributions govern the contributingvariables. But whether the distributions for the ak turn out to be Gaussian

51

or not and no matter what distributions govern the yi, if the first-order Tay-lor expansion is accurate over the likely range of the yi, propagation of errorwill accurately predict the most important parameters of p(a)—the means,variances, and covariances.

Using the first-order Taylor expansion, the means, variances, and covari-ances for the joint probability distribution for the ak can easily be determined.The mean for the variable ak is defined as 〈ak〉 and this expectation value iseasily evaluated from Eq. 6.7.

〈ak〉 =

⟨αk +

N∑i=1

∂fk∂yi

(yi − µi)

⟩

= αk +N∑i=1

∂fk∂yi〈(yi − µi)〉

= αk (6.12)

where the expectation values 〈yi − µi〉 = 0 have been used to eliminate allterms in the sum. This demonstrates the important result that the quantityak = fk(y1, y2, ..., yM) will be an unbiased estimate of the true αk. The vectorequation whose kth element is Eq. 6.12 becomes

〈a〉 = α (6.13)

Recall that elements of the covariance matrix for the ak are defined by:

[σ2a]kl = 〈(ak − αk)(al − αl)〉 (6.14)

and that the entire covariance matrix (Eq. 4.22) can be expressed

[σ2a] =

⟨(a−α)(a−α)T

⟩(6.15)

The kl element of [σ2a] is the covariance between ak and al and the kk element

(kth diagonal element) is the variance of ak. And substituting Eq. 6.8 (andits transpose) for a (and aT ) then gives:

[σ2a] =

⟨[Jay ](y − µ)(y − µ)T [Jay ]T

⟩(6.16)

for which a general element is

[σ2a]kl =

⟨N∑i=1

N∑j=1

∂fk∂yi

(yi − µi)(yj − µj)∂fl∂yj

⟩(6.17)


Rearranging the terms in the sum, factoring constants (the derivatives) outfrom the expectation values and using the definitions for the input variancesand covariances [σ2

y ]ij = 〈(yi − µi)(yj − µj)〉, Eq. 6.17 then becomes:

[σ2a]kl =

N∑i=1

N∑j=1

∂fk∂yi

∂fl∂yj〈(yi − µi)(yj − µj)〉

=N∑i=1

N∑j=1

∂fk∂yi

∂fl∂yj

[σ2y ]ij (6.18)

Proceeding from Eq. 6.16, the same logic means the expectation anglebrackets can be moved through the Jacobians giving

[σ2a] = [Jay ]

⟨(y − µ)(y − µ)T

⟩[Jay ]T

= [Jay ][σ2y][J

ay ]T (6.19)

where Eq. 4.22 was used in the final step. Equations 6.18 and 6.19 givethe covariance matrix [σ2

a] associated with the ak in terms of the covariancematrix [σ2

y ] associated with the yi and the Jacobian describing the relation-ship between them. The partial derivatives in [Jay ] are simply constants thatshould be evaluated at the expansion point, µ1, µ2, ...µN . However, as thetrue means are typically unknown, the derivatives will have to be evaluatedat the measured point y1, y2, ...yN instead. This difference should not signifi-cantly affect the calculations as all fk are supposed to be linear over a rangeof several σi about each µi and thus the derivatives must be nearly constantfor any yi in that range.

Equations 6.18 and 6.19 are the same general formula for propagation oferror. Various formulas derived from them are often provided to treat lessgeneral cases. One such formula is simply a rewrite for a diagonal element ofthe covariance matrix. [σ2

a]kk, the variance of ak, denoted σ2ak, is especially

important because its square root is the standard deviation or uncertaintyin ak.

σ2ak =

N∑i=1

(∂fk∂yi

)2

σ2i + 2

N−1∑i=1

N∑j=i+1

∂fk∂yi

∂fk∂yj

σij (6.20)

The factor of 2 in the second sum arises because Eq. 6.18 would produce twoequivalent cross terms while the sum above includes each cross term only

53

once. The first sum includes all terms involving the variances of the yi, andthe second sum—a sum over all pairs i, j where j 6= i—includes all termsinvolving the covariances between the yi. Note that whenever correlatedvariables are used together as input to a calculation, the uncertainty in thecalculated quantity will have to take into account the input covariances viathis equation.

Now consider the case for the variance σ2ak when all yi are independent,

i.e., when their covariances σij, i 6= j are all zero. In this case, the formulasimplifies to

σ2ak =

N∑i=1

(∂fk∂yi

)2

σ2i (6.21)

This is the most common propagation of error formula, but it should beunderstood to apply only to uncorrelated yi.

Special conditions can lead to uncorrelated output variables. In gen-eral, however, any two output variables will be correlated (have non-zerooff-diagonal [σ2

a]kl) whether the input variables are correlated or not. For thespecial case where all yi are independent, Eq. 6.18 simplifies to

[σ2a]kl =

N∑i=1

∂fk∂yi

∂fl∂yi

σ2i (6.22)

which is not likely to be zero without fortuitous cancellations.

Exercise 3 Simulate 1000 pairs of simultaneous measurements of a currentI through a circuit element and the voltage V across it. Assume that thecurrent and voltage measurements are independent. Take I-values from aGaussian with a mean of 76 mA and a standard deviation of 3 mA, and takeV -values from a Gaussian with a mean of 12.2 V and a standard deviationof 0.2 V.

Calculate sample values for the element’s resistance R = V/I and powerdissipated P = IV for each pair of I and V and submit a scatter plot for the1000 R,P sample values. Calculate the predicted means (Eq. 6.6) and vari-ances (Eq. 6.21) for the R and P distributions and calculate their predictedcovariance (Eq. 6.22). Evaluate the sample means (Eq. 2.14) for the 1000 Rand P values, their sample variances (Eq. 2.23), and the sample covariancebetween R and P (Eq. 4.18).

To quantitatively compare the predictions with the sample values deter-mined above requires the probability distributions for sample means, sample


variances, and sample covariances. Some of these distributions will be dis-cussed later. Their standard deviations will be given here as an aid to thecomparison. The standard deviation of the mean of N sample resistances ispredicted to be σR = σR/

√N . Similarly for the power. Check if your two

sample means agree with predictions at the 95% or two-sigma level. A frac-tional standard deviation is the standard deviation of a quantity divided bythe mean of that quantity. The fractional standard deviation of the two sam-ple variances are predicted to be

√2/(N − 1). For N = 1000, this is about

4.5%, which is also roughly the fractional standard deviation of the samplecovariance in this case. So check if your sample variances and covarianceagree with predictions at the 9% or two-sigma level.

Take the 1000-point samples repeatedly while keeping an eye out for howoften R and P are above and below their predicted means. P should behave asexpected—equally likely to be above or below the predicted mean. However, Ris much more likely to be above the predicted mean than below it. The reasonfor this behavior is the nonlinear dependence of R on I as discussed next.

Correction to the mean

To check how nonlinearities will affect the mean, Eq. 6.12 is re-derived—thistime starting from a second-order Taylor expansion.

For a function a = f(y1, y2) of two random variables y1 and y2, thesecond-order Taylor series expansion about the means of y1 and y2 becomes

a = f(y1, y2)

= f(µ1, µ2) +∂f

∂y1

(y1 − µ1) +∂f

∂y2

(y2 − µ2) (6.23)

+1

2!

(∂2f

∂y21

(y1 − µ1)2 +∂2f

∂y22

(y2 − µ2)2 + 2∂2f

∂y1∂y2

(y1 − µ1)(y2 − µ2)

)Taking the expectation value of both sides of this equation noting that 〈a〉 =µa, 〈(y1 − µ1)〉 = 0, 〈(y2 − µ2)〉 = 0, 〈(y1 − µ1)2〉 = σ2

1, 〈(y2 − µ2)2〉 = σ22,

and 〈(y1 − µ1)(y2 − µ2)〉 = σ12, gives

µa = f(µ1, µ2) +1

2!

(∂2f

∂y21

σ21 +

∂2f

∂y22

σ22 + 2

∂2f

∂y1∂y2

σ12

)(6.24)

For the power values of Exercise 3, the three terms in parentheses areall zero—the first two because the second derivatives are zero, the third

55

because the covariance between I and V is zero. For the resistance values,however, the term in parentheses is non-zero (due to the (1/2!)(∂2R/∂I2)σ2

I

term only) and adds 0.25 Ω to µR—a relatively insignificant shift comparedto the standard deviation of the distribution for R in the exercise: σR ≈7 Ω. However, it is significant compared to the standard deviation for thethousand-point average R where σR ≈ 0.2 Ω is also small.

For the more general case in which a = f(y1, y2, ...yN) is a function of Nvariables, the second-order Taylor expansion becomes

a = f(µ1, µ2, ...µN) + [Jay ]∆y +1

2!

(∆yT [Ha

yy]∆y)

(6.25)

where [Jay ] is now a 1×N matrix (row vector) with the ith element given by∂f/∂yi and [Ha

yy], called the Hessian matrix, is the N ×N symmetric matrixof second derivatives

[Hayy]ij =

∂2f

∂yi∂yj(6.26)

Taking expectation values of both sides then gives a result that can beexpressed:

µa = f(µ1, µ2, ...µN) +1

2!

N∑i=1

N∑j=1

[Hayy]ij[σ

2y]ij (6.27)

Note that the sum of products involved in the second term is not equivalentto any matrix multiplication of the two matrices. [Ha

yy] and [σ2y ] have the

same N ×N shape and the sum in Eq. 6.27 is simply the sum of the elementby element product of the two.

Chapter 7

Central Limit Theorem

Proofs of the central limit theorem are typically restricted to sums or meansfrom a sample of random variables governed by an arbitrary distribution offinite variance. The theorem states that the distribution for the sum or meantends toward a Gaussian as the sample size N increases. Proofs are oftenbased on the fact that the probability distribution for the sum is then an N -fold convolution of the probability distributions for each variable in the sum.As the number of convolutions (sample size) tends to infinity, the result tendsto a Gaussian no matter what distribution governs each individual sample.

The following statement gives the essence of the central limit theorem ina way that is more relevant to experimental data analysis.

As the number of random variables contributing to a quantity in-creases, that quantity’s probability distribution will tend toward aGaussian no matter what distributions govern the contributions.

Here, the central limit theorem is applied to any quantity obtained from sam-ples of random variables. For example, it would apply to fitting parametersobtained from sufficiently large data sets.

How closely the actual distribution will agree with a Gaussian dependsnot only on the number of contributing variables, but also the shapes of theirdistributions and how each variable contributes to the result. Keep in mindthat the actual distribution for the results may have very nearly Gaussianprobabilities within one or two standard deviations of the mean, but havesignificant differences further from the mean. It typically takes more variablesto get agreement with a Gaussian in the tails of the distribution.

57

58 CHAPTER 7. CENTRAL LIMIT THEOREM

Going hand-in-hand with the central limit theorem are procedures suchas propagation of error (last chapter) and regression analysis (next chapter)which can predict the covariance matrix for the output variables given thecovariance matrix for the input variables. While both procedures require thevalidity of a first-order Taylor expansion, this condition is either guaranteedfor any size measurement errors (when the relationships between the inputand output variables are linear) or can be assured by getting measurementerrors small enough (nonlinear relationships).

For all simulations discussed in this chapter, for example, the outputvariables will be exact linear functions of the input variables. Because thefirst-order Taylor expansion is exact, the formulas for the output variableswill be unbiased and the calculated covariance matrix will be exact for anysize measurement errors having any distribution and whether the distributionfor the output has converged to a Gaussian or not. The spirit of the centrallimit theorem is then simply that as the number of input variables grows, thejoint probability distribution for the output variables should tend toward aGaussian with the calculated covariance matrix no matter what distributionsgovern the input variables.

Single variable central limit theorem

A typical example for the central limit theorem refers to a single outputvariable Y determined as the straight sum of any number N of independentrandom variables yi.

Y =N∑i=1

yi (7.1)

where each yi comes from a distribution with a mean µi and a variance σ2i

but is otherwise arbitrary. Propagation of error then gives the mean of thesum (Eq. 6.6)

µY =N∑i=1

µi (7.2)

and it gives the variance of the sum (Eq. 6.21)

σ2Y =

N∑i=1

σ2i (7.3)

590.549218 0.708879 0.102265 0.478931 0.912746 0.569894 0.701919 0.185155 0.338113 0.999189 0.196224 0.579505 6.322039 1.258097 1.360362 1.839294 4.3 4.2 1520.900447 0.379481 0.684273 0.657964 0.91288 0.329308 0.214574 0.177258 0.038428 0.463179 0.893452 0.860094 6.511339 1.279928 1.964201 2.622165 4.5 4.4 2570.636721 0.496792 0.201766 0.913622 0.187161 0.082754 0.458753 0.198628 0.641251 0.604067 0.545394 0.281607 5.248516 1.133514 1.33528 2.248901 4.7 4.6 3080.783856 0.940302 0.623901 0.939436 0.393966 0.866743 0.461955 0.505177 0.491725 0.625434 0.982783 0.220005 7.835285 1.724158 2.348058 3.287495 4.9 4.8 4100.791324 0.031858 0.888055 0.539481 0.063737 0.130715 0.500723 0.349786 0.202488 0.770527 0.751446 0.741094 5.761234 0.823182 1.711238 2.250719 5.1 5 4950.565465 0.745721 0.827416 0.792326 0.378362 0.150613 0.105477 0.064251 0.650948 0.685815 0.498134 0.650273 6.114799 1.311185 2.138601 2.930928 5.3 5.2 5850.790371 0.077466 0.615175 0.851586 0.190441 0.545497 0.000265 0.116597 0.483028 0.295689 0.021712 0.671833 4.65966 0.867837 1.483012 2.334598 5.5 5.4 6680.502615 0.901681 0.943388 0.193467 0.027664 0.238257 0.81187 0.7267 0.221031 0.213397 0.981614 0.049098 5.810782 1.404296 2.347684 2.541151 5.7 5.6 7520.488114 0.047857 0.548321 0.210618 0.784457 0.21819 0.257287 0.949337 0.55585 0.221175 0.580078 0.521725 5.383009 0.535971 1.084292 1.294909 5.9 5.8 7600.813964 0.171189 0.900846 0.318111 0.083493 0.737561 0.22078 0.654872 0.374203 0.846808 0.681197 0.312652 6.115675 0.985153 1.885999 2.20411 6.1 6 7760.695798 0.950467 0.061754 0.456724 0.397045 0.790515 0.624566 0.438515 0.424299 0.399616 0.57021 0.139906 5.949413 1.646264 1.708018 2.164742 6.3 6.2 7710.530938 0.269163 0.930554 0.143915 0.922634 0.64147 0.046875 0.583099 0.347708 0.963094 0.518927 0.304527 6.202905 0.800102 1.730656 1.874571 6.5 6.4 6670.262508 0.576702 0.097052 0.373326 0.100995 0.455908 0.387245 0.222799 0.769224 0.416431 0.80361 0.522063 4.987863 0.83921 0.936262 1.309588 6.7 6.6 6330.230014 0.179246 0.175126 0.189305 0.069349 0.635692 0.926501 0.5923 0.420381 0.964606 0.364274 0.642138 5.388932 0.40926 0.584386 0.773692 6.9 6.8 5780.287908 0.611184 0.690379 0.656041 0.296672 0.911632 0.523653 0.221596 0.228111 0.798281 0.818215 0.035917 6.079587 0.899092 1.589471 2.245512 7.1 7 5020.489897 0.952009 0.62866 0.10842 0.887067 0.608773 0.539598 0.835585 0.715264 0.989054 0.085871 0.640542 7.480739 1.441905 2.070566 2.178986 7.3 7.2 3770.283398 0.216401 0.232609 0.611532 0.177156 0.229104 0.331355 0.742153 0.468434 0.042314 0.565182 0.650059 4.549697 0.4998 0.732408 1.34394 7.5 7.4 3200.740429 0.215858 0.117261 0.989183 0.747717 0.871813 0.514121 0.911512 0.078318 0.726083 0.727885 0.687162 7.327343 0.956288 1.073549 2.062732 7.7 7.6 2070.290916 0.569326 0.819711 0.307307 0.210703 0.721532 0.440198 0.782941 0.69578 0.487948 0.703377 0.630738 6.660477 0.860242 1.679953 1.98726 7.9 7.8 1970.793434 0.89496 0.506501 0.248373 0.067739 0.084063 0.867874 0.887658 0.781462 0.755844 0.163627 0.232406 6.28394 1.688394 2.194895 2.443268 8.1 8 1140.952326 0.493439 0.873078 0.916642 0.870725 0.564618 0.624253 0.779977 0.283869 0.199357 0.236446 0.022912 6.817642 1.445765 2.318843 3.235485 8.3 8.2 740.113301 0.540413 0.456229 0.567699 0.776157 0.648 0.532054 0.920637 0.86246 0.603053 0.045763 0.137188 6.202955 0.653715 1.109944 1.677643 8.5 8.4 600.787755 0.767431 0.461881 0.691386 0.59639 0.795566 0.028383 0.916187 0.256697 0.184138 0.944636 0.806114 7.236564 1.555186 2.017067 2.708453 8.7 8.6 250.547368 0.569024 0.608583 0.01587 0.98106 0.886035 0.061066 0.056535 0.23433 0.32478 0.202644 0.863607 5.350903 1.116392 1.724975 1.740846 8.9 8.8 210.317079 0.355103 0.700674 0.255511 0.861135 0.373863 0.516838 0.440254 0.624917 0.131475 0.206243 0.419037 5.202131 0.672182 1.372856 1.628367 9.1 9 70.330256 0.292144 0.753761 0.513901 0.08599 0.941502 0.803729 0.425173 0.914495 0.989373 0.030332 0.813637 6.894294 0.6224 1.376161 1.890062 9.3 9.2 10.440832 0.89494 0.457497 0.645762 0.404615 0.954943 0.022456 0.780166 0.380203 0.205237 0.453215 0.551128 6.190993 1.335771 1.793268 2.43903 9.5 9.4 10.272339 0.147759 0.559299 0.686252 0.386761 0.715565 0.205012 0.766051 0.638571 0.243024 0.877909 0.126969 5.62551 0.420098 0.979397 1.665649 9.7 9.6 30.560327 0.385435 0.893224 0.925702 0.777227 0.645363 0.051221 0.451222 0.753866 0.718153 0.086213 0.72793 6.975884 0.945763 1.838987 2.764689 10.687854 0.650547 0.305896 0.045939 0.180787 0.051162 0.306349 0.817967 0.534278 0.035062 0.139301 0.859597 4.61474 1.3384 1.644297 1.6902360.52801 0.905161 0.574728 0.50898 0.331898 0.662768 0.425453 0.742936 0.943691 0.900881 0.435033 0.848159 7.807697 1.433172 2.0079 2.51688

0.826463 0.249625 0.953867 0.296567 0.56232 0.331523 0.873521 0.175551 0.660901 0.057462 0.548923 0.354561 5.891283 1.076088 2.029955 2.3265220.666087 0.144341 0.802329 0.659551 0.934168 0.486339 0.840911 0.108954 0.548591 0.982979 0.358135 0.883451 7.415836 0.810428 1.612757 2.2723080.456775 0.661871 0.041678 0.824903 0.141853 0.709929 0.71597 0.482016 0.354303 0.240062 0.876641 0.70291 6.208912 1.118646 1.160324 1.9852270.164108 0.129458 0.914809 0.283582 0.633719 0.554772 0.571189 0.120397 0.170341 0.397172 0.599348 0.158298 4.697192 0.293566 1.208375 1.4919570.746781 0.218335 0.196656 0.642305 0.772974 0.200859 0.453262 0.233451 0.701615 0.964555 0.50058 0.369869 6.001241 0.965117 1.161773 1.8040770.67459 0.412042 0.040475 0.664754 0.221163 0.621402 0.414368 0.076678 0.090105 0.332827 0.818305 0.518089 4.884798 1.086632 1.127107 1.791861

0.782712 0.467707 0.612215 0.621496 0.693173 0.018891 0.430089 0.112078 0.226736 0.924629 0.97928 0.132462 6.001467 1.250419 1.862634 2.484130.295866 0.264718 0.832344 0.737674 0.106702 0.929181 0.374736 0.23172 0.38501 0.78022 0.210237 0.846816 5.995225 0.560585 1.392929 2.1306030.719456 0.222856 0.396694 0.348059 0.387071 0.06009 0.662559 0.342465 0.050978 0.408776 0.881885 0.876975 5.357864 0.942312 1.339006 1.687065

sum of 2 uniform (0,1)

0

200

400

600

800

1000

1200

-0.5 0 0.5 1 1.5 2 2.5

value

Fre

qu

ency


0

100

200

300

400

500

600

700

800

900

0 0.5 1 1.5 2 2.5 3

value

Fre

qu

ency


0

100

200

300

400

500

600

700

800

900

2 3 4 5 6 7 8 9 10

value

Fre

qu

ency


0

100

200

300

400

500

600

700

800

-0.5 0 0.5 1 1.5

value

Fre

qu

ency

Figure 7.1: Sample frequency distributions for 10,000 samples of the sum of 1, 2,3 and 12 uniformly distributed random numbers (counterclockwise from top left).The solid lines are exact predictions based on convolutions. The dotted lines givethe prediction for a Gaussian random variable with the same mean and variance.

For a single output variable, there are no covariances and according tothe central limit theorem, the probability distribution in the limit of large Nis a single-variable Gaussian distribution having the form of Eq. 3.1 with amean and variance as given above.

Simulations of this kind are demonstrated in Fig. 7.1 where four samplefrequency distributions are shown. Counterclockwise from top left, the sam-ples for each histogram were created as the sum of N = 1, 2, 3, or 12 randomnumbers uniformly distributed on the interval from [0, 1]. This uniform dis-tribution has a mean of 1/2 and a variance of 1/12 and so propagation oferror implies the resulting Y -distributions have means of 0.5, 1.0, 1.5, and 6,and variances of 1/12, 2/12, 3/12, and 1, respectively.

For the first three cases, the true parent distribution is given as the solidline—the original uniform distribution (top left, N = 1) or ones obtainedfrom it by one or two convolutions according to Eq. 6.5 (bottom, N = 2, 3).

In all four cases, a parent Gaussian distribution with a mean and varianceequal to that of the true distribution is shown as a dotted curve. Note howthe approach to a Gaussian shape is becoming evident after summing as


few as three uniform random variables. But keep in mind that the truedistribution for this three-variable sum extends only from 0 to 3 and haszero probability density outside this±3-sigma range, whereas a true Gaussianrandom variable would have around 0.3% probability of being there.

Multidimensional central limit theorem

The simulations for a multidimensional random variable will come from acommon regression analysis task—fitting a slope m and intercept b to N datapoints (xi, yi), i = 1...N , all of which are predicted to lie along a straightline: y = mx+ b.

The simulated data are generated according to yi = 3xi + 15 + error.That is, for some fixed set of xi, the yi are created as the sum of a true meanµi = 3xi + 15, and a zero-mean random variable—the measurement error—created using a random number generator. Choosing the error to come froma distribution with a mean of zero is equivalent to assuming no systematicerrors.

To demonstrate the central limit theorem, the random error for the yi areall drawn from a uniform distribution in the interval from -5 to 5, which hasa variance σ2

y = 102/12 = 100/12. The equally-weighted regression formula,Eq. 8.53, is used to get the least-squares estimates for m and b for each dataset. With σ2

y = 100/12, Eq. 8.52 will give the true parameter covariancematrix in all cases.

Figures 7.2 and 7.3 summarize the results from two simulations. For eachsimulation, 10,000 data sets were created and fit to a straight line—producing10,000 pairs of (m, b) values, one for each set. Because of the added randomerror, the fitted slopes and intercepts are expected to deviate from the truevalues of 3 and 15, respectively. Because each data set has different randomerror, each set produces a different value for m and b; they are a randomvariable pair.

Each figure shows one sample data set (top right) as well as a scatter plot(1500 points, bottom left) and frequency distributions for both m (top left)and b (bottom right) for all 10,000 data sets. For Fig. 7.3, every data set has20 data points (for integer x-values from 1 to 20). Data sets for Fig. 7.2 haveonly three data points (for integer x-values of 1, 10, and 20).

Qualitatively, the negative correlation indicated in both scatter plots canbe understood by considering the top right graphs in the figures showing a

61

Figure 7.2: 3-point data sets. Top right: One sample data set with its fit-ted line. Top left and bottom right: Frequency distributions for the slopes andintercepts. Bottom left: Scatter plot for the slopes and intercepts.

typical data set and fitted line. Simultaneously high values for both the slopeand intercept would produce a fitted line lying entirely above the true linewith the deviation between them growing as x increases. Random errors areunlikely to reproduce this behavior by chance. A similar argument wouldapply when the slope and intercept are simultaneously low. However, with ahigh slope and a low intercept, or vice versa, the fitted line would cross thetrue line and the deviations between them would never get as large. Datasets showing this behavior are not as unlikely and lead to the negativelycorrelated scatter plots.

In both figures, the solid lines are associated with predictions based onthe assumption that m and b vary according to the correlated Gaussiandistribution of Eq. 4.24 with µm = 3 and µb = 15 and having a covariancematrix as given by the regression formula, Eq. 8.52. The solid vertical andhorizontal guidelines are positioned ±σm and ±σb about the predicted meansof 3 and 15, respectively, and would be expected to include 68% of the points


Figure 7.3: 20-point data sets. Top right: One sample data set with itsfitted line. Top left and bottom right: Frequency distributions for the slopes andintercepts. Bottom left: Scatter plot for the slopes and intercepts. Solid (pink)lines provide Gaussian predictions.

if m and b followed a Gaussian distribution. The ellipse in each scatter plotis an iso-probability contour; all values for m and b on the ellipse wouldmake the argument of the exponential in Eq. 4.24 equal to −1/2. That is,those m, b values would occur with a probability density that is down bya factor of exp(−1/2) from the peak density at m = 3 and b = 15. A twodimensional integration of a Gaussian joint pdf gives the probability enclosedby the ellipse and evaluates to 1 − exp(−1/2) = 0.39. Thus, if the resultsfollowed the Gaussian predictions, 39% should fall inside the ellipse.

Both the 3-point data sets and the 20-point data sets show non-Gaussiandistributions for m and b. Compared to a Gaussian (overlying solid curve),the histograms have fewer m- and b-values in both the peak and tails of thedistribution and more in the shoulders. The effect is quite pronounced forthe 3-point data sets. However, for the 20-point data sets, the effect is subtle,indicating m and b follow a near-Gaussian distribution and have converged

63

toward the conditions for the central limit theorem.Basically, the central limit theorem guarantees that if enough data go into

determining a set of results and their covariance matrix, then the distributionfor those results will be approximately Gaussian with that covariance matrix.Of course, if the shape of the output distribution is very important, it wouldprobably be wise to seek additional verification (typically via a Monte Carlosimulation) and not simply assume a Gaussian based on the central limittheorem.

Chapter 8

Regression Analysis

Regression analysis refers to techniques involving data sets with one or moredependent variables measured as a function of one or more independent vari-ables with the goal to compare that data with a theoretical model and extractmodel parameters and their uncertainties. A common example is a straight-line fit to a set of measured (xi, yi) data points predicted to obey a linearrelationship: y = mx + b. Regression analysis then provides an estimate ofthe slope m and the intercept b based on the data.

The dependent variables will be denoted yi, i = 1...N and each yi willbe modeled as an independent sample from either a Gaussian, binomial, orPoisson probability distribution. The independent variables (the xi in thestraight line fit) can also be random variables, but this possibility will onlybe considered after treating the simpler case where the independent variablesare known exactly.

With regard to statistical dependence, the yi in a regression analysisare typically assumed to be independent with no correlations in their errordistributions. If correlations exist, they would have to be taken into account.Possible treatments are addressed later in this chapter, but for now, all yiwill be considered statistically independent.

The model is that the mean of the distribution for each yi depends onthe independent variables associated with that data point through a fittingfunction with M unknown theory parameters αk, k = 1...M

µi = Fi(α1, α2, ..., αM) (8.1)

where the subscript i in Fi(α) denotes the independent variables. Equa-tion 8.1 is intentionally written without any explicit independent variables.

65

66 CHAPTER 8. REGRESSION ANALYSIS

In a regression analysis they simply distinguish the point by point dependen-cies of the µi on the αk. As far as regression is concerned, the fitting functionis simply N different equations for the µi in terms of the M values for theαk.

Equation 8.1 is written as defining the true means µi in terms of the truetheory parameters αk. Except in simulations, both are usually unknown.The version

yfiti = Fi(a1, a2, ...aM) (8.2)

gives the corresponding quantities determined by the fitting process. Theyfiti are the estimates of µi obtained via the same fitting function as Eq. 8.1

but with each αk replaced by a corresponding estimate ak as determined bythe fit. Even though the ak depend on the yi only indirectly via the fittingprocess, they are nonetheless associated with one particular data set and arerandom variables—the fitted ak would change if the yi were resampled. Incontrast, the sought-after true parameter values αk and true means µi areconstants and are not random variables.

Principle of Maximum Likelihood

The estimates ak are obtained according to the principle of maximum likeli-hood in that they are chosen so as to maximize the probability of the dataset from which they are derived. As a result, should the experiment andtheory be deemed incompatible, they will be incompatible regardless of theparameter values. Any other values will only make the data less likely. Anyparameter determined by this principle is called a maximum likelihood esti-mate or MLE.

With the yi statistically independent, the product rule applies. Variablesgoverned by Gaussian, binomial, or Poisson distributions have joint proba-bilities given by Eqs. 4.12, 4.13 or 4.14, respectively. These give the actualprobability for the data set—larger for data sets that are more likely andsmaller for sets that are less likely. This product probability becomes depen-dent on the ak when the µi are replaced with the estimates yfit

i as expressedthrough Eq. 8.2. The ak that produce the yfit

i that produce the largest pos-sible joint probability become the MLE’s for the αk. The yfit

i are commonlycalled the best fit (to the input yi) and thus the ak are also called the best-fitparameters.

67

Conditions for a global maximum can be complicated. For a continu-ous function f(a1, a2, ...), conditions for a local maximum with respect to allvariables is that ∂f/∂ak = 0 for all k. When f represents a joint probabil-ity, satisfying this condition usually (but not always) finds a global maxi-mum. To find the maximum, a useful trick is to first take the natural loga-rithm of f(a1, a2, ...) and maximize that. This works because ∂(ln f)/∂ak =(1/f) ∂f/∂ak and, because f will be non-zero and finite, where one derivativeis zero, so is the other.

The natural logarithm of the joint probability is called the log likelihoodL and thus defined by

L = lnP (8.3)

Using L simplifies the math because it transforms the products into sumswhich are easier to differentiate. The actual value of L is unimportant. Itsonly use is in maximizing the probability with respect to the fitting param-eters by imposing the condition

∂L∂ak

= 0 (8.4)

for each ak.Note the awkward use of the symbol ak for variables (whose values affect

L) and for the MLE values (where L is maximized and the ak satisfy Eq. 8.4).In general, ak—and the yfit

i they produce—will refer to MLE values. Wherederivatives are taken, they are normally evaluated at the MLE values and sothere is not much chance of confusion.

As mentioned, the dependence of L on the ak arises when the yfiti are

used as estimates of the µi in the joint probability. Because any term thatis independent of yfit

i will automatically have a zero derivative with respectto all ak, only terms that depend on µi need to be kept when evaluating Lfrom Eqs. 4.12-4.14

For a Gaussian data set where P (y) is given by Eq. 4.12, Eq. 8.3 gives(after dropping terms that are independent of µi and substituting yfit

i for µi)

L = −1

2

N∑i=1

(yi − yfit

i

)2

σ2i

(8.5)

For a binomial data set (Eq. 4.13) L becomes:

L =N∑i=1

yi ln yfiti + (Ni − yi) ln(Ni − yfit

i ) (8.6)


And for a Poisson data set (Eq. 4.14)

L =N∑i=1

yi ln yfiti − yfit

i (8.7)

Since L depends only indirectly on the ak through the yfiti , the derivatives

in Eq. 8.4 are evaluated according to the chain rule

∂L∂ak

=N∑i=1

∂L∂yfit

i

∂yfiti

∂ak(8.8)

where the partial derivatives ∂yfiti /∂ak are determined by the specific form

for the fitting function, Eq. 8.2.For Gaussian-distributed yi, where L is given by Eq. 8.5, Eq. 8.4 becomes

0 =∂

∂ak

[−1

2

N∑i=1

(yi − yfiti )2

σ2i

]

=N∑i=1

yi − yfiti

σ2i

∂yfiti

∂ak(8.9)

For binomial-distributed yi (Eq. 8.6), Eq. 8.4 becomes

0 =∂

∂ak

[N∑i=1

yi ln yfiti + (Ni − yi) ln(Ni − yfit

i )

]

=N∑i=1

(yiyfiti

− Ni − yiNi − yfit

i

)∂yfit

i

∂ak

=N∑i=1

yi − yfiti

yfiti (1− yfit

i /Ni)∂yfit

i

∂ak(8.10)

Remarkably, the numerator and the denominator in each term above are thesame as those of Eq. 8.9. The numerators are obviously the same and thedenominator in Eq. 8.10 is, in fact, the variance of a binomial distribution(Eq. 3.6) for Ni Bernoulli trials having a mean of yfit

i —and after all, the meanis exactly what yfit

i is estimating. Thus, with the understanding that

σ2i = yfit

i

(1− yfit

i

Ni

)(8.11)

69

when fitting binomial-distributed yi, Eq. 8.10 is exactly the same as Eq. 8.9.For Poisson-distributed yi (Eq. 8.7), Eq. 8.4 becomes

0 =∂

∂ak

[N∑i=1

(yi ln y

fiti − yfit

i

)]

=N∑i=1

(yiyfiti

− 1

)∂yfit

i

∂ak

=N∑i=1

yi − yfiti

yfiti

∂yfiti

∂ak(8.12)

Once again, the numerator is the same as in Eq. 8.9 and the denominator isthe variance of a Poisson distribution (Eq. 3.8) having a mean of yfit

i . Thuswith the understanding

σ2i = yfit

i (8.13)

when fitting Poisson-distributed yi, Eq. 8.12 is also the same as Eq. 8.9.Thus, with the understanding that the σ2

i are appropriately chosen forthe yi at hand, Eqs. 8.9, 8.10 and 8.12 can all be rewritten in the form

N∑i=1

yiσ2i

∂yfiti

∂ak=

N∑i=1

yfiti

σ2i

∂yfiti

∂ak(8.14)

This is a set of M simultaneous equations which must be satisfied by the Munknown ak in order for that set to be the MLEs for the αk.

Regression analysis must find the ak that maximize L or, equivalently,that satisfy Eq. 8.14. The analysis must also provide the covariance matrix(uncertainties) for those ak. This chapter will provide the formulas andalgorithms to solve Eq. 8.14. Chapter 10 discusses the use of Excel’s Solverprogram to directly maximize L.

Iteratively Reweighted Regression

To solve for the ak with binomial- or Poisson-distributed yi, one could goback to Eqs. 8.10 or 8.12. Solutions can be found for a few specific fittingfunctions such as the constant function. For general fitting functions, how-ever, these equations cannot be solved directly. On the other hand, solutions


to Eq. 8.14 can be found for almost any fitting function when the σ2i are spec-

ified constants that are independent of the yfiti as for Gaussian-distributed

yi. Procedures for finding these Gaussian solutions will be described shortly,but before presenting them, it is worthwhile demonstrating how they can beadapted to cases where the σ2

i depend on the yfiti .

The adaptation is called iteratively reweighted regression and would ap-ply to any yi governed by distributions for which the maximum likelihoodcondition can be put in the form of Eq. 8.14 with an appropriate σ2

i . Thisform has been demonstrated for Poisson and binomial distributions and italso applies to several other probability distributions.

Iteratively reweighted regression begins with an estimate of the fittingparameters and the resulting yfit

i , which are then used to get estimates forthe σ2

i , (σ2i = yfit

i , e.g., for Poisson yi). The σ2i are then fixed at these initial

estimates and Eq. 8.14 is solved for the ak. This solution then gives a newimproved set of ak and yfit

i which are then used to get new values for σ2i to

use in the next iteration. Additional iterations are performed until the akconverge and Eq. 8.14 is solved in a self-consistent manner with the σ2

i asevaluated at the best fit leading to the best-fit solution.

The iterations tend to converge quickly because the fitting parametersdepend strongly on the yi but only weakly on the σ2

i . The dependence isweak enough that for Poisson-distributed yi, it is often assumed that theinput yi should be good enough estimates of the yfit

i for the purposes ofcalculating the σ2

i . The fit then uses σ2i = yi without iteration. This is not

unreasonable if the yi are all 100 or more so that the errors in using yi for yfiti

when calculating σ2i are unlikely to be more than 10%. However, if many of

the yi are relatively low, using σ2i = yi can give significantly different results.

With today’s computers and programming tools, there is hardly a reason notto iterate the solution when appropriate.

Least Squares Principle

The Gaussian log likelihood and its use with the maximum likelihood princi-ple gain special significance in regression analysis not only because it can beapplied to binomial- and Poisson-distributed yi when augmented by iterativereweighting, but also because of its equivalence to a least squares principle.

Aside from an uninteresting negative factor of −1/2, the Gaussian log-likelihood is a sum of positive (squared) terms—one for each yi. This sum of

71

squares is called the chi-square random variable and is given by

χ2 =N∑i=1

(yi − yfit

i

)2

σ2i

(8.15)

Thus for Gaussian-distributed yi, maximizing L = −χ2/2 is the same asminimizing χ2 and the maximum likelihood principle becomes a least-squaresprinciple.

Note how the χ2 value is always positive and gets smaller as the deviationsbetween the fit and the data get smaller. It also has a sensible dependenceon the standard deviations—giving equal contributions to the sum for equaldeviations in units of σi. That is, deviations are always compared relativeto their uncertainty. With such behavior, minimizing the χ2 is intuitivelysatisfying no matter what distribution governs the yi.

In effect, any solution to Eq. 8.14 can be considered a least squares solu-tion because solving Eq. 8.14 for any given σ2

i is the same as minimizing theχ2 with those σ2

i fixed. For example, iteratively reweighted solutions couldbe determined by minimizing the χ2 with the current set of σ2

i fixed. Theresulting yfit

i are then used to calculate an improved set of σ2i and the χ2 is

again minimized with these σ2i fixed. Iterations would continue until a self

consistent set of ak are obtained. Note that this iteration procedure is notthe same as using Eq. 8.11 or 8.13 to eliminate σ2

i in the formula for χ2 andminimizing the value of that function. Simply put, that process would maxi-mize the wrong log likelihood. Iterative reweighting is different. Minimizingthe χ2 with the σ2

i fixed but ultimately calculated at the best fit is exactlyequivalent to directly maximizing the correct L of Eq. 8.6 or 8.7.

Sample mean and variance

Before presenting general solutions to Eq. 8.14, a small detour is now inorder—a return to the subject of distribution sampling in light of the principleof maximum likelihood. Recall the definitions for the sample mean (Eq. 2.14)and the sample variance (Eq. 2.23) for samples from a common distribution.Can they be demonstrated to satisfy the principle of maximum likelihood?

The input data in this case now consist of a sample set yi, i = 1...N ,all from the exact same probability distribution and the model is that thetrue means are the same for all yi. This is the simple case of a fit to the


constant function—yfiti = y for all i. The estimate is given the symbol y for

reasons that will be obvious shortly. As a one-parameter fit, M = 1 withyfiti = Fi(a1) = a1 = y and ∂yfit

i /∂a1 = 1 for all i. Because all yi are from thesame distribution, the σ2

i will also be the same for all yi. That is, σ2i = σ2

y

for all i.After bringing the constant σ2

y out of the summations on both sides ofEq. 8.14, this quantity cancels. Setting the derivatives to one, and settingyfiti = y for all i, this equation becomes simply:

N∑i=1

yi =N∑i=1

y (8.16)

The right side is simply Ny and solving for y then reproduces the standarddefinition of the sample mean, Eq. 2.14.

The sample mean has now been proven to be the MLE for the distributionmean for variables governed by a Gaussian, binomial or Poisson distribution.The sample mean y has previously been shown (see Eq. 2.15) to be an unbi-ased estimate of the true mean µy. Thus, for these three cases, the MLE is anunbiased estimate. The principle of maximum likelihood does not always giveunbiased estimates. A biased estimate will have a distribution mean aboveor below the true value and will not give the true value “on average.” Biasis considered a significant flaw and, consequently, checks for it are routinelymade and corrections are sought and applied.

When sampling yi from a common Gaussian distribution, σ2y becomes a

second parameter suitable for determination according to the principle ofmaximum likelihood. Is the sample variance s2

y of Eq. 2.23 an MLE forσ2y? For this case, the joint probability would now have to include all terms

dependent on µy and σy. Taking the log of Eq. 4.12 replacing all µi withtheir estimate y and replacing all σ2

i with their estimate s2y (and dropping all

terms independent of these two variables) gives

L = −N ln sy −1

2

N∑i=1

(yi − y)2

s2y

(8.17)

Now, the derivatives with respect to both y and sy must be set equal to zeroin order for them to be the MLEs. Nothing changes for the y equation andthe sample mean (Eq. 2.14) stays the MLE for µy. Setting the derivative

73

with respect to sy equal to zero then gives

0 =∂

∂sy

[−N ln sy −

1

2

N∑i=1

(yi − y)2

s2y

]

= −Nsy

+1

s3y

N∑i=1

(yi − y)2 (8.18)

with the solution

s2y =

1

N

N∑i=1

(yi − y)2 (8.19)

However, this s2y is seldom used because it is biased, having an expectation

value smaller than the true variance. As will be demonstrated in Exercise 5,the more common estimate given by Eq. 2.23, which uses N − 1 in place ofN in the denominator, is preferred because it is unbiased.

Note that the mean of a single sample (a sample of size N = 1) is welldefined. It is that sample value and thus also the MLE of the true mean of itsdistribution. No MLE of the true variance can be obtained with only a singlesample. Neither Eq. 8.19, which gives an unphysical estimate of s2

y = 0, norEq. 2.23, which gives an indeterminate value of 0/0, can be used. It takesat least two samples to get a sample variance, for which Eq. 2.23 gives theunbiased estimate s2

y = (y1 − y2)2/2.Of course, samples of size 1 or 2 are the smallest possible. Larger samples

would give sample means and sample variances which are more precise—moreclosely clustered around the true mean and the true variance. The varianceof y is given in the next exercise. The variance of s2

y is discussed in Chapter 9.

Exercise 4 The variance of y is most easily determined from Eq. 2.19—asthe difference between the second moment and the square of the first—in thiscase: σ2

y = 〈y2〉 − µ2y. Evaluate the right side of this equation to show that

σ2y =

σ2y

N(8.20)


Hint: Re-express y2 using

y2 =

(1

N

N∑i=1

yi

)(1

N

N∑j=1

yj

)

=1

N2

N∑i=1

N∑j=1

yiyj (8.21)

before taking the expectation value. Note how each y must use its own privatedummy index to clearly enumerate terms with i = j and i 6= j for use withEq. 4.9.

Eq. 8.20 indicates that the standard deviation of the mean of N samplesis√N times smaller than the standard deviation of a single sample, i.e., the

average of 100 samples is 10 times more precise an estimate of the true meanthan is a single sample.

If σy is not known and sy is used in its place on the right side of Eq. 8.20,the left side is then designated s2

y and called the sample variance of the mean

s2y =

s2y

N(8.22)

This quantity is then an unbiased estimate of σ2y . sy is called the sample

standard deviation of the mean.

Exercise 5 Show that Eq. 2.23 is unbiased and satisfies Eq. 2.22. Hint 1:Explain why each of the N terms in Eq. 2.23 has the same expectation valueand use this fact to get rid of the sum over i—replacing it with a factor of Ntimes the expectation value of one term (say i = 1). Hint 2: Expand (y1 − y)2

before taking the expectation value term by term. Then use Eqs. 2.14 and 4.9and/or results from Exercise 4 as needed for the individual terms.

Weighted mean

The sample mean y is an unweighted average; each yi has an equal effect onits value. Suppose that a sample of yi, i = 1...N are again all predicted tohave the same distribution mean µy. For example, they might be results forthe same quantity obtained from different data sets or by different researchgroups. But now, the yi don’t all have the same uncertainty—each comes

75

with its own standard deviation σi. The yi and the σi are given and the MLEfor µy is to be determined. Taking that MLE as my, the regression problemis again a simple fit to a constant—M = 1, with yfit

i = F (a1) = a1 = my and∂yfit

i /∂a1 = 1 for all i. The maximum likelihood solution must then satisfyEq. 8.14 which becomes the single equation

N∑i=1

yiσ2i

=N∑i=1

my

σ2i

(8.23)

my can be factored from the sum giving the result

my =

N∑i=1

yiσ2i

N∑i=1

1

σ2i

(8.24)

Eq. 8.24 is a weighted average of the yi

my =w1y1 + w2y2 + ...+ wNyN

w1 + w2 + ...+ wN(8.25)

with the weight for each yi given by the inverse of the variance for that yi:

wi =1

σ2i

(8.26)

Larger standard deviations indicate less precisely known y-values and, ap-propriately, smaller weights in the average. Weighting a data point’s contri-bution according to the inverse of its distribution’s variance persists in allregression problems. The larger the σi, the smaller the weighting of thatpoint in its effect on the fitting parameters.

Propagation of error then gives the variance of my via Eq. 6.21 withEq. 8.25 as


σ2my =

N∑i=1

(∂my

∂yi

)2

σ2i

=

N∑i=1

w2i σ

2i(

N∑i=1

wi

)2

=

N∑i=1

wi(N∑i=1

wi

)2

=1

N∑i=1

wi

(8.27)

This result is often expressed in a form somewhat easier to remember

1

σ2my

=N∑i=1

1

σ2i

(8.28)

Effectively, Eqs. 8.24 and 8.28 are a prescription for turning a group of inde-pendent samples (with known standard deviations) into a single sample my

with a reduced standard deviation σmy .

Linear Regression

The rest of this chapter is devoted to providing solutions to the maximumlikelihood condition (Eq. 8.14) for fixed σ2

i . It covers linear and nonlinearregression and then several specialized cases such as data sets with uncer-tainties in the independent variables, data sets with correlated yi, and datasets collected after an instrument calibration.

77

In linear algebra form, Eq. 8.14 is just the kth element of the vectorequation

[Jya ]T[σ2y

]−1y = [Jya ]T

[σ2y

]−1yfit (8.29)

where[σ2y

]−1—the inverse of the covariance matrix—is called the weighting

matrix and given by

[σ2y

]−1=

1/σ2

1 0 0 · · ·0 1/σ2

2 0 · · ·...

......

0 0 · · · 1/σ2N

(8.30)

and where [Jya ] is the N×M Jacobian of partial derivatives of yfiti with respect

to each ak as determined by the fitting function

[Jya ]ik =∂yfit

i

∂ak(8.31)

It is recommended that the reader check Eq. 8.29 and verify that it is, in-deed, a vector equation having M elements with the kth element reproducingEq. 8.14 including the proper summation over the i index.

The linear algebra formulas discussed next are demonstrated for a quad-ratic fit using array formulas in Excel in Linear Regression.xls available onthe lab web site.

Linear regression is used when the fitting function is linear in the fittingparameters. A linear fitting function with a single independent variable xifor each yfit

i would be of the form

yfiti = a1f1(xi) + a2f2(xi) + ...+ aMfM(xi)

=M∑k=1

akfk(xi) (8.32)

where the fk(x) are linearly independent functions of x with no unknownparameters. For example, a data set for a cart rolling on an inclined trackmight consist of the measured track position yi versus the time ti at eachmeasurement. This data might then be checked against a predicted quadraticbased on motion at constant acceleration:

yfiti = a1 + a2ti + a3t

2i (8.33)

http://www.phys.ufl.edu/courses/phy4803L/statistics/Linear Regression.xls


This model is linear in the three parameters a1, a2, and a3 associated withthe basis functions: f1(ti) = 1, f2(ti) = ti, and f3(ti) = t2i .

Note that when the true parameters αk are used in place of the fitted ak,Eq. 8.32 gives the true means µi of the distributions for the yi.

µi = α1f1(xi) + α2f2(xi) + ...+ αMfM(xi)

=M∑k=1

αkfk(xi) (8.34)

Equation 8.31 for the Jacobian is then the N ×M matrix given by

[Jya ] =

f1(x1) f2(x1) · · · fM(x1)f1(x2) f2(x2) · · · fM(x2)

......

...f1(xN) f2(xN) · · · fM(xN)

(8.35)

Note that [Jya ] is independent of the entire set of ak. It is this condition thatdetermines when linear regression is appropriate.

Equation 8.32 for the N values of yfiti can then be expressed by the vector

equations (with N elements)

yfit = [Jya ]a (8.36)

and Eq. 8.34 for the true means becomes

µ = [Jya ]α (8.37)

Substituting Eq. 8.36 into Eq, 8.29 then gives

[Jya ]T[σ2y

]−1y = [Jya ]T

[σ2y

]−1[Jya ]a (8.38)

The combination [Jya ]T[σ2y

]−1[Jya ] is an M ×M symmetric matrix that

will later be shown to be the inverse of the parameter covariance matrix. Fornow, however, it will simply be called the X-matrix as defined by

[X] = [Jya ]T[σ2y

]−1[Jya ] (8.39)

so that Eq. 8.38 becomes

[Jya ]T[σ2y

]−1y = [X]a (8.40)

79

Equation 8.40 is solved for the best-fit parameter vector a by determiningthe inverse matrix [X]−1 and multiplying it from the left of both sides ofEq. 8.40

a = [X]−1 [Jya ]T[σ2y

]−1y (8.41)

ora =

[Jay]†y (8.42)

where [Jay]†

= [X]−1 [Jya ]T[σ2y

]−1(8.43)

is an M ×N matrix called the (weighted) Moore-Penrose pseudo-inverse of[Jya ]. Note that: [

Jay]†

[Jya ] = [X]−1 [Jya ]T[σ2y

]−1[Jya ]

= [X]−1[X]

= [1] (8.44)

where [1] is the M×M identity matrix. This product of the Jacobian and itspseudo-inverse yields the M×M identity matrix—a key inverse-like property.

The ak are random variables. For each new input set of yi, the output setof ak would vary according to Eq. 8.42. What can be expected for the means,variances and covariances for the ak if the input data sets were re-sampledover and over again?

That the distribution for ak will be unbiased and have a mean of αk canbe demonstrated upon taking the expectation value of both sides of Eq. 8.42

〈a〉 =[Jay]† 〈y〉

=[Jay]†µ

=[Jay]†

[Jya ]α

= α (8.45)

where 〈y〉 = µ (from 〈yi〉 = µi) was used to get to line 2, Eq. 8.37 was usedto get to line 3, and Eq. 8.44 was used to get to line 4.

Keep in mind the kth row of Eq. 8.42 is

ak =[Jay]†k1y1 +

[Jay]†k2y2 + ...

[Jay]†kNyN (8.46)

and defines a direct linear relationship for each output parameter ak in termsof the input data set yi. Consequently, the propagation of error formula


(Eq. 6.19) can be used to determine the parameter covariance matrix [σ2a]

in terms of the data covariance matrix [σ2y]. Recall, the M × N Jacobian

appearing in Eq. 6.19 has elements defined by dak/dyi, which in this case are

just the elements of[Jay]†

giving

[σ2a] =

[Jay]†

[σ2y][Jay]†T

(8.47)

This equation simplifies using Eqs. 8.43 and 8.39, where the N×M Jacobian[Jya ] appearing therein is the Jacobian for the expansion of the yfit

i in terms ofthe ak; [Jya ]ik = ∂yfit

i /∂ak. Note that the rule for taking the transpose gives[Jay]†T

=[σ2y

]−1[Jya ] [X]−1 because

[σ2y

]−1and [X]−1 are symmetric about

the matrix diagonal and thus they are their own transpose. With these notes,proceeding from Eq. 8.47 gives

[σ2a] = [X]−1 [Jya ]T

[σ2y

]−1[σ2y][σ2y

]−1[Jya ] [X]−1

= [X]−1 [Jya ]T[σ2y

]−1[Jya ] [X]−1

= [X]−1[X][X]−1

= [X]−1 (8.48)

= [[Jya ]T[σ2y

]−1[Jya ]]−1

Taking the inverse of both sides of the last equation gives[σ2a

]−1= [Jya ]T

[σ2y

]−1[Jya ] (8.49)

Note how Eq. 8.46 demonstrates that every ak is a purely linear functionof the yi. Consequently, the first-order Taylor expansion for ak about any setof yi is exact. Recall from Chapter 6, this implies Eq. 6.19 (here, Eq. 8.47 andEq. 8.49) is exact. Linear fitting functions lead to linear direct relationships,and for both, the validity of the calculated parameter covariance matrix doesnot rely on keeping errors small. The calculated [σ2

a] is exact for any givenσi and for any distributions governing the yi.

As was demonstrated in Chapter 6, when the ak are nonlinear functions ofthe yi, Eq. 6.19 remains valid, but now relies on keeping the σi small enoughthat the first-order Taylor expansion, Eq. 6.7, remains valid. As will bedemonstrated later in this chapter, Eq. 8.49 also remains valid for nonlinearfitting functions if the σi are kept small enough.

Because of their wide applicability, Eqs. 6.19 and 8.49 are perhaps thetwo most important formulas in statistical analysis. Equation 6.19, [σ2

a] =

81

[Jay ][σ2y][J

ay ]T , uses an M×N Jacobian based on the direct relationships: ak =

fk(yi) with [Jay ]ki = ∂ak/∂yi. Equation 8.49, [σ2a]−1

= [Jya ]T[σ2y

]−1[Jya ],

uses an N ×M Jacobian based on the fitting model: yfiti = fi(ak) with

[Jya ]ik = ∂yfiti /∂ak.

Equally-weighted linear regression

Occasionally, all yi are obtained using the same technique and have the sameuncertainty. Or, the uncertainties are not known very well and assuming theyare equal is an appropriate starting point. For a data set where the standarddeviations are the same for all yi (σi = σy), the regression equations andresults are then called equally-weighted.

The regression equations simplify because the covariance and weighting

matrix are proportional to the identity matrix. [σ2y ] = σ2

y [1] and[σ2y

]−1=

(1/σ2y)[1], where [1] is the N × N identity matrix. With this substitution,

Eq. 8.39 becomes

[X] =1

σ2y

[Xu] (8.50)

where now [Xu] is the [X] matrix without the weighting matrix

[Xu] = [Jya ]T [Jya ] (8.51)

and is thus independent of σy. The inverse of Eq. 8.50 is then [X]−1 =σ2y[Xu]

−1 where [Xu]−1, the inverse of [Xu], is also independent of σy.

By Eq. 8.48, [X]−1 is the parameter covariance matrix. That is,

[σ2a] = σ2

y[Xu]−1 (8.52)

thereby demonstrating that every element of the parameter covariance matrixis proportional to σ2

y. Equation 8.52 thus further implies that the standarddeviation of every parameter (square root of the corresponding diagonal ele-ment) is proportional to the standard deviation (σy) of the y-values for thatdata set.

Equation 8.41 for the parameter values becomes

a = [Xu]−1 [Jya ]T y (8.53)

showing that σ2y has canceled and thus the parameter values themselves do

not depend on its value. The independence can also be inferred from the χ2


of Eq. 8.15, which becomes

χ2 =1

σ2y

N∑i=1

(yi − yfit

i

)2(8.54)

Recall that the best-fit parameters would be those that minimize this χ2.Because σ2

y factors from the sum, no matter its value, the same sum of squareddeviations must be minimized and thus the same ak will be obtained.

Nonlinear regression

Linear regression requires that the Jacobian be independent of the entireset of fitting parameters. Fitting functions that are nonlinear in the fittingparameters do not satisfy this requirement. For example, consider a radioac-tive sample emitting gamma rays and positioned in front of a Geiger-Mullertube. (A Geiger-Muller tube produces a click or “count” each time it detectsa gamma ray.) If the radioactive half-life of the sample is a few minutes orso, the number of counts yi might be measured over consecutive ten-secondintervals as a function of the interval starting time ti relative to the start ofthe experiment. The yi would decrease as the sample decays and a modelpredicting exponential decay would be represented

yfiti = a1e

−ti/a2 + a3 (8.55)

where a1 is proportional to the initial sample activity, a2 is the mean lifetime,and a3 is a constant arising from natural background radiation. For thisnonlinear fitting function, the derivative of yfit

i with respect to a1 depends ona2 and the derivative with respect to a2 depends on both a1 and a2.

The solution for the best-fit parameters must still satisfy Eq. 8.29, butbecause [Jya ] depends on the ak, an iterative solution must be sought. Theuser must provide initial guesses for the fitting parameters that will be usedas a starting point. From the initial guesses, a nonlinear fitting program willlocate other nearby parameter sets—evaluating the χ2 each set produces.Each time the program finds that χ2 has decreased, it uses those improvedparameter values as a new starting point for the next iteration. Iterationscontinue until the solution to Eq. 8.29 is self consistent—satisfied with [Jya ](and perhaps [σ2

y]) evaluated at the best- fit.

83

Various algorithms can be used to search through parameter space andlocate the set of ak that minimize the χ2. Of course, how quickly and reliablythey do that is the key issue. The grid search is an algorithm where the χ2 isevaluated on a grid over specified ranges for the ak. It is not very efficient butcan be useful when there are multiple local minima in the χ2. The Gauss-Newton algorithm is the quickest way to find the best-fit parameters whenthe starting parameters are already sufficiently close. Its big drawback is thatit tends to fail if the starting parameters are not close enough. The gradient-search algorithm is better at improving the fit parameters when the startingparameters are further from the best-fit parameters. Its big drawback is thatit tends to take a long time to find the best fit. The Levenberg-Marquardtalgorithm elegantly addresses the shortcomings of the Gauss-Newton andgradient search algorithms. In effect, it uses the gradient-search algorithmwhen the starting point is far from the best fit and switches to the Gauss-Newton algorithm as the parameters get closer to the best fit.

The Gauss-Newton algorithm begins by evaluating the Jacobian (Eq. 8.31)at the starting parameter values—either by numerical differentiation or fromformulas supplied by the user. If the starting point is near the best-fit values,the χ2 will be near its true minimum and will have a quadratic dependenceon the M fitting parameters—it will lie on an M -dimensional parabola. TheGauss-Newton algorithm uses the [Jya ] at the present starting point to deter-mine this parabola and having done so can jump directly to the predictedminimum.

For a linear fitting function, the parabolic shape is guaranteed to beaccurate—even when the starting parameters are far from the minimum. Ineffect, this is why linear regression formulas can find the best-fit parametersin a single step. For nonlinear fitting functions, if the starting parameters aremade close enough to the best-fit parameters, the parabolic shape is againguaranteed and the Gauss-Newton algorithm would jump almost directly tothe correct best-fit parameters in one try. However, if the starting point istoo far from the best-fit, the local derivatives may not predict where the trueminimum in χ2 will be. When the Gauss-Newton algorithm jumps to thepredicted best-fit parameters, it may find the χ2 has decreased, but not toits minimum. In this case, it can simply reiterate from there—starting with anew [Jya ]. However, if at any time it finds that the χ2 has actually increased,it would then have no recourse to improve the fit.

Each iteration of the gradient search algorithm is guaranteed to improvethe fit (lower the χ2) and so it can be used when the Gauss-Newton algorithm


fails. The gradient search ignores the curvature of the parabola and uses theJacobian only to determine the gradient of χ2 at the starting point. It thenmoves the parameters from the starting point (takes a step) in a directionopposite the gradient—along the direction of steepest descent for the χ2.Because it ignores the curvature, however, this algorithm can’t be sure howbig a step to take. So it takes a step and if χ2 does not decrease at thenew point, it goes back to the starting point, decreases the step size by somefactor, and tries again. Sooner or later the step size will be small enough thatχ2 will decrease—leading to a new improved starting point from which a newiteration can be performed. The problem with the gradient algorithm is thatnear the χ2 minimum, the gradient is not steep at all, the steps become verysmall and the algorithm proceeds very slowly toward the best-fit solution.

The elegance of the Levenberg-Marquardt algorithm is in how it monitorsthe χ2 during the iterations and, with a single parameter, smoothly switchesbetween the Gauss-Newton and gradient search algorithms and adjusts thegradient search step size.

If the σ2i depend on the best fit, iteratively reweighted least squares will

be needed. Recall that the σ2i must be held constant when minimizing the

χ2, but must also be the correct σ2i as evaluated at the best fit. One might

recalculate the σ2i after each successful χ2 decrease, or only after finding the

minimum χ2 with the current set of σ2i . Because self-consistency must ul-

timately be achieved for both σ2i and [Jya ], the choice is simply a matter of

whether and how fast convergence is achieved. With a minimization pro-gram such as in Excel’s Solver, intermediate results are not available andreweighting will have to be performed after each minimum χ2 is found withthe current σ2

i .

The Gauss-Newton algorithm

The Gauss-Newton algorithm is presented in some detail because it illustratesthe similarities between the nonlinear regression formulas and their linearregression counterparts. It also gives the parameter covariance matrix.

The treatment will require distinguishing between the best-fit parametersak, k = 1...M and another set—nearby, but otherwise arbitrary. This nearby“trial” solution will be labeled atrial

k , k = 1...M , and gives a trial fittingfunction

ytriali = Fi(atrial) (8.56)

85

This initial solution must be close enough to the best fit so that for everypoint i in the data set, a first-order Taylor series expansion about ytrial

i willaccurately reproduce the best-fit yfit

i .

yfiti = ytrial

i +M∑k=1

∂ytriali

∂atrialk

(ak − atrialk ) (8.57)

Where this expansion is accurate, the χ2 surface is parabolic.Differentiating this Taylor expansion shows that the elements of the Ja-

cobian, Eq. 8.31, [Jya ]ik = ∂yfiti /∂ak are

[Jya ]ik =∂ytrial

i

∂atrialk

(8.58)

That is, the first-order Taylor expansion implies the derivatives at the bestfit are the derivatives at the nearby starting point.

For any reasonable fitting function, the first-order Taylor expansion isguaranteed accurate for values of atrial

k that are sufficiently close to the ak.If the expansion remains valid for a wider range of atrial

k , so too will theGauss-Newton algorithm find the best-fit ak from those more distant trialsolutions. Moreover, the Gauss-Newton algorithm also provides the param-eter covariance matrix based on this expansion. Consequently, the range ofvalidity for the first-order Taylor expansion will have important implicationsfor parameter uncertainties.

To see how the first-order Taylor expansion will lead to linear regression-like formulas, we first define the modified input data ∆yi and the modifiedbest-fit ∆yfit

i as relative to the trial solution.

∆yi = yi − ytriali (8.59)

∆yfiti = yfit

i − ytriali (8.60)

or, in vector notation

∆y = y − ytrial (8.61)

∆yfit = yfit − ytrial (8.62)

Subtracting [Jya ]T[σ2y

]−1ytrial from both sides of the defining equation for

the maximum likelihood solution, Eq. 8.29, then gives

[Jya ]T[σ2y

]−1∆y = [Jya ]T

[σ2y

]−1∆yfit (8.63)


Next, we define the modified best-fit parameters (∆ak) as the differencebetween the actual best-fit parameters and the trial parameters.

∆ak = ak − atrialk (8.64)

or∆a = a− atrial (8.65)

With these definitions, the first-order Taylor expansion (Eq. 8.57) can bewritten

∆yfiti =

M∑k=1

[Jya ]ik ∆ak (8.66)

or∆yfit = [Jya ] ∆a (8.67)

Using Eq. 8.67 in Eq. 8.63 gives the linear regression-like result

[Jya ]T[σ2y

]−1∆y = [Jya ]T

[σ2y

]−1[Jya ] ∆a (8.68)

This equation is now in the linear regression form analogous to Eq. 8.38 andthe solution for the best-fit ∆ak is analogous to Eq. 8.42.

∆a =[Jay]†

∆y (8.69)

where[Jay]†

= [X]−1 [Jya ]T[σ2y

]−1and [X] = [Jya ]T

[σ2y

]−1[Jya ] as evaluated at

the trial solution.After Eq. 8.69 is applied to determine the best-fit ∆a, Eq. 8.64 must then

be applied to each element to find the best-fit ak

ak = atrialk + ∆ak (8.70)

The ending values for the ak should then be used as a new trial solutionfor another iteration of the algorithm. The Jacobian (and if necessary, theweighting matrix as well) should be reevaluated there and the Gauss-Newtonalgorithm reiterated. Iterations should continue until there are no significantchanges in the ak, i.e., until the atrial and ytrial are already at the best fit andEq, 8.69 gives ∆a = 0 with [Jya ] and [σ2

y] evaluated there. The requirementthat ∆a = 0 when atrial = a and ytrial = yfit implies that the left side ofEq. 8.68 is zero at the best fit. This gives the condition for iterations to end

[Jya ]T[σ2y

]−1(y − yfit) = 0 (8.71)

87

which is just the condition for the χ2 to be a minimum, namely, Eq. 8.29.Equation 8.48 gives [X]−1 as the covariance matrix for the ∆ak. Be-

cause of the constant offset transformation between ∆ak and ak expressed byEq. 8.70, the Jacobian matrix in either direction is the identity matrix andpropagation of error (Eq. 6.18) implies [X]−1 is also the covariance matrixfor the ak.

The Gauss-Newton, gradient search, and Levenberg-Marquardt algorithmsare demonstrated for (simulated) exponential decay data in the two Excelworkbooks Nonlinear Regression.xls and Nonlinear Regression Poisson.xlsfor Gaussian- and Poisson-distributed yi, respectively. (Only one of the twoshould be opened at one time.) The three algorithms differ on how they treatthe X-matrix before finding its inverse and using it to calculate ∆a. Leaving[X] unmodified corresponds to the Gauss-Newton algorithm. Using only itsdiagonal elements corresponds to using the gradient search algorithm. TheLevenberg-Marquardt algorithm uses a control parameter λ and multipliesdiagonal elements of [X] by a factor of 1 + λ. This makes ∆a follow a near-gradient search algorithm with smaller and smaller step size as λ increasesabove 1 and a more and more Gauss-Newton algorithm as λ decreases below1. λ is adjusted during iterations as follows: If the χ2 successfully decreases,besides keeping the new ak, λ is decreased by a factor of ten for the nextiteration. If the χ2 fails to decrease, besides keeping the old ak, λ is increasedby a factor of ten.

The parameter covariance matrix [σ2a] should always be obtained using

an unmodified X-matrix evaluated with the Jacobian [Jya ] and the inputcovariance matrix [σ2

y] evaluated at the best-fit.

Uncertainties in independent variables

Up to now, only the dependent variable had uncertainty—only the yi wererandom variables. What can be done when there are uncertainties in theindependent variable—when the xi are also random variables? There is norigorous treatment for the general case. However, if the xi are statisticallyindependent and have uncertainties that are small enough, a simple modifi-cation to the data point weightings can accurately model this complexity.

Only a single independent variable x will be considered here, i.e., where

yfiti = F (a;xi) (8.72)

http://www.phys.ufl.edu/courses/phy4803L/statistics/Nonlinear Regression.xls

http://www.phys.ufl.edu/courses/phy4803L/statistics/Nonlinear Regression Poisson


but the extension to additional independent variables should be obvious.Letting σxi represent the standard deviation of xi and letting µxi representits mean, F (a;xi) must be a nearly linear function of x throughout therange µxi ± 3σxi. That is, each yfit

i should be well described by a first-orderTaylor expansion

yfiti = F (a;µxi) +

∂F (a;xi)∂xi

(xi − µxi) (8.73)

for any xi in this range.Under these conditions, propagation of error implies that random varia-

tions in xi with a variance of σ2xi would cause random variations in yfit

i witha variance

σ2

yfiti

=

(∂F (a;xi)

∂xi

)2

σ2xi (8.74)

If xi is statistically independent from yi, the variations in yfiti will be uncor-

related with the variations in yi and propagation of errors implies that thequantity yi − yfit

i will have random variations with a variance given by

σ2i = σ2

yi +

(∂F (a;xi)

∂xi

)2

σ2xi (8.75)

where now σyi is the standard deviation associated with yi (σi previously).To account for uncertainty in an independent variable, simply replace the

σ2i appearing in the regression formulas with the modified values of Eq. 8.75.

These values will give the proper weighting matrix with the correct depen-dence on σxi and σyi. Most importantly, the adjusted σ2

i will give the correctcovariance matrix for the fitting parameters and when used in the χ2 ofEq. 8.15 will maintain its proper expectation value—a critical aspect of thechi-square test discussed in Chapter 9.

Once again, the fitting parameters of F (a;xi) would need to be knownin order to evaluate the partial derivative in Eq. 8.75. Consequently, aniterative solution would be needed. At the start of each iteration, σ2

i wouldbe recalculated from Eq. 8.75 using the current set of ak and would be heldfixed while a fit to find the next set of ak is performed. The iterations wouldcontinue until a self-consistent solution is obtained.

89

Regression with correlated yi

Performing a fit to a set of yi having a non-diagonal covariance matrix [σ2y] is

relatively simple if the joint probability distribution for the yi is reasonablywell-described by the correlated Gaussian of Eq. 4.24. In this case, the re-gression formulas already presented remain valid without modification. Oneneed only substitute the non-diagonal covariance matrix and its inverse forthe diagonal versions assumed up to now.

This simple substitution works because the log likelihood for the corre-lated joint probability of Eq. 4.24 (multiplied by −2 so it is to be minimizedto maximize the probability) depends on the µi only via a χ2 of the form

χ2 = (y − µ)T[σ2y

]−1(y − µ) (8.76)

When the best-fit parameters ak are found, they will determine the best-fit yfit

i via the fitting function, which when used for the µi in Eq. 8.76 mustproduce a minimum χ2 to maximize the log likelihood.

χ2 =(y − yfit

)T [σ2y

]−1 (y − yfit

)(8.77)

That this χ2 is a minimum with respect to all fitting parameters, impliesthat its derivative with respect to every ak is zero. Performing this chain-rule differentiation then gives

0 =∂

∂ak

(y − yfit

)T [σ2y

]−1 (y − yfit

)= −

(y − yfit

)T [σ2y

]−1 ∂yfit

∂ak− ∂yfitT

∂ak

[σ2y

]−1(y − yfit) (8.78)

The two terms in this last equation are scalars. In fact, they are the exactsame scalar, just formed from expression that are transposes of one another.Thus each must be zero at the best fit and choosing the second of these gives

∂yfitT

∂ak

[σ2y

]−1y =

∂yfitT

∂ak

[σ2y

]−1yfit (8.79)

This scalar equation must be true for each of the M fitting parameters akand with the definition of the Jacobian (Eq. 8.31), all M can be rewritten inthe vector form of Eq. 8.29. Because Eq. 8.29 was the starting point for theregression results already presented, and because its solution does not relyon [σ2

y] being diagonal, the equations for the best-fit parameters and theircovariance matrix do not change when [σ2

y ] is non-diagonal.


Calibrations and instrument constants

Systematic errors are expected when using scientific instruments. Reducingthem involves a calibration. As an example, consider a grating spectrometerused to determine wavelengths by measuring the grating angle where spectralfeatures such as emission lines from a source are observed. Optics, diffrac-tion theory, and spectrometer construction details predict the wavelengths(the dependent variable yi) according to a calibration function involving thegrating angle (the independent variable xi) and a set of apparatus parame-ters (the instrument constants bj). For a spectrometer, the grating groovespacing and a grating offset angle might be instrument constants.

Consider a second example. Many instruments involve electronic circuitswhich are susceptible to offset and gain errors. For this and other reasons,it is often appropriate to assume that instrument readings will suffer sucheffects and that corrected yi should be obtained from raw instrument readingsxi according to yi = b1 + b2xi, where b1 is the offset error and the deviationof b2 from unity is the gain error. Having no information on their actualvalues, but based on instrument specifications, a voltmeter offset and gainmight be assumed to be: b1 = 0±1 mV and b2 = 1±0.001. With b1 = 0 andb2 = 1, this “calibration” results in yi = xi and makes no corrections to theraw readings, but it does acknowledge that every yi may be systematicallyoffset by a few mV and that all differences in yi may be off in scale by a fewparts per thousand.

The analysis model is that the instrument (or calibration) constants donot change during the acquisition of a measurement set, but their exact valueis subject to some uncertainty. Making sure the instrument constants do notchange significantly is an important experimental consideration. For exam-ple, temperature affects all kinds of instrumentation and is often associatedwith gain and offset errors. This is why measurements should always be madeafter the electronics have warmed up and why swings in ambient temperatureshould be avoided.

For calibrations, the independent variables are always measurements—random variables. With one independent variable x, the calibration functionfor an entire data set will be expressed

yi = G(b;xi) (8.80)

where b represents L instrument constants, bj, j = 1...L. With a givenset of bj, this calibration function transforms the raw instrument readings xi

91

into the “instrument-measured” yi of some physical significance.Sometimes, calibrations involve careful investigation of the instrument

resulting in a statistically independent determination of each bj and its un-certainty. Other times, the bj and their covariance matrix are determined byusing the instrument to measure standards—samples or sources for which theyi are already known to high accuracy. With the corresponding xi measuredfor a set of standard yi, the (xi, yi) data points are then fit to the calibrationequation treating it as a fitting function

yfiti = G(b;xi) (8.81)

A regression analysis then determines the best-fit bj and their covariance ma-trix [σ2

b ]. The dependent yi for this fit are now the standard reference values,which are normally quite accurately known. Consequently, the uncertaintiesin the xi play a dominant role in the regression analysis and must be smallenough to perform a valid fit as described previously.

Mathematically, a calibration is simply a determination of the best esti-mates bj and their standard deviations (if independent) or, more generally,their L × L covariance matrix [σ2

b ]. These calibration constants should beconsidered a single sample from some joint probability distribution havingmeans given by the true instrument constants β and having a covariancematrix [σ2

b ].With the bj and [σ2

b ] from the calibration in hand, the instrument is thenused again on some new system of interest—one where the y-values are notknown in advance. Wavelengths measured with a spectrometer, for example,are used in all manner of studies associated with source conditions. To thisend, a new set of xi are measured and used (with the calibration constants)in Eq. 8.80 to directly determine a new set of yi-values. These yi will nowbe considered “measured” y-values for the new system and will be used asthe dependent variables in a some new regression analysis. How will theuncertainty in the instrument constants (as represented by [σ2

b ]) affect theuncertainty in the theory parameters of the new regression analysis?

This new, or main, analysis starts by performing a standard regressionanalysis assuming that the instrument constants are known exactly.

One issue that must be dealt with at the start is that the N input yi areobtained from yi = G(b;xi) and it is the xi that are measured. Independentvariations in the xi of variance σ2

xi propagate to independent variations in


the yi of variance

σ2i =

(∂yi∂xi

)2

σ2xi (8.82)

In matrix notation, the entire covariance matrix becomes

[σ2y ] = [Jyx ][σ2

x][Jyx ]T (8.83)

where the N ×N Jacobian [Jyx ] has only diagonal elements

[Jyx ]ii =∂G(b;xi)

∂xi(8.84)

Off-diagonal elements are all zero because of the one-to-one relationship be-tween the xi and yi in Eq. 8.80. The diagonality of [Jyx ] implies [σ2

y] ofEq. 8.83 will likewise be diagonal (and the yi can be treated as staticallyindependent) if, as is usually the case, the xi are statistically independentfrom one another.

The main regression analysis is performed with this [σ2y], but is otherwise

an ordinary regression analysis. It determines the M best-fit parameters akto the main fitting function yfit

i = Fi(a) and it determines their covariancematrix [σ2

a]. All previous notations, formulas, procedures and results apply.The effects of [σ2

b ] are determined only after this solution is obtained.Only a few of the relationships from the main regression are needed to

determine the effects of [σ2b ]. And only the difference forms will be involved—

those involving ∆y and ∆a. Thus, while all first-order Taylor expansionsmust be valid (errors may need to be sufficiently small), the treatment appliesto both linear and nonlinear fitting functions.

Note that the best fit ak will be unaffected by uncertainty in the bj. Thatis not to say that the ak do not depend on the bj. In fact, a first-order Taylorexpansion will be assumed to give the (small) changes to the yi that can beexpected from (small) changes to the bj. And of course, small changes tothe yi lead to small changes in the ak. Once these linear relationships arespecified, propagation of error can be used to determine the contribution to[σ2a] arising from [σ2

b ]. This new contribution will be in addition to the [σ2a]

determined from the main analysis and arising from the uncertainty in theyi. Although Eq. 8.80 shows that yi depends on xi and all the instrumentconstants bj, recall that the [σ2

y] for the main fit must only take account ofthe uncertainty due to yi (via Eq. 8.83).

93

Uncertainties in the ak due to uncertainties in the bj can be predictedfrom the calibration function, the fitting function, and the best fit ak and bjas follows. Changes in the yi due to (small) changes in the bj are assumed tobe well described by a first-order Taylor expansion. In linear algebra form itis simply

∆y = [Jyb ]∆b (8.85)

where

[Jyb ]ij =∂G(b;xi)

∂bj(8.86)

Equation 8.69 (∆a =[Jay]†

∆y) gives the changes in the main fitting pa-rameters ak that would occur with changes to the yi. Combining this withEq. 8.85 then gives (a “chain rule” version of) the first-order Taylor expansionfor the ak as a function of the bj.

∆a =[Jay]†

[Jyb ]∆b (8.87)

With Eq. 8.87 giving the change in the fitting parameters arising from achange in the calibration constants, propagation of error (Eq. 6.19) thengives the covariance matrix for the ak in terms of that for the bj

[σ2(b)a ] =

[Jay]†

[Jyb ][σ2b ][J

yb ]T[Jay]†T

(8.88)

where the superscript (b) on the left indicates that this is the contribution to[σ2a] arising from [σ2

b ] and must be added to that due to [σ2x]. The contribution

due to [σ2x] is obtained from Eq. 8.47 (with Eq. 8.83 for [σ2

y]) giving

[σ2(x)a ] =

[Jay]†

[σ2y][Jay]†T

=[Jay]†

[Jyx ][σ2x][J

yx ]T[Jay]†T

(8.89)

The total covariance matrix is the sum of Eqs. 8.88 and 8.89.

[σ2a] = [σ2(b)

a ] + [σ2(x)a ]

=[Jay]† (

[Jyb ][σ2b ][J

yb ] + [Jyx ][σ2

x][Jyx ]T) [Jay]†T

(8.90)

Note that the term in parentheses above is the final covariance matrix forthe yi

[σ2y] = [Jyb ][σ2

b ][Jyb ]T + [Jyx ][σ2

x]][Jyx ]T (8.91)


which is simply the propagation of error formula applied to yi = G(b;xi)assuming the xi and bj are statistically independent. Note that, becausechanges in the instrument constants propagate through all yi, the contribu-tion from [σ2

b ] will lead to correlated yi even if the xi are uncorrelated.While Eq. 8.91 is the final covariance matrix for the yi, keep in mind that

only the [σ2y ] due to [σ2

x] (Eq. 8.83) should be used in the main regressionanalysis—including the calculation of the χ2. This is just the normal proce-dure under the assumption that the bj are exact constants for that analysis.The reason for not including the effects of [σ2

b ] in the [σ2y] used for the main

fit is that the deviations in the bj from their true values do not add addi-tional scatter to the yi; they affect the yi systematically and predictably. Forexample, the χ2 should not vary significantly for any reasonable likely set ofbj. The first-order Taylor expansion of Eq. 8.87 shows how the ak can be con-sidered direct functions of the bj so that propagation of error (Eq. 8.88) thendescribes how the covariance matrix for the bj propagates to the covariancematrix for the ak.

Fitting to a function of the measured variable

Sometimes the dependent variables yi for the main analysis are calculatedfrom a specific function of other more directly measured variables, say zi,and it is the uncertainties in the zi that are known or given. For example,one might fit to yi = 1/zi or yi = ln zi or some arbitrary function:

yi = H(zi) (8.92)

If H(z) is invertible, one could simply adjust the fitting function to predictzi rather than yi. However, this is not always desirable and if deviations inthe zi are likely to stay in the linear regime for the yi, this situation can behandled by a simple transformation; use the calculated yi as the dependentvariable for the main fit and calculate the [σ2

y] for the fit from the uncertaintyin the zi using propagation of error.

A first-order Taylor expansion describes the change to yi arising fromchanges to the zi

∆y = [Jyz ]∆z (8.93)

where [Jyz ] has only diagonal elements

[Jyz ]ii =∂H(zi)

∂zi(8.94)

95

and thus the [σ2y] to use in the main fit is

[σ2y ] = [Jyz ][σ2

z ][Jyz ]T (8.95)

Eq. 8.69 (with Eq. 8.93) becomes

∆a = [Jay ]†∆y

= [Jay ]†[Jyz ]∆z (8.96)

from which propagation of error then gives the parameter covariance matrixfor the fit as

[σ2a] = [Jay ]†[Jyz ][σ2

z ][Jyz ]T [Jay ]†T (8.97)

If a calibration function (of xi) determines zi, then Eq. 8.83 becomes

[σ2z ] = [Jzx ][σ2

x][Jzx ]T (8.98)

and the covariance matrix to use in the main analysis and for calculating theχ2 for the fit (Eq. 8.95 with Eq. 8.98) becomes

[σ2y] = [Jyz ][Jzx ][σ2

x][Jzx ]T [Jyz ]T (8.99)

The parameter covariance matrix due to the uncertainties in the xi becomes

[σ2(x)a ] = [Jay ]†[Jyz ][Jzx ][σ2

x][Jzx ]T [Jyz ]T [Jay ]†T (8.100)

Coupling Eq. 8.96 with Eq. 8.85 (here, ∆z = [Jzb ]∆b), then gives

∆a = [Jay ]†[Jyz ]∆z

= [Jay ]†[Jyz ][Jzb ]∆b (8.101)

for which Eq. 6.19 gives the [σ2a] due to [σ2

b ]

[σ2(b)a ] = [Jay ]†[Jyz ][Jzb ][σ2

b ][Jzb ]T [Jyz ]T [Jay ]†T (8.102)

The final parameter covariance matrix is then again the sum of the con-tributions due to [σ2

x] (Eq. 8.100) and due to [σ2b ] (Eq. 8.102).

[σ2a] = [Jay ]†[Jyz ]

([Jzx ][σ2

x][Jzx ]T + [Jzb ][σ2

b ][Jzb ]T)

[Jyz ]T [Jay ]†T (8.103)

Chapter 9

Evaluating a Fit

Graphical evaluation

Evaluating the agreement between a fitting function and a data set shouldalways begin with a graph. The process will be illustrated for a fit with asingle independent variable xi associated with each yi.

The main graph should show the fitting function as a smooth curve with-out markers for a set of x-values that give a good representation of the best-fitcurve throughout the fitting region. Plotting it only for the xi of the datapoints may be insufficient if the xi are too widely spaced. It is also some-times desirable to extrapolate the theory curve above or below the range ofthe data set xi.

The input data points (xi, yi) should not be connected by lines. Theyshould be marked with an appropriately sized symbol and error bars—verticalline segments extending one standard deviation above and below each point.If there are x-uncertainties, horizontal error bars should also be placed oneach point. Alternatively, the x-uncertainties can be folded into a σi calcu-lated according to Eq. 8.75, which would then be used for vertical error barsonly.

Figure 9.1 shows a case where the error bars would be too small to showclearly on the main graph. The fix is shown below the main graph—a plotof the residuals yi−yfit

i vs. xi with error bars. If the σi vary too widely to allshow clearly on a residual plot, logarithmic or other nonlinear y-axis scalingmay fix the problem. Or, normalized residuals (yi − yfit

i )/σi (without errorbars) could be used.

97

98 CHAPTER 9. EVALUATING A FIT-0.14184141-0.41484745

-0.266988-0.158302160.1071460930.1604729660.0976536470.2149541420.667474072-0.09656279-0.24012021-0.36632796-0.466171670.5787352470.002512659-0.17282158

0.1675800250.0370966170.21433223

d(1/λλλλ)/d(mλ)λ)λ)λ) x 1/λλλλ pred ∆∆∆∆(1/λλλλ)

(1/nm)2 (1/nm) (1/nm) DOF5.30129E-06 0.21 0.002304117 -1.6643E-06 74.22393E-06 0.1875 0.002057247 -2.0273E-06 χ2

Mercury Calibration

-1500

-1000

-500

0

500

1000

1500

2000

2500

120 140 160 180 200 220 240 260

angle (degress)

mλλ λλ

(nm

)

raw deviations

-1

-0.5

0

0.5

1

1.5

120 140 160 180 200 220 240 260∆∆ ∆∆mλλ λλ

(nm

)

Figure 9.1: Top: main graph for a fit to a calibration function for data from ourvisible spectrometer. Bottom: corresponding residual plot.

The purpose of these graphs is to make it easy to see each data point’sresidual relative to its standard deviation and to assess the entire set ofresiduals for their expected randomness. If the sample size is large enough,residuals or normalized residuals can be histogrammed and checked againstthe expected distribution.

Specifically look for the following “bad fit” problems and possible causes.

• Residuals are non-random and show some kind of trend. For example,data points are mostly above or mostly below the fit, or mostly above atone end and mostly below at the other. Residuals should be random. Inparticular, positive and negative residuals are equally likely and shouldbe present in roughly equal numbers. Trends in residuals may indicatea systematic error, a problem with the fitting program, or a fittingmodel that is incomplete or wrong.

• Too many error bars miss the fitted curve. Approximately two-thirdsof the error bars should cross the fit. If the residuals are random andsimply appear larger than predicted, the σi may be underestimated.

• There are outliers. Outliers are points missing the fit by three or more

99

σi. These should be very rare and may indicate data entry mistakes,incorrect assignment of xi, or other problems.

• The fit goes through most data points near the middle of the errorbars. On average, the yi should miss the fit by one error bar and aboutone-third of the error bars should miss the fit entirely. This shouldprobably be called a “good fit” problem. It is not all that unlikely ifthere are only a few data points, but with a sufficient number of datapoints, it indicates the σi have been overestimated—the measurementsare more precise than expected.

The χ2 distribution

The χ2 appearing in Eq. 8.15 is sometimes called the “best-fit” chi-square todistinguish it from another version—the “true” chi-square that uses the truetheory—the true theory parameters αk and thus the true means µi.

χ2 =N∑i=1

(yi − µi)2

σ2i

(9.1)

The chi-squares of Eq. 8.15 and 9.1 are different random variables with dif-ferent distributions. Of course, except in Monte-Carlo simulations, the αkand µi are not likely known and the true chi-square cannot be calculated.

Both versions of the χ2 random variable are also well-defined for binomial-or Poisson-distributed yi with σ2

i given, respectively, by Eq. 3.6 or 3.8 forEq. 9.1 and by Eq. 8.11 or 8.13 for Eq. 8.15.

The best-fit χ2 value is typically used as a “goodness of fit” statistic. Ifthe σi and the fitting function are correct, the distribution of likely valuesfor the χ2 variable can be quite predictable. An unlikely, i.e., “bad” χ2 valueis an indication that theory and measurement are incompatible. To deter-mine whether or not a particular χ2 value is reasonable, the χ2 probabilitydistribution must be known.

The χ2 probability distribution depends on the particular probability dis-tributions governing the yi as well as the number of data points N and thenumber of fitting parameters M . The quantity N − M is called the chi-square’s degrees of freedom and plays a central role in describing the distri-bution. Each data point adds one to the degrees of freedom and each fittingparameter subtracts one.

100 CHAPTER 9. EVALUATING A FIT

Evaluating the expectation value or mean of the χ2 distribution beginswith the definition, Eq. 2.18, for the variance of each yi—rewritten in theform ⟨

(yi − µi)2

σ2i

⟩= 1 (9.2)

Summing over all N data points gives⟨N∑i=1

(yi − µi)2

σ2i

⟩= N (9.3)

The sum in this expression is the true χ2 of Eq. 9.1. Thus the mean of thedistribution for the true χ2 is equal to the number of data points N . Howdoes the mean change when χ2 must be evaluated with Eq. 8.15 using theyfiti in place of the µi? Is Eq. 9.3 valid if yfit

i is substituted for µi? It turnsout that the best-fit χ2 calculated with the yfit

i satisfies⟨N∑i=1

(yi − yfiti )2

σ2i

⟩= N −M (9.4)

The expectation value of best-fit χ2 is smaller than that of the true χ2 bythe number of fitting parameters M .

The proof of Eq. 9.4 is given in the Regression Analysis Addendum. It isstill based on Eq. 9.2 and thus valid for any distribution for the yi. But italso requires that the best fit be linear in the parameters—satisfy Eq. 8.57with atrial

k = αk. In fact, Exercise 5 demonstrated Eq. 9.4 for the case of theconstant fitting function (yfit

i = y, M = 1) with equally weighted y-values(σi = σy).

For fixed σi, it is easy to understand why some reduction in the χ2 shouldalways be expected when yfit

i replaces µi. After all, the fitted ak (resulting intheir corresponding yfit

i ) are specifically chosen to produce the lowest possibleχ2 for one particular data set—lower even than would be obtained (for thatdata set) with the true αk and their corresponding µi. Thus for any data set,the χ2 using yfit

i can only be equal to or less than the χ2 using the µi. Theaverage decrease is M , but it can be smaller or larger for any particular dataset.

According to Eq. 9.4, the mean of the chi-square distribution is N −M .How much above (or below) the expected mean does the χ2 value have to be

http://www.phys.ufl.edu/courses/phy4803L/statistics/matproof.pdf

101

before one must conclude that it is too big (or too small) to be reasonable?That question calls into play the width or variance of the χ2 distribution.

The variance of the χ2 distribution depends on the probability distribu-tion for the yi. For the true χ2, the variance is readily predicted startingfrom Eq. 2.19.

σ2χ2 =

⟨(χ2)2⟩−(⟨χ2⟩)2

=

⟨(N∑i=1

(yi − µi)2

σ2i

)(N∑j=1

(yj − µj)2

σ2j

)⟩−N2

=N∑i=1

N∑j=1

⟨((yi − µi)2

σ2i

)((yj − µj)2

σ2j

)⟩−N2 (9.5)

where Eq. 9.3 has been used for the expectation value of χ2.In the double sum, there are N2 − N terms where i 6= j and N terms

where i = j. Assuming all yi are statistically independent, Eq. 4.6 appliesand thus each of the terms with i 6= j has an expectation value of one—equalto the product of the expectation value of its two factors (each of which isunity by Eq. 9.2). The N terms with i = j become a single sum of terms ofthe form: 〈(yi − µi)4/σ4

i 〉. Making these substitutions in Eq. 9.5 gives

σ2χ2 =

N∑i=1

⟨(yi − µi)4

σ4i

⟩−N (9.6)

The quantity

βi =

⟨(yi − µi)4

σ4i

⟩(9.7)

is called the normalized kurtosis and is a measure of the peakedness of thedistribution for yi. The more probability density further from the mean (inunits of the standard deviation), the higher the kurtosis.

The expectation values determining βi are readily evaluated for variousprobability distributions using Eq. 2.9 or 2.10. The reader is invited to dothese sums or integrals; only the results are presented here. For yi governedby a Gaussian distribution, the result is βi = 3 and using this in Eq. 9.6 gives

σ2χ2 = 2N (9.8)


For yi governed by a binomial distribution, βi = 3 +Ni/µi(Ni − µi)− 6/Nigiving

σ2χ2 = 2N +

N∑i=1

[Ni

µi(Ni − µi)− 6

Ni

](9.9)

For yi governed by a Poisson distribution, βi = 3 + 1/µi giving

σ2χ2 = 2N +

N∑i=1

1

µi(9.10)

For yi governed by a uniform distribution, βi = 1.8 giving

σ2χ2 = 0.8N (9.11)

Of course, the mean and variance do not provide a complete description ofthe chi-square distribution. However, detailed shapes given in references andstatistics software are only for the standard chi-square distributions wherethe yi follow a Gaussian distribution. Chi-square distributions for y-variablesfollowing non-Gaussian distributions are not usually discussed in the litera-ture.

Equations 9.8-9.11 give the variance of the true χ2 distribution. Howshould the distribution for the best-fit χ2 compare? As already mentioned,no matter what distribution governs the yi, the mean of the distribution forthe best-fit χ2 is N −M . However, only for Gaussian-distributed yi is thevariance of this distribution well known—the best-fit χ2 follows the standardchi-square distribution with N−M degrees of freedom. It has an expectationvalue of N −M and a variance of 2(N −M).

The simulations demonstrated in Chi-Square distributions.xls on the labweb site check the predictions for the means and variances of the true χ2 andthe best-fit χ2 when twenty samples from the same distribution are fit to findthe MLE for the mean (yfit

i = y for all i, N = 20, M = 1). Clicking on thedifferent radio buttons generates results for variables governed by one of thefour distributions: Gaussian, binomial, Poisson, and uniform. In each case,2000 fits are simulated and used to create sample distributions for the trueχ2, the best-fit χ2, and their difference. The sample mean and the samplevariance for these three variables are compared with each other and with thepredictions for those four random variable types.

http://www.phys.ufl.edu/courses/phy4803L/statistics/Chi-square distributions.xls

103

Predictions for the means of the true χ2 (N) and the best-fit χ2 (N −M)agree well with the sample means for these quantities in all cases. Thevariances for the true χ2 given by Eqs. 9.8-9.11 also agree well with thecorresponding sample variances in all cases. For the one question: Whatis the variance of the best-fit χ2?—where the answer, 2(N − M), is onlyknown for the case of Gaussian-distributed yi—the simulations demonstratethe following behavior. As expected for Gaussian-distributed yi, the samplevariance of the best-fit χ2 agreed with the predicted value of 2(N−M)—thatis, it decreased by two in comparison to the variance of the true χ2. Keepin mind that for Poisson- and binomial-distributed yi, the variance of thetrue χ2 is larger than the variance of the true χ2 for Gaussian-distributedyi, while for uniformly-distributed yi, it is smaller. For the former two cases,the sample variance of the best-fit χ2 decreased by more than two, while forthe latter, there was actually a slight increase in the variance. Thus, in allthree cases, it appears that fitting tended to relax the χ2 distribution towardthat associated with Gaussian distributions.

As expected for Gaussian-distributed yi, the best-fit χ2 was always lowerthan the true χ2—not just on average, but for every data set. This was alsotrue for uniformly distributed yi. However, for yi governed by the binomialand Poisson distributions, this was not the case for all data sets. For somedata sets the true χ2 was lower. This is not unexpected. For these twodistributions, y maximizes the log-likelihood. It does not minimize the best-fit chi-square.

The standard chi-square distribution will be assumed appropriate in thefollowing discussions, but if evaluating χ2 probabilities is an important aspectof the analysis, keep this assumption in mind. For example, it would directlyaffect the famous chi-square test, discussed next.

The χ2 test

The chi-square test uses the χ2 distribution to decide whether a χ2 value froma fit is too large or too small to be reasonably probable. While reporting thebest-fit χ2 should be standard practice when describing experimental results,the test itself is no substitute for a graphical evaluation of a fit.

As an example, consider a fit with N −M = 50 degrees of freedom. Themean of the χ2 distribution would be 50 and its standard deviation would beσχ2 =

√2(N −M) = 10. While the χ2 distribution is not exactly Gaussian,


χ2 values outside the two-sigma range from 30-70 should be cause for concern.So suppose the actual χ2 value from the fit is significantly above N −M

and the analysis must decide if it is too big. To decide the issue, the chi-square distribution is used to determine the probability of getting a value aslarge or larger than the actual χ2 value from the fit. If this probability is toosmall to be accepted as a chance occurrence, one must conclude that the χ2

is unreasonably large.In rare cases, the χ2 value from the fit may come out too small—well

under the expected value of N − M . To check if it is too low, the chi-square distribution should be used to find the probability of getting a valuethat small or smaller. If this probability is too low to be accepted as achance occurrence, one must conclude that the χ2 is unreasonably small. Onecaveat seems reasonable. There should be at least three degrees of freedomto test for an undersized χ2. With only one or two degrees of freedom, theχ2 distribution is non-zero at χ2 = 0 and decreases monotonically as χ2

increases. Thus, for these two cases, smaller values are always more likelythan larger values.

If the χ2 is unacceptably large or small, the deviations are not in accordwith predictions and the experimental model and theoretical model are in-compatible. The same problems mentioned earlier for a graphical assessmentmay be applicable to an unacceptable χ2.

Uncertainty in the uncertainty

It is not uncommon to be unsure of the true σi associated with a set ofmeasurements. If the σi are unknown, the chi-square variable can not becalculated and the chi-square test is unusable. More importantly, the σidetermine the weighting matrix

[σ2y

]−1(Eq. 8.30) and via Eq. 8.48 with

Eq. 8.39 they also determine the parameter covariance matrix [σ2a]. If the σi

are unknown or unreliable, the parameter uncertainties would be unknownor unreliable as well.

What can be done in this case? One accepted practice is to adjust themeasurement uncertainties so that the χ2 is equal to its expectation value ofN −M . Forcing χ2 = N −M is a valid method for adjusting uncertain σito achieve a predictable level of confidence in the parameter uncertainties.

For an equally-weighted data set, the technique is straight-forward. Findthat one value for σy that gives χ2 = N−M . Equation 8.54 shows this would

105

happen if the following sample variance of the fit were used for σ2y :

s2y =

1

N −M

N∑i=1

(yi − yfit

i

)2(9.12)

Eq. 8.52 demonstrates that the parameter covariance matrix [σ2a]—every

element—would then be proportional to this sample variance.Note how Eq. 9.12 is a more general version of Eq. 2.23. Equation 9.12

is the general case where the yfiti are based on the M best-fit parameters

of some fitting function. Equation 2.23 can be considered a special casecorresponding to a fit to the constant function, where all yfit

i are the sameand given by the single (M = 1) best-fit parameter y.

Exercise 5 and Eq. 9.4 demonstrate that Eqs. 2.23 and 9.12 give a samplevariance s2

y that is an unbiased estimate of the true variance σ2y. For an

equally-weighted fit then, forcing χ2 = N −M is the simple act of using syfor σy in the formulas for the parameter covariance matrix.

If the σi vary from point to point, forcing χ2 = N −M proceeds a bitdifferently. The relative sizes of the σi must be known in advance so thatforcing χ2 = N −M determines a single overall scale factor for all of them.Relatively-sized σi might occur when measuring wide-ranging quantities withinstruments having multiple scales or ranges. In such cases, the measurementuncertainty might be assumed to scale in some known way with the instru-ment range used for the measurement.

Initial estimates for the input σi would be set in accordance with theknown ratios. The fit is performed and the χ2 is evaluated. Then a singlemultiplier would be determined to achieve a chi-square value of N − M .Scaling all the σi by a common factor κ =

√χ2/(N −M), scales the present

χ2 by a factor of 1/κ2 (to N −M). The input covariance matrix [σ2y] would

scale by κ2 and Eq. 8.48 with Eq. 8.39 implies that the parameter covariancematrix [σ2

a] would scale by κ2 as well. On the other hand, Eq. 8.42 with the

equations for[Jay]†

show that the fit parameters are unaffected by the scalefactor. There is no effect on the fit parameters because the scaling does notchange the relative weighting of the data points.

When the σi are known in advance, the normal randomness in the dataset deviations leads to a randomness in the χ2 value for the fit. When theσi are adjusted to make χ2 = N −M , that randomness is transferred to themeasurement uncertainties and to the parameter uncertainties. The covari-


ance matrices [σ2y ] and [σ2

a] become random variables. They become samplecovariance matrices and should be written [s2

y] and [s2a].

For example, parameter standard deviations obtained from the diagonalelements of [s2

a] become sample standard deviations. As discussed shortly,confidence intervals constructed with sample standard deviations have differ-ent confidence levels than those constructed with true standard deviations.This issue will be addressed in the section on Student-T probabilities.

Forcing χ2 = N −M uses the random fit residuals to determine the siused as estimates for the true σi. Consequently, the technique is better suitedfor large data sets where the high number of residuals ensure a reasonableprecision for those estimates. It is less suited for smaller data sets where it isnot all that unlikely to get residuals having rms values that are significantlylarger or smaller than expected.

To understand the issue more clearly, consider an experiment with N =M + 200 data points all with equal uncertainty. The expectation value of χ2

is then 200 with a standard deviation of 20. Suppose that an initial estimateof σy makes the the best-fit χ2 come out 100. The chi-square distributionsays that with 200 degrees of freedom, a value outside the range 200± 60 isless than 1% likely, so a value of 100 means something is amiss. Scaling theσy down by a factor of

√2 will raise the χ2 to 200—the predicted average

χ2 for this case. Your confidence in the original σy was not all that high andscaling it that amount is deemed reasonable. The residuals are examinedand seem random. This is a good case for forcing χ2 = N −M . The degreesof freedom are so high that using sy for σy may give a better estimate of theparameter covariance matrix than one obtained using a coarse estimate of σybased on other considerations.

Next consider an equally-weighted data set where N = M + 5 so that theexpectation value of χ2 is five. Suppose that using an initial estimate of σy,the best-fit chi-square comes out ten. The residuals are random and showno systematic deviations from the fit. This time, σy would have to be scaledup by a factor of

√2 to bring the χ2 down to five. Again, your confidence in

σy is not high and you are not uncomfortable scaling them this much. Butshould you? With five degrees of freedom, a χ2 value of 10 or larger is notall that rare and is expected to occur with about an 8% probability. Withfewer degrees of freedom, it is not as clear whether to abandon the originalσy and force χ2 = N −M .

When should the σi be scaled to give χ2 = N − M? The techniqueis appropriate whenever the σi are unknown or uncertain. On the other

107

hand, the ability to validate the theory is lost. Theory can’t be deemed inagreement with experiment if the measurement uncertainties are infinitelyadjustable. Consequently, an independent coarse estimate of the σi basedon experimental considerations should still be made even if they will thenbe scaled to force χ2 = N −M . Scaling the σi by factor of two might bedeemed acceptable, but scaling them by a factor of ten might not. Oneshould certainly be wary of accepting a scale factor much bigger than 3 ormuch smaller than 1/3. Even relatively sloppy experimental work should becapable of predicting measurement uncertainties at this level. Moreover, onemust still examine residuals for systematic deviations. Masking systematicerrors or incorrect models by arbitrarily increasing the σi is not what forcingχ2 = N −M is meant to do.

The reduced chi-square distribution

Dividing a chi-square random variable by its degrees of freedom N−M givesanother random variable called the reduced chi-square random variable.

χ2ν =

χ2

N −M(9.13)

Dividing any random variable by a constant results in a new randomvariable with a mean divided down from that of the original by that constantand with a variance divided down from that of the original by the square ofthat constant. Thus, the reduced chi-square distribution will have a mean ofone for all degrees of freedom and a variance equal to 2/(N−M). The reducedchi-square value is the preferred statistic by many researchers because forlarge degrees of freedom, good reduced chi-square values are always aroundone.

Reduced chi-square distributions for various degrees of freedom are shownin Fig. 9.2. A table of reduced chi-square values/probabilities is given inTable 10.3. The table can also be used for determining χ2 probabilities usingthe scaling described above. For example, with 100 degrees of freedom, theprobability a χ2 will exceed 120 is the same as the probability that a χ2

ν (with100 degrees of freedom) will exceed 1.2, which is about 8 percent.

Dividing both sides of Eq. 9.12 by σ2y and eliminating the sum using


0

1

2

3

0 1 2 3

p ()2

2

Figure 9.2: The reduced chi-square pdfs p(χ2ν) for degrees of freedom (dof)

1, 2, 4, 10, 30, 100. The tall distribution peaking at χ2ν = 1 is for dof = 100. The

curves get broader and lower as the dof decrease. For dof = 1, the distribution(red) is singular at zero.

Eq. 8.54 gives

s2y

σ2y

=χ2

N −M(9.14)

and shows that the ratio of the sample variance to the true variance is areduced chi-square variable with N −M degrees of freedom.

The fact that the variance of χ2ν decreases as the sample size N increases,

implies that the reduced chi-square distribution becomes more sharply peakedaround its expectation value of one. The peaking makes sense with respectto the law of large numbers. A reduced chi-square of one means s2

y/σ2y = 1

and so as N increases, the s2y determined from the data becomes a more and

more precise estimate of the true variance σ2y.

For large N −M , the chi-square and the reduced chi-square distributionsare approximately Gaussian—the former with a mean of N−M and standarddeviation of

√2(N −M), and the latter with a mean of one and a standard

deviation of√

2/(N −M). This approximation is used in the next exercise.

Exercise 6 It is often stated that uncertainties should be expressed with onlyone significant figure. Some say two figures should be kept if the first digitis 1. Roughly speaking, this suggests uncertainties are only good to about10%. Suppose you take a sample set of yi and evaluate the sample mean y.

109

For the uncertainty, you use the sample standard deviation of the mean sy.Show that it takes around 200 samples if one is to be about 95% confidentthat sy is within 10% of σy. Hint: sy will also have to be within 10% ofσy. Thus, you want to find the number of degrees of freedom such that theprobability P (0.9σy < sy < 1.1σy) ≈ 0.95. Convert this to a probability on χ2

ν

and use near-Gaussian limiting behavior appropriate for large sample sizes.Then show you can use Table 10.3 and check your results.

Confidence intervals

Consider a single set of M fitting parameters ak and their covariance matrix[σ2a] obtained after proper data collection and analysis procedures. Recall

that the set of ak should be regarded as one sample governed by some jointprobability distribution having the known covariance matrix and some un-known true means αk. In the absence of systematic error, and subject to thevalidity of the first-order Taylor expansions, those means should be the truevalues and the ultimate target of the experiment. What can be said aboutthem?

The ak and [σ2a] can simply be reported, leaving the reader to draw conclu-

sions about how big a deviation between the ak and αk would be consideredreasonable. More commonly, the results are reported using confidence in-tervals. Based on the best-fit ak and the [σ2

a], a confidence interval consistsof a range (or interval) of possible parameter values and a probability (orconfidence level) that the true mean will lie in that interval.

To create confidence intervals requires that the shape of the joint distribu-tion be known. Its covariance matrix alone is insufficient. Often, confidenceintervals are created assuming the variables are governed by a Gaussian jointdistribution, but they can also take into account other information, such asa known non-Gaussian distribution for the ak, or uncertainty in the [σ2

a].If a random variable y follows a Gaussian distribution, a y-value in the

range µy±σy occurs 68% of the time and a y-value in the range µy±2σy occurs95% of the time. This logic is invertible and one can construct confidenceintervals of the form

y ± zσyfor any value of z and the probability such an interval will include the truemean µy will be 68% for z = 1, 95% for z = 2, etc. Such confidence intervals


and associated probabilities are seldom reported because they are well knownand completely specified once y and σy are given.

Note that the discussion above is unmodified if y is one variable of acorrelated set having the given σ2

y but otherwise arbitrary covariance matrix.For a Gaussian joint distribution, the unconditional probability distributionfor y—obtained by integrating over all possible values for the other variablesin the set—can be shown to be the single-variable Gaussian distributionwith a variance of σ2

y and a mean µy. It does not depend on any of the otherrandom variable values, variances, or covariances.

Occasionally, one might want to describe confidence intervals associatedwith multiple variables. For example, what is the confidence level that both oftwo means are in some stated range. If the variables are statistically indepen-dent, the probabilities for each variable are independent and the probabilityfor both to be in range is simply the product of the probabilities for eachto be in range. When the variables are correlated, the calculations are moreinvolved and only one will be considered.

A constant-probability contour for two correlated Gaussian variables, sayy1 and y2, governed by the joint distribution of Eq. 4.24, is an ellipse wherethe argument of the exponential in the distribution is some given constant.The ellipse can be described as the solution to

χ2 =(yT − µT

) [σ2y

]−1(y − µ) (9.15)

where χ2 is a constant. Eq. 4.24 shows that e−χ2/2 would give the probability

density along the ellipse as a fraction of its peak value at y = µ where theellipse is centered. For example, on the χ2 = 2 ellipse the probability densityis down by a factor of 1/e.

The probability for (y1, y2) values to be inside this ellipse is the integralof the joint distribution over the area of the ellipse and is readily shown to be1− exp(−χ2/2). For χ2 = 1—akin to a one-sigma one-dimensional interval,the probability is about 39% and for χ2 = 4—akin to a two-sigma interval,the probability is about 86%.

Reversing the logic, this same ellipse can be centered on the measured(y1, y2) rather than on the means (µ1, µ2). If the (y1, y2) value were in-side the ellipse centered at (µ1, µ2), then (µ1, µ2) would be inside that sameellipse centered on (y1, y2). Since the former occurs with a probability of1− exp(−χ2/2), this is also the probability (confidence level) for the latter.

Two-dimensional confidence ellipses do not change if the two variables

111

involved are part of a larger set of correlated variables. One can show that in-tegrating the joint Gaussian distribution over all possible values for the othervariables leaves the two remaining variables described by a two-dimensionaljoint Gaussian distribution with the same means, variances, and covarianceas in the complete distribution.

Student-T probabilities

When the χ2 is forced to N−M , the parameter covariance matrix is a randomvariable—a sample covariance matrix—and confidence levels described abovechange somewhat. Only single-variable confidence intervals and Gaussiandistributed variables will be considered for this case.

When using a fitting parameter ak and its sample standard deviation sk,one can again express a confidence interval in the form

ak ± zsk

However, now that the interval is constructed with a sample sk rather thana true σk, z = 1 (or z = 2) are not necessarily 68% (or 95%) likely toinclude the true value. William Sealy Gosset, publishing around 1900 un-der the pseudonym “Student” was the first to determine these “Student-T”probabilities.

A difference arises because sk might, by chance, come out larger or smallerthan σk. Recall its size will be related to the random scatter of the data aboutthe best fit. When the probabilities for all possible values of sk are properlytaken into account, the confidence level for any z is always smaller than wouldbe predicted based on a known σk.

In effect, the uncertainty in how well sk estimates σk decreases the con-fidence level for any given z when compared to an interval constructed witha true σk. Because the uncertainty in sk depends on the degrees of freedom,the Student-T confidence intervals also depend on the degrees of freedom.The larger the degrees of freedom, the better the estimate sk becomes andthe closer the Student-T intervals will be to the corresponding Gaussian in-tervals.

Table 10.4 gives some Student-T probabilities. As an example of its use,consider five sample yi values obtained from the same Gaussian distribution,which are then used to calculate y and sy. There are four degrees of freedomfor an sy calculated from five samples. Looking in the row for four degrees


of freedom, the 95% probability interval for the true mean is seen to bey ± 2.78sy. If one were ignorant of the Student-T probabilities one mighthave assumed that a 95% confidence interval would be, as for a Gaussian,y ± 2 sy.

Exercise 7 Three sample values from a Gaussian pdf are 1.20, 1.24, and1.19. (a) Find the sample mean, sample standard deviation, and samplestandard deviation of the mean and give the 68% and 95% confidence intervalsfor the true mean based on this data alone. (b) Now assume those threesample values are known to come from a pdf with a standard deviation σy =0.02. With this assumption, what are the 68% and 95% confidence intervals?Determine the reduced chi-square and give the probability it would be this bigor bigger.

The ∆χ2 = 1 rule

An optimization routine, such as the one available in Excel’s Solver program,can be quite suitable for performing linear or nonlinear regression. A tutorialon using Solver for that purpose is given in Chapter 10. Unfortunately,Solver does not directly provide the fitting parameter uncertainties, but cando so via additional steps based on the “∆χ2 = 1” rule. The procedurefinds individual elements of [σ2

a] based on how fast the χ2 changes when thecorresponding ak are changed. As an added benefit, the same procedure candetermine the parameter covariances and it can provide a check on the rangeof validity of the first-order Taylor expansion.

The best-fit χ2 is the reference value for calculating ∆χ2. That is, ∆χ2

is defined as the difference between a chi-square calculated using some non-optimized trial solution and the chi-square at the best-fit. The best-fit chi-square is defined by Eq. 8.15 (or Eq. 8.77, which is also valid for non-diagonal[σ2y ]). The trial parameter set, denoted atrial

k , will be in the neighborhoodof the best-fit ak and when used with the fitting function, will give a non-optimized fit

ytriali = Fi(atrial

k ) (9.16)

The non-optimized χ2 is then determined from Eq. 8.15 or Eq. 8.77 usingthe ytrial

i for yfiti in those formulas. Using the linear algebra form of Eq. 8.77,

113

∆χ2 is thus defined

∆χ2 =(y − ytrial

)T [σ2y

]−1 (y − ytrial

)(9.17)

−(y − yfit

)T [σ2y

]−1 (y − yfit

)It is important to point out that

[σ2y

]−1, and thus the sample variances σ2

i ,must be kept the same when evaluating both halves of ∆χ2. If the σ2

i arefit-dependent, they should be fixed at their best-fit values when calculating[σ2y

]−1and both contributions to ∆χ2. For Poisson yi, for example, where

σ2i = yfit

i , continue using the best-fit values even when evaluating the non-best-fit χ2—do not use σ2

i = ytriali . The σi determine the covariance matrix

[σ2a] we are trying to determine and should be kept constant. Only the fitting

function should be allowed to vary.The first-order Taylor expansion for each ytrial

i as a function of atrial aboutthe best-fit becomes

ytriali = yfit

i +M∑k=1

∂Fi∂ak

(atrialk − ak) (9.18)

Or, in linear algebra form

ytrial = yfit + [Jya ] ∆a (9.19)

where the Jacobian is evaluated at the best fit and ∆a = atrial−a gives thedeviation of the fitting function parameters from their best-fit values.

Defining∆y = y − yfit (9.20)

and substituting Eq. 9.19 into Eq. 9.17 then gives

∆χ2 =(

∆yT −∆aT [Jya ]T) [σ2y

]−1(∆y − [Jya ] ∆a)

−∆yT[σ2y

]−1∆y

= ∆aT [Jya ]T[σ2y

]−1[Jya ] ∆a

−∆aT [Jya ]T[σ2y

]−1∆y −∆yT

[σ2y

]−1[Jya ] ∆a

= ∆aT [X]∆a

= ∆aT[σ2a

]−1∆a (9.21)


where the best-fit condition—Eq. 8.71, [Jya ]T[σ2y

]−1∆y = 0 and its transpose—

were used to eliminate the last two terms in the second equation. Eq. 8.39

([X] = [Jya ]T[σ2y

]−1[Jya ]) was used to get to the third equation and Eq. 8.48

([X]−1 = [σ2a]) was used to get the final result.

Because this derivation relies on the validity of the first-order Taylorexpansion (Eq. 9.19), Eq. 9.21 is exact for linear fitting functions, but fornonlinear fitting functions is limited to ∆a values that are small enough forthe expansion to be valid.

The ∆χ2 = 1 rule effectively solves Eq. 9.21 for particular elements of [σ2a]

from values of ∆χ2 evaluated using Eq. 9.17 for a given ∆a. The ∆χ2 = 1rule is derived in the Regression Algebra addendum and can be stated asfollows.

If a fitting parameter is offset from its best-fit value by its stan-dard deviation, i.e., from ak to ak±σk, and then fixed there whileall other fitting parameters are readjusted to minimize the χ2, thenew χ2 will be one higher than its best-fit minimum.

Where Eq. 9.21 is valid, so is the following equation—a more general formof the ∆χ2 = 1 rule showing explicitly the expected quadratic dependence of∆χ2 on the change in ak.

σ2k =

(atrialk − ak)2

∆χ2(9.22)

Here, ∆χ2 is the increase in χ2 after changing ak from its best-fit valueto atrial

k . However, it is not the increase immediately after the change. Itis important to appreciate ∆χ2 is the increase from the best-fit value onlyafter re-fitting all other fitting parameters for a minimum χ2 obtained whilekeeping the one selected atrial

k fixed. The immediate change in ∆χ2 is likelyto be larger than predicted by Eq. 9.22. However, depending on the degree ofcorrelation among the ak, some of the immediate increase upon changing toatrialk will be canceled after the re-fit. Re-minimizing by adjusting the other

parameters is needed to bring the ∆χ2 into agreement with Eq. 9.22.The covariances between ak and the other parameters can also be deter-

mined by keeping track of the parameter changes after the re-fit. If someother parameter with a best-fit value of am goes to a′m after the re-fit, thecovariance between ak and am, including its sign, is given by

σkm =(atrialk − ak)(a′m − am)

∆χ2(9.23)

http://www.phys.ufl.edu/courses/phy4803L/statistics/matproof.pdf

115

A coarse check on the validity of the first-order Taylor expansion is impor-tant when working with nonlinear fitting functions and is easily performedwith the same calculations. One simply checks that the variances and covari-ances obtained with Eq. 9.22 and 9.23 give the same result using any atrial

k

in the interval: ak ± 3σk—a range that would cover most of its probabilitydistribution. The check should be performed for each parameter of interest,varying atrial

k by small amounts both above and below the best-fit value andsized so that ∆χ2 values come out around 0.1 and around 10. If the first-order Taylor expansion is accurate over this range, all four values should benearly the same. If the σk vary significantly, the first-order Taylor expansionmay not be a good approximation over the ak ± 3σk range. Nonlinearitiescan skew the parent distribution and bias the fitting parameter.

Chapter 10

Regression with Excel

Excel is more business software than scientific software. Nonetheless, becauseit includes Visual Basic for Applications, or VBA—a programming languageavailable within—Excel makes a very reasonable platform for many statisticalanalysis tasks. Familiarity with Excel is assumed in the following discussion,e.g., the ability to enter and graph data and evaluate formulas.

Linear algebra formulas are evaluated using array formulas in Excel. Touse an array formula, select the appropriate rectangular block of cells (one- ortwo-dimensional), click in the edit box, enter the array formula, and end theprocess with the three-key combination Ctrl|Shift|Enter. The entire blockof cells is evaluated according to the formula and the results are placedin the block. Errors with various behaviors result if array dimensions areincompatible with the particular operation or if the selected block is not ofthe appropriate size. Excel will not allow you to modify parts of any arrayarea defined from an array formula. For example, Clear Contents only worksif the cell range covers the array completely.

The built-in array functions needed for linear regression formulas are:

TRANSPOSE(array): Returns the transpose of a vector or matrix.

MMULT(array1, array2): Returns the result of vector and matrix multi-plication.

MINVERSE(array): Returns the inverse of an invertible matrix.

MDETERM(array): Returns the determinant of a square matrix.

117

118 CHAPTER 10. REGRESSION WITH EXCEL

The StatsMats.xla Excel add-in can be found on the lab web site andadds the following array formulas specifically designed for regression:

CovarMat(array): Takes a one-dimensional input array (row or columnform) of σi values and returns the corresponding square diagonal co-variance matrix for independent variables.

WeightMat(array): Takes the same input as CovarMat, but returns thediagonal weighting matrix.

SDFromCovar(array): Takes a two-dimensional covariance matrix and re-turns a one-dimensional array of standard deviations from its diagonalelements.

Linear regression with Excel

Linear regression formulas are easily programmed using the built in arrayfunctions and those in the StatsMats add-in. For an example, see the LinearRegression.xls spreadsheet on the lab web site, which includes the StatsMatsfunctions as a VBA module.

Excel has its own linear regression program, but it is limited to equally-weighted fits. It does not allow the user to provide σy. It uses the sy ofEq. 9.12 for σy in determining the parameter covariance matrix. That is, itforces χ2 = N −M .

The starting point is to enter the N data points xi and yi, each as acolumn of length N . The steps will be illustrated for a quadratic fit: yfit

i =a1 + a2xi + a3x

2i (M = 3) having the basis functions: f1(xi) = 1, f2(xi) = xi,

and f3(xi) = x2i . Next, create a contiguous group of M more columns, one

for each of the fk(xi). For the example, a columns of 1’s, then a columnof xi and then a column of x2

i would make the parameters in the vectorsand matrices appear in the order a1, a2 then a3. In turns out unnecessaryto create the column of ones for the a1 parameter. The Excel regressionprogram can add a constant to any fitting function without one. If linearalgebra formulas will be used, however, creating the column of ones for theadditive constant would be required.

Excel’s linear regression program requires that the “Analysis Toolpak”add-in be installed and loaded. The location of the regression program in

http://www.phys.ufl.edu/courses/phy4803L/statistics/StatsMats.xla



119

Figure 10.1: Excel’s linear regression dialog box

the Excel menu system varies for different versions of Excel. Look for it in“Data Analysis Tools.” The dialog box appears as in Fig. 10.1.

Select the column containing the yi-values for the Input Y-Range. For theInput X-Range, select what must be a contiguous rectangular block containingall fk(xi) values—the Jacobian [Jya ]. For the quadratic fit, either the twocolumns containing xi and x2

i or three columns—if the column of ones is alsoincluded. Leave the Constant is Zero box unchecked if only the xi and x2

i

columns were provided. It would be checked if the fitting function did notinclude a constant term or if a constant term is included as a column of onesin the Jacobian. Leave the Labels box unchecked. If you would like, checkthe Confidence Level box and supply a probability for a Student-T intervalnext to it. Intervals for the 95% confidence level are provided automatically.Select the New Worksheet Ply radio button or the Output Range. For thelatter, also specify the upper left corner of an empty spreadsheet area for theresults. Then click OK.

The output includes an upper Regression Statistics area containing a pa-rameter labeled Standard Error. This is the sample standard deviation syfrom Eq. 9.12. The lower area contains information about the constant—labeled Intercept—and the fitting parameters ak—labeled X Variable k. Nextto the best-fit values—labeled Coefficients—are the parameter sample stan-


dard deviations sk—labeled Standard Error. The last two double columnsare for the lower and upper limits of Student-T intervals at confidence levelsof 95% and the user specified percentage.

General regression with Excel

Excel’s Solver can find the best-fit parameters for both linear or nonlinearfitting functions. It can handle unequally weighted data and even correlatedyi. The Solver requires the user to construct a cell containing either a chi-square to minimize or a log likelihood to maximize.

For pure Gaussian-, binomial-, or Poisson-distributed yi with negligibleuncertainty in the independent variables, one would construct the appropri-ate log likelihood function L and maximize that once as describe below.

For all other cases, iteratively reweighted least squares can be used. Recallthis technique is used when the σ2

i depend on the best fit. It requires iterationuntil minimizing the χ2—using fixed σ2

i that have been evaluated at the bestfit—leads to the best fit. It should give exactly the same result as maximizingL for pure Gaussian-, binomial- or Poisson-distributed yi and would be theonly technique to use if uncertainties in the independent variables also hadto be taken into account.

Of course, for Gaussian-distributed yi, maximizing L is the same as min-imizing χ2 and is the more common practice. But even if a binomial orPoisson L will be maximized, a χ2 calculation is still needed. Not only is it arandom variable associated with the goodness of fit, it is also needed to findthe parameter uncertainties using the ∆χ2 = 1 rule and it is used in checkingthe validity of the first-order Taylor expansions.

The following instructions describe a single iteration and the iterationmethod assuming that the σ2

i or the input covariance matrix [σ2y] depend on

the best fit.

1. Set up an area of the spreadsheet for the fitting parameters ak. Us-ing Solver will be somewhat easier if the parameters are confined to asingle block of cells. Enter initial guesses for the values of each fittingparameter.

2. Enter the data in columns, starting with xi and yi.

3. Use the fitting function to create a column for yfiti from the xi and the

initial guesses for the ak.

121

4. Add additional columns for other input information or derived quanti-ties as needed. For example, if the xi and/or yi are Gaussian-distributedrandom variables, with known uncertainties, add a column for the rawstandard deviations σxi and/or σyi. If the yi are from a binomial orPoisson distribution, create columns for σi and σ2

i according to Eq. 8.11or 8.13. If there is uncertainty in the independent variables, create acolumn for the derivative of the fitting function ∂Fi/∂xi, and anotherfor the final σ2

i of Eq. 8.75.

5. Create the main graph. Plot the (xi, yi) data points with error bars,but without connecting lines. Add a plot of yfit vs. x as a smoothcurve without markers. This plot may use the columns of yfit

i and xialready constructed, but if there are not enough xi to produce a smoothcurve or if you would like this curve to show values of yfit at values of xoutside the measured range, make another column with the appropriatex-values, evaluate the yfit-values there, and use these two columns forthe plot.

6. Verify that the column of yfit values and its graph depend on the valuesplaced in the cells for the fit parameters. Adjust the fit parametersmanually to get the fitted curve close to the data points. Solver maynot find the best-fit parameters if the starting values are not closeenough.

7. Columns for the deviations yi−yfiti (residuals) and normalized residuals

(yi − yfiti )/σi are also handy—both for evaluating χ2 and for making

plots of these quantities versus xi for fit evaluation.

8. Construct the necessary χ2 cell. If a log likelihood maximization is tobe performed, construct the necessary L cell as well.

If needed, the L cell is easily constructed according to Eq. 8.6 or 8.7. Anextra column for the N terms in L should be constructed for summing.

If the yi are correlated, the χ2 of Eq. 8.77 would be used and a repre-

sentation of the N × N covariance matrix [σ2y] or weighting matrix

[σ2y

]−1

would need to be specified. Often matrix calculations and array formulasare involved. Otherwise, the columns already constructed should make for astraightforward calculation of the χ2 according to Eqs. 8.15 or 8.77.

If the σ2i or [σ2

y] depend on the best fit, a constant copy is needed for χ2

minimization or for applying the ∆χ2 = 1 rule. Thus, after using the most


Figure 10.2: Excel’s Solver dialog box

recent ak and yfiti to create the calculated column for σ2

i or the N ×N blockfor [σ2

y], reserve a similarly sized block of cells for the copy.A simple way to make the constant copy is to use the spreadsheet Copy

command from the calculated block and then the Paste Special|Values com-mand to the block reserved for the copy. This sets the copy to fixed numberswhile leaving the formulas for the calculated values undisturbed. Be sureto use the constant copy for the calculation of the χ2 cell; when the χ2 isminimized with respect to the parameters, only the yfit

i can be allowed tovary—not the σ2

i or [σ2y].

To run one iteration of the Solver:

9. Invoke the Solver. Installing it and finding it in the various versions ofExcel differ. The dialog box is shown in Fig. 10.2.

10. Provide the cell address of the χ2 or L in the Set Target Cell: box. Clickthe Min radio button next to Equal To: for a χ2 minimization or clickthe Max button for an L maximization.

11. Provide the cell addresses for the fitting parameters in the By ChangingCells: box.

12. Click on the Solve button. The solver’s algorithms then start withyour initial fitting parameters, varying them to find those values whichminimize χ2 or maximize L.

13. Accept the final, changed, fitting parameters.

123

If iterative reweighting is needed, remember to repeat the Copy–PasteSpecial|Values procedure to update the constant copy of the σ2

i or [σ2y] with

the new values based on the most recent fit parameters. Repeat iterations ofthe Solver until there are no significant changes to the ak.

Parameter variances and covariances

Solver does not provide parameter variances or covariances. The best wayto get them is to construct the Jacobian matrix (according to Eq. 8.31) sothat the entire covariance matrix for the parameters [σ2

a] can be obtainedfrom Eq. 8.49 using array formulas. Otherwise, each parameter’s varianceand its covariances with the other parameters can be individually calculatedusing the following procedure based on the ∆χ2 = 1 rule. Even if Eq. 8.49is used, this procedure is still valuable for checking whether the probabilitydistribution for the parameters will be Gaussian.

The procedure is a numerical recipe for determining elements of [σ2a] by

checking how χ2 changes when parameter values are varied from the best-fitak. Remember, only the yfit

i should be allowed to change as the parametersare adjusted; the σ2

i should not. If the σ2i or [σ2

y ] are calculated, use theconstant copy evaluated at the best fit.

14. Write down or save to a separate area the best-fit values of all fitting pa-rameters and the best-fit χ2. Again, use the Copy–Paste Special|Valuesprocedure.

15. Change the value in the cell containing one of the fitting parameters,say ak, by a bit—try to change it by what you suspect will be its un-certainty. Call this new (unoptimized) value atrial

k . The χ2 will increasebecause it was originally at a minimum.

16. Remove the cell containing ak from the list of adjustable parametersand rerun the Solver. The other parameters might change a bit andthe χ2 might go down a bit, but it will still be larger than best-fit χ2.

17. If χ2 increased by one, then the amount that ak was changed is itsstandard deviation: σk = |atrial

k −ak|. If the increase in ∆χ2 is more (orless) than unity, the tested change in ak is larger (or smaller) than σk.Equation 9.22 can then be used to determine σk from the parameterchange atrial

k − ak and ∆χ2.


18. The covariances between ak and the other parameters can also be de-termined by keeping track of the changes in the other parameters afterthe re-optimization. Equation 9.23 can then be used to solve for thecovariances.

19. Check that these procedures gives roughly the same σk and σkm usingatrialk values both above and below ak and sized such that ∆χ2 values

are around 0.1 and around 10. If σk is roughly constant for all fourcases, one can be reasonably assured that confidence intervals ak± zσkshould be close to the Gaussian predictions up to z = 3.

Cautions

Determining when to stop the iteration process is not a simple matter. Thereis often a termination parameter giving the smallest fractional change in thetarget cell from one iteration to the next for which another iteration will beperformed. The Solver default is 10−4. There will also be a maximum numberof iterations to ensure the program stops eventually—even if the terminationcondition is never met. The Solver default is 100. These parameters can beadjusted by clicking on the Options button in the Solver dialog box. Alwaysmake sure the program has actually found the true extremum and has notterminated prematurely.

The Solver and other such programs may fail to find the best fit for avariety of reasons. If the initial parameter guesses are not close enough tothe best-fit values, they may need to be readjusted before the program willproceed to the solution. If a parameter wanders into unphysical domainsleading to an undefined function value, constraints may need to be placed onits allowed values.

Solver sometimes has problems with large or small parameters or targetcells and may need them to be rescaled. For example, if the y-values of ameasured exponential decay are of order 106 while the mean lifetime is oforder 10−6 s, and the background is around 103, rather than fit to Eq. 8.55,Solver performs better with

yfiti = 106a1e

−106ti/a2 + 103a3

so that all three fitting parameters are of order unity.

125

xi yi

2 2.43 6.75 27.86 43.28 80.79 104.5

Table 10.1: Data for regression exercises.

Regression exercises

In these last exercises you will perform equally-weighted fits for the data ofTable 10.1 to a quadratic formula: yfit

i = a1 + a2xi + a3x2i . You will solve it

three ways: using Excel’s linear regression program, programming the linearalgebra regression formulas, and using the Solver program. Augment andsave the spreadsheet as you work through the exercises, but the questionsmust be answered on a separate sheet clearly labeled with the question partand the answer. Feel free to cut and paste from the spreadsheet, but it willnot be opened or graded. This is a fit of N = 6 points to a formula withM = 3 fitting parameters; it is said to have N −M = 3 degrees of freedom.Keep this in mind when discussing χ2, χ2

ν or Student-T probabilities, whichdepend on the degrees of freedom.

Exercise 8 Start by creating the 6-row column vector for y, i.e., the yi inTable 10.1 and the 6×3 Jacobian matrix [Jya ], i.e., three side-by-side columnsof 1’s for a1, xi for a2, and x2

i for a3. This matrix will also be needed shortlywhen the regression formulas are used. Use it now as you solve the problemusing Excel’s linear regression program. Remember that if you include thecolumn of 1’s in the Input X Range, the Constant is Zero box must be checked.(a) Locate the parameters ak and their sample standard deviations sk. (b)Locate the sample standard deviation sy and show a calculation of its value

according to Eq. 9.12. Hint: Add a column for yfiti and another for yi − yfit

i .Then use the Excel SUMSQ(array) function to do the sum of squares.

Exercise 9 Now use the equally weighted linear regression formulas to solvethe problem. First create the 3 × 3 matrix for [Xu]

−1 based on Eq. 8.51 for


[Xu]. Hint: This formula and those to follow will require array formulas andExcel’s MMULT, TRANSPOSE, and MINVERSE array functions. Next,construct the 3-row column vector for the best fit parameters a accordingto Eq. 8.53 and show they agree with Excel’s regression results. Finally,construct the parameter covariance matrix [σ2

a] from Eq. 8.52 using σy =sy and show how the parameter sample standard deviations sk provided byExcel’s regression program can be obtained from this matrix.

Exercise 10 Excel’s linear regression program assumes no prior knowledgeof σy (except that it is the same for all yi). Because Excel uses the randomvariable sy as an estimate of σy, the [σ2

a] obtained with it is actually a sam-ple covariance matrix and the sk are sample standard deviations. It is alsowhy Student-T probabilities are used when Excel constructs confidence inter-vals. Give the linear regression program’s 95% confidence interval (for thequadratic coefficient only) and show how it can be obtained from a3, s3 andthe Student-T table.

Exercise 11 If the true σy is known, its value should be used in Eq. 8.52for [σ2

a], which then becomes a true covariance matrix. For this questionassume the yi of Table 10.1 are all known to come from distributions withσy = 0.5. The data are still equally weighted and thus the best fit parametervalues and the sample standard deviation sy do not change. Give the correctparameter standard deviations for this case. How should the sk from Excel’slinear regression program be scaled when σy is known? Would the parameterstandard deviations now be sample values or true values? Give the 95%confidence interval for the quadratic coefficient. Should you use Student-Tor Gaussian probabilities?

Exercise 12 (a) Do a graphical evaluation of the fit assuming σy = 0.5.Add a cell for σy and reference it to evaluate χ2 according to Eq. 8.15 orEq. 8.54. Evaluate χ2

ν according to Eq. 9.13 or χ2ν = s2

y/σ2y. Note that these

four formulas for χ2 and χ2ν are just different ways to calculate and describe

the exact same information about the actual and expected deviations. Whatis the probability that a χ2 or χ2

ν random variable would come out as big orbigger than it did here? (b) For this data and fit, how small would σy haveto get before one would have to conclude (at the 99% level, say) that the fitdeviations are too big to be reasonable?

127

Exercise 13 (a) Use Solver to do the fit. The Solver will be used in a singleχ2 minimization; no iteration will be needed. Demonstrate that the ak do notdepend on the value used for σy—that with any σy the optimized ak are thesame as those from the linear regression program.(b) Use the ∆χ2 = 1 rule to determine the parameter standard deviations.Start by assuming σy is unknown and use the sample standard deviation syfor σy in the calculations. Recall, this means the parameter covariance matrixand standard deviations will be sample values. What value of χ2 does thisproduce? Why is this value expected? Use the ∆χ2 = 1 rule to determinethe sample standard deviation s3 for the a3 fit parameter only. Show it is thesame as that obtained by Excel’s linear regression.(c) Now assume it is known that σy = 0.5 and use it with the ∆χ2 = 1 ruleto determine the standard deviation σ3 in the a3 parameter. Compare this σ3

value with the s3 value from part (b) and show that they scale as predicted byEq. 8.52.


Gaussian

probabilities µ µ LFσ

z 0.00 0.02 0.04 0.06 0.08

0.00 0.00000 0.00798 0.01595 0.02392 0.031880.10 0.03983 0.04776 0.05567 0.06356 0.071420.20 0.07926 0.08706 0.09483 0.10257 0.110260.30 0.11791 0.12552 0.13307 0.14058 0.148030.40 0.15542 0.16276 0.17003 0.17724 0.18439

0.50 0.19146 0.19847 0.20540 0.21226 0.219040.60 0.22575 0.23237 0.23891 0.24537 0.251750.70 0.25804 0.26424 0.27035 0.27637 0.282300.80 0.28814 0.29389 0.29955 0.30511 0.310570.90 0.31594 0.32121 0.32639 0.33147 0.336461.00 0.34134 0.34614 0.35083 0.35543 0.35993

1.10 0.36433 0.36864 0.37286 0.37698 0.381001.20 0.38493 0.38877 0.39251 0.39617 0.399731.30 0.40320 0.40658 0.40988 0.41308 0.416211.40 0.41924 0.42220 0.42507 0.42785 0.430561.50 0.43319 0.43574 0.43822 0.44062 0.44295

1.60 0.44520 0.44738 0.44950 0.45154 0.453521.70 0.45543 0.45728 0.45907 0.46080 0.462461.80 0.46407 0.46562 0.46712 0.46856 0.469951.90 0.47128 0.47257 0.47381 0.47500 0.476152.00 0.47725 0.47831 0.47932 0.48030 0.48124

2.10 0.48214 0.48300 0.48382 0.48461 0.485372.20 0.48610 0.48679 0.48745 0.48809 0.488702.30 0.48928 0.48983 0.49036 0.49086 0.491342.40 0.49180 0.49224 0.49266 0.49305 0.493432.50 0.49379 0.49413 0.49446 0.49477 0.49506

2.60 0.49534 0.49560 0.49585 0.49609 0.496322.70 0.49653 0.49674 0.49693 0.49711 0.497282.80 0.49744 0.49760 0.49774 0.49788 0.498012.90 0.49813 0.49825 0.49836 0.49846 0.498563.00 0.49865 0.49874 0.49882 0.49889 0.49896

Table 10.2: Half-sided integral of the Gaussian probability density function. Thebody of the table gives the integral probability P (µ < y < µ+ zσ) for values of zspecified by the first column and row.

129

Reduced

chi-square

probabilities χ 1ν

P 0.99 0.98 0.95 0.9 0.8 0.2 0.1 0.05 0.02 0.01 0.001ν

1 0.000 0.001 0.004 0.016 0.064 1.642 2.706 3.841 5.412 6.635 10.8272 0.010 0.020 0.051 0.105 0.223 1.609 2.303 2.996 3.912 4.605 6.9083 0.038 0.062 0.117 0.195 0.335 1.547 2.084 2.605 3.279 3.782 5.4224 0.074 0.107 0.178 0.266 0.412 1.497 1.945 2.372 2.917 3.319 4.6175 0.111 0.150 0.229 0.322 0.469 1.458 1.847 2.214 2.678 3.017 4.103

6 0.145 0.189 0.273 0.367 0.512 1.426 1.774 2.099 2.506 2.802 3.7437 0.177 0.223 0.310 0.405 0.546 1.400 1.717 2.010 2.375 2.639 3.4748 0.206 0.254 0.342 0.436 0.574 1.379 1.670 1.938 2.271 2.511 3.2659 0.232 0.281 0.369 0.463 0.598 1.360 1.632 1.880 2.187 2.407 3.097

10 0.256 0.306 0.394 0.487 0.618 1.344 1.599 1.831 2.116 2.321 2.959

11 0.278 0.328 0.416 0.507 0.635 1.330 1.570 1.789 2.056 2.248 2.84212 0.298 0.348 0.436 0.525 0.651 1.318 1.546 1.752 2.004 2.185 2.74213 0.316 0.367 0.453 0.542 0.664 1.307 1.524 1.720 1.959 2.130 2.65614 0.333 0.383 0.469 0.556 0.676 1.296 1.505 1.692 1.919 2.082 2.58015 0.349 0.399 0.484 0.570 0.687 1.287 1.487 1.666 1.884 2.039 2.513

16 0.363 0.413 0.498 0.582 0.697 1.279 1.471 1.644 1.852 2.000 2.45317 0.377 0.427 0.510 0.593 0.706 1.271 1.457 1.623 1.823 1.965 2.39918 0.390 0.439 0.522 0.604 0.714 1.264 1.444 1.604 1.797 1.934 2.35119 0.402 0.451 0.532 0.613 0.722 1.258 1.432 1.587 1.773 1.905 2.30620 0.413 0.462 0.543 0.622 0.729 1.252 1.421 1.571 1.751 1.878 2.266

22 0.434 0.482 0.561 0.638 0.742 1.241 1.401 1.542 1.712 1.831 2.19424 0.452 0.500 0.577 0.652 0.753 1.231 1.383 1.517 1.678 1.791 2.13226 0.469 0.516 0.592 0.665 0.762 1.223 1.368 1.496 1.648 1.755 2.07928 0.484 0.530 0.605 0.676 0.771 1.215 1.354 1.476 1.622 1.724 2.03230 0.498 0.544 0.616 0.687 0.779 1.208 1.342 1.459 1.599 1.696 1.990

32 0.511 0.556 0.627 0.696 0.786 1.202 1.331 1.444 1.578 1.671 1.95334 0.523 0.567 0.637 0.704 0.792 1.196 1.321 1.429 1.559 1.649 1.91936 0.534 0.577 0.646 0.712 0.798 1.191 1.311 1.417 1.541 1.628 1.88838 0.545 0.587 0.655 0.720 0.804 1.186 1.303 1.405 1.525 1.610 1.86140 0.554 0.596 0.663 0.726 0.809 1.182 1.295 1.394 1.511 1.592 1.835

42 0.563 0.604 0.670 0.733 0.813 1.178 1.288 1.384 1.497 1.576 1.81244 0.572 0.612 0.677 0.738 0.818 1.174 1.281 1.375 1.485 1.562 1.79046 0.580 0.620 0.683 0.744 0.822 1.170 1.275 1.366 1.473 1.548 1.77048 0.587 0.627 0.690 0.749 0.825 1.167 1.269 1.358 1.462 1.535 1.75150 0.594 0.633 0.695 0.754 0.829 1.163 1.263 1.350 1.452 1.523 1.733

60 0.625 0.662 0.720 0.774 0.844 1.150 1.240 1.318 1.410 1.473 1.66070 0.649 0.684 0.739 0.790 0.856 1.139 1.222 1.293 1.377 1.435 1.60580 0.669 0.703 0.755 0.803 0.865 1.130 1.207 1.273 1.351 1.404 1.56090 0.686 0.718 0.768 0.814 0.873 1.123 1.195 1.257 1.329 1.379 1.525

100 0.701 0.731 0.779 0.824 0.879 1.117 1.185 1.243 1.311 1.358 1.494

120 0.724 0.753 0.798 0.839 0.890 1.107 1.169 1.221 1.283 1.325 1.447140 0.743 0.770 0.812 0.850 0.898 1.099 1.156 1.204 1.261 1.299 1.410160 0.758 0.784 0.823 0.860 0.905 1.093 1.146 1.191 1.243 1.278 1.381180 0.771 0.796 0.833 0.868 0.910 1.087 1.137 1.179 1.228 1.261 1.358200 0.782 0.806 0.841 0.874 0.915 1.083 1.130 1.170 1.216 1.247 1.338

Table 10.3: Integral of the χ2ν probability density function for various values of

the number of degrees of freedom ν. The body of the table contains values ofχ2ν , such that the probability P of exceeding this value is given at the top of the

column.


Student-T probabilities

P 0.99 0.95 0.90 0.80 0.70 0.68 0.60 0.50ν1 63.6559 12.70615 6.31375 3.07768 1.96261 1.81899 1.37638 1.000002 9.92499 4.30266 2.91999 1.88562 1.38621 1.31158 1.06066 0.816503 5.84085 3.18245 2.35336 1.63775 1.24978 1.18893 0.97847 0.764894 4.60408 2.77645 2.13185 1.53321 1.18957 1.13440 0.94096 0.740705 4.03212 2.57058 2.01505 1.47588 1.15577 1.10367 0.91954 0.72669

6 3.70743 2.44691 1.94318 1.43976 1.13416 1.08398 0.90570 0.717567 3.49948 2.36462 1.89458 1.41492 1.11916 1.07029 0.89603 0.711148 3.35538 2.30601 1.85955 1.39682 1.10815 1.06022 0.88889 0.706399 3.24984 2.26216 1.83311 1.38303 1.09972 1.05252 0.88340 0.70272

10 3.16926 2.22814 1.81246 1.37218 1.09306 1.04642 0.87906 0.69981

11 3.10582 2.20099 1.79588 1.36343 1.08767 1.04149 0.87553 0.6974412 3.05454 2.17881 1.78229 1.35622 1.08321 1.03740 0.87261 0.6954813 3.01228 2.16037 1.77093 1.35017 1.07947 1.03398 0.87015 0.6938314 2.97685 2.14479 1.76131 1.34503 1.07628 1.03105 0.86805 0.6924215 2.94673 2.13145 1.75305 1.34061 1.07353 1.02853 0.86624 0.69120

16 2.92079 2.11990 1.74588 1.33676 1.07114 1.02634 0.86467 0.6901317 2.89823 2.10982 1.73961 1.33338 1.06903 1.02441 0.86328 0.6891918 2.87844 2.10092 1.73406 1.33039 1.06717 1.02270 0.86205 0.6883619 2.86094 2.09302 1.72913 1.32773 1.06551 1.02117 0.86095 0.6876220 2.84534 2.08596 1.72472 1.32534 1.06402 1.01980 0.85996 0.68695

21 2.83137 2.07961 1.72074 1.32319 1.06267 1.01857 0.85907 0.6863522 2.81876 2.07388 1.71714 1.32124 1.06145 1.01745 0.85827 0.6858123 2.80734 2.06865 1.71387 1.31946 1.06034 1.01643 0.85753 0.6853124 2.79695 2.06390 1.71088 1.31784 1.05932 1.01549 0.85686 0.6848525 2.78744 2.05954 1.70814 1.31635 1.05838 1.01463 0.85624 0.68443

26 2.77872 2.05553 1.70562 1.31497 1.05752 1.01384 0.85567 0.6840427 2.77068 2.05183 1.70329 1.31370 1.05673 1.01311 0.85514 0.6836928 2.76326 2.04841 1.70113 1.31253 1.05599 1.01243 0.85465 0.6833529 2.75639 2.04523 1.69913 1.31143 1.05530 1.01180 0.85419 0.6830430 2.74998 2.04227 1.69726 1.31042 1.05466 1.01122 0.85377 0.68276

31 2.74404 2.03951 1.69552 1.30946 1.05406 1.01067 0.85337 0.6824932 2.73849 2.03693 1.69389 1.30857 1.05350 1.01015 0.85300 0.6822333 2.73329 2.03452 1.69236 1.30774 1.05298 1.00967 0.85265 0.6820034 2.72839 2.03224 1.69092 1.30695 1.05249 1.00922 0.85232 0.6817735 2.72381 2.03011 1.68957 1.30621 1.05202 1.00879 0.85201 0.68156

36 2.71948 2.02809 1.68830 1.30551 1.05158 1.00838 0.85172 0.6813737 2.71541 2.02619 1.68709 1.30485 1.05116 1.00800 0.85144 0.6811838 2.71157 2.02439 1.68595 1.30423 1.05077 1.00764 0.85118 0.6810039 2.70791 2.02269 1.68488 1.30364 1.05040 1.00730 0.85093 0.6808340 2.70446 2.02107 1.68385 1.30308 1.05005 1.00697 0.85070 0.68067

∞ 2.57583 1.95996 1.64485 1.28155 1.03643 0.99446 0.84162 0.67449

Table 10.4: Student-T probabilities for various values of the number of degreesof freedom ν. The body of the table contains values of z, such that the probabilityP that the interval y ± zsy will include the mean µy is given at the top of thecolumn.

Statistical Analysis of Data in the Linear RegimeFour of the most common probability distributions...

Documents

Transcript of Statistical Analysis of Data in the Linear RegimeFour of the most common probability distributions...