37.PROBABILITY - Institute of...

41
37. Probability 467 37. PROBABILITY Revised September 2013 by G. Cowan (RHUL). 37.1. General [1–8] An abstract definition of probability can be given by considering a set S, called the sample space, and possible subsets A,B,... , the interpretation of which is left open. The probability P is a real-valued function defined by the following axioms due to Kolmogorov [9]: 1. For every subset A in S, P (A) 0; 2. For disjoint subsets (i.e., A B = ), P (A B)= P (A)+ P (B); 3. P (S) = 1. In addition, one defines the conditional probability P (A|B) (read as P of A given B) as P (A|B)= P (A B) P (B) . (37.1) From this definition and using the fact that A B and B A are the same, one obtains Bayes’ theorem, P (A|B)= P (B|A)P (A) P (B) . (37.2) From the three axioms of probability and the definition of conditional probability, one obtains the law of total probability, P (B)= i P (B|A i )P (A i ) , (37.3) for any subset B and for disjoint A i with i A i = S. This can be combined with Bayes’ theorem (Eq. (37.2)) to give P (A|B)= P (B|A)P (A) i P (B|A i )P (A i ) , (37.4) where the subset A could, for example, be one of the A i . The most commonly used interpretation of the elements of the sample space are outcomes of a repeatable experiment. The probability P (A) is assigned a value equal to the limiting frequency of occurrence of A. This interpretation forms the basis of frequentist statistics. The elements of the sample space might also be interpreted as hypotheses, i.e., statements that are either true or false, such as ‘The mass of the W boson lies between 80.3 and 80.5 GeV.’ Upon repetition of a measurement, however, such statements are either always true or always false, i.e., the corresponding probabilities in the frequentist interpretation are either 0 or 1. Using subjective probability, however, P (A) is interpreted as the degree of belief that the hypothesis A is true. Subjective probability is used in Bayesian (as opposed to frequentist) statistics. Bayes’ theorem can be written P (theory|data) P (data|theory)P (theory) , (37.5) where ‘theory’ represents some hypothesis and ‘data’ is the outcome of the experiment. Here P (theory) is the prior probability for the theory, which reflects the experimenter’s degree of belief before carrying out the measurement, and P (data|theory) is the probability to have gotten the data actually obtained, given the theory, which is also called the likelihood. Bayesian statistics provides no fundamental rule for obtaining the prior probability, which may depend on previous measurements, theoretical prejudices, etc. Once this has been specified, however, Eq. (37.5) tells how the probability for the theory must be modified in the light of the new data to give the posterior probability, P (theory|data). As Eq. (37.5) is stated as a proportionality, the probability must be normalized by summing (or integrating) over all possible hypotheses. 37.2. Random variables A random variable is a numerical characteristic assigned to an element of the sample space. In the frequency interpretation of probability, it corresponds to an outcome of a repeatable experiment. Let x be a possible outcome of an observation. If x can take on any value from a continuous range, we write f (x; θ)dx as the probability that the measurement’s outcome lies between x and x + dx. The function f (x; θ) is called the probability density function (p.d.f.), which may depend on one or more parameters θ. If x can take on only discrete values (e.g., the non-negative integers), then we use f (x; θ) to denote the probability to find the value x. In the following the term p.d.f. is often taken to cover both the continuous and discrete cases, although technically the term density should only be used in the continuous case. The p.d.f. is always normalized to unity. Both x and θ may have multiple components and are then often written as vectors. If θ is unknown, we may wish to estimate its value from a given set of measurements of x; this is a central topic of statistics (see Sec. 38). The cumulative distribution function F (a) is the probability that x a: F (a)= a −∞ f (x) dx . (37.6) Here and below, if x is discrete-valued, the integral is replaced by a sum. The endpoint a is expressly included in the integral or sum. Then 0 F (x) 1, F (x) is nondecreasing, and P (a<x b)= F (b) F (a). If x is discrete, F (x) is flat except at allowed values of x, where it has discontinuous jumps equal to f (x). Any function of random variables is itself a random variable, with (in general) a different p.d.f. The expectation value of any function u(x) is E[u(x)] = −∞ u(x) f (x) dx , (37.7) assuming the integral is finite. The expectation value is linear, i.e., for any two functions u and v of x and constants c 1 and c 2 , E[c 1 u + c 2 v]= c 1 E[u]+ c 2 E[v]. The n th moment of a random variable x is α n E[x n ]= −∞ x n f (x) dx , (37.8a) and the n th central moment of x (or moment about the mean, α 1 ) is m n E[(x α 1 ) n ]= −∞ (x α 1 ) n f (x) dx . (37.8b) The most commonly used moments are the mean μ and variance σ 2 : μ α 1 , (37.9a) σ 2 V [x] m 2 = α 2 μ 2 . (37.9b) The mean is the location of the “center of mass” of the p.d.f., and the variance is a measure of the square of its width. Note that V [cx + k]= c 2 V [x]. It is often convenient to use the standard deviation of x, σ, defined as the square root of the variance. Any odd moment about the mean is a measure of the skewness of the p.d.f. The simplest of these is the dimensionless coefficient of skewness γ 1 = m 3 3 . The fourth central moment m 4 provides a convenient measure of the tails of a distribution. For the Gaussian distribution (see Sec. 37.4), one has m 4 =3σ 4 . The kurtosis is defined as γ 2 = m 4 4 3, i.e., it is zero for a Gaussian, positive for a leptokurtic distribution with longer tails, and negative for a platykurtic distribution with tails that die off more quickly than those of a Gaussian. The quantile x α is the value of the random variable x at which the cumulative distribution is equal to α. That is, the quantile is the inverse of the cumulative distribution function, i.e., x α = F 1 (α). An important special case is the median, x med , defined by F (x med )=1/2, i.e., half the probability lies above and half lies below x med .

Transcript of 37.PROBABILITY - Institute of...

Page 1: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

37. Probability 467

37. PROBABILITY

Revised September 2013 by G. Cowan (RHUL).

37.1. General [1–8]

An abstract definition of probability can be given by consideringa set S, called the sample space, and possible subsets A, B, . . . , theinterpretation of which is left open. The probability P is a real-valuedfunction defined by the following axioms due to Kolmogorov [9]:

1. For every subset A in S, P (A) ≥ 0;

2. For disjoint subsets (i.e., A ∩ B = ∅), P (A ∪ B) = P (A) + P (B);

3. P (S) = 1.

In addition, one defines the conditional probability P (A|B) (read as Pof A given B) as

P (A|B) =P (A ∩ B)

P (B). (37.1)

From this definition and using the fact that A ∩ B and B ∩ A are thesame, one obtains Bayes’ theorem,

P (A|B) =P (B|A)P (A)

P (B). (37.2)

From the three axioms of probability and the definition of conditionalprobability, one obtains the law of total probability,

P (B) =∑

i

P (B|Ai)P (Ai) , (37.3)

for any subset B and for disjoint Ai with ∪iAi = S. This can becombined with Bayes’ theorem (Eq. (37.2)) to give

P (A|B) =P (B|A)P (A)

i P (B|Ai)P (Ai), (37.4)

where the subset A could, for example, be one of the Ai.

The most commonly used interpretation of the elements ofthe sample space are outcomes of a repeatable experiment. Theprobability P (A) is assigned a value equal to the limiting frequencyof occurrence of A. This interpretation forms the basis of frequentist

statistics.

The elements of the sample space might also be interpreted ashypotheses, i.e., statements that are either true or false, such as ‘Themass of the W boson lies between 80.3 and 80.5 GeV.’ Upon repetitionof a measurement, however, such statements are either always trueor always false, i.e., the corresponding probabilities in the frequentistinterpretation are either 0 or 1. Using subjective probability, however,P (A) is interpreted as the degree of belief that the hypothesis Ais true. Subjective probability is used in Bayesian (as opposed tofrequentist) statistics. Bayes’ theorem can be written

P (theory|data) ∝ P (data|theory)P (theory) , (37.5)

where ‘theory’ represents some hypothesis and ‘data’ is the outcome ofthe experiment. Here P (theory) is the prior probability for the theory,which reflects the experimenter’s degree of belief before carrying outthe measurement, and P (data|theory) is the probability to have gottenthe data actually obtained, given the theory, which is also called thelikelihood.

Bayesian statistics provides no fundamental rule for obtainingthe prior probability, which may depend on previous measurements,theoretical prejudices, etc. Once this has been specified, however,Eq. (37.5) tells how the probability for the theory must be modifiedin the light of the new data to give the posterior probability,P (theory|data). As Eq. (37.5) is stated as a proportionality, theprobability must be normalized by summing (or integrating) over allpossible hypotheses.

37.2. Random variables

A random variable is a numerical characteristic assigned to anelement of the sample space. In the frequency interpretation ofprobability, it corresponds to an outcome of a repeatable experiment.Let x be a possible outcome of an observation. If x can take on anyvalue from a continuous range, we write f(x; θ)dx as the probabilitythat the measurement’s outcome lies between x and x + dx. Thefunction f(x; θ) is called the probability density function (p.d.f.), whichmay depend on one or more parameters θ. If x can take on onlydiscrete values (e.g., the non-negative integers), then we use f(x; θ)to denote the probability to find the value x. In the following theterm p.d.f. is often taken to cover both the continuous and discretecases, although technically the term density should only be used inthe continuous case.

The p.d.f. is always normalized to unity. Both x and θ may havemultiple components and are then often written as vectors. If θ isunknown, we may wish to estimate its value from a given set ofmeasurements of x; this is a central topic of statistics (see Sec. 38).

The cumulative distribution function F (a) is the probability thatx ≤ a:

F (a) =

∫ a

−∞

f(x) dx . (37.6)

Here and below, if x is discrete-valued, the integral is replaced by asum. The endpoint a is expressly included in the integral or sum. Then0 ≤ F (x) ≤ 1, F (x) is nondecreasing, and P (a < x ≤ b) = F (b)−F (a).If x is discrete, F (x) is flat except at allowed values of x, where it hasdiscontinuous jumps equal to f(x).

Any function of random variables is itself a random variable, with(in general) a different p.d.f. The expectation value of any functionu(x) is

E[u(x)] =

∫ ∞

−∞

u(x) f(x) dx , (37.7)

assuming the integral is finite. The expectation value is linear,i.e., for any two functions u and v of x and constants c1 and c2,E[c1u + c2v] = c1E[u] + c2E[v].

The nth moment of a random variable x is

αn ≡ E[xn] =

∫ ∞

−∞

xnf(x) dx , (37.8a)

and the nth central moment of x (or moment about the mean, α1) is

mn ≡ E[(x − α1)n] =

∫ ∞

−∞

(x − α1)nf(x) dx . (37.8b)

The most commonly used moments are the mean µ and variance σ2:

µ ≡ α1 , (37.9a)

σ2 ≡ V [x] ≡ m2 = α2 − µ2 . (37.9b)

The mean is the location of the “center of mass” of the p.d.f., andthe variance is a measure of the square of its width. Note thatV [cx+k] = c2V [x]. It is often convenient to use the standard deviation

of x, σ, defined as the square root of the variance.

Any odd moment about the mean is a measure of the skewnessof the p.d.f. The simplest of these is the dimensionless coefficient ofskewness γ1 = m3/σ3.

The fourth central moment m4 provides a convenient measure of thetails of a distribution. For the Gaussian distribution (see Sec. 37.4),one has m4 = 3σ4. The kurtosis is defined as γ2 = m4/σ4 − 3, i.e.,it is zero for a Gaussian, positive for a leptokurtic distribution withlonger tails, and negative for a platykurtic distribution with tails thatdie off more quickly than those of a Gaussian.

The quantile xα is the value of the random variable x at whichthe cumulative distribution is equal to α. That is, the quantile is theinverse of the cumulative distribution function, i.e., xα = F−1(α). Animportant special case is the median, xmed, defined by F (xmed) = 1/2,i.e., half the probability lies above and half lies below xmed.

Page 2: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

468 37. Probability

(More rigorously, xmed is a median if P (x ≥ xmed) ≥ 1/2 andP (x ≤ xmed) ≥ 1/2. If only one value exists, it is called ‘the median.’)

Under a monotonic change of variable x → y(x), the quantilesof a distribution (and hence also the median) obey yα = y(xα). Ingeneral the expectation value and mode (most probable value) of adistribution do not, however, transform in this way.

Let x and y be two random variables with a joint p.d.f. f(x, y).The marginal p.d.f. of x (the distribution of x with y unobserved) is

f1(x) =

∫ ∞

−∞

f(x, y) dy , (37.10)

and similarly for the marginal p.d.f. f2(y). The conditional p.d.f. of ygiven fixed x (with f1(x) 6= 0) is defined by f3(y|x) = f(x, y)/f1(x),and similarly f4(x|y) = f(x, y)/f2(y). From these, we immediatelyobtain Bayes’ theorem (see Eqs. (37.2) and (37.4)),

f4(x|y) =f3(y|x)f1(x)

f2(y)=

f3(y|x)f1(x)∫

f3(y|x′)f1(x′) dx′. (37.11)

The mean of x is

µx =

∫ ∞

−∞

∫ ∞

−∞

x f(x, y) dx dy =

∫ ∞

−∞

x f1(x) dx , (37.12)

and similarly for y. The covariance of x and y is

cov[x, y] = E[(x − µx)(y − µy)] = E[xy] − µxµy . (37.13)

A dimensionless measure of the covariance of x and y is given by thecorrelation coefficient,

ρxy = cov[x, y]/σxσy , (37.14)

where σx and σy are the standard deviations of x and y. It can beshown that −1 ≤ ρxy ≤ 1.

Two random variables x and y are independent if and only if

f(x, y) = f1(x)f2(y) . (37.15)

If x and y are independent, then ρxy = 0; the converse is not necessarilytrue. If x and y are independent, E[u(x)v(y)] = E[u(x)]E[v(y)], andV [x + y] = V [x] + V [y]; otherwise, V [x + y] = V [x] + V [y] + 2cov[x, y],and E[uv] does not necessarily factorize.

Consider a set of n continuous random variables x = (x1, . . . , xn)with joint p.d.f. f(x), and a set of n new variables y = (y1, . . . , yn),related to x by means of a function y(x) that is one-to-one, i.e., theinverse x(y) exists. The joint p.d.f. for y is given by

g(y) = f(x(y))|J | , (37.16)

where |J | is the absolute value of the determinant of the square matrixJij = ∂xi/∂yj (the Jacobian determinant). If the transformation fromx to y is not one-to-one, the x-space must be broken into regionswhere the function y(x) can be inverted, and the contributions tog(y) from each region summed.

Given a set of functions y = (y1, . . . , ym) with m < n, one canconstruct n−m additional independent functions, apply the procedureabove, then integrate the resulting g(y) over the unwanted yi to findthe marginal distribution of those of interest.

For a one-to-one transformation of discrete random variables,the probability is obtained by simple substitution; no Jacobian isnecessary because in this case f is a probability rather than aprobability density. If the transformation is not one-to-one, then onemust sum the probabilities for all values of the original variable thatcontribute to a given value of the transformed variable. If f dependson a set of parameters θ, a change to a different parameter set η(θ) ismade by simple substitution; no Jacobian is used.

37.3. Characteristic functions

The characteristic function φ(u) associated with the p.d.f. f(x) isessentially its Fourier transform, or the expectation value of eiux:

φ(u) = E[

eiux]

=

∫ ∞

−∞

eiuxf(x) dx . (37.17)

Once φ(u) is specified, the p.d.f. f(x) is uniquely determined and viceversa; knowing one is equivalent to the other. Characteristic functionsare useful in deriving a number of important results about momentsand sums of random variables.

It follows from Eqs. (37.8a) and (37.17) that the nth moment of arandom variable x that follows f(x) is given by

i−n dnφ

dun

u=0

=

∫ ∞

−∞

xnf(x) dx = αn . (37.18)

Thus it is often easy to calculate all the moments of a distributiondefined by φ(u), even when f(x) cannot be written down explicitly.

If the p.d.f.s f1(x) and f2(y) for independent random variablesx and y have characteristic functions φ1(u) and φ2(u), then thecharacteristic function of the weighted sum ax + by is φ1(au)φ2(bu).The rules of addition for several important distributions (e.g., thatthe sum of two Gaussian distributed variables also follows a Gaussiandistribution) easily follow from this observation.

Let the (partial) characteristic function corresponding to theconditional p.d.f. f2(x|z) be φ2(u|z), and the p.d.f. of z be f1(z). Thecharacteristic function after integration over the conditional value is

φ(u) =

φ2(u|z)f1(z) dz . (37.19)

Suppose we can write φ2 in the form

φ2(u|z) = A(u)eig(u)z . (37.20)

Thenφ(u) = A(u)φ1(g(u)) . (37.21)

The cumulants (semi-invariants) κn of a distribution withcharacteristic function φ(u) are defined by the relation

φ(u) = exp

[

∞∑

n=1

κn

n!(iu)n

]

= exp(

iκ1u − 1

2κ2u

2 + . . .)

. (37.22)

The values κn are related to the moments αn and mn. The first fewrelations are

κ1 = α1 (= µ, the mean)

κ2 = m2 = α2 − α21 (= σ2, the variance)

κ3 = m3 = α3 − 3α1α2 + 2α31 . (37.23)

37.4. Commonly used probability distributions

Table 37.1 gives a number of common probability density functionsand corresponding characteristic functions, means, and variances.Further information may be found in Refs. [1– 8], [10], and [11],which has particularly detailed tables. Monte Carlo techniques forgenerating each of them may be found in our Sec. 39.4 and in Ref. [10].We comment below on all except the trivial uniform distribution.

37.4.1. Binomial and multinomial distributions :

A random process with exactly two possible outcomes which occurwith fixed probabilities is called a Bernoulli process. If the probabilityof obtaining a certain outcome (a “success”) in an individual trial is p,then the probability of obtaining exactly r successes (r = 0, 1, 2, . . . , N)in N independent trials, without regard to the order of the successesand failures, is given by the binomial distribution f(r; N, p) inTable 37.1. If r and s are binomially distributed with parameters(Nr, p) and (Ns, p), then t = r + s follows a binomial distribution withparameters (Nr + Ns, p).

If there are are m possible outcomes for each trial havingprobabilities p1, p2, . . . , pm, then the joint probability to findr1, r2, . . . , rm of each outcome after a total of N independent trialsis given by the multinomial distribution as shown in Table 37.1. Wecan regard outcome i as “success” and all the rest as “failure”, soindividually, any of the ri follow a binomial distribution for N trialsand a success probability pi.

Page 3: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

37. Probability 469

37.4.2. Poisson distribution :

The Poisson distribution f(n; ν) gives the probability of findingexactly n events in a given interval of x (e.g., space or time) whenthe events occur independently of one another and of x at an averagerate of ν per the given interval. The variance σ2 equals ν. It is thelimiting case p → 0, N → ∞, Np = ν of the binomial distribution.The Poisson distribution approaches the Gaussian distribution forlarge ν.

For example, a large number of radioactive nuclei of a given typewill result in a certain number of decays in a fixed time interval. If thisinterval is small compared to the mean lifetime, then the probabilityfor a given nucleus to decay is small, and thus the number of decaysin the time interval is well modeled as a Poisson variable.

Table 37.1. Some common probability density functions, with corresponding characteristic functions andmeans and variances. In the Table, Γ(k) is the gamma function, equal to (k − 1)! when k is an integer;

1F1 is the confluent hypergeometric function of the 1st kind [11].

Probability density function CharacteristicDistribution f (variable; parameters) function φ(u) Mean Variance

Uniform f(x; a, b) =

1/(b − a) a ≤ x ≤ b

0 otherwise

eibu − eiau

(b − a)iu

a + b

2

(b − a)2

12

Binomial f(r; N, p) =N !

r!(N − r)!prqN−r (q + peiu)N Np Npq

r = 0, 1, 2, . . . , N ; 0 ≤ p ≤ 1 ; q = 1 − p

Multinomial f(r1, . . . , rm; N, p1, . . . , pm) =N !

r1! · · · rm!pr11

· · · prmm

(∑m

k=1pkeiuk

)N E[ri] =

Npi

cov[ri, rj ] =

Npi(δij − pj)rk = 0, 1, 2, . . . , N ; 0 ≤ pk ≤ 1 ;

∑mk=1

rk = N

Poisson f(n; ν) =νne−ν

n!; n = 0, 1, 2, . . . ; ν > 0 exp[ν(eiu − 1)] ν ν

Normal(Gaussian)

f(x; µ, σ2) =1

σ√

2πexp(−(x − µ)2/2σ2) exp(iµu − 1

2σ2u2) µ σ2

−∞ < x < ∞ ; −∞ < µ < ∞ ; σ > 0

MultivariateGaussian

f(x; µ, V ) =1

(2π)n/2√

|V |exp

[

iµ · u − 1

2u

T V u]

µ Vjk

× exp[

−1

2(x − µ)T V −1(x − µ)

]

−∞ < xj < ∞; − ∞ < µj < ∞; |V | > 0

Log-normal f(x; µ, σ2) =1

σ√

1

xexp(−(ln x − µ)2/2σ2) —

exp(µ + σ2/2) exp(2µ + σ2)

×[exp(σ2) − 1]0 < x < ∞ ; −∞ < µ < ∞ ; σ > 0

χ2 f(z; n) =zn/2−1e−z/2

2n/2Γ(n/2); z ≥ 0 (1 − 2iu)−n/2 n 2n

Student’s t f(t; n) =1

√nπ

Γ[(n + 1)/2]

Γ(n/2)

(

1 +t2

n

)−(n+1)/2

—0

for n > 1

n/(n − 2)

for n > 2−∞ < t < ∞ ; n not required to be integer

Gamma f(x; λ, k) =xk−1λke−λx

Γ(k); 0 ≤ x < ∞ ; (1 − iu/λ)−k k/λ k/λ2

k not required to be integer

Beta f(x; α, β) =Γ(α + β)

Γ(α)Γ(β)xα−1(1 − x)β−1

1F1(α; α + β; iu)α

α + β

αβ

(α + β)2(α + β + 1)0 ≤ x ≤ 1

37.4.3. Normal or Gaussian distribution :

The normal (or Gaussian) probability density function f(x; µ, σ2)given in Table 37.1 has mean E[x] = µ and variance V [x] = σ2.Comparison of the characteristic function φ(u) given in Table 37.1with Eq. (37.22) shows that all cumulants κn beyond κ2 vanish; this isa unique property of the Gaussian distribution. Some other propertiesare:

P (x in range µ ± σ) = 0.6827,

P (x in range µ ± 0.6745σ) = 0.5,

E[|x − µ|] =√

2/πσ = 0.7979σ,

half-width at half maximum =√

2 ln 2σ = 1.177σ.

Page 4: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

470 37. Probability

For a Gaussian with µ = 0 and σ2 = 1 (the standard normal) thecumulative distribution, often written Φ(x), is related to the errorfunction erf by

F (x; 0, 1) ≡ Φ(x) = 1

2

[

1 + erf(x/√

2)]

. (37.24)

The error function and standard Gaussian are tabulated in manyreferences (e.g., Ref. [11,12]) and are available in software packagessuch as ROOT [13]. For a mean µ and variance σ2, replace x by(x − µ)/σ. The probability of x in a given range can be calculatedwith Eq. (38.65).

For x and y independent and normally distributed, z = ax + byfollows a normal p.d.f. f(z; aµx + bµy, a2σ2

x + b2σ2y); that is, the

weighted means and variances add.

The Gaussian derives its importance in large part from the central

limit theorem:

If independent random variables x1, . . . , xn are distributed accordingto any p.d.f. with finite mean and variance, then the sum y =

∑ni=1

xi

will have a p.d.f. that approaches a Gaussian for large n. If the p.d.f.sof the xi are not identical, the theorem still holds under somewhatmore restrictive conditions. The mean and variance are given by thesums of corresponding terms from the individual xi. Therefore, thesum of a large number of fluctuations xi will be distributed as aGaussian, even if the xi themselves are not.

For a set of n Gaussian random variables x with means µ andcovariances Vij = cov[xi, xj ], the p.d.f. for the one-dimensionalGaussian is generalized to

f(x; µ, V ) =1

(2π)n/2√

|V |exp

[

− 1

2(x − µ)T V −1(x − µ)

]

, (37.25)

where the determinant |V | must be greater than 0. For diagonal V(independent variables), f(x; µ, V ) is the product of the p.d.f.s of nGaussian distributions.

For n = 2, f(x; µ, V ) is

f(x1, x2; µ1, µ2, σ1, σ2, ρ) =1

2πσ1σ2

1 − ρ2

× exp

−1

2(1 − ρ2)

[

(x1 − µ1)2

σ21

−2ρ(x1 − µ1)(x2 − µ2)

σ1σ2

+(x2 − µ2)

2

σ22

]

.

(37.26)The characteristic function for the multivariate Gaussian is

φ(u; µ, V ) = exp[

iµ · u − 1

2u

T V u

]

. (37.27)

If the components of x are independent, then Eq. (37.27) is theproduct of the characteristic functions of n Gaussians.

For an n-dimensional Gaussian distribution for x with mean µ andcovariance matrix V , the marginal distribution for any single xi isis a one-dimensional Gaussian with mean µi and variance Vii. Theequation (x − a)T V −1(x − a) = C, where C is any positive number,defines an n-dimensional ellipse centered about a. If a is equal tothe mean µ, then C is a random variable obeying the χ2 distributionfor n degrees of freedom, which is discussed in the following section.The probability that x lies outside the ellipsoid for a given valueof C is given by 1 − Fχ2(C; n), where Fχ2 is the cumulative χ2

distribution. This may be read from Fig. 38.1. For example, the “s-standard-deviation ellipsoid” occurs at C = s2. For the two-variablecase (n = 2), the point x lies outside the one-standard-deviationellipsoid with 61% probability. The use of these ellipsoids as indicatorsof probable error is described in Sec. 38.4.2.2; the validity of thoseindicators assumes that µ and V are correct.

37.4.4. Log-normal distribution :

If a random variable y follows a Gaussian distribution with meanµ and variance σ2, then x = ey follows a log-normal distribution, asgiven in Table 37.1. As a consequence of the central limit theoremdescribed in Sec. 37.4.3, the distribution of the product of a largenumber of positive random variables approaches a log-normal. It isbounded below by zero and is thus well suited for modeling quantitiesthat are intrinsically non-negative such as an efficiency. One canimplement a log-normal model for a random variable x by definingy = lnx so that y follows a Gaussian distribution.

37.4.5. χ2 distribution :

If x1, . . . , xn are independent Gaussian random variables, the sumz =

∑ni=1

(xi − µi)2/σ2

i follows the χ2 p.d.f. with n degrees of freedom,

which we denote by χ2(n). More generally, for n correlated Gaussianvariables as components of a vector X with covariance matrix V ,z = X

T V −1X follows χ2(n) as in the previous section. For a set of

zi, each of which follows χ2(ni),∑

zi follows χ2(∑

ni). For large n,the χ2 p.d.f. approaches a Gaussian with a mean and variance givenby µ = n and σ2 = 2n, respectively (here the formulae for µ and σ2

are valid for all n).

The χ2 p.d.f. is often used in evaluating the level of compatibilitybetween observed data and a hypothesis for the p.d.f. that the datamight follow. This is discussed further in Sec. 38.3.2 on significancetests.

37.4.6. Student’s t distribution :

Suppose that y and x1, . . . , xn are independent and Gaussiandistributed with mean 0 and variance 1. We then define

z =

n∑

i=1

x2i and t =

y√

z/n. (37.28)

The variable z thus follows a χ2(n) distribution. Then t is distributedaccording to Student’s t distribution with n degrees of freedom,f(t; n), given in Table 37.1.

If defined through gamma functions as in Table 37.1, the parametern is not required to be an integer. As n → ∞, the distributionapproaches a Gaussian, and for n = 1 it is a Cauchy or Breit–Wigner

distribution.

As an example, consider the sample mean x =∑

xi/n and thesample variance s2 =

(xi − x)2/(n − 1) for normally distributedxi with unknown mean µ and variance σ2. The sample meanhas a Gaussian distribution with a variance σ2/n, so the variable

(x − µ)/√

σ2/n is normal with mean 0 and variance 1. The quantity(n − 1)s2/σ2 is independent of this and follows χ2(n − 1). The ratio

t =(x − µ)/

σ2/n√

(n − 1)s2/σ2(n − 1)=

x − µ√

s2/n(37.29)

is distributed as f(t; n − 1). The unknown variance σ2 cancels, andt can be used to test the hypothesis that the true mean is someparticular value µ.

37.4.7. Gamma distribution :

For a process that generates events as a function of x (e.g.,space or time) according to a Poisson distribution, the distance inx from an arbitrary starting point (which may be some particularevent) to the kth event follows a gamma distribution, f(x; λ, k). ThePoisson parameter µ is λ per unit x. The special case k = 1 (i.e.,f(x; λ, 1) = λe−λx) is called the exponential distribution. A sum of k′

exponential random variables xi is distributed as f(∑

xi; λ, k′).

The parameter k is not required to be an integer. For λ = 1/2 andk = n/2, the gamma distribution reduces to the χ2(n) distribution.

Page 5: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

37. Probability 471

37.4.8. Beta distribution :

The beta distribution describes a continuous random variablex in the interval [0, 1]. By scaling and translation one can easilygeneralize it to have arbitrary endpoints. In Bayesian inference aboutthe parameter p of a binomial process, if the prior p.d.f. is a betadistribution f(p; α, β) then the observation of r successes out of Ntrials gives a posterior beta distribution f(p; r+α, N−r+β) (Bayesianmethods are discussed further in Sec. 38). The uniform distribution isa beta distribution with α = β = 1.

References:

1. H. Cramer, Mathematical Methods of Statistics, (Princeton Univ.Press, New Jersey, 1958).

2. A. Stuart and J.K. Ord, Kendall’s Advanced Theory of Statistics,Vol. 1 Distribution Theory 6th Ed., (Halsted Press, New York,1994), and earlier editions by Kendall and Stuart.

3. F.E. James, Statistical Methods in Experimental Physics, 2ndEd., (World Scientific, Singapore, 2006).

4. L. Lyons, Statistics for Nuclear and Particle Physicists,(Cambridge University Press, New York, 1986).

5. B.R. Roe, Probability and Statistics in Experimental Physics, 2ndEd., (Springer, New York, 2001).

6. R.J. Barlow, Statistics: A Guide to the Use of Statistical Methods

in the Physical Sciences, (John Wiley, New York, 1989).7. S. Brandt, Data Analysis, 3rd Ed., (Springer, New York, 1999).8. G. Cowan, Statistical Data Analysis, (Oxford University Press,

Oxford, 1998).9. A.N. Kolmogorov, Grundbegriffe der Wahrscheinlichkeitsrech-

nung, (Springer, Berlin, 1933); Foundations of the Theory of

Probability, 2nd Ed., (Chelsea, New York 1956).10. Ch. Walck, Hand-book on Statistical Distributions for Experimen-

talists, University of Stockholm Internal Report SUF-PFY/96-01,available from www.physto.se/~walck.

11. M. Abramowitz and I. Stegun, eds., Handbook of Mathematical

Functions, (Dover, New York, 1972).12. F.W.J. Olver et al., eds., NIST Handbook of Mathematical

Functions, (Cambridge University Press, 2010); a companionDigital Library of Mathematical Functions is available atdlmf.nist.gov.

13. Rene Brun and Fons Rademakers, Nucl. Inst. Meth. A 389, 81(1997); see also root.cern.ch.

14. The CERN Program Library (CERNLIB); see cernlib.web.

cern.ch/cernlib.

Page 6: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

472 38. Statistics

38. STATISTICS

Revised September 2013 by G. Cowan (RHUL).

This chapter gives an overview of statistical methods used inhigh-energy physics. In statistics, we are interested in using a givensample of data to make inferences about a probabilistic model, e.g., toassess the model’s validity or to determine the values of its parameters.There are two main approaches to statistical inference, which we maycall frequentist and Bayesian.

In frequentist statistics, probability is interpreted as the frequencyof the outcome of a repeatable experiment. The most important toolsin this framework are parameter estimation, covered in Section 38.2,statistical tests, discussed in Section 38.3, and confidence intervals,which are constructed so as to cover the true value of a parameter witha specified probability, as described in Section 38.4.2. Note that infrequentist statistics one does not define a probability for a hypothesisor for the value of a parameter.

In Bayesian statistics, the interpretation of probability is moregeneral and includes degree of belief (called subjective probability).One can then speak of a probability density function (p.d.f.) for aparameter, which expresses one’s state of knowledge about where itstrue value lies. Bayesian methods provide a natural means to includeadditional information, which in general may be subjective; in factthey require prior probabilities for the hypotheses (or parameters)in question, i.e., the degree of belief about the parameters’values before carrying out the measurement. Using Bayes’ theorem(Eq. (37.4)), the prior degree of belief is updated by the data from theexperiment. Bayesian methods for interval estimation are discussed inSections 38.4.1 and 38.4.2.4.

For many inference problems, the frequentist and Bayesian ap-proaches give similar numerical values, even though they answerdifferent questions and are based on fundamentally different inter-pretations of probability. In some important cases, however, thetwo approaches may yield very different results. For a discussionof Bayesian vs. non-Bayesian methods, see references written by astatistician [1], by a physicist [2], or the more detailed comparison inRef. 3.

Following common usage in physics, the word “error” is oftenused in this chapter to mean “uncertainty.” More specifically it canindicate the size of an interval as in “the standard error” or “errorpropagation,” where the term refers to the standard deviation of anestimator.

38.1. Fundamental concepts

Consider an experiment whose outcome is characterized by one ormore data values, which we can write as a vector x. A hypothesis H isa statement about the probability for the data, often written P (x|H).(We will usually use a capital letter for a probability and lower case fora probability density. Often the term p.d.f. is used loosely to refer toeither a probability or a probability density.) This could, for example,define completely the p.d.f. for the data (a simple hypothesis), or itcould specify only the functional form of the p.d.f., with the values ofone or more parameters not determined (a composite hypothesis).

If the probability P (x|H) for data x is regarded as a functionof the hypothesis H , then it is called the likelihood of H , usuallywritten L(H). Often the hypothesis is characterized by one or moreparameters θ, in which case L(θ) = P (x|θ) is called the likelihoodfunction.

In some cases one can obtain at least approximate frequentistresults using the likelihood evaluated only with the data obtained. Ingeneral, however, the frequentist approach requires a full specificationof the probability model P (x|H) both as a function of the data x andhypothesis H .

In the Bayesian approach, inference is based on the posteriorprobability for H given the data x, which represents one’s degree ofbelief that H is true given the data. This is obtained from Bayes’theorem (37.4), which can be written

P (H |x) =P (x|H)π(H)

P (x|H ′)π(H ′) dH ′. (38.1)

Here P (x|H) is the likelihood for H , which depends only on the dataactually obtained. The quantity π(H) is the prior probability for H ,which represents one’s degree of belief for H before carrying out themeasurement. The integral in the denominator (or sum, for discretehypotheses) serves as a normalization factor. If H is characterized bya continuous parameter θ then the posterior probability is a p.d.f.p(θ|x). Note that the likelihood function itself is not a p.d.f. for θ.

38.2. Parameter estimation

Here we review point estimation of parameters, first with an overviewof the frequentist approach and its two most important methods,maximum likelihood and least squares, treated in Sections 38.2.2 and38.2.3. The Bayesian approach is outlined in Sec. 38.2.4.

An estimator θ (written with a hat) is a function of the data used toestimate the value of the parameter θ. Sometimes the word ‘estimate’is used to denote the value of the estimator when evaluated withgiven data. There is no fundamental rule dictating how an estimatormust be constructed. One tries, therefore, to choose that estimatorwhich has the best properties. The most important of these are (a)consistency, (b) bias, (c) efficiency, and (d) robustness.

(a) An estimator is said to be consistent if the estimate θ converges tothe true value θ as the amount of data increases. This property is soimportant that it is possessed by all commonly used estimators.

(b) The bias, b = E[ θ ] − θ, is the difference between the expectationvalue of the estimator and the true value of the parameter.The expectation value is taken over a hypothetical set of similarexperiments in which θ is constructed in the same way. When b = 0,the estimator is said to be unbiased. The bias depends on the chosenmetric, i.e., if θ is an unbiased estimator of θ, then θ 2 is not in generalan unbiased estimator for θ2.

(c) Efficiency is the ratio of the minimum possible variance for any

estimator of θ to the variance V [ θ ] of the estimator θ. For the caseof a single parameter, under rather general conditions the minimumvariance is given by the Rao-Cramer-Frechet bound,

σ2min =

(

1 +∂b

∂θ

)2

/I(θ) , (38.2)

where

I(θ) = E

[

(

∂ lnL

∂θ

)2]

= −E

[

∂2 lnL

∂θ2

]

(38.3)

is the Fisher information, L is the likelihood, and the expectationvalue in (38.3) is carried out with respect to the data. For the finalequality to hold, the range of allowed data values must not depend onθ.

The mean-squared error,

MSE = E[(θ − θ)2] = V [θ] + b2 , (38.4)

is a measure of an estimator’s quality which combines bias andvariance.

(d) Robustness is the property of being insensitive to departuresfrom assumptions in the p.d.f., e.g., owing to uncertainties in thedistribution’s tails.

It is not in general possible to optimize simultaneously for all themeasures of estimator quality described above. For example, there isin general a trade-off between bias and variance. For some commonestimators, the properties above are known exactly. More generally,it is possible to evaluate them by Monte Carlo simulation. Note thatthey will often depend on the unknown θ.

38.2.1. Estimators for mean, variance, and median :

Suppose we have a set of n independent measurements, x1, . . . , xn,each assumed to follow a p.d.f. with unknown mean µ and unknownvariance σ2. The measurements do not necessarily have to follow aGaussian distribution. Then

µ =1

n

n∑

i=1

xi (38.5)

σ2 =1

n − 1

n∑

i=1

(xi − µ)2 (38.6)

Page 7: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 473

are unbiased estimators of µ and σ2. The variance of µ is σ2/n and

the variance of σ2 is

V[

σ2]

=1

n

(

m4 −n − 3

n − 1σ4

)

, (38.7)

where m4 is the 4th central moment of x (see Eq. (37.8b)). ForGaussian distributed xi, this becomes 2σ4/(n − 1) for any n ≥ 2,and for large n the standard deviation of σ (the “error of the error”)is σ/

√2n. For any n and Gaussian xi, µ is an efficient estimator

for µ, and the estimators µ and σ2 are uncorrelated. Otherwise thearithmetic mean (38.5) is not necessarily the most efficient estimator;this is discussed further in Sec. 8.7 of Ref. 4.

If σ2 is known, it does not improve the estimate µ, as can be seenfrom Eq. (38.5); however, if µ is known, one can substitute it for µ inEq. (38.6) and replace n − 1 by n to obtain an estimator of σ2 stillwith zero bias but smaller variance. If the xi have different, knownvariances σ2

i , then the weighted average

µ =1

w

n∑

i=1

wixi , (38.8)

where wi = 1/σ2i and w =

i wi, is an unbiased estimator for µ with asmaller variance than an unweighted average. The standard deviationof µ is 1/

√w.

As an estimator for the median xmed, one can use the valuexmed such that half the xi are below and half above (the samplemedian). If the sample median lies between two observed values, itis set by convention halfway between them. If the p.d.f. of x has theform f(x − µ) and µ is both mean and median, then for large nthe variance of the sample median approaches 1/[4nf2(0)], providedf(0) > 0. Although estimating the median can often be more difficultcomputationally than the mean, the resulting estimator is generallymore robust, as it is insensitive to the exact shape of the tails of adistribution.

38.2.2. The method of maximum likelihood :

Suppose we have a set of measured quantities x and the likelihoodL(θ) = P (x|θ) for a set of parameters θ = (θ1, . . . , θN ). Themaximum likelihood (ML) estimators for θ are defined as the valuesthat give the maximum of L. Because of the properties of thelogarithm, it is usually easier to work with lnL, and since both aremaximized for the same parameter values θ, the ML estimators canbe found by solving the likelihood equations,

∂ lnL

∂θi= 0 , i = 1, . . . , N . (38.9)

Often the solution must be found numerically. Maximum likelihoodestimators are important because they are asymptotically unbiasedand efficient for large data samples, under quite general conditions,and the method has a wide range of applicability.

In general the likelihood function is obtained from the probabilityof the data under assumption of the parameters. An important specialcase is when the data consist of i.i.d. (independent and identicallydistributed) values. Here one has a set of n statistically independentquantities x = (x1, . . . , xn), where each component follows the samep.d.f. f(x; θ). In this case the joint p.d.f. of the data sample factorizesand the likelihood function is

L(θ) =

n∏

i=1

f(xi; θ) . (38.10)

In this case the number of events n is regarded as fixed. If howeverthe probability to observe n events itself depends on the parametersθ, then this should be included in the likelihood. For example, if nfollows a Poisson distribution with mean µ and the independent xvalues all follow f(x; θ), then the likelihood becomes

L(θ) =µn

n!e−µ

n∏

i=1

f(xi; θ) . (38.11)

Equation. (38.11)) is often called the extended likelihood (see,e.g., Refs. [6–8]) . In general µ is a function of θ, and includingthe probability for n given θ in the likelihood provides additionalinformation about the parameters and thus leads to a reduction intheir statistical uncertainties.

In evaluating the likelihood function, it is important that anynormalization factors in the p.d.f. that involve θ be included. However,we will only be interested in the maximum of L and in ratios of Lat different values of the parameters; hence any multiplicative factorsthat do not involve the parameters that we want to estimate may bedropped, including factors that depend on the data but not on θ.

Under a one-to-one change of parameters from θ to η, theML estimators θ transform to η(θ). That is, the ML solution isinvariant under change of parameter. However, other properties ofML estimators, in particular the bias, are not invariant under changeof parameter.

The inverse V −1 of the covariance matrix Vij = cov[θi, θj ] for a setof ML estimators can be estimated by using

(V −1)ij = −∂2 lnL

∂θi∂θj

θ

; (38.12)

for finite samples, however, Eq. (38.12) can result in an underestimateof the variances. In the large sample limit (or in a linear model withGaussian errors), L has a Gaussian form and lnL is (hyper)parabolic.In this case, it can be seen that a numerically equivalent way ofdetermining s-standard-deviation errors is from the hypersurfacedefined by the θ

′ such that

lnL(θ′) = lnLmax − s2/2 , (38.13)

where ln Lmax is the value of lnL at the solution point (comparewith Eq. (38.68)). The minimum and maximum values of θi on thehypersurface then give an approximate s-standard deviation confidenceinterval for θi (see Section 38.4.2.2).

38.2.2.1. ML with binned data:

If the total number of data values xi, i = 1, . . . , ntot, is small, theunbinned maximum likelihood method, i.e., use of equation (38.10)(or (38.11) for extended ML), is preferred since binning can onlyresult in a loss of information, and hence larger statistical errors forthe parameter estimates. If the sample is large, it can be convenientto bin the values in a histogram with N bins, so that one obtains avector of data n = (n1, . . . , nN ) with expectation values µ = E[n] andprobabilities f(n; µ). Suppose the mean values µ can be determinedas a function of a set of parameters θ. Then one may maximize thelikelihood function based on the contents of the bins.

As mentioned in Sec. 38.2.2, the total number of events ntot =∑

i ni

can be regarded as fixed or as a random variable. If it is fixed, thehistogram follows a multinomial distribution,

fM(n; θ) =ntot!

n1! · · ·nN !pn11 · · · p

nNN , (38.14)

where we assume the probabilities pi are given functions of theparameters θ. The distribution can be written equivalently in termsof the expected number of events in each bin, µi = ntotpi. If the ni

are regarded as independent and Poisson distributed, then the dataare described by a product of Poisson probabilities,

fP(n; θ) =

N∏

i=1

µnii

ni!e−µi , (38.15)

where the mean values µi are given functions of θ. The totalnumber of events ntot thus follows a Poisson distribution with meanµtot =

i µi.

When using maximum likelihood with binned data, one can findthe ML estimators and at the same time obtain a statistic usable fora test of goodness-of-fit (see Sec. 38.3.2). Maximizing the likelihoodL(θ) = fM/P(n; θ) is equivalent to maximizing the likelihood ratio

λ(θ) = fM/P(n; θ)/f(n; µ), where in the denominator f(n; µ) is a

Page 8: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

474 38. Statistics

model with an adjustable parameter for each bin, µ = (µ1, . . . , µN ),and the corresponding estimators are µ = (n1, . . . , nN ). Often oneminimizes instead the equivalent quantity −2 lnλ(θ). For independentPoisson distributed ni this is [9]

−2 lnλ(θ) = 2

N∑

i=1

[

µi(θ) − ni + ni lnni

µi(θ)

]

, (38.16)

where for bins with ni = 0, the last term in (38.16) is zero. Theexpression (38.16) without the terms µi − ni also gives −2 lnλ(θ) formultinomially distributed ni, i.e., when the total number of entries isregarded as fixed. In the limit of zero bin width, maximizing (38.16)is equivalent to maximizing the unbinned extended likelihood function(38.11) or in the multinomial case without the µi − ni terms oneobtains Eq. (38.10).

A smaller value of −2 lnλ(θ) corresponds to better agreementbetween the data and the hypothesized form of µ(θ). The value of

−2 lnλ(θ) can thus be translated into a p-value as a measure ofgoodness-of-fit, as described in Sec. 38.3.2. Assuming the model iscorrect, then according to Wilks’ theorem, for sufficiently large µi

and providing certain regularity conditions are met, the minimumof −2 lnλ as defined by Eq. (38.16) follows a χ2 distribution (see,e.g., Ref. 9). If there are N bins and m fitted parameters, then thenumber of degrees of freedom for the χ2 distribution is N − m if thedata are treated as Poisson-distributed, and N − m − 1 if the ni aremultinomially distributed.

Suppose the ni are Poisson-distributed and the overall normalizationµtot =

i µi is taken as an adjustable parameter, so that µi =µtotpi(θ), where the probability to be in the ith bin, pi(θ), does notdepend on µtot. Then by minimizing Eq. (38.16), one obtains that thearea under the fitted function is equal to the sum of the histogramcontents, i.e.,

i µi =∑

i ni.

38.2.2.2. Frequentist treatment of nuisance parameters:

Suppose we want to determine the values of parameters θ using aset of measurements x described by a probability model Px(x|θ). Ingeneral the model is not perfect, which is to say it can not providean accurate description of the data even at the most optimal point ofits parameter space. As a result, the estimated parameters can have asystematic bias.

One can in general improve the model by including in it additionalparameters. That is, it is extended to Px(x|θ, ν), which depends onparameters of interest θ and nuisance parameters ν. The additionalparameters are not of intrinsic interest but must be included for themodel to be accurate for some point in the enlarged parameter space.

Although including additional parameters may eliminate or at leastreduce the effect of systematic uncertainties, their presence will resultin increased statistical uncertainties for the parameters of interest.This occurs because the estimators for the nuisance parameters andthose of interest will in general be correlated, which results in anenlargement of the contour defined by Eq. (38.13).

To reduce the impact of the nuisance parameters one oftentries to constrain their values by means of control or calibrationmeasurements, say, having data y. For example, some components ofy could represent estimates of the nuisance parameters, often fromseparate experiments. Suppose the measurements y are statisticallyindependent from x and are described by a model Py(y|ν). The jointmodel for both x and y is in this case therefore the product of theprobabilities for x and y, and thus the likelihood function for the fullset of parameters is

L(θ, ν) = Px(x|θ, ν)Py(y|ν) . (38.17)

Note that in this case if one wants to simulate the experiment bymeans of Monte Carlo, both the primary and control measurements,x and y, must be generated for each repetition under assumption offixed values for the parameters θ and ν.

Using all of the parameters (θ, ν) in Eq. (38.13) to find thestatistical errors in the parameters of interest θ is equivalent to usingthe profile likelihood, which depends only on θ. It is defined as

Lp(θ) = L(θ, ν(θ)), (38.18)

where the double-hat notation indicates the profiled values of theparameters ν, defined as the values that maximize L for the specifiedθ. The profile likelihood is discussed further in Section 38.3.2.1 inconnection with hypothesis tests.

38.2.3. The method of least squares :

The method of least squares (LS) coincides with the method ofmaximum likelihood in the following special case. Consider a set of Nindependent measurements yi at known points xi. The measurementyi is assumed to be Gaussian distributed with mean µ(xi; θ) andknown variance σ2

i . The goal is to construct estimators for theunknown parameters θ. The likelihood function contains the sum ofsquares

χ2(θ) = −2 lnL(θ) + constant =N

i=1

(yi − µ(xi; θ))2

σ2i

. (38.19)

The parameter values that maximize L are the same as those whichminimize χ2.

The minimum of Equation (38.19) defines the least-squares

estimators θ for the more general case where the yi are notGaussian distributed as long as they are independent. If they are notindependent but rather have a covariance matrix Vij = cov[yi, yj ],then the LS estimators are determined by the minimum of

χ2(θ) = (y − µ(θ))T V −1(y − µ(θ)) , (38.20)

where y = (y1, . . . , yN ) is the (column) vector of measurements, µ(θ)is the corresponding vector of predicted values, and the superscript Tdenotes the transpose.

Often one further restricts the problem to the case where µ(xi; θ)is a linear function of the parameters, i.e.,

µ(xi; θ) =

m∑

j=1

θjhj(xi) . (38.21)

Here the hj(x) are m linearly independent functions, e.g.,

1, x, x2, . . . , xm−1 or Legendre polynomials. We require m < Nand at least m of the xi must be distinct.

Minimizing χ2 in this case with m parameters reduces to solving asystem of m linear equations. Defining Hij = hj(xi) and minimizing

χ2 by setting its derivatives with respect to the θi equal to zero givesthe LS estimators,

θ = (HT V −1H)−1HT V −1y ≡ Dy . (38.22)

The covariance matrix for the estimators Uij = cov[θi, θj ] is given by

U = DV DT = (HT V −1H)−1 , (38.23)

or equivalently, its inverse U−1 can be found from

(U−1)ij =1

2

∂2χ2

∂θi∂θj

θ=θ

=

N∑

k,l=1

hi(xk)(V −1)klhj(xl) . (38.24)

The LS estimators can also be found from the expression

θ = Ug , (38.25)

where the vector g is defined by

gi =

N∑

j,k=1

yjhi(xk)(V −1)jk . (38.26)

For the case of uncorrelated yi, for example, one can use (38.25) with

(U−1)ij =

N∑

k=1

hi(xk)hj(xk)

σ2k

, (38.27)

gi =N

k=1

ykhi(xk)

σ2k

. (38.28)

Page 9: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 475

Expanding χ2(θ) about θ, one finds that the contour in parameterspace defined by

χ2(θ) = χ2(θ) + 1 = χ2min + 1 (38.29)

has tangent planes located at approximately plus-or-minus-onestandard deviation σ

θfrom the LS estimates θ.

In constructing the quantity χ2(θ) one requires the variances or,in the case of correlated measurements, the covariance matrix. Oftenthese quantities are not known a priori and must be estimated fromthe data; an important example is where the measured value yi

represents the event count in a histogram bin. If, for example, yi

represents a Poisson variable, for which the variance is equal to themean, then one can either estimate the variance from the predictedvalue, µ(xi; θ), or from the observed number itself, yi. In the firstoption, the variances become functions of the fitted parameters,which may lead to calculational difficulties. The second option can beundefined if yi is zero, and in both cases for small yi, the variancewill be poorly estimated. In either case, one should constrain thenormalization of the fitted curve to the correct value, i.e., one shoulddetermine the area under the fitted curve directly from the numberof entries in the histogram (see Ref. 8, Section 7.4). As noted inSec. 38.2.2.1, this issue is avoided when using the method of extendedmaximum likelihood with binned data by minimizing Eq. (38.16). Inthat case if the expected number of events µtot does not depend onthe other fitted parameters θ, then its extended ML estimator is equalto the observed total number of events.

As the minimum value of the χ2 represents the level of agreementbetween the measurements and the fitted function, it can be used forassessing the goodness-of-fit; this is discussed further in Section 38.3.2.

38.2.4. The Bayesian approach :

In the frequentist methods discussed above, probability is associatedonly with data, not with the value of a parameter. This is no longerthe case in Bayesian statistics, however, which we introduce in thissection. For general introductions to Bayesian statistics see, e.g.,Refs. [22–25].

Suppose the outcome of an experiment is characterized by a vectorof data x, whose probability distribution depends on an unknownparameter (or parameters) θ that we wish to determine. In Bayesianstatistics, all knowledge about θ is summarized by the posterior p.d.f.p(θ|x), whose integral over any given region gives the degree of belieffor θ to take on values in that region, given the data x. It is obtainedby using Bayes’ theorem,

p(θ|x) =P (x|θ)π(θ)

P (x|θ′)π(θ′) dθ′, (38.30)

where P (x|θ) is the likelihood function, i.e., the joint p.d.f. for thedata viewed as a function of θ, evaluated with the data actuallyobtained in the experiment, and π(θ) is the prior p.d.f. for θ. Notethat the denominator in Eq. (38.30) serves to normalize the posteriorp.d.f. to unity.

As it can be difficult to report the full posterior p.d.f. p(θ|x),one would usually summarize it with statistics such as the mean (ormedian) values, and covariance matrix. In addition one may constructintervals with a given probability content, as is discussed in Sec. 38.4.1on Bayesian interval estimation.

38.2.4.1. Priors:

Bayesian statistics supplies no unique rule for determining the priorπ(θ); this reflects the analyst’s subjective degree of belief (or stateof knowledge) about θ before the measurement was carried out. Forthe result to be of value to the broader community, whose membersmay not share these beliefs, it is important to carry out a sensitivityanalysis, that is, to show how the result changes under a reasonablevariation of the prior probabilities.

One might like to construct π(θ) to represent complete ignoranceabout the parameters by setting it equal to a constant. A problemhere is that if the prior p.d.f. is flat in θ, then it is not flat for a

nonlinear function of θ, and so a different parametrization of theproblem would lead in general to a non-equivalent posterior p.d.f.

For the special case of a constant prior, one can see from Bayes’theorem (38.30) that the posterior is proportional to the likelihood,and therefore the mode (peak position) of the posterior is equal to theML estimator. The posterior mode, however, will change in generalupon a transformation of parameter. One may use as the Bayesianestimator a summary statistic other than the mode, such as themedian, which is invariant under parameter transformation. But thiswill not in general coincide with the ML estimator.

The difficult and subjective nature of encoding personal knowledgeinto priors has led to what is called objective Bayesian statistics,where prior probabilities are based not on an actual degree of beliefbut rather derived from formal rules. These give, for example, priorswhich are invariant under a transformation of parameters, or oneswhich result in a maximum gain in information for a given set ofmeasurements. For an extensive review see, e.g., Ref. 26.

Objective priors do not in general reflect degree of belief, but theycould in some cases be taken as possible, although perhaps extreme,subjective priors. The posterior probabilities as well therefore donot necessarily reflect a degree of belief. However one may regardinvestigating a variety of objective priors to be an important partof the sensitivity analysis. Furthermore, use of objective priors withBayes’ theorem can be viewed as a recipe for producing estimators orintervals which have desirable frequentist properties.

An important procedure for deriving objective priors is due toJeffreys. According to Jeffreys’ rule one takes the prior as

π(θ) ∝√

det(I(θ)) , (38.31)

where

Iij(θ) = −E

[

∂2 lnP (x|θ)

∂θi∂θj

]

(38.32)

is the Fisher information matrix. One can show that the Jeffreysprior leads to inference that is invariant under a transformationof parameters. One should note that the Jeffreys prior depends onthe likelihood function, and thus contains information about themeasurement model itself, which goes beyond one’s degree of beliefabout the value of a parameter. As examples, the Jeffreys prior forthe mean µ of a Gaussian distribution is a constant, and for the meanof a Poisson distribution one finds π(µ) ∝ 1/

õ.

Neither the constant nor 1/√

µ priors can be normalized to unitarea and are therefore said to be improper. This can be allowedbecause the prior always appears multiplied by the likelihood function,and if the likelihood falls to zero sufficiently quickly then one mayhave a normalizable posterior density.

An important type of objective prior is the reference prior due toBernardo and Berger [27]. To find the reference prior for a givenproblem one considers the Kullback-Leibler divergence Dn[π, p] of theposterior p(θ|x) relative to a prior π(θ), obtained from a set of i.i.d.data x = (x1, . . . , xn):

Dn[π, p] =

p(θ|x) lnp(θ|x)

π(θ)dθ . (38.33)

This is effectively a measure of the gain in information provided bythe data. The reference prior is chosen so that the expectation valueof this information gain is maximized for the limiting case of n → ∞,where the expectation is computed with respect to the marginaldistribution of the data,

p(x) =

p(x|θ)π(θ) dθ . (38.34)

For a single, continuous parameter the reference prior is usuallyidentical to the Jeffreys prior. In the multiparameter case an iterativealgorithm exists, which requires sorting the parameters by order ofinferential importance. Often the result does not depend on this order,but when it does, this can be part of a robustness analysis. Furtherdiscussion and applications to particle physics problems can be foundin Ref. 28.

Page 10: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

476 38. Statistics

38.2.4.2. Bayesian treatment of nuisance parameters:

As discussed in Sec. 38.2.2, a model may depend on parameters ofinterest θ as well as on nuisance parameters ν, which must be includedfor an accurate description of the data. Knowledge about the valuesof ν may be supplied by control measurements, theoretical insights,physical constraints, etc. Suppose, for example, one has data y from acontrol measurement which is characterized by a probability Py(y|ν).Suppose further that before carrying out the control measurementone’s state of knowledge about ν is described by an initial prior π0(ν),which in practice is often taken to be a constant or in any case verybroad. By using Bayes’ theorem (38.1) one obtains the updated priorπ(ν) (i.e., now π(ν) = π(ν|y), the probability for ν given y),

π(ν|y) ∝ P (y|ν)π0(ν) . (38.35)

In the absence of a model for P (y|ν) one may make some reasonablebut ad hoc choices. For a single nuisance parameter ν, for example,one might characterize the uncertainty in a nuisance parameter ν bya p.d.f. π(ν) centered about its nominal value with a certain standarddeviation σν . Often a Gaussian p.d.f. provides a reasonable modelfor one’s degree of belief about a nuisance parameter; in other cases,more complicated shapes may be appropriate. If, for example, theparameter represents a non-negative quantity then a log-normal orgamma p.d.f. can be a more natural choice than a Gaussian truncatedat zero. Note also that truncation of the prior of a nuisance parameterν at zero will in general make π(ν) nonzero at ν = 0, which can lead toan unnormalizable posterior for a parameter of interest that appearsmultiplied by ν.

The likelihood function, prior, and posterior p.d.f.s then all dependon both θ and ν, and are related by Bayes’ theorem, as usual. Notethat the likelihood here only refers to the primary measurementx. Once any control measurements y are used to find the updatedprior π(ν) for the nuisance parameters, this information is fullyencapsulated in π(ν) and the control measurements do not appearfurther.

One can obtain the posterior p.d.f. for θ alone by integrating overthe nuisance parameters, i.e.,

p(θ|x) =

p(θ, ν|x) dν . (38.36)

Such integrals can often not be carried out in closed form, and if thenumber of nuisance parameters is large, then they can be difficult tocompute with standard Monte Carlo methods. Markov Chain Monte

Carlo (MCMC) techniques are often used for computing integrals ofthis type (see Sec. 39.5).

38.2.5. Propagation of errors :

Consider a set of n quantities θ = (θ1, . . . , θn) and a set of mfunctions η(θ) = (η1(θ), . . . , ηm(θ)). Suppose we have estimatedθ = (θ1, . . . , θn), using, say, maximum-likelihood or least-squares, and

we also know or have estimated the covariance matrix Vij = cov[θi, θj ].The goal of error propagation is to determine the covariance matrixfor the functions, Uij = cov[ηi, ηj ], where η = η(θ ). In particular, thediagonal elements Uii = V [ηi] give the variances. The new covariancematrix can be found by expanding the functions η(θ) about the

estimates θ to first order in a Taylor series. Using this one finds

Uij ≈∑

k,l

∂ηi

∂θk

∂ηj

∂θl

θ

Vkl . (38.37)

This can be written in matrix notation as U ≈ AV AT where thematrix of derivatives A is

Aij =∂ηi

∂θj

θ

, (38.38)

and AT is its transpose. The approximation is exact if η(θ) is linear(it holds, for example, in equation (38.23)). If this is not the case, theapproximation can break down if, for example, η(θ) is significantly

nonlinear close to θ in a region of a size comparable to the standarddeviations of θ.

38.3. Statistical tests

In addition to estimating parameters, one often wants to assessthe validity of certain statements concerning the data’s underlyingdistribution. Frequentist hypothesis tests, described in Sec. 38.3.1,provide a rule for accepting or rejecting hypotheses depending on theoutcome of a measurement. In significance tests, covered in Sec. 38.3.2,one gives the probability to obtain a level of incompatibility with acertain hypothesis that is greater than or equal to the level observedwith the actual data. In the Bayesian approach, the correspondingprocedure is based fundamentally on the posterior probabilities of thecompeting hypotheses. In Sec. 38.3.3 we describe a related constructcalled the Bayes factor, which can be used to quantify the degree towhich the data prefer one or another hypothesis.

38.3.1. Hypothesis tests :

A frequentist test of a hypothesis (often called the null hypothesis,H0) is a rule that states for which data values x the hypothesis isrejected. A region of x-space called the critical region, w, is specifiedsuch that such that there is no more than a given probability under H0,α, called the size or significance level of the test, to find x ∈ w. If thedata are discrete, it may not be possible to find a critical region withexact probability content α, and thus we require P (x ∈ w|H0) ≤ α. Ifthe data are observed in the critical region, H0 is rejected.

The critical region is not unique. Choosing one should take intoaccount the probabilities for the data predicted by some alternativehypothesis (or set of alternatives) H1. Rejecting H0 if it is true iscalled a type-I error, and occurs by construction with probability nogreater than α. Not rejecting H0 if an alternative H1 is true is calleda type-II error, and for a given test this will have a certain probabilityβ = P (x /∈ w|H1). The quantity 1 − β is called the power of the testof H0 with respect to the alternative H1. A strategy for defining thecritical region can therefore be to maximize the power with respect tosome alternative (or alternatives) given a fixed size α.

In high-energy physics, the components of x might represent themeasured properties of candidate events, and the critical region isdefined by the cuts that one imposes in order to reject backgroundand thus accept events likely to be of a certain desired type. HereH0 could represent the background hypothesis and the alternativeH1 could represent the sought after signal. In other cases, H0 couldbe the hypothesis that an entire event sample consists of backgroundevents only, and the alternative H1 may represent the hypothesis of amixture of background and signal.

Often rather than using the full set of quantities x, it is convenientto define a scalar function of x called a test statistic, t(x). The criticalregion in x-space is bounded by a surface of constant t(x). Once thefunction t(x) is fixed, a given hypothesis for the distribution of x willdetermine a distribution for t.

To maximize the power of a test of H0 with respect to thealternative H1, the Neyman–Pearson lemma states that the criticalregion w should be chosen such that for all data values x inside w, theratio

λ(x) =f(x|H1)

f(x|H0), (38.39)

is greater than a given constant, the value of which is determined bythe size of the test α. Here H0 and H1 must be simple hypotheses,i.e., they should not contain undetermined parameters.

The lemma is equivalent to the statement that (38.39) representsthe optimal test statistic where the critical region is defined by asingle cut on λ. This test will lead to the maximum power (i.e., themaximum probability to reject H0 if H1 is true) for a given probabilityα to reject H0 if H0 is in fact true. It can be difficult in practice,however, to determine λ(x), since this requires knowledge of the jointp.d.f.s f(x|H0) and f(x|H1).

In the usual case where the likelihood ratio (38.39) cannot be usedexplicitly, there exist a variety of other multivariate classifiers thateffectively separate different types of events. Methods often used inHEP include neural networks or Fisher discriminants (see Ref. 10).Recently, further classification methods from machine-learning havebeen applied in HEP analyses; these include probability density

estimation (PDE) techniques, kernel-based PDE (KDE or Parzen

Page 11: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 477

window), support vector machines, and decision trees. Techniquessuch as “boosting” and “bagging” can be applied to combine anumber of classifiers into a stronger one with greater stability withrespect to fluctuations in the training data. Descriptions of thesemethods can be found in [11–13], and Proceedings of the PHYSTAT

conference series [14]. Software for HEP includes the TMVA [15] andStatPatternRecognition [16] packages.

38.3.2. Tests of significance (goodness-of-fit) :

Often one wants to quantify the level of agreement between the dataand a hypothesis without explicit reference to alternative hypotheses.This can be done by defining a statistic t, which is a function of thedata whose value reflects in some way the level of agreement betweenthe data and the hypothesis. The analyst must decide what values ofthe statistic correspond to better or worse levels of agreement withthe hypothesis in question; the choice will in general depend on therelevant alternative hypotheses.

The hypothesis in question, H0, will determine the p.d.f. f(t|H0)for the statistic. The significance of a discrepancy between the dataand what one expects under the assumption of H0 is quantified bygiving the p-value, defined as the probability to find t in the region ofequal or lesser compatibility with H0 than the level of compatibilityobserved with the actual data. For example, if t is defined such thatlarge values correspond to poor agreement with the hypothesis, thenthe p-value would be

p =

∫ ∞

tobs

f(t|H0) dt , (38.40)

where tobs is the value of the statistic obtained in the actualexperiment.

The p-value should not be confused with the size (significancelevel) of a test, or the confidence level of a confidence interval(Section 38.4), both of which are pre-specified constants. We mayformulate a hypothesis test, however, by defining the critical region tocorrespond to the data outcomes that give the lowest p-values, so thatfinding p ≤ α implies that the data outcome was in the critical region.When constructing a p-value, one generally chooses the region of dataspace deemed to have lower compatibility with the model being testedas one having higher compatibility with a given alternative, such thatthe corresponding test will have a high power with respect to thisalternative.

The p-value is a function of the data, and is therefore itself arandom variable. If the hypothesis used to compute the p-value istrue, then for continuous data p will be uniformly distributed betweenzero and one. Note that the p-value is not the probability for thehypothesis; in frequentist statistics, this is not defined. Rather, thep-value is the probability, under the assumption of a hypothesis H0, ofobtaining data at least as incompatible with H0 as the data actuallyobserved.

When searching for a new phenomenon, one tries to reject thehypothesis H0 that the data are consistent with known (e.g., StandardModel) processes. If the p-value of H0 is sufficiently low, then oneis willing to accept that some alternative hypothesis is true. Oftenone converts the p-value into an equivalent significance Z, defined sothat a Z standard deviation upward fluctuation of a Gaussian randomvariable would have an upper tail area equal to p, i.e.,

Z = Φ−1(1 − p) . (38.41)

Here Φ is the cumulative distribution of the Standard Gaussian, andΦ−1 is its inverse (quantile) function. Often in HEP the level ofsignificance where an effect is said to qualify as a discovery is Z = 5,i.e., a 5σ effect, corresponding to a p-value of 2.87 × 10−7. One’sactual degree of belief that a new process is present, however, willdepend in general on other factors as well, such as the plausibility ofthe new signal hypothesis and the degree to which it can describe thedata, one’s confidence in the model that led to the observed p-value,and possible corrections for multiple observations out of which onefocuses on the smallest p-value obtained (the “look-elsewhere effect”,discussed in Section 38.3.2.2).

38.3.2.1. Treatment of nuisance parameters for frequentist tests:

Suppose one wants to test hypothetical values of parameters θ, butthe model also contains nuisance parameters ν. To find a p-value forθ we can construct a test statistic qθ such that larger values constituteincreasing incompatibility between the data and the hypothesis. Thenfor an observed value of the statistic qθ,obs, the p-value of θ is

pθ(ν) =

∫ ∞

qθ,obs

f(qθ|θ, ν) dqθ , (38.42)

which depends in general on the nuisance parameters ν. In the strictfrequentist approach, θ is rejected only if the p-value is less than α forall possible values of the nuisance parameters.

The difficulty described above is effectively solved if we can definethe test statistic qθ in such a way that its distribution f(qθ|θ) isindependent of the nuisance parameters. Although exact independenceis only found in special cases, it can be achieved approximately by useof the profile likelihood ratio. This is given by the profile likelihoodfrom Eq.(38.18) divided by the value of the likelihood at its maximum,

i.e., when evaluated wit the ML estimators θ and ν:

λp(θ) =L(θ, ν(θ))

L(θ, ν). (38.43)

Wilks’ theorem states that, providing certain general conditions aresatisfied, the distribution of −2 lnλp(θ), under assumption of θ,approaches a χ2 distribution in the limit where the data sample isvery large, independent of the values of the nuisance parameters ν.Here the number of degrees of freedom is equal to the number ofcomponents of θ. More details on use of the profile likelihood are givenin Refs. [36–37] and in contributions to the PHYSTAT conferences[14]; explicit formulae for special cases can be found in Ref. 38.Further discussion on how to incorporate systematic uncertainties intop-values can be found in Ref. 17.

Even with use of the profile likelihood ratio, for a finite data samplethe p-value of hypothesized parameters θ will retain in general somedependence on the nuisance parameters ν. Ideally one would find thethe maximum of pθ(ν) from Eq. (38.42) explicitly, but that is oftenimpractical. An approximate and computationally feasible technique

is to use pθ(ν(θ)), where ν(θ) are the profiled values of the nuisance

parameters as defined in Section 38.2.2.2. The resulting p-value is thecorrect if the true values of the nuisance parameters are equal to theprofiled values used; otherwise it could be either too high or too low.This is discussed further in Section 38.4.2 on confidence intervals.

One may also treat model uncertainties in a Bayesian mannerbut then use the resulting model in a frequentist test. Suppose theuncertainty in a set of nuisance parameters ν is characterized by aBayesian prior p.d.f. π(ν). This can be used to construct the marginal(also called the prior predictive) model for the data x and parametersof interest θ,

Pm(x|θ) =

P (x|θ, ν)π(ν) dν . (38.44)

The marginal model does not represent the probability of data thatwould be generated if on were really to repeat the experiment, asin that case one would assume that the nuisance parameters do notvary. Rather, the marginal model represents a situation in whichevery repetition of the experiment is carried out with new values ofν, randomly sampled from π(ν). It is in effect an average of modelseach with a given ν, where the average carried out with respect to theprior p.d.f. π(ν).

The marginal model for the data x can be used to determine thedistribution of a test statistic Q, which can be written

Pm(Q|θ) =

P (Q|θ, ν)π(ν) dν . (38.45)

In a search for a new signal process, the test statistic can be based onthe ratio of likelihoods corresponding to the experiments where signaland background events are both present, Ls+b, to that of backgroundonly, Lb. Often the likelihoods are evaluated with the profiled values

Page 12: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

478 38. Statistics

of the nuisance parameters, which may give improved performance. Itis important to note, however, that it is through use of the marginalmodel for the distribution of Q that the uncertainties related tothe nuisance parameters are incorporated into the result of the test.Different choices for the test statistic itself only result in variations ofthe power of the test with respect to different alternatives.

38.3.2.2. The look-elsewhere effect:

The “look-elsewhere effect” relates to multiple measurements usedto test a single hypothesis. The classic example is when one searchesin a distribution for a peak whose position is not predicted in advance.Here the no-peak hypothesis is tested using data in a given range ofthe distribution. In the frequentist approach the correct p-value of theno-peak hypothesis is the probability, assuming background only, tofind a signal as significant as the one found or more so anywhere in thesearch region. This can be substantially higher than the probabilityto find a peak of equal or greater significance in the particular placewhere it appeared. There is in general some ambiguity as to whatconstitutes the relevant search region or even the broader set ofrelevant measurements. Although the desired p-value is well definedonce the search region has been fixed, an exact treatment can requireextensive computation.

The “brute-force” solution to this problem by Monte Carlo involvesgenerating data under the background-only hypothesis and for eachdata set, fitting a peak of unknown position and recording a measureof its significance. To establish a discovery one often requires ap-value less than 2.9 × 10−7, corresponding to a 5σ or larger effect.Determining this with Monte Carlo thus requires generating andfitting a very large number of experiments, perhaps several times 107.In contrast, if the position of the peak is fixed, then the fit to thedistribution is much easier, and furthermore one can in many casesuse formulae valid for sufficiently large samples that bypass completelythe need for Monte Carlo (see, e.g., [38]) . But this fixed-positionor “local” p-value would not be correct in general, as it assumes theposition of the peak was known in advance.

A method that allows one to modify the local p-value computedunder assumption of a fixed position to obtain an approximationto the correct “global” value using a relatively simple calculation isdescribed in Ref. 18. Suppose a test statistic q0, defined so that largervalues indicate increasing disagreement with the data, is observed tohave a value u. Furthermore suppose the model contains a nuisanceparameter θ (such as the peak position) which is only defined underthe signal model (there is no peak in the background-only model). Anapproximation for the global p-value is found to be

pglobal ≈ plocal + 〈Nu〉 , (38.46)

where 〈Nu〉 is the mean number of “upcrossings” of the the statisticq0 above the level u in the range of the nuisance parameter considered(e.g., the mass range).

The value of 〈Nu〉 can be estimated from the number of upcrossings〈Nu0

〉 above some much lower value, u0, by using a relation due toDavis [19],

〈Nu〉 ≈ 〈Nu0〉e−(u−u0)/2 . (38.47)

By choosing u0 sufficiently low, the value of 〈Nu〉 can be estimated bysimulating only a very small number of experiments or even from theobserved data, rather than the 107 needed if one is dealing with a 5σeffect.

38.3.2.3. Goodness-of-fit with the method of Least Squares:

When estimating parameters using the method of least squares,one obtains the minimum value of the quantity χ2 (38.19). Thisstatistic can be used to test the goodness-of-fit, i.e., the test provides ameasure of the significance of a discrepancy between the data and thehypothesized functional form used in the fit. It may also happen thatno parameters are estimated from the data, but that one simply wantsto compare a histogram, e.g., a vector of Poisson distributed numbersn = (n1, . . . , nN ), with a hypothesis for their expectation valuesµi = E[ni]. As the distribution is Poisson with variances σ2

i = µi, the

χ2 (38.19) becomes Pearson’s χ2 statistic,

χ2 =

N∑

i=1

(ni − µi)2

µi. (38.48)

If the hypothesis µ = (µ1, . . . , µN ) is correct, and if the expectedvalues µi in (38.48) are sufficiently large (or equivalently, if themeasurements ni can be treated as following a Gaussian distribution),then the χ2 statistic will follow the χ2 p.d.f. with the number ofdegrees of freedom equal to the number of measurements N minus thenumber of fitted parameters.

Alternatively, one may fit parameters and evaluate goodness-of-fit by minimizing −2 lnλ from Eq. (38.16). One finds that thedistribution of this statistic approaches the asymptotic limit fasterthan does Pearson’s χ2, and thus computing the p-value with theχ2 p.d.f. will in general be better justified (see Ref. 9 and referencestherein).

Assuming the goodness-of-fit statistic follows a χ2 p.d.f., the p-valuefor the hypothesis is then

p =

∫ ∞

χ2f(z; nd) dz , (38.49)

where f(z; nd) is the χ2 p.d.f. and nd is the appropriate number ofdegrees of freedom. Values are shown in Fig. 38.1 or obtained fromthe ROOT function TMath::Prob. If the conditions for using the χ2

p.d.f. do not hold, the statistic can still be defined as before, butits p.d.f. must be determined by other means in order to obtain thep-value, e.g., using a Monte Carlo calculation.

Since the mean of the χ2 distribution is equal to nd, one expectsin a “reasonable” experiment to obtain χ2 ≈ nd. Hence the quantityχ2/nd is sometimes reported. Since the p.d.f. of χ2/nd depends onnd, however, one must report nd as well if one wishes to determinethe p-value. The p-values obtained for different values of χ2/nd areshown in Fig. 38.2.

If one finds a χ2 value much greater than nd, and a correspondinglysmall p-value, one may be tempted to expect a high degree ofuncertainty for any fitted parameters. Poor goodness-of-fit, however,does not mean that one will have large statistical errors for parameterestimates. If, for example, the error bars (or covariance matrix)used in constructing the χ2 are underestimated, then this will leadto underestimated statistical errors for the fitted parameters. Thestandard deviations of estimators that one finds from, say, Eq. (38.13)reflect how widely the estimates would be distributed if one were torepeat the measurement many times, assuming that the hypothesisand measurement errors used in the χ2 are also correct. They donot include the systematic error which may result from an incorrecthypothesis or incorrectly estimated measurement errors in the χ2.

1 2 3 4 5 7 10 20 30 40 50 70 1000.001

0.002

0.005

0.010

0.020

0.050

0.100

0.200

0.500

1.000

p-v

alu

e f

or

test

α fo

r co

nfi

den

ce i

nte

rvals

3 42 6 8

10

15

20

25

30

40

50

n = 1

χ2

Figure 38.1: One minus the χ2 cumulative distribution,1 − F (χ2; n), for n degrees of freedom. This gives the p-valuefor the χ2 goodness-of-fit test as well as one minus the coverageprobability for confidence regions (see Sec. 38.4.2.2).

Page 13: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 479

0 10 20 30 40 500.0

0.5

1.0

1.5

2.0

2.5

Degrees of freedom n

50%

10%

90%

99%

95%

68%

32%

5%

1%

χ2/n

Figure 38.2: The ‘reduced’ χ2, equal to χ2/n, for n degreesof freedom. The curves show as a function of n the χ2/n thatcorresponds to a given p-value.

38.3.3. Bayes factors :

In Bayesian statistics, all of one’s knowledge about a model iscontained in its posterior probability, which one obtains using Bayes’theorem (38.30). Thus one could reject a hypothesis H if its posteriorprobability P (H |x) is sufficiently small. The difficulty here is thatP (H |x) is proportional to the prior probability P (H), and there willnot be a consensus about the prior probabilities for the existence ofnew phenomena. Nevertheless one can construct a quantity called theBayes factor (described below), which can be used to quantify thedegree to which the data prefer one hypothesis over another, and isindependent of their prior probabilities.

Consider two models (hypotheses), Hi and Hj , described by vectorsof parameters θi and θj , respectively. Some of the components willbe common to both models and others may be distinct. The full priorprobability for each model can be written in the form

π(Hi, θi) = P (Hi)π(θi|Hi) . (38.50)

Here P (Hi) is the overall prior probability for Hi, and π(θi|Hi) isthe normalized p.d.f. of its parameters. For each model, the posteriorprobability is found using Bayes’ theorem,

P (Hi|x) =

P (x|θi, Hi)P (Hi)π(θi|Hi) dθi

P (x), (38.51)

where the integration is carried out over the internal parameters θi

of the model. The ratio of posterior probabilities for the models istherefore

P (Hi|x)

P (Hj |x)=

P (x|θi, Hi)π(θi|Hi) dθi∫

P (x|θj , Hj)π(θj |Hj) dθj

P (Hi)

P (Hj). (38.52)

The Bayes factor is defined as

Bij =

P (x|θi, Hi)π(θi|Hi) dθi∫

P (x|θj , Hj)π(θj |Hj) dθj. (38.53)

This gives what the ratio of posterior probabilities for models i andj would be if the overall prior probabilities for the two models wereequal. If the models have no nuisance parameters, i.e., no internalparameters described by priors, then the Bayes factor is simply thelikelihood ratio. The Bayes factor therefore shows by how much theprobability ratio of model i to model j changes in the light of the data,and thus can be viewed as a numerical measure of evidence suppliedby the data in favour of one hypothesis over the other.

Although the Bayes factor is by construction independent of theoverall prior probabilities P (Hi) and P (Hj), it does require priorsfor all internal parameters of a model, i.e., one needs the functionsπ(θi|Hi) and π(θj |Hj). In a Bayesian analysis where one is only

interested in the posterior p.d.f. of a parameter, it may be acceptableto take an unnormalizable function for the prior (an improper prior)as long as the product of likelihood and prior can be normalized. Butimproper priors are only defined up to an arbitrary multiplicativeconstant, and so the Bayes factor would depend on this constant.Furthermore, although the range of a constant normalized prior isunimportant for parameter determination (provided it is wider thanthe likelihood), this is not so for the Bayes factor when such a prioris used for only one of the hypotheses. So to compute a Bayes factor,all internal parameters must be described by normalized priors thatrepresent meaningful probabilities over the entire range where theyare defined.

An exception to this rule may be considered when the identicalparameter appears in the models for both numerator and denominatorof the Bayes factor. In this case one can argue that the arbitraryconstants would cancel. One must exercise some caution, however, asparameters with the same name and physical meaning may still playdifferent roles in the two models.

Both integrals in equation (38.53) are of the form

m =

P (x|θ)π(θ) dθ , (38.54)

which is the marginal likelihood seen previously in Eq. (38.44) (insome fields this quantity is called the evidence). A review of Bayesfactors can be found in Ref. 30. Computing marginal likelihoods canbe difficult; in many cases it can be done with the nested samplingalgorithm [31] as implemented, e.g., in the program MultiNest [32].

38.4. Intervals and limits

When the goal of an experiment is to determine a parameter θ,the result is usually expressed by quoting, in addition to the pointestimate, some sort of interval which reflects the statistical precisionof the measurement. In the simplest case, this can be given by theparameter’s estimated value θ plus or minus an estimate of thestandard deviation of θ, σ

θ. If, however, the p.d.f. of the estimator

is not Gaussian or if there are physical boundaries on the possiblevalues of the parameter, then one usually quotes instead an intervalaccording to one of the procedures described below.

In reporting an interval or limit, the experimenter may wish to

• communicate as objectively as possible the result of theexperiment;

• provide an interval that is constructed to cover the true value ofthe parameter with a specified probability;

• provide the information needed by the consumer of the result todraw conclusions about the parameter or to make a particulardecision;

• draw conclusions about the parameter that incorporate statedprior beliefs.

With a sufficiently large data sample, the point estimate andstandard deviation (or for the multiparameter case, the parameterestimates and covariance matrix) satisfy essentially all of these goals.For finite data samples, no single method for quoting an interval willachieve all of them.

In addition to the goals listed above, the choice of method maybe influenced by practical considerations such as ease of producingan interval from the results of several measurements. Of course theexperimenter is not restricted to quoting a single interval or limit;one may choose, for example, first to communicate the result with aconfidence interval having certain frequentist properties, and then inaddition to draw conclusions about a parameter using a judiciouslychosen subjective Bayesian prior.

It is recommended, however, that there be a clear separationbetween these two aspects of reporting a result. In the remainder ofthis section, we assess the extent to which various types of intervalsachieve the goals stated here.

Page 14: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

480 38. Statistics

38.4.1. Bayesian intervals :

As described in Sec. 38.2.4, a Bayesian posterior probability maybe used to determine regions that will have a given probability ofcontaining the true value of a parameter. In the single parametercase, for example, an interval (called a Bayesian or credible interval)[θlo, θup] can be determined which contains a given fraction 1 − α ofthe posterior probability, i.e.,

1 − α =

∫ θup

θlo

p(θ|x) dθ . (38.55)

Sometimes an upper or lower limit is desired, i.e., θlo or θup can beset to a physical boundary or to plus or minus infinity. In other cases,one might be interested in the set of θ values for which p(θ|x) is higherthan for any θ not belonging to the set, which may constitute a singleinterval or a set of disjoint regions; these are called highest posteriordensity (HPD) intervals. Note that HPD intervals are not invariantunder a nonlinear transformation of the parameter.

If a parameter is constrained to be non-negative, then the priorp.d.f. can simply be set to zero for negative values. An importantexample is the case of a Poisson variable n, which counts signal eventswith unknown mean s, as well as background with mean b, assumedknown. For the signal mean s, one often uses the prior

π(s) =

0 s < 01 s ≥ 0

. (38.56)

This prior is regarded as providing an interval whose frequentistproperties can be studied, rather than as representing a degree ofbelief. For example, to obtain an upper limit on s, one may proceedas follows. The likelihood for s is given by the Poisson distribution forn with mean s + b,

P (n|s) =(s + b)n

n!e−(s+b) , (38.57)

along with the prior (38.56) in (38.30) gives the posterior density fors. An upper limit sup at confidence level (or here, rather, credibilitylevel) 1 − α can be obtained by requiring

1 − α =

∫ sup

−∞

p(s|n)ds =

∫ sup−∞ P (n|s)π(s) ds

∫ ∞−∞

P (n|s)π(s) ds, (38.58)

where the lower limit of integration is effectively zero because of thecut-off in π(s). By relating the integrals in Eq. (38.58) to incompletegamma functions, the solution for the upper limit is found to be

sup = 12F−1

χ2 [p, 2(n + 1)] − b , (38.59)

where F−1

χ2 is the quantile of the χ2 distribution (inverse of the

cumulative distribution). Here the quantity p is

p = 1 − α(

Fχ2 [2b, 2(n + 1)])

, (38.60)

where Fχ2 is the cumulative χ2 distribution. For both Fχ2 and F−1

χ2

above, the argument 2(n + 1) gives the number of degrees of freedom.For the special case of b = 0, the limit reduces to

sup = 12F−1

χ2 (1 − α; 2(n + 1)) . (38.61)

It happens that for the case of b = 0, the upper limit from Eq. (38.61)coincides numerically with the frequentist upper limit discussed inSection 38.4.2.3. Values for 1 − α = 0.9 and 0.95 are given by thevalues µup in Table 38.3. The frequentist properties of confidenceintervals for the Poisson mean found in this way are discussed inRefs. [2] and [21].

As in any Bayesian analysis, it is important to show how the resultchanges under assumption of different prior probabilities. For example,one could consider the Jeffreys prior as described in Sec. 38.2.4. Forthis problem one finds the Jeffreys prior π(s) ∝ 1/

√s + b for s ≥ 0 and

zero otherwise. As with the constant prior, one would not regard thisas representing one’s prior beliefs about s, both because it is improperand also as it depends on b. Rather it is used with Bayes’ theorem toproduce an interval whose frequentist properties can be studied.

If the model contains nuisance parameters then these are eliminatedby marginalizing, as in Eq. (38.36), to obtain the p.d.f. for theparameters of interest. For example, if the parameter b in the Poissoncounting problem above were to be characterized by a prior p.d.f.π(b), then one would first use Bayes’ theorem to find p(s, b|n). This isthen marginalized to find p(s|n) =

p(s, b|n)π(b) db, from which onemay determine an interval for s. One may not be certain whether toextend a model by including more nuisance parameters. In this case, aBayes factor may be used to determine to what extent the data prefera model with additional parameters, as described in Section 38.3.3.

38.4.2. Frequentist confidence intervals :

The unqualified phrase “confidence intervals” refers to frequentistintervals obtained with a procedure due to Neyman [29], describedbelow. These are intervals (or in the multiparameter case, regions)constructed so as to include the true value of the parameter witha probability greater than or equal to a specified level, called thecoverage probability. It is important to note that in the frequentistapproach, such coverage is not meaningful for a fixed interval. Aconfidence interval, however, depends on the data and thus wouldfluctuate if one were to repeat the experiment many times. Thecoverage probability refers to the fraction of intervals in such a set thatcontain the true parameter value. In this section, we discuss severaltechniques for producing intervals that have, at least approximately,this property.

38.4.2.1. The Neyman construction for confidence intervals:

Consider a p.d.f. f(x; θ) where x represents the outcome of theexperiment and θ is the unknown parameter for which we wantto construct a confidence interval. The variable x could (and oftendoes) represent an estimator for θ. Using f(x; θ), we can find for apre-specified probability 1−α, and for every value of θ, a set of valuesx1(θ, α) and x2(θ, α) such that

P (x1 < x < x2; θ) = 1 − α =

∫ x2

x1

f(x; θ) dx . (38.62)

This is illustrated in Fig. 38.3: a horizontal line segment [x1(θ, α),x2(θ, α)] is drawn for representative values of θ. The union of suchintervals for all values of θ, designated in the figure as D(α), is knownas the confidence belt. Typically the curves x1(θ, α) and x2(θ, α) aremonotonic functions of θ, which we assume for this discussion.

Possible experimental values x

pa

ram

ete

r θ x

2(θ), θ

2(x)

x1(θ), θ

1(x)

x1(θ

0) x

2(θ

0)

D(α)

θ0

Figure 38.3: Construction of the confidence belt (see text).

Upon performing an experiment to measure x and obtaining a valuex0, one draws a vertical line through x0. The confidence interval for θis the set of all values of θ for which the corresponding line segment[x1(θ, α), x2(θ, α)] is intercepted by this vertical line. Such confidenceintervals are said to have a confidence level (CL) equal to 1 − α.

Now suppose that the true value of θ is θ0, indicated in the figure.We see from the figure that θ0 lies between θ1(x) and θ2(x) if and

Page 15: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 481

only if x lies between x1(θ0) and x2(θ0). The two events thus havethe same probability, and since this is true for any value θ0, we candrop the subscript 0 and obtain

1 − α = P (x1(θ) < x < x2(θ)) = P (θ2(x) < θ < θ1(x)) . (38.63)

In this probability statement, θ1(x) and θ2(x), i.e., the endpoints ofthe interval, are the random variables and θ is an unknown constant.If the experiment were to be repeated a large number of times, theinterval [θ1, θ2] would vary, covering the fixed value θ in a fraction1 − α of the experiments.

The condition of coverage in Eq. (38.62) does not determine x1 andx2 uniquely, and additional criteria are needed. One possibility is tochoose central intervals such that the probabilities excluded below x1

and above x2 are each α/2. In other cases, one may want to reportonly an upper or lower limit, in which case the probability excludedbelow x1 or above x2 can be set to zero. Another principle based onlikelihood ratio ordering for determining which values of x should beincluded in the confidence belt is discussed below.

When the observed random variable x is continuous, the coverageprobability obtained with the Neyman construction is 1−α, regardlessof the true value of the parameter. If x is discrete, however, it is notpossible to find segments [x1(θ, α), x2(θ, α)] that satisfy Eq. (38.62)exactly for all values of θ. By convention, one constructs the confidencebelt requiring the probability P (x1 < x < x2) to be greater than or

equal to 1 − α. This gives confidence intervals that include the trueparameter with a probability greater than or equal to 1 − α.

An equivalent method of constructing confidence intervals is toconsider a test (see Sec. 38.3) of the hypothesis that the parameter’strue value is θ (assume one constructs a test for all physical values ofθ). One then excludes all values of θ where the hypothesis would berejected in a test of size α or less. The remaining values constitutethe confidence interval at confidence level 1 − α. If the critical regionof the test is characterized by having a p-value pθ ≤ α, then theendpoints of the confidence interval are found in practice by solvingpθ = α for θ.

In this procedure, one is still free to choose the test to be used; thiscorresponds to the freedom in the Neyman construction as to whichvalues of the data are included in the confidence belt. One possibilityis to use a test statistic based on the likelihood ratio,

λ =f(x; θ)

f(x; θ ), (38.64)

where θ is the value of the parameter which, out of all allowed values,maximizes f(x; θ). This results in the intervals described in Ref. 33 byFeldman and Cousins. The same intervals can be obtained from theNeyman construction described above by including in the confidencebelt those values of x which give the greatest values of λ.

If the model contains nuisance parameters ν, then these can beincorporated into the test (or the p-values) used to determine thelimit by profiling as discussed in Section 38.3.2.1. As mentioned there,the strict frequentist approach is to regard the parameter of interestθ as excluded only if it is rejected for all possible values of ν. Theresulting interval for θ will then cover then cover the true value witha probability greater than or equal to the nominal confidence level forall points in ν-space.

If the p-value is based on the profiled values of the nuisance

parameters, i.e., with ν = ν(θ) used in Eq. (38.42), then the resultinginterval for the parameter of interest will have the correct coverage ifthe true values of ν are equal to the profiled values. Otherwise thecoverage probability may be too high or too low. This procedure hasbeen called profile construction in HEP [20]( see also [17]) .

38.4.2.2. Gaussian distributed measurements:

An important example of constructing a confidence interval is whenthe data consists of a single random variable x that follows a Gaussiandistribution; this is often the case when x represents an estimator fora parameter and one has a sufficiently large data sample. If there ismore than one parameter being estimated, the multivariate Gaussianis used. For the univariate case with known σ, the probability thatthe measured value x will fall within ±δ of the true value µ is

1 − α =1

√2πσ

∫ µ+δ

µ−δe−(x−µ)2/2σ2

dx = erf

(

δ√

2 σ

)

= 2Φ(σ

δ

)

− 1 ,

(38.65)where erf is the Gaussian error function, which is rewritten in thefinal equality using Φ, the Gaussian cumulative distribution. Fig. 38.4shows a δ = 1.64σ confidence interval unshaded. The choice δ = σgives an interval called the standard error which has 1 − α = 68.27%if σ is known. Values of α for other frequently used choices of δ aregiven in Table 38.1.

−3 −2 −1 0 1 2 3

f (x; µ,σ)

α /2α /2

(x−µ) /σ

1−α

Figure 38.4: Illustration of a symmetric 90% confidence interval(unshaded) for a measurement of a single quantity with Gaussianerrors. Integrated probabilities, defined by α = 0.1, are as shown.

Table 38.1: Area of the tails α outside ±δ from the mean of aGaussian distribution.

α δ α δ

0.3173 1σ 0.2 1.28σ

4.55 ×10−2 2σ 0.1 1.64σ

2.7 ×10−3 3σ 0.05 1.96σ

6.3×10−5 4σ 0.01 2.58σ

5.7×10−7 5σ 0.001 3.29σ

2.0×10−9 6σ 10−4 3.89σ

We can set a one-sided (upper or lower) limit by excluding abovex + δ (or below x − δ). The values of α for such limits are half thevalues in Table 38.1.

The relation (38.65) can be re-expressed using the cumulativedistribution function for the χ2 distribution as

α = 1 − F (χ2; n) , (38.66)

for χ2 = (δ/σ)2 and n = 1 degree of freedom. This can be seen asthe n = 1 curve in Fig. 38.1 or obtained by using the ROOT functionTMath::Prob.

For multivariate measurements of, say, n parameter estimatesθ = (θ1, . . . , θn), one requires the full covariance matrix Vij =

cov[θi, θj ], which can be estimated as described in Sections 38.2.2and 38.2.3. Under fairly general conditions with the methods ofmaximum-likelihood or least-squares in the large sample limit, theestimators will be distributed according to a multivariate Gaussiancentered about the true (unknown) values θ, and furthermore, thelikelihood function itself takes on a Gaussian shape.

The standard error ellipse for the pair (θi, θj) is shown in Fig. 38.5,

corresponding to a contour χ2 = χ2min + 1 or lnL = lnLmax − 1/2.

The ellipse is centered about the estimated values θ, and the tangentsto the ellipse give the standard deviations of the estimators, σi andσj . The angle of the major axis of the ellipse is given by

tan 2φ =2ρijσiσj

σ2j − σ2

i

, (38.67)

Page 16: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

482 38. Statistics

Table 38.2: Values of ∆χ2 or 2∆ lnL corresponding to acoverage probability 1 − α in the large data sample limit, forjoint estimation of m parameters.

(1 − α) (%) m = 1 m = 2 m = 3

68.27 1.00 2.30 3.53

90. 2.71 4.61 6.25

95. 3.84 5.99 7.82

95.45 4.00 6.18 8.03

99. 6.63 9.21 11.34

99.73 9.00 11.83 14.16

where ρij = cov[θi, θj ]/σiσj is the correlation coefficient.

The correlation coefficient can be visualized as the fraction of thedistance σi from the ellipse’s horizontal center-line at which the ellipsebecomes tangent to vertical, i.e., at the distance ρijσi below thecenter-line as shown. As ρij goes to +1 or −1, the ellipse thins to adiagonal line.

It could happen that one of the parameters, say, θj , is known fromprevious measurements to a precision much better than σj , so that thecurrent measurement contributes almost nothing to the knowledge ofθj . However, the current measurement of θi and its dependence on θj

may still be important. In this case, instead of quoting both parameterestimates and their correlation, one sometimes reports the value of θi,which minimizes χ2 at a fixed value of θj , such as the PDG best value.This θi value lies along the dotted line between the points where theellipse becomes tangent to vertical, and has statistical error σinner

as shown on the figure, where σinner = (1 − ρ2ij)

1/2σi. Instead of the

correlation ρij , one reports the dependency dθi/dθj which is the slopeof the dotted line. This slope is related to the correlation coefficientby dθi/dθj = ρij ×

σiσj

.

θ i

φ

θ i

θj

^

θ j^

ij iρ σ

innerσ

Figure 38.5: Standard error ellipse for the estimators θi andθj . In this case the correlation is negative.

As in the single-variable case, because of the symmetry of theGaussian function between θ and θ, one finds that contours of constantlnL or χ2 cover the true values with a certain, fixed probability. Thatis, the confidence region is determined by

lnL(θ) ≥ lnLmax − ∆ ln L , (38.68)

or where a χ2 has been defined for use with the method ofleast-squares,

χ2(θ) ≤ χ2min + ∆χ2 . (38.69)

Values of ∆χ2 or 2∆ lnL are given in Table 38.2 for several values ofthe coverage probability and number of fitted parameters.

For non-Gaussian data samples, the probability for the regionsdetermined by equations (38.68) or (38.69) to cover the true valueof θ becomes independent of θ only in the large-sample limit. Sofor a finite data sample these are not exact confidence regionsaccording to our previous definition. Nevertheless, they can still havea coverage probability only weakly dependent on the true parameter,and approximately as given in Table 38.2. In any case, the coverageprobability of the intervals or regions obtained according to thisprocedure can in principle be determined as a function of the trueparameter(s), for example, using a Monte Carlo calculation.

One of the practical advantages of intervals that can be constructedfrom the log-likelihood function or χ2 is that it is relatively simple toproduce the interval for the combination of several experiments. If Nindependent measurements result in log-likelihood functions lnLi(θ),then the combined log-likelihood function is simply the sum,

lnL(θ) =

N∑

i=1

lnLi(θ) . (38.70)

This can then be used to determine an approximate confidence intervalor region with Eq. (38.68), just as with a single experiment.

38.4.2.3. Poisson or binomial data:

Another important class of measurements consists of counting acertain number of events, n. In this section, we will assume theseare all events of the desired type, i.e., there is no background. If nrepresents the number of events produced in a reaction with crosssection σ, say, in a fixed integrated luminosity L, then it follows aPoisson distribution with mean µ = σL. If, on the other hand, onehas selected a larger sample of N events and found n of them to havea particular property, then n follows a binomial distribution where theparameter p gives the probability for the event to possess the propertyin question. This is appropriate, e.g., for estimates of branching ratiosor selection efficiencies based on a given total number of events.

For the case of Poisson distributed n, the upper and lower limits onthe mean value µ can be found from the Neyman procedure to be

µlo = 12F−1

χ2 (αlo; 2n) , (38.71a)

µup = 12F−1

χ2 (1 − αup; 2(n + 1)) , (38.71b)

where the upper and lower limits are at confidence levels of 1 − αlo

and 1 − αup, respectively, and F−1

χ2 is the quantile of the χ2

distribution (inverse of the cumulative distribution). The quantilesF−1

χ2 can be obtained from standard tables or from the ROOT

routine TMath::ChisquareQuantile. For central confidence intervalsat confidence level 1 − α, set αlo = αup = α/2.

Table 38.3: Lower and upper (one-sided) limits for the meanµ of a Poisson variable given n observed events in the absence ofbackground, for confidence levels of 90% and 95%.

1 − α =90% 1 − α =95%

n µlo µup µlo µup

0 – 2.30 – 3.00

1 0.105 3.89 0.051 4.74

2 0.532 5.32 0.355 6.30

3 1.10 6.68 0.818 7.75

4 1.74 7.99 1.37 9.15

5 2.43 9.27 1.97 10.51

6 3.15 10.53 2.61 11.84

7 3.89 11.77 3.29 13.15

8 4.66 12.99 3.98 14.43

9 5.43 14.21 4.70 15.71

10 6.22 15.41 5.43 16.96

It happens that the upper limit from Eq. (38.71b) coincidesnumerically with the Bayesian upper limit for a Poisson parameter,using a uniform prior p.d.f. for µ. Values for confidence levels of90% and 95% are shown in Table 38.3. For the case of binomiallydistributed n successes out of N trials with probability of success p,the upper and lower limits on p are found to be

plo =nF−1

F [αlo; 2n, 2(N − n + 1)]

N − n + 1 + nF−1F [αlo; 2n, 2(N − n + 1)]

, (38.72a)

pup =(n + 1)F−1

F [1 − αup; 2(n + 1), 2(N − n)]

(N − n) + (n + 1)F−1F [1 − αup; 2(n + 1), 2(N − n)]

. (38.72b)

Page 17: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

38. Statistics 483

Here F−1F is the quantile of the F distribution (also called the

Fisher–Snedecor distribution; see Ref. 4).

38.4.2.4. Parameter exclusion in cases of low sensitivity:

An important example of a statistical test arises in the search fora new signal process. Suppose the parameter µ is defined such thatit is proportional to the signal cross section. A statistical test maybe carried out for hypothesized values of µ, which may be done bycomputing a p-value, pµ, for all µ. Those values not rejected in atest of size α, i.e., for which one does not find pµ ≤ α, constitute aconfidence interval with confidence level 1 − α.

In general one will find that for some regions in the parameterspace of the signal model, the predictions for data are almostindistinguishable from those of the background-only model. Thiscorresponds to the case where µ is very small, as would occur, e.g., ifone searches for a new particle with a mass so high that its productionrate in a given experiment is negligible. That is, one has essentiallyno experimental sensitivity to such a model.

One would prefer that if the sensitivity to a model (or a point in amodel’s parameter space) is very low, then it should not be excluded.Even if the outcomes predicted with or without signal are identical,however, the probability to reject the signal model will equal α, thetype-I error rate. As one often takes α to be 5%, this would meanthat in a large number of searches covering a broad range of a signalmodel’s parameter space, there would inevitably be excluded regionsin which the experimental sensitivity is very small, and thus one mayquestion whether it is justified to regard such parameter values asdisfavored.

Exclusion of models to which one has little or no sensitivityoccurs, for example, if the data fluctuate very low relative to theexpectation of the background-only hypothesis. In this case theresulting upper limit on the predicted rate (cross section) of a signalmodel may be anomalously low. As a means of controlling this effectone often determines the mean or median limit under assumptionof the background-only hypothesis using a simplified Monte Carlosimulation of the experiment. An upper limit found significantly belowthe background-only expectation may indicate a strong downwardfluctuation of the data, or perhaps as well an incorrect estimate of thebackground rate.

One way to mitigate the problem of excluding models to whichone is not sensitive is the CLs method, where the measure used totest a parameter is increased for decreasing sensitivity [34,35]. Theprocedure is based on a statistic called CLs, which is defined as

CLs =pµ

1 − pb

, (38.73)

where pb is the p-value of the background-only hypothesis. In theusual formulation of the method, both pµ and pb are defined usinga single test statistic, and the definition of CLs above assumes thisstatistic is continuous; more details can be found in Refs. [34,35].

A point in a model’s parameter space is regarded as excludedif one finds CLs ≤ α. As the denominator in Eq. (38.73) is alwaysless than or equal to unity, the exclusion criterion based on CLs

is more stringent than the usual requirement pµ ≤ α. In this sensethe CLs procedure is conservative, and the coverage probability ofthe corresponding intervals will exceed the nominal confidence level1− α. If the experimental sensitivity to a given value of µ is very low,then one finds that as pµ decreases, so does the denominator 1 − pb,and thus the condition CLs ≤ α is effectively prevented from beingsatisfied. In this way the exclusion of parameters in the case of lowsensitivity is suppressed.

The CLs procedure has the attractive feature that the resultingintervals coincide with those obtained from the Bayesian methodin two important cases: the mean value of a Poisson or Gaussiandistributed measurement with a constant prior. The CLs intervalsovercover for all values of the parameter µ, however, by an amountthat depends on µ.

The problem of excluding parameter values to which one has littlesensitivity is particularly acute when one wants to set a one-sidedlimit, e.g., an upper limit on a cross section. Here one tests a value

of a rate parameter µ against the alternative of a lower rate, andtherefore the critical region of the test is taken to correspond to dataoutcomes with a low event yield. If the number of events found inthe search region fluctuates low enough, however, it can happen thatall physically meaningful signal parameter values, including those towhich one has very little sensitivity, are rejected by the test.

Another solution to this problem, therefore, is to replace theone-sided test by one based on the likelihood ratio, where the criticalregion is not restricted to low rates. This is the approach followedin the Feldman-Cousins procedure described in Section 38.4.2.1. Thecritical region for the test of a given value of µ contains data valuescharacteristic of both higher and lower rates. As a result, for a givenobserved rate one can in general obtain a two-sided interval. If,however, the parameter estimate µ is sufficiently close to the lowerlimit of zero, then only high values of µ are rejected, and the loweredge of the confidence interval is at zero. Note, however, that thecoverage property of 1 − α pertains to the entire interval, not to theprobability for the upper edge µup to be greater than the true valueµ. For parameter estimates increasingly far away from the boundary,i.e., for increasing signal significance, the point µ = 0 is excluded andthe interval has nonzero upper and lower edges.

An additional difficulty arises when a parameter estimate is notsignificantly far away from the boundary, in which case it is naturalto report a one-sided confidence interval (often an upper limit). It isstraightforward to force the Neyman prescription to produce only anupper limit by setting x2 = ∞ in Eq. (38.62). Then x1 is uniquelydetermined and the upper limit can be obtained. If, however, thedata come out such that the parameter estimate is not so close tothe boundary, one might wish to report a central confidence interval(i.e., an interval based on a two-sided test with equal upper and lowertail areas). As pointed out by Feldman and Cousins [33], however,if the decision to report an upper limit or two-sided interval is madeby looking at the data (“flip-flopping”), then in general there will beparameter values for which the resulting intervals have a coverageprobability less than 1 − α. With the confidence intervals suggestedin [33], the prescription determines whether the interval is one- ortwo-sided in a way which preserves the coverage probability (and arethus said to be unified).

The intervals according to this method for the mean of Poissonvariable in the absence of background are given in Table 38.4. (Notethat α in Ref. 33 is defined following Neyman [29] as the coverageprobability; this is opposite the modern convention used here in whichthe coverage probability is 1−α.) The values of 1−α given here referto the coverage of the true parameter by the whole interval [µ1, µ2].In Table 38.3 for the one-sided upper limit, however, 1 − α refers tothe probability to have µup ≥ µ (or µlo ≤ µ for lower limits).

Table 38.4: Unified confidence intervals [µ1, µ2] for a the meanof a Poisson variable given n observed events in the absence ofbackground, for confidence levels of 90% and 95%.

1 − α =90% 1 − α =95%

n µ1 µ2 µ1 µ2

0 0.00 2.44 0.00 3.09

1 0.11 4.36 0.05 5.14

2 0.53 5.91 0.36 6.72

3 1.10 7.42 0.82 8.25

4 1.47 8.60 1.37 9.76

5 1.84 9.99 1.84 11.26

6 2.21 11.47 2.21 12.75

7 3.56 12.53 2.58 13.81

8 3.96 13.99 2.94 15.29

9 4.36 15.30 4.36 16.77

10 5.50 16.50 4.75 17.82

Page 18: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

484 38. Statistics

A potential difficulty with unified intervals arises if, for example,one constructs such an interval for a Poisson parameter s of someyet to be discovered signal process with, say, 1 − α = 0.9. If the truesignal parameter is zero, or in any case much less than the expectedbackground, one will usually obtain a one-sided upper limit on s. In acertain fraction of the experiments, however, a two-sided interval fors will result. Since, however, one typically chooses 1 − α to be only0.9 or 0.95 when setting limits, the value s = 0 may be found belowthe lower edge of the interval before the existence of the effect is wellestablished. It must then be communicated carefully that in excludings = 0 at, say, 90% or 95% confidence level from the interval, one is notnecessarily claiming to have discovered the effect, for which one wouldusually require a higher level of significance (e.g., 5 σ).

Another possibility is to construct a Bayesian interval as describedin Section 38.4.1. The presence of the boundary can be incorporatedsimply by setting the prior density to zero in the unphysical region.More specifically, the prior may be chosen using formal rules such asthe reference prior or Jeffreys prior mentioned in Sec. 38.2.4.

In HEP a widely used prior for the mean µ of a Poisson distributedmeasurement has been uniform for µ ≥ 0. This prior does not followfrom any fundamental rule nor can it be regarded as reflecting areasonable degree of belief, since the prior probability for µ to liebetween any two finite limits is zero. It is more appropriately regardedas a procedure for obtaining intervals with frequentist propertiesthat can be investigated. The resulting upper limits have a coverageprobability that depends on the true value of the Poisson parameter,and is nowhere smaller than the stated probability content. Lowerlimits and two-sided intervals for the Poisson mean based on flat priorsundercover, however, for some values of the parameter, although to anextent that in practical cases may not be too severe [2,21]. Intervalsconstructed in this way have the advantage of being easy to derive;if several independent measurements are to be combined then onesimply multiplies the likelihood functions (cf. Eq. (38.70)).

In any case, it is important to always report sufficient informationso that the result can be combined with other measurements. Oftenthis means giving an unbiased estimator and its standard deviation,even if the estimated value is in the unphysical region.

It can also be useful with a frequentist interval to calculate itssubjective probability content using the posterior p.d.f. based on oneor several reasonable guesses for the prior p.d.f. If it turns out tobe significantly less than the stated confidence level, this warns thatit would be particularly misleading to draw conclusions about theparameter’s value from the interval alone.

References:

1. B. Efron, Am. Stat. 40, 11 (1986).

2. R.D. Cousins, Am. J. Phys. 63, 398 (1995).

3. A. Stuart, J.K. Ord, and S. Arnold, Kendall’s Advanced Theory

of Statistics, Vol. 2A: Classical Inference and the Linear Model,6th ed., Oxford Univ. Press (1999), and earlier editions byKendall and Stuart. The likelihood-ratio ordering principle isdescribed at the beginning of Ch. 23. Chapter 26 comparesdifferent schools of statistical inference.

4. F.E. James, Statistical Methods in Experimental Physics, 2nd ed.,(World Scientific, Singapore, 2007).

5. H. Cramer, Mathematical Methods of Statistics, Princeton Univ.Press, New Jersey (1958).

6. L. Lyons, Statistics for Nuclear and Particle Physicists,(Cambridge University Press, New York, 1986).

7. R. Barlow, Nucl. Instrum. Methods A297, 496 (1990).

8. G. Cowan, Statistical Data Analysis, (Oxford University Press,Oxford, 1998).

9. For a review, see S. Baker and R. Cousins, Nucl. Instrum.Methods 221, 437 (1984).

10. For information on neural networks and related topics, see e.g.,C.M. Bishop, Neural Networks for Pattern Recognition, ClarendonPress, Oxford (1995); C. Peterson and T. Rognvaldsson, AnIntroduction to Artificial Neural Networks, in Proceedings of the

1991 CERN School of Computing, C. Verkerk (ed.), CERN 92-02(1992).

11. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of

Statistical Learning (2nd edition, Springer, New York, 2009).12. A. Webb, Statistical Pattern Recognition, 2nd ed., (Wiley, New

York, 2002).13. L.I. Kuncheva, Combining Pattern Classifiers, (Wiley, New York,

2004).14. Links to the Proceedings of the PHYSTAT conference series

(Durham 2002, Stanford 2003, Oxford 2005, and Geneva 2007,2011) can be found at phystat.org.

15. A. Hocker et al., TMVA Users Guide, physics/0703039(2007);software available from tmva.sf.net.

16. I. Narsky, StatPatternRecognition: A C++ Package for Statistical

Analysis of High Energy Physics Data, physics/0507143(2005);software avail. from sourceforge.net/projects/statpatrec.

17. L. Demortier, P -Values and Nuisance Parameters, Proceedings of

PHYSTAT 2007, CERN-2008-001, p. 23.18. E. Gross and O. Vitells, Eur. Phys. J. C70, 525 (2010);

arXiv:1005.1891.19. R.B. Davis, Biometrica 74, 33 (1987).20. K. Cranmer, Statistical Challenges for Searches for New Physics

at the LHC, in Proceedings of PHYSTSAT 2005, L. Lyons and m.Karagoz Unel (eds.), Oxford (2005); arXiv:physics/0511028.

21. B.P. Roe and M.B. Woodroofe, Phys. Rev. D63, 13009 (2000).22. A. O’Hagan and J.J. Forster, Bayesian Inference, (2nd edition,

volume 2B of Kendall’s Advanced Theory of Statistics, Arnold,London, 2004).

23. D. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial,(Oxford University Press, 2006).

24. P.C. Gregory, Bayesian Logical Data Analysis for the Physical

Sciences, (Cambridge University Press, 2005).25. J.M. Bernardo and A.F.M. Smith, Bayesian Theory, (Wiley,

2000).26. Robert E. Kass and Larry Wasserman, J. Am. Stat. Assoc. 91,

1343 (1996).27. J.M. Bernardo, J. R. Statist. Soc. B41, 113 (1979); J.M. Bernardo

and J.O. Berger, J. Am. Stat. Assoc. 84, 200 (1989). See alsoJ.M. Bernardo, Reference Analysis, in Handbook of Statistics,25 (D.K. Dey and C.R. Rao, eds.), 17-90, Elsevier (2005) andreferences therein.

28. L. Demortier, S. Jain, and H. Prosper, Phys. Rev. D 82, 034002(2010); arXiv:1002.1111.

29. J. Neyman, Phil. Trans. Royal Soc. London, Series A, 236, 333(1937), reprinted in A Selection of Early Statistical Papers on J.

Neyman, (University of California Press, Berkeley, 1967).30. R. E. Kass and A. E. Raftery, J. Am. Stat. Assoc. 90, 773

(1995).31. J. Skilling, Nested Sampling, AIP Conference Proceedings 735,

395405 (2004).32. F. Feroz, M.P. Hobson, and M. Bridges, Mon. Not. Roy. Astron.

Soc. 398, 1601-1614 (2009); arXiv:0809.3437.33. G.J. Feldman and R.D. Cousins, Phys. Rev. D57, 3873 (1998).

This paper does not specify what to do if the ordering principlegives equal rank to some values of x. Eq. 21.6 of Ref. 3 gives therule: all such points are included in the acceptance region (thedomain D(α)). Some authors have assumed the contrary, andshown that one can then obtain null intervals.

34. A.L. Read, Modified frequentist analysis of search results (the

CLs method), in F. James, L. Lyons, and Y. Perrin (eds.),Workshop on Confidence Limits, CERN Yellow Report 2000-005,available through cdsweb.cern.ch.

35. T. Junk, Nucl. Instrum. Methods A434, 435 (1999).36. N. Reid, Likelihood Inference in the Presence of Nuisance

Parameters, Proceedings of PHYSTAT2003, L. Lyons, R. Mount,and R. Reitmeyer, eds., eConf C030908, Stanford, 2003.

37. W.A. Rolke, A.M. Lopez, and J. Conrad, Nucl. Instrum. MethodsA551, 493 (2005); physics/0403059.

38. G. Cowan, K. Cranmer, E. Gross, and O. Vitells, Eur. Phys. J.C71, 1554 (2011).

Page 19: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

39. Monte Carlo techniques 485

39. MONTE CARLO TECHNIQUES

Revised September 2011 by G. Cowan (RHUL).

Monte Carlo techniques are often the only practical way toevaluate difficult integrals or to sample random variables governedby complicated probability density functions. Here we describe anassortment of methods for sampling some commonly occurringprobability density functions.

39.1. Sampling the uniform distribution

Most Monte Carlo sampling or integration techniques assume a“random number generator,” which generates uniform statisticallyindependent values on the half open interval [0, 1); for reviews see,e.g., [1,2].

Uniform random number generators are available in softwarelibraries such as CERNLIB [3], CLHEP [4], and ROOT [5]. Forexample, in addition to a basic congruential generator TRandom (seebelow), ROOT provides three more sophisticated routines: TRandom1

implements the RANLUX generator [6] based on the method byLuscher, and allows the user to select different quality levels,trading off quality with speed; TRandom2 is based on the maximallyequidistributed combined Tausworthe generator by L’Ecuyer [7];the TRandom3 generator implements the Mersenne twister algorithmof Matsumoto and Nishimura [8]. All of the algorithms produce aperiodic sequence of numbers, and to obtain effectively random values,one must not use more than a small subset of a single period. TheMersenne twister algorithm has an extremely long period of 219937 −1.

The performance of the generators can be investigated with testssuch as DIEHARD [9] or TestU01 [10]. Many commonly availablecongruential generators fail these tests and often have sequences(typically with periods less than 232), which can be easily exhaustedon modern computers. A short period is a problem for the TRandom

generator in ROOT, which, however, has the advantage that itsstate is stored in a single 32-bit word. The generators TRandom1,TRandom2, or TRandom3 have much longer periods, with TRandom3

being recommended by the ROOT authors as providing the bestcombination of speed and good random properties. For furtherinformation see, e.g., Ref. 11.

39.2. Inverse transform method

If the desired probability density function is f(x) on the range−∞ < x < ∞, its cumulative distribution function (expressing theprobability that x ≤ a) is given by Eq. (37.6). If a is chosen withprobability density f(a), then the integrated probability up to pointa, F (a), is itself a random variable which will occur with uniformprobability density on [0, 1]. Suppose u is generated according toa uniformly distributed in (0, 1). If x can take on any value, andignoring the endpoints, we can then find a unique x chosen from thep.d.f. f(x) for a given u if we set

u = F (x) , (39.1)

provided we can find an inverse of F , defined by

x = F−1(u) . (39.2)

This method is shown in Fig. 39.1a. It is most convenient when onecan calculate by hand the inverse function of the indefinite integral off . This is the case for some common functions f(x) such as exp(x),(1 − x)n, and 1/(1 + x2) (Cauchy or Breit-Wigner), although itdoes not necessarily produce the fastest generator. Standard librariescontain software to implement this method numerically, workingfrom functions or histograms in one or more dimensions, e.g., theUNU.RAN package [12], available in ROOT.

For a discrete distribution, F (x) will have a discontinuous jump ofsize f(xk) at each allowed xk, k = 1, 2, · · ·. Choose u from a uniformdistribution on (0,1) as before. Find xk such that

F (xk−1) < u ≤ F (xk) ≡ Prob (x ≤ xk) =

k∑

i=1

f(xi) ; (39.3)

then xk is the value we seek (note: F (x0) ≡ 0). This algorithm isillustrated in Fig. 39.1b.

0

1

0

1

F(x)

F(x)

f (xk)

xxk+1xk

u

xx = F−1(u)

Continuous

distribution

Discrete

distribution

u

(a)

(b)

Figure 39.1: Use of a random number u chosen from a uniformdistribution (0,1) to find a random number x from a distributionwith cumulative distribution function F (x).

39.3. Acceptance-rejection method (Von Neumann)

Very commonly an analytic form for F (x) is unknown or toocomplex to work with, so that obtaining an inverse as in Eq. (39.2) isimpractical. We suppose that for any given value of x, the probabilitydensity function f(x) can be computed, and further that enough isknown about f(x) that we can enclose it entirely inside a shape whichis C times an easily generated distribution h(x), as illustrated inFig. 39.2. That is, Ch(x) ≥ f(x) must hold for all x.

C h(x)

C h(x)

f (x)

x

f (x)

(a)

(b)

Figure 39.2: Illustration of the acceptance-rejection method.Random points are chosen inside the upper bounding figure, andrejected if the ordinate exceeds f(x). The lower figure illustratesa method to increase the efficiency (see text).

Frequently h(x) is uniform or is a normalized sum of uniformdistributions. Note that both f(x) and h(x) must be normalizedto unit area, and therefore, the proportionality constant C > 1.To generate f(x), first generate a candidate x according to h(x).Calculate f(x) and the height of the envelope C h(x); generate u andtest if uC h(x) ≤ f(x). If so, accept x; if not reject x and try again. Ifwe regard x and uC h(x) as the abscissa and ordinate of a point in atwo-dimensional plot, these points will populate the entire area C h(x)in a smooth manner; then we accept those which fall under f(x). Theefficiency is the ratio of areas, which must equal 1/C; therefore wemust keep C as close as possible to 1.0. Therefore, we try to chooseC h(x) to be as close to f(x) as convenience dictates, as in the lowerpart of Fig. 39.2.

Page 20: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

486 39. Monte Carlo techniques

39.4. Algorithms

Algorithms for generating random numbers belonging to manydifferent distributions are given for example by Press [13], Ahrensand Dieter [14], Rubinstein [15], Devroye [16], Walck [17] and Gentle[18]. For many distributions, alternative algorithms exist, varying incomplexity, speed, and accuracy. For time-critical applications, thesealgorithms may be coded in-line to remove the significant overheadoften encountered in making function calls.

In the examples given below, we use the notation for the variablesand parameters given in Table 37.1. Variables named “u” are assumedto be independent and uniform on [0,1). Denominators must beverified to be non-zero where relevant.

39.4.1. Exponential decay :

This is a common application of the inverse transform method, anduses the fact that if u is uniformly distributed in [0, 1], then (1 − u) isas well. Consider an exponential p.d.f. f(t) = (1/τ) exp(−t/τ) that istruncated so as to lie between two values, a and b, and renormalizedto unit area. To generate decay times t according to this p.d.f., firstlet α = exp(−a/τ) and β = exp(−b/τ); then generate u and let

t = −τ ln(β + u(α − β)). (39.4)

For (a, b) = (0,∞), we have simply t = −τ ln u. (See also Sec. 39.4.6.)

39.4.2. Isotropic direction in 3D :

Isotropy means the density is proportional to solid angle, thedifferential element of which is dΩ = d(cos θ)dφ. Hence cos θ isuniform (2u1 − 1) and φ is uniform (2πu2). For alternative generationof sinφ and cosφ, see the next subsection.

39.4.3. Sine and cosine of random angle in 2D :

Generate u1 and u2. Then v1 = 2u1 − 1 is uniform on (−1,1), andv2 = u2 is uniform on (0,1). Calculate r2 = v2

1+ v2

2. If r2 > 1, start

over. Otherwise, the sine (S) and cosine (C) of a random angle (i.e.,uniformly distributed between zero and 2π) are given by

S = 2v1v2/r2 and C = (v21 − v2

2)/r2 . (39.5)

39.4.4. Gaussian distribution :

If u1 and u2 are uniform on (0,1), then

z1 = sin(2πu1)√

−2 lnu2 and z2 = cos(2πu1)√

−2 lnu2 (39.6)

are independent and Gaussian distributed with mean 0 and σ = 1.

There are many variants of this basic algorithm, which may befaster. For example, construct v1 = 2u1 − 1 and v2 = 2u2 − 1, whichare uniform on (−1,1). Calculate r2 = v2

1+ v2

2, and if r2 > 1 start

over. If r2 < 1, it is uniform on (0,1). Then

z1 = v1

−2 ln r2

r2and z2 = v2

−2 ln r2

r2(39.7)

are independent numbers chosen from a normal distribution withmean 0 and variance 1. z′i = µ + σzi distributes with mean µ and

variance σ2.

For a multivariate Gaussian with an n×n covariance matrix V , onecan start by generating n independent Gaussian variables, ηj, withmean 0 and variance 1 as above. Then the new set xi is obtainedas xi = µi +

j Lijηj , where µi is the mean of xi, and Lij arethe components of L, the unique lower triangular matrix that fulfilsV = LLT . The matrix L can be easily computed by the followingrecursive relation (Cholesky’s method):

Ljj =

Vjj −

j−1∑

k=1

L2

jk

1/2

, (39.8a)

Lij =Vij −

∑j−1

k=1LikLjk

Ljj, j = 1, ..., n ; i = j + 1, ..., n, (39.8b)

where Vij = ρijσiσj are the components of V . For n = 2 one has

L =

(

σ1 0ρσ2

1 − ρ2 σ2

)

, (39.9)

and therefore the correlated Gaussian variables are generated asx1 = µ1 + σ1η1, x2 = µ2 + ρσ2η1 +

1 − ρ2 σ2η2.

39.4.5. χ2(n) distribution :

To generate a variable following the χ2 distribution for n degrees offreedom, use the Gamma distribution with k = n/2 and λ = 1/2 usingthe method of Sec. 39.4.6.

39.4.6. Gamma distribution :

All of the following algorithms are given for λ = 1. For λ 6= 1,divide the resulting random number x by λ.

• If k = 1 (the exponential distribution), accept x = − lnu. (Seealso Sec. 39.4.1.)

• If 0 < k < 1, initialize with v1 = (e + k)/e (with e = 2.71828...being the natural log base). Generate u1, u2. Define v2 = v1u1.

Case 1: v2 ≤ 1. Define x = v1/k2

. If u2 ≤ e−x, accept x andstop, else restart by generating new u1, u2.Case 2: v2 > 1. Define x = −ln([v1 − v2]/k). If u2 ≤ xk−1,accept x and stop, else restart by generating new u1, u2.Note that, for k < 1, the probability density has a pole atx = 0, so that return values of zero due to underflow must beaccepted or otherwise dealt with.

• Otherwise, if k > 1, initialize with c = 3k − 0.75. Generateu1 and compute v1 = u1(1 − u1) and v2 = (u1 − 0.5)

c/v1. Ifx = k + v2 − 1 ≤ 0, go back and generate new u1; otherwisegenerate u2 and compute v3 = 64v3

1u2

2. If v3 ≤ 1 − 2v2

2/x or if

ln v3 ≤ 2[k − 1] ln[x/(k − 1)] − v2, accept x and stop; otherwisego back and generate new u1.

39.4.7. Binomial distribution :

Begin with k = 0 and generate u uniform in [0, 1). ComputePk = (1 − p)n and store Pk into B. If u ≤ B accept rk = k andstop. Otherwise, increment k by one; compute the next Pk asPk · (p/(1 − p)) · (n − k)/(k + 1); add this to B. Again, if u ≤ B,accept rk = k and stop, otherwise iterate until a value is accepted. Ifp > 1/2, it will be more efficient to generate r from f(r; n, q), i.e.,with p and q interchanged, and then set rk = n − r.

39.4.8. Poisson distribution :

Iterate until a successful choice is made: Begin with k = 1 and setA = 1 to start. Generate u. Replace A with uA; if now A < exp(−µ),where µ is the Poisson parameter, accept nk = k − 1 and stop.Otherwise increment k by 1, generate a new u and repeat, alwaysstarting with the value of A left from the previous try.

Note that the Poisson generator used in ROOT’s TRandom

classes before version 5.12 (including the derived classes TRandom1,

TRandom2, TRandom3) as well as the routine RNPSSN from CERNLIB,use a Gaussian approximation when µ exceeds a given threshold. Thismay be satisfactory (and much faster) for some applications. To dothis, generate z from a Gaussian with zero mean and unit standarddeviation; then use x = max(0, [µ + z

õ + 0.5]) where [ ] signifies

the greatest integer ≤ the expression. The routines from NumericalRecipes [13] and CLHEP’s routine RandPoisson do not make thisapproximation (see, e.g., Ref. 11).

39.4.9. Student’s t distribution :

Generate u1 and u2 uniform in (0, 1); then t = sin(2πu1)[n(u−2/n2

1)]1/2 follows the Student’s t distribution for n > 0 degrees of freedom(n not necessarily an integer).

Alternatively, generate x from a Gaussian with mean 0 and σ2 = 1according to the method of 39.4.4. Next generate y, an independentgamma random variate, according to 39.4.6 with λ = 1/2 and k = n/2.Then z = x/

y/n is distributed as a t with n degrees of freedom.

For the special case n = 1, the Breit-Wigner distribution, generateu1 and u2; set v1 = 2u1 − 1 and v2 = 2u2 − 1. If v2

1+ v2

2≤ 1 accept

z = v1/v2 as a Breit-Wigner distribution with unit area, center at 0.0,and FWHM 2.0. Otherwise start over. For center M0 and FWHM Γ,use W = zΓ/2 + M0.

Page 21: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

39. Monte Carlo techniques 487

39.4.10. Beta distribution :

The choice of an appropriate algorithm for generation of betadistributed random numbers depends on the values of the parametersα and β. For, e.g., α = 1, one can use the transformation method tofind x = 1 − u1/β , and similarly if β = 1 one has x = u1/α. For moregeneral cases see, e.g., Refs. [17,18] and references therein.

39.5. Markov Chain Monte Carlo

In applications involving generation of random numbers followinga multivariate distribution with a high number of dimensions, thetransformation method may not be possible and the acceptance-rejection technique may have too low of an efficiency to be practical.If it is not required to have independent random values, but only thatthey follow a certain distribution, then Markov Chain Monte Carlo(MCMC) methods can be used. In depth treatments of MCMC canbe found, e.g., in the texts by Robert and Casella [19], Liu [20], andthe review by Neal [21].

MCMC is particularly useful in connection with Bayesian statistics,where a p.d.f. p(θ) for an n-dimensional vector of parametersθ = (θ1, . . . , θn) is obtained, and one needs the marginal distributionof a subset of the components. Here one samples θ from p(θ) andsimply records the marginal distribution for the components ofinterest.

A simple and broadly applicable MCMC method is the Metropolis-Hastings algorithm, which allows one to generate multidimensionalpoints θ distributed according to a target p.d.f. that is proportionalto a given function p(θ). It is not necessary to have p(θ) normalizedto unit area, which is useful in Bayesian statistics, as posteriorprobability densities are often determined only up to an unknownnormalization constant.

To generate points that follow p(θ), one first needs a proposal p.d.f.q(θ; θ0), which can be (almost) any p.d.f. from which independentrandom values θ can be generated, and which contains as a parameteranother point in the same space θ0. For example, a multivariateGaussian centered about θ0 can be used. Beginning at an arbitrarystarting point θ0, the Hastings algorithm iterates the following steps:

1. Generate a value θ using the proposal density q(θ; θ0);

2. Form the Hastings test ratio, α = min

[

1,p(θ)q(θ0; θ)

p(θ0)q(θ; θ0)

]

;

3. Generate a value u uniformly distributed in [0, 1];

4. If u ≤ α, take θ1 = θ. Otherwise, repeat the old point, i.e.,θ1 = θ0.

5. Set θ0 = θ1 and return to step 1.

If one takes the proposal density to be symmetric in θ and θ0, thenthis is the Metropolis -Hastings algorithm, and the test ratio becomesα = min[1, p(θ)/p(θ0)]. That is, if the proposed θ is at a value ofprobability higher than θ0, the step is taken. If the proposed step isrejected, the old point is repeated.

Methods for assessing and optimizing the performance of thealgorithm are discussed in, e.g., Refs. [19–21]. One can, for example,examine the autocorrelation as a function of the lag k, i.e., thecorrelation of a sampled point with that k steps removed. This shoulddecrease as quickly as possible for increasing k.

Generally one chooses the proposal density so as to optimize somequality measure such as the autocorrelation. For certain problemsit has been shown that one achieves optimal performance when theacceptance fraction, that is, the fraction of points with u ≤ α, isaround 40%. This can be adjusted by varying the width of theproposal density. For example, one can use for the proposal p.d.f. amultivariate Gaussian with the same covariance matrix as that of thetarget p.d.f., but scaled by a constant.

References:1. F. James, Comp. Phys. Comm. 60, 329 (1990).2. P. L’Ecuyer, Proc. 1997 Winter Simulation Conference, IEEE

Press, Dec. 1997, 127–134.3. The CERN Program Library (CERNLIB);

see cernlib.web.cern.ch/cernlib.4. Leif Lonnblad, Comp. Phys. Comm. 84, 307 (1994).5. Rene Brun and Fons Rademakers, Nucl. Inst. Meth. A389, 81

(1997); see also root.cern.ch.6. F. James, Comp. Phys. Comm. 79, 111 (1994), based on M.

Luscher, Comp. Phys. Comm. 79, 100 (1994).7. P. L’Ecuyer, Mathematics of Computation, 65, 213 (1996) and

65, 225 (1999).8. M. Matsumoto and T. Nishimura, ACM Transactions on

Modeling and Computer Simulation, Vol. 8, No. 1, January 1998,3–30.

9. Much of DIEHARD is described in: G. Marsaglia, A Current

View of Random Number Generators, keynote address, Computer

Science and Statistics: 16th Symposium on the Interface, Elsevier(1985).

10. P. L’Ecuyer and R. Simard, ACM Transactions on Mathematical

Software 33, 4, Article 1, December 2007.11. J. Heinrich, CDF Note CDF/MEMO/STATISTICS/PUBLIC

/8032, 2006.12. UNU.RAN is described at statmath.wu.ac.at/software/

unuran; see also W. Hormann, J. Leydold, and G. Derflinger,Automatic Nonuniform Random Variate Generation, (Springer,New York, 2004).

13. W.H. Press et al., Numerical Recipes, 3rd edition, (CambridgeUniversity Press, New York, 2007).

14. J.H. Ahrens and U. Dieter, Computing 12, 223 (1974).15. R.Y. Rubinstein, Simulation and the Monte Carlo Method, (John

Wiley and Sons, Inc., New York, 1981).16. L. Devroye, Non-Uniform Random Variate Generation,

(Springer-Verlag, New York, 1986); available online atcg.scs.carleton.ca/~luc/rnbookindex.html.

17. C. Walck, Handbook on Statistical Distributions for Experimen-

talists, University of Stockholm Internal Report SUF-PFY/96-01,available from www.physto.se/~walck.

18. J.E. Gentle, Random Number Generation and Monte Carlo

Methods, 2nd ed., (Springer, New York, 2003).19. C.P. Robert and G. Casella, Monte Carlo Statistical Methods,

2nd ed., (Springer, New York, 2004).20. J.S. Liu, Monte Carlo Strategies in Scientific Computing,

(Springer, New York, 2001).21. R.M. Neal, Probabilistic Inference Using Markov Chain Monte

Carlo Methods, Technical Report CRG-TR-93-1, Dept. ofComputer Science, University of Toronto, available fromwww.cs.toronto.edu/~radford/res-mcmc.html.

Page 22: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

488 40. Monte Carlo Event Generators

40. MONTE CARLO EVENT GENERATORS

Revised September 2013 by P. Nason (INFN, Milan) and P.Z. Skands(CERN)

General-purpose Monte Carlo (GPMC) generators like HERWIG [1],HERWIG++ [2], PYTHIA 6 [3], PYTHIA 8 [4], and SHERPA [5], providefully exclusive simulations of high-energy collisions. They play anessential role in QCD modeling (in particular for aspects beyondfixed-order perturbative QCD), in data analysis, where they are usedtogether with detector simulation to provide a realistic estimate ofthe detector response to collision events, and in the planning of newexperiments, where they are used to estimate signals and backgroundsin high-energy processes. They are built from several components,that describe the physics starting from very short distance scales,up to the typical scale of hadron formation and decay. Since QCDis weakly interacting at short distances (below a femtometer), thecomponents of the GPMC dealing with short-distance physics arebased upon perturbation theory. At larger distances, all soft hadronicphenomena, like hadronization and the formation of the underlyingevent, cannot be computed from first principles, and one must relyupon QCD-inspired models.

The purpose of this review is to illustrate the main components ofthese generators. It is divided into four sections. The first one dealswith short-distance, perturbative phenomena. The basic conceptsleading to the simulations of the dominant QCD processes areillustrated here. In the second section, hadronization phenomena aretreated. The two most popular hadronization models for the formationof primary hadrons, the string and cluster models, are illustrated. Thebasics of the implementation of primary-hadron decays into stableones is also illustrated here. In the third section, models for softhadron physics are discussed. These include models for the underlyingevent, and for minimum-bias interactions. Issues of Bose-Einstein andcolor-reconnection effects are also discussed here. The fourth sectionbriefly introduces the problem of MC tuning.

We use natural units throughout, such that c = 1 and ℏ = 1,with energy, momenta and masses measured in GeV, and time anddistances measured in GeV−1.

40.1. Short-distance physics in GPMC generators

The short-distance components of a GPMC generator deal with thecomputation of the primary process at hand, with decays of short-livedparticles, and with the generation of QCD and QED radiation, ontime scales below 1/Λ, with Λ denoting a typical hadronic scale of afew hundred MeV, corresponding roughly to an inverse femtometer.In e+e− annihilation, for example, the short-distance physics describesthe evolution of the system from the instant when the e+e− pairannihilates up to a time when the size of the produced system is justbelow a femtometer.

In the present discussion we take the momentum scale of theprimary process to be Q ≫ Λ, so that the corresponding timeand distance scale 1/Q is small. Soft- and collinear-safe inclusiveobservables, such as total decay widths or inclusive cross sections, canthen be reliably computed in QCD perturbation theory (pQCD), withthe perturbative expansion truncated at any fixed order n, and theremainder suppressed by αS(Q)n+1.

Less inclusive observables, however, can receive large enhancementsthat destroy the convergence of the fixed-order expansion. Thisis due to the presence of collinear and infrared singularities inQCD. Thus, for example, a correction in which a parton from theprimary interaction splits collinearly into two partons of comparableenergy, is of order αS(Q) ln(Q/Λ), where the logarithm arises froman integral over a singularity regulated by the hadronic scale Λ.Since αS(Q) ∝ 1/ ln(Q/Λ), the corresponding cross section receivesa correction of order unity. Two subsequent collinear splittings yieldα2

S(Q) ln2(Q/Λ), and so on. Thus, corrections of order unity arise

at all orders in perturbation theory. The dominant region of phasespace is the one where radiation is strongly ordered in a measure ofhardness. This means that, from a typical final-state configuration, byclustering together final-state parton pairs with the smallest hardnessrecursively, we can reconstruct a branching tree, that may be viewedas the splitting history of the event. This history necessarily has somedependence on how we define hardness. For example, we can define it

as the energy of the incoming parton times the splitting angle, or asits virtuality, or as the transverse momentum of the splitting partonswith respect to the incoming one. These definitions, however, are allequivalent in the collinear region. In fact, in the small-angle limit, thevirtuality of a parton of energy E, splitting into two on-shell partons,is given by

p2 = E2z(1 − z)(1 − cos θ) ≈z(1 − z)

2E2θ2 , (40.1)

where z and 1 − z are the energy fractions carried by the producedpartons, and θ is their relative angle. The transverse momentum ofthe final partons relative to the direction of the incoming one is givenby

p2

T ≈ z2(1 − z)2E2θ2. (40.2)

Thus, significant differences between these measures only arise inregions with very small z or 1 − z values. In QCD, because of softdivergences, these regions are in fact important, and the choice of theappropriate ordering variable is very relevant (see Sec. 40.3).

The so called KLN theorem [6,7] guarantees that large logarithmi-cally divergent corrections, arising from final-state collinear splittingand from soft emissions, cancel against the virtual corrections in thetotal cross section, order by order in perturbation theory. Further-more, the factorization theorem guarantees that initial-state collinearsingularities can be factorized into the parton density functions(PDFs). Therefore, the cross section for the basic process remainsaccurate up to corrections of higher orders in αS(Q), provided it isinterpreted as an inclusive cross section, rather than as a bare partoniccross section. Thus, for example, the leading order (LO) cross sectionfor e+e− → qq is a good LO estimate of the e+e− cross section for theproduction of a pair of quarks accompanied by an arbitrary numberof collinear and soft gluons, but is not a good estimate of the crosssection for the production of a qq pair with no extra radiation.

Shower algorithms are used to compute the cross section for generichard processes including all leading-logarithmic (LL) corrections.These algorithms begin with the generation of the kinematics of thebasic process, performed with a probability proportional to its LOpartonic cross section, which is interpreted physically as the inclusivecross section for the basic process, followed by an arbitrary sequenceof shower splittings. A probability is then assigned to each splittingsequence. Thus, the initial LO cross section is partitioned into thecross sections for a multitude of final states of arbitrary multiplicity.The sum of all these partial cross sections equals that of the primaryprocess. This property of the GPMCs reflects the KLN cancellationmentioned earlier, and it is often called “unitarity of the showerprocess”, a name that reminds us that the KLN cancellation itselfis a consequence of unitarity. The fact that a quantum mechanicalprocess can be described in terms of composition of probabilities,rather than amplitudes, follows from the LL approximation. Infact, in the dominant, strongly ordered region, subsequent splittingsare separated by increasingly large times and distances, and thissuppresses interference effects.

We now illustrate the basic parton-shower algorithm, as firstintroduced in Ref. 8. The purpose of this illustration is to give aschematic representation of how shower algorithms work, to introducesome concepts that will be referred to in the following, and to showthe relationship between shower algorithms and Feynman-diagramresults. For simplicity, we consider the example of e+e− annihilationinto qq pairs. With each dominant (i.e. strongly ordered) final-stateconfiguration one can associate an ordered tree diagram, by recursivelyclustering together final-state parton pairs with the smallest hardness,and ending up with the hard production vertex (i.e. the γ∗ → qq).The momenta of all intermediate lines of the tree diagram are thenuniquely determined from the final-state momenta. Hardnesses in thegraph are also strongly ordered. One assigns to each splitting vertexthe hardness t, the energy fractions z and 1 − z of the two generatedpartons, and the azimuth φ of the splitting process with respect tothe momentum of the incoming parton. For definiteness, we assumethat z and φ are defined in the center-of-mass (CM) frame of thee+e− collision, although other definitions are possible that differ onlybeyond the LL approximation. The differential cross section for a

Page 23: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

40. Monte Carlo Event Generators 489

given final state is given by the product of the differential cross sectionfor the initial e+e− → qq process, multiplied by a factor

∆i(t, t′)

αS(t)

2πPi,jk(z)

dt

tdz

2π(40.3)

for each intermediate line ending in a splitting vertex. We havedenoted with t′ the maximal hardness that is allowed for the line, witht its hardness, and z and φ refer to the splitting process. ∆(t, t′) isthe so-called Sudakov form factor

∆i(t, t′) = exp

∫ t′

t

dq2

q2

αS(q2)

jk

Pi,jk(z)dzdφ

. (40.4)

The suffixes i and jk represent the parton species of the incomingand final partons, respectively, and Pi,jk(z) are the Altarelli-Parisi [9]splitting kernels. Final-state lines that do not undergo any furthersplitting are associated with a factor

∆i(t0, t′) , (40.5)

where t0 is an infrared cutoff defined by the shower hadronizationscale (at which the charges are screened by hadronization) or, for anunstable particle, its width (a source cannot emit radiation with aperiod exceeding its lifetime).

Notice that the definition of the Sudakov form factor is such that

∆i(t2, t1) +

∫ t1

t2

dt

tdz

jk

∆i(t, t1)αS(t)

2πPi,jk(z) = 1 . (40.6)

This implies that the cross section for developing the shower up to agiven stage does not depend on what happens next, since subsequentfactors for further splitting or not splitting add up to one.

The shower cross section can then be formulated in a probabilisticway. The Sudakov form factor ∆i(t2, t1) is interpreted as theprobability for a splitting not to occur, for a parton of type i, startingfrom a branching vertex at the scale t1, down to a scale t2. Noticethat 0 < ∆i(t2, t1) ≤ 1, where the upper extreme is reached fort2 = t1, and the lower extreme is approached for t2 = t0. FromEq. (40.4), it seems that the Sudakov form factor should vanish ift2 = 0. However, because of the presence of the running coupling inthe integrand, t2 cannot be taken smaller than some cutoff scale ofthe order of Λ, so that at its lower extreme the Sudakov form factor issmall, but not zero. Event generation then proceeds as follows. Onegets a uniform random number 0 ≤ r ≤ 1, and seeks a solution of theequation r = ∆i(t2, t1) as a function of t2. If r is too small and nosolution exists, no splitting is generated, and the line is interpretedas a final parton. If a solution t2 exists, a branching is generated atthe scale t2. Its z value and the final parton species jk are generatedwith a probability proportional to Pi,jk(z). The azimuth is generateduniformly. This procedure is started with each of the primary processpartons, and is applied recursively to all generated partons. It maygenerate an arbitrary number of partons, and it stops when nofinal-state partons undergo further splitting.

The four-momenta of the final-state partons are reconstructed fromthe momenta of the initiating ones, and from the whole sequenceof splitting variables, subject to overall momentum conservation.Different algorithms employ different strategies to treat recoil effectsdue to momentum conservation, which may be applied either locallyfor each parton or dipole splitting, or globally for the entire setof partons (a procedure called momentum reshuffling.) This has asubleading effect with respect to the collinear approximation.

We emphasize that the shower cross section described above can bederived from perturbative QCD by keeping only the collinear-dominantreal and virtual contributions to the cross section. In particular, up toterms that vanish after azimuthal averaging, the product of the crosssection for the basic process, times the factors

αS

dt

tdz

2πPi,jk(z) (40.7)

at each branching vertex, gives the leading collinear contribution tothe tree-level cross section for the same process. The dominant virtualcorrections in the same approximation are provided by the runningcoupling at each vertex and by the Sudakov form factors in theintermediate lines.

40.1.1. Angular correlations :In gluon splitting processes (g → qq, g → gg) in the collinearapproximation, the distribution of the split pair is not uniform inazimuth, and the Altarelli-Parisi splitting functions are recovered onlyafter azimuthal averaging. This dependence is due to the interferenceof positive and negative helicity states for the gluon that undergoessplitting. Spin correlations propagate through the splitting process,and determine acausal correlations of the EPR kind [10]. A methodto partially account for these effects was introduced in Ref. 11, inwhich the azimuthal correlation between two successive splittings iscomputed by averaging over polarizations. This can then be appliedat each branching step. Acausal correlations are argued to be small,and are discarded with this method, that is still used in the PYTHIA

code [3]. A method that fully includes spin correlation effects waslater proposed by Collins [12], and has been implemented in thefortran HERWIG code [13].

40.1.2. Initial-state radiation :Initial-state radiation (ISR) arises because incoming charged particlescan radiate before entering the hard-scattering process. In doingso, they acquire a non-vanishing transverse momentum, and theirvirtuality becomes negative (spacelike). The dominant logarithmicregion is the collinear one, where virtualities become larger andlarger in absolute value with each emission, up to a limit given bythe hardness of the basic process itself. A shower that starts byconsidering the highest virtualities first would thus have to workbackward in time for ISR. A corresponding backwards-evolutionalgorithm was formulated by Sjostrand [14], and was basicallyadopted in all shower models.

The key point in backwards evolution is that the evolutionprobability depends on the amount of partons that could have givenrise to the one being evolved. This is reflected by introducing theratio of the PDF after the branching to the PDF before the branchingin the definition of the backward-evolution Sudakov form factor,

∆ISRi (t, t′) = exp

∫ t

t′

dt′′

t′′αS(t′′)

1

x

dz

z

jk

Pj,ik(z)fj(t

′′, x/z)

fi(t′′, x)

.

(40.8)

Notice that there are two uses of the PDFs: they are used tocompute the cross section for the basic hard process, and they controlISR via backward evolution. Since the evolution is generated withleading-logarithmic accuracy, it is acceptable to use two different PDFsets for these two tasks, provided they agree at the LO level.

In the context of GPMC evolution, each ISR emission generatesa finite amount of transverse momentum. Details on how the recoilsgenerated by these transverse “kicks” are distributed among otherpartons in the event, in particular the ones involved in the hardprocess, constitute one of the main areas of difference between existingalgorithms, see Ref. 15. An additional O(1 GeV) of “primordialkT ” is typically added, to represent the sum of unresolved and/ornon-perturbative motion below the shower cutoff scale.

40.1.3. Soft emissions and QCD coherence :In massless field theories like QCD, there are two sources of largelogarithms of infrared origin. One has to do with collinear singularities,which arise when two final-state particles become collinear, or when afinal-state particle becomes collinear to an initial-state one. The otherhas to do with the emission of soft gluons at arbitrary angles. Becauseof that, it turns out that in QCD perturbation theory two powersof large logarithms can arise for each power of αS. The expansionin leading soft and collinear logarithms is often referred to as thedouble-logarithmic expansion.

Within the conventional parton-shower formalism, based oncollinear factorization, it was shown in a sequel of publications(see Ref. 16 and references therein) that the double-logarithmicregion can be correctly described by using the angle of the emissionsas the ordering variable, rather than the virtuality, and that theargument of αS at the splitting vertex should be the relative partontransverse momentum after the splitting. Physically, the ordering inangle approximates the coherent interference arising from large-angle

Page 24: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

490 40. Monte Carlo Event Generators

soft emission from a bunch of collinear partons. Without this effect,the particle multiplicity would grow too rapidly with energy, inconflict with e+e− data. For this reason, angular ordering is usedas the evolution variable in both the HERWIG [16] and HERWIG++ [17]programs, and an angular veto is imposed on the virtuality-orderedevolution in PYTHIA 6 [18].

A radical alternative formulation of QCD cascades first proposed inRef. 19 focuses upon soft emission, rather than collinear emission, asthe basic splitting mechanism. It then becomes natural to consider abranching process where it is a parton pair (i.e. a dipole) rather thana single parton, that emits a soft parton. Adding a suitable correctionfor non-soft, collinear partons, one can achieve in this framework thecorrect logarithmic structure for both soft and collinear emissions inthe limit of large number of colors Nc, without any explicit angular-ordering requirement. The ARIADNE [20] and VINCIA [21] programsare based on this approach. In SHERPA, the default shower [22] is alsoof a dipole type [23], while the p⊥-ordered showers in PYTHIA 6 and8 represent a hybrid, combining collinear splitting kernels with dipolekinematics [24].

40.1.4. Massive quarks :Quark masses act as cut-off on collinear singularities. If the mass of aquark is below, or of the order of Λ, its effect in the shower is small.For larger quark masses, like in c, b, or t production, it is the mass,rather than the typical hadronic scale, that cuts off collinear radiation.For a quark with energy E and mass mQ, the divergent behavior dθ/θof the collinear splitting process is regulated for θ ≤ θ0 = mQ/E. Wethus expect less collinear activity for heavy quarks than for light ones,which in turn is the reason why heavy quarks carry a larger fractionof the momentum acquired in the hard production process.

This feature can be implemented with different levels of sophis-tication. Using the fact that soft emission exhibits a zero at zeroemission angle, older parton shower algorithms simply limited theshower emission to be not smaller than the angle θ0. More modernapproaches are used in both PYTHIA, where mass effects are includedusing a kind of matrix-element correction method [25], and inHERWIG++ and SHERPA, where a generalization of the Altarelli-Parisisplitting kernel is used for massive quarks [26].

40.1.5. Color information :Shower MC generators track large-Nc color information during thedevelopment of the shower. In the large-Nc limit, quarks or antiquarksare represented by a color line, i.e. a line with an arrow indicating thedirection of color flow. Gluons are represented by a pair of color lineswith opposite arrows. The rules for color propagation are:

(40.9)During the shower development, partons are connected by color

lines. We can have a quark directly connected by a color line to anantiquark, or via an arbitrary number of intermediate gluons, asshown in Fig. 40.1.

Figure 40.1: Color development of a shower in e+e− annihi-lation. Systems of color-connected partons are indicated by thedashed lines.

It is also possible for a set of gluons to be connected cyclically incolor, as e.g. in the decay Υ → ggg.

The color information is used in angular-ordered showers, where theangle of color-connected partons determines the initial angle for theshower development, and in dipole showers, where dipoles are alwayscolor-connected partons. It is also used in hadronization models,where the initial strings or clusters used for hadronization are formedby systems of color-connected partons.

40.1.6. Electromagnetic corrections :The physics of photon emission from light charged particles can alsobe treated with a shower MC algorithm. High-energy electrons andquarks, for example, are accompanied by bremsstrahlung photons.Also here, similarly to the QCD case, electromagnetic correctionsare of order αem ln(Q/m), where m is the mass of the radiatingparticle, or even of order αem ln(Q/m) ln(Eγ/E) in the region wheresoft photon emission is important, so that, especially for the caseof electrons, their inclusion in the simulation process is mandatory.This is done in most of the GPMC’s (for a recent comparative studysee [27]) . The specialized generator PHOTOS [28] is sometimes usedas an afterburner for an improved treatment of QED radiation innon-hadronic resonace decays.

In case of photons emitted by leptons the shower can be continueddown to virtualities arbitrarily close to the lepton mass shell (unlikethe case in QCD). In practice, photon radiation must be cut off belowa certain energy, in order for the shower algorithm to terminate.Therefore, there is always a minimum energy for emitted photons thatdepends upon the implementations [27]( and so does the MC truth fora charged lepton). In the case of electrons, this energy is typically ofthe order of its mass. Electromagnetic radiation below this scale isnot enhanced by collinear singularities, and is thus bound to be soft,so that the electron momentum is not affected by it.

For photons emitted from quarks, we have instead the obviouslimitation that the photon wavelength cannot exceed the typicalhadronic size. Longer-wavelength photons are in fact emittedby hadrons, rather than quarks. This last effect is in practicenever modeled by existing shower MC implementations. Thus,electromagnetic radiation from quarks is cut off at a typical hadronicscale. Finally, hadron (and τ) decays involving charged particles canproduce additional soft bremsstrahlung. This is implemented in ageneral way in HERWIG++ [29] and SHERPA [30].

40.1.7. Beyond-the-Standard-Model Physics :The inclusion of processes for physics beyond the Standard Model(BSM) in event generators is to some extent only a matter ofimplementing the relevant hard processes and (chains of) decays, withthe level of difficulty depending on the complexity of the model andthe degree of automation [31,32]. Notable exceptions are long-livedcolored particles [33], particles in exotic color representations, andparticles showering under new gauge symmetries, with a growing setof implementations documented in the individual GPMC manuals.Further complications that may be relevant are finite-width effects(discussed in Sec. 40.1.8) and the assumed threshold behavior.

In addition to code-specific implementations [15], there are afew commonly adopted standards that are useful for transferringinformation and events between codes. Currently, the most importantof these is the Les Houches Event File (LHEF) standard [34],normally used to transfer parton-level events from a hard-processgenerator to a shower generator. Another important standard is theSupersymmetry Les Houches Accord (SLHA) format [35], originallyused to transfer information on supersymmetric particle spectra andcouplings, but by now extended to apply also to more general BSMframeworks and incorporated within the LHEF standard [36].

40.1.8. Decay Chains and Particle Widths :In most BSM processes and some SM ones, an important aspect ofthe event simulation is how decays of short-lived particles, such astop quarks, EW and Higgs bosons, and new BSM resonances, arehandled. We here briefly summarize the spectrum of possibilities,but emphasize that there is no universal standard. Users are advisedto check whether the treatment of a given code is adequate for thephysics study at hand.

The appearance of an unstable resonance as a physical particleat some intermediate stage of the event generation implies that itsproduction and decay processes are treated as being factorized. Thisis valid up to corrections of order Γ/m0, with Γ the width and m0

the pole mass. States whose widths are a substantial fraction of theirmass should not be treated as “physical particles,” but rather asintrinsically off-shell internal propagator lines.

For states treated as physical particles, two aspects are relevant: themass distribution of the decaying particle itself and the distributions

Page 25: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

40. Monte Carlo Event Generators 491

of its decay products. For the former, matrix-element generators oftenuse a simple δ function at m0. The next level up, typically used inGPMCs, is to use a Breit-Wigner distribution (relativistic or non-relativistic), which formally resums higher-order virtual corrections tothe mass distribution. Note, however, that this still only generates animproved picture for moderate fluctuations away from m0. Similarlyto above, particles that are significantly off-shell (in units of Γ) shouldnot be treated as resonant, but rather as internal off-shell propagatorlines. In most GPMCs, further refinements are included, for instanceby letting Γ be a function of m (“running widths”) and by limitingthe magnitude of the allowed fluctuations away from m0.

For the distributions of the decay products, the simplest treatmentis again to assign them their respective m0 values, with a uniformphase-space distribution. A more sophisticated treatment distributesthe decay products according to the differential decay matrix elements,capturing at least the internal dynamics and helicity structure of thedecay process, including EPR-like correlations. Further refinementsinclude polarizations of the external states [37] and assigning thedecay products their own Breit-Wigner distributions, the latter ofwhich opens the possibility to include also intrinsically off-shell decaychannels, like H → WW ∗.

During subsequent showering of the decay products, most parton-shower models will preserve their total invariant mass, so as not toskew the original resonance shape.

When computing partial widths and/or modifying decay tables,one should be aware of the danger of double-counting intermediateon-shell particles, see Sec. 40.2.3.

40.1.9. Matching with Matrix Elements :Shower algorithms are based upon a combination of the collinear(small-angle) and soft (small-energy) approximations and are thusinaccurate for hard, large-angle emissions. They also lack next-to-leading order (NLO) corrections to the basic process.

Traditional GPMCs, like HERWIG and PYTHIA, have included fora long time the so called Matrix Element Corrections (MEC), firstformulated in Ref. 38 with later developments summarized in Ref. 15.They are available for processes involving two incoming and oneoutgoing or one incoming and two outgoing particles, like DIS, vectorboson and Higgs production and decays, and top decays. The MECcorrects the emission of the hardest jet at large angles, so that itbecomes exact at leading order.

In the past decade, considerable progress has taken place inorder to improve the parton-shower description of hard collisions, intwo different directions: the so called Matrix Elements and PartonShower matching (ME+PS from now on), and the matching of NLOcalculations and Parton Showers (NLO+PS).

The ME+PS method allows one to use tree-level matrix elementsfor hard, large-angle emissions. It was first formulated in the so-called CKKW paper [39], and several variants have appeared,including the CKKW-L, MLM, and pseudoshower methods, seeRefs. 40, 15 for summaries. Truncated showers are required [41] inorder to maintain color coherence when interfacing matrix-elementcalculations to angular-ordered parton showers using these methods.It is also important to ensure consistent αS choices between the real(ME-driven) and virtual (PS-driven) corrections [42].

In the ME+PS method one typically starts by generating exactmatrix elements for the production of the basic process plus a certainnumber ≤ n of other partons. A minimum separation is imposedon the produced partons, requiring, for example, that the relativetransverse momentum in any pair of partons is above a given cutQcut. One then reweights these amplitudes in such a way that, in thestrongly ordered region, the virtual effects that are included in theshower algorithm (i.e. running couplings and Sudakov form factors)are also accounted for. At this stage, before parton showers are added,the generated configurations are tree-level accurate at large angle,and at small angle they match the results of the shower algorithm,except that there are no emissions below the scale Qcut, and no finalstates with more than n partons. These kinematic configurations arethus fed into a GPMC, that must generate all splittings with relativetransverse momentum below the scale Qcut, for initial events with

less than n partons, or below the scale of the smallest pair transversemomentum, for events with n partons. The matching parameter Qcut

must be chosen to be large enough for fixed-order perturbation theoryto hold, but small enough so that the shower is accurate for emissionsbelow it. Notice that the accuracy achieved with MEC is equivalentto that of ME+PS with n = 1, where MEC has the advantage of nothaving a matching parameter Qcut.

The popularity of the ME+PS method is due to the fact thatprocesses with many jets appear often as backgrounds to new-physicssearches. These jets are typically required to be well separated, andto have large transverse momenta. These kinematical configurations,away from the small-angle region, are precisely those where GPMCsfail to be accurate, and it is thus mandatory to describe them usingat least tree-level accurate matrix elements.

The NLO+PS methods extend the accuracy of the generation ofthe basic process at the NLO level in QCD. They must thus includethe radiation of an extra parton with tree-level accuracy, since thisradiation constitutes a NLO correction to the basic process. Theymust also include NLO virtual corrections. They can be viewed asan extension of the MEC method with the inclusion of NLO virtualcorrections. They are however more general, since they are applicableto processes of arbitrary complexity. Two of these methods are nowwidely used: MC@NLO [43] and POWHEG [41,44], with several alternativemethods now also being pursued, see Ref. 15 and references therein.

NLO+PS generators produce NLO accurate distributions forinclusive quantities, and generate the hardest jet with tree-levelaccuracy even at large angle. It should be recalled, though, that in2 → 1 processes like Z/W production, GPMCs including MEC andweighted by a constant K factor may perform nearly as well, and, ifsuitably tuned, may even yield a better description of data. It maythus be wise to consider tuning also the NLO+PS generators for theseprocesses.

ME+PS generators should be preferred over NLO+PS ones whenone needs an accurate description of more than one hard, large-anglejet associated with the primary process. In order to get event samplesthat have the advantage of both methods, several attempts to combineNLO+PS calculations at different multiplicities, possibly merged withME+PS calculations for even higher multiplicities, have appearedrecently, see refs [10]-[19] in Ref. 46.

Several ME+PS implementations use existing LO generators, likeALPGEN [47], MADGRAPH [48], and others summarized in Ref. 40, forthe calculation of the matrix elements, and feed the partonic eventsto a GPMC like PYTHIA or HERWIG using the Les Houches Interfacefor User Processes (LHI/LHEF) [49,34]. SHERPA and HERWIG++ alsoinclude their own matrix-element generators.

Several NLO+PS processes are implemented in the MC@NLO

program [43], together with the new AMC@NLO development [50],and in the POWHEG BOX framework [44]. HERWIG++ also includes itsown POWHEG implementation, suitably adapted with the inclusion ofvetoed and truncated showers, for several processes. SHERPA insteadimplements a variant of the MC@NLO method.

40.2. Hadronization Models

In the context of GPMCs, hadronization denotes the process by whicha set of colored partons (after showering) is transformed into a setof color-singlet primary hadrons, which may then subsequently decayfurther (to secondary hadrons). This non-perturbative transition takesplace at the hadronization scale Qhad, which by construction isidentical to the infrared cutoff of the parton shower. In the absenceof a first-principles solution to the relevant dynamics, GPMCs useQCD-inspired phenomenological models to describe this transition.

A key difference between MC hadronization models and thefragmentation-function (FF) formalism used to describe inclusivehadron spectra in perturbative QCD (see Chap. 9 of PDG book) isthat the former is always defined at the hadronization scale, whilethe latter can be defined at an arbitrary perturbative scale Q. Theycan therefore only be compared directly if the perturbative evolutionbetween Q and Qhad is taken into account. FFs are calculable in

Page 26: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

492 40. Monte Carlo Event Generators

pQCD, given a non-perturbative initial condition obtained by fits tohadron spectra. In the MC context, one can prove that the correctQCD evolution of the FFs arises from the shower formalism, withthe hadronization model providing an explicit parametrization of thenon-perturbative component. It should be kept in mind, however,that the MC modeling of shower and hadronization includes muchmore information on the final state since it is fully exclusive (i.e., itaddresses all particles in the final state explicitly), while FFs onlydescribe inclusive spectra. This exclusivity also enables MC models tomake use of the color-flow information coming from the perturbativeshower evolution (see Sec. 40.1.5) to determine between which partonsthe confining potentials should arise.

If one had an exact hadronization model, its dependence upon thehadronization scale Qhad would be compensated by the correspondingscale dependence of the shower algorithm, which stops generatingbranchings at the scale Qhad. However, due to their complicatedand fully exclusive nature, it is generally not possible to enforce thiscompensation automatically in MC models of hadronization. Onemust therefore be aware that the model must be “retuned” by handif changes are made to the perturbative evolution, in particular if theinfrared cutoff is modified. Tuning is discussed briefly in Sec. 40.4.

An important result in “quenched” lattice QCD (see Chap. 18 ofPDG book) is that the potential of the color-dipole field between acharge and an anticharge appears to grow linearly with the separationof the charges, at distances greater than about a femtometer.This is known as “linear confinement”, and it forms the startingpoint for the string model of hadronization, discussed below inSec. 40.2.1. Alternatively, a property of perturbative QCD called“preconfinement” is the basis of the cluster model of hadronization,discussed in Sec. 40.2.2.

Finally, it should be emphasized that the so-called “parton level”that can be obtained by switching off hadronization in a GPMC, isnot a universal concept, since each model defines the hadronizationscale differently (e.g. by a cutoff in p⊥, invariant mass, etc., withdifferent tunes using different values for the cutoff). Comparisons todistributions at this level may therefore be used to provide an idea ofthe overall impact of hadronization corrections within a given model,but should be avoided in the context of physical observables.

40.2.1. The String Model :Starting from early concepts [51], several hadronization models basedon strings have been proposed [15]. Of these, the most widelyused today is the so-called Lund model [52,53], implemented inPYTHIA [3,4]. We concentrate on that particular model here, thoughmany of the overall concepts would be shared by any string-inspiredmethod.

Consider a color-connected quark-antiquark pair with no interme-diate gluons emerging from the parton shower (like the qq pair in thecenter of Fig. 40.1), e.g. a red q and an antired q. As the charges moveapart, linear confinement implies that a potential V (r) = κ r is reachedfor large distances r. (At short distances, there is a Coulomb term∝ 1/r as well, but this is neglected in the Lund string.) This potentialdescribes a string with tension κ ∼ 1 GeV/fm ∼ 0.2 GeV2. Thephysical picture is that of a color flux tube being stretched betweenthe q and the q.

tim

e

Figure 40.2: Illustration of string breaking by quark pair-creation in the string field.

As the string grows, the non-perturbative creation of quark-antiquark pairs can break the string, via the process (qq) →(qq′) + (q′q), illustrated in Fig. 40.2. More complicated color-connected quark-antiquark configurations involving intermediate

gluons (like the qgggq and qgq systems on the left and right part ofFig. 40.1) are treated by representing gluons as transverse “kinks.”Thus soft gluons effectively build up a transverse structure in theoriginally one-dimensional object, with infinitely soft ones smoothlyabsorbed into the string. For strings with finite-energy kinks, thespace-time evolution is slightly more involved [53], but the mainpoint is that there are no separate free parameters for gluon jets.Differences with respect to quark fragmentation arise simply becausequarks are only connected to a single string piece, while gluons haveone on either side, increasing their relative energy loss (per unitinvariant time) by a factor of 2, similar to the ratio of color CasimirsCA/CF = 2.25.

Since the string breaks are causally disconnected (as can be realizedfrom space-time diagrams [53]) , they do not have to be consideredin any specific time-ordered sequence. In the Lund model, the stringbreaks are generated starting with the leading (“outermost”) hadrons,containing the endpoint quarks, and iterating inwards towards thecenter of the string, alternating randomly between the left and rightsides. One can thereby split off a single on-shell hadron in each step,making it straightforward to ensure that only states consistent withknown hadron states are produced.

For each breakup vertex, quantum mechanical tunneling is assumedto control the masses and p⊥ kicks that can be produced, leading to aGaussian suppression

Prob(m2q , p

2

⊥q) ∝ exp

(

−πm2q

κ

)

exp

(

−πp2

⊥q

κ

)

, (40.10)

where mq is the mass of the produced quark flavor and p⊥ is thenon-perturbative transverse momentum imparted to it by the breakupprocess (the antiquark has the same mass and opposite p⊥), with a

universal average value of⟨

p2

⊥q

= κ/π ∼ (250 MeV)2. The charm

and bottom masses are sufficiently heavy that they are not producedat all in the soft fragmentation. The transverse direction is definedwith respect to the string axis, so the p⊥ in a frame where the stringis moving will be modified by a Lorentz boost. Note that the effectiveamount of “non-perturbative” p⊥, in a Monte Carlo model with a fixedshower cutoff Qhad, may be larger than the purely non-perturbativeκ/π above, to account for effects of additional unresolved soft-gluonradiation below Qhad. In principle, the magnitude of this additionalcomponent should scale with the cutoff, but in practice it is up to theuser to enforce this by retuning the relevant parameter when changingthe hadronization scale.

Since quark masses are difficult to define for light quarks, thevalue of the strangeness suppression is determined from experimentalobservables, such as the K/π and K∗/ρ ratios. Note that theparton-shower evolution generates a small amount of strangeness aswell, through perturbative g → ss splittings.

Baryon production can also be incorporated, by allowing stringbreaks to produce pairs of diquarks, loosely bound states of two quarksin an overall 3 representation. Again, since diquark masses are difficultto define, the relative rate of diquark to quark production is extracted,e.g. from the p/π ratio. Since the perturbative shower splittings donot produce diquarks, the optimal value for this parameter is mildlycorrelated with the amount of g → qq splittings produced by theshower. More advanced scenarios for baryon production have also beenproposed, see Ref. 53. Within the PYTHIA framework, a fragmentationmodel including baryon string junctions [54] is also available.

The next step of the algorithm is the assignment of the producedquarks within hadron multiplets. Using a nonrelativistic classificationof spin states, the fragmenting q may combine with the q′ from anewly created breakup to produce a meson — or baryon, if diquarksare involved — of a given spin S and angular momentum L. Thelowest-lying pseudoscalar and vector meson multiplets, and spin-1/2and -3/2 baryons, are assumed to dominate in a string framework1,

1 The PYTHIA implementation includes the lightest pseudoscalar andvector mesons, with the four L = 1 multiplets (scalar, tensor, and 2pseudovectors) available but disabled by default, largely because sev-

Page 27: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

40. Monte Carlo Event Generators 493

but individual rates are not predicted by the model. This is thereforethe sector that contains the largest amount of free parameters.

From spin counting, the ratio V/P of vectors to pseudoscalars isexpected to be 3, but in practice this is only approximately true for Bmesons. For lighter flavors, the difference in phase space caused by theV –P mass splittings implies a suppression of vector production. Whenextracting the corresponding parameters from data, it is advisableto begin with the heaviest states, since so-called feed-down from thedecays of higher-lying hadron states complicates the extraction forlighter particles, see Sec. 40.2.3. For baryons, separate parameterscontrol the relative rates of spin-1 diquarks vs. spin-0 ones and,likewise, have to be extracted from data.

With p2

⊥ and m2 now fixed, the final step is to select the fraction,z, of the fragmenting endpoint quark’s longitudinal momentum thatis carried by the created hadron, an aspect for which the stringmodel is highly predictive. The requirement that the fragmentationbe independent of the sequence in which breakups are considered(causality) imposes a “left-right symmetry” on the possible form ofthe fragmentation function, f(z), with the solution

f(z) ∝1

z(1 − z)a exp

(

−b (m2

h + p2

⊥h)

z

)

, (40.11)

which is known as the Lund symmetric fragmentation function(normalized to unit integral). The dimensionless parameter adampens the hard tail of the fragmentation function, towards z → 1,and may in principle be flavor-dependent, while b, with dimensionGeV−2, is a universal constant related to the string tension [53] whichdetermines the behavior in the soft limit, z → 0. Note that the explicitmass dependence in f(z) implies a harder fragmentation function forheavier hadrons (in the rest frame of the string).

As a by-product, the probability distribution in invariant time τ ofq′q breakup vertices, or equivalently Γ = (κτ)2, is also obtained, withdP/dΓ ∝ Γa exp(−bΓ) implying an area law for the color flux, and theaverage breakup time lying along a hyperbola of constant invarianttime τ0 ∼ 10−23s [53].

For massive endpoints (e.g. c and b quarks, or hypotheticalhadronizing new-physics particles), which do not move along straightlightcone sections, the exponential suppression with string area leads

to modifications of the form f(z) → f(z)/zbm2

Q , with mQ the mass ofthe heavy quark [55]. Although different forms can also be used todescribe inclusive heavy-meson spectra (see Sec 20.9 of PDG book),such choices are not consistent with causality in the string frameworkand hence are theoretically disfavored in this context, one well-knownexample being the Peterson formula [56],

f(z) ∝1

z

(

1 −1

z−

ǫQ1 − z

)−2

, (40.12)

with ǫQ a free parameter expected to scale ∝ 1/m2

Q.

40.2.2. The Cluster Model :The cluster hadronization model is based on preconfinement, i.e., onthe observation [57,58] that the color structure of a perturbative QCDshower evolution at any scale Q0 is such that color-singlet subsystemsof partons (labeled “clusters”) occur with a universal invariant massdistribution that only depends on Q0 and on ΛQCD, not on thestarting scale Q, for Q ≫ Q0 ≫ ΛQCD. Further, this mass distributionis power-suppressed at large masses.

Following early models based on this universality [8,59], thecluster model developed by Webber [60] has for many years been ahallmark of the HERWIG and HERWIG++ generators, with an alternativeimplementation [61] now available in the SHERPA generator. The keyidea, in addition to preconfinement, is to force “by hand” all gluonsto split into quark-antiquark pairs at the end of the parton shower.Compared with the string description, this effectively amounts to

eral states are poorly known and thus may result in a worse overalldescription when included. For baryons, the lightest spin-1/2 and -3/2multiplets are included.

viewing gluons as “seeds” for string breaks, rather than as kinksin a continuous object. After the splittings, a new set of low-masscolor-singlet clusters is obtained, formed only by quark-antiquarkpairs. These can be decayed to on-shell hadrons in a simple manner.

The algorithm starts by generating the forced g → qq breakups,and by assigning flavors and momenta to the produced quark pairs.For a typical shower cutoff corresponding to a gluon virtualityof Qhad ∼ 1 GeV, the p⊥ generated by the splittings can beneglected. The constituent light-quark masses, mu,d ∼ 300 MeV andms ∼ 450 MeV, imply a suppression (typically even an absence)of strangeness production. In principle, the model also allows fordiquarks to be produced at this stage, but due to the larger constituentmasses this would only become relevant for shower cutoffs larger than1 GeV.

If a cluster formed in this way has an invariant mass above somecutoff value, typically 3–4 GeV, it is forced to undergo sequential1 → 2 cluster breakups, along an axis defined by the constituentpartons of the original cluster, until all sub-cluster masses fall belowthe cutoff value. Due to the preservation of the original axis in thesebreakups, this treatment has some resemblance to the string-likepicture.

Next, on the low-mass side of the spectrum, some clusters areallowed to decay directly to a single hadron, with nearby clustersabsorbing any excess momentum. This improves the description ofthe high-z part of the fragmentation spectrum — where the hadroncarries almost all the momentum of its parent jet — at the cost ofintroducing one additional parameter, controlling the probability forsingle-hadron cluster decay.

Having obtained a final distribution of small-mass clusters, nowwith a strict cutoff at 3–4 GeV and with the component destined todecay to single hadrons already removed, the remaining clusters areinterpreted as a smoothed-out spectrum of excited mesons, each ofwhich decays isotropically to two hadrons, with relative probabilitiesproportional to the available phase space for each possible two-hadroncombination that is consistent with the cluster’s internal flavors,including spin degeneracy. It is important that all the light members(containing only uds) of each hadron multiplet be included, as theabsence of members can lead to unphysical isospin or SU(3) flavorviolation. Typically, the lightest pseudoscalar, vector, scalar, even andodd charge conjugation pseudovector, and tensor multiplets of lightmesons are included. In addition, some excited vector multiplets oflight mesons may be available. For baryons, usually only the lightestflavor-octet, -decuplet and -singlet baryons are present, although boththe HERWIG++ and SHERPA implementations now include some heavierbaryon multiplets as well.

Contrary to the case in the string model, the mechanism of phase-space suppression employed here leads to a natural enhancement ofthe lighter pseudoscalars, and no parameters beyond the spectrum ofhadron masses need to be introduced at this point. The phase spacealso limits the transverse momenta of the produced hadrons relativeto the jet axis.

Note that, since the masses and decays of excited heavy-flavorhadrons in particular are not well known, there is some freedom inthe model to adjust these, which in turn will affect their relativephase-space populations.

40.2.3. Hadron and τ Decays :Of the so-called primary hadrons, originating directly from stringbreaks and/or cluster decays (see above), many are unstable andso decay further, until a set of particles is obtained that can beconsidered stable on time scales relevant to the given measurement2.The decay modeling can therefore have a significant impact on finalparticle yields and spectra, especially for the lowest-lying hadronicstates, which receive the largest relative contributions from decays(feed-down). Note that the interplay between primary productionand feed-down implies that the hadronization parameters should beretuned if significant changes to the decay treatment are made.

2 E.g., a typical hadron-collider definition of a “stable particle” iscτ ≥ 10 mm, which includes the weakly-decaying strange hadrons (K,Λ, Σ±, Σ±, Ξ, Ω).

Page 28: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

494 40. Monte Carlo Event Generators

Particle summary tables, such as those given elsewhere inthis Review, represent a condensed summary of the availableexperimental measurements and hence may be incomplete and/orexhibit inconsistencies within the experimental precision. In anMC decay package, on the other hand, all information must bequantified and consistent, with all branching ratios summing to unity.When adapting particle summary information for use in a decaypackage, a number of choices must therefore be made. The amount ofambiguity increases as more excited hadron multiplets are added tothe simulation, about which less and less is known from experiment,with each GPMC making its own choices.

A related choice is how to distribute the decay productsdifferentially in phase space, in particular which matrix elements touse. Historically, MC generators contained matrix elements only forselected (generator-specific) classes of hadron and τ decays, coupledwith a Breit-Wigner smearing of the masses, truncated at the edgesof the physical decay phase space (the treatment of decay thresholdscan be important for certain modes [15]) . A more sophisticatedtreatment can then be obtained by reweighting the generated eventsusing the obtained particle four-momenta and/or by using specializedexternal packages such as EVTGEN [62] for hadron decays and TAUOLA

[63] for τ decays.

More recently, HERWIG++ and SHERPA include helicity-dependencein τ decays [64,65], with a more limited treatment available inPYTHIA 8 [4]. The HERWIG++ and SHERPA generators have alsoincluded significantly improved internal simulations of hadronicdecays, which include spin correlations between those decays forwhich matrix elements are used. Photon-bremsstrahlung effects arediscussed in Sec. 40.1.6.

HERWIG++ and PYTHIA include the probability for B mesons tooscillate into B ones before decay. SHERPA and EVTGEN also includeCP-violating effects and, for common decay modes of the neutralmeson and its antiparticle, the interference between the direct decayand oscillation followed by decay.

We end on a note of warning on double counting. This may occurif a particle can decay via an intermediate on-shell resonance. Anexample is a1 → πππ which may proceed via a1 → ρπ, ρ → ππ. Ifthese decay channels of the a1 are both included, each with their fullpartial width, a double counting of the on-shell a1 → ρπ contributionwould result. Such cases are normally dealt with consistently in thedefault MC generator packages, so this warning is mostly for usersthat wish to edit decay tables on their own.

40.3. Models for Soft Hadron-Hadron Physics

40.3.1. Minimum-Bias and Diffraction :The term “minimum bias” (MB) originates from the experimentalrequirement of a minimal number of tracks (or hits) in a giveninstrumented region. In order to make MC predictions for suchobservables, all possible contributions to the relevant phase-spaceregion must be accounted for. There are essentially four types ofphysics processes, which together make up the total hadron-hadron(hh) cross section: 1) elastic scattering3: hh → hh, 2) single diffractivedissociation: hh → h + gap + X , with X denoting anything that isnot the original beam particle, and “gap” denoting a rapidityregion devoid of observed activity; 3) double diffractive dissociation:hh → X + gap + X , and 4) inelastic non-diffractive scattering:everything else. A fifth class may also be defined, called centraldiffraction (hh → h + gap + X + gap + h). Some differences existbetween theoretical and experimental terminology [66]. In theexperimental setting, diffraction is defined by an observable gap, ofsome minimal size in rapidity. In the MC context, each diffractivephysics process typically produces a whole spectrum of gaps, withsmall ones suppressed but not excluded.

The inelastic non-diffractive part of the cross section is typicallymodeled either by smoothly regulating and extending the perturbativeQCD scattering cross sections all the way to zero p⊥ [67] (PYTHIA 6,

3 The QED elastic-scattering cross section diverges and is normallya non-default option in MC models.

PYTHIA 8, and SHERPA), or by regulating the QCD cross sections witha sharp cutoff [68]( HERWIG+JIMMY) and adding a separate class ofintrinsically soft scatterings below that scale [69]( HERWIG++). See alsoSec. 40.3.2. In all cases, the three most important ingredients are:1) the IR regularization of the perturbative scattering cross sections,including their PDF dependence, 2) the assumed matter distribution ofthe colliding hadrons, possibly including multi-parton correlations [54]and/or x dependence [70], and 3) additional soft-QCD effects suchas color reconnections and/or other collective effects, discussed inSec. 40.3.3.

Currently, there are essentially three methods for simulatingdiffraction in the main MC models: 1) in PYTHIA 6, one picksa diffractive mass according to parametrized cross sections ∝dM2/M2 [71]. This mass is represented as a string, which isfragmented as described in Sec. 40.2.1, though differences in theeffective scale of the hadronization may necessitate a (re)tuning ofthe fragmentation parameters for diffraction; 2) in PYTHIA 8, thehigh-mass tail beyond M ∼ 10 GeV is augmented by a partonicdescription in terms of pomeron PDFs [72], allowing diffractivejet production including showers and underlying event [73]; 3) thePHOJET and DPMJET programs also include central diffraction andrely directly on a formulation in terms of pomerons (color-singletmulti-gluon states) [74–76]. Cut pomerons correspond to exchangesof soft gluons while uncut ones give elastic and diffractive topologiesas well as virtual corrections that help preserve unitarity. So-called“hard pomerons” provide a transition to the perturbative regime.Fragmentation is still handled using the Lund string model, so thereis some overlap with the above models at the hadronization stage.In addition, a pomeron-based package exists for HERWIG [77], andan effort is underway to construct an MC implementation of the“KMR” model [78] within the SHERPA generator. Color reconnections(Sec. 40.3.3) may also play a role in creating rapidity gaps and theunderlying event (Sec. 40.3.2) in destroying them.

40.3.2. Underlying Event and Jet Pedestals :In the GPMC context, the term underlying event (UE) denotes anyadditional activity beyond the basic process and its associated ISRand FSR activity. The dominant contribution to this is believed tocome from additional color exchanges between the beam particles,which can be represented either as multiple parton-parton interactions(MPI) or as so-called cut pomerons (Sec. 40.3.1). The experimentallyobserved fact that the UE is more active than MB events at the sameCM energy is called the “jet pedestal” effect.

The most clearly identifiable consequence of MPI is arguably thepossibility of observing several hard parton-parton interactions inone and the same hadron-hadron event. This produces two or moreback-to-back jet pairs, with each pair having a small value of sum(~p⊥).For comparison, jets from bremsstrahlung tend to be aligned with thedirection of their parent initial- or final-state partons. The fraction ofMPI that give rise to additional reconstructible jets is, however, quitesmall. Soft MPI that do not give rise to observable jets are muchmore plentiful, and can give significant corrections to the color flowand total scattered energy of the event. This affects the final-stateactivity in a more global way, increasing multiplicity and summed ET

distributions, and contributing to the break-up of the beam remnantsin the forward direction.

The first detailed Monte Carlo model for perturbative MPI wasproposed in Ref. 67, and with some variation this still forms the basisfor most modern implementations. Some useful additional referencescan be found in Ref. 15. The first crucial observation is that thet-channel propagators appearing in perturbative QCD 2 → 2 scatteringalmost go on shell at low p⊥, causing the differential cross sections tobecome very large, behaving roughly as

dσ2→2 ∝dt

t2∼

dp2

p4

. (40.13)

This cross section is an inclusive number. Thus, if a single hadron-hadron event contains two parton-parton interactions, it will “count”twice in σ2→2 but only once in σtot, and so on. In the limit that allthe interactions are independent and equivalent, one would have

σ2→2(p⊥min) = 〈n〉(p⊥min) σtot , (40.14)

Page 29: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

40. Monte Carlo Event Generators 495

with 〈n〉(p⊥min) giving the average of a Poisson distribution in thenumber of parton-parton interactions above p⊥min per hadron-hadroncollision,

Pn(p⊥min) = (〈n〉(p⊥min))nexp (−〈n〉 (p⊥min))

n!. (40.15)

This simple argument expresses unitarity; instead of the totalinteraction cross section diverging as p⊥min → 0 (which would violateunitarity), we have restated the problem so that it is now thenumber of MPI per collision that diverges, with the total cross sectionremaining finite. At LHC energies, the 2 → 2 scattering cross sectionscomputed using the full LO QCD cross section folded with modernPDFs becomes larger than the total pp one for p⊥ values of order 4–5GeV [79]. One therefore expects the average number of perturbativeMPI to exceed unity at around that scale.

Two important ingredients remain to fully regulate the remainingdivergence. Firstly, the interactions cannot use up more momentumthan is available in the parent hadron. This suppresses the large-n tailof the estimate above. In PYTHIA-based models, the MPI are orderedin p⊥, and the parton densities for each successive interaction areexplicitly constructed so that the sum of x fractions can never begreater than unity. In the HERWIG models, instead the uncorrelatedestimate of 〈n〉 above is used as an initial guess, but the generation ofactual MPI is stopped once the energy-momentum conservation limitis reached.

The second ingredient invoked to suppress the number ofinteractions, at low p⊥ and x, is color screening; if the wavelength ∼1/p⊥ of an exchanged colored parton becomes larger than a typicalcolor-anticolor separation distance, it will only see an average colorcharge that vanishes in the limit p⊥ → 0, hence leading to suppressedinteractions. This provides an infrared cutoff for MPI similar tothat provided by the hadronization scale for parton showers. Afirst estimate of the color-screening cutoff would be the proton size,p⊥min ≈ ~/rp ≈ 0.3 GeV ≈ ΛQCD, but empirically this appears to befar too low. In current models, one replaces the proton radius rp in theabove formula by a “typical color screening distance,” i.e., an averagesize of a region within which the net compensation of a given colorcharge occurs. This number is not known from first principles [78] andis perceived of simply as an effective cutoff parameter. The simplestchoice is to introduce a step function Θ(p⊥ − p⊥min). Alternatively,one may note that the jet cross section is divergent like α2

S(p2

⊥)/p4

⊥,cf. Eq. (40.13), and that therefore a factor

α2S(p2

⊥0+ p2

⊥)

α2S(p2

⊥)

p4

(p2

⊥0+ p2

⊥)2

(40.16)

would smoothly regulate the divergences, now with p⊥0 as the freeparameter. Regardless of whether it is imposed as a smooth (PYTHIAand SHERPA) or steep (HERWIG++) function, this is effectively the main“tuning” parameter in such models.

Note that the numerical value obtained for the cross sectiondepends upon the PDF set used, and therefore the optimal valueto use for the cutoff will also depend on this choice. Note also thatthe cutoff does not have to be energy-independent. Higher energiesimply that parton densities can be probed at smaller x values, wherethe number of partons rapidly increases. Partons then become closerpacked and the color screening distance d decreases. The uncertaintyon the energy and/or x scaling of the cutoff is a major concern whenextrapolating between different collider energies [80].

We now turn to the origin of the observational fact that hardjets appear to sit on top of a higher “pedestal” of underlyingactivity than events with no hard jets. This is interpreted as aconsequence of impact-parameter-dependence: in peripheral collisions,only a small fraction of events contain any high-p⊥ activity, whereascentral collisions are more likely to contain at least one hardscattering; a high-p⊥ triggered sample will therefore be biasedtowards small impact parameters, b. The ability of a model todescribe the shape of the pedestal (e.g. to describe both MB and UEdistributions simultaneously) therefore depends upon its modeling ofthe b-dependence, and correspondingly the impact-parameter shapeconstitutes another main tuning parameter.

For each impact parameter b, the number of interactions n(b) canstill be assumed to be distributed according to Eq. (40.15), againmodulo momentum conservation, but now with the mean value ofthe Poisson distribution depending on impact parameter, 〈n(b)〉. Thiscauses the final n-distribution (integrated over b) to be wider than aPoissonian.

Finally, there are two perturbative modeling aspects which gobeyond the introduction of MPI themselves: 1) parton showers offthe MPI, and 2) perturbative parton-rescattering effects. Withoutshowers, MPI models would generate very sharp peaks for back-to-back MPI jets, caused by unshowered partons passed directly tothe hadronization model. However, with the exception of the oldestPYTHIA6 model, all GPMC models do include such showers [15],and hence should exhibit more realistic (i.e., broader and moredecorrelated) MPI jets. On the initial-state side, the main questionsare whether and how correlated multi-parton densities are taken intoaccount and, as discussed previously, how the showers are regulatedat low p⊥ and/or low x. Although none of the MC models currentlyimpose a rigorous correlated multi-parton evolution, all of them includesome elementary aspects. The most significant for parton-level resultsis arguably momentum conservation, which is enforced explicitly inall the models. The so-called “interleaved” models [24] attempt togo a step further, generating an explicitly correlated multi-partonevolution in which flavor sum rules are imposed to conserve, e.g. thetotal numbers of valence and sea quarks [54].

Perturbative rescattering in the final state can occur if partonsare allowed to undergo several distinct interactions, with showeringactivity possibly taking place in-between. This has so far not beenstudied extensively, but a first exploratory model is available [81]. Inthe initial state, parton rescattering/recombination effects have so farnot been included in any of the GPMC models.

40.3.3. Bose-Einstein and Color-Reconnection Effects :In the context of e+e− collisions, Bose-Einstein (BE) correlations havemostly been discussed as a source of uncertainty on high-precision Wmass determinations at LEP [82]. In hadron-hadron (and nucleus-nucleus) collisions, however, BE correlations are used extensively tostudy the space-time structure of hadronizing matter (“femtoscopy”).

In MC models of hadronization, each string break and/orparticle/cluster decay is normally factorized from all other ones.This reduces the number of variables that must be consideredsimultaneously, but also makes the introduction of correlations amongparticles from different breaks/decays intrinsically difficult to address.In the context of GPMCs, a few semi-classical models are availablewithin the PYTHIA 6 and 8 generators [83], in which the BE effectis mimicked by an attractive interaction between pairs of identicalparticles in the final state, with no higher correlations included. This“force” acts after the decays of very short-lived particles, like ρ,but before decays of longer-lived ones, like π0. The main differencesbetween the variants of this model is the assumed shape of thecorrelation function and how overall momentum conservation ishandled.

As discussed in Sec. 40.2, leading-color (“planar”) color flows areused to set up the hadronizing systems (clusters or strings) at thehadronization stage. If the systems do not overlap significantly inspace and time, subleading-color ambiguities and/or non-perturbativereconnections are expected to be small. However, if the density ofdisplaced color charges is sufficiently high that several systems canoverlap significantly, full-color and/or reconnection effects shouldbecome progressively larger.

In the specific context of MPI, a crucial question is how color isneutralized between different MPI systems, including the remnants.The large rapidity differences involved imply large invariant masses(though normally low p⊥), and hence large amounts of (soft) particleproduction. Indeed, in the context of soft-inclusive physics, it is these“inter-system” strings/clusters that furnish the dominant particle-production mechanism, and hence their modeling is an essentialpart of the soft-physics description, affecting topics such as MB/UEmultiplicity and p⊥ distributions, rapidity gaps, and precision massmeasurements. A more comprehensive review of color-reconnectioneffects can be found in Ref. 15.

Page 30: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

496 40. Monte Carlo Event Generators

40.4. Parameters and Tuning

The accuracy of GPMC models depends both on the inclusivenessof the chosen observable and on the sophistication of the simulation.Improvements at the theoretical level is an important driver for thelatter, but the achievable precision also depends crucially on theavailable constraints on the remaining free parameters. Using existingdata to constrain these is referred to as generator tuning.

Although MC models may appear to have a bewildering array ofadjustable parameters, most of them only control relatively small(exclusive) details of the event generation. The majority of the(inclusive) physics is determined by only a few, very importantones, such as the value of αS, in the perturbative domain, and theproperties of the non-perturbative fragmentation functions, in thenon-perturbative one. One may therefore take a factorized approach,first constraining the perturbative parameters and thereafter thenon-perturbative ones, each ordered in a measure of their relativesignificance to the overall modeling.

At LO×LL, perturbation theory is doing well if it agrees withan IR safe measurement within 10%. It would therefore not makemuch sense to tune a GPMC beyond roughly 5% (it might even bedangerous, due to overfitting). The advent of NLO Monte Carlos mayreduce this number slightly, but only for quantities for which oneexpects NLO precision. For LO Monte Carlos, distributions shouldbe normalized to unity, since the NLO normalization is not tunable.For quantities governed by non-perturbative physics, uncertaintiesare larger. For some quantities, e.g. ones for which the underlyingmodeling is known to be poor, an order-of-magnitude agreement orworse may have to be accepted.

In the context of LO×LL GPMC tuning, subleading aspects ofcoupling-constant and PDF choices are relevant. In particular, oneshould be aware that the choice of QCD Λ parameter ΛMC = 1.569Λ

MS

(for 5 active flavors) improves the predictions of coherent showeralgorithms at the NLL level [84], and hence this scheme is typicallyconsidered the baseline for shower tuning. The question of LO vs.NLO PDFs is more involved [15], but it should be emphasized thatthe low-x gluon in particular is important for determining the levelof the underlying event in MPI models (Sec. 40.3.2), and hence theMB/UE tuning (and energy scaling [80]) is linked to the choice ofPDF in such models. Further issues and an example of a specificrecipe that could be followed in a realistic set-up can be found inRef. 85. A useful online resource can be found at the mcplots.cern.chweb site [86], based on the RIVET tool [87].

Recent years have seen the emergence of automated tools thatattempt to reduce the amount of both computer and manpowerrequired for tuning [88]. Automating the human expert input is moredifficult. In the tools currently on the market, this is addressed by acombination of input solicited from the GPMC authors (e.g., whichparameters and ranges to consider, which observables constitute acomplete set, etc) and a set of weights determining the relative prioritygiven to each bin in each distribution. The field is still burgeoning,however, and future sophistications are to be expected. Nevertheless,the overall quality of the automated tunes appear to at least becompetitive with the manual ones.

References:

1. G. Corcella et al., JHEP 0101, 010 (2001), hep-ph/0011363.2. M.Bahr et al., Eur. Phys. J. C58, 639 (2008), arXiv:0803.0883.3. T. Sjostrand, S. Mrenna, and P. Z. Skands, JHEP 05, 026 (2006),

hep-ph/0603175.4. T. Sjostrand, S. Mrenna, and P. Z. Skands, Comp. Phys. Comm.

178, 852 (2008), arXiv:0710.3820.5. T. Gleisberg et al., JHEP 0402, 056 (2004), hep-ph/0311263.6. T. Kinoshita, J. Math. Phys. 3, 650 (1962).7. T. Lee and M. Nauenberg, Phys. Rev. 133, 1549 (1964).8. G.C. Fox and S. Wolfram, Nucl. Phys. B168, 285 (1980).9. G. Altarelli and G. Parisi, Nucl. Phys. B126, 298 (1977).

10. A. Einstein, B. Podolsky, and N. Rosen, Phys. Rev. 47, 777(1935).

11. B.R. Webber, Phys. Lett. B193, 91 (1987).12. J.C. Collins, Nucl. Phys. B304, 794 (1988).13. I.G. Knowles, Comp. Phys. Comm. 58, 271 (1990).

14. T. Sjostrand, Phys. Lett. B157, 321 (1985).15. A. Buckley et al., Phys. Reports 504, 145 (2011),

arXiv:1101.2599.16. G. Marchesini and B.R. Webber, Nucl. Phys. B310, 461 (1988).17. S. Gieseke, P. Stephens, and B. Webber, JHEP 0312, 045 (2003),

hep-ph/0310083.18. M. Bengtsson and T. Sjostrand, Nucl. Phys. B289, 810 (1987).19. G. Gustafson and U. Pettersson, Nucl. Phys. B306, 746 (1988).20. L. Lonnblad, Comp. Phys. Comm. 71, 15 (1992).21. W.T. Giele, D.A. Kosower, and P.Z. Skands, Phys. Rev. D78,

014026 (2008), arXiv:0707.3652.22. S. Schumann and F. Krauss, JHEP 0803, 038 (2008),

arXiv:0709.1027.23. Z. Nagy and D.E. Soper, JHEP 0510, 024 (2005), hep-

ph/0503053.24. T. Sjostrand and P.Z. Skands, Eur. Phys. J. C39, 129 (2005),

hep-ph/0408302.25. E. Norrbin and T. Sjostrand, Nucl. Phys. B603, 297 (2001),

hep-ph/0010012.26. S. Catani et al., Nucl. Phys. B627, 189 (2002), hep-ph/0201036.27. J. Cembranos et al., (2013), arXiv:1305.2124.28. N. Davidson, T. Przedzinski, and Z. Was, (2010),

arXiv:1011.0937.29. K. Hamilton and P. Richardson, JHEP 0607, 010 (2006),

hep-ph/0603034.30. M. Schonherr and F. Krauss, JHEP 0812, 018 (2008),

arXiv:0810.5071.31. A. Semenov, Comp. Phys. Comm. 180, 431 (2009),

arXiv:0805.0555.32. N.D. Christensen and C. Duhr, Comp. Phys. Comm. 180, 1614

(2009), arXiv:0806.4194.33. M. Fairbairn et al., Phys. Reports 438, 1 (2007), hep-

ph/0611040.34. J. Alwall et al., Comp. Phys. Comm. 176, 300 (2007), hep-

ph/0609017.35. P.Z. Skands et al., JHEP 0407, 036 (2004), hep-ph/0311123.36. J. Alwall et al., (2007), arXiv:0712.3311.37. P. Richardson, JHEP 0111, 029 (2001), hep-ph/0110108.38. M. Bengtsson and T. Sjostrand, Phys. Lett. B185, 435 (1987).39. S. Catani et al., JHEP 11, 063 (2001), hep-ph/0109231.40. J. Alwall et al., Eur. Phys. J. C53, 473 (2008), arXiv:0706.2569.41. P. Nason, JHEP 11, 040 (2004), hep-ph/0409146.42. B. Cooper et al., Eur. Phys. J. C72, 2078 (2012),

arXiv:1109.5295.43. S. Frixione and B.R. Webber, JHEP 06, 029 (2002), hep-

ph/0204244.44. S. Alioli et al., JHEP 1006, 043 (2010), arXiv:1002.2581.45. S. Alioli, K. Hamilton, and E. Re, (2001), arXiv:1108.0909.46. L. Hartgring, E. Laenen, and P. Skands, (2013),

arXiv:1303.4974.47. M.L. Mangano et al., JHEP 0307, 001 (2003), hep-ph/0206293.48. J. Alwall et al., JHEP 1106, 128 (2011), arXiv:1106.0522.49. E. Boos et al., (2007), hep-ph/0109068.50. V. Hirschi et al., JHEP 1105, 044 (2011), arXiv:1103.0621.51. X. Artru and G. Mennessier, Nucl. Phys. B70, 93 (1974).52. B. Andersson et al., Phys. Reports 97, 31 (1983).53. B. Andersson, Camb. Monogr. Part. Phys. Nucl. Phys. Cosmol.

7 (1997).54. T. Sjostrand and P.Z. Skands, JHEP 0403, 053 (2004),

hep-ph/0402078.55. M. Bowler, Z. Phys. C11, 169 (1981).56. C. Peterson et al., Phys. Rev. D27, 105 (1983).57. D. Amati and G. Veneziano, Phys. Lett. B83, 87 (1979).58. A. Bassetto, M. Ciafaloni, and G. Marchesini, Phys. Lett. B83,

207 (1979).59. R.D. Field and S. Wolfram, Nucl. Phys. B213, 65 (1983).60. B.R. Webber, Nucl. Phys. B238, 492 (1984).61. J.-C. Winter, F. Krauss, and G. Soff, Eur. Phys. J. C36, 381

(2004), hep-ph/0311085.62. D. Lange, Nucl. Instrum. Methods A462, 152 (2001).63. S. Jadach et al., Comp. Phys. Comm. 76, 361 (1993).

Page 31: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

40. Monte Carlo Event Generators 497

64. D. Grellscheid and P. Richardson, (2007), arXiv:0710.1951.65. T. Gleisberg et al., JHEP 0902, 007 (2009), arXiv:0811.4622.66. V. Khoze et al., Eur. Phys. J. C69, 85 (2010), arXiv:1005.4839.67. T. Sjostrand and M. van Zijl, Phys. Rev. D36, 2019 (1987).68. J.M. Butterworth, J.R. Forshaw, and M.H. Seymour, Z. Phys.

C72, 637 (1996), hep-ph/9601371.69. M. Bahr et al., (2009), arXiv:0905.4671.70. R. Corke and T. Sjostrand, JHEP 1105, 009 (2011),

1101.5953.71. G.A. Schuler and T. Sjostrand, Phys. Rev. D49, 2257 (1994).72. G. Ingelman and P. Schlein, Phys. Lett. B152, 256 (1985).73. S. Navin, (2010), arXiv:1005.3894.74. P. Aurenche et al., Comp. Phys. Comm. 83, 107 (1994),

hep-ph/9402351.75. F.W. Bopp, R. Engel, and J. Ranft, (1998), hep-ph/9803437.76. S. Roesler, R. Engel, and J. Ranft, p. 1033 (2000), hep-

ph/0012252.77. B.E. Cox and J.R. Forshaw, Comp. Phys. Comm. 144, 104

(2002), hep-ph/0010303.

78. M. Ryskin, A. Martin, and V. Khoze, Eur. Phys. J. C71, 1617(2011), arXiv:1102.2844.

79. M. Bahr, J.M. Butterworth, and M.H. Seymour, JHEP 01, 065(2009), arXiv:0806.2949.

80. H. Schulz and P.Z. Skands, Eur. Phys. J. C71, 1644 (2011),arXiv:1103.3649.

81. R. Corke and T. Sjostrand, JHEP 01, 035 (2009),arXiv:0911.1901.

82. LEP Electroweak Working Group, (2005), hep-ex/0511027.83. L. Lonnblad and T. Sjostrand, Eur. Phys. J. C2, 165 (1998),

hep-ph/9711460.84. S. Catani, B. R. Webber, and G. Marchesini, Nucl. Phys. B349,

635 (1991).85. P.Z. Skands, (2011), arXiv:1104.2863.86. A. Karneyeu et al., (2013), arXiv:1306.3436.87. A. Buckley et al., (2010), arXiv:1003.0694.88. A. Buckley et al., Eur. Phys. J. C65, 331 (2010),

arXiv:0907.2973.

Page 32: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

498 41. Monte Carlo Neutrino Generators

41. MONTE CARLO NEUTRINO EVENT GENERATORS

Written September 2013 by H. Gallagher (Tufts U.) and Y. Hayato(Tokyo U.)

Monte Carlo neutrino generators are programs or libraries whichsimulate neutrino interactions with electrons, nucleons and nuclei.In this capacity their usual task is to take an input neutrino andnucleus and produce a set of 4-vectors for particles emerging from theinteraction, which are then input to full detector simulations. Sincethese generators have to simulate not only the initial interaction ofneutrinos with target particles, but re-interactions of the generatedparticles in the nucleus, they contain a wide range of elementaryparticle and nuclear physics. Viewed more broadly, they are theaccess point for neutrino experimentalists to the theory inputs neededfor analysis. Examples include cross section libraries for event ratecalculations and parameter uncertainties and reweighting tools forsystematic error evaluation.

Neutrino experiments typically operate in neutrino beams thatare neither completely pure nor mono-energetic. Generators area crucial component in the convolution of beam flux, neutrinointeraction physics, and detector response that is necessary to makepredictions about observable quantities. Similarly they are used torelate reconstructed quantities back to true quantities. In these variouscapacities they are used from the detector design stage through theextraction of physics measurements from reconstructed observables.Monte Carlo neutrino generators play unique and important roles inthe experimental study of neutrino interactions and oscillations.

There are several neutrino event generators available, such as ANIS[1], GENIE [2], GiBUU [3], NEGN [4], NEUT [5], NUANCE[6], the FLUKA routines NUNDIS/NUNRES [7], and NuWRO [8].Historically, experiments would develop their own generators. Thiswas often because they were focused on a particular measurement,energy range, or target, and wanted to ensure that the best physicswas included for it. These ‘home-grown’ generators were often tunedprimarily or exclusively to the neutrino data most similar to the datathat the experiment would be collecting. A major advance in thefield was the introduction of conference series devoted to the topicof neutrino interaction physics, NuINT and NuFACT in particular.Event generator comparisons have been a regular staple of the NuINTconference series from its inception, and a great deal of informationon this topic can be found in the Proceedings of these meetings.These meetings have facilitated experiment-theory discussions leadingto the first generator developed by a theory group (NuWRO) [8],the extension of established nuclear interaction codes (FLUKA andGiBUU) to include neutrino-nuclear processes [3], [7], and inclusionof theorists in existing generator development teams.

These activites have led to more careful scrutiny of the crucialnuclear theory inputs to these generators, which is evaluated inparticular through comparisons to electron-scattering data. At thispoint in time all simulation codes face challenges in describing thefull extent of the lepton scattering data, and the tension betweenincorporating the best available theory versus obtaining the bestagreement with the data plays out in a variety of ways within the field.For the field to make progress, inclusion of state of the art theoryneeds to be coupled to global analyses that correctly incorporatecorrelations between measurements. Given the rapid pace of new dataand the complexity of analyses, this is a significant challenge for thefield in the coming years.

There are many neutrino experiments which use various sourcesof neutrinos, from reactors, accelerators, the atmosphere, andastrophysical sources, thereby covering a range of energies from MeVto TeV. Much of the emphasis has been on the few-GeV region inthe generators, as this is the relevant energy range for long-baselineneutrino oscillation experiments. These generators use the impulseapproximation for most of the primary neutrino interactions andsimulate the interactions of secondary particles in the nucleus insemi-classical ways in order to simulate a variety of nuclei in a singlemodel, and for practical considerations as these approaches are fast.However, there are several challenges facing these simulations comingmainly from the complexity of the nuclear physics, and avoiding doublecounting in combining perturbative and non-perturbative models forthe neutrino-nucleon scattering processes. While generators share

many common ingredients, differences in implementation, parametervalues, and approaches to avoid double counting can yield dramaticallydifferent predictions [9]. In the following sections, interaction modelsand their implementations including the interactions of generatedparticles in the nuclei are described.

In order to assure its reproducibility, neutrino event generators aretuned and validated against a wide variety of data, including data fromphoton, charged lepton, neutrino, and hadron probes. The results fromthese external data tuning exercises are important for experimentsas they quantify the uncertainty on model paramaters, needed byexperiments in the evaluation of generator-related systematic errors.Electron scattering data plays an important role in determining thevector contribution to the form-factors and structure functions, aswell as in evaluating specific aspects of the nuclear model. Hadronscattering data is used in validating the nuclear model, in particularthe modeling of final state interactions. Tuning of neutrino-nucleonscattering and hadronization models relies heavily on the previousgeneration of high energy neutrino scattering and hydrogen anddeuterium bubble chamber experiments, and more recent datafrom the K2K, MiniBooNE, NOMAD, SciBooNE, MINOS, T2K,ArgoNEUT, and MINERvA experiments either has been, or will be,used for this purpose.

41.1. Neutrino-Nucleon Scattering

Event generators typically begin with free-nucleon cross sectionswhich are then embedded into a nuclear physics model. The mostimportant processes are quasi-elastic (elastic for NC) scattering,resonance production, and non-resonant inelastic scattering, whichmake comparable contributions for few-GeV interactions. Theneutrino cross sections in this energy range can be seen in Figures 49.1through 49.4 of this Review.

41.1.1. Quasi-Elastic Scattering : The cross section for theneutrino nucleon charged current quasi-elastic scattering is describedin terms of the leptonic and hadronic weak currents, where dominantcontributions to the hadronic current come from the vector andaxial-vector form factors. There also exists the pseudo-scalar term(the pseudo-scalar form factor) in the hadronic current but this termis rather small for electron and muon neutrinos and usually relatedto the axial form factor assuming partially conserved axial current(PCAC). The vector form factors are measured by the recent preciseelectron scattering experiments and known to have some deviationfrom the simple dipole form [10]. Therefore, most of the generatorsuse parametrizations of this form factor taken directly from the data.For the axial form factor there is no such precise experiment, and mostof the generators use a dipole form. Generally, the value of axial formfactor at q2 = 0 is extracted from the polarized nucleon beta decayexperiment. However, the selection of the axial vector mass parameterdepends on each generator, with values typically around 1.00 GeV/c2.

41.1.2. Resonance Production : Most generators use the cal-culation of Rein-Sehgal to simulate neutrino-induced single pionproduction [13]. To obtain the cross section for a particular channel,they calculate the amplitude for the production of each resonancemultiplied by the probability for the decay of that resonance into thatparticular channel. Implementation differences include the number ofresonances included, whether the amplitudes are added coherentlyor incoherently, the invariant mass range over which the model isused, how non-resonant backgrounds are included, inclusion of leptonmass terms, and the model parameter values (in particular the axialmass). In this model it is also possible to calculate the cross-sectionsof single photon, kaon and η productions by changing the decayprobability of the resonances, which are included in some of theprograms. However, it is known that discrepancies exist between therecent pion electro/photoproduction data and the results from thesimulation data with the same framework, i.e. vector part of thismodel. There are several attempts to overcome this issue [12] andsome of the generators started using more appropriate form factors.The GiBUU and NuWRO generators do not use the Rein-Sehgalmodel, and instead rely directly on electro-production data for thevector contribution and fit bubble chamber data to determine theremaining parameters for the axial contribution [14], [15], [16].

Page 33: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

41. Monte Carlo Neutrino Generators 499

41.1.3. Deep and Shallow Inelastic Scattering : For this processthe fundamental target shifts from the nucleon to its quarkconstituents. Therefore, the generators use the standard expressionfor the constructions for the nucleon structure functions F2 andxF3 from parton distributions for high Q2 (the DIS regime) tocalculate direction and momeuntum of lepton. The first challengeis in extending this picture to the lower values of Q2 and W thatdominate the available phase space for few-GeV interactions (theso-called ‘shallow inelastic scattering’, or SIS regime). The correctionsproposed in [17] are widely used, while others [7] implement theirown modifictions to the parton distributions at low Q2. Both DISand SIS generates hadrons but their production depends on eachgenerator’s implementation of a hadronization model as described inthe next section. There are various difficulties not only in the actualhadronization but the relation with the single meson production. Itis necessary to avoid double counting between the resonance andSIS/DIS models, and all generators are different in this regard.The scheme chosen can have a significant impact on the results ofsimulations at a few-GeV neutrino energies.

41.2. Hadronization Models

For hadrons produced via baryonic resonances, the underlying modelamplitudes and resonance branching fractions can be used to fullycharacterize the hadronic system. For non-resonant production, ahadronization model is required. Most generators use PYTHIA[18] for this purpose, although some with modified parameters.In addition some implement their own models to handle invariantmasses that are too low for PYTHIA, typically somwhere around2.0 GeV/c2. Such models rely heavily on measurements of neutrinohadro-production in high-resolution devices, such as bubble chambersand the CHORUS [19] and NOMAD experiments [20], to constructempirical parametrizations that reproduce the key features of thedata [21], [22]. The basic ingredients are the emperical observationsthat average charged particle multiplicites increase logarithmicallywith the invariant mass of the hadronic system, and that thedistribution of charged particle multiplicities about this average aredescribed by a single function (an observation known as KNO scaling).Neutral particles are assumed to be produced with an averagemultiplicity that is 50% of the charged particle multiplicity. Simpleparametrizations to more accurately reproduce differences observed inthe forward/backward hemispheres of hadronic systems are includedin GENIE, NEUT, and NuWRO.

41.3. Nuclear Physics

The nuclear physics relevant to neutrino-nucleus scattering atfew-GeV energies is complicated, involving Fermi motion, nuclearbinding, Pauli blocking, in-medium modifications of form factors andhadronization, intranuclear rescattering of hadrons, and many-bodyscattering mechanisms including long- and short-range nucleon-nucleoncorrelations.

41.3.1. Scattering Mechanisms :

Most of the models used for neutrino-nuclear scattering kinematicswere developed in the context of few-GeV inclusive electron scattering,by experiments going back nearly 50 years. A topic of considerablediscussion within this community has been to what extent the impulseapproximation, whereby the nucleus is envisioned as collection ofbound, moving, single nucleons, is appropriate. The question aroseinitially in the context of measurements of the quasi-elastic axialmass, with a number of recent experiments using nuclear targetsmeasuring values that were significantly higher than those obtained byan earlier generation of bubble chamber experiments using hydrogenor deuterium [23]. These led to a revisitation of the role playedby scattering from multi-particle/hole states in the nucleus. Thecontribution of these scattering processes is an extremely active areaof theoretical research at present, with significant implications forgenerators and analyses [24]. The GiBUU, NuWRO, GENIE, andNEUT generators have all implemented, or are in the process ofimplementing, first models for these processes [25].

In order to obtain the cross-section off nucleons in the nucleus, itis necessary to take into account the in-medium effects. The basicmodels imployed in event generators rely on impulse approximationschemes, the most simple of which is the Relativistic Fermi GasModel. The most common implementations are the Smith-Moniz[26] and Bodek-Ritchie [27] models. Within the electron scatteringcommunity, the analagous calculations have for decades relied onspectral functions, which incorpoate information about nucleonmomenta and binding energies in the impulse approximation scheme.The NuWRO and GiBUU generators currently use spectral functions,they are incorporated into NEUT as an option, and several of theother generators are incorporating spectral function models at thistime. It is known from photo and electro-nuclear scattering that theDelta width is affected by Pauli blocking and collisional broadening.These effects are included in some, but not all, generators.

When scattering from a nucleus, coherent scattering of variouskinds is possible. Most simulations incorporate, at least, neutraland charged coherent coherent single pion production. While theinteraction rate for these interactions is typically around a percent ofthe total yield, the unique kinematic features of these events can makethem potential backgrounds for oscillation searches. Implemented inMonte Carlo are PCAC-based methods, while microscopic models arecurrently being incorporated into several generators as well. Reference[9] clearly demonstrates a point mentioned earlier, where generatorsimplementing the same model [28] are seen to produce very differentpredictions.

41.3.2. Hadron Production in Nuclei :

Neutrino pion production is one of the dominant interactions in afew-GeV region and the interaction cross sections of pions in nucleusfrom those interactions are quite large. Therefore, the interactions ofpions in nucleus changes the kinematics of the pions and can have largeeffects on the results of simulations at these energies. Most generatorsimplement this physics through an intranculear cascade simulation.In generators which utilize cascade models, a hadron, which has beenformed in the nucleus, is moved step by step until it interacts withthe other nucleon or escapes from the nucleus. The probabilities ofeach interaction in nuclus are usually given as the mean free pathsand used to determine whether the hadron is interacted or not. Ifthe hadron is found to be interacted, appropriate interactions areselected and simulated. Usually, absorption, elastic, and inelasticscatterings including particle productions are simulated as secondaryinteractions. The determination method of the kinematics for the finalstate particles heavily depends on the generators but most of themuse experimentally validated models to simulate hadron interactionsin nucleus. No two interanuclear cascade simulations implementedin neutrino event generators are the same. In all cases hadronspropagate from an interaction vertex chosen based on the densitydistribution of the target nucleus. In determining the generatedposition of the hadrons in nucleus, the concept of the formation lengthis sometimes employed. Based on this idea, the hadronization processis not instantaneous and it takes some time before generating thehadrons [29]. The basis for formation times are measurements atrelatively high energy and Q2, and most generators that employ theconcept do not apply them to resonance interactions, the exception is[29]. The intranuclear rescattering simulations are typically validatedagainst hadron scattering data. In some simulations (e.g. NEUT) thepion-less Delta decay is also considered and 20% of the events do nothave a pion and only the lepton and the nucleon are generated.

The exception is GiBUU, a semiclassical transport model in coupledchannels that describes the space-time evolution of a manybody systemin the presence of potentials and a collision term [3]. This approachassures consistency between nuclear effects in the initial state, such asFermi motion, Pauli blocking, hadron self-energies, and modified crosssections, and the final state, such as particle reinteractions, since thetwo are derived from the same model. This model has been previouslyused to describe a wide variety of nuclear interaction data. Similarly,the hadronic simulation of the NUNDIS/NUNRES programs arehandled by the well-established FLUKA hadronic simulation package[7].

Page 34: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

500 41. Monte Carlo Neutrino Generators

References:

1. A. Gazizov and M. P. Kowalski, Comput. Phys. Commun. 172,203 (2005), astro-ph/0406439.

2. C. Andreopoulos et al., Nucl. Instrum. Methods A614, 87 (2001),hep-ph/09052517.

3. O. Buss et al., Phys. Reports 512, 1 (2012), hep-ph/1106.1344.4. D. Autiero, Nucl. Phys. Proc. Suppl. 139, 253 (2005).5. Y. Hayato, Nucl. Phys. (Proc. Supp.) 112, 171 (2002).6. D. Casper, Nucl. Phys. (Proc. Supp.) 112, 161 (2002), hep-

ph/0208030.7. G. Battiston et al., Acta Phys. Polon. B40, 2491 (2009).8. C. Juszczak, J. A. Nowak, and J. T. Sobczyk, Nucl. Phys. (Proc.

Supp.) 159, 211 (2006), hep-ph/0512365.9. S. Boyd et al., AIP Conf. Proc. 1189, 60 (2009).

10. A. Bodek et al., Eur. Phys. J. C53, 349 (2008), hep-

ex0708.1946/.11. R. P. Feynman, M. Kislinger, and F. Ravndal, Phys. Rev. D 3,

2706 (1971).12. K. M. Graczyk and J. T. Sobczyk, Phys. Rev. D 77, 053001

(2008) [Erratum-ibid. D 79, 079903 (2009)] [arXiv:0707.3561[hep-ph]].

13. D. Rein, and L. M. Sehgal, Ann. Phys. 133, 79 (1981).14. O. Lalakulich and E. A. Paschos, Phys. Rev. D71, 074003 (2005),

hep-ph/0501109.

15. J. Nowak, Phys. Scripta T127, 70 (2006).16. L. Alvarez-Ruso, S. K. Singh, and M. J. Vicente-Vacas, Phys.

Rev. C57, 2693 (1998), nucl-th/9712058.17. A. Bodek, and U. K. Yang, J. Phys. G29, 1899 (2003),

hep-ex/0210024.18. T. Sjostrand, S. Mrenna, and P. Skands, JHEP 05, 26 (2006),

hep-ph/0603175.19. A. Kayis-Topaksu et al., Eur. Phys. J. C51, 775 (2007),

hep-ex/arXiv:0707.1586.20. J. Altegoer et al., Phys. Lett. B445, 439 (1999).21. T. Yang et al., Eur. Phys. J. C63, 1 (2009), hep-ph/0904.4043.22. J. Nowak and J. Sobczyk, Acta Phys. Polon. B37, 2371

(2006), hep-ph/0608108.23. H. Gallagher, G. Garvey, and G. Zeller, Ann. Rev. Nucl. and

Part. Sci. 61, 355 (2011).24. O. Lalakulich, U. Mosel, and K. Gallmeister, Phys. Rev. C86,

054606 (2012), nucl-th/1208.3678.25. T. Katori, nucl-th/1304.6014.26. R. Smith and E. Moniz, Nucl. Phys. B43, 605 (1972).27. A. Bodek and J. Ritchie, Phys. Rev. D24, 1400 (1981).28. D. Rein and L. Sehgal, Nucl. Phys. B223, 29 (1983).29. T. Golan, C. Juszczak and J. Sobczyk, Phys. Rev. C86, 015505

(2012), nucl-th/1202.4197.

Page 35: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

42. Monte Carlo particle numbering scheme 501

42. MONTE CARLO PARTICLE NUMBERING SCHEME

Revised September 2013 by J.-F. Arguin LBNL), L. Garren (Fermilab),F. Krauss (Durham U.), C.-J. Lin (LBNL), S. Navas (U. Granada),P. Richardson (Durham U.), and T. Sjostrand (Lund U.).

The Monte Carlo particle numbering scheme presented here isintended to facilitate interfacing between event generators, detectorsimulators, and analysis packages used in particle physics. Thenumbering scheme was introduced in 1988 [1] and a revisedversion [2,3] was adopted in 1998 in order to allow systematic inclusionof quark model states which are as yet undiscovered and hypotheticalparticles such as SUSY particles. The numbering scheme is used inseveral event generators, e.g. HERWIG, PYTHIA, and SHERPA, andinterfaces, e.g. /HEPEVT/ and HepMC.

The general form is a 7–digit number:

±n nr nL nq1 nq2 nq3 nJ .

This encodes information about the particle’s spin, flavor content, andinternal quantum numbers. The details are as follows:

1. Particles are given positive numbers, antiparticles negativenumbers. The PDG convention for mesons is used, so that K+

and B+ are particles.2. Quarks and leptons are numbered consecutively starting from 1

and 11 respectively; to do this they are first ordered by familyand within families by weak isospin.

3. In composite quark systems (diquarks, mesons, and baryons)nq1−3

are quark numbers used to specify the quark content, whilethe rightmost digit nJ = 2J + 1 gives the system’s spin (exceptfor the K0

S and K0L). The scheme does not cover particles of spin

J > 4.4. Diquarks have 4-digit numbers with nq1 ≥ nq2 and nq3 = 0.5. The numbering of mesons is guided by the nonrelativistic (L–S

decoupled) quark model, as listed in Tables 15.2 and 15.3.

a. The numbers specifying the meson’s quark content conformto the convention nq1 = 0 and nq2 ≥ nq3 . The special case

K0L is the sole exception to this rule.

b. The quark numbers of flavorless, light (u, d, s) mesons are:11 for the member of the isotriplet (π0, ρ0, . . .), 22 for thelighter isosinglet (η, ω, . . .), and 33 for the heavier isosinglet(η′, φ, . . .). Since isosinglet mesons are often large mixturesof uu + dd and ss states, 22 and 33 are assigned by mass anddo not necessarily specify the dominant quark composition.

c. The special numbers 310 and 130 are given to the K0S and

K0L respectively.

d. The fifth digit nL is reserved to distinguish mesons of thesame total (J) but different spin (S) and orbital (L) angularmomentum quantum numbers. For J > 0 the numbers are:(L, S) = (J − 1, 1) nL = 0, (J, 0) nL = 1, (J, 1) nL = 2and (J + 1, 1) nL = 3. For the exceptional case J = 0 thenumbers are (0, 0) nL = 0 and (1, 1) nL = 1 (i.e. nL = L).See Table 42.1.

Table 42.1: Meson numbering logic. Here qq stands fornq2 nq3.

L = J − 1, S = 1 L = J , S = 0 L = J , S = 1 L = J + 1, S = 1

J code JPC L code JPC L code JPC L code JPC L

0 — — — 00qq1 0−+ 0 — — — 10qq1 0++ 1

1 00qq3 1−− 0 10qq3 1+− 1 20qq3 1++ 1 30qq3 1−− 2

2 00qq5 2++ 1 10qq5 2−+ 2 20qq5 2−− 2 30qq5 2++ 3

3 00qq7 3−− 2 10qq7 3+− 3 20qq7 3++ 3 30qq7 3−− 4

4 00qq9 4++ 3 10qq9 4−+ 4 20qq9 4−− 4 30qq9 4++ 5

e. If a set of physical mesons correspond to a (non-negligible)mixture of basis states, differing in their internal quantumnumbers, then the lightest physical state gets the smallestbasis state number. For example the K1(1270) is numbered10313 (11P1 K1B) and the K1(1400) is numbered 20313(13P1 K1A).

f. The sixth digit nr is used to label mesons radially excitedabove the ground state.

g. Numbers have been assigned for complete nr = 0 S- andP -wave multiplets, even where states remain to be identified.

h. In some instances assignments within the qq meson modelare only tentative; here best guess assignments are made.

i. Many states appearing in the Meson Listings are not yetassigned within the qq model. Here nq2−3

and nJ areassigned according to the state’s likely flavors and spin; allsuch unassigned light isoscalar states are given the flavorcode 22. Within these groups nL = 0, 1, 2, . . . is used todistinguish states of increasing mass. These states are flaggedusing n = 9. It is to be expected that these numbers willevolve as the nature of the states are elucidated. Codes areassigned to all mesons which are listed in the one-page tableat the end of the Meson Summary Table as long as they havea prefered or established spin. Additional heavy meson statesexpected from heavy quark spectroscopy are also assignedcodes.

6. The numbering of baryons is again guided by the nonrelativisticquark model, see Table 15.6. This numbering scheme is illustratedthrough a few examples in Table 42.2.

a. The numbers specifying a baryon’s quark content are suchthat in general nq1 ≥ nq2 ≥ nq3 .

b. Two states exist for J = 1/2 baryons containing 3 differenttypes of quarks. In the lighter baryon (Λ, Ξ, Ω, . . .) the lightquarks are in an antisymmetric (J = 0) state while forthe heavier baryon (Σ0, Ξ′, Ω′, . . .) they are in a symmetric(J = 1) state. In this situation nq2 and nq3 are reversed forthe lighter state, so that the smaller number corresponds tothe lighter baryon.

c. For excited baryons a scheme is adopted, where the nr

label is used to denote the excitation bands in the harmonicoscillator model, see Sec. 15.4. Using the notation employedthere, nr is given by the N -index of the DN band identifier.

d. Further degeneracies of excited hadron multiplets with thesame excitation number nr and spin J are lifted by labellingsuch multiplets with the nL index according to their mass, asgiven by its N or ∆-equivalent.

e. In such excited multiplets extra singlets may occur, theΛ(1520) being a prominent example. In such cases theordering is reversed such that the heaviest quark label ispushed to the last position: nq3 > nq1 > nq2 .

f. For pentaquark states n = 9, nrnLnq1nq2 gives the fourquark numbers in order nr ≥ nL ≥ nq1 ≥ nq2 , nq3 gives theantiquark number, and nJ = 2J + 1, with the assumptionthat J = 1/2 for the states currently reported.

7. The gluon, when considered as a gauge boson, has official number21. In codes for glueballs, however, 9 is used to allow a notationin close analogy with that of hadrons.

8. The pomeron and odderon trajectories and a generic reggeontrajectory of states in QCD are assigned codes 990, 9990, and 110respectively, where the final 0 indicates the indeterminate natureof the spin, and the other digits reflect the expected “valence”flavor content. We do not attempt a complete classification of allreggeon trajectories, since there is currently no need to distinguisha specific such trajectory from its lowest-lying member.

9. Two-digit numbers in the range 21–30 are provided for theStandard Model gauge bosons and Higgs.

10. Codes 81–100 are reserved for generator-specific pseudoparticlesand concepts.

11. The search for physics beyond the Standard Model is an activearea, so these codes are also standardized as far as possible.

a. A standard fourth generation of fermions is included byanalogy with the first three.

b. The graviton and the boson content of a two-Higgs-doubletscenario and of additional SU(2)×U(1) groups are found inthe range 31–40.

c. “One-of-a-kind” exotic particles are assigned numbers in therange 41–80.

d. Fundamental supersymmetric particles are identified byadding a nonzero n to the particle number. The superpartnerof a boson or a left-handed fermion has n = 1 while thesuperpartner of a right-handed fermion has n = 2. Whenmixing occurs, such as between the winos and charged

Page 36: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

502 42. Monte Carlo particle numbering scheme

Table 42.2: Some examples of octet (top) and decuplet (bottom) members for thenumbering scheme for excited baryons. Here qqq stands for nq1nq2nq3 . See the textfor the definition of the notation. The numbers in parenthesis correspond to themass of the baryons. The states marked as (?) are not experimentally confirmed.

JP (D, LPN ) nrnLnq1nq2nq3nJ N Λ8 Σ Ξ Λ1

Octet 211,221 312 311,321,322 331,332 213

1/2+ (56,0+0

) 00qqq2 (939) (1116) (1193) (1318) —

1/2+ (56,0+2

) 20qqq2 (1440) (1600) (1660) (1690) —

1/2+ (70,0+2 ) 21qqq2 (1710) (1810) (1880) (?) (?)

1/2− (70,1−1 ) 10qqq2 (1535) (1670) (1620) (1750) (1405)

JP (D, LPN ) nrnLnq1nq2nq3nJ ∆ Σ Ξ Ω

Decuplet 111,211,221,222 311,321,322 331,332 333

3/2+ (56,0+0

) 00qqq4 (1232) (1385) (1530) (1672)

3/2+ (56,0+2

) 20qqq4 (1600) (1690) (?) (?)

1/2− (70,1−1 ) 11qqq2 (1620) (1750) (?) (?)

3/2− (70,1−1 ) 12qqq4 (1700) (?) (?) (?)

Higgsinos to give charginos, or between left and rightsfermions, the lighter physical state is given the smaller basisstate number.

e. Technicolor states have n = 3, with technifermions treatedlike ordinary fermions. States which are ordinary colorsinglets have nr = 0. Color octets have nr = 1. If a statehas non-trivial quantum numbers under the topcolor groupsSU(3)1 × SU(3)2, the quantum numbers are specified bytech,ij, where i and j are 1 or 2. nL is then 2i + j. Thecoloron, V8, is a heavy gluon color octet and thus is 3100021.

f. Excited (composite) quarks and leptons are identified bysetting n = 4 and nr = 0.

g. Within several scenarios of new physics, it is possible tohave colored particles sufficiently long-lived for color-singlethadronic states to form around them. In the context ofsupersymmetric scenarios, these states are called R-hadrons,since they carry odd R-parity. R-hadron codes, defined here,should be viewed as templates for corresponding codes alsoin other scenarios, for any long-lived particle that is eitheran unflavored color octet or a flavored color triplet. TheR-hadron code is obtained by combining the SUSY particlecode with a code for the light degrees of freedom, with asmany intermediate zeros removed from the former as requiredto make place for the latter at the end. (To exemplify, asparticle n00000nq combined with quarks q1 and q2 obtainscode n00nqnq1nq2nJ .) Specifically, the new-particle spindecouples in the limit of large masses, so that the final nJ

digit is defined by the spin state of the light-quark systemalone. An appropriate number of nq digits is used to definethe ordinary-quark content. As usual, 9 rather than 21 isused to denote a gluon/gluino in composite states. The signof the hadron agrees with that of the constituent new particle(a color triplet) where there is a distinct new antiparticle,and else is defined as for normal hadrons. Particle names areR with the flavor content as lower index.

h. A black hole in models with extra dimensions has code5000040. Kaluza-Klein excitations in models with extradimensions have n = 5 or n = 6, to distinquish excitationsof left- or right-handed fermions or, in case of mixing, thelighter or heavier state (cf. 11d). The nonzero nr digit givesthe radial excitation number, in scenarios where the levelspacing allow these to be distinguished. Should the modelalso contain supersymmetry, excited SUSY states would bedenoted by an nr > 0, with n = 1 or 2 as usual. Shouldsome colored states be long-lived enough that hadrons wouldform around them, the coding strategy of 11g applies, withthe initial two nnr digits preserved in the combined code.

i. Magnetic monopoles and dyons are assumed to have oneunit of Dirac monopole charge and a variable integer numbernq1nq2nq3 units of electric charge. Codes 411nq1nq2nq30 arethen used when the magnetic and electrical charge sign agreeand 412nq1nq2nq30 when they disagree, with the overall signof the particle set by the magnetic charge. For now no spininformation is provided.

12. Occasionally program authors add their own states. To avoidconfusion, these should be flagged by setting nnr = 99.

13. Concerning the non-99 numbers, it may be noted that onlyquarks, excited quarks, squarks, and diquarks have nq3 = 0; onlydiquarks, baryons (including pentaquarks), and the odderon havenq1 6= 0; and only mesons, the reggeon, and the pomeron havenq1 = 0 and nq2 6= 0. Concerning mesons (not antimesons), if nq1is odd then it labels a quark and an antiquark if even.

14. Nuclear codes are given as 10-digit numbers ±10LZZZAAAI.For a (hyper)nucleus consisting of np protons, nn neutrons andnΛ Λ’s, A = np + nn + nΛ gives the total baryon number, Z = np

the total charge and L = nΛ the total number of strange quarks.I gives the isomer level, with I = 0 corresponding to the groundstate and I > 0 to excitations, see [4], where states denotedm, n, p, q translate to I = 1 − 4. As examples, the deuteronis 1000010020 and 235U is 1000922350. To avoid ambiguities,nuclear codes should not be applied to a single hadron, like p, nor Λ0, where quark-contents-based codes already exist.

This text and full lists of particle numbers, including excitedbaryons and particles from physics beyond the standard model, can befound online [5]. The StdHep Monte Carlo standardization project [6]maintains the list of PDG particle numbers, as well as numberingschemes from most event generators and software to convert betweenthe different schemes.

References:

1. G.P. Yost et al., Particle Data Group, Phys. Lett. B204, 1 (1988).

2. I.G. Knowles et al., in “Physics at LEP2”, CERN 96-01, v. 2,p. 103.

3. C. Caso et al., Particle Data Group, Eur. Phys. J. C3, 1 (1998).

4. G. Audi et al., Nucl. Phys. A729, 3 (2003) See alsohttp://www.nndc.bnl.gov/amdc/web/nubase en.html.

5. http://pdg.lbl.gov/current/mc-particle-id/.

6. L. Garren, StdHep, Monte Carlo Standardization at FNAL,Fermilab PM0091 and StdHep WWW site:http://cepa.fnal.gov/psm/stdhep/.

Page 37: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

42. Monte Carlo particle numbering scheme 503

QUARKS

d 1u 2s 3c 4b 5t 6b′ 7t′ 8

LEPTONS

e− 11νe 12

µ− 13

νµ 14

τ− 15ντ 16

τ ′− 17ντ ′ 18

GAUGE AND

HIGGS BOSONS

g (9) 21

γ 22

Z0 23W+ 24h0/H0

1 25

Z ′/Z02 32

Z ′′/Z03 33

W ′/W+2 34

H0/H02 35

A0/H03 36

H+ 37

SPECIAL

PARTICLES

G (graviton) 39

R0 41LQc 42

reggeon 110

pomeron 990

odderon 9990

for MC internaluse 81–100

DIQUARKS

(dd)1 1103

(ud)0 2101

(ud)1 2103

(uu)1 2203

(sd)0 3101

(sd)1 3103

(su)0 3201

(su)1 3203

(ss)1 3303

(cd)0 4101

(cd)1 4103

(cu)0 4201

(cu)1 4203

(cs)0 4301

(cs)1 4303

(cc)1 4403

(bd)0 5101

(bd)1 5103

(bu)0 5201

(bu)1 5203

(bs)0 5301

(bs)1 5303

(bc)0 5401

(bc)1 5403

(bb)1 5503

SUSY

PARTICLES

˜dL 1000001

uL 1000002

sL 1000003

cL 1000004˜b1 1000005a

˜t1 1000006a

e−L 1000011

νeL 1000012

µ−L 1000013

νµL 1000014

τ−1 1000015a

ντL 1000016˜dR 2000001

uR 2000002

sR 2000003

cR 2000004˜b2 2000005a

˜t2 2000006a

e−R 2000011

µ−R 2000013

τ−2 2000015a

g 1000021

χ01 1000022b

χ02 1000023b

χ+1 1000024b

χ03 1000025b

χ04 1000035b

χ+2 1000037b

˜G 1000039

LIGHT I = 1 MESONS

π0 111π+ 211a0(980)0 9000111

a0(980)+ 9000211

π(1300)0 100111

π(1300)+ 100211

a0(1450)0 10111

a0(1450)+ 10211

π(1800)0 9010111

π(1800)+ 9010211

ρ(770)0 113

ρ(770)+ 213

b1(1235)0 10113

b1(1235)+ 10213

a1(1260)0 20113

a1(1260)+ 20213

π1(1400)0 9000113

π1(1400)+ 9000213

ρ(1450)0 100113

ρ(1450)+ 100213

π1(1600)0 9010113

π1(1600)+ 9010213

a1(1640)0 9020113

a1(1640)+ 9020213

ρ(1700)0 30113

ρ(1700)+ 30213

ρ(1900)0 9030113

ρ(1900)+ 9030213

ρ(2150)0 9040113

ρ(2150)+ 9040213

a2(1320)0 115

a2(1320)+ 215

π2(1670)0 10115

π2(1670)+ 10215

a2(1700)0 9000115

a2(1700)+ 9000215

π2(2100)0 9010115

π2(2100)+ 9010215

ρ3(1690)0 117

ρ3(1690)+ 217

ρ3(1990)0 9000117

ρ3(1990)+ 9000217

ρ3(2250)0 9010117

ρ3(2250)+ 9010217

a4(2040)0 119

a4(2040)+ 219

LIGHT I = 0 MESONS

(uu, dd, and ss Admixtures)

η 221

η′(958) 331

f0(600) 9000221

f0(980) 9010221

η(1295) 100221

f0(1370) 10221

η(1405) 9020221

η(1475) 100331

f0(1500) 9030221

f0(1710) 10331

η(1760) 9040221

f0(2020) 9050221

f0(2100) 9060221

f0(2200) 9070221

η(2225) 9080221

ω(782) 223

φ(1020) 333

h1(1170) 10223

f1(1285) 20223

h1(1380) 10333

f1(1420) 20333

ω(1420) 100223

f1(1510) 9000223

h1(1595) 9010223

ω(1650) 30223

φ(1680) 100333

f2(1270) 225

f2(1430) 9000225

f ′2(1525) 335

f2(1565) 9010225

f2(1640) 9020225

η2(1645) 10225

f2(1810) 9030225

η2(1870) 10335

f2(1910) 9040225

f2(1950) 9050225

f2(2010) 9060225

f2(2150) 9070225

f2(2300) 9080225

f2(2340) 9090225

ω3(1670) 227

φ3(1850) 337

f4(2050) 229

fJ (2220) 9000229

f4(2300) 9010229

Page 38: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

504 42. Monte Carlo particle numbering scheme

STRANGE

MESONS

K0L 130

K0S 310

K0 311K+ 321K∗

0 (800)0 9000311

K∗0 (800)+ 9000321

K∗0 (1430)0 10311

K∗0 (1430)+ 10321

K(1460)0 100311

K(1460)+ 100321

K(1830)0 9010311

K(1830)+ 9010321

K∗0 (1950)0 9020311

K∗0 (1950)+ 9020321

K∗(892)0 313

K∗(892)+ 323

K1(1270)0 10313

K1(1270)+ 10323

K1(1400)0 20313

K1(1400)+ 20323

K∗(1410)0 100313

K∗(1410)+ 100323

K1(1650)0 9000313

K1(1650)+ 9000323

K∗(1680)0 30313

K∗(1680)+ 30323

K∗2 (1430)0 315

K∗2 (1430)+ 325

K2(1580)0 9000315

K2(1580)+ 9000325

K2(1770)0 10315

K2(1770)+ 10325

K2(1820)0 20315

K2(1820)+ 20325

K∗2 (1980)0 9010315

K∗2 (1980)+ 9010325

K2(2250)0 9020315

K2(2250)+ 9020325

K∗3 (1780)0 317

K∗3 (1780)+ 327

K3(2320)0 9010317

K3(2320)+ 9010327

K∗4 (2045)0 319

K∗4 (2045)+ 329

K4(2500)0 9000319

K4(2500)+ 9000329

CHARMED

MESONS

D+ 411D0 421D∗

0(2400)+ 10411

D∗0(2400)0 10421

D∗(2010)+ 413

D∗(2007)0 423

D1(2420)+ 10413

D1(2420)0 10423

D1(H)+ 20413

D1(2430)0 20423

D∗2(2460)+ 415

D∗2(2460)0 425

D+s 431

D∗s0(2317)+ 10431

D∗+s 433

Ds1(2536)+ 10433

Ds1(2460)+ 20433

D∗s2(2573)+ 435

BOTTOM

MESONS

B0 511B+ 521B∗0

0 10511

B∗+0 10521

B∗0 513B∗+ 523B1(L)0 10513

B1(L)+ 10523

B1(H)0 20513

B1(H)+ 20523

B∗02 515

B∗+2 525

B0s 531

B∗0s0 10531

B∗0s 533

Bs1(L)0 10533

Bs1(H)0 20533

B∗0s2 535

B+c 541

B∗+c0 10541

B∗+c 543

Bc1(L)+ 10543

Bc1(H)+ 20543

B∗+c2 545

cc

MESONS

ηc(1S) 441

χc0(1P ) 10441

ηc(2S) 100441

J/ψ(1S) 443

hc(1P ) 10443

χc1(1P ) 20443

ψ(2S) 100443

ψ(3770) 30443

ψ(4040) 9000443

ψ(4160) 9010443

ψ(4415) 9020443

χc2(1P ) 445

χc2(2P ) 100445

bb

MESONS

ηb(1S) 551

χb0(1P ) 10551

ηb(2S) 100551

χb0(2P ) 110551

ηb(3S) 200551

χb0(3P ) 210551

Υ(1S) 553

hb(1P ) 10553

χb1(1P ) 20553

Υ1(1D) 30553

Υ(2S) 100553

hb(2P ) 110553

χb1(2P ) 120553

Υ1(2D) 130553

Υ(3S) 200553

hb(3P ) 210553

χb1(3P ) 220553

Υ(4S) 300553

Υ(10860) 9000553

Υ(11020) 9010553

χb2(1P ) 555

ηb2(1D) 10555

Υ2(1D) 20555

χb2(2P ) 100555

ηb2(2D) 110555

Υ2(2D) 120555

χb2(3P ) 200555

Υ3(1D) 557

Υ3(2D) 100557

LIGHT

BARYONS

p 2212

n 2112∆++ 2224∆+ 2214∆0 2114∆− 1114

STRANGE

BARYONS

Λ 3122Σ+ 3222Σ0 3212Σ− 3112Σ∗+ 3224c

Σ∗0 3214c

Σ∗− 3114c

Ξ0 3322Ξ− 3312Ξ∗0 3324c

Ξ∗− 3314c

Ω− 3334

CHARMED

BARYONS

Λ+c 4122

Σ++c 4222

Σ+c 4212

Σ0c 4112

Σ∗++c 4224

Σ∗+c 4214

Σ∗0c 4114

Ξ+c 4232

Ξ0c 4132

Ξ′+c 4322

Ξ′0c 4312

Ξ∗+c 4324

Ξ∗0c 4314

Ω0c 4332

Ω∗0c 4334

Ξ+cc 4412

Ξ++cc 4422

Ξ∗+cc 4414

Ξ∗++cc 4424

Ω+cc 4432

Ω∗+cc 4434

Ω++ccc 4444

BOTTOM

BARYONS

Λ0b 5122

Σ−b 5112

Σ0b 5212

Σ+b

5222

Σ∗−b

5114

Σ∗0b 5214

Σ∗+b 5224

Ξ−b

5132

Ξ0b 5232

Ξ′−b

5312

Ξ′0b 5322

Ξ∗−b

5314

Ξ∗0b 5324

Ω−b

5332

Ω∗−b 5334

Ξ0bc 5142

Ξ+bc

5242

Ξ′0bc 5412

Ξ′+bc

5422

Ξ∗0bc 5414

Ξ∗+bc 5424

Ω0bc 5342

Ω′0bc 5432

Ω∗0bc 5434

Ω+bcc

5442

Ω∗+bcc

5444

Ξ−bb 5512

Ξ0bb 5522

Ξ∗−bb

5514

Ξ∗0bb 5524

Ω−bb

5532

Ω∗−bb

5534

Ω0bbc 5542

Ω∗0bbc 5544

Ω−bbb

5554

Footnotes to the Tables:

a) Particulary in the third generation, the left and right sfermion states may mix, as shown.The lighter mixed state is given the smaller number.

b) The physical χ states are admixtures of the pure γ, ˜Z0, ˜W+, ˜H01 , ˜H0

2 , and ˜H+ states.c) Σ∗ and Ξ∗ are alternate names for Σ(1385) and Ξ(1530).

Page 39: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

43. Clebsch-Gordan coefficients 505

43. CLEBSCH-GORDAN COEFFICIENTS, SPHERICAL HARMONICS,

AND d FUNCTIONS

Note: A square-root sign is to be understood over every coefficient, e.g., for −8/15 read −√

8/15.

Y 01

=

3

4πcos θ

Y 11

= −

3

8πsin θ eiφ

Y 02

=

5

(3

2cos2 θ −

1

2

)

Y 12

= −

15

8πsin θ cos θ eiφ

Y 22

=1

4

15

2πsin2 θ e2iφ

Y −mℓ

= (−1)mY m∗ℓ

〈j1j2m1m2|j1j2JM〉

= (−1)J−j1−j2〈j2j1m2m1|j2j1JM〉d ℓm,0 =

2ℓ + 1Y m

ℓ e−imφ

dj

m′,m= (−1)m−m′

dj

m,m′ = dj

−m,−m′ d 10,0 = cos θ d

1/2

1/2,1/2= cos

θ

2

d1/2

1/2,−1/2= − sin

θ

2

d 11,1 =

1 + cos θ

2

d 11,0 = −

sin θ√

2

d 11,−1

=1 − cos θ

2

d3/2

3/2,3/2=

1 + cos θ

2cos

θ

2

d3/2

3/2,1/2= −

√31 + cos θ

2sin

θ

2

d3/2

3/2,−1/2=

√31 − cos θ

2cos

θ

2

d3/2

3/2,−3/2= −

1 − cos θ

2sin

θ

2

d3/2

1/2,1/2=

3 cos θ − 1

2cos

θ

2

d3/2

1/2,−1/2= −

3 cos θ + 1

2sin

θ

2

d 22,2 =

(1 + cos θ

2

)2

d 22,1 = −

1 + cos θ

2sin θ

d 22,0 =

√6

4sin2 θ

d 22,−1

= −1 − cos θ

2sin θ

d 22,−2

=(1 − cos θ

2

)2

d 21,1 =

1 + cos θ

2(2 cos θ − 1)

d 21,0 = −

3

2sin θ cos θ

d 21,−1

=1 − cos θ

2(2 cos θ + 1) d 2

0,0 =(3

2cos2 θ −

1

2

)

Figure 43.1: The sign convention is that of Wigner (Group Theory, Academic Press, New York, 1959), also used by Condon and Shortley (TheTheory of Atomic Spectra, Cambridge Univ. Press, New York, 1953), Rose (Elementary Theory of Angular Momentum, Wiley, New York, 1957),and Cohen (Tables of the Clebsch-Gordan Coefficients, North American Rockwell Science Center, Thousand Oaks, Calif., 1974).

Page 40: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

506 44. SU(3) isoscalar factors and representation matrices

44. SU(3) ISOSCALAR FACTORS AND REPRESENTATION MATRICES

Written by R.L. Kelly (LBNL).

The most commonly used SU(3) isoscalar factors, correspondingto the singlet, octet, and decuplet content of 8 ⊗ 8 and 10 ⊗ 8, areshown at the right. The notation uses particle names to identify thecoefficients, so that the pattern of relative couplings may be seenat a glance. We illustrate the use of the coefficients below. See J.Jde Swart, Rev. Mod. Phys. 35, 916 (1963) for detailed explanationsand phase conventions.

A√

is to be understood over every integer in the matrices; theexponent 1/2 on each matrix is a reminder of this. For example, theΞ → ΩK element of the 10 → 10 ⊗ 8 matrix is −

√6/√

24 = −1/2.

Intramultiplet relative decay strengths may be read directly fromthe matrices. For example, in decuplet → octet + octet decays, theratio of Ω∗ → ΞK and ∆ → Nπ partial widths is, from the 10 → 8× 8matrix,

Γ (Ω∗ → ΞK)

Γ (∆ → Nπ)=

12

6× (phase space factors) . (44.1)

Including isospin Clebsch-Gordan coefficients, we obtain, e.g.,

Γ(Ω∗− → Ξ0K−)

Γ(∆+ → p π0)=

1/2

2/3×

12

6× p.s.f. =

3

2× p.s.f. (44.2)

Partial widths for 8 → 8 ⊗ 8 involve a linear superposition of 81

(symmetric) and 82 (antisymmetric) couplings. For example,

Γ(Ξ∗ → Ξπ) ∼

(

9

20g1 +

3

12g2

)2

. (44.3)

The relations between g1 and g2 (with de Swart’s normalization)and the standard D and F couplings that appear in the interactionLagrangian,

L = −√

2 D Tr (B, BM) +√

2 F Tr ([B, B] M) , (44.4)

where [B, B] ≡ BB − BB and B, B ≡ BB + BB, are

D =

√30

40g1 , F =

√6

24g2 . (44.5)

Thus, for example,

Γ(Ξ∗ → Ξπ) ∼ (F − D)2 ∼ (1 − 2α)2 , (44.6)

where α ≡ F/(D + F ). (This definition of α is de Swart’s. Thealternative D/(D + F ), due to Gell-Mann, is also used.)

The generators of SU(3) transformations, λa (a = 1, 8), are 3 × 3matrices that obey the following commutation and anticommutationrelationships:

[λa, λb] ≡ λaλb − λbλa = 2ifabcλc (44.7)

λa, λb ≡ λaλb + λbλa =4

3δabI + 2dabcλc , (44.8)

where I is the 3 × 3 identity matrix, and δab is the Kronecker deltasymbol. The fabc are odd under the permutation of any pair ofindices, while the dabc are even. The nonzero values are

1 → 8⊗ 8

(

Λ)

→(

NK Σπ Λη ΞK)

=1√

8( 2 3 −1 −2 )1/2

81 → 8⊗ 8

NΣΛΞ

Nπ Nη ΣK ΛKNK Σπ Λπ Ση ΞK

NK Σπ Λη ΞKΣK ΛK Ξπ Ξη

=1

√20

9 −1 −9 −1−6 0 4 4 −62 −12 −4 −29 −1 −9 −1

1/2

82 → 8⊗ 8

NΣΛΞ

Nπ Nη ΣK ΛKNK Σπ Λπ Ση ΞK

NK Σπ Λη ΞKΣK ΛK Ξπ Ξη

=1

√12

3 3 3 −32 8 0 0 −2

6 0 0 63 3 3 −3

1/2

10 → 8⊗ 8

∆ΣΞΩ

Nπ ΣKNK Σπ Λπ Ση ΞK

ΣK ΛK Ξπ ΞηΞK

=1

√12

−6 6−2 2 −3 3 2

3 −3 3 312

1/2

8 → 10⊗ 8

NΣΛΞ

∆π ΣK∆K Σπ Ση ΞK

Σπ ΞKΣK Ξπ Ξη ΩK

=1

√15

−12 38 −2 −3 2

−9 63 −3 −3 6

1/2

10 → 10⊗ 8

∆ΣΞΩ

∆π ∆η ΣK∆K Σπ Ση ΞKΣK Ξπ Ξη ΩK

ΞK Ωη

=1

√24

15 3 −68 8 0 −812 3 −3 −6

12 −12

1/2

abc fabc abc dabc abc dabc

123 1 118 1/√

3 355 1/2

147 1/2 146 1/2 366 −1/2

156 −1/2 157 1/2 377 −1/2

246 1/2 228 1/√

3 448 −1/(2√

3)

257 1/2 247 −1/2 558 −1/(2√

3)

345 1/2 256 1/2 668 −1/(2√

3)

367 −1/2 338 1/√

3 778 −1/(2√

3)

458√

3/2 344 1/2 888 −1/√

3

678√

3/2

The λa’s are

λ1 =

(

0 1 01 0 00 0 0

)

λ2 =

(

0 −i 0i 0 00 0 0

)

λ3 =

(

1 0 00 − 1 00 0 0

)

λ4 =

(

0 0 10 0 01 0 0

)

λ5 =

(

0 0 −i0 0 0i 0 0

)

λ6 =

(

0 0 00 0 10 1 0

)

λ7 =

(

0 0 00 0 −i0 i 0

)

λ8 =1√

3

(

1 0 00 1 00 0 −2

)

Equation (44.7) defines the Lie algebra of SU(3). A general d-dimensional representation is given by a set of d×d matrices satisfyingEq. (44.7) with the fabc given above. Equation (44.8) is specific to thedefining 3-dimensional representation.

Page 41: 37.PROBABILITY - Institute of Physicsiopscience.iop.org/1674-1137/38/9/090001/media/rpp2014_0467-050… · 37.Probability 467 37.PROBABILITY Revised September 2013 by G. Cowan (RHUL).

45. SU(n) multiplets and Young diagrams 507

45. SU(n) MULTIPLETS AND YOUNG DIAGRAMS

Written by C.G. Wohl (LBNL).

This note tells (1) how SU(n) particle multiplets are identified orlabeled, (2) how to find the number of particles in a multiplet from itslabel, (3) how to draw the Young diagram for a multiplet, and (4) howto use Young diagrams to determine the overall multiplet structure ofa composite system, such as a 3-quark or a meson-baryon system.

In much of the literature, the word “representation” is used wherewe use “multiplet,” and “tableau” is used where we use “diagram.”

45.1. Multiplet labels

An SU(n) multiplet is uniquely identified by a string of (n−1)nonnegative integers: (α, β, γ, . . .). Any such set of integers specifiesa multiplet. For an SU(2) multiplet such as an isospin multiplet, thesingle integer α is the number of steps from one end of the multipletto the other (i.e., it is one fewer than the number of particles in themultiplet). In SU(3), the two integers α and β are the numbers ofsteps across the top and bottom levels of the multiplet diagram. Thusthe labels for the SU(3) octet and decuplet

1

1

0

3

are (1,1) and (3,0). For larger n, the interpretation of the integersin terms of the geometry of the multiplets, which exist in an(n−1)-dimensional space, is not so readily apparent.

The label for the SU(n) singlet is (0, 0, . . . , 0). In a flavor SU(n),the n quarks together form a (1, 0, . . . , 0) multiplet, and the nantiquarks belong to a (0, . . . , 0, 1) multiplet. These two multipletsare conjugate to one another, which means their labels are related by(α, β, . . .) ↔ (. . . , β, α).

45.2. Number of particles

The number of particles in a multiplet, N = N(α, β, . . .), is givenas follows (note the pattern of the equations).

In SU(2), N = N(α) is

N =(α + 1)

1. (45.1)

In SU(3), N = N(α, β) is

N =(α + 1)

1·(β + 1)

1·(α + β + 2)

2. (45.2)

In SU(4), N = N(α, β, γ) is

N =(α+1)

1·(β+1)

1·(γ+1)

1·(α+β+2)

2·(β+γ+2)

2·(α+β+γ+3)

3.

(45.3)

Note that in Eq. (45.3) there is no factor with (α + γ + 2): only aconsecutive sequence of the label integers appears in any factor. Onemore example should make the pattern clear for any SU(n). In SU(5),N = N(α, β, γ, δ) is

N =(α+1)

1·(β+1)

1·(γ+1)

1·(δ+1)

1·(α+β+2)

2·(β+γ+2)

2

×(γ+δ+2)

2·(α+β+γ+3)

3·(β+γ+δ+3)

3·(α+β+γ+δ+4)

4.(45.4)

From the symmetry of these equations, it is clear that multiplets thatare conjugate to one another have the same number of particles, butso can other multiplets. For example, the SU(4) multiplets (3,0,0) and(1,1,0) each have 20 particles. Try the equations and see.

45.3. Young diagrams

A Young diagram consists of an array of boxes (or some othersymbol) arranged in one or more left-justified rows, with each rowbeing at least as long as the row beneath. The correspondence betweena diagram and a multiplet label is: The top row juts out α boxes tothe right past the end of the second row, the second row juts out βboxes to the right past the end of the third row, etc. A diagram inSU(n) has at most n rows. There can be any number of “completed”columns of n boxes buttressing the left of a diagram; these don’t affectthe label. Thus in SU(3) the diagrams

, , , ,

represent the multiplets (1,0), (0,1), (0,0), (1,1), and (3,0). In anySU(n), the quark multiplet is represented by a single box, theantiquark multiplet by a column of (n−1) boxes, and a singlet by acompleted column of n boxes.

45.4. Coupling multiplets together

The following recipe tells how to find the multiplets that occurin coupling two multiplets together. To couple together more thantwo multiplets, first couple two, then couple a third with each of themultiplets obtained from the first two, etc.

First a definition: A sequence of the letters a, b, c, . . . is admissible

if at any point in the sequence at least as many a’s have occurred asb’s, at least as many b’s have occurred as c’s, etc. Thus abcd and aabcbare admissible sequences and abb and acb are not. Now the recipe:

(a) Draw the Young diagrams for the two multiplets, but in one ofthe diagrams replace the boxes in the first row with a’s, the boxes inthe second row with b’s, etc. Thus, to couple two SU(3) octets (such

as the π-meson octet and the baryon octet), we start with and

a a

b. The unlettered diagram forms the upper left-hand corner of all

the enlarged diagrams constructed below.

(b) Add the a’s from the lettered diagram to the right-hand endsof the rows of the unlettered diagram to form all possible legitimateYoung diagrams that have no more than one a per column. In general,there will be several distinct diagrams, and all the a’s appear in eachdiagram. At this stage, for the coupling of the two SU(3) octets, wehave:

a a , a , a , .a a

a a

(c) Use the b’s to further enlarge the diagrams already obtained,subject to the same rules. Then throw away any diagram in which thefull sequence of letters formed by reading right to left in the first row,then the second row, etc., is not admissible.

(d) Proceed as in (c) with the c’s (if any), etc.

The final result of the coupling of the two SU(3) octets is:

⊗ a a

b=

a a ⊕ a a ⊕ a ⊕ a ⊕ a ⊕ .b a b a b a

b b a a b

Here only the diagrams with admissible sequences of a’s and b’s andwith fewer than four rows (since n = 3) have been kept. In terms ofmultiplet labels, the above may be written

(1, 1) ⊗ (1, 1) = (2, 2) ⊕ (3, 0) ⊕ (0, 3) ⊕ (1, 1) ⊕ (1, 1) ⊕ (0, 0) .

In terms of numbers of particles, it may be written

8 ⊗ 8 = 27⊕ 10⊕ 10⊕ 8⊕ 8⊕ 1 .

The product of the numbers on the left here is equal to the sum onthe right, a useful check. (See also Sec. 15 on the Quark Model.)