A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in...

Post on 09-Jul-2020

1 views 0 download

Transcript of A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in...

A survey of the Dirichlet process and its role innonparametrics

Jayaram SethuramanDepartment of StatisticsFlorida State UniversityTallahassee, FL 32306

sethu@stat.fsu.edu

May 9, 2013

Summary

• Elementary frequentist and Bayesian methods

• Bayesian nonparametrics

• Nonparametric priors in a natural way

• Nonparametric priors and exchangeable random variables;Polya urn sequences

• Nonparametric priors through other approaches

• Sethuraman construction of Dirichlet priors

• Some properties of Dirichlet priors

• Bayes hierarchical models

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ,

say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Beginning nonparametrics

Suppose that X1, . . . ,Xn are i.i.d. and take on only a finite numberof values, say in X = {1, . . . , k}. The most general probabilitydistribution for X1 is p = (p1, . . . , pk) taking values in the unitsimplex in Rk , i.e. pj ≥ 0, i = 1, . . . , k and

∑pj = 1.

Let N = (N1, . . . ,Nk) where Nj = #{Xi = j , i = 1, . . . , n} is theobserved frequency of j .

The frequentist estimate of pj is Nj/n and the estimate of p is(N1, . . . ,Nk)/n.

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Bayes analysis

Then the posterior distribution of p given the data X1, . . . ,Xn hasa density proportional to

k∏1

pai+Ni−11

which is the finite dimensional Dirichlet distribution withparameter a + N.

Thus the Bayes estimate of p is a+NA+n .

Bayes analysis

Then the posterior distribution of p given the data X1, . . . ,Xn hasa density proportional to

k∏1

pai+Ni−11

which is the finite dimensional Dirichlet distribution withparameter a + N.

Thus the Bayes estimate of p is a+NA+n .

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Nonparametric priors in a natural way

By a distribution for P we mean a distribution on P, the space ofall probability measures on R1. More precisely, we will also have tospecify a σ-field S in P. We will take the smallest σ-fieldcontaining sets of the form {P : P(A) ≤ r} for all A ∈ B and0 ≤ r ≤ 1 as our S.

It can be shown that Ferguson choice is a probability distributionon (P,S) and is called the Dirichlet distribution with parameterαβ(·) and denoted by D(αβ(·)).

This is also called the Dirichlet process since it is the distributionof F (x), x ∈ R1.

Nonparametric priors in a natural way

By a distribution for P we mean a distribution on P, the space ofall probability measures on R1. More precisely, we will also have tospecify a σ-field S in P. We will take the smallest σ-fieldcontaining sets of the form {P : P(A) ≤ r} for all A ∈ B and0 ≤ r ≤ 1 as our S.

It can be shown that Ferguson choice is a probability distributionon (P,S) and is called the Dirichlet distribution with parameterαβ(·) and denoted by D(αβ(·)).

This is also called the Dirichlet process since it is the distributionof F (x), x ∈ R1.

Nonparametric priors in a natural way

Current usage calls α as the scale factor and β(·) as the basemeasure of the Dirichlet distribution D(αβ(·)).

Ferguson showed that the posterior distribution of P given thedata X is the Dirichlet distribution D(αβ(·) + nPn(·)).

and thus the Bayes estimate of P is αβ(·)+nPn(·)α+n .

This is analogous to the results in the finite sample space case.

Nonparametric priors in a natural way

Under D(αβ), (P(A1), . . . ,P(Ak)) has a finite dimensionalDirichlet distribution.

Two assertions made in the last slide.

• D(αβ) is a probability measure on (P,S).

• The posterior distribution given X is D(αβ + nPn).

Another property of the Dirichlet prior that was disconcerting atfirst is that

• D(αβ) gives probability 1 to the subset of of discreteprobability measures on (R1,B).

The proofs needed some effort and were sometimes mystifying.

Nonparametric priors in a natural way

Under D(αβ), (P(A1), . . . ,P(Ak)) has a finite dimensionalDirichlet distribution.Two assertions made in the last slide.

• D(αβ) is a probability measure on (P,S).

• The posterior distribution given X is D(αβ + nPn).

Another property of the Dirichlet prior that was disconcerting atfirst is that

• D(αβ) gives probability 1 to the subset of of discreteprobability measures on (R1,B).

The proofs needed some effort and were sometimes mystifying.

Nonparametric priors and exchangeable random variables

The class of all nonparametric priors are the same as the class ofall exchangeable sequences of random variables!

This followed from De Finetti’s theorem (1931). See also Hewittand Savage (1955), Kingman (1978).

Let X1,X2, . . . be an infinite exchangeable (def?) sequence ofrandom variables with a joint distribution Q. Then

1. The empirical distribution functions Fn(x)→ F (x) withprobability 1 for all x . In fact, supx |Fn(x)− F (x)| → 0 withprobability 1. (Note that F (x) is a random distributionfunction.)

Nonparametric priors and exchangeable random variables

The class of all nonparametric priors are the same as the class ofall exchangeable sequences of random variables!

This followed from De Finetti’s theorem (1931). See also Hewittand Savage (1955), Kingman (1978).

Let X1,X2, . . . be an infinite exchangeable (def?) sequence ofrandom variables with a joint distribution Q. Then

1. The empirical distribution functions Fn(x)→ F (x) withprobability 1 for all x . In fact, supx |Fn(x)− F (x)| → 0 withprobability 1. (Note that F (x) is a random distributionfunction.)

Nonparametric priors and exchangeable random variables

2. The empirical probability measures Pnw→ P. This P is a

random probability measure.

3. Given P, X1,X2, . . . are i.i.d. P.

4. Thus the distribution of P under Q denoted by νQ is anonparametric prior.

5. The class of all nonparametric priors arises in this fashion.

6. The distribution of X2,X3, . . . , given X1 is also exchangeable;will be denoted by QX1 .

7. The limit P of the empirical probability measures ofX1,X2, . . . is also the limit of the empirical probabilitymeasures of X2,X3, . . . . Thus the distribution of P given X1

(the posterior distribution) is the distribution of P under QX1

and, by mere notation, is νQX1 .

Dirichlet prior based on a Polya urn sequences

The Polya urn sequence is an example of an infinite exchangeablerandom variables.

Let β be a pm on R1 and let α > 0. Define the joint distributionPol(α, β) of X1,X2, . . . through

X1 ∼ β,Xn|(X1, . . . ,Xn−1) ∼αβ +

∑n−11 δXi

α + n − 1, n = 2, 3, . . .

This defines Pol(α, β) as an exchangeable probability measure.

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).

P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm.

For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Dirichlet prior based on a Polya urn sequences

Let Y1,Y2, . . . be the distinct values among X1,X2, . . . listed inthe order of their appearance.

Then Y1 = X1,

Y1,P({Y1}) are independent

and Y1 ∼ β,P({Y1}) ∼ B(1, α).

Dirichlet prior based on a Polya urn sequences

Let Y1,Y2, . . . be the distinct values among X1,X2, . . . listed inthe order of their appearance.

Then Y1 = X1,

Y1,P({Y1}) are independent and Y1 ∼ β,P({Y1}) ∼ B(1, α).

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words, we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words,

we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words, we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Dirichlet prior based on a Polya urn sequences

The Polya urn sequence also gives the predictive distribution undera Dirichlet prior by its very definition.

Let the distribution of X1,X2, . . . given P be i.i.d. P where P hasthe Dirichlet prior D(αβ). Then X1,X2, . . . is the Polya urnsequence Pol(α, β).

Hence, the distribution of Xn given X1, . . . ,Xn−1 isαβ+

∑n−11 δXi

α+n−1 .

Dirichlet prior based on a Polya urn sequences

The awkwardness of the rpm P being discrete has turned into agold mine for finding clusters in data.

Let Y1, . . . ,YMn be the distinct values among X1, . . . ,Xn and lettheir multiplicities be k1, . . . , kMn . From the predictive distributiongiven above we can show that the probability of Mn = m and(k1, . . . , km) is

αmΓ(α)∏m

1 Γ(kj)

Γ(α + n).

From this we can obtain the marginal distribution of M thenumber of distinct values (or clusters) and also the conditionaldistributions of the multiplicities.

Dirichlet prior based on a Polya urn sequences

A curious property:

The number of distinct values (clusters) Mn goes to ∞ slower thann and in fact, Mn

log(n) → α.

Also, from the LLN,∑Mn

1 I (Yj≤x)log(n) → β(x) for all x with probability

1.

Note that∑n

1 I (Xj≤x)n → F (x) with probability 1 where F is a rdf

and has distribution D(αβ).Note all all this required that β be nonatomic.

Dirichlet prior based on a Polya urn sequences

A curious property:

The number of distinct values (clusters) Mn goes to ∞ slower thann and in fact, Mn

log(n) → α.

Also, from the LLN,∑Mn

1 I (Yj≤x)log(n) → β(x) for all x with probability

1.

Note that∑n

1 I (Xj≤x)n → F (x) with probability 1 where F is a rdf

and has distribution D(αβ).Note all all this required that β be nonatomic.

Nonparametric priors through completely random measures

Kingman (1967) introduced the concept of completely randommeasures.

X (A),A ∈ B is a completely random measure, if X (·) is a randommeasure and if X (A1), . . . ,X (Ak) are independent wheneverA1, . . . ,Ak are disjoint.

If X (R1) <∞ with probability 1, then P(A) = X (A)X (R1) will be a

random probability measure.

Kingman also characterized the class of all completely randommeasures (subject to a σ-finite condition) and also showed how togenerate them from Poisson processes and transition functions.

The Dirichlet prior is a special case of this.

Nonparametric priors and ind inc processes

A df is just an increasing function on the real line. Consider [0, 1]instead.

The class of processes X (t), t ∈ [0, 1] with nonnegativeindependent increments is well known from the theory of infinitelydivisible laws.

When some simple cases are excluded, this process has only acountable number of jumps which are independent of theirlocations.

Nonparametric priors and ind inc processes

A special case when X (t) ∼ Gamma(αβ(t)) for t ∈ [0, 1].

F (t) = X (t)X (1) is a random distribution function (since X (1) <∞

with probability 1). Ferguson (second definition) shows that itsdistribution is the nonparametric prior D(αβ(·)).

The properties of the Gamma process show that the rpm isdiscrete and also has the representation

P([0, t]) = F (t) =∞∑1

p∗i δYi([0, t])

where p∗1 > p∗2 > · · · are the jumps of the Gamma process indecreasing order. The jumps are independent of the locations andthus (p∗1 , p

∗2 , . . . ) which are independent of Y1,Y2, . . . which are

i.i.d. β.

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)

[= p1δY1(·) + (1− p1)

∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Sethuraman construction of Dirichlet priors

The random variables Y1,Y2 . . . can take values any measurespace.The proof is demonstrated by rewriting the constructive definitionas

P = p1δY1 + (1− p1)P∗

where all the random variables are independent,p1 ∼ B(1, α),Y1 ∼ β and the two rpm’s P,P∗ have the samedistribution.

For the distribution of P we have the distributional identity

Pd= p1δY1 + (1− p1)P.

We first show that D(αβ) is a solution to this distributionalequation and the solution is unique.

Sethuraman construction of Dirichlet priors

To summarize,

P(·) =∞∑1

piδYi(·)

is a concrete representation of a random probability measure andits distribution is D(αβ(·)).

It was not assumed that β is a nonatomic probability measure andthe Yi ’s can be general random variables.

Sethuraman construction of Dirichlet priors

To summarize,

P(·) =∞∑1

piδYi(·)

is a concrete representation of a random probability measure andits distribution is D(αβ(·)).

It was not assumed that β is a nonatomic probability measure andthe Yi ’s can be general random variables.

Sethuraman construction of Dirichlet priors

The Sethuraman construction is similar to

P(·) =∞∑1

p∗i δYi(·)

obtained when using a Gamma process to define the Dirichlet prior.

It can be shown that these (p∗1 , p∗2 , . . . ) are the same, in

distribution, as (p1, p2, . . . ) arranged in increasing order.

Definition: A one-time size biased sampling converts (p∗1 , p∗2 , . . . )

to (p∗∗1 , p∗∗2 , . . . ) as follows.

Let J be an integer valued randomvariable with Prob(J = n|p∗1 , p∗2 , . . . ) = p∗n.If J = 1 define (p∗∗1 , p∗∗2 , . . . ) to be the same as (p∗1 , p

∗2 , . . . ).

If J > 1 put p∗∗1 = p∗J and let p∗∗2 , p∗3∗, . . . be p∗1 , p∗2 . . . with p∗J

removed.

Sethuraman construction of Dirichlet priors

The Sethuraman construction is similar to

P(·) =∞∑1

p∗i δYi(·)

obtained when using a Gamma process to define the Dirichlet prior.

It can be shown that these (p∗1 , p∗2 , . . . ) are the same, in

distribution, as (p1, p2, . . . ) arranged in increasing order.

Definition: A one-time size biased sampling converts (p∗1 , p∗2 , . . . )

to (p∗∗1 , p∗∗2 , . . . ) as follows. Let J be an integer valued randomvariable with Prob(J = n|p∗1 , p∗2 , . . . ) = p∗n.If J = 1 define (p∗∗1 , p∗∗2 , . . . ) to be the same as (p∗1 , p

∗2 , . . . ).

If J > 1 put p∗∗1 = p∗J and let p∗∗2 , p∗3∗, . . . be p∗1 , p∗2 . . . with p∗J

removed.

Sethuraman construction of Dirichlet priors

As a converse result, the distribution of (p1, p2, . . . ) is the limitingdistribution after repeated size biased sampling of (p∗1 , p

∗2 , . . . ).

McCloskey (1965), Pitman (1996), etc.

The distribution of p1, p2, . . . does not change after a single sizebiased sampling.

This property can be used to establish that in the nonparametricproblem with one observation X1, the posterior distribution of Pgiven X1 is D(αβ + δX1). Simpler than the one given inSethuraman (1994).

Sethuraman construction of Dirichlet priors

Ferguson showed that the support of the D(αβ) is the collection ofprobability measures in P whose support is contained in thesupport of β.

If the support of β is R1 then the support of Dαβ is P.

This assures us that even though D(αβ) gives probability 1 to theclass of discrete pm’s, it gives positive probability to everyneighborhood of every pm.

Dirichlet priors are not discrete

The Dirichlet prior D(αβ) is not a discrete pm; it just sits on theset all discrete pm’s.

The rpm P with distribution D(αβ) is a random discreteprobability measure.

Dirichlet priors are not discrete

The Dirichlet prior D(αβ) is not a discrete pm; it just sits on theset all discrete pm’s.

The rpm P with distribution D(αβ) is a random discreteprobability measure.

Absolute continuity

Consider D(αiβi ), i = 1, 2. These two measures are absolutelycontinuous with respect to each other or they are orthogonal.

Let βi1, βi ,2 be the atomic and discrete part of βi , i = 1, 2.

D(αiβi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or ifβ11 6= β21 or if support of β12 6= support of β22. There is anecessary and sufficient condition for orthogonality when all ofthese are equal.

Absolute continuity

Consider D(αiβi ), i = 1, 2. These two measures are absolutelycontinuous with respect to each other or they are orthogonal.

Let βi1, βi ,2 be the atomic and discrete part of βi , i = 1, 2.

D(αiβi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or ifβ11 6= β21 or if support of β12 6= support of β22. There is anecessary and sufficient condition for orthogonality when all ofthese are equal.

Absolute continuity

The curious result that we can consistently estimate theparameters of the Dirichlet prior from the sample happens becauseof the orthogonality of Dirichlet distributions when theirparameters are different.

Another curious result: When β is nonatomic, the prior D(αβ), theposterior given X1, the posterior given X1,X2, the posterior givenX1,X2,X3 etc. are all orthogonal to one another!

Absolute continuity

The curious result that we can consistently estimate theparameters of the Dirichlet prior from the sample happens becauseof the orthogonality of Dirichlet distributions when theirparameters are different.

Another curious result: When β is nonatomic, the prior D(αβ), theposterior given X1, the posterior given X1,X2, the posterior givenX1,X2,X3 etc. are all orthogonal to one another!

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x), one should first check that

our prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x),

one should first check thatour prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x), one should first check that

our prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Some properties of Dirichlet priors

Feigin and Tweedie (1989), and others later, gave necessary andsufficient conditions for this, namely

∫log(max(1, |x |))dβ(x) <∞.

From our constructive definition,∫|x |dP(x) =

∞∑1

p1|Yi |.

The Kolomogorov three series theorem gives a simple direct proof.Sethuraman (2010).

Some properties of Dirichlet priors

The Bayes estimate of the mean is the mean under the posteriordistribution D(αβ + nPn) and it is

α∫xdβ(x) + nXn

α + n.

One should check first that∫

log(max(1, |x |))dβ(x) <∞.

Notethat the Bayesian can estimate

∫xdP(x) in this case even though

the base measure β may not have a mean. (What does frequentistestimate Xn estimate in this case?).

Some properties of Dirichlet priors

The Bayes estimate of the mean is the mean under the posteriordistribution D(αβ + nPn) and it is

α∫xdβ(x) + nXn

α + n.

One should check first that∫

log(max(1, |x |))dβ(x) <∞. Notethat the Bayesian can estimate

∫xdP(x) in this case even though

the base measure β may not have a mean. (What does frequentistestimate Xn estimate in this case?).

Some properties of Dirichlet priors

The actual distribution of∫xdP(x) under D(αβ) is a vexing

problem. Regazzini, Lijoi and Prunster (2003), Lijoi and Prunster(2009) have the best results.

When β is the Cauchy distribution, it is easy that∫xdP(x) =

∞∑1

piYi

where Y1,Y2, . . . are i.i.d. Cauchy, and hence∫xPd(x) is Cauchy.

One does not need the GEM property of (p1, p2, . . . ) for this; it isenough for it to be independent of (Y1,Y2, . . . ). Yamato (1984)was the first to prove this.

Some properties of Dirichlet priors

Convergence of Dirichlet priors when their parameters converge iseasy to establish using the constructive definition.

When α→∞ and β is fixed, D(αβ)w→ δβ.

When α→ 0 and β is fixed, D(αβ)w→ δY where Y ∼ β.

Sethuraman and Tiwari (1982).

Some properties of Dirichlet priors

Convergence of Dirichlet priors when their parameters converge iseasy to establish using the constructive definition.

When α→∞ and β is fixed, D(αβ)w→ δβ.

When α→ 0 and β is fixed, D(αβ)w→ δY where Y ∼ β.

Sethuraman and Tiwari (1982).

Some properties of Dirichlet priors

The constructive definition

P(·) =∞∑1

piδYi(·)

leads to the inequality

||P −M∑1

piδYi|| ≤

M∏1

(1− pi ).

So one can allow for several kinds of random stopping to staywithin chosen errors. One can also stop at nonrandom times andhave probability bounds for errors. Muliere and Tardella (1998) hasseveral results of this type.

Some properties of Dirichlet priors

Let P ∼ D(αβ) and let f be a function of X . The Pf −1 thedistribution of f under P has the D(αβf −1) where βf −1 is thedistribution of f under β

Some properties of Dirichlet priors

Regression problems are studied when the data is of the form(Y1,X1), (Y2,X2), . . . and the model is that Yi = f (Xi , εi ). If thedistribution of ε is P, then the conditional distribution of Y givenX = x is the distribution of f (x , ·) under P and is denoted by Mx .

Let P ∼ D(α, β) and given P let ε1, ε2, . . . be i.i.d. P. Theconstructive definition gives

P =∞∑1

piδZi

where Z1,Z2, . . . are i.i.d. β.

Some properties of Dirichlet priors

The df Mx is the distribution of f (x , ·) under P and is thus has aDirichlet distribution and can be represented by

Px =∞∑1

piδZi,x

where Z1,x ,Z2,x , . . . are i.i.d from the distribution of f (x , ) underβ. MacEachern (1999) calls this the dependent Dirichlet prior(DDP).

Some properties of Dirichlet priors

Ishwaran and James (2011) allow the Vi ’s appearing in thedefinition of the pi ’s to independent B(ai , bi ) random variables.Rodriguez (2011) allows Vi = Φ(Ni ) where N1,N2, . . . are i.i.d.N(µ, σ2) and calls it “probit stick breaking”.Bayes analysis is with such priors can be handled only bycomputational methods.

Bayes hierarchical models

A popular Bayes hierarchical model is usually stated as follows.

The data X1, . . . ,Xn are independent with distributionsK (·, θ1), . . . ,K (·, θn), where K (·, θ) is a nice continuous df foreach θ.

Next it is assumed that there is rpm P, and given P, θ1, . . . , θnare i.i.d. P.

Finally, it is assumed that P has the D(αβ) distribution.

This is also called the DP mixture model and it has lots ofapplications.

One wants the posterior distribution of P given the dataX1, . . . ,Xn.

Bayes hierarchical models

A popular Bayes hierarchical model is usually stated as follows.

The data X1, . . . ,Xn are independent with distributionsK (·, θ1), . . . ,K (·, θn), where K (·, θ) is a nice continuous df foreach θ.

Next it is assumed that there is rpm P, and given P, θ1, . . . , θnare i.i.d. P.

Finally, it is assumed that P has the D(αβ) distribution.

This is also called the DP mixture model and it has lots ofapplications.

One wants the posterior distribution of P given the dataX1, . . . ,Xn.

Bayes hierarchical models

One should state the Bayes hierarchical model more carefully.

Let the rpm P have the D(αβ) distribution.

Conditional on P, let θ1, . . . , θn be i.i.d. P.

Conditional of (P, θ1, . . . , θn), let X1, . . . ,Xn be independent withdistributions K (·, θ1), . . . ,K (·, θn), where K (·, θ) is a df for each θ.

Then (X1, . . . ,Xn, θ1, . . . , θn,P) will have a well defined jointdistribution and all the necessary conditional distributions are alsowell defined.Ghosh and Ramamoorthy (2003) are careful to state this waythroughout their book.

Bayes hierarchical models

Several computational methods have been proposed in theliterature.West, Muller, and Escobar (1994), Escobar and West (1998), andMacEachern (1998) have studied computational methods. Theyintegrate out the rpm P and deal with the joint distribution of(X1, . . . ,Xn, θ1, . . . , θn).The hierarchical model can also be expressed as follows,suppressing the latent variables θ1, . . . , θn: For any pm P, letK (·,P) =

∫K (·, θ)dP(θ).

P ∼ D(αβ) and given P, let X1, . . . ,Xn be i.i.d. P.

Bayes hierarchical models

To perform an MCMC one can use the constructive definition andput

K (·,P) =∞∑1

piK (·,Yi )

where (p1, p2, . . . ) is the usual GEM(α) and Y1,Y2, . . . are i.i.d. .βTo do the MCMC here one has to truncate this infinite seriesappropriately.Doss (1994), Ishwaran and James (2011) and others.

More

More Bayes hierarchical models: Neal (2000), Rodriguez, Dunsonand Gelfand (2008)Clustering applications: Blei and Jordan (2006), Wang, Shan andBanerjee (2009)Two parameter Dirichlet process: Perman, Pitman and Yor (1992),Pitman and Yor (1997).Partition based priors - useful in reliability and repair models:Hollander and Sethuraman (2009)