Dependent Dirichlet processesand application to ecological data
Julyan ArbelJoint work with Kerrie Mengersen & Judith Rousseau
CREST-INSEE, Universite Paris-Dauphine
2 December 2012ERCIM 2012
5th International Conference onComputing & Statistics
Biology questionNonparametric model
Outline
1 Biology questionIntroductionData
2 Nonparametric modelDirichlet processDependent Dirichlet process
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Outline
1 Biology questionIntroductionData
2 Nonparametric modelDirichlet processDependent Dirichlet process
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Biology introduction
Series of measurements atdifferent places aroundCasey Station, permanentbase in AntarcticaAt each site: pollutionlevel, and abundance ofmicrobes called OTUs.Assess the impact of apollutant on the soilcomposition / biodiversity
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Data
Data consist of measurements of microbes abundance:
Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...
......
......
......
......
...
13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0
Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Data
Data consist of measurements of microbes abundance:
Site TPH 06251 00576 00429 06360 08793 06259 05164 00772
1 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...
......
......
......
......
...
13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0
Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Data
Data consist of measurements of microbes abundance:
Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...
......
......
......
......
...
13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0
Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Data
Data consist of measurements of microbes abundance:
Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...
......
......
......
......
...
13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0
Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Notations
Microbe species are denoted by j = 1, . . . by decreasingtotal abundance
At each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).Data are a frequency matrix:
Site TPH 06251 00576 . . .
j = 1 j . . .
1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....
......
......
k x . . . #(Yn(x) = j) . . .
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Notations
Microbe species are denoted by j = 1, . . . by decreasingtotal abundanceAt each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).
Data are a frequency matrix:
Site TPH 06251 00576 . . .
j = 1 j . . .
1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....
......
......
k x . . . #(Yn(x) = j) . . .
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Notations
Microbe species are denoted by j = 1, . . . by decreasingtotal abundanceAt each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).Data are a frequency matrix:
Site TPH 06251 00576 . . .
j = 1 j . . .
1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....
......
......
k x . . . #(Yn(x) = j) . . .
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Notations
A standard example of diversity is Shannon diversity, taken asthe exponential of Shannon entropy
D(x) = exp(∑
j −pj(x) log pj(x))
with pj(x) =#(Yn(x)=j)
N(x)
0 5000 10000 20000
2.5
3.0
3.5
tph
Sha
nnon
ent
ropy
0 5000 10000 20000
1020
3040
tph
Sha
nnon
div
ersi
ty
Figure: Left: Shannon entropy in row data. Right: Shannon diversityin row data.
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
IntroductionData
Notations
A standard example of diversity is Shannon diversity, taken asthe exponential of Shannon entropy
D(x) = exp(∑
j −pj(x) log pj(x))
with pj(x) =#(Yn(x)=j)
N(x)
0 5000 10000 20000
2.5
3.0
3.5
tph
Sha
nnon
ent
ropy
0 5000 10000 2000010
2030
40
tph
Sha
nnon
div
ersi
ty
Figure: Left: Shannon entropy in row data. Right: Shannon diversityin row data.
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Outline
1 Biology questionIntroductionData
2 Nonparametric modelDirichlet processDependent Dirichlet process
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
First model
Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.
First, we run anindependent model ateach site with TPH x
Yi(x) |G ∼ G,
G(·) =∞∑
j=1
pjδj(·),
(pj)j ∼ GEM(M).
The GEM(M) distribution is defined in [Pitman, 2002] (GEM
stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:
pj = Vj
∏l<j
(1 − Vl), Vj ∼ Beta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
First model
Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.
First, we run anindependent model ateach site with TPH x
Yi(x) |G ∼ G,
G(·) =∞∑
j=1
pjδj(·),
(pj)j ∼ GEM(M).
The GEM(M) distribution is defined in [Pitman, 2002] (GEM
stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:
pj = Vj
∏l<j
(1 − Vl), Vj ∼ Beta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
First model
Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.
First, we run anindependent model ateach site with TPH x
Yi(x) |G ∼ G,
G(·) =∞∑
j=1
pjδj(·),
(pj)j ∼ GEM(M).
The GEM(M) distribution is defined in [Pitman, 2002] (GEM
stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:
pj = Vj
∏l<j
(1 − Vl), Vj ∼ Beta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Posterior sampling
We use a blocked Gibbs sampler (truncated version of theinfinite sum)
The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).This is conjugated, with a Beta posterior:
π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Posterior sampling
We use a blocked Gibbs sampler (truncated version of theinfinite sum)The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).
This is conjugated, with a Beta posterior:
π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Posterior sampling
We use a blocked Gibbs sampler (truncated version of theinfinite sum)The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).This is conjugated, with a Beta posterior:
π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
But we want to run a single model across TPH x ; it means apredictor-dependent model
Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
But we want to run a single model across TPH x ; it means apredictor-dependent model
Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]
Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
But we want to run a single model across TPH x ; it means apredictor-dependent model
Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]
Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
But we want to run a single model across TPH x ; it means apredictor-dependent model
Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.
Yi(x) |G(x) ∼ G(x),
G(x)(·) =∞∑
j=1
pj(x)δj(·),
(pj(x))j ∼ DGEM(M),
pj(x) = Vj(x)∏l<j
(1 − Vl(x)),
Vj(x) ∼ Beta(1,M).
where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.
Yi(x) |G(x) ∼ G(x),
G(x)(·) =∞∑
j=1
pj(x)δj(·),
(pj(x))j ∼ DGEM(M),
pj(x) = Vj(x)∏l<j
(1 − Vl(x)),
Vj(x) ∼ Beta(1,M).
where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.
Yi(x) |G(x) ∼ G(x),
G(x)(·) =∞∑
j=1
pj(x)δj(·),
(pj(x))j ∼ DGEM(M),
pj(x) = Vj(x)∏l<j
(1 − Vl(x)),
Vj(x) ∼ Beta(1,M).
where DGEM(M) stands for Dependent GEM distribution.
Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Second model
Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.
Yi(x) |G(x) ∼ G(x),
G(x)(·) =∞∑
j=1
pj(x)δj(·),
(pj(x))j ∼ DGEM(M),
pj(x) = Vj(x)∏l<j
(1 − Vl(x)),
Vj(x) ∼ Beta(1,M).
where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)α1
α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)
α1α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)α1
α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)α1
α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)α1
α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Process on the beta breaks,Vj(x)
Construction from Trippa, Muller and Johnson [2011].
V (x1) =Γ(x1)
Γ(x1)+ΓM (x1)α1
α12
α2α3
α23α123
x1 x3x2
Γ(x1) = Γ1 + Γ12 + Γ123,
ΓM(x1) = ΓM1 + ΓM
12 + ΓM123.
Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),
ΓM1 ∼ Ga(α1M), . . . , ΓM
123 ∼ Ga(α123M).
In the end:
pj(x) = Vj(x)∏
l<j(1 − Vl(x)) ∼ DGEM(M).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Interesting features
This idea can be extended to large dimensional covariatespaces:
α1α123
α2
α12
α23
α3
x1 x2
x3
..
.
Easy to simulate in: only needs to simulate Gammarandom variables
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Posterior sampling
There is independence across j , so it suffices to be able tosimulate in each posterior:
π(Vj |Y ) ∝ π(V j)L(Y |V j),
∝ π(V j)∏
xVj(x)#(Yn(x)=j)(1 − Vj(x))#(Yn(x)>j).
Quite uncommon situation: we can sample in the priorπ(V j), but we cannot evaluate it. Reverse situation toApproximate Bayesian computation (ABC), where thelikelihood is intractable, but can be sampled.
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
A first solution is to use a Metropolis-Hastings algorithm:
Metropolis-Hastings Algorithm1 Given a current value V j , sample a new one V ∗j
independently in the prior π(V j).2 Acceptance probability is
ρ = min
1, L(Y |V ∗j )
L(Y |V j)
.
But it is not a good idea to propose in the prior.Acceptance rate is low (around 1%).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
A first solution is to use a Metropolis-Hastings algorithm:
Metropolis-Hastings Algorithm1 Given a current value V j , sample a new one V ∗j
independently in the prior π(V j).2 Acceptance probability is
ρ = min
1, L(Y |V ∗j )
L(Y |V j)
.But it is not a good idea to propose in the prior.Acceptance rate is low (around 1%).
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
A better solution is to use Importance Sampling:
Importance Sampling1 Sample iid values V j in the prior π(V j).2 Use a weighted sample by the importance weights defined
by the likelihood w(V j) = L(Y |V j).
iid sample instead of a Markov chainbetter precision by a Rao-Blackwellisation argument(weights instead of accept-reject)
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
A better solution is to use Importance Sampling:
Importance Sampling1 Sample iid values V j in the prior π(V j).2 Use a weighted sample by the importance weights defined
by the likelihood w(V j) = L(Y |V j).
iid sample instead of a Markov chainbetter precision by a Rao-Blackwellisation argument(weights instead of accept-reject)
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Results
0 5000 10000 20000
1020
3040
tph
Pos
terio
r di
vers
ity
0 5000 10000 20000
1020
3040
tphD
iver
sity
in d
ata
Figure: Left: dependent DP prior: posterior mean of the Shannondiversity by TPH; 95% centred credible intervals. Right: Shannondiversity in row data.
Julyan Arbel DDP and ecological data
Biology questionNonparametric model
Dirichlet processDependent Dirichlet process
Conclusion
Such a model allows to give probabilistic answers toquestions about diversity as we get a posterior sample.The use of Gaussian processes transformed to Betaprocesses by the inverse CDF might fastened the posteriorcomputations.Extension to handle other covariates.
Julyan Arbel DDP and ecological data
Top Related