A Short Review on Bayesian Analysis -...
Transcript of A Short Review on Bayesian Analysis -...
A Short Review on Bayesian Analysis
• Binomial, Multinomial, Normal, Beta, Dirichlet
• Posterior mean, MAP, credible interval, posterior distribution
• Gibbs sampling
Stat 542 (F. Liang) 1
Frequentist vs Bayesian
Statistical methods, such as hypothesis testing (p-values), maximum
likelihood estimates (MLE), and confidence intervals (CI), are known as
frequentist methods. In the frequentist framework,
• probabilities refer to long run frequencies and are objective quantities;
• parameters are fixed but unknown constants;
• statistical procedures should have well-defined long run frequency
properties (e.g. 95% CI).
Stat 542 (F. Liang) 2
There is another approach to inference known as Bayesian inference. In the
Bayesian framework,
• probabilities reflect (subjective) personal belief;
• unknown parameters are treated as random variables;
• we make inferences about a parameter ✓ by producing a probability
distribution for ✓.
Stat 542 (F. Liang) 3
Bayesian Analysis
The Bayesian inference is carried out in the following way:
1. Choose a statistical model p(x|✓), i.e., the likelihood, same as in
frequentist approach;
2. Choose a prior distribution ⇡(✓);
3. Calculate the posterior distribution ⇡(✓|x).
⇡(✓|x) = p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓
Stat 542 (F. Liang) 4
Alternatively, we can write the posterior as
⇡(✓|x) =
p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓
/ p(x|✓)⇡(✓)
where we drop the scale factor⇥ R
p(x|✓)⇡(✓)d✓⇤�1
since it is a constant
not depending on ✓.
For example, if ⇡(✓|x) / e
�✓2+2✓, then we know that ✓|x ⇠ N(1, 1/2).
Stat 542 (F. Liang) 5
Now any inference on the parameter ✓ can be obtained from the posterior
distribution ⇡(✓|x). For example, if one wants a
• point estimate of ✓, we can report the mean (E(✓ | x)), the median
(median(✓ | x)), or the mode of the posterior distribution
(max✓ ⇡(✓ | x));
• an interval estimate of ✓, we can report the 95% credible interval,
which is a region with 0.95 posterior probability. So 95% credible
interval (1.2, 3.5) means that
P(✓ 2 (1.2, 3.5)) = 0.95
where P corresponds to the posterior distribution over ✓.
For a 95% confidence interval (1.2, 3.5), we CANNOT say that “(1.2,
3.5) covers the true ✓ with prob 95%.”
Stat 542 (F. Liang) 6
• The mode estimate is often referred to as the MAP (maximum a
posteriori) estimate, which is the solution of
max
✓log ⇡(✓|x) = max
✓
hlog p(x|✓) + log ⇡(✓)
i.
Note that MAP is also the solution of
min
✓
h� log p(x | ✓)� log ⇡(✓)
i.
So MAP is related to the regularization approach, with the penalty
term being (� log) of the prior.
Stat 542 (F. Liang) 7
• Your first resistance to Bayesian inference may be the prior choice.
Where does one find priors?
• Priors, like the likelihood, is part of your assumption: it’s one’s initial
guess of the parameter; after observing the data which carry
information about the parameter, one updates his/her prior to the
posterior. Priors matter and do not matter.
• Next I’ll introduce some default prior choices. Of course the sensitivity
of prior choices–how di↵erent priors a↵ect the final result–should always
be examined in practice.
Stat 542 (F. Liang) 8
A Bernoulli Example
• Suppose X = (X1, . . . , Xn) denotes the outcomes from n
coin-tossings, 1 means a head and 0 means a tail. They are iid samples
from a Bernoulli distribution with parameter ✓, where ✓ is the
probability of getting a head.
• Without any information about the coin, we can put a uniform prior on
✓, that is,
⇡(✓) = 1, 0 ✓ 1.
Stat 542 (F. Liang) 9
Next we calculate the posterior distribution of ✓ given X.
⇡(✓|X) /nY
i=1
p(Xi|✓)
/ ✓
s(1� ✓)
n�s, s =
XXi,
which implies that ✓|X ⇠ Beta(s+ 1, n+ 1� s). The corresponding
posterior mean can be used as a point estimate for ✓,
ˆ
✓ =
s+ 1
n+ 2
. (1)
Stat 542 (F. Liang) 10
Post-mean =
s+ 1
n+ 2
, MLE =
s
n
.
MLE is equal to the observed frequency of heads among n experiments; the
Bayes estimator is the frequency of heads among (n+ 2) experiments in
which there are two “prior” experiments, one is a head and the other one is
a tail.
Without the data, one just looks at the prior experiments and a reasonable
guess for ✓ is 1/2. After observing the data, the final estimate (1) is some
number between 1/2 and s/n as a compromise between the prior
information and the MLE. Note that when n gets large, the prior gets
“washed out”.
Stat 542 (F. Liang) 11
Post-mean =
s+ 1
n+ 2
, MLE =
s
n
.
The extra counts–one for head and one for tail–are often called the
pseudo-counts. Having pseudo-counts is appealing in cases where ✓ is likely
to take extreme values close to 1 or 0. For example, to estimate ✓ for a rare
event, it is likely to observe Xi = 0 for all i = 1, . . . , n, but it may be
dangerous to conclude ˆ
✓ = 0.
Stat 542 (F. Liang) 12
Beta distributions are often used as a prior on ✓; in fact, uniform is a special
case of Beta. Suppose the prior on ✓ is Beta(↵,�),
⇡(✓) =
✓
↵�1(1� ✓)
��1
B(↵,�)
,
then
⇡(✓|X) ⇠ Beta(s+ ↵, n� s+ �). (2)
We call Beta distributions the conjugate family for Bernoulli models since
both the prior and the posterior distributions belong to the same family.
The posterior mean of (2) is equal to (↵+ s)/(n+ ↵+ �). So Beta priors
can be viewed as having (↵+ �) prior experiments in which we have ↵
heads and � tails.
Stat 542 (F. Liang) 13
A Multinomial Example
• Suppose you randomly draw a card from an ordinary deck of playing
cards, and then put it back in the deck. Repeat this exercise five times
(i.e., sampling with replacement).
• Let (N1, N2, N3, N4) denote the number of spades, hearts, diamonds,
and clubs among the five cards. We say (N1, N2, N3, N4) follow a
multinomial distribution, with a probability distribution function given
by
P (N1 = n1, . . . , N4 = n4 | n, ✓1, . . . , ✓4)
=
(n)!
(n1)!(n2)!(n3)!(n4)!✓
n11 · · · ✓n4
4 ,
where n1 + · · ·n4 = n = 5 and ✓1 = · · · = ✓4 = 1/4.
Stat 542 (F. Liang) 14
• In the Bernoulli example, we conduct n independent trials and each
trial results in one of two possible outcomes, e.g., head or tail, with
probabilities ✓ and (1� ✓), respectively.
• In the multinomial example, we conduct n independent trials and each
trial results in one of k possible outcomes, e.g., k = 4 in the
aforementioned card example, with probabilities ✓1, . . . , ✓k, respectively.
Stat 542 (F. Liang) 15
• For multinomial distributions, the parameter of interest is
✓ = (✓1, . . . , ✓k) which lies in a simplex of Rk,
S = {✓ = (✓1, . . . , ✓k);X
i
✓i = 1, ✓i � 0}.
• A Dirichlet distribution on S, Dir(↵1, . . . ,↵k), is an extension of the
Beta distribution, with a density function given by
p(✓|↵) =
�(↵1 + · · ·+ ↵k)
�(↵1) · · ·�(↵k)
kY
i=1
✓
↵i�1i .
The Dirichlet distributions are conjugate priors for Multinomial models.
Stat 542 (F. Liang) 16
Dirichlet Distributions
Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2Dsince p3 = 1� p1 � p2:
A Normal Example
Assume X1, . . . , Xn ⇠ N(✓,�
2). The parameters here are (✓,�
2) and we
would like to get the posterior distribution ⇡(✓,�
2|X). If using Gibbs
sampler which will be introduced later, we only need to know the posterior
distribution of ⇡(✓|�2,X) and ⇡(�
2|✓,X).
Stat 542 (F. Liang) 17
For the location parameter ✓, the conjugate prior is normal.
¯
X | ✓,�2 ⇠ N(✓,�2/n)
✓ ⇠ N(µ0, ⌧20 ),
then ✓|�2,X ⇠ N(µ, ⌧2) where
µ = w
¯
X + (1� w)µ0, w =
1�2/n
1�2/n +
1⌧20
, ⌧
2=
⇣1
�
2/n
+
1
⌧
20
⌘�1.
Stat 542 (F. Liang) 18
For the scale parameter �2, the conjugate prior is Invse Gamma, that is, the
prior on 1/�
2 is Gamma. Suppose ⇡(�
2) = InvGa(↵,�), then
⇡(�
2|µ,X) /⇣1
�
2
⌘n/2exp{�
P(Xi � µ)
2
2�
2}⇣1
�
2
⌘↵�1e
� ��2
⇠ InvGa�n
2
+ ↵,
P(Xi � µ)
2
2
+ �
�.
In practice, we often specify ⇡(�
2) using Inv�2 distributions which are
special cases of InvGa,
Inv�2(v0, s
20) = InvGa(
v0
2
,
v0
2
s
20).
With prior Inv�2(v0, s
20) the posterior distribution is also Inv�2
(vn, s2n) where
vn = v0 + n, vns2n = v0s
20 +
X(Xi � µ)
2.
v0 pseudo samples and each contributes s20 into RSS.
Stat 542 (F. Liang) 19
Gibbs Sampling for Posterior Inference
Suppose the random variables X and Y have a joint probability density
function p(x, y).
Sometimes it is not easy to simulate directly from the joint distribution.
Instead, suppose it is possible to simulate from the individual conditional
distributions pX|Y (x|y) and pY |X(y|x).
Stat 542 (F. Liang) 20
Then a Gibbs sampler draws (X1, Y1), . . . , (XT , YT ) as follows:
1. Initialization: let (X0, Y0) be some starting values; set n = 0.
2. draw Xn+1 ⇠ pX|Y (x|Yn)
3. draw Yn+1 ⇠ pY |X(y|Xn+1)
4. Go to step 2 and repeat.
Stat 542 (F. Liang) 21
Gibbs samplers are MCMC algorithms, and they produce samples from the
desired distributions after a so-called burning period. So in practice, we
always drop some samples from the initial steps (say, for example, 1000 or
5000 steps) and start saving samples after that.
Stat 542 (F. Liang) 22
Suppose we have multiple parameters ✓ = (✓1, . . . , ✓K). Then we can draw
the posterior samples of ✓ using a multi-stage Gibbs sampler.
At each stage, we draw ✓i from the conditional distribution ⇡(✓i|✓[�i],Data)
where ✓[�i] denotes the (K � 1) parameters except ✓i.
Why Gibbs samplers?In many cases the conditional distribution of
⇡(✓|Data) is not of closed form, while all those conditionals are.
Stat 542 (F. Liang) 23
Revisit the Gaussian Mixture Model
• EM for MAP
• Collapsed Gibbs sampling
• Chinese restaurant process, nonparametric clustering
Stat 542 (F. Liang) 24
A Gaussian Mixture Model
Suppose the data x1, x2, . . . , xn iid from
w N(µ1,�21) + (1� w) N(µ2,�
22).
For each xi, we introduce a latent variable Zi indicating which component
xi is generated from and
P (Zi = 1) = w, P (Zi = 2) = 1� w.
The parameters of interest are ✓ = (w, µ1, µ2,�21,�
22) and their prior
distributions are specified as follows
w ⇠ Be(1, 1), µ1, µ2 ⇠ N(0, ⌧2), �
21,�
22 ⇠ InvGa(↵,�).
Stat 542 (F. Liang) 25
EM for MAP
The MAP estimate is defined to be
ˆ
✓ = argmax
✓P (x | ✓)⇡(✓) = argmax
✓
hX
z
P (x, z | ✓)i⇡(✓).
We can use the EM algorithm:
log p(x,Z | ✓)⇡(✓)
=
X
i
1(Zi = 1)⇥hlog �µ1,�2
1(xi) + logw
i
+
X
i
1(Zi = 2)⇥hlog �µ2,�2
2(xi) + log(1� w)
i
+ log ⇡(w) + log ⇡(µ1) + log ⇡(µ2) + log ⇡(�
21) + log ⇡(�
22)
Stat 542 (F. Liang) 26
Recall the EM algorithm for MLE:
• at the E-step, we replace 1(Zi = 1) and 1(Zi = 2) by its expectation,
i.e., the probability of Zi = 1 or 2 conditioning on the data x and the
current estimate of the parameter ✓0
�i = P (Zi = 1 | xi, ✓0) =w�µ1,�2
1(xi)
w�µ1,�21(xi) + (1� w)�µ2,�2
2(xi)
;
• at the M -step, we update ✓,
• and iterative between the E and M steps, until convergence.
Stat 542 (F. Liang) 27
For MAP, the E-step is the same; the M-step is slightly di↵erent:
• Without the Beta prior, we would update w by �+/n. But with the
Beta prior on w, we need to add the pseudo-counts.
• Similarly, without the prior, we would update µ1 by
1
�+
X
i
�ixi,
but with the prior, we would update µ1 by a weighted average of the
value above and the prior mean for µ1.
Stat 542 (F. Liang) 28
Gibbs Sampling from the Posterior Distribution
⇡(✓,Z | x) =
nY
i=1
⇥p(Zi | ✓)p(xi | Zi, ✓)
⇤
⇥⇡(w)⇡(µ1)⇡(µ2)⇡(�21)⇡(�
22)
In a Gibbs sampler, we iteratively sample each element from
(w, µ1, µ2,�21,�
22, Z1, . . . , Zn)
conditioning on the other elements being fixed.
Stat 542 (F. Liang) 29
For example,
• How to sample Zi? Bernoulli
P (Zi = 1 | x,Z[�i], others) / P (Zi = 1|✓)⇥ P (xi | Zi = 1, ✓)
= wp(xi | µ1,�21)
• How to sample µ1? Normal
(µ1 | x,Z, others) /h Y
i:Zi=1
P (Xi | µ1,�21)
i⇥ ⇡(µ1).
Stat 542 (F. Liang) 30
For a general Gaussian mixture model with K components, we have the
prior on the mixing weights as
w = (w1, . . . , wK) ⇠ Dir(↵
K
, . . . ,
↵
K
),
and again, normal prior on µk’s and InvGa prior on �
2k’s.
The Gibbs sampler iterates the following steps:
1. Draw Zi from a Multinomial distribution for i = 1, . . . , n;
2. Draw w from Dirichlet;
3. Draw µ1, . . . , µK from Normal;
4. Draw �
21, . . . ,�
2K from InvGa.
Stat 542 (F. Liang) 31
Collapsed Gibbs sampling
The mixing weights w is used in sampling Zi’s:
P (Zi = k | x, z[�i], ✓) / P (xi | Zi = k, ✓)P (Zi = k | z[�i], ✓),
where the 2nd term is equal to P (Zi = k | z[�i],w) = P (Zi = k|w), i.e.,
when conditioning on w, Zi’s are independent.
Stat 542 (F. Liang) 32
Let’s eliminate w from the parameter list, i.e., integrate over w. So we
need to compute
P (Zi = k | z[�i]) =P (z1, . . . , zi�1, Zi = k, zi+1, . . . , zn)
P (z1, . . . , zi�1, zi+1, . . . , zn)
Note that Zi’s are exchangeable (its meaning will be made clearly in class).
So it su�ces to compute the sampling distribution for the last observation.
Stat 542 (F. Liang) 33
P (Zn = k | z1, . . . , zn�1) =P (Zn = k, z1, . . . , zi)
P (z1, . . . , zi)
=
Rw
n1+↵/K�11 · · ·wnk+1+↵/K�1
k · · ·wnK+↵/K�1K dw
Rw
n1+↵/K�11 · · ·wnk+↵/K�1
k · · ·wnK+↵/K�1K dw
=
�(n� 1 + ↵)
�(n� 1 + ↵+ 1)
�(nk + ↵/K + 1)
�(nk + ↵/K)
=
nk + ↵/K
n� 1 + ↵
where nk = #{j : zj = k, j = 1 : (n� 1)}.
Stat 542 (F. Liang) 34
The Collapsed Gibbs sampler iterates the following steps:
1. Draw Zi from a Multinomial(�i1, . . . , �iK), for i = 1, . . . , n where
�ik /n
(i)k + ↵/K
n� 1 + ↵
⇥ P (xi | µk,�2k).
2. No need to sample w from Dirichlet;
3. Draw µ1, . . . , µK from Normal;
4. Draw �
21, . . . ,�
2K from InvGa.
Stat 542 (F. Liang) 35
Infinite Many Clusters
K = A Large Value (> n)?
At the t-th iteration of the algorithm, there must be some empty clusters.
Let’s re-label the clusters: 1 to K
⇤ being the non-empty clusters (of course,
the value of K⇤ changes from iteration to iteration), and the remaining are
empty ones.
• Update {µk,�2k}K
⇤k=1 from posterior (Normal/InvGa)
• Update {µk,�2k}Kk=K⇤+1 from prior (Normal/InvGa)
Stat 542 (F. Liang) 36
• Update Zi from a Multinomial(�i1, . . . , �iK), where
�ik /n
(i)k + ↵/K
n� 1 + ↵
⇥ P (xi | µk,�2k).
Zi may start a new cluster, i.e., Zi > K
⇤.
It doesn’t matter which value Zi takes from (K
⇤+ 1, . . . ,K), since we
can always label this new cluster as the (K
⇤+ 1)th cluster, and then
immediately update (µK⇤+1,�2K⇤+1) based on the corresponding
posterior (conditioning on xi).
Stat 542 (F. Liang) 37
What’t the chance that Zi starts a new cluster?
KX
k=K⇤+1
�ik =
↵/K
n� 1 + ↵
KX
k=K⇤+1
P (xi | µk,�2k)
=
↵
K�K⇤
K
n� 1 + ↵
"1
K �K
⇤
KX
k=K⇤+1
P (xi | µk,�2k)
#
! ↵
n� 1 + ↵
ZZP (xi | µ,�2
)⇡(µ,�
2)dµd�
2,
when K ! 1, where
ZZP (xi | µ,�2
)⇡(µ,�
2)dµd�
2= m(xi)
is the integrated (wrt to our prior ⇡) likelihood for a sample xi.
Stat 542 (F. Liang) 38
Clustering with Chinese Restaurant Process (CRP)
A mixture model
Xi | Zi = k ⇠ P (· | µk,�2k), i = 1 : n.
Prior on Z1, . . . , Zn and (µk,�2k)
Kk=1 (known as the Chinese Restaurant
Process)
• Z1 = 1 and (µ1,�21) ⇠ ⇡(·)
• for i � 1, suppose Z1, . . . , Zi form m clusters with size n1, . . . , nm,
then
P (Zi+1 = k | Z1, . . . , Zi) =
nk
i+ ↵
, k = 1, . . . ,m;
P (Zi+1 = m+ 1 | Z1, . . . , Zi) =
↵
i+ ↵
, (µm+1,�2m+1) ⇠ ⇡(·).
Stat 542 (F. Liang) 39
Alternatively, you can describe the prior first and then the likelihood (which
gives you a clear idea of how data are generated):
• Set Z1 = 1, generate (µ1,�21) ⇠ ⇡(·) and X1 ⇠ P (· | µ1,�
21).
• Loop over i = 1, . . . , n� 1: suppose the previous i samples form m
clusters with cluster-specific parameters (µk,�2k)
mk=1; then
P (Zi+1 = k | Z1, . . . , Zi) =
nk
i+ ↵
, k = 1, . . . ,m;
P (Zi+1 = m+ 1) | Z1, . . . , Zi) =
↵
i+ ↵
.
If Zi+1 = m+ 1, generate (µm+1,�2m+1) ⇠ ⇡(·). Then generate
Xi+1 ⇠ P (· | µk,�2k).
Stat 542 (F. Liang) 40
Advantages
• We do not need to specify K.
• K is treated as a random variable, and its (posterior) distribution is
learned from the data.
• Can model unseen data: for any new sample X
⇤, there is always a
positive chance that it can start a new cluster.
Stat 542 (F. Liang) 41
Exchangeability of Zi’s
In CRP, the labels Zi’s are generated sequentially, but in fact they are
exchangeable (up to a permutation of the cluster labels – labels should start
from 1, 2, ...)
P (11122) = P (12121) = P (12221)
=
↵
2(2!)(1!)
(1 + ↵)(2 + ↵) · · · (4 + ↵)
.
In general, suppose z1, . . . , zn form m clusters with size n1, . . . , nm, then
P (z1, . . . , zn) =↵
m(n1 � 1)! · · · (nm � 1)!Qn
i=2(i� 1 + ↵)
.
So the order of zi’s doesn’t matter; what matters is the partition of the n
samples implied by zi’s.
Stat 542 (F. Liang) 42
Posterior Sampling for Clustering with CRP
Same as the Gibbs sampler we have derived for K ! 1. At the tth
iteration, repeat the following:
• Suppose Z[�i] form K clusters, labeled from 1 to K, of size n
(i)k .
Sample Zi from a Multinomial with
P (Zi = k) /n
(i)k
n� 1 + ↵
⇥ P (xi | µk,�2k), k = 1, . . . ,K
P (Zi = K + 1) / ↵
n� 1 + ↵
m(xi),
where m(xi) =RR
P (xi|µ,�2)⇡(µ,�
2)dµd�
2.
• Update {µk,�2k} from posterior (Normal/InvGa)
Stat 542 (F. Liang) 43
• The exchangeability of Zi’s plays an important role in the algorithm.
Where we use this property?
• The marginal likelihood m(·) is easy to compute if the prior ⇡(µ,�2) is
conjugate, otherwise, we need to figure a way to compute m(·).
• For other MCMC algorithms, check Neal (2000); for Variational Bayes,
check Blei and Jordan (2004).
• The “ugly” side: labeling issue.
Stat 542 (F. Liang) 44
Non-parametric Bayesian (NB) Models
• The finite mixture model
Xi | Zi = k ⇠ P✓⇤k, P (Zi = k) = wk.
✓
⇤k iid ⇠ G0, w ⇠ Dir(↵/K, . . . ,↵/K).
• Alternatively,
Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G
G(·) =KX
k=1
wk�✓⇤k(·), w ⇠ Dir
⇣↵
K
, · · ·⌘, ✓
⇤k ⇠ G0.
The prior on G is a K-element discrete dist. In the NB approach, we’ll
drop this restriction.
Stat 542 (F. Liang) 45
• A NB approach for clustering
Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G
G ⇠ DP(↵, G0)
where DP(↵, G0) denotes a Dirichlet Process with a scale (precision)
parameter ↵ and a base measure G0.
Stat 542 (F. Liang) 46
Dirichlet Process (DP)
⇤i | G iid G, G ⇧ DP(�, G0)
• Define DP as a distribution over distributions (Ferguson, 1973)
• Describe DP as a stick-breaking process (Sethuraman, 1994)
• If we integrate over G (wrt DP), the resulting prior on
(⇤1, . . . , ⇤n),
⌅(⇤1, . . . , ⇤n) =
� n⌥
i=1
G(⇤i)d⇥(G)
is the Chinese restaurant process (CRP).
7
Dirichlet Process
G � DP(·|G0,�) OK, but what does it look like?
Samples from a DP are discrete with probability one:
G(⇤) =��
k=1
⌅k⇥�k(⇤)
where ⇥�k(·) is a Dirac delta at ⇤k, and ⇤k � G0(·).
Note: E(G) = G0
As �⇥⇤, G looks more like G0.
Dirichlet Processes: Stick Breaking Representation
G ⇥ DP(·|G0,�)Samples G from a DP can berepresented as follows:
G(·) =⇥⇥
k=1
⇧k⇤�k(·)
where ⌅k ⇥ G0(·),�⇥
k=1 ⇧k = 1,
⇧k = ⇥k
k�1⇤
j=1
(1� ⇥j)
and
⇥k ⇥ Beta(·|1,�)
0 0.2 0.4 0.6 0.8 10
5
10
15
!
p(!)
Beta(1,1)
Beta(1,10)
Beta(1,0.5)
(Sethuraman, 1994)
Chinese Restaurant Process
!1
!1 !
3!5 !2 !3 !4
!2 !
6!4
. . .
Generating from a CRP:customer 1 enters the restaurant and sits at table 1.⌅1 = ⇤1 where ⇤1 ⇥ G0, K = 1, n = 1, n1 = 1for n = 2, . . .,
customer n sits at table
�k with prob nk
n�1+� for k = 1 . . .KK + 1 with prob �
n�1+� (new table)
if new table was chosen then K ⇤ K + 1, ⇤K+1 ⇥ G0 endifset ⌅n to ⇤k of the table k that customer n sat at; set nk ⇤ nk + 1
endfor
The resulting conditional distribution over ⌅n:
⌅n|⌅1, . . . ,⌅n�1, G0,� ⇥�
n� 1 + �G0(·) +
K⇥
k=1
nk
n� 1 + �⇥⇥k
(·)