Machine Learning and Nonparametric Bayes...Machine Learning and Nonparametric Bayes Tamara Broderick...
Transcript of Machine Learning and Nonparametric Bayes...Machine Learning and Nonparametric Bayes Tamara Broderick...
Machine Learning and Nonparametric Bayes
Tamara BroderickITT Career Development Assistant Professor Electrical Engineering & Computer Science
MIT
Peter OrbanzAssistant Professor
Statistics Columbia
Machine Learning and Nonparametric Bayes
Tamara BroderickITT Career Development Assistant Professor Electrical Engineering & Computer Science
MIT
Peter OrbanzAssistant Professor
Statistics Columbia
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric (wait!) • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
“Wikipedia phenomenon”
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Time Mag]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Arjas, Gasbarra 1994]
1
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Arjas, Gasbarra 1994]
1
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Arjas, Gasbarra 1994]
1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Arjas, Gasbarra 1994]
[Saria et al
2010]1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Arjas, Gasbarra 1994]
1
[Saria et al
2010]
[Ewens, 1972; Hartl, Clark 2003]
[Lloyd et al 2012; Miller et al 2010]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
[Time Mag]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Sudderth, Jordan 2009]
[Lloyd et al 2012; Miller et al 2010]
[Arjas, Gasbarra 1994]
[Fox et al 2014]
1
[Escobar, West 1995; Ghosal et al 1999]
[Saria et al
2010]
[Ewens 1972; Hartl, Clark 2003]
[Time Mag]
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Generative model
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative model• Finite Gaussian mixture
model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
• Finite Gaussian mixture model (K=2 clusters)
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
�(m+ 1) = m!
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + 1{z = 1}, a2 + 1{z = 2})
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + 1{z = 1}, a2 + 1{z = 2})
�(m+ 1) = m!�(x+ 1) = x�(x)
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
�
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + 1{z = 1}, a2 + 1{z = 2})
�(m+ 1) = m!�(x+ 1) = x�(x)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
zniid⇠ Categorical(⇢1:K)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇢2
zniid⇠ Categorical(⇢1:K)
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
7
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
⇢k 2 (0, 1)X
k
⇢k = 1
7
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
[demo]
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
[demo]
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
[demo]
7
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
⇢1:K |z d= Dirichlet(a01:K), a0k = ak + 1{z = k}
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
[demo]
7
What if K > N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
8
What if K > N ?
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• e.g. species sampling, topic modeling, groups on a social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
What if K > N ?
8
• Here, difficult to choose finite K in advance (contrast with small K): don’t know K, difficult to infer, streaming data
• How to generate K = ∞ strictly positive frequencies that sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4)
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4)
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
⇢4 = 1�3X
k=1
⇢k
• “Stick breaking”
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1) ⇢1 = V1
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
ak = 1, bk = ↵ > 0
⇢ = (⇢1, ⇢2, . . .) ⇠ GEM(↵)
[McCloskey 1965; Engen 1975; Patil and Taillie 1977; Ewens 1987; Sethuraman 1994; Ishwaran, James 2001]10
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
⇢ = (⇢1, ⇢2, . . .) ⇠ GEM(↵)
[McCloskey 1965; Engen 1975; Patil and Taillie 1977; Ewens 1987; Sethuraman 1994; Ishwaran, James 2001]
Vkiid⇠ Beta(1,↵) ⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
10
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
1 2 3 4
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet process stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
1 2 3 4
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet process stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
1 2 3 4
1 2 3 4 ……
�
11
Distributions• Beta → random distribution
over
• Dirichlet → random distribution over
• GEM / Dirichlet process stick-breaking → random distribution over
• Dirichlet process → random distribution over :
1, 2
1, 2, . . . ,K
1, 2, . . .
1 2
1 2 3 4
1 2 3 4 ……
�• Infinity of parameters: components • Growing number of parameters: clusters
11
Exercises• Prove the Dirichlet is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
12
1 2 3 4 ……
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How do the two compare when you change α?
1 2 3 4 ……
12
Exercises• Prove the beta (Dirichlet) is conjugate to the categorical
• What is the posterior after N data points? • Suppose ; prove that
!
!
• Code your own GEM simulator for ρ; why is this hard? • Simulate drawing cluster indicators (z) from your ρ
⇢1:K ⇠ Dirichlet(a1:K)
) ⇢1d= Beta(a1,
KX
k=1
ak � a1)?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)
• Compare the number of clusters as N changes in the GEM case with the growth in the K=1000 case
• How does the growth in N change when you change α?
1 2 3 4 ……
12
ReferencesA full reference list is provided at the end of the “Part II” slides.