1
Efficient Bounds for the Softmax Function Applications to Inference in Hybrid Models
Guillaume Bouchard
Xerox Research Centre Europe
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 2
Deterministic Inference in Hybrid Graphical Models
Discrete variables with continuous* parents No sufficient statistic No conjugate distribution
Intractable inference Approximate deterministic inference
Local sampling Deterministic approximations
Gaussian quadrature delta method Laplace approximation Maximize a lower bound to the
variational free energy
X1X1 X2
X2 X3X3 X4
X4
Y1Y1
X5X5 X5
X5 Y2Y2
Y3Y3
X0X0
Discrete variable
Continuous variableObserved variable
Hidden variable *or a large number of discrete parents
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 3
Variational inference
Focus on Bayesian multinomial logistic regression
Mean field approximation
Q belongs to an approximation family
Discrete variable
Continuous variable
Observed variable
Hidden variable
X1iX1i X2i
X2i β 1β 1 β2
β2
YiYi
Data i
upper bound?
upper bound?
max
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 4
Bounding the log-partition function (1)
Binary case dimension: classical bound [Jordan and Jaakkola]
We propose its multiclass extension
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 5
Bounding the log-partition function (2)
K=2
K=10
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
-5 -4 -3 -2 -1 0 1 2 3 4 50
5
10
15
20
25
30
35
40
45
50
x
log k ex
k
worse curvatureoptimal tightoptimal average (=2)
optimal average (=1)
optimal average (=0.1)
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 6
Other upper bounds
Concavity of the log [e.g. Blei et al.]
Worst curvature [Bohning]
Bound using hyperbolic cosines [Jebara]
Local approximation [Gibbs]not proved to be an upper bound
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 7
ProofIdea: Expand the product of inverted sigmoids
Upper-bounded by K quadratic upper bounds
Lower bounded by a linear function (log-convexity of f)
Proof: apply Jensen inequality to
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 8
Bounds on the Expectation
Exponential bound
Quadratic bound
simulations
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 9
Bayesian multinomial logistic regression
Exponential bound
Cannot be maximized in closed form gradient-based optimization Fixed point equation (unstable !)
Quadratic bound
Analytic update:
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 10
Numerical experiments
Iris dataset 4 dimensions 3 classes Prior: unit variance
Experiment Learning: Batch updates Compared to MCMC
estimation based on 100K samples
Error = Euclidian distance between the mean and variance parameters
ResultsThe “worse curvature” bound is more faster and better…
0 10 20 30 40 50 60 70 80 90 1001.5
2
2.5
3
3.5
4
4.5
5
5.5
6
number of iterations
err
or
worse curvaturesigmoid product bound
0 20 40 60 80 100 1200
2
4
6
8
10
12
14
16x 10
5
number of iterations
Var
iatio
na
l Fre
e E
nerg
y
worse curvaturesigmoid product bound
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 11
Conclusion
Multinomial links in graphical models are feasible Existing bound work well We can expect further improvements Remark
better bounds are only needed for the Bayesian setting
For MAP estimation, even a loose bound converge Future work
Application to discriminative learning Mixture-based mean-field approximation
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 14
Numerical experiments
Iris dataset 4 dimensions 3 classes Prior: unit variance
Experiment Learning: Batch updates Compared to MCMC
estimation based on 100K samples
Error = Euclidian distance between the mean and variance parameters
ResultsThe “worse curvature” bound is more faster and better…
0 10 20 30 40 50 60 70 80 90 1001.5
2
2.5
3
3.5
4
4.5
5
5.5
6
number of iterations
err
or
worse curvaturesigmoid product bound
0 20 40 60 80 100 1200
2
4
6
8
10
12
14
16x 10
5
number of iterations
Var
iatio
na
l Fre
e E
nerg
y
worse curvaturesigmoid product bound
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 15
Numerical experiments
Iris dataset 4 dimensions 3 classes Prior: unit variance
Experiment Learning: Batch updates Compared to MCMC
estimation based on 100K samples
Error = Euclidian distance between the mean and variance parameters
ResultsThe “worse curvature” bound is more faster and better…
0 20 40 60 80 1000.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
number of iterations
err
or
worse curvaturesigmoid product bound
December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 16
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
-5 -4 -3 -2 -1 0 1 2 3 4 50
10
20
30
40
50
60
70
x
K=3
log k ex
k
worse curvatureoptimal tightoptimal average (=2)
optimal average (=1)
optimal average (=0.1)
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
-5 -4 -3 -2 -1 0 1 2 3 4 50
10
20
30
40
50
60
70
x
K=3
log k ex
k
worse curvatureoptimal tightoptimal average (=2)
optimal average (=1)
optimal average (=0.1)
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
-5 -4 -3 -2 -1 0 1 2 3 4 50
5
10
15
20
25
30
35
40
45
50
x
log k ex
k
worse curvatureoptimal tightoptimal average (=2)
optimal average (=1)
optimal average (=0.1)
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
-5 -4 -3 -2 -1 0 1 2 3 4 50
20
40
60
80
100
120
140
x
K=100
log k ex
k
worse curvatureoptimal tightoptimal average (=2)
optimal average (=1)
optimal average (=0.1)
Top Related