Learning MultiplicativeInteractions
many slides from Hinton
Two different meanings of“multiplicative”
multiplicative interactions between latent variables.
• If we take two density models and multiply together their probability distributions at each point in data-space, we get a “product of experts”.– The product of two Gaussian experts is a
Gaussian.• If we take two variables and we multiply them together
to provide input to a third variable we get a “multiplicative interaction”.– The distribution of the product of two Gaussian-
distributed variables is NOT Gaussian distributed. It is a heavy-tailed distribution. One Gaussian determines the standard deviation of the other Gaussian.
– Heavy-tailed distributions are the signatures of
Learning multiplicative interactions
form a bi-partite graph.
• It is fairly easy to learn multiplicative interactions if all of the variables are observed.– This is possible if we control the variables used to
create a training set (e.g. pose, lighting, identity …)
• It is also easy to learn energy-based models in which all but one of the terms in each multiplicative interaction are observed.– Inference is still easy.
• If more than one of the terms in each multiplicative interaction are unobserved, the interactions between hidden variables make inference difficult.– Alternating Gibbs can be used if the latent
variables
Higher order Boltzmann machines(Sejnowski, ~1986)
• The usual energy function is quadratic in the states:
• But we could use higher order interactions:
• Hidden unit h acts as a switch. When h is on, it switches in the pairwise interaction between unit i and unit j.
– Units i and j can also be viewed as switches that control the pairwise interactions between j and h or between i and h.
−𝐸=𝑏𝑖𝑎𝑠 𝑡𝑒𝑟𝑚𝑠+∑𝑖< 𝑗
𝑠𝑖 𝑠 𝑗𝑤𝑖𝑗
−𝐸=𝑏𝑖𝑎𝑠 𝑡𝑒𝑟𝑚𝑠+∑𝑖 , 𝑗 , h
𝑠𝑖 𝑠 𝑗 𝑠h𝑤 h𝑖𝑗
Using higher-order Boltzmann machines to model image transformations
(Memisevic and Hinton, 2007)
• A global transformation specifies which pixelgoes to which other pixel.
• Conversely, each pair of similar intensity pixels, one in each image, votes for a particular global transformation.
image transformation
image(t) image(t+1)
Using higher-order Boltzmann machines to model image transformations
• For binary images, a simple energy function that captures all possible correlations between the components of is
• Using this energy function, we can now define the joint distribution over outputs and hidden variables by exponentiating and normalizing:
其中, • From Eqs. 1 and 2, we get
(1)
(2)
Making the reconstruction easier
• Condition on the first image so that only one visible
group needs to be reconstructed.– Given the hidden states and the previous
image, the pixels in the second image are conditionally independent.
image transformationimage(t) image(t+1)
The main problem with 3-way interactions• energy function:
• There are far too many of them.• We can reduce the number in several straight-
forward ways:
– Do dimensionality reduction on each group before the three way interactions.
– Use spatial locality to limit the range of the three-way interactions.
• A much more interesting approach (which can be combined with the other two) is to factor the interactions so that they can be specified with fewer parameters.– This leads to a novel type of learning module.
Factoring three-way interactions
• We use factors that correspond to 3-way outer-products.
𝑤𝑖𝑗𝑘=∑𝑓
𝑤𝑖𝑓 𝑤 𝑗𝑓𝑤h𝑓
E si s j sh wi jh
i, j,h
unfactored
E si s j sh wif w jf whf fi, j,h
factored
w jf
whf
wif
(Ranzato, Krizhevsky and Hinton, 2010)
• Joint 3-way model• Model the covariance structure of natural images.
The visible units are two identical copies
Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images
• Define energy function in terms of 3-way multiplicative interactions between two visible binary units, , and one hidden binary unit :
• Model the three-way weights as a sum of “factors”, f, each of which is a three-way outer product
• The factors are connected twice to the same image through matrices B and C, it is natural to tie their weights further reducing the number of parameters:
A powerful module for deep learning
• So the energy function becomes:
• The parameters of the model can be learned by maximizing the log likelihood, whose gradient is given by:
• The hidden units conditionally independent given the states of the visible units, and their binary states can be sampled using:
• However, given the hidden states, the visible units are no longer independent.
A powerful module for deep learning
Producing reconstructions using hybrid Monte Carlo
• Integrate out the hidden units and use the hybrid Monte Carlo algorithm(HMC) on free energy:
(Hinton et al., 2011)
• describe a generative model of the relationship between two images
• The model is defined as a factored three-way Boltzmann machine, in which hidden variables collaborate to define the joint correlation matrix for image pairs
Modeling the joint density of two images under a variety of tranformations
• Given two real-valued images and , define the matching score of triplets :
• Add bias terms to matching score and get energy function:
(1)• Exponentiate and normalize energy function:
Model
• Marginalize over to get distribution over an image pair :
• And the we can get (3) (4) (5)
• This shows that among the three sets of variables, computation of the conditional distribution of any one group , given the other two, is tractable.
Model
先决条件:数据集,学习率repeat
for from 1 to do
计算 令 for each
执行正阶段更新
从中采样 从中采样 if then
从中采样,令 从中采样,令 else
从中采样,令 从中采样,令 End if
令 for each
计算 执行负阶段更新
重新正则化 end for
until 达到收敛条件
Three-way contrastive Divergence
Thank you
Top Related