Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 ·...

Unsupervised learning

Seoul National University Deep Learning September-December, 2019 1 / 40

In unsupervised learning, we try to learn about data without labels.

Unsupervised learning tasks involve estimation of the density of inputx , say p(x) or dimension reduction.

Dimension reduction tasks are sometimes called feature learning andinclude PCA, self organizing map, Multidimensional Scaling andautoencoder.

Density estimation tasks include cluster analysis, anomaly detection,variational autoencoder and generative adversarial network.

Dimensionality and density estimation

To get the scope of effects of dimensionality on estimating densities,consider kernel density estimators.Assume that Xi ∈ C ⊂ Rd where C is compact.p(x) = n−1

∑ni=1

1hdK(‖x−Xi‖

)Let β and L be positive numbers. Given a vector s = (s1, · · · , sd),

|s| = s1 + · · ·+ sd , and Ds = ∂s1+···+sd

∂xd11 ···∂x

. Define Holder class

Σ(β, L) = {|Dsg(x)− Dsg(y)| ≤ L‖x − y‖β−|s|, for all s such that|s| = bβc and all x and y} where bβc is the greatest integer strictlyless than β.Then the MSE is

MSE � h2β +1

If h � n−1/(2β+d),

supp∈Σ(β,L)

∫(ph(x)− p(x))2dx ≤ cn−2β/(2β+d)

High dimension and density estimation

The rate of convergence n−2β/(2β+d) is slow when the dimension d islarge.

Instead of estimating p precisely we have to settle for finding anadequate approximation that identifies the regions where p puts largeamounts of mass.

Consider

ph(x) = E (p(x)) =

hdK(‖x − X‖

)p(u)du

Then applying P(‖ph − ph‖∞ > ε) ≤ Ce−ncε2, and

‖ph − ph‖∞ = Op

(√ log n

)The rate of convergence does not depend on d but how to choose his not clear.

In deep learning we may face cases where a distribution isconcentrated near a lower-dimensional set. This causes problems fordensity estimation or even lack of definition. Current approachesattempt to find a smoothed density well defined on lower dimensionalrepresentation.

Variational autoencoder (VAE) and Generative adversarial network(GAN) attempt to estimate the density through a lower-dimensionalrepresentation obtained via deep models.

We first focus on dimension reduction and go over autoencoder.

Autoencoders

• Goal: Given data x ∈ Rd , learn a representation z ∈ Rk , k < d , wherez = f (x) approximates x well with minimal capacity.

If the dimension of z is the sameas x , the modelling represents itself.Wants to maximally compress xwithout losing too much information.

The network may be viewedas consisting of two parts: an encoderfunction z = f (x) and a decoder thatproduces a reconstruction x = g(z), g(f (x)) = x .

Autoencoders

Shallow autoencoders

When f (.) is linear, the problem boils down to principal componentanalysis (PCA).

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2.

Loss function E‖x − w2w1x‖22 can be smaller for a higher the

dimension of z .Seoul National University Deep Learning September-December, 2019 8 / 40

Autoencoders

Shallow autoencoders

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2. It can be shown that columns of w2 are orthogonal(wT

2 w2 = Ikxk) and w1 = wT2 .

For any w1 and w2, R = {w1w2x : x ∈ Rd} is in an k dimensionallinear subspace of Rd . Let V ∈ Rd ,k be a matrix whose columns forman orthonomal basis of this subspace. Each vector in R can bewritten as Vy where y ∈ Rk . For every x ∈ Rd and y ∈ Rk we have‖x −Vy‖2

2 = ‖x‖2 + yTV TVy − 2yTV T x = ‖x‖2 + ‖y‖2− 2yTV T x ,which is minimized when y = V T x . Therefore VV T x is the argmin of‖x − x‖2

Autoencoders

Deep autoencoders

f (.) can be compositional including linear, nonlinear, ReLu,convolutional ..etc.

Autoencoders

Variational Autoencoders (VAE)

VAEs have already shown promise in generating many kinds ofcomplicated data, including handwritten digits, faces, house numbers,CIFAR images, physical models of scenes, segmentation, andpredicting the future from static images.

Approximate bayesian inference called variational inference is used.

We sidetrack for a moment to study variational inference.

Approximate Inference: Variational Inference

Approximate inference on posterior approximation

In Bayesian inference, posterior distribution

p(z |x) =p(x |z)p(z)∫p(x |v)p(v)dv

is of interest.

For most models, the posterior is analytically intractable.

Variation inference attempts to approximate the posterior via q(z |x).

Minimize Kullback-Leibler divergence KL(q(z |x)||p(z |x)) orKL(p(z |x)||q(z |x)), where

KL(q(z |x)||p(z |x)) =

∫q(z |x) log

q(z |x)

p(z |x)dz .

Two types of KL divergence

For KL(q(z |x)||p(z |x)) =∫q(z |x) log q(z|x)

p(z|x)dz , if p(z |x) = 0, we

must have q(z |x) = 0. Minimizing KL(q||p) forces q to choose singlemode.

For KL(p(z |x)||q(z |x)) =∫p(z |x) log p(z|x)

q(z|x)dz if p(z |x) > 0 we must

have q(z |x) > 0. Minimizing KL(p||q) forces q to cover both modes.

source: Bishop (2006) Pattern recognition and machine learning. Springer

Two types of approximation inference: Variationalinference vs. Expectation propagation

Approximate inference minimizing KL(p(z |x)||q(z |x)) includesExpectation propagation (EP).

KL(p(z |x)||q(z |x)) is harder to evaluate.

Approximate inference minimizing KL(q(z |x)||p(z |x)) is calledVariational inference (VI).

VI is more popular than EP due to computational feasibility.

Specifying variational distributions

Variational inference is originally in terms of an optimization problemin which the quantity being optimized is a functional. The solution isobtained by exploring all possible input functions to find the one thatmaximizes, or minimizes, the functional. We can consider minimizingKL divergence over q. This approach will be discussed in the meanfield method.

We first consider the case in which q is indexed by a finitedimensional parameter.

Variational Inference: Marginal likelihood, KL divergenceand ELBO

Variational inference minimizes KL(q(z |x)||p(z |x)).

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) =

∫q(z |x ;φ) log p(x ; θ)dz =

∫q(z |x ;φ) log

p(x , z ; θ)

p(z |x ; θ)dz

∫q(z |x ;φ) log

[p(x , z ; θ)

p(z |x ; θ)

q(z |x ;φ)

∫q(z |x ;φ)

q(z |x ;φ)

p(z |x ; θ)+ log

p(x , z ; θ)

q(z |x ;φ)

∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz +

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

Variational Inference: Marginal likelihood, KL divergenceand ELBO

L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz .

Note that the first term is∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz ≡ KL(q(z |x ;φ)||p(z |x ; θ)).

Thenlog p(x ; θ) = KL(q(z |x ;φ)||p(z |x ; θ)) + L(θ, φ)

By maximizing L(θ, φ), we minimize KL(q(z |x ;φ)||p(z |x ; θ)).

Variational Inference: ELBO

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) = log( ∫

p(x , z ; θ)dz)

= log( ∫ p(x , z ; θ)

q(z |x ;φ)q(z |x ;φ)dz

)≥∫

log(p(x , z ; θ)

q(z |x ;φ)

)q(z |x ;φ)dz ≡ L(θ, φ)

Recall Jensen’s inequality E (f (x)) ≥ f (E (x)) if f is convex.

We call L(θ, φ) the evidence lower bound (ELBO).

Variational Inference: ELBO as a penalized expectedlikelihood

We can express

ELBO = L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

= Eq(z|x ;φ)[log p(x , z ; θ)]− Eq(z|x ;φ)[log q(z |x ;φ)]

= E [log p(x |z ; θ)] + E [log p(z)]− E [log q(z |x ;φ)]

= E [log p(x |z ; θ)]− KL(q(z |x ;φ)||p(z))

Maximizing ELBO is equivalent to maximizing E [log p(x |z ; θ)] withpenalty KL(q(z |x ;φ)||p(z)).

Optimizing ELBO with respect to parameters

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

First,

∂θL(θ, φ) =

∂θEq(z|x ;φ)[log p(x , z ; θ)]

∫p′(x |z ; θ)

p(x |z ; θ)q(z |x ;φ)dz

This term is usually computed stochastically.

Optimizing ELBO

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

∂φL(θ, φ) =

∂φEq(z|x ;φ)[log p(x , z ; θ)]− ∂

∂φEq(z|x ;φ)[log q(z |x ;φ)]

∫[log p(x , z ; θ)]

q′(z |x ;φ)

−∫

q′(z |x ;φ)

−∫

[log q(z |x ;φ)]q′(z |x ;φ)

Computation is complicated due to parameter appearing in theexpectation. An EM-type iterative stochastic evaluation of each termis not guaranteed to be converged.

One may attempt to reparameterize ELBO so that the expectation isover a distribution free of parameters.

Reparameterization trick

In evaluating ∂∂φ

∫f (x)p(x ;φ)dx , let x = g(φ, ε), where g is

differentiable and p(ε) is free of φ. That is

p(g(ε, φ))∣∣∂g(φ, ε)

∣∣ = p(ε).

Then ∫f (x)p(x ;φ)dx =

∫f (g(φ, ε))

∣∣∂g(φ, ε)

∣∣p(g(φ, ε))dε

∫f (g(φ, ε))p(ε)dε

For example, when p(x ;φ) is N(µ, σ) and x = g(φ, ε) = µ+ εσ,∫f (x)p(x ;φ)dx =

∫f (µ+ εσ)p(ε)dε,

where ε = (x − µ)/σ and p(ε) is the standard normal density.

∫f (x)p(x ;φ)dx =

∫f ′(g(φ, ε))

∣∣∂g(φ, ε)

∣∣p(ε)dε

Back to evaluating ELBO, let

h(x , z , θ, φ) = log p(x , z ; θ)− log q(z |x ;φ).

We need to evaluate

L(θ, φ) =

∫h(x , z , θ, φ)q(z |x ;φ)dz =

∫h(x , g(φ, ε), θ, φ)p(ε)dε

and∂

∂φL(θ, φ) =

∫∂

∂φh(x , g(φ, ε), θ, φ)p(ε)dε

Evaluation is much simpler since the expectation part is free ofparameter φ. This can be evaluated stochastically.

Variational Autoencoders

Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

ELBO: L(θ, φ) = Eq(z|x ;φ)[log p(x |z ; θ) + log p(z)− log q(z |x ;φ)]

Decoder: Assume p(x |z ; θ) ∼ N(µx |z ,Σx |z), where

µx |z = aK (aK−1(..a2(a1(z ; θ1); θ2)...θK−1); θK ),

diag(Σx |z) = b(aK−1(...a2(a1(z ; θ1); θ2)...θK−1); θb).

Parameters for µx |z and diag(Σx |z) are shared upto the (K − 1)th

layer.

Assume p(z) ∼ N(0, I )

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

Encoder: Assume q(z |x) ∼ N(µz|x ,Σz|x) where

µz|x = fK (fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φK ),

diag(Σz|x) = g(fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φg )

Parameters for µz|x and diag(Σz|x) are shared upto the (K − 1)th

layer.

L(θ, φ) = Eq(z|x ;φ)[log p(x |z)] + Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)].

For the second term, a closed form can be obtained due to thenormality assumption of q(z |x ;φ).

Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)]

∝ −E (n∑

zTi zi |x ;φ) +n∑

log |Σzi |xi |

+ En∑

(zi − µzi |xi )Σ−1zi |xi (zi − µzi |xi )

= −n∑

tr(Σzi |xi )−n∑

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |+ ntr(I )

For the first term, use the reparameterization trick

Eq(z|x ;φ)[log p(x |z)] = Ep(ε)[log p(x |g(ε, x ;φ))]

and estimate stochastically using

Eq(z|x ;φ)[log p(x |z)] =1

M∑m=1

log p(x |g(ε(m), x ;φ))

Putting together, maximize a stochastic version of ELBO,

L(θ, φ) =n∑

M∑m=1

log pθ(xi |g(εmi , xi ;φ))

−n∑

tr(Σzi |xi )−n∑

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |

where g(εmi , xi ;φ) = µzi |xi + Σ12

zi |xi εmi and εmi ∼ N(0, I ).

To generate data, sample z from prior and generate x from p(x |z ; θ).

Variational Autoencoder: Visualization of Learned Manifold

source: Kingma and Welling, 2014

Mean Field Variational Inference

Mean Field VI

Variational inference is named after variational calculus in which weoptimize a functional over all possible input functions. Withoutassuming parametric models for q and optimizing over the parameter,we can consider minimizing KL divergence over function q. Meanfield method takes this approach.

Assume the variational distribution over the latent variables factorizesas q(z1, z2, · · · , zm|x) =

∏mj=1 q(zj). Denote q(zj) by qj .

This approximation may not contain the true posterior because thelatent variables are usually dependent.

Mean Field VI

Treating q1, · · · , qj−1, qj+1, · · · , qm fixed, ELBO can be expressed as:

L(θ, qj) =

∫ m∏i=1

log p(x , z)−m∑i=1

log qi}dz

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

Let ∫log p(x , z)

∏i 6=j

qidzi ≡ Ei 6=j(log p(x , z)).

Mean Field VI

L(θ, qj) =

∫ m∏i=1

log p(x , z)−m∑i=1

log qi}dz

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

∫qjEi 6=j(log p(x , z))dzj −

∫qj log qjdzj + c

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

wherelog p(x , zj) = Ei 6=j(log p(x , z)) + c

Mean Field VI

L(θ, φ) =

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

= −KL(qj ||p(x , zj))

If we fix qj for i 6= j , ELBO is negative KL divergence between qj andp(x , zj).

ELBO can be maximized when q∗j = p(x , zj),

q∗j =exp[Ei 6=j{log p(x , z)}]∫exp[Ei 6=j{log p(x , z)}]dzj

Expectation propagation

Another approximated inference. Minimize KL(p||q) (Minka, 2001a;Minka, 2001b).

Consider the problem of minimizing KL(p||q) with respect to q(z)when p(z) is a fixed distribution and q(z) is a member of theexponential family,

q(z) = h(z)g(η) exp(ηTu(z))

KL divergence has a form:

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Setting the derivative with respect to η equals to zero

− ∂

∂ηlog g(η) = Ep(z)(u(z)).

Note that

− ∂

∂ηlog g(η) = Eq(z)(u(z)).

Then,Eq(z)(u(z)) = Ep(z)(u(z))

The optimum solution corresponds to matching the expectedsufficient statistics. This is called moment matching.

Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 ·...

Documents

Transcript of Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 ·...

Vae controls presentation petro

Inf-VAE: A Variational Autoencoder Framework to Integrate … · 2021. 1. 2. · VAE can account for unobserved social connections in the user activation process by modeling social

One-Class Collaborative Filtering with theQueryable ... · novel Queryable Variational Autoencoder (Q-VAE) variant of the VAE that explicitly models arbitrary conditional relationships

Variational Autoencoders - GitHub Pages · Variational Autoencoder q ... Autoencoder (reconstruction loss) KL divergence only VAE (KL + recon. loss) From:this webpage. Title: Variational

Capsule Architecture as a Discriminator in Generative Adversarial …€¦ · Variational Autoencoder VAE Deep Learning DL Convolutional Neural Network CCN Generative Adversarial

From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

Introduction of VAE

Variational autoencoder talk

Bridged Variational Autoencoders for Joint Modeling of ... · Joint Multimodal Variational Autoencoder: The JM-VAE model [19] is one of the ﬁrst model, that had used VAEs for multimodal

Counterfactual Autoencoder for Unsupervised Semantic ...rvc.eng.miami.edu/Paper/2018/IJMDEM2018-2.pdfKeywords: Variational autoencoder (VAE), Variational inferencing, Counterfactual

Variational Attention for Sequence-to-Sequence Models · 2020-04-18 · Variational Autoencoder (VAE) Kingma and Welling (2013) A combination of (neural) autoencoders and variational

Ladder VAE - arindam.cs.illinois.edu

Guide VAE papier

The shape variational autoencoder: A deep generative model ...homepages.inf.ed.ac.uk/ckiw/postscript/sgp2017.pdf · variational autoencoder (VAE) in the machine learning literature

FoldingNet Point Cloud Autoencoder

Grammar Variational Autoencoder (GVAE) Syntax-Directed ... · Grammar Variational Autoencoder (GVAE) & Syntax-Directed Variational Autoencoder For Structured Data (SD-VAE) Prepared

VAE 2014 presentation VAICS

Catalogue VAE 2015 - AEENSP

Variational Autoencoders for Sparse and Overdispersed ... · sulting Negative-Binomial Variational AutoEncoder (NBVAE for short) is a VAE-based framework gen-erating data with a NB

APo-VAE: Text Generation in Hyperbolic Space · tent language hierarchies in hyperbolic space. 1 Introduction Variational Autoencoder (VAE) (Kingma and Welling,2013;Rezende et al.,2014)