Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 ·...

Post on 28-May-2020

19 views 0 download

Transcript of Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 ·...

Unsupervised learning

Unsupervised learning

Seoul National University Deep Learning September-December, 2019 1 / 40

Unsupervised learning

Unsupervised learning

In unsupervised learning, we try to learn about data without labels.

Unsupervised learning tasks involve estimation of the density of inputx , say p(x) or dimension reduction.

Dimension reduction tasks are sometimes called feature learning andinclude PCA, self organizing map, Multidimensional Scaling andautoencoder.

Density estimation tasks include cluster analysis, anomaly detection,variational autoencoder and generative adversarial network.

Seoul National University Deep Learning September-December, 2019 2 / 40

Unsupervised learning

Dimensionality and density estimation

To get the scope of effects of dimensionality on estimating densities,consider kernel density estimators.Assume that Xi ∈ C ⊂ Rd where C is compact.p(x) = n−1

∑ni=1

1hdK(‖x−Xi‖

h

)Let β and L be positive numbers. Given a vector s = (s1, · · · , sd),

|s| = s1 + · · ·+ sd , and Ds = ∂s1+···+sd

∂xd11 ···∂x

dss

. Define Holder class

Σ(β, L) = {|Dsg(x)− Dsg(y)| ≤ L‖x − y‖β−|s|, for all s such that|s| = bβc and all x and y} where bβc is the greatest integer strictlyless than β.Then the MSE is

MSE � h2β +1

nhd.

If h � n−1/(2β+d),

supp∈Σ(β,L)

E

∫(ph(x)− p(x))2dx ≤ cn−2β/(2β+d)

Seoul National University Deep Learning September-December, 2019 3 / 40

Unsupervised learning

High dimension and density estimation

The rate of convergence n−2β/(2β+d) is slow when the dimension d islarge.

Instead of estimating p precisely we have to settle for finding anadequate approximation that identifies the regions where p puts largeamounts of mass.

Consider

ph(x) = E (p(x)) =

∫1

hdK(‖x − X‖

h

)p(u)du

.

Then applying P(‖ph − ph‖∞ > ε) ≤ Ce−ncε2, and

‖ph − ph‖∞ = Op

(√ log n

n

)The rate of convergence does not depend on d but how to choose his not clear.

Seoul National University Deep Learning September-December, 2019 4 / 40

Unsupervised learning

Unsupervised learning

In deep learning we may face cases where a distribution isconcentrated near a lower-dimensional set. This causes problems fordensity estimation or even lack of definition. Current approachesattempt to find a smoothed density well defined on lower dimensionalrepresentation.

Variational autoencoder (VAE) and Generative adversarial network(GAN) attempt to estimate the density through a lower-dimensionalrepresentation obtained via deep models.

We first focus on dimension reduction and go over autoencoder.

Seoul National University Deep Learning September-December, 2019 5 / 40

Autoencoders

Autoencoders

Seoul National University Deep Learning September-December, 2019 6 / 40

Autoencoders

Autoencoders

• Goal: Given data x ∈ Rd , learn a representation z ∈ Rk , k < d , wherez = f (x) approximates x well with minimal capacity.

If the dimension of z is the sameas x , the modelling represents itself.Wants to maximally compress xwithout losing too much information.

The network may be viewedas consisting of two parts: an encoderfunction z = f (x) and a decoder thatproduces a reconstruction x = g(z), g(f (x)) = x .

Seoul National University Deep Learning September-December, 2019 7 / 40

Autoencoders

Shallow autoencoders

When f (.) is linear, the problem boils down to principal componentanalysis (PCA).

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2.

Loss function E‖x − w2w1x‖22 can be smaller for a higher the

dimension of z .Seoul National University Deep Learning September-December, 2019 8 / 40

Autoencoders

Shallow autoencoders

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2. It can be shown that columns of w2 are orthogonal(wT

2 w2 = Ikxk) and w1 = wT2 .

For any w1 and w2, R = {w1w2x : x ∈ Rd} is in an k dimensionallinear subspace of Rd . Let V ∈ Rd ,k be a matrix whose columns forman orthonomal basis of this subspace. Each vector in R can bewritten as Vy where y ∈ Rk . For every x ∈ Rd and y ∈ Rk we have‖x −Vy‖2

2 = ‖x‖2 + yTV TVy − 2yTV T x = ‖x‖2 + ‖y‖2− 2yTV T x ,which is minimized when y = V T x . Therefore VV T x is the argmin of‖x − x‖2

2.

Seoul National University Deep Learning September-December, 2019 9 / 40

Autoencoders

Deep autoencoders

f (.) can be compositional including linear, nonlinear, ReLu,convolutional ..etc.

Seoul National University Deep Learning September-December, 2019 10 / 40

Autoencoders

Variational Autoencoders (VAE)

VAEs have already shown promise in generating many kinds ofcomplicated data, including handwritten digits, faces, house numbers,CIFAR images, physical models of scenes, segmentation, andpredicting the future from static images.

Approximate bayesian inference called variational inference is used.

We sidetrack for a moment to study variational inference.

Seoul National University Deep Learning September-December, 2019 11 / 40

Approximate Inference: Variational Inference

Approximate Inference: Variational Inference

Seoul National University Deep Learning September-December, 2019 12 / 40

Approximate Inference: Variational Inference

Approximate inference on posterior approximation

In Bayesian inference, posterior distribution

p(z |x) =p(x |z)p(z)∫p(x |v)p(v)dv

is of interest.

For most models, the posterior is analytically intractable.

Variation inference attempts to approximate the posterior via q(z |x).

Minimize Kullback-Leibler divergence KL(q(z |x)||p(z |x)) orKL(p(z |x)||q(z |x)), where

KL(q(z |x)||p(z |x)) =

∫q(z |x) log

q(z |x)

p(z |x)dz .

Seoul National University Deep Learning September-December, 2019 13 / 40

Approximate Inference: Variational Inference

Two types of KL divergence

For KL(q(z |x)||p(z |x)) =∫q(z |x) log q(z|x)

p(z|x)dz , if p(z |x) = 0, we

must have q(z |x) = 0. Minimizing KL(q||p) forces q to choose singlemode.

For KL(p(z |x)||q(z |x)) =∫p(z |x) log p(z|x)

q(z|x)dz if p(z |x) > 0 we must

have q(z |x) > 0. Minimizing KL(p||q) forces q to cover both modes.

source: Bishop (2006) Pattern recognition and machine learning. Springer

Seoul National University Deep Learning September-December, 2019 14 / 40

Approximate Inference: Variational Inference

Two types of approximation inference: Variationalinference vs. Expectation propagation

Approximate inference minimizing KL(p(z |x)||q(z |x)) includesExpectation propagation (EP).

KL(p(z |x)||q(z |x)) is harder to evaluate.

Approximate inference minimizing KL(q(z |x)||p(z |x)) is calledVariational inference (VI).

VI is more popular than EP due to computational feasibility.

Seoul National University Deep Learning September-December, 2019 15 / 40

Approximate Inference: Variational Inference

Specifying variational distributions

Variational inference is originally in terms of an optimization problemin which the quantity being optimized is a functional. The solution isobtained by exploring all possible input functions to find the one thatmaximizes, or minimizes, the functional. We can consider minimizingKL divergence over q. This approach will be discussed in the meanfield method.

We first consider the case in which q is indexed by a finitedimensional parameter.

Seoul National University Deep Learning September-December, 2019 16 / 40

Approximate Inference: Variational Inference

Variational Inference: Marginal likelihood, KL divergenceand ELBO

Variational inference minimizes KL(q(z |x)||p(z |x)).

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) =

∫q(z |x ;φ) log p(x ; θ)dz =

∫q(z |x ;φ) log

p(x , z ; θ)

p(z |x ; θ)dz

=

∫q(z |x ;φ) log

[p(x , z ; θ)

p(z |x ; θ)

q(z |x ;φ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ)

[log

q(z |x ;φ)

p(z |x ; θ)+ log

p(x , z ; θ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz +

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

Seoul National University Deep Learning September-December, 2019 17 / 40

Approximate Inference: Variational Inference

Variational Inference: Marginal likelihood, KL divergenceand ELBO

Let

L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz .

Note that the first term is∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz ≡ KL(q(z |x ;φ)||p(z |x ; θ)).

Thenlog p(x ; θ) = KL(q(z |x ;φ)||p(z |x ; θ)) + L(θ, φ)

By maximizing L(θ, φ), we minimize KL(q(z |x ;φ)||p(z |x ; θ)).

Seoul National University Deep Learning September-December, 2019 18 / 40

Approximate Inference: Variational Inference

Variational Inference: ELBO

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) = log( ∫

p(x , z ; θ)dz)

= log( ∫ p(x , z ; θ)

q(z |x ;φ)q(z |x ;φ)dz

)≥∫

log(p(x , z ; θ)

q(z |x ;φ)

)q(z |x ;φ)dz ≡ L(θ, φ)

Recall Jensen’s inequality E (f (x)) ≥ f (E (x)) if f is convex.

We call L(θ, φ) the evidence lower bound (ELBO).

Seoul National University Deep Learning September-December, 2019 19 / 40

Approximate Inference: Variational Inference

Variational Inference: ELBO as a penalized expectedlikelihood

We can express

ELBO = L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

= Eq(z|x ;φ)[log p(x , z ; θ)]− Eq(z|x ;φ)[log q(z |x ;φ)]

= E [log p(x |z ; θ)] + E [log p(z)]− E [log q(z |x ;φ)]

= E [log p(x |z ; θ)]− KL(q(z |x ;φ)||p(z))

Maximizing ELBO is equivalent to maximizing E [log p(x |z ; θ)] withpenalty KL(q(z |x ;φ)||p(z)).

Seoul National University Deep Learning September-December, 2019 20 / 40

Approximate Inference: Variational Inference

Optimizing ELBO with respect to parameters

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

First,

∂θL(θ, φ) =

∂θEq(z|x ;φ)[log p(x , z ; θ)]

=

∫p′(x |z ; θ)

p(x |z ; θ)q(z |x ;φ)dz

This term is usually computed stochastically.

Seoul National University Deep Learning September-December, 2019 21 / 40

Approximate Inference: Variational Inference

Optimizing ELBO

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

∂φL(θ, φ) =

∂φEq(z|x ;φ)[log p(x , z ; θ)]− ∂

∂φEq(z|x ;φ)[log q(z |x ;φ)]

=

∫[log p(x , z ; θ)]

q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

−∫

q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

−∫

[log q(z |x ;φ)]q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

Computation is complicated due to parameter appearing in theexpectation. An EM-type iterative stochastic evaluation of each termis not guaranteed to be converged.

One may attempt to reparameterize ELBO so that the expectation isover a distribution free of parameters.

Seoul National University Deep Learning September-December, 2019 22 / 40

Approximate Inference: Variational Inference

Reparameterization trick

In evaluating ∂∂φ

∫f (x)p(x ;φ)dx , let x = g(φ, ε), where g is

differentiable and p(ε) is free of φ. That is

p(g(ε, φ))∣∣∂g(φ, ε)

∂ε

∣∣ = p(ε).

Then ∫f (x)p(x ;φ)dx =

∫f (g(φ, ε))

∣∣∂g(φ, ε)

∂ε

∣∣p(g(φ, ε))dε

=

∫f (g(φ, ε))p(ε)dε

Seoul National University Deep Learning September-December, 2019 23 / 40

Approximate Inference: Variational Inference

Reparameterization trick

For example, when p(x ;φ) is N(µ, σ) and x = g(φ, ε) = µ+ εσ,∫f (x)p(x ;φ)dx =

∫f (µ+ εσ)p(ε)dε,

where ε = (x − µ)/σ and p(ε) is the standard normal density.

Then

∂φ

∫f (x)p(x ;φ)dx =

∫f ′(g(φ, ε))

∣∣∂g(φ, ε)

∂φ

∣∣p(ε)dε

Seoul National University Deep Learning September-December, 2019 24 / 40

Approximate Inference: Variational Inference

Reparameterization trick

Back to evaluating ELBO, let

h(x , z , θ, φ) = log p(x , z ; θ)− log q(z |x ;φ).

We need to evaluate

L(θ, φ) =

∫h(x , z , θ, φ)q(z |x ;φ)dz =

∫h(x , g(φ, ε), θ, φ)p(ε)dε

and∂

∂φL(θ, φ) =

∫∂

∂φh(x , g(φ, ε), θ, φ)p(ε)dε

Evaluation is much simpler since the expectation part is free ofparameter φ. This can be evaluated stochastically.

Seoul National University Deep Learning September-December, 2019 25 / 40

Variational Autoencoders

Variational Autoencoders

Seoul National University Deep Learning September-December, 2019 26 / 40

Variational Autoencoders

Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

ELBO: L(θ, φ) = Eq(z|x ;φ)[log p(x |z ; θ) + log p(z)− log q(z |x ;φ)]

Decoder: Assume p(x |z ; θ) ∼ N(µx |z ,Σx |z), where

µx |z = aK (aK−1(..a2(a1(z ; θ1); θ2)...θK−1); θK ),

diag(Σx |z) = b(aK−1(...a2(a1(z ; θ1); θ2)...θK−1); θb).

Parameters for µx |z and diag(Σx |z) are shared upto the (K − 1)th

layer.

Assume p(z) ∼ N(0, I )

Seoul National University Deep Learning September-December, 2019 27 / 40

Variational Autoencoders

Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

Encoder: Assume q(z |x) ∼ N(µz|x ,Σz|x) where

µz|x = fK (fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φK ),

diag(Σz|x) = g(fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φg )

Parameters for µz|x and diag(Σz|x) are shared upto the (K − 1)th

layer.

Seoul National University Deep Learning September-December, 2019 28 / 40

Variational Autoencoders

Variational Autoencoder: Example

ELBO:

L(θ, φ) = Eq(z|x ;φ)[log p(x |z)] + Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)].

For the second term, a closed form can be obtained due to thenormality assumption of q(z |x ;φ).

Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)]

∝ −E (n∑

i=1

zTi zi |x ;φ) +n∑

i=1

log |Σzi |xi |

+ En∑

i=1

(zi − µzi |xi )Σ−1zi |xi (zi − µzi |xi )

= −n∑

i=1

tr(Σzi |xi )−n∑

i=1

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |+ ntr(I )

Seoul National University Deep Learning September-December, 2019 29 / 40

Variational Autoencoders

Variational Autoencoder: Example

For the first term, use the reparameterization trick

Eq(z|x ;φ)[log p(x |z)] = Ep(ε)[log p(x |g(ε, x ;φ))]

and estimate stochastically using

Eq(z|x ;φ)[log p(x |z)] =1

M

M∑m=1

log p(x |g(ε(m), x ;φ))

Seoul National University Deep Learning September-December, 2019 30 / 40

Variational Autoencoders

Variational Autoencoder: Example

Putting together, maximize a stochastic version of ELBO,

L(θ, φ) =n∑

i=1

1

M

M∑m=1

log pθ(xi |g(εmi , xi ;φ))

−n∑

i=1

tr(Σzi |xi )−n∑

i=1

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |

where g(εmi , xi ;φ) = µzi |xi + Σ12

zi |xi εmi and εmi ∼ N(0, I ).

To generate data, sample z from prior and generate x from p(x |z ; θ).

Seoul National University Deep Learning September-December, 2019 31 / 40

Variational Autoencoders

Variational Autoencoder: Visualization of Learned Manifold

source: Kingma and Welling, 2014

Seoul National University Deep Learning September-December, 2019 32 / 40

Mean Field Variational Inference

Mean Field Variational Inference

Seoul National University Deep Learning September-December, 2019 33 / 40

Mean Field Variational Inference

Mean Field VI

Variational inference is named after variational calculus in which weoptimize a functional over all possible input functions. Withoutassuming parametric models for q and optimizing over the parameter,we can consider minimizing KL divergence over function q. Meanfield method takes this approach.

Assume the variational distribution over the latent variables factorizesas q(z1, z2, · · · , zm|x) =

∏mj=1 q(zj). Denote q(zj) by qj .

This approximation may not contain the true posterior because thelatent variables are usually dependent.

Seoul National University Deep Learning September-December, 2019 34 / 40

Mean Field Variational Inference

Mean Field VI

Treating q1, · · · , qj−1, qj+1, · · · , qm fixed, ELBO can be expressed as:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

Let ∫log p(x , z)

∏i 6=j

qidzi ≡ Ei 6=j(log p(x , z)).

Seoul National University Deep Learning September-December, 2019 35 / 40

Mean Field Variational Inference

Mean Field VI

ELBO:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

=

∫qjEi 6=j(log p(x , z))dzj −

∫qj log qjdzj + c

=

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

wherelog p(x , zj) = Ei 6=j(log p(x , z)) + c

Seoul National University Deep Learning September-December, 2019 36 / 40

Mean Field Variational Inference

Mean Field VI

ELBO

L(θ, φ) =

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

= −KL(qj ||p(x , zj))

If we fix qj for i 6= j , ELBO is negative KL divergence between qj andp(x , zj).

ELBO can be maximized when q∗j = p(x , zj),

q∗j =exp[Ei 6=j{log p(x , z)}]∫exp[Ei 6=j{log p(x , z)}]dzj

Seoul National University Deep Learning September-December, 2019 37 / 40

Expectation propagation

Expectation propagation

Seoul National University Deep Learning September-December, 2019 38 / 40

Expectation propagation

Expectation propagation

Another approximated inference. Minimize KL(p||q) (Minka, 2001a;Minka, 2001b).

Consider the problem of minimizing KL(p||q) with respect to q(z)when p(z) is a fixed distribution and q(z) is a member of theexponential family,

q(z) = h(z)g(η) exp(ηTu(z))

KL divergence has a form:

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Seoul National University Deep Learning September-December, 2019 39 / 40

Expectation propagation

Expectation propagation

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Setting the derivative with respect to η equals to zero

− ∂

∂ηlog g(η) = Ep(z)(u(z)).

Note that

− ∂

∂ηlog g(η) = Eq(z)(u(z)).

Then,Eq(z)(u(z)) = Ep(z)(u(z))

The optimum solution corresponds to matching the expectedsufficient statistics. This is called moment matching.

Seoul National University Deep Learning September-December, 2019 40 / 40