Probabilistic modelling for NLP powered by deep...

Post on 01-May-2020

8 views 0 download

Transcript of Probabilistic modelling for NLP powered by deep...

Deep generative modelsVariational inference

Variational auto-encoderReferences

Probabilistic modelling for NLPpowered by deep learning

Wilker AzizUniversity of Amsterdam

April 3, 2018

Wilker Aziz DGMs in NLP 1

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 1

Deep generative modelsVariational inference

Variational auto-encoderReferences

Problems

Supervised problems

“learn a distribution over observed data”

sentences in natural language

images, . . .

Unsupervised problems

“learn a distribution over observed and unobserved data”

sentences in natural language + parse trees

images + bounding boxes, . . .

Wilker Aziz DGMs in NLP 2

Deep generative modelsVariational inference

Variational auto-encoderReferences

Problems

Supervised problems

“learn a distribution over observed data”

sentences in natural language

images, . . .

Unsupervised problems

“learn a distribution over observed and unobserved data”

sentences in natural language + parse trees

images + bounding boxes, . . .

Wilker Aziz DGMs in NLP 2

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedure

which we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Deep generative modelsVariational inference

Variational auto-encoderReferences

Multiple problems, same language

N

(Conditional) Density estimation

Side information (φ) Observation (x)Parsing a sentence parse tree

Translation a sentence in French translation in English

Captioning an image caption in English

Entailment a text and hypothesis entailment relation

Wilker Aziz DGMs in NLP 4

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Deep generative modelsVariational inference

Variational auto-encoderReferences

NN as efficient parametrisation

From the statistical point of view NNs do not generate data

they parametrise distributions thatby assumption govern data

compact and efficient way to map from complex sideinformation to parameter space

Prediction is done by a decision rule outside the statistical model

e.g. beam search

Wilker Aziz DGMs in NLP 6

Deep generative modelsVariational inference

Variational auto-encoderReferences

NN as efficient parametrisation

From the statistical point of view NNs do not generate data

they parametrise distributions thatby assumption govern data

compact and efficient way to map from complex sideinformation to parameter space

Prediction is done by a decision rule outside the statistical model

e.g. beam search

Wilker Aziz DGMs in NLP 6

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Deep generative modelsVariational inference

Variational auto-encoderReferences

Tractability is central

Likelihood gives us a differentiable objective to optimise for

but we need to stick with tractable likelihood functions

Wilker Aziz DGMs in NLP 11

Deep generative modelsVariational inference

Variational auto-encoderReferences

When do we have intractable likelihood?

Unsupervised problems contain unobserved random variables

pθ(x , z) =

latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸

observation model

thus assessing the marginal likelihood requires marginalisation oflatent variables

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

Wilker Aziz DGMs in NLP 12

Deep generative modelsVariational inference

Variational auto-encoderReferences

When do we have intractable likelihood?

Unsupervised problems contain unobserved random variables

pθ(x , z) =

latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸

observation model

thus assessing the marginal likelihood requires marginalisation oflatent variables

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

Wilker Aziz DGMs in NLP 12

Deep generative modelsVariational inference

Variational auto-encoderReferences

Examples of latent variable models

Discrete latent variable, continuous observation

too many forward passes

pθ(x) =K∑

c=1

Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass

Continuous latent variable, discrete observation

infinitely many forward passes

pθ(x) =

∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸

forward pass

dz

Wilker Aziz DGMs in NLP 13

Deep generative modelsVariational inference

Variational auto-encoderReferences

Examples of latent variable models

Discrete latent variable, continuous observation

too many forward passes

pθ(x) =K∑

c=1

Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass

Continuous latent variable, discrete observation

infinitely many forward passes

pθ(x) =

∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸

forward pass

dz

Wilker Aziz DGMs in NLP 13

Deep generative modelsVariational inference

Variational auto-encoderReferences

Deep generative models

Joint distribution with deep observation model

pθ(x , z) = p(z)︸︷︷︸prior

pθ(x |z)︸ ︷︷ ︸likelihood

mapping from latent variable z to p(x |z) is a NN with parameters θ

Marginal likelihood (or evidence)

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

intractable in general

Wilker Aziz DGMs in NLP 14

Deep generative modelsVariational inference

Variational auto-encoderReferences

Deep generative models

Joint distribution with deep observation model

pθ(x , z) = p(z)︸︷︷︸prior

pθ(x |z)︸ ︷︷ ︸likelihood

mapping from latent variable z to p(x |z) is a NN with parameters θ

Marginal likelihood (or evidence)

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

intractable in general

Wilker Aziz DGMs in NLP 14

GradientExact gradient is intractable

∇θ log pθ(x)

= ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z)dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z)dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 18

Deep generative modelsVariational inference

Variational auto-encoderReferences

The Basic Problem

The marginal likelihood

p(x) =

∫p(x , z)dz

is generally intractable, which prevents us from computingquantities that depend on the posterior p(z |x)

e.g. gradients in MLE

e.g. predictive distribution in Bayesian modelling

Wilker Aziz DGMs in NLP 19

Deep generative modelsVariational inference

Variational auto-encoderReferences

Strategy

Accept that p(z |x) is not computable.

approximate it by an auxiliary distribution q(z |x) that iscomputable

choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation

Wilker Aziz DGMs in NLP 20

Deep generative modelsVariational inference

Variational auto-encoderReferences

Strategy

Accept that p(z |x) is not computable.

approximate it by an auxiliary distribution q(z |x) that iscomputable

choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation

Wilker Aziz DGMs in NLP 20

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])

≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]

= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Deep generative modelsVariational inference

Variational auto-encoderReferences

Variational Inference

Objectivemaxq(z|x)

E [log p(x ,Z )] + H (q(z |x))

The ELBO is a lower bound on log p(x)

Blei et al. (2016)Wilker Aziz DGMs in NLP 23

Deep generative modelsVariational inference

Variational auto-encoderReferences

Mean field assumption

Suppose we have N latent variables

assume the posterior factorises as N independent terms

each with an independent set of parameters

q(z1, . . . , zN) =N∏i=1

qλi (zi )︸ ︷︷ ︸mean field

Wilker Aziz DGMs in NLP 24

Deep generative modelsVariational inference

Variational auto-encoderReferences

Amortised variational inference

Amortise the cost of inference using NNs

q(z1, . . . , zN |x1, . . . , xN) =N∏i=1

qλ(zi |xi )

with a shared set of parameters

e.g. Z |x ∼ N (µλ(x), σλ(x)2︸ ︷︷ ︸inference network

)

Wilker Aziz DGMs in NLP 25

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 26

Deep generative modelsVariational inference

Variational auto-encoderReferences

Variational auto-encoder

Generative model with NN likelihood

z

x θ

N

λcomplex (non-linear) observation modelpθ(x |z)

complex (non-linear) mapping from datato latent variables qλ(z |x)

Jointly optimise generative model pθ(x |z) and inference modelqλ(z |x) under the same objective (ELBO)

Kingma and Welling (2013)Wilker Aziz DGMs in NLP 27

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]

MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample more

won’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)

excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian Transformation

If Z ∼ N (µλ(x), σλ(x)2) then

h(z , λ) =z − µλ(x)

σλ(x)= ε ∼ N (0, I)

h−1(ε, λ) = µλ(x) + σλ(x)� ε ε ∼ N (0, I)

Wilker Aziz DGMs in NLP 36

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)Thus backprop will compute − ∂

∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)

Thus backprop will compute − ∂∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)Thus backprop will compute − ∂

∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Deep generative modelsVariational inference

Variational auto-encoderReferences

Example

z xm1 θ

N

λ

Generative model

Z ∼ N (0, I )

Xi |z , x<i ∼ Cat(fθ(z , x<i ))

Inference model

Z ∼ N (µλ(xm1 ), σλ(xm1 )2)

Bowman et al. (2016)Wilker Aziz DGMs in NLP 41

Deep generative modelsVariational inference

Variational auto-encoderReferences

VAEs – Summary

Advantages

Backprop training

Easy to implement

Posterior inference possible

One objective for both NNs

Drawbacks

Discrete latent variables are difficult

Optimisation may be difficult with several latent variables

Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)

Wilker Aziz DGMs in NLP 42

Deep generative modelsVariational inference

Variational auto-encoderReferences

VAEs – Summary

Advantages

Backprop training

Easy to implement

Posterior inference possible

One objective for both NNs

Drawbacks

Discrete latent variables are difficult

Optimisation may be difficult with several latent variables

Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)

Wilker Aziz DGMs in NLP 42

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

Deep learning in NLP

task-driven feature extraction

models with more realistic assumptions

Probabilistic modelling

better (or at least more explicit) statistical assumptions

compact models

semi-supervised learning

Wilker Aziz DGMs in NLP 43

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature I

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neuralmachine translation by jointly learning to align and translate. 2014.URL http://arxiv.org/abs/1409.0473.

David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variationalinference: A review for statisticians. 01 2016. URLhttps://arxiv.org/abs/1601.00670.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, RafalJozefowicz, and Samy Bengio. Generating sentences from acontinuous space. In Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning, pages 10–21, Berlin,Germany, August 2016. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/K16-1002.

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes.2013. URL http://arxiv.org/abs/1312.6114.

Wilker Aziz DGMs in NLP 44

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature II

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, andDavid M. Blei. Automatic differentiation variational inference. Journalof Machine Learning Research, 18(14):1–45, 2017. URLhttp://jmlr.org/papers/v18/16-107.html.

Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochasticbackpropagation and approximate inference in deep generative models.In ICML, pages 1278–1286, 2014. URLhttp://jmlr.org/proceedings/papers/v32/rezende14.pdf.

Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. Thegeneralized reparameterization gradient. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors, NIPS, pages460–468. 2016. URL http://papers.nips.cc/paper/

6328-the-generalized-reparameterization-gradient.pdf.

Wilker Aziz DGMs in NLP 45

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature III

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neuralmachine translation models with monolingual data. In Proceedings ofthe 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany,August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.

Michalis Titsias and Miguel Lazaro-Gredilla. Doubly stochasticvariational bayes for non-conjugate inference. In Tony Jebara andEric P. Xing, editors, ICML, pages 1971–1979, 2014. URLhttp://jmlr.org/proceedings/papers/v32/titsias14.pdf.

Wilker Aziz DGMs in NLP 46