Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...

Post on 18-Oct-2020

6 views 0 download

Transcript of Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Deep Generative Language ModelsDGM4NLP

Miguel RiosUniversity of Amsterdam

May 2, 2019

Rios DGM4NLP 1 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Outline

1 Language Modelling

2 Variational Auto-encoder for Sentences

Rios DGM4NLP 2 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 3 / 46

Discriminative embedding modelsword2vec

In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.

Place words in Rd as to answer questions like

“Have I seen this word in this context?”

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 3 / 46

Discriminative embedding modelsword2vec

In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.

Place words in Rd as to answer questions like

“Have I seen this word in this context?”

Fit a binary classifier

positive examples

negative examples

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

The models processes a sentence and outputs a wordrepresentation:

Rios DGM4NLP 4 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

x

y

z

a

θ

m

n

|B|

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Recap Generative Models of Word Representation

x

embedding128d

h: BiRNN100d

u 100d

f(z)

x

s 100d

sample z100d

g(zaj)

y

Generative model

Inference model

Rios DGM4NLP 6 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

you know nothing, jon x

ground control to major x

the x

Rios DGM4NLP 7 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

you know nothing, jon x

ground control to major x

the x

Rios DGM4NLP 7 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

you know nothing, jon x

ground control to major x

the x

Rios DGM4NLP 7 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Definition

Language models give us the probability of a sentence;

At a time step, they assign a probability to the next word.

Rios DGM4NLP 9 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Definition

Language models give us the probability of a sentence;

At a time step, they assign a probability to the next word.

Rios DGM4NLP 9 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

Rios DGM4NLP 11 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

Rios DGM4NLP 11 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

Rios DGM4NLP 11 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

x is a sequence of words

x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow

Rios DGM4NLP 12 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

x is a sequence of words

x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow

Rios DGM4NLP 12 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

To compute the probability of a sentence

p(x) = p (x1, x2, . . . , xn) (1)

We apply the chain rule:

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)

[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

To compute the probability of a sentence

p(x) = p (x1, x2, . . . , xn) (1)

We apply the chain rule:

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)

[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

To compute the probability of a sentence

p(x) = p (x1, x2, . . . , xn) (1)

We apply the chain rule:

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)

[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

Chain rule:p(x) =

∏i

p (xi |x1, . . . , xi−1) (3)

Rios DGM4NLP 14 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

Chain rule:p(x) =

∏i

p (xi |x1, . . . , xi−1) (3)

Rios DGM4NLP 14 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make a Markov assumption of conditional independence:

p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)

Rios DGM4NLP 15 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make a Markov assumption of conditional independence:

p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)

Rios DGM4NLP 15 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)

If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.

MLE

pMLE (xi |xi−1) =count (xi−1, xi )

count (xi−1)(5)

Laplace smoothing:

padd1 (xi |xi−1) =count (xi−1, xi ) + 1

count (xi−1) + V(6)

Rios DGM4NLP 16 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)

If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.

MLE

pMLE (xi |xi−1) =count (xi−1, xi )

count (xi−1)(5)

Laplace smoothing:

padd1 (xi |xi−1) =count (xi−1, xi ) + 1

count (xi−1) + V(6)

Rios DGM4NLP 16 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)

If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.

MLE

pMLE (xi |xi−1) =count (xi−1, xi )

count (xi−1)(5)

Laplace smoothing:

padd1 (xi |xi−1) =count (xi−1, xi ) + 1

count (xi−1) + V(6)

Rios DGM4NLP 16 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

N-gram LM

We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)

If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.

MLE

pMLE (xi |xi−1) =count (xi−1, xi )

count (xi−1)(5)

Laplace smoothing:

padd1 (xi |xi−1) =count (xi−1, xi ) + 1

count (xi−1) + V(6)

Rios DGM4NLP 16 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

(7)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Rios DGM4NLP 17 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

(7)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Rios DGM4NLP 17 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

(7)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Rios DGM4NLP 17 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

(7)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Rios DGM4NLP 17 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

(7)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Rios DGM4NLP 17 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

n-gram features xj−1 = the and xj = puppy.

gappy n-gram features xj−2 = the and xj = puppy.

class features: xj belongs to class ABC;

gazetteer features: xj is a place name;

Rios DGM4NLP 18 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

n-gram features xj−1 = the and xj = puppy.

gappy n-gram features xj−2 = the and xj = puppy.

class features: xj belongs to class ABC;

gazetteer features: xj is a place name;

Rios DGM4NLP 18 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

n-gram features xj−1 = the and xj = puppy.

gappy n-gram features xj−2 = the and xj = puppy.

class features: xj belongs to class ABC;

gazetteer features: xj is a place name;

Rios DGM4NLP 18 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

n-gram features xj−1 = the and xj = puppy.

gappy n-gram features xj−2 = the and xj = puppy.

class features: xj belongs to class ABC;

gazetteer features: xj is a place name;

Rios DGM4NLP 18 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Rios DGM4NLP 19 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Rios DGM4NLP 19 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Rios DGM4NLP 19 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Rios DGM4NLP 19 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Rios DGM4NLP 19 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Neural LM

n-gram language models have proven to be effective in varioustasks

log-linear models allow us to share weights through features

maybe our history is still too limited, e.g. n-1 words

we need to find useful features

Rios DGM4NLP 20 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Neural LM

n-gram language models have proven to be effective in varioustasks

log-linear models allow us to share weights through features

maybe our history is still too limited, e.g. n-1 words

we need to find useful features

Rios DGM4NLP 20 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Neural LM

n-gram language models have proven to be effective in varioustasks

log-linear models allow us to share weights through features

maybe our history is still too limited, e.g. n-1 words

we need to find useful features

Rios DGM4NLP 20 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Neural LM

n-gram language models have proven to be effective in varioustasks

log-linear models allow us to share weights through features

maybe our history is still too limited, e.g. n-1 words

we need to find useful features

Rios DGM4NLP 20 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:

1 each word is mapped to an embedding: an m-dimensionalfeature vector;

2 a probability function over word sequences is expressed interms of these vectors;

3 we jointly learn the feature vectors and the parameters of theprobability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:

1 each word is mapped to an embedding: an m-dimensionalfeature vector;

2 a probability function over word sequences is expressed interms of these vectors;

3 we jointly learn the feature vectors and the parameters of theprobability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:1 each word is mapped to an embedding: an m-dimensional

feature vector;

2 a probability function over word sequences is expressed interms of these vectors;

3 we jointly learn the feature vectors and the parameters of theprobability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:1 each word is mapped to an embedding: an m-dimensional

feature vector;2 a probability function over word sequences is expressed in

terms of these vectors;

3 we jointly learn the feature vectors and the parameters of theprobability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:1 each word is mapped to an embedding: an m-dimensional

feature vector;2 a probability function over word sequences is expressed in

terms of these vectors;3 we jointly learn the feature vectors and the parameters of the

probability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Rios DGM4NLP 22 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Rios DGM4NLP 22 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Rios DGM4NLP 22 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Rios DGM4NLP 22 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Rios DGM4NLP 22 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

Eyou,Eknow,Enothing ∈ R100

x = [Eyou; Eknow; Enothing] ∈ R300

y = W 3 tanh (W 1x + b1) + W 2x + b2(8)

Rios DGM4NLP 23 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

Eyou,Eknow,Enothing ∈ R100

x = [Eyou; Eknow; Enothing] ∈ R300

y = W 3 tanh (W 1x + b1) + W 2x + b2(8)

Rios DGM4NLP 23 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

The non-linear activation functions perform featurecombinations that a linear model cannot do;

End-to-end training on next word prediction.

Rios DGM4NLP 24 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

The non-linear activation functions perform featurecombinations that a linear model cannot do;

End-to-end training on next word prediction.

Rios DGM4NLP 24 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

The non-linear activation functions perform featurecombinations that a linear model cannot do;

End-to-end training on next word prediction.

Rios DGM4NLP 24 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.

Recurrent neural networks have unlimited history

Rios DGM4NLP 25 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.

Recurrent neural networks have unlimited history

Rios DGM4NLP 25 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.

Recurrent neural networks have unlimited history

Rios DGM4NLP 25 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history

Rios DGM4NLP 26 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.

Recurrent neural networks have unlimited history

Rios DGM4NLP 26 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Feed-forward NLM

FF-LM

We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history

Rios DGM4NLP 26 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

RNN NLM

RNN-LM

Start: predict second word from first

[Mikolov et al., 2010]Rios DGM4NLP 27 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

RNN NLM

RNN-LM

Start: predict second word from first

[Mikolov et al., 2010]Rios DGM4NLP 27 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

RNN NLM

Rios DGM4NLP 28 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Outline

1 Language Modelling

2 Variational Auto-encoder for Sentences

Rios DGM4NLP 29 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Model observations as draws from the marginal of a DGM.

An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,

p(x |θ) =∫p(z)p(x |z , θ)dz

=∫N (z |0, I )

∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz

(9)

[Bowman et al., 2015]Rios DGM4NLP 30 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Model observations as draws from the marginal of a DGM.

An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,

p(x |θ) =∫p(z)p(x |z , θ)dz

=∫N (z |0, I )

∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz

(9)

[Bowman et al., 2015]Rios DGM4NLP 30 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

z xm1 θ

N

φ

Generative model

Z ∼ N (0, I )

Xi |z , x<i ∼ Cat(fθ(z , x<i ))

Inference model

Z ∼ N (µφ(xm1 ), σφ(xm1 )2)

Rios DGM4NLP 31 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Rios DGM4NLP 32 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Rios DGM4NLP 32 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Rios DGM4NLP 32 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Rios DGM4NLP 32 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Rios DGM4NLP 32 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Approximate Inference

The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))

With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f

Analytical KL:KL [qφ(z |x)‖pθ(z)] =12

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

)

Rios DGM4NLP 33 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Approximate Inference

The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))

With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f

Analytical KL:KL [qφ(z |x)‖pθ(z)] =12

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

)

Rios DGM4NLP 33 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Approximate Inference

The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))

With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f

Analytical KL:KL [qφ(z |x)‖pθ(z)] =12

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

)

Rios DGM4NLP 33 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Approximate Inference

We jointly estimate the parameters of both generative andinference by maximising a lowerbound on the log-likelihoodfunction (ELBO):

L(θ, φ|x) = Eq(z|x ,φ)[log p(x |z , θ)].−KL(q(z |x , φ)|p(z))

(10)

Rios DGM4NLP 34 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Architecture

Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))

f (z , x<i ; θ) = softmax (si )

ei = emb (xi ; θemb)

h0 = tanh (affine (z ; θ init ))

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

(11)

Rios DGM4NLP 35 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Architecture

Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))

f (z , x<i ; θ) = softmax (si )

ei = emb (xi ; θemb)

h0 = tanh (affine (z ; θ init ))

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

(11)

Rios DGM4NLP 35 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

The Strong Decoder Problem

The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.

This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels

The model might achieve a high ELBO without usinginformation from z

RNN LM is strong decoder because they condition on allprevious context when generating a word

Rios DGM4NLP 36 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

The Strong Decoder Problem

The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.

This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels

The model might achieve a high ELBO without usinginformation from z

RNN LM is strong decoder because they condition on allprevious context when generating a word

Rios DGM4NLP 36 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

The Strong Decoder Problem

The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.

This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels

The model might achieve a high ELBO without usinginformation from z

RNN LM is strong decoder because they condition on allprevious context when generating a word

Rios DGM4NLP 36 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

The Strong Decoder Problem

The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.

This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels

The model might achieve a high ELBO without usinginformation from z

RNN LM is strong decoder because they condition on allprevious context when generating a word

Rios DGM4NLP 36 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

What to do?

Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.

KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)

Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.

Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))

Rios DGM4NLP 37 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

What to do?

Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.

KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)

Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.

Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))

Rios DGM4NLP 37 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

What to do?

Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.

KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)

Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.

Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))

Rios DGM4NLP 37 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

What to do?

Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.

KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)

Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.

Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))

Rios DGM4NLP 37 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Metrics

For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.

We use an importance sampling (IS) estimate://

p(x |θ) =∫p(z , x |θ)dz

IS=∫q(z |x)p(z,x |θ)q(z|x) dz

≈ 1S

∑Ss=1

p(z(s),x |θ)q(z(s)|x)

where z(s) ∼ q(z |x)(12)

Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences

PPL = exp

(∑Ni=1 log p (xi )∑N

i=1 |xi |

)(13)

perplexity is based on the importance sampled NLL

Rios DGM4NLP 38 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Metrics

For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.

We use an importance sampling (IS) estimate://

p(x |θ) =∫p(z , x |θ)dz

IS=∫q(z |x)p(z,x |θ)q(z|x) dz

≈ 1S

∑Ss=1

p(z(s),x |θ)q(z(s)|x)

where z(s) ∼ q(z |x)(12)

Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences

PPL = exp

(∑Ni=1 log p (xi )∑N

i=1 |xi |

)(13)

perplexity is based on the importance sampled NLL

Rios DGM4NLP 38 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Metrics

For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.

We use an importance sampling (IS) estimate://

p(x |θ) =∫p(z , x |θ)dz

IS=∫q(z |x)p(z,x |θ)q(z|x) dz

≈ 1S

∑Ss=1

p(z(s),x |θ)q(z(s)|x)

where z(s) ∼ q(z |x)(12)

Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences

PPL = exp

(∑Ni=1 log p (xi )∑N

i=1 |xi |

)(13)

perplexity is based on the importance sampled NLL

Rios DGM4NLP 38 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Metrics

For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.

We use an importance sampling (IS) estimate://

p(x |θ) =∫p(z , x |θ)dz

IS=∫q(z |x)p(z,x |θ)q(z|x) dz

≈ 1S

∑Ss=1

p(z(s),x |θ)q(z(s)|x)

where z(s) ∼ q(z |x)(12)

Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences

PPL = exp

(∑Ni=1 log p (xi )∑N

i=1 |xi |

)(13)

perplexity is based on the importance sampled NLL

Rios DGM4NLP 38 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Baseline

RNNLM (Dyer et al., 2016)

At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and

f (x<i ; θ) = softmax (si ) and

ei = emb (xi ; θemb)

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

(14)

Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.

Rios DGM4NLP 39 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Baseline

RNNLM (Dyer et al., 2016)

At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and

f (x<i ; θ) = softmax (si ) and

ei = emb (xi ; θemb)

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

(14)

Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.

Rios DGM4NLP 39 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Baseline

RNNLM (Dyer et al., 2016)

At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and

f (x<i ; θ) = softmax (si ) and

ei = emb (xi ; θemb)

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

(14)

Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.

Rios DGM4NLP 39 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Data

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Rios DGM4NLP 40 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Data

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Rios DGM4NLP 40 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Data

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Rios DGM4NLP 40 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Data

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Rios DGM4NLP 40 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Data

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Rios DGM4NLP 40 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Results

NLL↓ PPL↓RNN-LM 118.7±0.12 107.1±0.46

VAE 118.4±0.09 105.7±0.36

Annealing 117.9±0.08 103.7±0.31

Free-bits 117.5±0.18 101.9±0.77

Rios DGM4NLP 41 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Samples

decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.

The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case

Rios DGM4NLP 42 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Samples

decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.

The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case

Rios DGM4NLP 42 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Samples

Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.

zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]

Rios DGM4NLP 43 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Samples

Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.

zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]

Rios DGM4NLP 43 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

References I

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and ChristianJanvin. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1155, 2003. URLhttp://dblp.uni-trier.de/db/journals/jmlr/jmlr3.

html#BengioDVJ03.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai,Rafal Jozefowicz, and Samy Bengio. Generating sentences froma continuous space. CoRR, abs/1511.06349, 2015. URLhttp://arxiv.org/abs/1511.06349.

Rios DGM4NLP 44 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

References II

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A.Smith. Recurrent neural network grammars. In Proceedings ofthe 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 199–209, San Diego, California, June 2016.Association for Computational Linguistics. doi:10.18653/v1/N16-1024. URLhttps://www.aclweb.org/anthology/N16-1024.

Joshua T. Goodman. A bit of progress in language modeling.Comput. Speech Lang., 15(4):403–434, October 2001. ISSN0885-2308. doi: 10.1006/csla.2001.0174. URLhttp://dx.doi.org/10.1006/csla.2001.0174.

Rios DGM4NLP 45 / 46

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

References III

Fred Jelinek and Robert L. Mercer. Interpolated estimation ofMarkov source parameters from sparse data. In Edzard S.Gelsema and Laveen N. Kanal, editors, Proceedings, Workshopon Pattern Recognition in Practice, pages 381–397. NorthHolland, Amsterdam, 1980.

Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, andSanjeev Khudanpur. Recurrent neural network based languagemodel. In Takao Kobayashi, Keikichi Hirose, and SatoshiNakamura, editors, INTERSPEECH, pages 1045–1048. ISCA,2010. URL http://dblp.uni-trier.de/db/conf/

interspeech/interspeech2010.html#MikolovKBCK10.

Rios DGM4NLP 46 / 46