Conditional Language Modeling - Unbabel

146
Conditional Language Modeling Chris Dyer DeepMind Carnegie Mellon University August 30, 2017 MT Marathon 2017

Transcript of Conditional Language Modeling - Unbabel

Page 1: Conditional Language Modeling - Unbabel

Conditional Language Modeling

Chris DyerDeepMind

Carnegie Mellon University

August 30, 2017MT Marathon 2017

Page 2: Conditional Language Modeling - Unbabel

A language model assigns probabilities to sequences of words, .w = (w1, w2, . . . , w`)

p(w) = p(w1)⇥ p(w2 | w1)⇥ p(w3 | w1, w2)⇥ · · ·⇥p(w` | w1, . . . , w`�1)

=

|w|Y

t=1

p(wt | w1, . . . , wt�1)

We saw that it is helpful to decompose this probability using the chain rule, as follows:

This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.

Review: Unconditional LMs

Page 3: Conditional Language Modeling - Unbabel

Unconditional LMs with RNNs

Page 4: Conditional Language Modeling - Unbabel

h1

h0 x1

<s>

Unconditional LMs with RNNs

Page 5: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

Unconditional LMs with RNNs

Page 6: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

tom

p(tom | hsi)

Unconditional LMs with RNNs

Page 7: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

Unconditional LMs with RNNs

Page 8: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

Unconditional LMs with RNNs

Page 9: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

Unconditional LMs with RNNs

Page 10: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

Unconditional LMs with RNNs

Page 11: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h/si | hsi, tom, likes, beer)

Unconditional LMs with RNNs

Page 12: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom⇠

likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

Unconditional LMs with RNNsTraining

Page 13: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

Unconditional LMs with RNNsTraining

Page 14: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4{log lo

ss/

cross

entro

py

Unconditional LMs with RNNsTraining

Page 15: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Unconditional LMs with RNNsTraining

Page 16: Conditional Language Modeling - Unbabel

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Unconditional LMs with RNNsTraining

Page 17: Conditional Language Modeling - Unbabel

Unconditional LMs with RNNs

h2h1

h0

h3 h4

softmax

w1 w2 w3 w4

w4w3w2w1

p(W5|w1,w2,w3,w4)z }| {

vector(word embedding)

observedcontext word

random variable

RNN hidden state vector, length=|vocab|

Page 18: Conditional Language Modeling - Unbabel

A conditional language model assigns probabilities to sequences of words, , given some conditioning context, .

w = (w1, w2, . . . , w`)

Conditional LMs

p(w | x) =Y

t=1

p(wt | x, w1, w2, . . . , wt�1)

As with unconditional models, it is again helpful to use the chain rule to decompose this probability:

What is the probability of the next word, given the history of previously generated words and conditioning context ?

x

x

Page 19: Conditional Language Modeling - Unbabel

Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in Portuguese Its English translationA sentence in English Its Portuguese translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer

x w

Page 20: Conditional Language Modeling - Unbabel

Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in Portuguese Its English translationA sentence in English Its Portuguese translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer

x w

Page 21: Conditional Language Modeling - Unbabel

Data for training conditional LMs

To train conditional language models, we need paired samples, .{(xi,wi)}Ni=1

Data availability varies. It’s easy to think of tasks that could be solved by conditional language models, but the data just doesn’t exist.

Relatively large amounts of data for:Translation, summarisation, caption generation, speech recognition

Page 22: Conditional Language Modeling - Unbabel

Section overviewThe rest of this section will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w

x

Kunst kann nicht gelehrt werden…

Artistry can’t be taught…

x

w

Page 23: Conditional Language Modeling - Unbabel

Section overviewThe rest of this section will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w

x

A dog is playing on the beach.

x

w

Page 24: Conditional Language Modeling - Unbabel

• Two questions

• How do we encode as a fixed-size vector, ?

• How do we condition on in the decoding model?

Section overview

x c

c

- Problem (or at least modality) specific- Think about assumptions

- Less problem specific- We will review solution/architectures

Page 25: Conditional Language Modeling - Unbabel

Kalchbrenner and Blunsom 2013

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Recall unconditional RNNht = g(W[ht�1;wt�1] + b])

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 26: Conditional Language Modeling - Unbabel

K&B 2013: EncoderHow should we define ?c = embed(x)

The simplest model possible:

What do you think of this model?

x1

x1 x2 x3 x4 x5 x6

x2 x3 x4 x5 x6

c =X

i

xi

Page 27: Conditional Language Modeling - Unbabel

K&B 2013: CSM EncoderHow should we define ?c = embed(x)

Convolutional sentence model (CSM)

Page 28: Conditional Language Modeling - Unbabel

K&B 2013: RNN Decoder

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Recall unconditional RNNht = g(W[ht�1;wt�1] + b])

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 29: Conditional Language Modeling - Unbabel

h1

h0 x1

<s>

s

K&B 2013: RNN Decoder

Page 30: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

s

K&B 2013: RNN Decoder

Page 31: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

Page 32: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

Page 33: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

x3

h3

softmax

beer

⇥p(beer | s, hsi, tom, likes)

Page 34: Conditional Language Modeling - Unbabel

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

x3

h3

softmax

beer

⇥p(beer | s, hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h\si | s, hsi, tom, likes, beer)

Page 35: Conditional Language Modeling - Unbabel

Sutskever et al. (2014)LSTM encoder

LSTM decoder

(c0,h0) are parameters

The encoding is where .

w0 = hsi

(ci,hi) = LSTM(xi, ci�1,hi�1)

(ct+`,ht+`) = LSTM(wt�1, ct+`�1,ht+`�1)

(c`,h`) ` = |x|

ut = Pht+` + b

p(Wt | x,w<t) = softmax(ut)

Page 36: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer STOP

START

c

Sutskever et al. (2014)

Page 37: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer STOP

START

Beginnings

c

Sutskever et al. (2014)

Page 38: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer

are

STOP

START

Beginnings

c

Sutskever et al. (2014)

Page 39: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer

are

STOP

START

difficultBeginnings

c

Sutskever et al. (2014)

Page 40: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014)

Page 41: Conditional Language Modeling - Unbabel

Sutskever et al. (2014)

• Good

• RNNs deal naturally with sequences of various lengths

• LSTMs in principle can propagate gradients a long distance

• Very simple architecture!

• Bad

• The hidden state has to remember a lot of information!

Page 42: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014): Tricks

Page 43: Conditional Language Modeling - Unbabel

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014): TricksRead the input sequence “backwards”: +4 BLEU

Page 44: Conditional Language Modeling - Unbabel

Sutskever et al. (2014): Tricks

Ensemble of 2 models: +3 BLEU

Decoder:

u(j)t = Ph(j)

t + b(j)

ut =1

J

JX

j0=1

u(j0)

p(Wt | x,w<t) = softmax(ut)

Ensemble of 5 models: +4.5 BLEU

Use an ensemble of J independently trained models.

(c(j)t+`,h(j)t+`) = LSTM(j)(wt�1, c

(j)t+`�1,h

(j)t+`�1)

Page 45: Conditional Language Modeling - Unbabel

A word about decoding

w

⇤= argmax

wp(w | x)

= argmax

w

|w|X

t=1

log p(wt | x,w<t)

In general, we want to find the most probable (MAP) output given the input, i.e.

Page 46: Conditional Language Modeling - Unbabel

A word about decoding

w

⇤= argmax

wp(w | x)

= argmax

w

|w|X

t=1

log p(wt | x,w<t)

In general, we want to find the most probable (MAP) output given the input, i.e.

This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search:

w⇤1 = argmax

w1

p(w1 | x)

w⇤2 = argmax

w2

p(w2 | x, w⇤1)

.

.

.

w⇤t = argmax

w2

p(wt | x,w⇤<t)

Page 47: Conditional Language Modeling - Unbabel

A word about decoding

w

⇤= argmax

wp(w | x)

= argmax

w

|w|X

t=1

log p(wt | x,w<t)

In general, we want to find the most probable (MAP) output given the input, i.e.

This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search:

w⇤1 = argmax

w1

p(w1 | x)

w⇤2 = argmax

w2

p(w2 | x, w⇤1)

.

.

.

w⇤t = argmax

w2

p(wt | x,w⇤<t)

undecidable :(

Page 48: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

Page 49: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

Page 50: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

logprob=-6.93

logprob=-5.80

drink

I

Page 51: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

logprob=-6.93

logprob=-5.80

drink

I

logprob=-8.66

logprob=-2.87drink

beer

Page 52: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

I

drink

I

drink

beer

Page 53: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 54: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 55: Conditional Language Modeling - Unbabel

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 56: Conditional Language Modeling - Unbabel

Sutskever et al. (2014): Tricks

Use beam search: +1 BLEU

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

w0 w1 w2 w3

Page 57: Conditional Language Modeling - Unbabel

A conceptual digression

Page 58: Conditional Language Modeling - Unbabel

A car bomb exploded downtownIn der Innenstadt explodierte eine Autobombe

Page 59: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

In der Innenstadt explodierte eine Autobombe

!

Page 60: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

!

A car bomb exploded downtown

Page 61: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

In der Innenstadt explodierte eine Autobombe

! !

A car bomb exploded downtown

Page 62: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe

Page 63: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe

detonate :arg0 bomb :arg1 car :loc downtown :time past

Semantics“logical form”

Page 64: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe

!

A car bomb exploded downtown

Syntax

detonate :arg0 bomb :arg1 car :loc downtown :time past

Semantics“logical form”

Page 65: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

!

A car bomb exploded downtown

Syntax

detonate :arg0 bomb :arg1 car :loc downtown :time past

Semantics“logical form”

Page 66: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

In der Innenstadt explodierte eine Autobombe

! !

A car bomb exploded downtown

detonate :arg0 bomb :arg1 car :loc downtown :time past

Page 67: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

In der Innenstadt explodierte eine Autobombe

! !

A car bomb exploded downtown

detonate :arg0 bomb :arg1 car :loc downtown :time past

report_event[ factivity=true explode(e, bomb, car) loc(e, downtown)]

explodieren :arg0 Bombe :arg1 Auto :loc Innenstadt :tempus imperf

Interlingua“meaning”

Page 68: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

In der Innenstadt explodierte eine Autobombe

! !

A car bomb exploded downtown

detonate :arg0 bomb :arg1 car :loc downtown :time past

report_event[ factivity=true explode(e, bomb, car) loc(e, downtown)]

explodieren :arg0 Bombe :arg1 Auto :loc Innenstadt :tempus imperf

Interlingua“meaning”

Hidden

Page 69: Conditional Language Modeling - Unbabel

In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown

Interlingua?

Page 70: Conditional Language Modeling - Unbabel

Conditioning with vectors

We are compressing a lot of information in a finite-sized vector.

Page 71: Conditional Language Modeling - Unbabel

Conditioning with vectors

We are compressing a lot of information in a finite-sized vector.

“You can't cram the meaning of a whole %&!$#sentence into a single $&!#* vector!”

Prof. Ray Mooney

Page 72: Conditional Language Modeling - Unbabel

We are compressing a lot of information in a finite-sized vector.

Gradients have a long way to travel. Even LSTMs forget!

Conditioning with vectors

Page 73: Conditional Language Modeling - Unbabel

We are compressing a lot of information in a finite-sized vector.

Gradients have a long way to travel. Even LSTMs forget!

Conditioning with vectors

What is to be done?

Page 74: Conditional Language Modeling - Unbabel

Outline of Final Section

• Machine translation with attention

• Image caption generation with attention

Page 75: Conditional Language Modeling - Unbabel

Outline of Final Section

• Machine translation with attention

• Image caption generation with attention

Page 76: Conditional Language Modeling - Unbabel

Solving the Vector Problem in Translation

• Represent a source sentence as a matrix

• Generate a target sentence from a matrix

• This will

• Solve the capacity problem

• Solve the gradient flow problem

Page 77: Conditional Language Modeling - Unbabel

Sentences as Matrices• Problem with the fixed-size vector model

• Sentences are of different sizes but vectors are of the same size

• Solution: use matrices instead

• Fixed number of rows, but number of columns depends on the number of words

• Usually |f| = #cols

Page 78: Conditional Language Modeling - Unbabel

Sentences as Matrices

Ich mochte ein Bier

Page 79: Conditional Language Modeling - Unbabel

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut

Page 80: Conditional Language Modeling - Unbabel

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Page 81: Conditional Language Modeling - Unbabel

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Question: How do we build these matrices?

Page 82: Conditional Language Modeling - Unbabel

With Concatenation• Each word type is represented by an n-dimensional

vector

• Take all of the vectors for the sentence and concatenate them into a matrix

• Simplest possible model

• So simple, no one has bothered to publish how well/badly it works!

Page 83: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

Page 84: Conditional Language Modeling - Unbabel

fi = xi

Ich möchte ein Bier

x1 x2 x3 x4

Page 85: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

fi = xi

F 2 Rn⇥|f |

Ich möchte ein Bier

x1 x2 x3 x4

Page 86: Conditional Language Modeling - Unbabel

With Convolutional Nets• Apply convolutional networks to transform the naive

concatenated matrix to obtain a context-dependent matrix

• Explored in a recent ICLR submission by Gehring et al., 2016 (from FAIR)

• Closely related to the neural translation model proposed by Kalchbrenner and Blunsom, 2013

• Note: convnets usually have a “pooling” operation at the top level that results in a fixed-sized representation. For sentences, leave this out.

Page 87: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

Page 88: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

Page 89: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

Filter 1

Page 90: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

⇤ ⇤

Filter 1 Filter 2

Page 91: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

⇤ ⇤

Ich mochte ein Bier

F 2 Rf(n)⇥g(|f |)

Filter 1 Filter 2

Page 92: Conditional Language Modeling - Unbabel

With Bidirectional RNNs• By far the most widely used matrix representation, due to

Bahdanau et al (2015)

• One column per word

• Each column (word) has two halves concatenated together:

• a “forward representation”, i.e., a word and its left context

• a “reverse representation”, i.e., a word and its right context

• Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations

Page 93: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

Page 94: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

Page 95: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Page 96: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 97: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 98: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 99: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 100: Conditional Language Modeling - Unbabel

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Ich mochte ein Bier

F 2 R2n⇥|f |

fi = [ �h i;�!h i]

Page 101: Conditional Language Modeling - Unbabel

Where are we in 2017?• There are lots of ways to construct F

• And currently lots of work looking at alternatives!

• Some particularly exciting trends want to get rid of recurrences when processing the input.

• Convolutions are making a comeback (especially at Facebook)

• “Stacked self-attention” (aka “All you need is attention” from Google Brain) is another strategy.

• Still many innovations possible, particularly targeting lower-resource scenarios and domain adaptation (two problems the big players aren’t as interested in)

Page 102: Conditional Language Modeling - Unbabel

Generation from Matrices• We have a matrix F representing the input, now we need to generate from it

• Bahdanau et al. (2015) were the first to propose using attention for translating from matrix-encoded sentences

• High-level idea

• Generate the output sentence word by word using an RNN

• At each output position t, the RNN receives two inputs (in addition to any recurrent inputs)

• a fixed-size vector embedding of the previously generated output symbol et-1

• a fixed-size vector encoding a “view” of the input matrix

• How do we get a fixed-size vector from a matrix that changes over time?

• Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fat)

• The weighting of the input columns at each time-step (at) is called attention

Page 103: Conditional Language Modeling - Unbabel

Recall RNNs…

Page 104: Conditional Language Modeling - Unbabel

Recall RNNs…

Page 105: Conditional Language Modeling - Unbabel

Recall RNNs…

Page 106: Conditional Language Modeling - Unbabel

I'd

Recall RNNs…

Page 107: Conditional Language Modeling - Unbabel

I'd

→ I'd

Recall RNNs…

Page 108: Conditional Language Modeling - Unbabel

I'd

like

I'd

Recall RNNs…

Page 109: Conditional Language Modeling - Unbabel

Page 110: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

Page 111: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

Page 112: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

Page 113: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 114: Conditional Language Modeling - Unbabel

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 115: Conditional Language Modeling - Unbabel

I'd

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 116: Conditional Language Modeling - Unbabel

I'd

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

I'd

Page 117: Conditional Language Modeling - Unbabel

I'd

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

I'd

Page 118: Conditional Language Modeling - Unbabel

I'd

I'd →

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

Page 119: Conditional Language Modeling - Unbabel

I'd

I'd →

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c2 = Fa2

Page 120: Conditional Language Modeling - Unbabel

I'd

I'd →

like

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c2 = Fa2

Page 121: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 122: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 123: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 124: Conditional Language Modeling - Unbabel

Attention

• How do we know what to attend to at each time-step?

• That is, how do we compute ?at

Page 125: Conditional Language Modeling - Unbabel

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 126: Conditional Language Modeling - Unbabel

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 127: Conditional Language Modeling - Unbabel

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 128: Conditional Language Modeling - Unbabel

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 129: Conditional Language Modeling - Unbabel

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 130: Conditional Language Modeling - Unbabel

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt

ut = tanh (WF+ rt)v

(simple model)(Bahdanau et al)

Page 131: Conditional Language Modeling - Unbabel

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Page 132: Conditional Language Modeling - Unbabel

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Page 133: Conditional Language Modeling - Unbabel

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

ut = v> tanh(WF+ rt)

Page 134: Conditional Language Modeling - Unbabel

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

doesn’t depend on output decisionsut = v> tanh(WF+ rt)

Page 135: Conditional Language Modeling - Unbabel

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

Xut = v> tanh(WF+ rt)

Page 136: Conditional Language Modeling - Unbabel

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

ut = v> tanh(X+ rt)

Page 137: Conditional Language Modeling - Unbabel

Attention in MTAdd attention to seq2seq translation: +11 BLEU

Page 138: Conditional Language Modeling - Unbabel
Page 139: Conditional Language Modeling - Unbabel

A word about gradients

Page 140: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 141: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 142: Conditional Language Modeling - Unbabel

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 143: Conditional Language Modeling - Unbabel

Attention and Translation• Cho’s question: does a translator read and memorize

the input sentence/document and then generate the output?

• Compressing the entire input sentence into a vector basically says “memorize the sentence”

• Common sense experience says translators refer back and forth to the input. (also backed up by eye-tracking studies)

• Should humans be a model for machines?

Page 144: Conditional Language Modeling - Unbabel

Summary• Attention

• provides the ability to establish information flow directly from distant

• closely related to “pooling” operations in convnets (and other architectures)

• Traditional attention model seems to only cares about “content”

• No obvious bias in favor of diagonals, short jumps, fertility, etc.

• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities

• Factorization into keys and values (Miller et al., 2016; Ba et al., 2016, Gulcehre et al., 2016)

• Attention weights provide interpretation you can look at

Page 145: Conditional Language Modeling - Unbabel

Questions?

Há perguntas?

Page 146: Conditional Language Modeling - Unbabel

Obrigado!