Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and...

212
Multilingual Multimodal Language Processing Using Neural Networks Mitesh M Khapra IBM Research India Sarath Chandar Université de Montréal

Transcript of Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and...

Page 1: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multilingual Multimodal Language Processing Using

Neural Networks

Mitesh M Khapra

IBM Research India

Sarath Chandar

Université de Montréal

Page 2: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

2

Is it ``University the Montreal``?

Page 3: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

5. Summary and open problems

3

Page 4: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

5. Summary and open problems

4

Page 5: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

What is multilingual multimodal NLP?

Designing language processing systems that can handle• Multiple languages (English, French, German, Hindi, Spanish,…)

• Multiple modalities (image, speech, video,…)

5

Page 6: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Why multilingual multimodal NLP?

6

We live an increasingly multilingual multimodal world

Yet these people worship me like a God. Who am I??

Video, Tamil Audio, English subtitles

English: brown horses eating tall grass beside a body of water

French: chevaux brun manger l'herbe haute à côté d'un corps de l'eau

Page 7: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Why multilingual multimodal NLP?

7*This example is taken from Calixto et al., 2012.

Seal – selo (stamp) and foca (marine animal).

Seal pup should have been translated to filhote de foca (young seal), but it has been translated as selo.

Pup in title has been wrongly translated to filhote de cachorro(young dog).

Page 8: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Why multilingual multimodal NLP?

8*Article from forbes.com

Page 9: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Why multilingual multimodal NLP?

9

Page 10: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Why using Neural Networks?

• Backpropaganda of Neural Networks in the name of Deep Learning.

• Significant success in speech recognition, computer vision.

• Slowly conquering NLP?

10

Page 11: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Deep Representation Learning

Input

Hand

designed

program

Output

Rule-based

systems

Classic machine

learning

Input

Hand

designed

features

Output

Mapping

from

features

Shallow

representation

learning

Input

features

Output

Mapping

from

features

Deep learning

Input

Simple

features

Output

More

abstract

features

Mapping

from

features

11

Page 12: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basicsa) Neural network and backpropagation

b) Matching data with architectures

c) Auto-encoders

d) Distributed natural language processing

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

5. Summary and open problems

12

Page 13: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Artificial Neuron / Perceptron

13

Neuron pre-activation:

Neuron activation:

• W – connection weights

• b – neuron bias

• g(.) – activation function

Page 14: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Activation functions

14

Identity

tanh

sigmoid

relu *Images from Hugo Larochelle’s course.

Page 15: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Learning problem

Given training data find W and b that minimizes

are the parameters of the perceptron model.

15

Page 16: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Learning using gradient descent

• Compute gradient w.r.t the parameters.

• Make a small step in the direction of the negative gradient in the parameter space.

16

Page 17: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Gradient Descent

1

0

J(0,1)

*This animation is taken from Andrew Ng’s course.

Page 18: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Gradient Descent

0

1

J(0,1)

*This animation is taken from Andrew Ng’s course.

Page 19: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Stochastic Gradient Descent (SGD)

• Approximate the gradient using a mini-batch of examples instead of entire training set.

• Online SGD – when mini-batch size is 1.

• SGD is most commonly used when compared to full-batch GD.

19

Page 20: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Online SGD for Perceptron Learning

• Perceptron learning objective is a convex function.

• GD is guaranteed to converge to global minimum while SGD will converge to global minimum by slowly letting learning rate decrease to zero.

20

Page 21: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

What can a perceptron do?

• Can solve linearly separable problems.

21*Images from Hugo Larochelle’s course.

Page 22: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Can’t solve non-linearly separable problems…

22*Images from Hugo Larochelle’s course.

Page 23: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Unless the input is transformed to a better feature space..

23*Images from Hugo Larochelle’s course.

Page 24: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Can the learning algorithm automatically learn these features?

24

Page 25: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Neural Networks

25

• You need some non-linearity f.

• Without f, this is still a perceptron!

• h – hidden layer.

Page 26: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Neural Networks with multiple outputs

26

Page 27: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Training Neural Networks

• Learning problem is still same as perceptron learning problem.

• Only the functional form of output is more complicated.

• We can learn the parameters of the neural network using gradient descent.

• Can we make use the sequential nature of the neural networks for more efficient computation of gradient?

27

Page 28: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Backpropagation for Neural Networks

• Algorithm for efficient computation of gradients using chain rule of differentiation.

• Backpropagation is not a learning algorithm. We still use gradient descent.

• No more need to derive backprop manually! Theano/Torch/Tensorflowcan do it for you!

28

Page 29: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Deep Neural Networks

29

• Can have multiple hidden layers.

• More hidden layers, more non-linear the final projection!

Page 30: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Language Modeling – An application

• N-gram language modeling: given n-1 words, predict the n-th word.

• Example

Objects are often grouped spatially.

4-gram model will consider ‘are’, ‘often’, ‘grouped’ to predict ‘spatially’.

Traditional n-gram models will use the frequency statistics to compute p(spatially|grouped,often,are).

30

Page 31: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Neural Language Modeling (Bengio et al., 2001)

31

Word Embedding Matrix

• Feed-forward neural network.

• Word embedding matrix We is also learnt using backprop+GD.

Page 32: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Distributed Natural Language Processing

• Neural Networks learn distributed word representation instead of localized word representations.

• Advantages of distributed word representation:• Comes with similarity as a by-product.

• Easy to generalize when compared to localized representations.

• Can be used to initialize word embedding in other algorithms.

32

Page 33: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Modeling sequence data

• Feedforward networks ignore the sequence information.

• We managed to include some sequence information in neural language model by considering previous n-1 words.

• Can we implicitly add this sequence information in the network architecture?

33

Page 34: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recurrent Neural Networks

34

h0 can be initialized to a zero vector or learned as a parameter.

Page 35: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recurrent Neural Networks

35

Page 36: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Backpropagation Through Time (BPTT)

36

To fit y1

Page 37: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Backpropagation Through Time (BPTT)

37

To fit y2

Page 38: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Backpropagation Through Time (BPTT)

38

To fit y3

Page 39: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Backpropagation Through Time (BPTT)

39

To fit y4

Computationally expensive for longer sequences !

Page 40: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Truncated Backpropagation Through Time (T-BPTT)

40

To fit y4Truncated after 2 steps.

Page 41: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

RNN Language Model (Mikolov et al., 2010)

Language modeling as sequential prediction problem.

I/P : Objects are often grouped spatially .

O/P: are often grouped spatially . <EOS>

41

Page 42: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

RNN Language model

42

Objects are often grouped

are often grouped spatially

Page 43: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

RNN Language model

43

Objects are often grouped

are often grouped spatially

Models p(grouped|often,are,objects)

Page 44: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM) (Hochreiter &

Schmidhuber, 1997)

• LSTM is a variant of RNN that is good at modeling long-term dependencies.

• RNN uses multiplication to overwrite the hidden state while LSTM uses addition (better gradient flow!).

44

Page 45: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

45*Pictures from Chris Olah

Page 46: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

46*Pictures from Chris Olah

Additive context

Page 47: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

47*Pictures from Chris Olah

Forget gate

Page 48: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

48*Pictures from Chris Olah

Input gate

Page 49: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

49*Pictures from Chris Olah

Additive context

Page 50: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Long Short Term Memory (LSTM)

50*Pictures from Chris Olah

Output gate

Page 51: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Gated Recurrent Units (GRU) (Cho et al., 2014)

51*Pictures from Chris Olah

Page 52: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recursive Neural Networks(Pollack., 1990)

52

. . . . . . . .

The cat

. . . .NP

. . . . . . . .

The cat

. . . . . . . .

The cat

. . . .NP

. . . . PP

. . . . VP

. . . . S

Shared weights

Page 53: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Convolutional Networks

State of the art performance in object recognition, object detection, image segmentation…

53

• Standard network architecture for image representation.

Page 54: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Matching data with architecture

Bag-of-words like data

54

Sequence data

ImagesTree structured data

Page 55: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Autoencoders

• Consists of two modules: encoder and decoder.

• Output is equal to the input.

55

Page 56: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Autoencoders

56

This is for bag-of-words like data.

Page 57: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recurrent Autoencoder

57

. . . .

. . . .

x1

. . . .

. . . .

x2

. . . .

. . . .

x3

. . . .

. . . .

x1

. . . .

. . . .

x2

. . . .

. . . .

x3

Recurrent Encoder

Recurrent Decoder

Page 58: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recurrent Autoencoder (v2)

58

. . . .

. . . .

x1

. . . .

. . . .

x2

. . . .

. . . .

x3

. . . .

. . . .

x1

. . . .

. . . .

x2

. . . .

. . . .

x3

. . . .

-

. . . .

x1

. . . .

x2

Page 59: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

59

You can also imagine Recursive Autoencoders, Convolutional Autoencoders…

Page 60: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

For a quick introduction:

60

Page 61: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

For a detailed introduction:

61

Page 62: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

5. Summary and research directions

Page 63: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

princess

queen

buvait

drink

drank

Lets start by defining the goal of learning multilingual word representations

eat

ate

king

prince

Monolingual Word Representations(capture syntactic and semantic

similarities between words)

English French Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

manger

mangé

boire

roi

princereine

princesse

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Page 64: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

First lets try to understand how do we learn monolingual word representations

Page 65: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Consider this Task: Predict n-th word given previous n-1 words

Example: he sat on a chair

Training data: All n-word windows in your corpus

Now, lets try to answer these two questions:• How do you model this task?• What is the connection between this task and learning word representations?

Page 66: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

he

cat

sheep

duck

the

mat

on

chat

chair

sleep

sat

slept

a

you

randomly initialized

𝑊 ∈ ℝ 𝑉 × 𝑑

Training instance: he sat on a chair

.

.

.

.

.

he

sat

on

a

concatenate

.

.

.

.

.

.

.

.

.

.

𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|

𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

.

.

.

.

𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

Objective:maximize−log(𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎))

.

.

.

.

back propagate

|𝑉|

Page 67: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

he

cat

sheep

duck

the

mat

on

chat

chair

sleep

sat

slept

a

you

update

𝑊 ∈ ℝ 𝑉 × 𝑑

Training instance: he sat on a chair

.

.

.

.

.

he

sat

on

a

concatenate

.

.

.

.

.

.

.

.

.

.

𝑊ℎ ∈ ℝ𝑘∙𝑑 ×ℎ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|

𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

.

.

.

.

𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

Objective:maximize−log(𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎))

.

.

.

.

back propagate

Page 68: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

he

cat

sheep

duck

the

mat

on

chat

chair

sleep

sat

slept

a

you

𝑊 ∈ ℝ 𝑉 × 𝑑

Training instance: he sat on a chair

.

.

.

.

.

he

sat

on

a

concatenate

.

.

.

.

.

.

.

.

.

.

𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|

𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

.

.

.

.

𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)

In general:

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑇

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))

T = total number of words in the corpus

.

.

.

.

back propagate

Page 69: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

How does this result in meaningful word representations?

Intuition: similar words appear in similar contextshe sat on a chairhe sits on a chair

To predict chair in both cases the model should learn tomake the representations of “sits” and “sat” similar

Page 70: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Alternate formulation … … …

Page 71: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

.

.

.

.

.

.

.

.

.

he

.

.

.

.

.

.

.

.

.

.

𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ

𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|1|

.

.

.

.

.

.

sat

on

a

oxygen

.

.

.

.

.

.

.

.

.

he

.

.

.

.

.

.

.

.

.

.

𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ

𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|1|

𝑠

.

.

.

.

.

.

sat

on

a

chair

𝑠𝑐

Positive: he sat on a chair Negative: he sat on a oxygen

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 (0, 1 − 𝑠 + 𝑠𝑐)

back propagate and update word representations Advantage: does not require this expensive matrix multiplication

Page 72: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Coming back to learning (multi)bilingual representations

Page 73: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

Page 74: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

Page 75: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Offline bilingual alignment:• Stage 1: independently learn word

representations for two languages

• Stage 2: Now try to enforce similarity between the representations of similar words across the two languages.

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

he

cat

sheep

duck

the

mat

on

chat

chair

sleep

sat

slept

a

you

. . . . .

il

mouton

toi

canard

la

tapis

sur

bavarder

chaise

dormir

assis

dormi

un

chat

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Goal in Stage 2: transform the word representations such that representations of (cat, chat), (sheep, mouton) , (you, toi), etc. are close to each other

How? Let’s see …

English French

Page 76: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Use a bilingual dictionary to make X and Y parallel (i.e. corresponding rows in X and Y form a translation pair)

(Faruqui and Dyer, 2014)

Page 77: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X YSearch for projection vectors a & b

Such that after projecting the original representations on a and b …

… the projections are correlated

77

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a bUse Canonical Correlation Analysis (CCA)

(Faruqui and Dyer, 2014)

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

Page 78: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

78

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a b

(Faruqui and Dyer, 2014)

𝑋𝐴

Transform X by projecting it on A

Page 79: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

79

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a b

(Faruqui and Dyer, 2014)

𝑋𝐴 YB

Transform Y by projecting it on B

Page 80: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

80

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a b

(Faruqui and Dyer, 2014)

(𝑋𝐴)𝑇 𝑌𝐵

Page 81: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

81

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a b

(Faruqui and Dyer, 2014)

𝐴𝑇𝑋𝑇𝑌𝐵

This term is simply the correlation between transformations 𝑋𝐴 & YB. We need to maximize this

Page 82: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

X Y

Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other

82

1 2

3

4

1

2

3

41

1 2

2

3

3

4

4

a b

(Faruqui and Dyer, 2014)

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑡𝑟𝑎𝑐𝑒 𝐴𝑇𝑋𝑇𝑌𝐵𝑠. 𝑡. 𝐴𝑇𝑋𝑇𝑋𝐴 = 𝐼𝐵𝑇𝑌𝑇𝑌𝐵 = 𝐼

Page 83: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Alternately one could use Deep CCA instead of CCA …

Page 84: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

After Stage 1: X = representations of English wordsY = representations of French words

(Lu et. al, 2015)

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

sheep mouton

𝑥 𝑦

𝑓(𝑥) 𝑔(𝑦)

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒𝑎𝑇𝑓 𝑥 𝑇𝑔 𝑦 𝑏

𝑎𝑇𝑓 𝑥 𝑇𝑓 𝑥 𝑎 𝑏𝑇𝑔 𝑦 𝑇𝑔 𝑦 𝑏𝑤. 𝑟. 𝑡 𝑊𝑓, 𝑊𝑔, 𝑎, 𝑏 (same as CCA)

𝑊𝑓 𝑊𝑔

Again, use a bilingual dictionary to make X and Y parallel

Extract deep features

𝑓(𝑥) and 𝑔(𝑦) from

𝑥 and 𝑦

𝑊𝑓 and 𝑊𝑔 are the

parameters of the two networks

Now find projection vectors 𝑎 & 𝑏 such that …..

Backpropagate and

update 𝑊𝑓,𝑊𝑔, 𝑎, 𝑏

Page 85: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

Page 86: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

• Use only parallel data• Use monolingual as well a parallel data

Page 87: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

• Use only parallel data• Use monolingual as well a parallel data

Page 88: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Compose word representations to get a sentence representation using a Compositional Vector Model(CVM)

Two options considered:ADD: (simply add word vectors)𝑠 = 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑤𝑖 = 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓𝑤𝑜𝑟𝑑 𝑖 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

𝑓 𝑠 =

𝑖=1

𝑛

𝑤𝑖

BI (gram):

𝑓 𝑠 =

𝑖=1

𝑛

tanh(𝑤𝑖−1 +𝑤𝑖)

Training data: Parallel sentences

. . .

he sat on a chair

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

il était assis sur une chaise

CVM CVM

𝑎 = 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 = 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑛 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))

. . . . . . . . . . . . . . . .𝑓 𝑎 𝑔 𝑏

degenerate solution is tomake 𝑓 𝑎 = 𝑔 𝑏 = 0

To avoid this use max-margin training

(Hermann & Blunson, 2014)

Backpropagate & update𝑤𝑖’s in both languages

Page 89: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Training data: Parallel sentences

. . .

he sat on a chair

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

il était assis sur une chaise

CVM CVM

𝑎 = 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 = 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑛 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))

. . . . . . . . . . . . . . . .𝑓 𝑎 𝑔 𝑏

(Hermann & Blunson, 2014)

Backpropagate & update𝑤𝑖’s in both languages

To reduce the distance between 𝑓 𝑎 & 𝑔 𝑏 the model will eventually learn to reduce the distance between (chair, chaise), (sit, assis), (he, il) etc.

Page 90: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

The previous approach strictly requires parallel data…

Can we exploit monolingual data in two languages in addition to parallel data between them?

Page 91: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Joint English French

Multilingual Word Representations(capture syntactic and semantic

similarities between words bothwithin and across languages)

drink

drank

boire

buvaiteat

ate

manger

mangé

king

prince

roi

prince reine

princesse

queen

princess

Can also be extended to bigger units (sentences, documents, etc.)

Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment

• Use only parallel data• Use monolingual as well a parallel data

Page 92: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Given monolingual data we already know how to learn word representations

Recap:

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑤. 𝑟. 𝑡 𝑊𝑒𝑚𝑏𝑒 ,𝑊ℎ

𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝑇𝑒= total number of words in the English corpus𝑊𝑒𝑚𝑏𝑒 = word representation for English words𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒 = other parameters of the model

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))

𝑤. 𝑟. 𝑡𝑊𝑒𝑚𝑏𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓

𝑇𝑓= total number of words in the French corpus

𝑊𝑒𝑚𝑏𝑓 = word representation for French words

𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓 = other parameters of the model

Similarly for French

Page 93: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Simply putting the two languages together we get ……

Page 94: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑗∈{𝑒,𝑓}

𝑖=1

𝑇𝑗

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))

𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒=𝑊𝑒𝑚𝑏

𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝜃𝑓=𝑊𝑒𝑚𝑏𝑓𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓

Nothing great about this… this is same as training 𝜃𝑒 , 𝜃𝑓separately

Things become interesting when in addition we have parallel data…

We can then modify the objective function ….

Page 95: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑗∈{𝑒,𝑓}

𝑖=1

𝑇𝑗

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1)) + 𝜆 · Ω (𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓)

𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒= 𝑊𝑒𝑚𝑏

𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝜃𝑓= 𝑊𝑒𝑚𝑏𝑓,𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓

Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓=

𝑤𝑖∈𝑉𝑒

𝑤𝑗∈𝑉𝑓

𝑠𝑖𝑚 𝑤𝑖 , 𝑤𝑗 ∗ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑊𝑒𝑚𝑏𝑖𝑒 ,𝑊𝑒𝑚𝑏𝑗

𝑓)

This weighted sum will be low only when similar wordsacross languages are embedded close to each other

monolingual similarity bilingual similarity

Page 96: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Now, lets look at two specific instances of this formulation….

Page 97: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

𝑊𝑒𝑚𝑏𝑒 ∈ ℝ 𝑉

𝑒 × 𝑑

𝑊𝑒𝑚𝑏𝑓∈ ℝ 𝑉

𝑓 × 𝑑

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

he

chair

on

duck

a

sat

chat

. . . . .

. . . . .

0.02 0.9 0.05 0.01 0.020.85 0.01 0.02 0.03 0.090.06 0.01 0.01 0.01 0.950.02 0.02 0.92 0.02 0.020.10 0.05 0.05 0.81 0.04

hesatchairaon

assis il une sur chaise

𝐸𝑎𝑐ℎ 𝑐𝑒𝑙𝑙 (𝑖, 𝑗) 𝑜𝑓 𝐴 𝑠𝑡𝑜𝑟𝑒𝑠

𝑠𝑖𝑚 𝑤𝑖 , 𝑤𝑗 using word

alignment information from a parallel corpus

𝐴

Il

chaise

sur

ente

une

assis

plaudern

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

English Training instance: he sat on a chair

In addition also update French words in proportional to their similarity to {he, sat, on, a}

𝑊𝑒𝑚𝑏𝑖𝑓

=𝑊𝑒𝑚𝑏𝑖𝑓+

𝑤𝑗∈𝑉𝑒

𝐴𝑖,𝑗𝜕 ℒ(𝜃𝑒)

𝜕 𝑊𝑒𝑚𝑏𝑗𝑒

ℒ 𝜃𝑒 =

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑀𝑜𝑟𝑒 𝑓𝑜𝑟𝑚𝑎𝑙𝑙𝑦,

Similar words across the two languages undergo similar updates and hence remain close to each other

(Klementiev et. al., 2012)

Page 98: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑚𝑎𝑥(0, 1 − 𝑠𝑒 + 𝑠𝑐𝑒)

𝑤. 𝑟. 𝑡. 𝜃𝑓

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑚𝑎𝑥(0, 1 − 𝑠𝑓 + 𝑠𝑐𝑓)

𝑤. 𝑟. 𝑡. 𝜃𝑒

En positive: he sat on a chairEn negative: he sat on a oxygen

Fr positive: Il était assis sur une chaiseFr negative: Il était assis sur une oxygène

Independently update 𝜃𝑒 and 𝜃𝑓

+ Parallel dataEn: he sat on a chair [𝑠𝑒 = 𝑤1

𝑒 , 𝑤2𝑒 , 𝑤3𝑒 , 𝑤4𝑒 , 𝑤5𝑒]

Fr : Il était assis sur une chaise [𝑠𝑓 = 𝑤1𝑓, 𝑤2𝑓, 𝑤3𝑓, 𝑤4𝑓, 𝑤5𝑓]

𝑛𝑜𝑤, 𝑎𝑙𝑠𝑜 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓=1

𝑚

𝑤𝑖∈𝑠𝑒

𝑤𝑚

𝑊𝑒𝑚𝑏𝑖𝑒 −

1

𝑛

𝑤𝑗∈𝑠𝑒

𝑤𝑛

𝑊𝑒𝑚𝑏𝑖𝑓

2

𝑤. 𝑟. 𝑡 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓

(Gouws et. al., 2015)

Page 99: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

In fact, looking back, we can analyze all the approaches under this framework…

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑗∈{𝑒,𝑓}

𝑖=1

𝑇𝑗

ℒ 𝜃𝑗 + 𝜆 · Ω (𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓)

𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒= 𝑊𝑒 ,𝑊ℎ

𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓

monolingual similarity bilingual similarity

Page 100: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

Page 101: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇

𝑊𝑒𝑚𝑏𝑓𝑏

Ω 𝜃𝑒 , 𝜃𝑓

is optimizedafteroptimizing

ℒ 𝜃i(Faruqui and Dyer, 2014)

Page 102: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇

𝑊𝑒𝑚𝑏𝑓𝑏

Ω 𝜃𝑒 , 𝜃𝑓

is optimizedafteroptimizing

ℒ 𝜃i

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏

𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏

𝑓𝑏

-- ‘’ --

(Faruqui and Dyer, 2014)

(Lu et. al., 2015)

Page 103: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇

𝑊𝑒𝑚𝑏𝑓𝑏

Ω 𝜃𝑒 , 𝜃𝑓

is optimizedafteroptimizing

ℒ 𝜃i

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏

𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏

𝑓𝑏

-- ‘’ --

0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))

𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2

Only

Ω 𝜃𝑒 , 𝜃𝑓

is optimized

(Faruqui and Dyer, 2014)

(Lu et. al., 2015)

(Herman & Blunsom, 2014)

Page 104: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇

𝑊𝑒𝑚𝑏𝑓𝑏

Ω 𝜃𝑒 , 𝜃𝑓

is optimizedafteroptimizing

ℒ 𝜃i

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏

𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏

𝑓𝑏

-- ‘’ --

0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))

𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2

Only

Ω 𝜃𝑒 , 𝜃𝑓

is optimized

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))0.5 ∗ 𝑊𝑒𝑚𝑏

𝑒𝑇 𝐴⊗ 𝐼 𝑊𝑒𝑚𝑏𝑓 Ω 𝜃𝑒 , 𝜃𝑓

& ℒ 𝜃i

are optimizedjointly

(Faruqui and Dyer, 2014)

(Lu et. al., 2015)

(Herman & Blunsom, 2014)

(Klementiev et. al., 2012)

Page 105: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇

𝑊𝑒𝑚𝑏𝑓𝑏

Ω 𝜃𝑒 , 𝜃𝑓

is optimizedafteroptimizing

ℒ 𝜃i

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏

𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏

𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏

𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏

𝑓𝑏

-- ‘’ --

0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))

𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2

Only

Ω 𝜃𝑒 , 𝜃𝑓

is optimized

𝑖=1

𝑇𝑒

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))

𝑖=1

𝑇𝑓

−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))0.5 ∗ 𝑊𝑒𝑚𝑏

𝑒𝑇 𝐴⊗ 𝐼 𝑊𝑒𝑚𝑏𝑓 Ω 𝜃𝑒 , 𝜃𝑓

& ℒ 𝜃i

are optimizedjointly

𝑚𝑎𝑥(0, 1 − 𝑠𝑒 + 𝑠𝑐𝑒) 𝑚𝑎𝑥(0, 1 − 𝑠𝑓 + 𝑠𝑐

𝑓) 1

𝑚

𝑤𝑖∈𝑠𝑒

𝑤𝑚

𝑊𝑒𝑚𝑏𝑖𝑒 −

1

𝑛

𝑤𝑗∈𝑠𝑒

𝑤𝑛

𝑊𝑒𝑚𝑏𝑖𝑓

2 -- ‘’ --

(Faruqui and Dyer, 2014)

(Lu et. al., 2015)

(Herman & Blunsom, 2014)

(Klementiev et. al., 2012)

(Gouws et. al., 2015)

Page 106: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Now lets take a look at an approach which is based on autoencoders….

Page 107: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Background: a neural network based single view autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑋𝑖 − 𝑔(ℎ 𝑋 ))2

ℎ 𝑋 = 𝑓 𝑋 = 𝑓(𝑾𝑋 + 𝑏)

𝑋′ = g(ℎ 𝑋 ) = 𝑔(𝑾′ℎ(𝑋) + 𝑏′)

𝑒𝑛𝑐𝑜𝑑𝑒𝑟

𝑑𝑒𝑐𝑜𝑑𝑒𝑟

𝑢𝑠𝑒 𝑏𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑜𝑔𝑎𝑡𝑖𝑜𝑛

107

Page 108: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Correlational Neural Network

108

A multiview autoencoder

Page 109: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

109

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 110: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2

110

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 111: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2

111

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 112: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2

112

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 113: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2

113

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 114: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . . . . . .

𝑋

𝑋′

𝑌

𝑌′

𝑓𝑥(∙)

𝑔𝑥(∙)

𝑓𝑦(∙)

𝑔𝑦(∙)

ℎ(∙)

So far so good…. But will the representations h(X) and h(Y) be correlated?

Turns out that there is no guarantee for this !

114

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Page 115: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . . . . . .

. . . . . . .

𝑋

𝑋′

𝑌

𝑌′

𝑓𝑥(∙)

𝑔𝑥(∙)

𝑓𝑦(∙)

𝑔𝑦(∙)

ℎ(∙)

So far so good…. But will the representations h(X) and h(Y) be correlated?

Turns out that there is no guarantee for this !

115

. . . . . . . .

. . . . . . . . . . . . . . . .

Page 116: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . . . . . . . . . .

. . . . . . .

𝑋

𝑋′

𝑌

𝑌′

𝑓𝑥(∙)

𝑔𝑥(∙)

𝑓𝑦(∙)

𝑔𝑦(∙)

ℎ(∙)

So far so good…. But will the representations h(X) and h(Y) be correlated?

Turns out that there is no guarantee for this !

116

. . . . . . . .

. . . . . . . . . . . . . . . .

Page 117: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A multiview autoencoder

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2

+

𝑖=1

𝑁

(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2

−𝑐𝑜𝑟𝑟(ℎ 𝑋 , ℎ( 𝑌)

117

Correlational Neural Network

ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)

𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)

ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)

𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)

𝒆𝒏𝒄𝒐𝒅𝒆𝒓

𝒅𝒆𝒄𝒐𝒅𝒆𝒓

Page 118: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Lets compare the performance of some of these approaches on the taskof cross language document classification

Labeled documents in L1

+-+-+---+

+

+-+-+---+

+

Unlabeled documents in L2

Common RepresentationLearner

Page 119: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Approach ende de en

Klementiev et. al., 2012 77.6 71.1

Herman & Blunsom, 2013 83.7 71.4

Chandar et.al., 2014 91.8 72.8

Gouws et. al., 2015 86.5 75.0

(Herman & Blunsom, 2014)

(Klementiev et. al., 2012)

(Gouws et. al., 2015)

(Chandar et.al., 2014)

Page 120: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

We now look at multimodal representation learning

Page 121: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .. . . .

. . . .

a

man

jumping

on

his

bike

𝑥1

𝑥2

𝑥3

𝑥4

𝑥5

𝑥6

ℎ1

ℎ2

ℎ3

ℎ4

ℎ5

ℎ6

𝑔θ(∙)

𝑔θ(∙)

𝑔θ(∙)

𝑔θ(∙)

𝑔θ(∙)

𝑔θ(∙)

ℎ1 = 𝑔𝜃 𝑥1 = 𝑓(𝑊𝑣𝑥1)

ℎ6 = 𝑔𝜃 𝑥6, ℎ5 = 𝑓(𝑊𝑣𝑥6 +𝑊𝑙ℎ5)

ℎ3 = 𝑔𝜃 𝑥3, ℎ2, ℎ4 = 𝑓(𝑊𝑣𝑥6 +𝑊𝑙ℎ2 +𝑊𝑟ℎ4)

𝑊𝑟 𝑓𝑜𝑟 𝑟𝑖𝑔ℎ𝑡 𝑐ℎ𝑖𝑙𝑑𝑊𝑙 𝑓𝑜𝑟 𝑙𝑒𝑓𝑡 𝑐ℎ𝑖𝑙𝑑

Recursive Neural Network

ConvolutionalNeural

Network

. . . .

Page 122: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Recursive Neural

Network

ConvolutionalNeural

Network

. . . . . . . .

Common Space

. . . .

. . . .

𝑊𝑡 𝑊𝑖𝑚

𝑣𝑖

𝑦𝑗

𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒:

𝑖,𝑗 ∈𝑃

𝑐∈𝑆\S(𝑖)

max(0, ∆ − 𝑣𝑖𝑇𝑦𝑗 + 𝑣𝑖

𝑇𝑦𝑐)

+

𝑖,𝑗 ∈𝑃

𝑐∈𝐼\I(𝑗)

max(0, ∆ − 𝑣𝑖𝑇𝑦𝑗 + 𝑣𝑖

𝑇𝑦𝑐)

𝑖, 𝑗 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑖𝑟

𝑖, 𝑐 = 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑖𝑟

Page 123: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Bridge Correlational Neural Networks

@ NAACL, 2016

Monday, June 13, 2016

2:40 – 3:00 p.m.

Interested in more ?

Page 124: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

A quick summary … … …

Page 125: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

(Faruqui and Dyer, 2014) (Lu et. al., 2015) (Herman & Blunsom, 2014)

(Klementiev et. al., 2012) (Gouws et. al., 2015) (Chandar et. al., 2015)

(Socher et. al., 2013)

Page 126: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Representation Learning

• Learn task specific bilingual embeddings

• Learn from comparable corpora (instead of parallel corpora)

• Handle data imbalance • More data for ℒ 𝜃𝑗

• Less data for 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓

• Handle larger vocabulary

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑗∈{𝑒,𝑓}

𝑖=1

𝑇𝑗

ℒ 𝜃𝑗 + 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓+ ℒ𝑡𝑎𝑠𝑘(α)

𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓, α

𝜃𝑒= 𝑊𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓

α= task –specific parameters

monolingual similarity bilingual similarity

Page 127: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and open problems

127

Page 128: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Natural Language Generation

• Natural Language Processing• Natural Language Understanding (NLU)

• Natural Language Generation (NLG)

• Natural language generation is hard.

• Applications of NLG:• Machine Translation

• Question answering

• Captioning

• Summarization

• Dialogue systems

• Evaluating NLG systems is also hard! (More about in the end)

128

Page 129: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multilingual Multimodal NLG

Multilingual Multimodal NLG refers to conditional NLG where the generator could be conditioned on

• Multiple languages (machine translation, summarization, dialog systems)

• Multiple modalities like images, videos (visual QA, image captioning, video captioning)

129

Page 130: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and research directions

130

Page 131: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Machine Translation

En: Economic growth has slowed down in recent years.

Fr : La croissance économique a ralenti au cours des dernières années .

• Statistical Machine Translation (SMT) aims to design systems that can learn to translate between languages based on some training data.

• SMT maximizes

• p(e|f) – translation model

• p(f) – language model

• Traditional methods – long pipeline (Example: Moses, Joshua)

131

Page 132: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Neural Machine Translation

• Neural Network based machine translation system.

• Why Neural MT?• Easy to train in an end-to-end fashion.

• The whole system can be optimized for the actual task in hand.

• No need to store gigantic phrase tables. Small memory footprint.

• Simple decoder unlike highly intricate decoders in standard MT.

• We will consider• Single source – single target NMT

• Multi source – single target NMT

• Single source – multi target NMT

• Multi source – multi target NMT

132

Page 133: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Learning phrase representations using RNN Encoder-Decoder for SMT (Cho et al., 2014)

• RNN Encoder: encode a variable-length sequence into a fixed-length vector representation.

• RNN Decoder: decode a given fixed-length vector representation back into a variable-length sequence.

133

Economic growth has slowed down in recent years.

La croissance économique a ralenti au cours des dernières années .

GRU Encoder

GRU Decoder

Page 134: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

RNN Encoder-Decoder

134

How to use the trained model?

1. Generate a target sequence given an input

sequence.

2. Score a given pair of input/output sequence

p(y|x).

Page 135: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

2-D embedding of the learned phrase representation

135

Page 136: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

WMT’14 English/French SMT - rescoring

136

Baseline: Moses with default setting

Page 137: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Sequence to Sequence Learning with Neural Networks (Sutskever et al., 2014)

137

What is different from Cho et al., 2014?

1. LSTM encoder/decoder instead of GRU encoder/decoder.

2. 4 layers deep encoder/decoder instead of shallow encoder/decoder.

Page 138: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

2-D embedding of the learned phrase representation

138

Page 139: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

The performance of the LSTM on WMT’14 English to French test set

139

Trick-1: Reverse the input sequence.

Trick-2: Ensemble Neural Nets.

Page 140: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Performance as a function of sentence length

140

Page 141: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Neural MT by jointly learning to align and translate (Bahdanau et al., 2015)

• Issue with encoder-decoder approach• Can we compress all the necessary information in a sentence to a fixed length

vector?

NO

• How to choose the length of the vector?• It should be proportional to the length of the sentence.

• Bahdanau et al. proposed to use k fixed length vectors for encoder where k is the length of the sentence.

141

Page 142: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Attention based NMT

142

• To generate tth word

• Model learns an attention

mechanism over all the words in

the input sentence.

• Input words are represented using a

bi-directional RNN.

Page 143: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Results on WMT’14 En-Fr Dataset

143

BLEU scoreEffect of increasing the

sentence length

Page 144: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Visualization of attention

144

Page 145: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Effective Approaches to Attention-based NMT (Luong et al., 2015)

• Exploring various architectural choices for attention based NMT systems.• Global Attention

• Local Attention

• Input-feeding approach

145

Page 146: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Global Attention

146

• Stacked LSTM encoder instead of bi-

directional single layer LSTM encoder.

• Hidden state ht of the final layer is

used for context computation.

• All the source words are used for

computing the context (similar to

Bahdanau et al.)

Page 147: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Local Attention

147

• The model first predicts a single

aligned position pt for the current

target word.

• A window chosen around pt is used to

compute the context ct

• ct is a weighted average of the source

hidden states in the window.

• In between soft attention and hard

attention (more on hard attention

later).

Page 148: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Input-feeding Approach

148

• Attention vectors are fed as input to

the next time steps to inform the

model.

• Similar to coverage set in standard MT.

Page 149: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

WMT’14 En-De results

149

Page 150: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Alignment visualizations

150

Global attention Local-m attention

Local-p attention Gold alignment

Page 151: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Alignment Error Rate on RWTH En-De alignment data

151

Page 152: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-source NMT

• Why multiple sources?

• Two strings can reduce ambiguity via triangulation.

• Ex: English word “bank” may be easily translated to French in the presence of a second, German input containing the work “Flussufer” (river bank).

• Sources should be distant languages for better triangulation.English and German to French

×English and French to German

152

Page 153: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-source NMT (Zoph and Knight, 2016)

• Train p(e|f, g) model directly on trilingual data.

• Use it to decode e given any (f, g) pair.

• How to combine information from f and g?

153

Page 154: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-source Encoder-Decoder Model

154

Page 155: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-source Attention model

• We can take local-attention NMT model and concatenate context from multiple sources

155

Page 156: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

English-French-German NMT

156

Page 157: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-Target NMT (Dong et alt., 2015)

157

Multi-task learning framework for multiple-target language translation

Optimization for end to multi-end model

Page 158: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-Target NMT (Dong et alt., 2015)

158

Size of training corpus for different language pairs

Multi-task neural translation v.s. single model given large-scale corpus in all language pairs

Multi-task neural translation v.s. single model with a small-scale training corpus on some language pairs. * means that the language pair is sub-sampled.

Multi-task NMT v.s. single model v.s. moses on the WMT 2013 test set

Faster and Better convergence in Multi-task Learning in multiple language translation

Page 159: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-Way, Multilingual NMT (Firat et alt., 2016)

• Multiple sources and multiple targets

• Advantage?• Sharing knowledge across multiple languages.

• One universal common space for multiple languages.

• Advantageous for low resource languages?

• Challenges• N-lingual data is difficult to get for N>2.

• Parameters of the system grows fast w.r.t number of languages.

159

Page 160: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Simple Solutions

• Encoder-decoder model with multiple encoders and multiple decoders which are shared across language pairs.

• Can we do the same method with attention based models?• Attention is language pair specific.

• With L languages, we need O(L2) attentions.

• Can we share same attention module to multiple languages?

YES

160

Page 161: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Multi-way Multilingual NMT(Firat et al., 2016)

• Model• N encoders

• N decoders

• Shared attention mechanism

• Both encoders and decoders are shared across multiple language pairs.

• Each encoder can be of different type (convolutional/rnn, different sizes).

161

Page 162: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

One step of multiway multilingual NMT

162

Page 163: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Low Resource Translation

163

Page 164: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Large Scale Translation

164

Page 165: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and research directions

165

Page 166: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Image Captioning

166

A women is throwing a frisbee in a park.

Input

Output

Page 167: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Image Captioning

167Input

1. A building is surrounded by wide attractions, such

as a horse statue and a statue of a giant hand and

wrist, with a picnic table next to them.

2. A large hand statue outside of a country store.3. A strange antique store with odd artwork outside

near assorted tables and chairs.4. A wooden shop with a large hand in the

forecourt.5. Country store has big hand with checkered-base

statue and tables with benches on front yard

Crowdsourced Captions

Page 168: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Image Captioning

• Requires both visual understanding and language understanding.

• Hard problem.

• Several potential applications.

168

Page 169: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Show and Tell: A Neural Image Caption Generator (Vinyals et al.,2015)

169

Page 170: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Show and Tell

170

Page 171: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Generated captions

171

Page 172: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

BLEU-1 scores

172

Page 173: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention (Xu et al.,2016)

173

Page 174: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Attention over time

174

Soft attention

Hard attention

Page 175: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Show, Attend and Tell

• Show: Convolutional encoder

Instead of using final fully connected representation, use a lower convolutional layer.

This allows decoder to selective focus on certain parts of the image.

• Attend: Soft attention or hard attention over image (per time step)

• Tell: LSTM decoder conditioned on context at current time step,

word generated in previous time step, previous hidden state.

175

Page 176: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Deterministic Soft Attention

• Same as the attention mechanism in Bahdanau et al.’s NMT system.

176

Page 177: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Stochastic Hard Attention

177

This is not differentiable!

Use REINFORCE

Page 178: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Attending correct objects

178

Page 179: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Examples of mistakes

179

Page 180: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Results of 3 datasets

180

Page 181: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Dense Captioning

181

Page 182: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Dense Captioning (Johnson et al.,2015)

182

Page 183: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Dense Captioning

183

Page 184: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Towards interpretable image search systems

184

Page 185: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and research directions

185

Page 186: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Visual Question Answering

186

Page 187: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Visual QA dataset (Agrawal et al., 2016)

187

Page 188: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Image/Text Encoder with LSTM answer module (Agrawal et al., 2016)

188

Page 189: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Results

189

Page 190: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Memory Network based VQA (Xiong et al., 2016)

190

Page 191: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Visual Input Module

191

Page 192: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Episodic Memory Module

192

Page 193: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Results on VQA dataset

193

Page 194: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and research directions

194

Page 195: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Video Captioning

195

Page 196: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Video Captioning (Venugopalan et al., 2015)

196

Page 197: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Video Captioning (Venugopalan et al., 2015)

197

Page 198: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Video Captioning (Venugopalan et al., 2015)

198

Page 199: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Also..

Describing videos by exploiting temporal structure – (Yao et al., 2015)

Proposes• Exploiting local structure by using spatio-temporal convolutional network.

• Exploiting global structure by using temporal attention mechanism.

199

Page 200: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

a) Machine Translation

b) Image captioning

c) Visual Question Answering

d) Video captioning

e) Image generation from captions

5. Summary and research directions

200

Page 201: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Generating images from captions (Mansimov et al.,

2016)

201

Page 202: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Generating images from captions (Mansimov et al.,

2016)

202

Page 203: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Tutorial Outline

1. Introduction and motivation

2. Neural networks – basics

3. Multilingual multimodal representation learning

4. Multilingual multimodal generation

5. Summary and open problems

203

Page 204: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Summary

• Significant progress in multilingual/multimodal language processing in past few years.

• Reasons• Better representation learners

• Data

• Compute power

• Finally AI is able to have direct impact in common man’s life!

• This is just the beginning!

204

Page 205: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Representation Learning

• Learn task specific bilingual embeddings

• Learn from comparable corpora (instead of parallel corpora)

• Handle data imbalance • More data for ℒ 𝜃𝑗

• Less data for 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓

• Handle larger vocabulary

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑗∈{𝑒,𝑓}

𝑖=1

𝑇𝑗

ℒ 𝜃𝑗 + 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏

𝑓+ ℒ𝑡𝑎𝑠𝑘(α)

𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓, α

𝜃𝑒= 𝑊𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒

𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓

α= task –specific parameters

monolingual similarity bilingual similarity

Page 206: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Language Generation

• Handling large vocabulary during generation• hierarchical softmax (Morin & Bengio, 2005)

• NCE (Mnih & Teh, 2012)

• Hashing based approaches (Shrivastava & Li, 2014)

• Sampling based approaches (Jean et al ., 2015)

• Blackout (Ji et al., 2016)

• Character level instead of word level? • Character level decoder - Chung et al., 2016

• Character level encoder/decoder – Ling et al, 2015

• Still an open problem.

206

Page 207: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Language Generation

• Handling out-of-vocabulary words.

• Out-of-vocabulary word in source side?• Pointing the unknown words - Caglar et al., 2016

• Out-of-vocabulary word in target side?

207

Page 208: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Language Generation

• Better evaluation metrics?

• Perplexity and BLEU does not encourage diversity in the generation.

• Human evaluation?

• Can we come up with better automated evaluation measures?

• Related work: How NOT to evaluate your dialogue system (Liu et al., 2016)

208

Page 209: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Research Directions: Language Generation

• Transfer learning for low resource languages?

• Data efficient architectures?

209

Page 210: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Acknowledgements

Many images/tables are directly taken from the respective papers.

210

Page 211: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

211

Slides will be made available at:

http://sarathchandar.in/mmnlp-tutorial/

Bibliography of related papers will be maintained at:

http://github.com/apsarath/mmnlp-papers

Page 212: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed

Questions?

212