for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf ·...

42
Graph Convolutional Encoders for Syntax-aware NMT Joost Bastings Institute for Logic, Language and Computation (ILLC) University of Amsterdam http://joost.ninja 1

Transcript of for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf ·...

Page 1: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders for Syntax-aware NMT

Joost Bastings

Institute for Logic, Language and Computation (ILLC)University of Amsterdamhttp://joost.ninja

1

Page 2: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Collaborators

2

Ivan Titov Wilker Aziz Diego Marcheggiani Khalil Sima’an

Page 3: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

3Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

Page 4: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

4Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

you

Page 5: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

5Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

you know

Page 6: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

6Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

you know nothing

Page 7: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

7Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

you know nothing jon

Page 8: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Encoder-Decoder models: no syntax, no hierarchical structure

8Sutskever et al. (2014), Cho et al. (2014)

du vet ingenting jon snow

you know nothing jon snow

Page 9: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Syntax can help!

9

Syntax:

● captures non-local dependencies between words● captures relations between words

How to incorporate syntax into NMT?

Page 10: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Multi-task learning / latent-graph:

● Eriguchi et al. (ACL, 2017). Learning to parse and translate improves NMT.● Hashimoto et al. (EMNLP, 2017). NMT with Source-Side Latent Graph Parsing.● Nadejde et al. (WMT, 2017). Predicting Target Language CCG Supertags Improves NMT.

Tree-RNN:

● Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT.● Chen et al. (ACL, 2017). Improved NMT with a Syntax-Aware Encoder and Decoder.● Yang et al. (EMNLP, 2017). Towards Bidirectional Hierarchical Representations for Attention-based NMT.

Serialized trees:

● Aharoni & Goldberg (ACL, 2017). Towards String-to-Tree NMT.● Li et al. (ACL, 2017). Modeling Source Syntax for NMT.

Sequence-to-Dependency:

● Wu et al. (ACL, 2017). Sequence-to-Dependency NMT.

Related Work

10

Page 11: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

11

Page 12: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Networks: Neural Message Passing

12

Page 13: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Many linguistic representations can be expressed as a graph

13

the bastard defended the wall

det nsubjdet

dobj

Page 14: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Networks: message passing

14Kipf & Welling (2017). Related ideas earlier, e.g., Scarselli et al. (2009).

v

Undirected graph Update of node v

Page 15: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Networks: message passing

15Kipf & Welling (2017). Related ideas earlier, e.g., Scarselli et al. (2009).

v

Undirected graph Update of node v

Page 16: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

GCNs: multilayer convolution operation

16

Initial feature representations of nodes

Representations informed by nodes’ neighbourhood

X = H(0) Z = H(N)

H(1)

Input

Hidden layer

H(2)

Hidden layer

Output

Adapted from Thomas Kipf. The computation is parallelizable can be made quite efficient (e.g., Hamilton, Ying and Leskovec (2017)).

Page 17: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

GCNs: multilayer convolution operation

17

Initial feature representations of nodes

Representations informed by nodes’ neighbourhood

X = H(0) Z = H(N)

H(1)

Input

Hidden layer

H(2)

Hidden layer

Output

Adapted from Thomas Kipf. The computation is parallelizable can be made quite efficient (e.g., Hamilton, Ying and Leskovec (2017)).

Page 18: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Syntactic GCNs: directionality and labels

18

v

Wout

Wout

nsubj dobjdet

det

Wloop

Page 19: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Syntactic GCNs: directionality and labels

19

Weight matrix for each direction:Wout, Win, Wloop

Bias for each label,e.g. bnsubj

v

Wout

Wout

nsubj dobjdet

det

Wloop

Page 20: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Syntactic GCNs: edge-wise gating

20

Page 21: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

21

Page 22: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

22

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

Page 23: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

23

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

GCN layer 1

Page 24: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

24

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

GCN layer 1

W outW out

Wout

W out

Page 25: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

25

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

GCN layer 1

W outW out

Wout

W out

Page 26: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

26

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

GCN layer 2

W out W outW out

Wout

GCN layer 1

W outW out

Wout

W out

Page 27: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Graph Convolutional Encoders

27

the bastard defended the wall

det nsubj

det

dobj

Encoder (BiRNN, CNN, ..)

GCN layer 2

W out W outW out

Wout

GCN layer 1

W outW out

Wout

W out

Page 28: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

GCNs for NMT

28

the bastard defended the wall

det nsubj

det

dobj

Encoder

GCN layer(s)

W out W outW out

Wout

bastarden

Context vector

Decoder (RNN)

Page 29: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

GCNs for NMT

29

the bastard defended the wall

det nsubj

det

dobj

Encoder

GCN layer(s)

W out W outW out

Wout

bastarden försvarade

Context vector

Decoder (RNN)

Page 30: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

GCNs for NMT

30

the bastard defended the wall

det nsubj

det

dobj

Encoder

GCN layer(s)

W out W outW out

Wout

bastarden försvarade väggen

Context vector

Decoder (RNN)

Page 31: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Experiments

31

Page 32: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Random sequences from 26 types

Each token is linked to its predecessor

Sequences are randomly permuted

Each token is also linked to a random token with a “fake edge” using a different label set

Artificial reordering experiment

32

s n o w

n s w o

Page 33: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Random sequences from 26 types

Each token is linked to its predecessor

Sequences are randomly permuted

Each token is also linked to a random token with a “fake edge” using a different label set

Artificial reordering experiment

33

A BiRNN+GCN model is able to learn how to put the permuted sequences back into order.

Real edges are distinguished from fake edges.

s n o w

n s w o

Page 34: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Three baselines

34

Bag of words Convolutional Recurrent

Page 35: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Machine Translation experiments

35

We parse English source sentences using SyntaxNet

BPE on target side (8k merges, 16k for WMT’16)

Embeddings: 256 units, GRUs/CNNs: 512 units (800 for WMT’16)

Train Validationnewstest2015

Testnewstest2016

English-German NCv11 227 k2169 2999

WMT‘16 4.5 M

English-Czech NCv11 181 k 2656 2999

Page 36: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Results English-German (BLEU)

36

Page 37: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Results English-Czech (BLEU)

37

Page 38: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Effect of GCN layers

38

English-German English-Czech

BLEU1 BLEU4 BLEU1 BLEU4

BiRNN 44.2 14.1 37.8 8.9

+ GCN 1L 45.0 14.1 38.3 9.6

+ GCN 2L 46.3 14.8 39.6 9.9

Page 39: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Effect of sentence length

39

Page 40: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Flexibility of GCNs

40

In principle we can condition on any kind of graph-based linguistic structure:

Semantic Role Labels

Co-reference chains

AMR semantic graphs

Page 41: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Conclusion

41

Simple and effective approach to integrating syntax into NMT models

Consistent BLEU improvements for two challenging language pairs

GCNs are capable of encoding any kind of graph-based structure

Page 42: for Syntax-aware NMT Graph Convolutional Encoderswilkeraziz.github.io/slides/emnlp2017.pdf · Eriguchi et al. (ACL, 2016). Tree-to-Sequence Attentional NMT. Chen et al. (ACL, 2017).

Thank you!

42

Code: Neural Monkey + GCN (TensorFlow)

joost.ninja

Code: Semantic Role Labeling (Theano)

diegma.github.io

Ivan is hiring PhDs/Postdocs. Deadline tomorrow!

This work was supported by the European Research Council (ERC StG BroadSem 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518, NWO VICI 277-89-002).