An Imitation Learning Approach to Unsupervised Parsing

30
An Imitation Learning Approach to Unsupervised Parsing Bowen Li, e Lili Mou, a Frank Keller e e : E d n b u r g h N L P University of Edinburgh Natural Language Processing a: 1 / 25

Transcript of An Imitation Learning Approach to Unsupervised Parsing

Page 1: An Imitation Learning Approach to Unsupervised Parsing

An Imitation Learning Approach

to Unsupervised Parsing

Bowen Li,e Lili Mou,a Frank Kellere

e:

Ed nburghNLPUniversity of Edinburgh

Natural Language Processing a:

1 / 25

Page 2: An Imitation Learning Approach to Unsupervised Parsing

Why Unsupervised Parsing?

Engineering motivation:

I ∼6,000 languages in the world

I Treebanks for ∼70 languages (many of them small)

I Syntactic annotation

I slow and costly

I relying on expert linguists

We need a way of inducing syntactic knowledge

I Based on simple, crowd-sourcable sentence annotation

I E.g., natural language inference, sentiment

2 / 25

Page 3: An Imitation Learning Approach to Unsupervised Parsing

Why Unsupervised Parsing?

Cognitive motivation: how children learn languages?

I 18 months: start with two word utterances

I By 5 years: generate complex syntax (Brown’s stages):

I relative clauses, infinitival, gerunds, wh-phrases, passives

I No explicit supervision is provided (children don’t see syntax trees)

I But they receive indirect feedback: is an utterance understood or not?

To model this, we need a way of inducing syntactic knowledge based on simple semantic labels at

the sentence level

3 / 25

Page 4: An Imitation Learning Approach to Unsupervised Parsing

Unsupervised Parsing

Goal: learn linguistically meaningful syntax (tree structures) without treebank supervision

Approach:

I Get training signal from a secondary task:

I Language modeling

I Semantically oriented tasks (e.g., natural language inference, sentiment)

I Try to induce meaningful “latent” tree structures

4 / 25

Page 5: An Imitation Learning Approach to Unsupervised Parsing

Hard Discrete Parsers

Examples:

I RL-SPINN [Yogatama et al., 2017], Soft-Gating [Maillard et al., 2017],

Gumbel-Tree-LSTM [Choi et al., 2018]

Advantages:

I Models have grounded parsing actions

Disadvantages: Not differentiable

I Reinforcement learning =⇒ doubly stochastic gradient descent, poor local optima, low

self-agreement

I Dynamic Programming marginalization =⇒ high time complexity

5 / 25

Page 6: An Imitation Learning Approach to Unsupervised Parsing

Soft Continuous Parsers

Very recent work:

I Parsing-reading-predict network [PRPN, Shen et al., 2018]

I Ordered Neurons [ON-LSTM, Shen et al., 2019]

Advantages:

I Relaxing discrete parsing by continuous notions (e.g., structured attention) =⇒ easy to train

by differentiation

Disadvantages:

I Inducing syntax from continuous relaxation is not learnable

I Parsing operations are stipulated externally by heuristics

6 / 25

Page 7: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

I Is it possible to combine both approaches?

I Yes! We can use imitation learning!

I Coupling soft continuous parser and hard discrete parser at the intermediate output level

(parse tree)

7 / 25

Page 8: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

I Is it possible to combine both approaches?

I Yes! We can use imitation learning!

I Coupling soft continuous parser and hard discrete parser at the intermediate output level

(parse tree)

7 / 25

Page 9: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

I Is it possible to combine both approaches?

I Yes! We can use imitation learning!

I Coupling soft continuous parser and hard discrete parser at the intermediate output level

(parse tree)

7 / 25

Page 10: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

Hard Parser

Soft Parser

Jparse

Imperfect parsing label

Pretraining

Induce

Supervised Learning Policy refinement

Jtask1

Jtask2

8 / 25

Page 11: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

Hard Parser

Soft Parser

Jparse

Imperfect parsing label

Pretraining

Induce

Supervised Learning Policy refinement

Jtask1

Jtask2

9 / 25

Page 12: An Imitation Learning Approach to Unsupervised Parsing

Combine both Worlds by Imitation Learning

Hard Parser

Soft Parser

Jparse

Imperfect parsing label

Pretraining

Induce

Supervised Learning Policy refinement

Jtask1

Jtask2

10 / 25

Page 13: An Imitation Learning Approach to Unsupervised Parsing

PRPN as the Soft Parser

Parsing-reading-predict network (PRPN; [Shen et al. 2018])

3

2

1

0

Height

w1 w2

w3 w4

w5

2 23 1Syntactic Distance

Published as a conference paper at ICLR 2018

Parsers are also related to our work since they are all inferring grammatical tree structure given asentence. For example, SPINN (Bowman et al., 2016) is a shift-reduce parser that uses an LSTM asits composition function. The transition classifier in SPINN is supervisedly trained on the StanfordPCFG Parser (Klein & Manning, 2003) output. Unsupervised parsers are more aligned with whatour model is doing. Klein & Manning (2004) presented a generative model for the unsupervisedlearning of dependency structures. Klein & Manning (2002) is a generative distributional model forthe unsupervised induction of natural language syntax which explicitly models constituent yieldsand contexts. We compare our parsing quality with the aforementioned two papers in Section 6.3.

3 MOTIVATION

Figure 1: Hard arrow represents syntactic tree structure and parent-to-child dependency relation,dash arrow represents dependency relation between siblings

Suppose we have a sequence of tokens x0, ..., x6 governed by the tree structure showed in Figure 1.The leafs xi are observed tokens. Node yi represents the meaning of the constituent formed by itsleaves xl(yi), ..., xr(yi), where l(·) and r(·) stands for the leftmost child and right most child. Root rrepresents the meaning of the whole sequence. Arrows represent the dependency relations betweennodes. The underlying assumption is that each node depends only on its parent and its left siblings.

Directly modeling the tree structure is a challenging task, usually requiring supervision to learn (Taiet al., 2015). In addition, relying on tree structures can result in a model that is not sufficiently robustto face ungrammatical sentences (Hashemi & Hwa, 2016). In contrast, recurrent models provide aconvenient way to model sequential data, with the current hidden state only depends on the lasthidden state. This makes models more robust when facing nonconforming sequential data, but itsuffers from neglecting the real dependency relation that dominates the structure of natural languagesentences.

Figure 2: Proposed model architecture, hard line indicate valid connection in Reading Network,dash line indicate valid connection in Predict Network.

3

LSTM with structured attention for LM

Syntactic distance

Induce

[Shen et al., 2018]

11 / 25

Page 14: An Imitation Learning Approach to Unsupervised Parsing

PRPN as the Soft Parser

w1 w2

w3 w4

w5

2 23 1Syntactic Distance

Syntactic distance

vanilla induction

w1 w2 w3 w4 w5

2 23 1Syntactic Distance

!

Tree structure

12 / 25

Page 15: An Imitation Learning Approach to Unsupervised Parsing

PRPN as the Soft Parser

w1 w2

w3 w4

w5

2 23 1Syntactic Distance

Syntactic distance

vanilla induction

w1 w2 w3 w4 w5

2 23 1Syntactic Distance

!

Tree structure

13 / 25

Page 16: An Imitation Learning Approach to Unsupervised Parsing

PRPN as the Soft Parser

w1 w2 w3

w4 w5

2 23 1Syntactic Distance

Syntactic distance

heuristic-based Induction

w1 w2 w3 w4 w5

2 23 1Syntactic Distance

Right-branching bias

!

Tree structure

14 / 25

Page 17: An Imitation Learning Approach to Unsupervised Parsing

PRPN as the Soft Parser

w1 w2 w3

w4 w5

2 23 1Syntactic Distance

Syntactic distance

heuristic-based Induction

w1 w2 w3 w4 w5

2 23 1Syntactic Distance

Right-branching bias

!

Tree structure

15 / 25

Page 18: An Imitation Learning Approach to Unsupervised Parsing

Gumbel-Tree-LSTM as the Hard Parser

w1

w2

w3 w4

Tree-LSTM for sentence classification

w1 w2 w3 w4

Learning tree structures by Straight-Through Gumbel Softmax

Probability 0.2 0.1 0.7

Gumbel

16 / 25

Page 19: An Imitation Learning Approach to Unsupervised Parsing

Our Approach

w1 w2 w3 w4

[0, 1, 0]

[1, 0]

Jtask

Jparse

17 / 25

Page 20: An Imitation Learning Approach to Unsupervised Parsing

Our Approach

Two-stage training:

I Stage 1: step-by-step supervised

learning

w1 w2 w3 w4

[0, 1, 0]

[1, 0]

Jtask2

Jparse

18 / 25

Page 21: An Imitation Learning Approach to Unsupervised Parsing

Our Approach

Two-stage training:

I Stage 1: step-by-step supervised

learning

I Stage 2: policy refinement on NLI

task

w1 w2 w3 w4

Jtask2

SampleST-Gumbel

19 / 25

Page 22: An Imitation Learning Approach to Unsupervised Parsing

Experimental Results: Parsing Results on All-NLI

Model Mean F Self-agreement

Left-Branching 18.9 -

Right-Branching 18.5 -

Balanced-Tree 22.0 -

Gumbel-Tree-LSTM 21.9 56.8

PRPN 51.6 65.0

Imitation (first stage only) 52.0 70.8

Imitation (two stages) 53.7 67.4

More settings and analysis in our paper

20 / 25

Page 23: An Imitation Learning Approach to Unsupervised Parsing

Experimental Results: Parsing Results on All-NLI

Model Mean F Self-agreement

Left-Branching 18.9 -

Right-Branching 18.5 -

Balanced-Tree 22.0 -

Gumbel-Tree-LSTM 21.9 56.8

PRPN 51.6 65.0

Imitation (first stage only) 52.0 70.8

Imitation (two stages) 53.7 67.4

More settings and analysis in our paper

20 / 25

Page 24: An Imitation Learning Approach to Unsupervised Parsing

Experimental Results: Parsing Results on All-NLI

Model Mean F Self-agreement

Left-Branching 18.9 -

Right-Branching 18.5 -

Balanced-Tree 22.0 -

Gumbel-Tree-LSTM 21.9 56.8

PRPN 51.6 65.0

Imitation (first stage only) 52.0 70.8

Imitation (two stages) 53.7 67.4

More settings and analysis in our paper

20 / 25

Page 25: An Imitation Learning Approach to Unsupervised Parsing

Relationship to Previous Studies

Do latent tree learning models identify meaningful structure in sentences?

[Williams et al., 2018]

I Our results: Yes, but we need a “good” initialization.

Tree-Based Neural Sentence Modeling [Shi et al., 2018]: parse/trivial trees are roughly the same

for classification performance

I Our results: same findings in terms of NLI accuracy

21 / 25

Page 26: An Imitation Learning Approach to Unsupervised Parsing

Relationship to Previous Studies

Do latent tree learning models identify meaningful structure in sentences?

[Williams et al., 2018]

I Our results: Yes, but we need a “good” initialization.

Tree-Based Neural Sentence Modeling [Shi et al., 2018]: parse/trivial trees are roughly the same

for classification performance

I Our results: same findings in terms of NLI accuracy

21 / 25

Page 27: An Imitation Learning Approach to Unsupervised Parsing

One last question

Why does NLI help unsupervised parsing?

NLI Loss

Tree SpaceTrivial trees

(e.g., left/right-branching)True parse trees Other not

understandable trees

22 / 25

Page 28: An Imitation Learning Approach to Unsupervised Parsing

One last question

Why does NLI help unsupervised parsing?

NLI Loss

Tree SpaceTrivial trees

(e.g., left/right-branching)True parse trees Other not

understandable trees

After first stage of imitation learning

Second stage policy refinement

23 / 25

Page 29: An Imitation Learning Approach to Unsupervised Parsing

Conclusion

I Imitation learning for unsupervised parsing

I A flexible way of coupling heterogeneous models on the intermediate output level

I Other applications: semantic parsing [Mou et al., 2017], discourse parsing

I Showing the usefulness of semantic tasks for unsupervised parsing

I More research needed on tasks, models, and combinations in this direction

24 / 25

Page 30: An Imitation Learning Approach to Unsupervised Parsing

Thank you!

Q&A

25 / 25