Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify...

97
1 Dependency Parsing COM6513 Natural Language Processing Nikos Aletras [email protected] @nikaletras Computer Science Department Week 5 Spring 2021

Transcript of Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify...

Page 1: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

1

Dependency ParsingCOM6513 Natural Language Processing

Nikos [email protected]

@nikaletras

Computer Science Department

Week 5Spring 2021

Page 2: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

2

In previous lectures...

Text Classification: Given an instance x (e.g. document),predict a label y ∈ YTasks: sentiment analysis, topic classification, etc.

Algorithm: Logistic Regression

Page 3: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

3

In previous lectures...

Sequence labelling: Given a sequence of wordsx = [x1, ...xN ], predict a sequence of labels y ∈ YN

Tasks: part of speech tagging, named entity recognition, etc.

Algorithms: Hidden Markov Models, Conditional RandomFields

Page 4: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

4

In this lecture...

Model richer linguistic representations: graphs

Dependency parses: Graphs representing syntactic relationsbetween words in a sentence

Two approaches:

Graph-based Dependency ParsingTransition-based Dependency Parsing

Page 5: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

4

In this lecture...

Model richer linguistic representations: graphs

Dependency parses: Graphs representing syntactic relationsbetween words in a sentence

Two approaches:

Graph-based Dependency ParsingTransition-based Dependency Parsing

Page 6: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

5

Dependecny Parsing: Applications

Relation extraction, e.g. identify entity pairs (AM, ArcticMonkeys), (Abbey Road, Beatles), (Different Class, Pulp)with the relation music album by

Question answering

Sentiment analysis

Page 7: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

Page 8: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

Page 9: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

Page 10: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

Page 11: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

Page 12: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

Page 13: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected?

NO

Acyclic? YES

Single-headed? YES

Solution?

Page 14: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

Page 15: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected? NO

Acyclic?

YES

Single-headed? YES

Solution?

Page 16: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

Page 17: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed?

YES

Solution?

Page 18: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

Page 19: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

9

Well-formed Dependency Parse

Add a special root node with edges to any nodes without heads(main verb and punctuation).

Page 20: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

10

Dependency Parsing: Problem setup

Training data is pairs of word sequences (sentences) anddependency trees:

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ...xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

Page 21: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

10

Dependency Parsing: Problem setup

Training data is pairs of word sequences (sentences) anddependency trees:

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ...xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

Page 22: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far?

Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

Page 23: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

Page 24: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge?

Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

Page 25: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

Page 26: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

12

Maximum Spanning Tree

Spanning Tree: In graph theory, a spanning tree T of anundirected graph G is a subgraph that includes all of thevertices of G, with the minimum possible number of edges.

Tree: In computer science, a tree is a widely used datastructure (Abstract Data Type) that simulates a hierarchicaltree structure, with a root value and subtrees of children witha parent node, represented as a set of linked nodes.

Page 27: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

12

Maximum Spanning Tree

Spanning Tree: In graph theory, a spanning tree T of anundirected graph G is a subgraph that includes all of thevertices of G, with the minimum possible number of edges.

Tree: In computer science, a tree is a widely used datastructure (Abstract Data Type) that simulates a hierarchicaltree structure, with a root value and subtrees of children witha parent node, represented as a set of linked nodes.

Page 28: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

13

Maximum Spanning Tree

Score all edges, but keep only the max spanning tree usingChu-Liu-Edmonds algorithm, a modification to Kruskal’s algorithmfor extracting Maximum Spanning Trees.

Exact solution in O(N2) time using Chu-Liu-Edmonds.

Page 29: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

14

Kruskal’s algorithm

Input scored edges E

sort E by cost (opposit of score)

G = {}while G not spanning do:

pop the next edge e

if connecting different trees :

add e to G

Return G

Page 30: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

15

Graph-based Dependency Parsing

Decompose the graph score into arc scores:

Gx = arg maxGx∈Gx

score(Gx, x)

= arg maxGx∈Gx

w · Φ(Gx, x) (linear model)

= arg maxGx∈Gx

∑(i ,j ,l)∈Ax

w · φ((i , j , l), x) (arc-factored)

Can learn the weights with a Conditional Random Field!

Page 31: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

Page 32: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

Page 33: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

Page 34: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

Page 35: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubjNO!!! Breaks the arc-factored scoring

Page 36: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

17

More Global models

Even though inference and learning are global, features arelocalised to arcs.

Can we have more global features?

Yes we can! Considersubgraphs spanning a few edges. But inference becomesharder, requiring more complex dynamic programs and cleverapproximations.

Is it worth it? Syntactic parsing has many applications, thusbetter compromises between speed and accuracy are alwayswelcome!

Page 37: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

17

More Global models

Even though inference and learning are global, features arelocalised to arcs.

Can we have more global features? Yes we can! Considersubgraphs spanning a few edges. But inference becomesharder, requiring more complex dynamic programs and cleverapproximations.

Is it worth it? Syntactic parsing has many applications, thusbetter compromises between speed and accuracy are alwayswelcome!

Page 38: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

Page 39: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

Page 40: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

Page 41: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

19

Joint vs incremental prediction

Joint: score (and enumerate) complete outputs (graphs)

Page 42: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

20

Joint vs incremental prediction

Incremental: predict a sequenceof actions (transitions)constructing the output

Page 43: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

21

Transition system

The actions A the classifier f can predict and their effect on thestate which tracks the prediction: St+1 = S1(α1 . . . αt)

What should the actions (transitions) be for dependency parsing?

Page 44: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

21

Transition system

The actions A the classifier f can predict and their effect on thestate which tracks the prediction: St+1 = S1(α1 . . . αt)

What should the actions (transitions) be for dependency parsing?

Page 45: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

22

Transition system setup

Input: Vertices Vx = {0, 1, ...,N} (words sentence x)

State S = (Stack,B,A):

Arcs A (dependencies predicted so far)Buffer Buf (words left to process)Stack Stack (last-in, first out memory)

Initial state: S0 = ([], [0, 1, ...,N], {})Final state: Sfinal = (Stack , [],A)

Page 46: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

Page 47: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

Page 48: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

Page 49: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack |i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

Page 50: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

24

Example

Stack = []Buffer = [ROOT, Economic, news, had, little, effect, on, financial,markets, .]

Action?

Shift

Page 51: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

24

Example

Stack = []Buffer = [ROOT, Economic, news, had, little, effect, on, financial,markets, .]

Action? Shift

Page 52: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

25

Example

Stack = [ROOT]Buffer = [Economic, news, had, little, effect, on, financial,markets, .]

Action?

Shift

Page 53: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

25

Example

Stack = [ROOT]Buffer = [Economic, news, had, little, effect, on, financial,markets, .]

Action? Shift

Page 54: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

26

Example

Stack = [ROOT, Economic]Buffer = [news, had, little, effect, on, financial, markets, .]

Action?

Left-Arc(amod)

Page 55: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

26

Example

Stack = [ROOT, Economic]Buffer = [news, had, little, effect, on, financial, markets, .]

Action? Left-Arc(amod)

Page 56: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

27

Example

Stack = [ROOT]Buffer = [news, had, little, effect, on, financial, markets, .]

Action?

Shift

Page 57: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

27

Example

Stack = [ROOT]Buffer = [news, had, little, effect, on, financial, markets, .]

Action? Shift

Page 58: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

28

Example

Stack = [ROOT, news]Buffer = [had, little, effect, on, financial, markets, .]

Action?

Left-Arc(nsubj)

Page 59: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

28

Example

Stack = [ROOT, news]Buffer = [had, little, effect, on, financial, markets, .]

Action? Left-Arc(nsubj)

Page 60: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

29

Example

Stack = [ROOT]Buffer = [had, little, effect, on, financial, markets, .]

Action?

Right-Arc(root)

Page 61: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

29

Example

Stack = [ROOT]Buffer = [had, little, effect, on, financial, markets, .]

Action? Right-Arc(root)

Page 62: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

30

Example

Stack = [ROOT, had]Buffer = [little, effect, on, financial, markets, .]

Action?

Shift

Page 63: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

30

Example

Stack = [ROOT, had]Buffer = [little, effect, on, financial, markets, .]

Action? Shift

Page 64: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

31

Example

Stack = [ROOT, had, little]Buffer = [effect, on, financial, markets, .]

Action?

Left-Arc(amod)

Page 65: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

31

Example

Stack = [ROOT, had, little]Buffer = [effect, on, financial, markets, .]

Action? Left-Arc(amod)

Page 66: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

32

Example

Stack = [ROOT, had]Buffer = [effect, on, financial, markets, .]

Action?

Right-Arc(dobj)

Page 67: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

32

Example

Stack = [ROOT, had]Buffer = [effect, on, financial, markets, .]

Action? Right-Arc(dobj)

Page 68: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

33

Example

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

Action?

let’s fast-forward...

Page 69: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

33

Example

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

Action? let’s fast-forward...

Page 70: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

34

Example

Stack = [ROOT, had, .]Buffer = []

Empty buffer.

DONE!

Page 71: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

34

Example

Stack = [ROOT, had, .]Buffer = []

Empty buffer. DONE!

Page 72: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

Page 73: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)

easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

Page 74: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

Page 75: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

Page 76: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ?

A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

Page 77: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ? A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

Page 78: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ? A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

Page 79: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack ,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

Page 80: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

Page 81: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack ,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

Page 82: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

38

Learning from an oracle

Given a labelled sentence and a transition system, an oracle returnsstates labelled with the correct actions.

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ..., xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

states Sm= [S1, ...,ST ]

actions αm= [α1, ..., αT ]

Page 83: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

39

Feature Representation

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

What features would help us predict the correction actionRight-Arc(prep)?

Page 84: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

Page 85: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

Page 86: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

Page 87: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?

Use Beam Search!

Page 88: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

Page 89: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

Page 90: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

Page 91: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

Page 92: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

42

Non-Projectivity

Arcs are crossing each other

long-range dependencies

free word order

Page 93: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

43

Non-projective Transition-based parsing

The standard stack-based systems cannot do it.

But there are extensions:

swap actions: word reoderingk-planar parsing: use multiple stacks (usually 2)

Standard graph-based parsing handles non-projectivity.

Page 94: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

44

Incremental Language Processing

Other problems solved with similar approaches (a.k.a.transition-based, greedy):

semantic parsing (converting a natural language utterance to alogical form)coreference resolution

Whenever you have a problem with a very large space ofoutputs, worth considering

Page 95: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

45

Evaluation

Head-finding word-accuracy:unlabelled: % of words with the right headlabelled: % of words with the right head and label

Sentence accuracy: % of sentences with correct graph

Page 96: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

46

Bibliography

Chapter 11 from Eisenstein

Nivre and McDonald’s tutorial slides

Nivre’s article on deterministic transition-based dependencyparsing

Nivre and McDonald’s paper comparing their approaches

Page 97: Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify entity pairs (AM, Arctic Monkeys), (Abbey Road, Beatles), (Di erent Class, Pulp) with

47

Coming up next week...

Feed-forward Neural Networks

Getting ready for Assignment 2!