Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify...

Post on 02-Apr-2021

2 views 0 download

Transcript of Dependency Parsing - COM6513 Natural Language Processing · Relation extraction, e.g. identify...

1

Dependency ParsingCOM6513 Natural Language Processing

Nikos Aletrasn.aletras@sheffield.ac.uk

@nikaletras

Computer Science Department

Week 5Spring 2021

2

In previous lectures...

Text Classification: Given an instance x (e.g. document),predict a label y ∈ YTasks: sentiment analysis, topic classification, etc.

Algorithm: Logistic Regression

3

In previous lectures...

Sequence labelling: Given a sequence of wordsx = [x1, ...xN ], predict a sequence of labels y ∈ YN

Tasks: part of speech tagging, named entity recognition, etc.

Algorithms: Hidden Markov Models, Conditional RandomFields

4

In this lecture...

Model richer linguistic representations: graphs

Dependency parses: Graphs representing syntactic relationsbetween words in a sentence

Two approaches:

Graph-based Dependency ParsingTransition-based Dependency Parsing

4

In this lecture...

Model richer linguistic representations: graphs

Dependency parses: Graphs representing syntactic relationsbetween words in a sentence

Two approaches:

Graph-based Dependency ParsingTransition-based Dependency Parsing

5

Dependecny Parsing: Applications

Relation extraction, e.g. identify entity pairs (AM, ArcticMonkeys), (Abbey Road, Beatles), (Different Class, Pulp)with the relation music album by

Question answering

Sentiment analysis

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

6

What is a Dependency Parse?

Dependency parse (or tree): Graph representing syntacticrelations between words in a sentence

Nodes (or vertices): Words in a sentence

Edges (or arcs): Syntactic relations between words, e.g. dog

is the subject (nsubj) of likes (list of standard dependencyrelations)

Dependency Parsing: Automatically identify the syntacticrelations between words in a sentence

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

7

Graph constraints

Connected: every word can be reached from any other wordignoring edge directionality

Acyclic: can’t re-visit the same word on a directed path

Single-Head: every word can have only one head

8

Well-formed Dependency Parse?

Connected?

NO

Acyclic? YES

Single-headed? YES

Solution?

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

8

Well-formed Dependency Parse?

Connected? NO

Acyclic?

YES

Single-headed? YES

Solution?

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed?

YES

Solution?

8

Well-formed Dependency Parse?

Connected? NO

Acyclic? YES

Single-headed? YES

Solution?

9

Well-formed Dependency Parse

Add a special root node with edges to any nodes without heads(main verb and punctuation).

10

Dependency Parsing: Problem setup

Training data is pairs of word sequences (sentences) anddependency trees:

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ...xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

10

Dependency Parsing: Problem setup

Training data is pairs of word sequences (sentences) anddependency trees:

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ...xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far?

Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge?

Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

11

Learning a dependency parser

We want to learn a model to predict the best graph:

Gx = arg maxGx∈Gx

score(Gx, x)

where the Gx is a well-formed dependency tree.

Can we learn it using what we know so far? Enumerationover all possible graphs will be expensive.

How about a classifier that predicts each edge? Maybe.But predicting an edge makes some edges invalid due to theacyclic and single-head constraints.

12

Maximum Spanning Tree

Spanning Tree: In graph theory, a spanning tree T of anundirected graph G is a subgraph that includes all of thevertices of G, with the minimum possible number of edges.

Tree: In computer science, a tree is a widely used datastructure (Abstract Data Type) that simulates a hierarchicaltree structure, with a root value and subtrees of children witha parent node, represented as a set of linked nodes.

12

Maximum Spanning Tree

Spanning Tree: In graph theory, a spanning tree T of anundirected graph G is a subgraph that includes all of thevertices of G, with the minimum possible number of edges.

Tree: In computer science, a tree is a widely used datastructure (Abstract Data Type) that simulates a hierarchicaltree structure, with a root value and subtrees of children witha parent node, represented as a set of linked nodes.

13

Maximum Spanning Tree

Score all edges, but keep only the max spanning tree usingChu-Liu-Edmonds algorithm, a modification to Kruskal’s algorithmfor extracting Maximum Spanning Trees.

Exact solution in O(N2) time using Chu-Liu-Edmonds.

14

Kruskal’s algorithm

Input scored edges E

sort E by cost (opposit of score)

G = {}while G not spanning do:

pop the next edge e

if connecting different trees :

add e to G

Return G

15

Graph-based Dependency Parsing

Decompose the graph score into arc scores:

Gx = arg maxGx∈Gx

score(Gx, x)

= arg maxGx∈Gx

w · Φ(Gx, x) (linear model)

= arg maxGx∈Gx

∑(i ,j ,l)∈Ax

w · φ((i , j , l), x) (arc-factored)

Can learn the weights with a Conditional Random Field!

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubj

NO!!! Breaks the arc-factored scoring

16

Feature Representation

What should φ((head , dependent, label), x) be?

unigram: head=had, head=VERB

bigram: head=had & dependent=effect

head=VERB & dependent=NOUN & between=ADJ

head=had & label=dobj & other-label=nsubjNO!!! Breaks the arc-factored scoring

17

More Global models

Even though inference and learning are global, features arelocalised to arcs.

Can we have more global features?

Yes we can! Considersubgraphs spanning a few edges. But inference becomesharder, requiring more complex dynamic programs and cleverapproximations.

Is it worth it? Syntactic parsing has many applications, thusbetter compromises between speed and accuracy are alwayswelcome!

17

More Global models

Even though inference and learning are global, features arelocalised to arcs.

Can we have more global features? Yes we can! Considersubgraphs spanning a few edges. But inference becomesharder, requiring more complex dynamic programs and cleverapproximations.

Is it worth it? Syntactic parsing has many applications, thusbetter compromises between speed and accuracy are alwayswelcome!

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

18

Transition-based Dependency Parsing

Graph-based dependency parsing restricts the features toperform joint inference efficiently.

Transition-based dependency parsing trades joint inferencefor feature flexibility.

No more argmax over graphs, just use a classifier with anyfeatures we want!

19

Joint vs incremental prediction

Joint: score (and enumerate) complete outputs (graphs)

20

Joint vs incremental prediction

Incremental: predict a sequenceof actions (transitions)constructing the output

21

Transition system

The actions A the classifier f can predict and their effect on thestate which tracks the prediction: St+1 = S1(α1 . . . αt)

What should the actions (transitions) be for dependency parsing?

21

Transition system

The actions A the classifier f can predict and their effect on thestate which tracks the prediction: St+1 = S1(α1 . . . αt)

What should the actions (transitions) be for dependency parsing?

22

Transition system setup

Input: Vertices Vx = {0, 1, ...,N} (words sentence x)

State S = (Stack,B,A):

Arcs A (dependencies predicted so far)Buffer Buf (words left to process)Stack Stack (last-in, first out memory)

Initial state: S0 = ([], [0, 1, ...,N], {})Final state: Sfinal = (Stack , [],A)

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack|i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

23

Transition system

Shift (Stack , i |Buf ,A)→ (Stack|i ,Buf ,A): push next word fromthe buffer (i) to stack

Reduce (Stack|i ,Buf ,A)→ (Stack ,Buf ,A): pop word top of thestack (i) if it has a head

Right-Arc(label) (Stack|i , j |Buf ,A)→(Stack |i |j ,Buf ,A ∪ {(i , j , l)}): create edge (i , j , label) between topof the stack (i) and next in buffer (j), push j

Left-Arc(label) (Stack |i , j |Buf ,A)→ (Stack, j |Buf ,A ∪ {(j , i , l)}):create edge (j , i , label) and pop i , if i has no head

24

Example

Stack = []Buffer = [ROOT, Economic, news, had, little, effect, on, financial,markets, .]

Action?

Shift

24

Example

Stack = []Buffer = [ROOT, Economic, news, had, little, effect, on, financial,markets, .]

Action? Shift

25

Example

Stack = [ROOT]Buffer = [Economic, news, had, little, effect, on, financial,markets, .]

Action?

Shift

25

Example

Stack = [ROOT]Buffer = [Economic, news, had, little, effect, on, financial,markets, .]

Action? Shift

26

Example

Stack = [ROOT, Economic]Buffer = [news, had, little, effect, on, financial, markets, .]

Action?

Left-Arc(amod)

26

Example

Stack = [ROOT, Economic]Buffer = [news, had, little, effect, on, financial, markets, .]

Action? Left-Arc(amod)

27

Example

Stack = [ROOT]Buffer = [news, had, little, effect, on, financial, markets, .]

Action?

Shift

27

Example

Stack = [ROOT]Buffer = [news, had, little, effect, on, financial, markets, .]

Action? Shift

28

Example

Stack = [ROOT, news]Buffer = [had, little, effect, on, financial, markets, .]

Action?

Left-Arc(nsubj)

28

Example

Stack = [ROOT, news]Buffer = [had, little, effect, on, financial, markets, .]

Action? Left-Arc(nsubj)

29

Example

Stack = [ROOT]Buffer = [had, little, effect, on, financial, markets, .]

Action?

Right-Arc(root)

29

Example

Stack = [ROOT]Buffer = [had, little, effect, on, financial, markets, .]

Action? Right-Arc(root)

30

Example

Stack = [ROOT, had]Buffer = [little, effect, on, financial, markets, .]

Action?

Shift

30

Example

Stack = [ROOT, had]Buffer = [little, effect, on, financial, markets, .]

Action? Shift

31

Example

Stack = [ROOT, had, little]Buffer = [effect, on, financial, markets, .]

Action?

Left-Arc(amod)

31

Example

Stack = [ROOT, had, little]Buffer = [effect, on, financial, markets, .]

Action? Left-Arc(amod)

32

Example

Stack = [ROOT, had]Buffer = [effect, on, financial, markets, .]

Action?

Right-Arc(dobj)

32

Example

Stack = [ROOT, had]Buffer = [effect, on, financial, markets, .]

Action? Right-Arc(dobj)

33

Example

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

Action?

let’s fast-forward...

33

Example

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

Action? let’s fast-forward...

34

Example

Stack = [ROOT, had, .]Buffer = []

Empty buffer.

DONE!

34

Example

Stack = [ROOT, had, .]Buffer = []

Empty buffer. DONE!

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)

easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

35

Other transition systems?

This was the arc-eager system. Others:

arc-standard (3 actions)easy-first (not left-to-right), etc.

All operate with actions combining:

moving words from the buffer to the stack and back(shift/un-shift)popping words from the stack (reduce)creating labeled arcs left and right

Intuition: Define actions that are easy to learn

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ?

A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ? A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

36

Transition-based Dependency Parsing

Input: sentence x

state S1 = initialize(x); timestep t = 1

while St not final do

action αt = arg maxα∈A

f (α,St)

St+1 = St(αt); t = t + 1

What is f ? A multiclass classifierWhat do we need to learn it?

learning algorithm (e.g. logistic regression)

labelled training data

feature representation

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack ,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

37

What are the right actions?

We only have sentences labelledwith graphs:Dtrain = {(x1,G 1

x )...(xM ,G Mx )}

Ask an oracle to tell us theactions constructing the graph!

In our case, a set of rules comparing the current stateS = (Stack ,Buffer ,ArcsPredicted) with Gx returning the correctaction as label

38

Learning from an oracle

Given a labelled sentence and a transition system, an oracle returnsstates labelled with the correct actions.

Dtrain = {(x1,G 1x )...(xM ,G M

x )}xm = [x1, ..., xN ]

graph Gx = (Vx,Ax)

vertices Vx = {0, 1, ...,N}edges Ax = {(i , j , k)|i , j ∈ V , k ∈ L(labels)}

states Sm= [S1, ...,ST ]

actions αm= [α1, ..., αT ]

39

Feature Representation

Stack = [ROOT, had, effect]Buffer = [on, financial, markets, .]

What features would help us predict the correction actionRight-Arc(prep)?

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

40

Feature Representation

Words/PoS in stack and buffer:wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN,etc.

Dependencies so far:depS1=dobj, depLeftChildS1=amod,depRightChildS1=NULL, etc.

Previous actions:αt−1 = Right-Arc(dobj), etc.

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?

Use Beam Search!

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

41

Transition-based vs Graph-based parsing

Transition-based tends to be better on shorter sentences,graph-based on longer ones

Graph-based tends to be better on long-range dependencies

Graph-based lacks the rich structural features

Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?Use Beam Search!

42

Non-Projectivity

Arcs are crossing each other

long-range dependencies

free word order

43

Non-projective Transition-based parsing

The standard stack-based systems cannot do it.

But there are extensions:

swap actions: word reoderingk-planar parsing: use multiple stacks (usually 2)

Standard graph-based parsing handles non-projectivity.

44

Incremental Language Processing

Other problems solved with similar approaches (a.k.a.transition-based, greedy):

semantic parsing (converting a natural language utterance to alogical form)coreference resolution

Whenever you have a problem with a very large space ofoutputs, worth considering

45

Evaluation

Head-finding word-accuracy:unlabelled: % of words with the right headlabelled: % of words with the right head and label

Sentence accuracy: % of sentences with correct graph

46

Bibliography

Chapter 11 from Eisenstein

Nivre and McDonald’s tutorial slides

Nivre’s article on deterministic transition-based dependencyparsing

Nivre and McDonald’s paper comparing their approaches

47

Coming up next week...

Feed-forward Neural Networks

Getting ready for Assignment 2!