Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

Christos ChristodoulopoulosPre-viva talk

2

What did we ever do for the linguists?

• Computational linguistic (or NLP) models– Mostly supervised (until recently)– Mostly on English (English ≈ WSJ sections 02-21)– Mostly on the “fat head” of Zipf’s law

3

Revolutionaries (without supervision)

• NLP is good at spotting patterns– Unsupervised learning– Machine Learning

• Not great at looking at the whole picture

4

The “whole” picture

Parts of Speech Syntax AlignmentsMorphology

Traditional NLP Pipeline

5

The “whole” picture

Parts of Speech

Syntax Alignments

Morphology

Geer

tzen

& van

Zaan

en (2

004)

Klein & Manning (2004)

Virpioja et al. (2007)

Snyder et al. (2009)

Naseem et al. (2009)Sirts & Alumäe (2012)

Clark (2003)

6

My thesis

• Patterns that correspond to syntactic categories or parts of speech (PoS)– Motivated by linguistic theories

• Holistic view of NLP– Instead of the pipeline approach– Computationally efficient

• Cross-lingual analysis– Might provide linguistic insights

7

My thesis

0 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 3410 14 35

0 1 2 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 28 30 313 8 27 29 32 33 34 35

τουτο, τι, εαν, τις, ιδου, ομως, οτε,

παντες, συ, αυτος

Κυριος, Θεος, βασιλευς, υιος, Ιησους, λαος,

ανθρωπος, Μωυσης, λογος, ιερευς

αυτο, εκει, πλεον, ουχι, εκαστος,

τουτον, ταυτην, ουδεν, ετη, πυρ

he, it, there, she, whosoever, soon,

others, Satan, whoso, Elias

man, day, time, city, place, thing, priest,

woman, wicked, spirit

do, make, give, take, know, bring,

eat, see, hear, keep

εισθαι, καμει, δωσει, φερει, λαβει, ελθει,

ειπει, γνωρισει, γεινει, προσφερει

ηναι, δυναται, γεινη, πρεπει, ελθη, καμη, δωση, καμω,

ιδη, λαβη

this, what, if, whoever, there, but, when,

everyone, you, himis, does, gives, brings,

comes, takes, says, knows,becomes,offers

Lord, God, king, son, Jesus, people, man, Moses, word, priest

this, there, since, not, whosoever, him, her,

none, fire

be, can, become, must, come, do, give,

make, see, give

8

What do we need?

• Theory of syntactic categories– What are we looking for?

• Clustering method – Review of existing methods– Multiple sources of information

• Alignment method based on PoS

9

What do we need?




10

Theory of syntactic categories

• Not everyone agrees on what they are• Syntactic categories/PoS/Word classes?• Most agree that they capture more than one

levels of language structure

Semantic (noun, verb)

Morphological (conjunction)

Plato, Aristotle

8 parts of speech Semantic, syntactic & morphological

Dionysius Thrax

‘School account’ of 9 parts of speech

Semantic & syntactic

Lindley Murray

Feature-based (e.g. ±Subject) Purely syntactic

Ray Jackendoff

Formal semantics <e,t>: nouns, adjectives, intr. verbs

Susan Schmerling Notional (pragmatic) definition

Morphological, syntactic & distributionalPaul Schachter

Influence to my work:• Not easy to focus on any particular theory• Multiple sources of information is key

11

What do we need?




12

What do we need?




13

Evaluation

• How will we know whether we have found good clusters?

• Intrinsic– Test on existing PoS tagged data (gold-standard)

• Extrinsic– Use clusters as input to another task

• Both have issues when used with unsupervised methods

14

Intrinsic evaluation

• Clusters might not correspond to PoS– Cluster IDs instead of labels– Different sizes

• Gold-standard follows specific linguistic theories• Gold-standard might not help downstream

– Annotations are tuned to specific tasks

15

The (intrinsic) elephant in the room

• Cyclical problem– Trying to discover clusters that don’t (necessarily)

correspond to gold-standard annotations– Evaluate them on gold-standard annotations

• (Compromising) Solution: Test on multiple languages, using multiple systems– There is going to be some overlap

16

Extrinsic evaluation

• “Passing the buck” to the next task– If unsupervised and intrinsically evaluated

• Performances might not be correlated – Intrinsic gains on task #1 do not correlate with

gains on task #2 (Headden et al., 2008)• Depends on the degree of integration

– How much of task #2’s input is the output of #1

17

Evaluation

• Intrinsic evaluation metrics– Mapping

• Many-to-one (m-1), one-to-one, cross-validation• Widely used• Sensitive to size of induced tagset

– Information-theoretic• Variation of information (vi), V-measure (vm)• Less sensitive• Less intuitive (especially vi)

18

What do we need?




19

What do we need?




Multiple systems examined

Average performance on 8 languages

(2010 review)

Average performance on 22 languages(2011 review)

tl;dr: Christodoulopoulos et al., 2010; 2011

Most successful properties:• Use of morphology features• One cluster per word type

20

What do we need?




21

Bayesian Multinomial Mixture Model (BMMM)

• Three key properties:– One tag per word

• Helpful for unsupervised systems– Mixture Model (instead of HMM)

• Easier to handle non-local features– Easy to add multiple features

• e.g. morphology, alignments

SPOILER ALERT!!!

I’ll be adding dependencies

22

z

θ α

M

fnj

φZ

β

For each word type i

choose a class zi

(conditioned on θ)

For each word token j

choose a feature fij

(conditioned on φi)

BMMMBasic model structure

23

BMMMExtended model

f(T)

z

θ

φ(T) β(T)

α

Mnj Z

m

φ(m)

Z

f(1)

njφ(1)

Zβ(1)

β(m)

. . . . . . . . .

Type-level features

Token-level features

24

Development results(averaged over 8 languages)

base +morph avg. Al. Best Al. +morph50

52

54

56

58

60

62

64

66

68

53

53.5

54

54.5

55

55.5

56

m-1 vm

m1

scor

e

vm sc

ores

* *

25

Final Results(using +morph system)

hcd clark bmmm0

10

20

30

40

50

60

70

average multext average misc average conll wsj

vm sc

ore

26

What do we need?




27

What do we need?



• Alignment method based on PoS– Interdependence of linguistic structure

28

Putting the syntax in syntactic categories

• Induced dependencies as BMMM features• DMV (Klein & Manning, 2004)

– Basis for most dependency parsers– Uses parts of speech as terminal nodes

• Proxy for a joint model– Induce PoS that help DMV induce dependencies…– …that help induce better PoS…– …repeat…

For example: Cohen & Smith (2009); Headden et al. (2009); Gillenwater et al. (2010); Blunsom

& Cohn (2011);Spitkovsky et al. (2010a,b,

2011a,b,c)

29

The Iterated Learning (IL) Model

ly ing ed ingly ity …

counting

is

simply

word

this

can

BMMM

MORPHOLOGY

is word context the a …

counting

is

simply

word

this

can


counting

is

simply

word

this

can

L+R CONTEXT

DMV

This/32 is/12 a/32 tagged/1 sentence/28 ./0

This/32 is/12 another/32 tagged/1 sentence/28 ./0

…

ly ing ed ingly ity …

counting

is

simply

word

this

can

BMMM

MORPHOLOGY


counting

is

simply

word

this

can


counting

is

simply

word

this

can

L+R CONTEXT


counting

is

simply

word

this

can

DEPENDENCIES

Experiments on WSJ and CoNLL

(≤10 words)

30

IL – DependenciesWSJ10 Results

0 1 2 3 4 5 6 7 8 9 1050

52

54

56

58

60

62

64

66

68

70

BMMM M-1DMV Undir

Iteration

31

IL – DependenciesAverage over 9 CoNLL languages (≤10 words)

0 1 2 3 4 540

45

50

55

60

65

70

BMMM M-1DMV UndirGold M-1

Iteration

***

32

IL – DependenciesShortcomings

• DMV is not state of the art – Best systems surpass it by more than 15%

accuracy (for WSJ)

• Trained/tested only on ≤10-word sentences– Hard to compare the PoS inducer– Not realistic

Replace DMV component with a state-of-the-art system (TSG-DMV)

Use full-length-sentence corpora for training and testing

Results not shown here. tl;dr: Slightly better results on PoS, much worse on deps.Interesting for further discussion.

33

Using full-length sentencesAverage over 9 CoNLL languages

0 1 2 3 4 530

35

40

45

50

55

60

65

70

75

BMMM M-1DMV Undir

Iteration

***

*

34

Recap

• Both BMMM and DMV improve– Mostly in the first few iterations

• Using full-length sentences:– Increase in BMMM above system with gold deps– DMV close to performance with gold PoS

(but lower than ≤10-word case)

35

What do we need?


• Clustering method – Review of existing methods– Multiple sources of information– Interdependence of linguistic structure


36

What do we need?




37

Further IL experiments

• Giza++ (Och & Ney, 2000)– Extension of the IBM 1-4 models (Brown et al. 1993)– Uses ‘word classes’ to condition alignment prob.

• Can be replaced with BMMM

• Hansards English-French corpus– Manually annotated alignments for 500 sentences

• MULTEXT-East corpus– 1984 Novel in 8 languages (incl. English)

38

IL – Alignments Hansards corpus

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2065

66

67

68

69

70

71

72 0.19

0.2

0.21

0.22

0.23

0.24

0.25

0.26

0.270.5k M-1 1k M-10.5k AER 1k AER

Iteration

M-1

AER

39

IL – AlignmentsMULTEXT-East corpus

0 1 2 3 4 550

52

54

56

58

60

62

64

66

68

70

m-1vm

Iteration

40

Recap

• Iterated learning between PoS and X– X = {Dependencies, Alignments, Morphology}– Effective proxy for joint inference

• PoS induction helped by all other levels– A test for theories of PoS– A joint model of NLP

41

A joint model of NLP

Parts of Speech

Syntax Alignments

Morphology

42

Induction chains

BMMM

Deps BMMM

Deps BMMM …

Morph BMMM …

Aligns BMMM …

Morph BMMM

Deps BMMM …

Morph BMMM …

Align BMMM …

Aligns BMMM

Deps BMMM …

Morph BMMM …

Align BMMM …

BMMM Morph BMMM Morph BMMM …

BMMM Deps BMMM Deps BMMM …

BMMM Aligns BMMM Aligns BMMM …

BMMM Deps

BMMM Morph

BMMM Aligns

43

Induction chains

baseline aligns-deps-morph aligns-morph-deps deps-aligns-morph deps-morph-aligns morph-aligns-deps morph-deps-aligns62

64

66

68

70

72

74

76

78

80

82

50

52

54

56

58

60

62

64

66

68

70

en m-1 bg m-1 en vm bg vm

m-1

scor

e

vm sc

ore

****

***

***

*

*

**

44

What do we need?




• Massively parallel corpus• Bible translations

– Collected from online versions of the Bible– Cleaned-up and verse-aligned (CES level 1 XML)– 100 languages

And one more thing:

45

Cross-lingual clusters

0 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 3410 14 35

3 8 27 29 320 1 2 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 28 30 31 33 34 35

τουτο, τι, εαν, τις, ιδου, ομως, οτε,

παντες, συ, αυτος

Κυριος, Θεος, βασιλευς, υιος, Ιησους, λαος,

ανθρωπος, Μωυσης, λογος, ιερευς

αυτο, εκει, πλεον, ουχι, εκαστος,

τουτον, ταυτην, ουδεν, ετη, πυρ

he, it, there, she, whosoever, soon,

others, Satan, whoso, Elias

man, day, time, city, place, thing, priest,

woman, wicked, spirit

do, make, give, take, know, bring,

eat, see, hear, keep

εισθαι, καμει, δωσει, φερει, λαβει, ελθει,

ειπει, γνωρισει, γεινει, προσφερει

ηναι, δυναται, γεινη, πρεπει, ελθη, καμη, δωση, καμω,

ιδη, λαβη

this, what, if, whoever, there, but, when,

everyone, you, himis, does, gives, brings,

comes, takes, says, knows,becomes,offers

Lord, God, king, son, Jesus, people, man, Moses, word, priest

this, there, since, not, whosoever, him, her,

none, fire

[SBJV]be, can, become,

must, come, do, give, make, see, give

Subjunctive moodο δε Κυριος αςκαμη το αρεστον the and Lord let (he)

do the pleasing‘and the Lord do [that which seemeth him] good’

3rd person present tenseθελει να

καμει(she) wants to

(she) make‘she wants to make’

• Unsupervised syntactic category induction– Theory of syntactic categories– Review of systems/evaluation metrics

• Iterated learning & Induction chains– Holistic view of NLP (no more pipelines!)

• Cross-lingual clusters– Tool for linguistic enquiry– Reveal similarities/differences across languages

tl;dr: My thesis

46

Where can we go from here?

• Fully joint models– Preliminary attempts for PoS & Dependencies

• Evaluation methods– Non-gold-standard based (Smith, 2012)

• “Syntactically aware” categories – CCG type induction (Bisk & Hockenmaier, 2012)

• Linguistic analysis– Invite the Romantics back!

THE END

49

A fully joint model

• Maximise jointly the distributions over PoS and dependency trees– Run a full training step of DMV every time BMMM

samples a new PoS sequence– Intractable

• Solution:– Train DMV on partial trees (up to a depth d)

• Comparable results with best IL models(also, still quite slow)

50

TSG-DMVBlunsom & Cohn (2010)

• Tree Substitution Grammar– CFG subset of LTAG– Lexicalised

• Eisner’s (2000) split-head constructions– Allows for modelling longer-range dependencies

• Pitman-Yor process (Teh, 2006) over TSG trees

51

IL – TSG-DMVAverage over 9 CoNLL languages (≤10 words)

0 1 2 3 4 540

45

50

55

60

65

70

75

BMMM M-1 TSG-DMV Undir

Gold M-1 Gold Undir

Iteration

52

Using full-length sentences – TSG-DMVAverage over 9 CoNLL languages

0 1 2 3 4 530

35

40

45

50

55

60

65

70

75

BMMM M-1 TSG-DMV UndirGold M-1 Gold Undir

Iteration

53

What are the gold-standard PoS capturing?

What purpose does PoS annotation serve?

What ARE parts of speech?

Why do we need PoS?

Why am I doing this?Why are we here?

What is the meaning of life the

universe and everything?

What is the meaning of 42?

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

Documents

Transcript of Unsupervised Syntactic Category Induction using Multi-level Linguistic Features