Baselines, Evaluations, and Analysis - Spence Green

Better Arabic ParsingBaselines, Evaluations, and Analysis

Spence Green and Christopher D. Manning

Stanford University

August 27, 2010

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32


Experiments


“Underperformance” Relative to English

English

Chinese

Bulgarian

German

French

Italian

Arabic

Arabic – This Paper

90.1

83.7

81.6

80.1

77.9

75.6

75.8

81.1

Evalb F1 - All Sentence Lengths

(Petrov, 2009)3 / 32


Experiments


Why Arabic / Penn Arabic Treebank (ATB)?

Annotation style similar to PTB

Relatively little segmentation (cf. Chinese)

Richer morphology (cf. English)

More syntactic ambiguity (unvocalized)

4 / 32


Experiments


ATB DetailsParts 1–3 (not including part 3, v3.2)

Newswire only

I Agence France Presse, Al–Hayat, Al–Nahar

Corpus/experimental characteristics

I 23k trees

I 740k tokens

I Shortened “Bies” POS tags

I Split: 2005 JHU workshop5 / 32


Experiments


Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

��

��

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +��

6 / 32


Experiments

Syntactic AmbiguityAnalysis and Evaluation

Syntactic Ambiguity in Arabic

This talk:

I Devocalization

I Discourse level coordination ambiguity

Many other types:

I Adjectives / Adjective phrases

I Process nominals — maSdar

I Attachment in annexation constructs (Gabbard and Kulick,

2008)

7 / 32


Experiments


Devocalization: Inna and her Sisters

POS Head Of

�� inna “indeed” VBP VP

�� anna “that” IN SBAR

�� in “if” IN SBAR

�� an “to” IN SBAR

8 / 32


Experiments



POS Head Of

�� inna “indeed” VBP VP

�� anna “that” IN SBAR

�� in “if” IN SBAR

�� an “to” IN SBAR

9 / 32


Experiments



VP

VBD

اضافت

she added

S

VP

PUNC

“

VBP

ان

Indeed

NP

NN

صدام

Saddam

. . .

Reference

VP

VBD

اضافت

she added

SBAR

PUNC

“

IN

ان

Indeed

NP

NN

صدام

Saddam

. . .

Stanford

10 / 32


Experiments


Discourse–level Coordination Ambiguity

S

S

NP VP

CC

و

and

S

VP

CC

و

and

S

NP PP NP

I S < S in 27.0% of dev set trees

I NP < CC in 38.7% of dev set trees

11 / 32


Experiments


Discourse–level Coordination Ambiguity

Leaf Ancestor metric reference chains (Berkeley)

I score ∈ [0,1]

Score # Gold

0.696 34

0.756 170 S < VP < NP < CC

0.768 31

0.796 86

0.804 52

S < S < VP < NP < PRP

S < S < VP < S < VP < PP < IN

S < S < VP < SBAR < IN

S < S < NP < NN

12 / 32


Experiments


Treebank Comparison

Compared ATB gross corpus statistics to:

I Chinese — CTB6

I English — WSJ sect. 2-23

I German — Negra

The ATB isn’t that unusual!

13 / 32


Experiments


Corpus Features in Favor of the ATB

ATB

CTB6

Negra

WSJ

1.04

1.18

0.46

0.82

Non-terminal / Terminal Ratio

16.8%

22.2%

30.5%

13.2%

OOV Rate

14 / 32


Experiments


Sentence Length Negatively Affects Parsing

ATB

CTB6

Negra

WSJ

31.5

27.7

17.2

23.8

Avg. Sentence Length

40 words is not a sufficient limit for evaluation!

15 / 32


Experiments

Features

Developing a Manually Annotated GrammarKlein and Manning (2003)–style state splits

I Human–interpretable

I Features can inform treebank revision

NP

NN NP

NN NP

DTNN

NP—idafa

NN NP—idafa

NN NP

DTNN

Alternative: automatic splits (Berkeley parser)

16 / 32


Experiments

Features

Developing a Manually Annotated GrammarKlein and Manning (2003)–style state splits

I Human–interpretable

I Features can inform treebank revision

NP

NN NP

NN NP

DTNN

NP—idafa

NN NP—idafa

NN NP

DTNN

Alternative: automatic splits (Berkeley parser)16 / 32


Experiments

Features

Feature: markContainsVerbS—hasVerb

NP-SBJ

NN NP

SBAR—hasVerb

IN S—hasVerb

NP-TPC VP—hasVerb

VB. . .

I +1.18 dev set improvement

I 16.1% of dev set trees lack verbs

17 / 32


Experiments

Features

Feature: markContainsVerbS—hasVerb

NP-SBJ

NN NP

SBAR—hasVerb

IN S—hasVerb

NP-TPC VP—hasVerb

VB. . .


I 16.1% of dev set trees lack verbs17 / 32


Experiments

Features

Feature: splitCCS

S CC—S S

NP

NP

NN CC—noun NN

. . .


I POS equivalence classes for verb, noun, adjective

18 / 32


Experiments

Gold SegmentationRaw Text

Gold Experimental Setup

Evaluation on lengths ≤ 70

Pre–processing makes a huge difference

I Maintained non–terminal / terminal ratio vv. priorwork

Models:

I Berkeley

I Bikel — with pre–tagged input

I Stanford — with the manual grammar

19 / 32


Experiments


Learning Curves

Evalb (development set)

75

80

85

5000 10000 15000

Berkeley

Stanford

Bikel

training trees

F1

20 / 32


Experiments


Model Comparison

Berkeley

Bikel

Stanford

Petrov (2009)

82.0

77.5

79.9

81.1

76.5

78.3

75.9 up to 70

All

Evalb F1

21 / 32


Experiments


Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

Evaluation on gold (segmented) lengths ≤ 70

Metric: Evalb without whitespace

I Requires exact character yield (Tsarfaty, 2006)

22 / 32


Experiments


Lattice Parsing — “ABCD EFG”

ABC

ABCD

EFG

EF G

D

A

BC

BCD

23 / 32


Experiments



ABC EFG

EF G

D

A

BC

BCD

A

ABC

BC

ABCD

24 / 32


Experiments



ABC EFG

EF G

D

A

BC

BCD

A

ABC

BC

BCD

D

ABCD

EFG

EF G

ABCD

25 / 32


Experiments


Parsing and Segmentation Results

Gold

Pipeline

Lattice

Lattice+LM

81.1

79.2

74.3

76.0

Evalb F1 (Dev set)

26 / 32


Experiments



Gold

Pipeline

Lattice

Lattice+LM

100.0

97.7

94.1

96.3

Segmentation Accuracy (Dev set)

Comparable segmentation without the effort!

27 / 32


Experiments



Gold

Pipeline

Lattice

Lattice+LM

100.0

97.7

94.1

96.3

Segmentation Accuracy (Dev set)

Comparable segmentation without the effort!27 / 32

Conclusion

English

Chinese

Bulgarian

German

French

Italian

Arabic

Arabic – This Paper

90.1

83.7

81.6

80.1

77.9

75.6

75.8

81.1

Evalb F1 - All Sentence Lengths

28 / 32

��

Thanks.

http://nlp.stanford.edu/projects/arabic.shtml


Experiments


Mixture of Manual and Automatic Annotation

Basic

Annotated

82.4

82.2

80.9

80.8

up to 70

All

Evalb F1 (Dev)

30 / 32


Experiments


Frequency Matched Strata

Arabic Englishµ freq. % nuclei µ freq. % nuclei

2 2.00 46% 2.00 34%[ 3, 4 ] 3.37 26% 3.37 27%[ 5, 9 ] 6.43 16% 6.49 20%[ 10, 49 ] 18.6 10% 19.0 16%[ 50, 500 ] 110.3 2% 113 3%

31 / 32


Experiments


Annotation Consistency Evaluation (Again!)

Nuclei Sample Error %per tree n–grams Type n–gram

WSJ 2–21 0.565 750 16.0% 4.10%ATB 0.830 658 26.0% 4.10%

95% confidence intervals (type–level):

I Arabic: [ 17.4%, 34.6% ]

I English: [ 8.79%, 23.3% ]

32 / 32

Baselines, Evaluations, and Analysis - Spence Green

Documents

Transcript of Baselines, Evaluations, and Analysis - Spence Green