Baselines, Evaluations, and Analysis - Spence Green

52
Better Arabic Parsing Baselines, Evaluations, and Analysis Spence Green and Christopher D. Manning Stanford University August 27, 2010

Transcript of Baselines, Evaluations, and Analysis - Spence Green

Page 1: Baselines, Evaluations, and Analysis - Spence Green

Better Arabic ParsingBaselines, Evaluations, and Analysis

Spence Green and Christopher D. Manning

Stanford University

August 27, 2010

Page 2: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32

Page 3: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32

Page 4: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32

Page 5: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32

Page 6: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

I Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

Does feature Z “help more” for language X than Y?

I Lexicalization

I Morphological annotations

I Markovization

I etc.

2 / 32

Page 7: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

“Underperformance” Relative to English

English

Chinese

Bulgarian

German

French

Italian

Arabic

Arabic – This Paper

90.1

83.7

81.6

80.1

77.9

75.6

75.8

81.1

Evalb F1 - All Sentence Lengths

(Petrov, 2009)3 / 32

Page 8: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Why Arabic / Penn Arabic Treebank (ATB)?

Annotation style similar to PTB

Relatively little segmentation (cf. Chinese)

Richer morphology (cf. English)

More syntactic ambiguity (unvocalized)

4 / 32

Page 9: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

ATB DetailsParts 1–3 (not including part 3, v3.2)

Newswire only

I Agence France Presse, Al–Hayat, Al–Nahar

Corpus/experimental characteristics

I 23k trees

I 740k tokens

I Shortened “Bies” POS tags

I Split: 2005 JHU workshop5 / 32

Page 10: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 11: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 12: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 13: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 14: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 15: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Multilingual ParsingArabic

Arabic Preliminaries

Diglossia: “Arabic” −→ MSA

Typology: VSO — VOS, SVO, VO also possible

Devocalization

�� �� �� � � �� �� �� � �� ��

���� ����� �

Segmentation: an analyst’s choice!

I ATB uses clitic segmentation

+ +���� �� �� � �

6 / 32

Page 16: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Syntactic Ambiguity in Arabic

This talk:

I Devocalization

I Discourse level coordination ambiguity

Many other types:

I Adjectives / Adjective phrases

I Process nominals — maSdar

I Attachment in annexation constructs (Gabbard and Kulick,

2008)

7 / 32

Page 17: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Devocalization: Inna and her Sisters

POS Head Of

�� � inna “indeed” VBP VP

�� � anna “that” IN SBAR

�� � in “if” IN SBAR

�� � an “to” IN SBAR

8 / 32

Page 18: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Devocalization: Inna and her Sisters

POS Head Of

�� inna “indeed” VBP VP

�� anna “that” IN SBAR

�� in “if” IN SBAR

�� an “to” IN SBAR

9 / 32

Page 19: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Devocalization: Inna and her Sisters

VP

VBD

اضافت

she added

S

VP

PUNC

VBP

ان

Indeed

NP

NN

صدام

Saddam

. . .

Reference

VP

VBD

اضافت

she added

SBAR

PUNC

IN

ان

Indeed

NP

NN

صدام

Saddam

. . .

Stanford

10 / 32

Page 20: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Discourse–level Coordination Ambiguity

S

S

NP VP

CC

و

and

S

VP

CC

و

and

S

NP PP NP

I S < S in 27.0% of dev set trees

I NP < CC in 38.7% of dev set trees

11 / 32

Page 21: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Discourse–level Coordination Ambiguity

Leaf Ancestor metric reference chains (Berkeley)

I score ∈ [0,1]

Score # Gold

0.696 34

0.756 170 S < VP < NP < CC

0.768 31

0.796 86

0.804 52

S < S < VP < NP < PRP

S < S < VP < S < VP < PP < IN

S < S < VP < SBAR < IN

S < S < NP < NN

12 / 32

Page 22: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Treebank Comparison

Compared ATB gross corpus statistics to:

I Chinese — CTB6

I English — WSJ sect. 2-23

I German — Negra

The ATB isn’t that unusual!

13 / 32

Page 23: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Treebank Comparison

Compared ATB gross corpus statistics to:

I Chinese — CTB6

I English — WSJ sect. 2-23

I German — Negra

The ATB isn’t that unusual!

13 / 32

Page 24: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Corpus Features in Favor of the ATB

ATB

CTB6

Negra

WSJ

1.04

1.18

0.46

0.82

Non-terminal / Terminal Ratio

16.8%

22.2%

30.5%

13.2%

OOV Rate

14 / 32

Page 25: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Corpus Features in Favor of the ATB

ATB

CTB6

Negra

WSJ

1.04

1.18

0.46

0.82

Non-terminal / Terminal Ratio

16.8%

22.2%

30.5%

13.2%

OOV Rate

14 / 32

Page 26: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Sentence Length Negatively Affects Parsing

ATB

CTB6

Negra

WSJ

31.5

27.7

17.2

23.8

Avg. Sentence Length

40 words is not a sufficient limit for evaluation!

15 / 32

Page 27: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Syntactic AmbiguityAnalysis and Evaluation

Sentence Length Negatively Affects Parsing

ATB

CTB6

Negra

WSJ

31.5

27.7

17.2

23.8

Avg. Sentence Length

40 words is not a sufficient limit for evaluation!

15 / 32

Page 28: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Developing a Manually Annotated GrammarKlein and Manning (2003)–style state splits

I Human–interpretable

I Features can inform treebank revision

NP

NN NP

NN NP

DTNN

NP—idafa

NN NP—idafa

NN NP

DTNN

Alternative: automatic splits (Berkeley parser)

16 / 32

Page 29: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Developing a Manually Annotated GrammarKlein and Manning (2003)–style state splits

I Human–interpretable

I Features can inform treebank revision

NP

NN NP

NN NP

DTNN

NP—idafa

NN NP—idafa

NN NP

DTNN

Alternative: automatic splits (Berkeley parser)

16 / 32

Page 30: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Developing a Manually Annotated GrammarKlein and Manning (2003)–style state splits

I Human–interpretable

I Features can inform treebank revision

NP

NN NP

NN NP

DTNN

NP—idafa

NN NP—idafa

NN NP

DTNN

Alternative: automatic splits (Berkeley parser)16 / 32

Page 31: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Feature: markContainsVerbS—hasVerb

NP-SBJ

NN NP

SBAR—hasVerb

IN S—hasVerb

NP-TPC VP—hasVerb

VB. . .

I +1.18 dev set improvement

I 16.1% of dev set trees lack verbs

17 / 32

Page 32: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Feature: markContainsVerbS—hasVerb

NP-SBJ

NN NP

SBAR—hasVerb

IN S—hasVerb

NP-TPC VP—hasVerb

VB. . .

I +1.18 dev set improvement

I 16.1% of dev set trees lack verbs17 / 32

Page 33: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Feature: splitCCS

S CC—S S

NP

NP

NN CC—noun NN

. . .

I +0.21 dev set improvement

I POS equivalence classes for verb, noun, adjective

18 / 32

Page 34: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Features

Feature: splitCCS

S CC—S S

NP

NP

NN CC—noun NN

. . .

I +0.21 dev set improvement

I POS equivalence classes for verb, noun, adjective

18 / 32

Page 35: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Gold Experimental Setup

Evaluation on lengths ≤ 70

Pre–processing makes a huge difference

I Maintained non–terminal / terminal ratio vv. priorwork

Models:

I Berkeley

I Bikel — with pre–tagged input

I Stanford — with the manual grammar

19 / 32

Page 36: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Learning Curves

Evalb (development set)

75

80

85

5000 10000 15000

Berkeley

Stanford

Bikel

training trees

F1

20 / 32

Page 37: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Model Comparison

Berkeley

Bikel

Stanford

Petrov (2009)

82.0

77.5

79.9

81.1

76.5

78.3

75.9 up to 70

All

Evalb F1

21 / 32

Page 38: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

Evaluation on gold (segmented) lengths ≤ 70

Metric: Evalb without whitespace

I Requires exact character yield (Tsarfaty, 2006)

22 / 32

Page 39: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

Evaluation on gold (segmented) lengths ≤ 70

Metric: Evalb without whitespace

I Requires exact character yield (Tsarfaty, 2006)

22 / 32

Page 40: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

Evaluation on gold (segmented) lengths ≤ 70

Metric: Evalb without whitespace

I Requires exact character yield (Tsarfaty, 2006)

22 / 32

Page 41: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

Evaluation on gold (segmented) lengths ≤ 70

Metric: Evalb without whitespace

I Requires exact character yield (Tsarfaty, 2006)

22 / 32

Page 42: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Lattice Parsing — “ABCD EFG”

ABC

ABCD

EFG

EF G

D

A

BC

BCD

23 / 32

Page 43: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Lattice Parsing — “ABCD EFG”

ABC EFG

EF G

D

A

BC

BCD

A

ABC

BC

ABCD

24 / 32

Page 44: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Lattice Parsing — “ABCD EFG”

ABC EFG

EF G

D

A

BC

BCD

A

ABC

BC

BCD

D

ABCD

EFG

EF G

ABCD

25 / 32

Page 45: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Parsing and Segmentation Results

Gold

Pipeline

Lattice

Lattice+LM

81.1

79.2

74.3

76.0

Evalb F1 (Dev set)

26 / 32

Page 46: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Parsing and Segmentation Results

Gold

Pipeline

Lattice

Lattice+LM

100.0

97.7

94.1

96.3

Segmentation Accuracy (Dev set)

Comparable segmentation without the effort!

27 / 32

Page 47: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Parsing and Segmentation Results

Gold

Pipeline

Lattice

Lattice+LM

100.0

97.7

94.1

96.3

Segmentation Accuracy (Dev set)

Comparable segmentation without the effort!27 / 32

Page 48: Baselines, Evaluations, and Analysis - Spence Green

Conclusion

English

Chinese

Bulgarian

German

French

Italian

Arabic

Arabic – This Paper

90.1

83.7

81.6

80.1

77.9

75.6

75.8

81.1

Evalb F1 - All Sentence Lengths

28 / 32

Page 49: Baselines, Evaluations, and Analysis - Spence Green

����� ��� ���

Thanks.

http://nlp.stanford.edu/projects/arabic.shtml

Page 50: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Mixture of Manual and Automatic Annotation

Basic

Annotated

82.4

82.2

80.9

80.8

up to 70

All

Evalb F1 (Dev)

30 / 32

Page 51: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Frequency Matched Strata

Arabic Englishµ freq. % nuclei µ freq. % nuclei

2 2.00 46% 2.00 34%[ 3, 4 ] 3.37 26% 3.37 27%[ 5, 9 ] 6.43 16% 6.49 20%[ 10, 49 ] 18.6 10% 19.0 16%[ 50, 500 ] 110.3 2% 113 3%

31 / 32

Page 52: Baselines, Evaluations, and Analysis - Spence Green

MotivationSyntax and AnnotationGrammar Development

Experiments

Gold SegmentationRaw Text

Annotation Consistency Evaluation (Again!)

Nuclei Sample Error %per tree n–grams Type n–gram

WSJ 2–21 0.565 750 16.0% 4.10%ATB 0.830 658 26.0% 4.10%

95% confidence intervals (type–level):

I Arabic: [ 17.4%, 34.6% ]

I English: [ 8.79%, 23.3% ]

32 / 32