Breaking the Resource Bottleneck for Multilingual Parsing

37
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland

description

Breaking the Resource Bottleneck for Multilingual Parsing. Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland. The Treebank Bottleneck. High-quality parsers need training examples with hand-annotated syntactic information Annotation is labor intensive and time consuming - PowerPoint PPT Presentation

Transcript of Breaking the Resource Bottleneck for Multilingual Parsing

Page 1: Breaking the Resource Bottleneck for Multilingual Parsing

Breaking the Resource Bottleneck for Multilingual Parsing

Rebecca Hwa, Philip Resnik and Amy Weinberg

University of Maryland

Page 2: Breaking the Resource Bottleneck for Multilingual Parsing

The Treebank Bottleneck

• High-quality parsers need training examples with hand-annotated syntactic information

• Annotation is labor intensive and time consuming

• There is no sizable treebank for most languages other than English

[[S [NP-SBJ Ford Motor Co. ] [VP acquired [NP [NP 5 % ] [PP of [NP [NP the shares] [PP-LOC in [NP Jaguar PLC]]]]]]] . ]

Page 3: Breaking the Resource Bottleneck for Multilingual Parsing

State of the Art Parsing

Language Treebank Size Parser Performance

English Penn

Treebank

1M words

40k sentences

~90%

Chinese Chinese

Treebank

100K words

4k sentences

~75%

Others(e.g., Hindi, Arabic)

? ? ?

Page 4: Breaking the Resource Bottleneck for Multilingual Parsing

Research Questions

• How can we induce a non-English language treebank quickly and automatically?– Bootstrap from available English resources– Project syntactic dependency relationship

across bilingual sentences

• How good is the resulting treebank?– Can we use it to train a new parser?– How can we improve its quality?

Page 5: Breaking the Resource Bottleneck for Multilingual Parsing

Roadmap

• Overview of the framework– Direct projection algorithm

• Problematic cases

– Post projection transformation• Remaining challenges

– Filtering

• Experiment– Direct evaluation of the projected trees– Evaluation of a Chinese parser trained on the induced

treebank

• Future Work

Page 6: Breaking the Resource Bottleneck for Multilingual Parsing

Overview of Our Frameworkbilingual corpus

English Chinese

Englishdependency

parser

wordalignment

model

dependencyparser

projected Chinesedependency treebank

Filtering

Transformation

Projection

unseenChinese

sentences

train

dependency treesfor unseen sentences

Page 7: Breaking the Resource Bottleneck for Multilingual Parsing

The Chinese side satisfactionexpressed thisregarding

中国 方面 对 表示 满意此

subject

Necessary Resources:1. Bilingual Sentences

Page 8: Breaking the Resource Bottleneck for Multilingual Parsing

The Chinese side satisfactionexpressed thisregarding

中国 方面 对 表示 满意此

subject

subj objadj

det

det

modmod

Necessary Resources2. English (Dependency) Parser

Page 9: Breaking the Resource Bottleneck for Multilingual Parsing

The Chinese side satisfactionexpressed thisregarding

中国 方面 对 表示 满意此

subject

subj objadj

det

det

modmod

Necessary Resources3. Word Alignment

Page 10: Breaking the Resource Bottleneck for Multilingual Parsing

The Chinese side satisfactionexpressed thisregarding

中国 方面 对 表示 满意此

subject

subj objadj

det

det

modmod

mod

obj

subj

adj mod

Projected Chinese Dependency Tree

Page 11: Breaking the Resource Bottleneck for Multilingual Parsing

Direct Projection Algorithm

• If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words

Page 12: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: Unaligned English

thisregarding subject

det

mod

对 此

Page 13: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: Unaligned English

thisregarding subject

det

mod

对 此 *e*det

mod

Page 14: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: many-to-1

thisregarding subject

det

mod

对 此

Page 15: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: many-to-1

thisregarding subject

det

mod

对 此

mod

Page 16: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: Unaligned Chinese

Chinese expressedThe

中国 方面 表示

subj

*e*

*e*

det

Page 17: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: Unaligned Chinese

Chinese expressedThe

中国 方面 表示

subj

*e*

*e*

subj

det

det

Page 18: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: 1-to-many

Chinese expressed

中国 方面 表示

subj

The

*e*

det

Page 19: Breaking the Resource Bottleneck for Multilingual Parsing

Problematic Case: 1-to-many

Chinese expressed

中国 方面 表示*M*

mac

mac

subj

subj

The

*e*

det

det

Page 20: Breaking the Resource Bottleneck for Multilingual Parsing

The Chinese satisfactionexpressed thisregarding

中国 方面 对 表示 满意此

subject

subj objdet det

modmod

obj

subj

Output of theDirect Projection Algorithm

*M**e*mod

moddet

mac

mac

Page 21: Breaking the Resource Bottleneck for Multilingual Parsing

Post Projection Transformation

• Handles One-to-Many mapping– Select head based on (projected) part-of-speech categories

• Handles some Unaligned-Chinese cases– Only addressing close-class words

• Functional words (e.g., aspectual, measure words)

• Easily enumerable lexical categories (e.g., $, RMB, yen)

• Remove empty nodes introduced by the Unaligned-English cases by promoting its head child

Page 22: Breaking the Resource Bottleneck for Multilingual Parsing

Remaining Challenges

• Handling divergences• Incorporating unaligned foreign words into the

projected tree• Removing cross dependencies

A B

a b

C D

d c

Page 23: Breaking the Resource Bottleneck for Multilingual Parsing

Filtering

• Projected treebank is noisy – Mistakes introduced by the projection algorithm

– Mistakes introduced by component errors

• Use aggressive filtering techniques to remove the worst projected trees– Filter out a sentence pair if many English words were

unaligned

– Filter out a sentence pair if many Chinese words are aligned to the same English word

– Filter out a sentence pair if many of the projected links caused crossing dependencies

Page 24: Breaking the Resource Bottleneck for Multilingual Parsing

Experiments

• Direct evaluation of the projection framework– Compare the (pre-filtered) projected trees against

human annotated gold standard

• Evaluation of the projected treebank– Use the (post-filtered) treebank to train a Chinese

parser

– Test the parser on unseen sentences and compare the output to human annotated gold standard

Page 25: Breaking the Resource Bottleneck for Multilingual Parsing

Direct Evaluation

• Bilingual data: 88 Chinese Treebank sentences with their English translations

• Apply projection and transformation under idealized conditions– Given human-corrected English parse trees and hand-drawn

word-alignments

• Apply projection and transformation under realistic conditions– English parse trees generated from Collins parser (trained on

Penn Treebank)– Word-alignments generated from IBM MT Model (trained

on ~56K Hong Kong News bilingual sentences)

Page 26: Breaking the Resource Bottleneck for Multilingual Parsing

Direct Evaluation Results

Condition Accuracy*

Ideal 67%

English parses from the Collins parser

62%

Word-alignments from the IBM MT Model

39%

*Accuracy = f-score based on unlabeled precision & recall

Page 27: Breaking the Resource Bottleneck for Multilingual Parsing

Evaluating Trained Parser

• Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus

• Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank

• Filter out badly-aligned sentence pairs to reduce noise• Train a Chinese parser with the (filtered) projected

treebank• Test the Chinese parser on unseen test set (88

Chinese Treebank sentences)

Page 28: Breaking the Resource Bottleneck for Multilingual Parsing

Parser Evaluation Results

Method Training

Corpus

Corpus Size Parser

Accuracy

Modify Prev

(baseline)

- - 13.5

Modify Next

(baseline)

- - 35.7

Stat. Parser HKNews

(Filtered)

5284 42.3

Stat. Parser

(upper bound)

Chinese Treebank

3870 75.6

Page 29: Breaking the Resource Bottleneck for Multilingual Parsing

Conclusion

• We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources

• Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge

• A parser trained on the induced treebank can outperform some baselines

Page 30: Breaking the Resource Bottleneck for Multilingual Parsing

Future Work

• Obtain larger parallel corpus

• Reduce error rates of the word-alignment models

• Develop more sophisticated techniques to filter out noise in the induced treebank

• Improve the projection algorithm to handle unaligned words and inconsistent trees

Page 31: Breaking the Resource Bottleneck for Multilingual Parsing

Reserve slides

Page 32: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 1: One-to-One

A B

ab

Page 33: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 2: Many-to-One

a b

A1 BA2 A3C

c

Page 34: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 3: One-to-Many

A B

a1b a2 a3*a*

Page 35: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 4: Many-to-Many

*a* b

BC

c a1 a2

A1 A2 A3

Page 36: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 5: Unaligned English Word

A B

a

C

c

Page 37: Breaking the Resource Bottleneck for Multilingual Parsing

DPA Case 6: Unaligned Foreign Word

A

a b

C

c