The Language Resource Spectrum: A Perspective from...

Ryan McDonald

Google NLU team Google Linguistics team

The Language Resource Spectrum: A Perspective from Google

LREC, May 2016

Language Resource Spectrum

search logs

LREC, May 2016

Language Resource Spectrum

search logs

unsu

pervis

ed

weakly

super

vised

fully

super

vised

LREC, May 2016

amount of

data

supervision

LREC, May 2016

amount of

data

supervision

which is better?

LREC, May 2016

High quality annotations Crowd-sourced

Software Engineer Auto resources

Models Active Learning

Pre-existing resources

LREC, May 2016

High quality annotations Crowd-sourced

Software Engineer Auto resources

Models Active Learning

Pre-existing resources

α β γ δ

LREC, May 2016

The Task Matters

✤ ML is really good at the head

POS tagging

QA

LREC, May 2016

Pipelined / multi-Component Systems

Indexer Text Extractor

Segmentation Morphosyntax

Shallow Semantics MT QA

SearchEntity resolution Relation extraction

LREC, May 2016






End User Task

LREC, May 2016






End User Task

Upstream Task

Upstream: Morphosyntactic Tagging

LREC, May 2016

Feature-based Classification

LREC, May 2016

Feature-based Classification

word=entendrez suffix3=rez word-1=n word+1=jamais cluster=124 cluster-1=53 cluster+1=210

LREC, May 2016

Resource Trade-Off

Annotated data

Model

LREC, May 2016

Resource Trade-Off

Annotated data

ModelDictionaries / Lexicons

LREC, May 2016

Morphosyntactic Lexicons via Graph-Propagation

LREC, May 2016

candies dish

dishes

candy

cat

cats

suff:ies:y pref

clust:10

suff:es: pref

clust:20

suff:s

suff:s: pref

Number=Plur

clust:35


LREC, May 2016

candies dish

dishes

candy

cat

cats

suff:ies:y pref

clust:10

suff:es: pref

clust:20

suff:s

suff:s: pref

Number=Plur

clust:35

+1 -1


LREC, May 2016

candies dish

dishes

candy

cat

cats

suff:ies:y pref

clust:10

suff:es: pref

clust:20

suff:s

suff:s: pref

Number=Plur

clust:35

Reinforce (1) suff:s clust:35 clust:20 Flip (-1) suff:ies:y suff:es: suff:s: Neutral (0) pref clust:10

-1

-1

1

11

0

-1

+1 -1


LREC, May 2016

candies dish

dishes

candy

cat

cats

suff:ies:y pref

clust:10

suff:es: pref

clust:20

suff:s

suff:s: pref

Number=Plur

clust:35


-1

-1

1

11

0

-1

+1 -1-1

-1

+1

+1


LREC, May 2016

candies dish

dishes

candy

cat

cats

suff:ies:y pref

clust:10

suff:es: pref

clust:20

suff:s

suff:s: pref

Number=Plur

clust:35


-1

-1

1

11

0

-1

+1 -1-1

-1

+1

+1

Faruqui et al ’16:Ising Mean Field Approximation


LREC, May 2016

Universal Lexicons

✤ Seed with Universal Dependencies (Nivre et al. ’16)

John saw Mary VERB

Tense=Past The saw broke

NOUN Number=Sing

saw

NOUN:Number=Sing VERB:Tense=Past

LREC, May 2016

Resource Trade-off

90

92

94

96

98

Cs Fi Hu

Baseline* +Auto Lexicon +Gold + Gold + Auto + 100% data

91

91.7

95.5

LREC, May 2016

Resource Trade-off

90

92

94

96

98

Cs Fi Hu


93.293.6

96.8

91

91.7

95.5

LREC, May 2016

Resource Trade-off

90

92

94

96

98

Cs Fi Hu


95.195.2

97

93.293.6

96.8

91

91.7

95.5

LREC, May 2016

Resource Trade-off

90

92

94

96

98

Cs Fi Hu


95.895.7

97.4

95.195.2

97

93.293.6

96.8

91

91.7

95.5

LREC, May 2016

Resource Trade-off

90

92

94

96

98

Cs Fi Hu


91.7

92.8

96.8

95.895.7

97.4

95.195.2

97

93.293.6

96.8

91

91.7

95.5

LREC, May 2016

Part-Of-Speech Tagging: Queries

LREC, May 2016

Search Logs

LREC, May 2016

Search Logs

Click

LREC, May 2016

POS Taggers From Clicks Ganchev et al. (2012)

70

76.25

82.5

88.75

95

MS-251-NVX MS-251 Long Tail

Baseline +Click

74.3

81.9

92.8

LREC, May 2016

POS Taggers From Clicks Ganchev et al. (2012)

70

76.25

82.5

88.75

95

MS-251-NVX MS-251 Long Tail

Baseline +Click

77.5

84.5

93.5

74.3

81.9

92.8

LREC, May 2016

Morphosyntax Conclusions

✤ Money on more supervised data not necessarily optimal

✤ Better alternative: lexical resources (auto, manual & both)

✤ Better alternative: correlate usage statistics (click logs)

End User: Machine Translation

LREC, May 2016

Pipelined Machine Translation

John saw Mary

nsubj dobj

John Mary saw

nsubj dobj

Preprocess Reorder

Translate

Preprocess Reorder Translate Postorder Postprocess

LREC, May 2016

Reordering Data VS. BL

EU Δ

0

1

2

3

4

5

En->Ar En->Iw En->Ja

3.7

1.41.2

3.5

1.51

Human Auto

Syntax-based reordered (Lerner & Petrov ’13)

LREC, May 2016

Reordering Data VS. BL

EU Δ

0

1

2

3

4

5

En->Ar En->Iw En->Ja

3.7

1.41.2

3.5

1.51

Human Auto

Syntax-based reordered (Lerner & Petrov ’13)

BLEU

Δ

0

1

2

3

4

5

En->Ja

4.8

3.73.5

Human AutoRule-based

LREC, May 2016

Better ParsersM

T Q

UA

LIT

Y

72

74.5

77

79.5

82

Time

Structured Training

Greedy transition-based

SSL

Better features

LREC, May 2016

Better ParsersM

T Q

UA

LIT

Y

72

74.5

77

79.5

82

Time

Structured Training

Greedy transition-based

SSL

1pt improvement is significant to humans

Better features

LREC, May 2016

More Data

EN->

JA R

eord

erin

g Sc

ore

75

76.5

78

1x 2x 10x

Syntax Reordering

Katz-Brown et al; Hall et al. 2011

LREC, May 2016

Machine Translation

✤ Human vs. auto data: about the same

✤ Human models sometimes better than learned

✤ Better parsing models = better translation

✤ Better to spend on targeted resources — reordering

vs.

End User: Sentence Compression

LREC, May 2016

Sentence Compression @Google

Former Los Angeles Lakers head coach Phil Jackson won eleven

NBA championships. He won six titles with the Chicago Bulls and

five titles with the Lakers.

LREC, May 2016

Google News

Headline

First sentence

Filippova & Altun ’13✤ Can extract millions of pairs✤ Quality ~= expert annotations✤ 81.4 -> 84.3 F1 (10% -> 100% data)

LREC, May 2016

Need For High Quality Annotations?

?

LREC, May 2016


?Filippova et al. ’15: LSTM compression

by deletion 1/0

LREC, May 2016



by deletion 1/0

[ ] [ ] [ ] [ ]word

Accuracy

30

LREC, May 2016



by deletion 1/0

[ ] [ ] [ ] [ ]word[ ] [ ] [ ] [ ]parent

Accuracy

31

30

LREC, May 2016



by deletion 1/0

[ ] [ ] [ ] [ ]word[ ] [ ] [ ] [ ]parent[ ] [ ] [ ] [ ]syn struct

Accuracy

34

31

30

LREC, May 2016

The Resource Trade-off

High quality annotations Crowd-sourced Auto resources

Data + modelData + model

End User: QA & Knowledge Extraction

LREC, May 2016

QA & Knowledge Extraction

LREC, May 2016

QA & Knowledge Extraction

as of 2014

LREC, May 2016

Weakly Sup. Knowledge Extraction (West et al. 2014)

R=parents

parent of __ __’s parent __ father

mother of __ …

Extract relation templates/queries

search logs

QA system

LREC, May 2016

Weakly Sup. Knowledge Extraction (West et al. 2014)

R=parents

parent of __ __’s parent __ father

mother of __ …

Extract relation templates/queries

search logs

QA system

Q=Frank Zappa

parent of Frank Zappa Frank Zappa’s parent Frank Zappa father

mother of Frank Zappa …

Mothers of Inversion Ray Collins

Rose Marie Colimore Francis Zappa

Gail Zappa Rose Marie

Score entities in result snippets & aggregateIssue queries

QA system

LREC, May 2016

Does it Work?

LREC, May 2016


High quality annotations Crowd-sourced Auto resources

LREC, May 2016


High quality annotations Crowd-sourced Auto resources?

LREC, May 2016

Top-level Conclusion

Syntax

Semantics

Thanks

The Language Resource Spectrum: A Perspective from...

Documents

Transcript of The Language Resource Spectrum: A Perspective from...