Mimicking Word Embeddings using Subword RNNs · Add subword layer to supervised task Ling et al....

Mimicking Word Embeddings using Subword RNNsYuval Pinter, Robert Guthrie, Jacob Eisenstein

@yuvalpi

Presented at EMNLP

September 2017, Copenhagen

http://www.twitter.com/yuvalpi

The Word Embedding Pipeline

Unlabeled

corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Reddit

...2

Embedding

model

(vectors)

W2V

GloVe

Polyglot

FastText

...


Unlabeled

corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Reddit

...2

Supervised

task corpusSupervised

task corpus

Penn TreeBank

SemEval

OntoNotes

Univ. Dependencies

...

Embedding

model

(vectors)

W2V

GloVe

Polyglot

FastText

...


Unlabeled

corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Reddit

...2

Supervised

task corpusSupervised

task corpus

Penn TreeBank

SemEval

OntoNotes

Univ. Dependencies

...

Embedding

model

(vectors)

W2V

GloVe

Polyglot

FastText

...


Unlabeled

corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Reddit

...

Tagging

Parsing

Sentiment

NER

...

Task

model

2

All possible text

Unlabeled text

Assumed Pattern

Supervised

task text

3

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

4

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

5

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

● No pre-trained vectors

5

All possible text

Unlabeled text

Actual Pattern

Supervised

task text


● Affects supervised tasks

5

All possible text

Unlabeled text

Actual Pattern

Supervised

task text



● Multiple treatments

suggested

5

All possible text

Unlabeled text

Actual Pattern

Supervised

task text



● Multiple treatments

suggested

● Our method - compositional

subword OOV model 5

Sources of OOVs

6

Sources of OOVs● Names Chalabi has increasingly marginalized within Iraq, ...

6

Sources of OOVs● Names

● Domain-specific jargon

Chalabi has increasingly marginalized within Iraq, ...

Important species (...) include shrimp, (...) and some varieties of flatfish.

6



● Foreign words



This term was first used in German (Hochrenaissance), …

6



● Foreign words

● Rare morphological derivations




Without George Martin the Beatles would have been just another

untalented band as Oasis.

6



● Foreign words


● Nonce words






What if Google morphed into GoogleOS?

6



● Foreign words


● Nonce words

● Nonstandard orthography







We’ll have four bands, and Big D is cookin’. lots of fun and great prizes.

6



● Foreign words


● Nonce words


● Typos and other errors








I dislike this urban society and I want to leave this whole enviroment.

6



● Foreign words


● Nonce words



● …








I dislike this urban society and I want to leave this whole enviroment.

???

6

Supervised

task corpusUnlabeled

corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Common OOV handling techniques● None (random init)

OOV

7

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus


OOV

???

7

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus


● One UNK to rule them all○ Average existing embeddings

○ Trained with embeddings (stochastic unking)

OOV

UNK

8

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus




● Add subword model during WE training○ Bhatia et al. (2016), Wieting et al. (2016)

OOV

9

Supervised

task corpus




● Add subword model during WE training○ Bhatia et al. (2016), Wieting et al. (2016)

○ What if we don’t have access to the original corpus? (e.g. FastText)

OOV

9

Char2Tag

OOV

10

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Char2Tag● Add subword layer to supervised task

○ Ling et al. (2015), Plank et al. (2016)

OOV

10

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus



● OOVs benefit from co-trained character model

OOV

10

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus



● OOVs benefit from co-trained character model

● Requires large supervised training set forefficient transfer to test set OOVs

OOV

10

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

OOV

Enter MIMICK

OOV

11

● What data do we have, post-unlabeled corpus?○ Vector dictionary

○ Orthography (the way words are spelled)

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

OOV

Enter MIMICK

OOV

11



● Use the former as training objective, latter as input

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

OOV

Enter MIMICK

OOV

11




● Pre-trained vectors as target○ No need to access original unlabeled corpus

○ Many training examples

○ (No context)Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

OOV

Enter MIMICK

OOV

11




● Pre-trained vectors as target○ No need to access original unlabeled corpus

○ Many training examples

○ (No context)

● Subword units as inputs○ Very extensible

○ (Character inventory changes?)

Supervised


corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

OOV

Enter MIMICK

OOV

11

MIMICK Training

m eka

Pre-trained Embedding(Polyglot/FastText/etc.) make

All possible text

Unlabeled text

make

12

http://dynet.readthedocs.io/en/latest/python.html

Characterembeddings

MIMICK Training

m eka


All possible text

Unlabeled text

make

12


Characterembeddings

MIMICK Training

m eka


All possible text

Unlabeled text

make

Forward LSTM

12


Characterembeddings

MIMICK Training

m eka


All possible text

Unlabeled text

makeBackward

LSTM

Forward LSTM

12


Characterembeddings

MIMICK Training

m eka


All possible text

Unlabeled text

makeBackward

LSTM

Forward LSTM

Multilayered Perceptron

12


Characterembeddings

MIMICK Training

m eka


Loss (L2)

Mimicked EmbeddingAll possible text

Unlabeled text

makeBackward

LSTM

Forward LSTM


12


Characterembeddings

MIMICK Inference

b hal

Mimicked EmbeddingAll possible text

Unlabeled textblah

Backward LSTM

Forward LSTM


13


Observation – Nearest Neighbors

14

Observation – Nearest Neighbors● English (OOV Nearest in-vocab words)

14


○ MCT → AWS, OTA, APT, PDM

14



○ pesky → euphoric, disagreeable, horrid, ghastly

14




○ lawnmower → tradesman, bookmaker, postman, hairdresser

14





● Hebrew

14





● Hebrew○ תפתור תתג → (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve

14






○ טריים גיאומטריים → geometric (m.pl., nontrad. spelling) geometric (m.pl.)

14







○ ארך רדסון’ריצ → Richardson Eustrach

14







○ ארך רדסון’ריצ → Richardson Eustrach

● ✔ Surface form ✔ Syntactic properties ✘ Semantics

14

Intrinsic Evaluation – RareWords

15


● RareWords similarity task: morphologically-complex, mostly unseen words

15



● Names


● Foreign words

● Rare(-ish) morphological derivations

● Nonce words



● ...15



● Names


● Foreign words


● Nonce words



● ...

Nearest:

programmatic

transformational

mechanistic

transactional

contextual

NN FUN!!!

15

Extrinsic Evaluation – POS + Attribute Tagging

● Names


● Foreign words


● Nonce words



● ... 16

● UD is annotated for POS and morphosyntactic attributes○ Eng: his stated goals Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing

people of advanced age


● Names


● Foreign words


● Nonce words



● ... 16




● POS model from Ling et al. (2015)

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM





● Attributes - same as POS layer

pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

POS

Number

Tense

17







pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

POS

Number

Tense

17

● Negative effect on POS







pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

POS

Number

Tense

17

● Negative effect on POS

● Attribute evaluation metric○ Micro F1


Language Selection

Minna Sundberg

18

Language Selection● |UD ∩ Polyglot| = 44, we took 23

Minna Sundberg

18


● Morphological structure

Minna Sundberg

18


● Morphological structure○ 12 fusional

Minna Sundberg

18



○ 3 analytic

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● Geneological diversity

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● Geneological diversity○ 13 Indo-European (7 different branches)

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative


○ 10 from 8 non-IE branches

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative



● MRLs (e.g. Slavic languages)

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative



● MRLs (e.g. Slavic languages)○ Much word-level data

Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative




○ Relatively free word order Minna Sundberg

18



○ 3 analytic

○ 1 isolating

○ 7 agglutinative




○ Relatively free word order

Institutional

Entrepreneurial

Linguistic

Anatomical

Ideological

Minna Sundberg

18

Language Selection (contd.)

19

Language Selection (contd.)● Script type

19


○ 7 in non-alphabetic scripts

19



○ Ideographic (Chinese) - ~12K characters

19




○ Hebrew, Arabic - no casing, no vowels, syntactic fusion

19





○ Vietnamese - tokens are non-compositional syllables

19






● Attribute-carrying tokens

19






● Attribute-carrying tokens○ Range from 0% (Vietnamese) to 92.4% (Hindi)

19







● OOV rate (UD against Polyglot vocabulary)

19







● OOV rate (UD against Polyglot vocabulary)○ 16.9%-70.8% type-level (median 29.1%)

19







● OOV rate (UD against Polyglot vocabulary)○ 16.9%-70.8% type-level (median 29.1%)

○ 2.2%-33.1% token-level (median 9.2%)

19

Evaluated Systems● NONE: Polyglot’s default UNK embedding

the sittingisflatfish

20


● MIMICK


20


● MIMICK

● CHAR2TAG - additional RNN layer○ 3x Training time

Char-

LSTM

Char-

LSTM

Char-

LSTM

Char-

LSTM


20


● MIMICK


● BOTH: MIMICK + CHAR2TAG

Char-

LSTM

Char-

LSTM

Char-

LSTM

Char-

LSTM


20


● MIMICK


● BOTH: MIMICK + CHAR2TAG

POINT

UNION

ROAD

LIGHT

LONG

Char-

LSTM

Char-

LSTM

Char-

LSTM

Char-

LSTM


20

Results - Full Data

POS tags (accuracy) Morpho. Attributes (micro F1)21

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

Results - 5,000 training tokens

POS tags (accuracy) Morpho. Attributes (micro F1)22

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

Results - Language Types (5,000 tokens)

Slavic languages POS23

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

Results - Language Types (5,000 tokens)

Slavic languages POS Agglutinative languages morpho. attribute F123

Results - Chinese

24POS tags (accuracy) Morpho. Attributes (micro F1)

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

NO

NE

MIM

ICK

CH

AR

2T

AG

BO

TH

A Word (Model) from our Sponsor

Code & models:

https://github.com/yuvalpinter/Mimick

25


A Word (Model) from our Sponsor● Our extrinsic results are on tagging

Code & models:


25



● Please consider us for all your WE use cases!

Code & models:


25



● Please consider us for all your WE use cases!○ Sentiment!

Code & models:


25




○ Parsing!

Code & models:


25




○ Parsing!

○ IE! Code & models:


25




○ Parsing!

○ IE!

○ QA!

Code & models:


25




○ Parsing!

○ IE!

○ QA!

○ ...

Code & models:


25




○ Parsing!

○ IE!

○ QA!

○ ...

● Code compatible with w2v, Polyglot, FastText

Code & models:


25




○ Parsing!

○ IE!

○ QA!

○ ...


● Models for Polyglot also on github

Code & models:


25

https://github.com/yuvalpinter/Mimick/tree/master/mimick/models




○ Parsing!

○ IE!

○ QA!

○ ...


● Models for Polyglot also on github○ <1MB each, dynet format

Code & models:


25






○ Parsing!

○ IE!

○ QA!

○ ...



○ Learn all OOVs in advance and add to param table, or

Code & models:


25






○ Parsing!

○ IE!

○ QA!

○ ...



○ Learn all OOVs in advance and add to param table, or

○ Load into memory and infer on-line

Code & models:


25




Conclusions

26

Conclusions● MIMICK: an OOV-extension embedding processing step for downstream tasks

26


● Compositional model complementing distributional artifact

26



● Powerful technique for low-resource scenarios

26




● Especially good for:

26




● Especially good for:○ Morphologically-rich languages

26





○ Large character vocabulary

26






● Sore spots and Future Work

26






● Sore spots and Future Work○ Vietnamese - syllabic vocabulary

26







○ Hebrew and Arabic - nontrivial tokenization, no case

26








○ Try other subword levels (morphemes, phonemes, bytes)

26








○ Try other subword levels (morphemes, phonemes, bytes)

○ Improve morphosyntactic attribute tagging scheme

26

Questions?

Code & models:


Neglect

Satisfaction

Illness

Espionage

Bullying

27


Mimicking Word Embeddings using Subword RNNs · Add subword layer to supervised task Ling et al....

Documents

Transcript of Mimicking Word Embeddings using Subword RNNs · Add subword layer to supervised task Ling et al....