Bilingual Word Embeddings: Supervision, Modeling ... · !Bilingual or Multilingual? !ypTe 1 and...

Bilingual Word Embeddings: Supervision,Modeling, Evaluation, Application

Ivan Vuli¢

University of Cambridge

South England NLP Meetup; London; September 22, 2016

[email protected]

1 / 47

mailto:[email protected]

Applications: Embeddings are Everywhere(Monolingual)

Intrinsic (semantic) tasks:

Semantic similarity and synonym detection, word analogiesConcept categorization

Selectional preferences, language grounding

Extrinsic (downstream) tasks:

Named entity recognition, part-of-speech tagging, chunkingSemantic role labeling, dependency parsing

Sentiment analysis

Beyond NLP:

Information retrieval, Web search

Question answering, dialogue systems2 / 47

Applications: Embeddings are Everywhere(Bilingual)


Cross-lingual and multilingual semantic similarity

Bilingual lexicon learning, multi-modal representations


Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

Statistical/Neural machine translation

Beyond NLP:

(Cross-Lingual) Information retrieval

(Cross-Lingual) Question answering3 / 47

Word Embeddings

Dense representations → real-valued low-dimensional vectors

Word embedding induction→ learn word-level features which generalise well across tasks andlanguagesWord embeddings capture interesting and universal regularities:

4 / 47

Word Embeddings

Dense representations → real-valued low-dimensional vectors

Word embedding induction→ learn word-level features which generalise well across tasks andlanguagesWord embeddings capture interesting and universal regularities:

5 / 47

Motivation

The NLP community has developed useful features for several tasks but �ndingfeatures that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(bilingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

6 / 47

Motivation

The NLP community has developed useful features for several tasks but �ndingfeatures that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(bilingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

Learn word-level features which generalise across tasks andlanguages

7 / 47

Learning Word Representations

Key idea

Distributional hypothesis → words with similar meanings are likelyto appear in similar contexts

[Harris, Word 1954]

shouts:

�Meaning as use!�

calmly states:

�You shall know a word by the company it keeps.�8 / 47

Word Embeddings

9 / 47

Word Embedding Models

Models di�er based on:

1. How they compute the context- bag-of-words, positional, dependency-based,...

2. How they map context to target wt = f(con)- linear, bilinear, non-linear,...

3. How they measure the loss between wt and f(con) and howthey are trained- negative sampling, sampled rank loss, squared-error,...

10 / 47

Word Embeddings

Similar contexts → �similar� embeddings?

For a �xed context, all distributionally similar words will/should getupdated towards a common point

11 / 47

Monolingual...

Skip-gram with negative sampling (SGNS)[Mikolov et al.; NIPS 2013]

12 / 47

Monolingual...


13 / 47

Still Monolingual...


Learning from the set D of (word, context) pairs observed in a corpus:

(w, v) = (w(t), w(t± i)); i = 1, ..., cs; cs = context window size

SG learns to predict the context of the pivot word

John saw a cute gray huhblub running in the �eld.

D = (huhblub, cute), (huhblub, gray), (huhblub, running), (huhblub, in)

vec(huhblub) = [−0.23, 0.44,−0.76, 0.33, 0.19, . . .]

14 / 47

Still Monolingual...

Representation of each word w ∈ V :

vec(w) = [f1, f2, . . . , fdim]

Word representations in the same shared semantic (or embedding) space!

Image courtesy of [Gouws et al., ICML 2015]15 / 47

Bilingual Word Embeddings (BWEs)

Representation of a word wS1 ∈ V S :

vec(wS1 ) = [f11 , f

12 , . . . , f

1dim]

Exactly the same representation for wT2 ∈ V T :

vec(wT2 ) = [f21 , f

22 , . . . , f

2dim]

Language-independent word representations in the same sharedsemantic (or embedding) space

16 / 47

Bilingual Word Embeddings

Monolingual vs. Bilingual

Q1 → How to align semantic spaces in two di�erent languages?

Q2 → Which bilingual signals are used for the alignment?

See also:

[Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical

Comparison; ACL 2016]17 / 47

Bilingual Signals?

(a) Word alignments?

(b) Sentence alignments?

(c) Word translation pairs?

(d) Document alignments?

18 / 47

Bilingual Word Embeddings

Two desirable properties:

P1 → Leverage (large) monolingual training sets tied togetherthrough a bilingual signal

P2 → Use as inexpensive bilingual signal as possible in orderto learn a shared space in a scalable and widely applicable manneracross languages and domains

19 / 47

BWEs and Bilingual Signals

(Type 1) Jointly learn and align BWEs using parallel-only data

[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]

(Type 2) Jointly learn and align BWEs using monolingual and parallel data

[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]

(Type 3) Learn BWEs from comparable document-aligned data[Vuli¢ and Moens, ACL 2015, JAIR 2016]

(Type 4) Align pretrained monolingual embedding spaces using seed lexicons[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]

20 / 47








21 / 47

Type 1: Parallel-Only

22 / 47

Type 1: Example 1

[Hermann and Blunsom, ACL 2014]

Ebi(a, b) = ||f(a)− g(b)||2Ehl(a, b, n) = [m+ Ebi(a, b)− Ebi(a, n)]+

23 / 47

Type 1: Example 2

[Luong et al., NAACL 2015]

→ Using hard word alignments

→ E�ectively: training 4 SGNS models, L1 → L1, L1 → L2, L2 → L1,

L2 → L2

24 / 47








25 / 47

Type 2: Joint Mono+CL

A (very) simple formulation:

γ1MonolingualS + γ2MonolingualT + δBilingual

26 / 47

Type 2 Example: BilBOWA

[Gouws et al., ICML 2015]

27 / 47

Type 2 Example: BilBOWA

[Gouws et al., ICML 2015]

28 / 47








29 / 47

Type 3 Example

[Vuli¢ and Moens, ACL 2015, JAIR 2016]

→ Length-Ratio Shu�e: Training a SGNS (or any other monolingualmodel!) on shu�ed �pseudo-bilingual� documents

30 / 47

Type 4 Example

Post-Hoc Mapping with Seed Lexicons

31 / 47

Type 4 Example

Post-Hoc Mapping with Seed Lexicons

Learn to transform the pre-trained source language embeddingsinto a space where the distance between a word and its translationpair is minimised

Bilingual signal → word translation pairs32 / 47

Type 4 Example

Monolingual WE model on monolingual corpora

Bilingual signal → N word translation pairs (xi, yi) , i = 1, . . . , N

Transformation between spaces → we assume linear mapping[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]

minW∈RdS×dT

||XW −Y||2F + λ||W||2F

X → Source language vectors for words from a training set

Y → Target language vectors for words from a training set

W → �Translation� (or transformation) matrix

33 / 47

Other Models/Upgrades

→ A Hybrid Model: Type 3 + Type 4 → A type-hybrid procedure whichretain only highly reliable translation pairs obtained by a Type 3 model as aseed lexicon for Type 4 models.[Vuli¢ and Korhonen, ACL 2016]

→ Bilingual or Multilingual? → Type 1 and Type 2 models encoding morethan 2 languages in their objectives.[Coulmance et al., EMNLP 2015]

→ Beyond the word level? → Learning cross-lingual representations forphrases and sentences.[Soyer et al., ICLR 2015]

→ Sentence IDs and Document IDs as Bilingual Signals → Inverted indexingplus dimensionality reduction.[Sogaard et al., ACL 2015, Levy and Sogaard, arXiv 2016]

→ ...

34 / 47

Ranked Lists in Bilingual Spaces

Spanish-English (ES-EN) Italian-English (IT-EN) Dutch-English (NL-EN)

(1)reina

(2)reina

(3)reina

(1)madre

(2)madre

(3)madre

(1)schilder

(2)schilder

(3)schilder

(Spanish) (English) (Combined) (Italian) (English) (Combined) (Dutch) (English) (Combined)

rey queen(+) queen(+) padre mother(+) mother(+) kunstschilderpainter(+) painter(+)trono heir rey moglie father padre schilderij painting kunstschildermonarca throne trono sorella sister moglie kunstenaar portrait paintingheredero king heir �glia wife father olieverf artist schilderijmatrimonio royal throne �glio daughter sorella olieverfschilderijcanvas kunstenaarhijo reign monarca fratello son �glia schilderen impressionistportraitreino succession heredero casa friend �glio frans cubism olieverfreinado princess king amico childhood sister nederlands art olieverfschilderijregencia marriage matrimonio marito family fratello componist poet schilderenduque prince royal donna cousin wife beeldhouwerdrawing artist

−−−→reina−−−−−−→woman+−−→man ≈ −→rey

−−−→queen−−−−−→mujer +−−−−−→hombre ≈

−−→king

−−−→reina−−−−−→mujer +

−−−−−→hombre ≈ −→rey

35 / 47

Example: Cross-Lingual IR

Beyond the level of words: We learn word embeddings:vec(huhblub) = [−0.23, 0.44,−0.76, 0.33, 0.19, . . .]vec(�u�y) = [0.31, 0.02,−0.11,−0.28, 0.52, . . .]

→ How to build document and query embeddings?vec(huhblup is �u�y) = ??

Adapting the framework from compositional distributional

semantics: [Mitchell and Lapata, ACL 2008; Socher et al., EMNLP 2011;

Milajevs et al., EMNLP 2014] and many more...

A generic composition with a bag-of-words assumption:(d = {w1, w2, . . . , w|Nd|})

−→d = −→w1 ?

−→w2 ? . . . ?−−−→w|Nd|

? = compositional vector operator (addition, multiplication, tensor product,..)36 / 47


Document and Query Embeddings

A general framework → in this work the simple and e�ectiveadditive composition:[Mitchell and Lapata, ACL 2008]

−→d = −→w1 +

−→w2 + . . .+−−−→w|Nd|

The dim-dimensional document embedding in the same bilingualword embedding space:

−→d = [fd,1, . . . , fd,k, . . . , fd,dim]

37 / 47


Document and Query Embeddings

→ The same principles with queries

−→Q = −→q1 +−→q2 + . . .+−→qm

The dim-dimensional query embedding in the same bilingualword embedding space:

−→Q = [fQ,1, . . . , fQ,k, . . . , fQ,dim]

38 / 47

Example: Cross-Lingual IR I

Final (Simple) CLIR Model[Vuli¢ and Moens, SIGIR 2015]

1 Induce a bilingual embedding space using any BWE model.

2 Given is a target document collection DC = {d′1, . . . , d′N ′}.Compute dim-dimensional document embeddings

−→d′ for

each d′ ∈ DC using the dim-dimensional WEs obtained in theprevious step and a semantic composition model.

3 After the query Q = {q1, . . . , qm} is issued in language LS ,compute a dim-dimensional query embedding.

39 / 47

Example: Cross-Lingual IR II

4 For each d′ ∈ DC, compute the semantic similarity scoresim(d′, Q) which quanti�es each document's relevance to thequery Q:

sim(d′, Q) = SF (d′, Q) =

−→d′ ·−→Q

|−→d′ | · |

−→Q |

5 Rank all documents from DC according to their similarityscores from the previous step.

40 / 47

Example Experiment: BLL

Task → Bilingual lexicon learning (BLL)

Goal → to build a non-probabilistic bilingual lexicon of word translations

Test Sets → ground truth word translation pairs built for three language pairs:Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)

[Vuli¢ and Moens, NAACL 2013, EMNLP 2013](Similar relative performance on other BLL test sets)

Evaluation Metric → Top 1 accuracy (Acc1)

(Similar model rankings with Acc5 and Acc10)

41 / 47

Example Experiment: BLL

Example Experiment: BLL(5K seed lexicons)

Model ES-EN NL-EN IT-EN

BiCVM (Type 1) 0.532 0.583 0.569BilBOWA (Type 2) 0.632 0.636 0.647BWESG (Type 3) 0.676 0.626 0.643

BNC+GT (Type 4) 0.677 0.641 0.646

ORTHO 0.233 0.506 0.224BNC+HYB+ASYM 0.673 0.626 0.644BNC+HYB+SYM 0.681 0.658* 0.663*(3388; 2738; 3145)HFQ+HYB+ASYM 0.673 0.596 0.635HFQ+HYB+SYM 0.695* 0.657* 0.667*

42 / 47

Evaluation and Comparison?

1. Cross-lingual document classi�cation (CLDC)

- Problematic? Word-level embeddings → document-levelevaluation?- Limited? Only for English-German; more language pairs required

2. Bilingual lexicon extraction

- Limited? Often ad-hoc and manually designed datasets.- Problematic? Contradictory results.

3. Cross-lingual POS tagging

- Too coarse-grained? Can it capture subtle di�erences between themodels?

...We need to �nd a better way of evaluating bilingual wordembeddings.

43 / 47

Evaluation and Comparison?

We need better evaluation protocols for cross-lingual embeddings...

How can we evaluate:

- phrase-level cross-lingual representations?

- sentence-level cross-lingual representations?

- document-level cross-lingual representations?

44 / 47

Applications: Embeddings are Everywhere(Bilingual)


Cross-lingual and multilingual semantic similarity

Bilingual lexicon learning, multi-modal representations


Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

Statistical/Neural machine translation

Beyond NLP:

(Cross-Lingual) Information retrieval

(Cross-Lingual) Question answering45 / 47

Language Technology Lab @Cam

46 / 47

Questions?

47 / 47

Bilingual Word Embeddings: Supervision, Modeling ... · !Bilingual or Multilingual? !ypTe 1 and...

Documents

Transcript of Bilingual Word Embeddings: Supervision, Modeling ... · !Bilingual or Multilingual? !ypTe 1 and...