Bilingual Word Embeddings: Supervision, Modeling ... · !Bilingual or Multilingual? !ypTe 1 and...
Transcript of Bilingual Word Embeddings: Supervision, Modeling ... · !Bilingual or Multilingual? !ypTe 1 and...
Bilingual Word Embeddings: Supervision,Modeling, Evaluation, Application
Ivan Vuli¢
University of Cambridge
South England NLP Meetup; London; September 22, 2016
1 / 47
Applications: Embeddings are Everywhere(Monolingual)
Intrinsic (semantic) tasks:
Semantic similarity and synonym detection, word analogiesConcept categorization
Selectional preferences, language grounding
Extrinsic (downstream) tasks:
Named entity recognition, part-of-speech tagging, chunkingSemantic role labeling, dependency parsing
Sentiment analysis
Beyond NLP:
Information retrieval, Web search
Question answering, dialogue systems2 / 47
Applications: Embeddings are Everywhere(Bilingual)
Intrinsic (semantic) tasks:
Cross-lingual and multilingual semantic similarity
Bilingual lexicon learning, multi-modal representations
Extrinsic (downstream) tasks:
Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer
Statistical/Neural machine translation
Beyond NLP:
(Cross-Lingual) Information retrieval
(Cross-Lingual) Question answering3 / 47
Word Embeddings
Dense representations → real-valued low-dimensional vectors
Word embedding induction→ learn word-level features which generalise well across tasks andlanguagesWord embeddings capture interesting and universal regularities:
4 / 47
Word Embeddings
Dense representations → real-valued low-dimensional vectors
Word embedding induction→ learn word-level features which generalise well across tasks andlanguagesWord embeddings capture interesting and universal regularities:
5 / 47
Motivation
The NLP community has developed useful features for several tasks but �ndingfeatures that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...)
(monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...)
(bilingual word embeddings → this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
6 / 47
Motivation
The NLP community has developed useful features for several tasks but �ndingfeatures that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...)
(monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...)
(bilingual word embeddings → this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
Learn word-level features which generalise across tasks andlanguages
7 / 47
Learning Word Representations
Key idea
Distributional hypothesis → words with similar meanings are likelyto appear in similar contexts
[Harris, Word 1954]
shouts:
�Meaning as use!�
calmly states:
�You shall know a word by the company it keeps.�8 / 47
Word Embeddings
9 / 47
Word Embedding Models
Models di�er based on:
1. How they compute the context- bag-of-words, positional, dependency-based,...
2. How they map context to target wt = f(con)- linear, bilinear, non-linear,...
3. How they measure the loss between wt and f(con) and howthey are trained- negative sampling, sampled rank loss, squared-error,...
10 / 47
Word Embeddings
Similar contexts → �similar� embeddings?
For a �xed context, all distributionally similar words will/should getupdated towards a common point
11 / 47
Monolingual...
Skip-gram with negative sampling (SGNS)[Mikolov et al.; NIPS 2013]
12 / 47
Monolingual...
Skip-gram with negative sampling (SGNS)[Mikolov et al.; NIPS 2013]
13 / 47
Still Monolingual...
Skip-gram with negative sampling (SGNS)[Mikolov et al.; NIPS 2013]
Learning from the set D of (word, context) pairs observed in a corpus:
(w, v) = (w(t), w(t± i)); i = 1, ..., cs; cs = context window size
SG learns to predict the context of the pivot word
John saw a cute gray huhblub running in the �eld.
D = (huhblub, cute), (huhblub, gray), (huhblub, running), (huhblub, in)
vec(huhblub) = [−0.23, 0.44,−0.76, 0.33, 0.19, . . .]
14 / 47
Still Monolingual...
Representation of each word w ∈ V :
vec(w) = [f1, f2, . . . , fdim]
Word representations in the same shared semantic (or embedding) space!
Image courtesy of [Gouws et al., ICML 2015]15 / 47
Bilingual Word Embeddings (BWEs)
Representation of a word wS1 ∈ V S :
vec(wS1 ) = [f11 , f
12 , . . . , f
1dim]
Exactly the same representation for wT2 ∈ V T :
vec(wT2 ) = [f21 , f
22 , . . . , f
2dim]
Language-independent word representations in the same sharedsemantic (or embedding) space
16 / 47
Bilingual Word Embeddings
Monolingual vs. Bilingual
Q1 → How to align semantic spaces in two di�erent languages?
Q2 → Which bilingual signals are used for the alignment?
See also:
[Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical
Comparison; ACL 2016]17 / 47
Bilingual Signals?
(a) Word alignments?
(b) Sentence alignments?
(c) Word translation pairs?
(d) Document alignments?
18 / 47
Bilingual Word Embeddings
Two desirable properties:
P1 → Leverage (large) monolingual training sets tied togetherthrough a bilingual signal
P2 → Use as inexpensive bilingual signal as possible in orderto learn a shared space in a scalable and widely applicable manneracross languages and domains
19 / 47
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
20 / 47
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
21 / 47
Type 1: Parallel-Only
22 / 47
Type 1: Example 1
[Hermann and Blunsom, ACL 2014]
Ebi(a, b) = ||f(a)− g(b)||2Ehl(a, b, n) = [m+ Ebi(a, b)− Ebi(a, n)]+
23 / 47
Type 1: Example 2
[Luong et al., NAACL 2015]
→ Using hard word alignments
→ E�ectively: training 4 SGNS models, L1 → L1, L1 → L2, L2 → L1,
L2 → L2
24 / 47
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
25 / 47
Type 2: Joint Mono+CL
A (very) simple formulation:
γ1MonolingualS + γ2MonolingualT + δBilingual
26 / 47
Type 2 Example: BilBOWA
[Gouws et al., ICML 2015]
27 / 47
Type 2 Example: BilBOWA
[Gouws et al., ICML 2015]
28 / 47
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
29 / 47
Type 3 Example
[Vuli¢ and Moens, ACL 2015, JAIR 2016]
→ Length-Ratio Shu�e: Training a SGNS (or any other monolingualmodel!) on shu�ed �pseudo-bilingual� documents
30 / 47
Type 4 Example
Post-Hoc Mapping with Seed Lexicons
31 / 47
Type 4 Example
Post-Hoc Mapping with Seed Lexicons
Learn to transform the pre-trained source language embeddingsinto a space where the distance between a word and its translationpair is minimised
Bilingual signal → word translation pairs32 / 47
Type 4 Example
Monolingual WE model on monolingual corpora
Bilingual signal → N word translation pairs (xi, yi) , i = 1, . . . , N
Transformation between spaces → we assume linear mapping[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]
minW∈RdS×dT
||XW −Y||2F + λ||W||2F
X → Source language vectors for words from a training set
Y → Target language vectors for words from a training set
W → �Translation� (or transformation) matrix
33 / 47
Other Models/Upgrades
→ A Hybrid Model: Type 3 + Type 4 → A type-hybrid procedure whichretain only highly reliable translation pairs obtained by a Type 3 model as aseed lexicon for Type 4 models.[Vuli¢ and Korhonen, ACL 2016]
→ Bilingual or Multilingual? → Type 1 and Type 2 models encoding morethan 2 languages in their objectives.[Coulmance et al., EMNLP 2015]
→ Beyond the word level? → Learning cross-lingual representations forphrases and sentences.[Soyer et al., ICLR 2015]
→ Sentence IDs and Document IDs as Bilingual Signals → Inverted indexingplus dimensionality reduction.[Sogaard et al., ACL 2015, Levy and Sogaard, arXiv 2016]
→ ...
34 / 47
Ranked Lists in Bilingual Spaces
Spanish-English (ES-EN) Italian-English (IT-EN) Dutch-English (NL-EN)
(1)reina
(2)reina
(3)reina
(1)madre
(2)madre
(3)madre
(1)schilder
(2)schilder
(3)schilder
(Spanish) (English) (Combined) (Italian) (English) (Combined) (Dutch) (English) (Combined)
rey queen(+) queen(+) padre mother(+) mother(+) kunstschilderpainter(+) painter(+)trono heir rey moglie father padre schilderij painting kunstschildermonarca throne trono sorella sister moglie kunstenaar portrait paintingheredero king heir �glia wife father olieverf artist schilderijmatrimonio royal throne �glio daughter sorella olieverfschilderijcanvas kunstenaarhijo reign monarca fratello son �glia schilderen impressionistportraitreino succession heredero casa friend �glio frans cubism olieverfreinado princess king amico childhood sister nederlands art olieverfschilderijregencia marriage matrimonio marito family fratello componist poet schilderenduque prince royal donna cousin wife beeldhouwerdrawing artist
−−−→reina−−−−−−→woman+−−→man ≈ −→rey
−−−→queen−−−−−→mujer +−−−−−→hombre ≈
−−→king
−−−→reina−−−−−→mujer +
−−−−−→hombre ≈ −→rey
35 / 47
Example: Cross-Lingual IR
Beyond the level of words: We learn word embeddings:vec(huhblub) = [−0.23, 0.44,−0.76, 0.33, 0.19, . . .]vec(�u�y) = [0.31, 0.02,−0.11,−0.28, 0.52, . . .]
→ How to build document and query embeddings?vec(huhblup is �u�y) = ??
Adapting the framework from compositional distributional
semantics: [Mitchell and Lapata, ACL 2008; Socher et al., EMNLP 2011;
Milajevs et al., EMNLP 2014] and many more...
A generic composition with a bag-of-words assumption:(d = {w1, w2, . . . , w|Nd|})
−→d = −→w1 ?
−→w2 ? . . . ?−−−→w|Nd|
? = compositional vector operator (addition, multiplication, tensor product,..)36 / 47
Example: Cross-Lingual IR
Document and Query Embeddings
A general framework → in this work the simple and e�ectiveadditive composition:[Mitchell and Lapata, ACL 2008]
−→d = −→w1 +
−→w2 + . . .+−−−→w|Nd|
The dim-dimensional document embedding in the same bilingualword embedding space:
−→d = [fd,1, . . . , fd,k, . . . , fd,dim]
37 / 47
Example: Cross-Lingual IR
Document and Query Embeddings
→ The same principles with queries
−→Q = −→q1 +−→q2 + . . .+−→qm
The dim-dimensional query embedding in the same bilingualword embedding space:
−→Q = [fQ,1, . . . , fQ,k, . . . , fQ,dim]
38 / 47
Example: Cross-Lingual IR I
Final (Simple) CLIR Model[Vuli¢ and Moens, SIGIR 2015]
1 Induce a bilingual embedding space using any BWE model.
2 Given is a target document collection DC = {d′1, . . . , d′N ′}.Compute dim-dimensional document embeddings
−→d′ for
each d′ ∈ DC using the dim-dimensional WEs obtained in theprevious step and a semantic composition model.
3 After the query Q = {q1, . . . , qm} is issued in language LS ,compute a dim-dimensional query embedding.
39 / 47
Example: Cross-Lingual IR II
4 For each d′ ∈ DC, compute the semantic similarity scoresim(d′, Q) which quanti�es each document's relevance to thequery Q:
sim(d′, Q) = SF (d′, Q) =
−→d′ ·−→Q
|−→d′ | · |
−→Q |
5 Rank all documents from DC according to their similarityscores from the previous step.
40 / 47
Example Experiment: BLL
Task → Bilingual lexicon learning (BLL)
Goal → to build a non-probabilistic bilingual lexicon of word translations
Test Sets → ground truth word translation pairs built for three language pairs:Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)
[Vuli¢ and Moens, NAACL 2013, EMNLP 2013](Similar relative performance on other BLL test sets)
Evaluation Metric → Top 1 accuracy (Acc1)
(Similar model rankings with Acc5 and Acc10)
41 / 47
Example Experiment: BLL
Example Experiment: BLL(5K seed lexicons)
Model ES-EN NL-EN IT-EN
BiCVM (Type 1) 0.532 0.583 0.569BilBOWA (Type 2) 0.632 0.636 0.647BWESG (Type 3) 0.676 0.626 0.643
BNC+GT (Type 4) 0.677 0.641 0.646
ORTHO 0.233 0.506 0.224BNC+HYB+ASYM 0.673 0.626 0.644BNC+HYB+SYM 0.681 0.658* 0.663*(3388; 2738; 3145)HFQ+HYB+ASYM 0.673 0.596 0.635HFQ+HYB+SYM 0.695* 0.657* 0.667*
42 / 47
Evaluation and Comparison?
1. Cross-lingual document classi�cation (CLDC)
- Problematic? Word-level embeddings → document-levelevaluation?- Limited? Only for English-German; more language pairs required
2. Bilingual lexicon extraction
- Limited? Often ad-hoc and manually designed datasets.- Problematic? Contradictory results.
3. Cross-lingual POS tagging
- Too coarse-grained? Can it capture subtle di�erences between themodels?
...We need to �nd a better way of evaluating bilingual wordembeddings.
43 / 47
Evaluation and Comparison?
We need better evaluation protocols for cross-lingual embeddings...
How can we evaluate:
- phrase-level cross-lingual representations?
- sentence-level cross-lingual representations?
- document-level cross-lingual representations?
44 / 47
Applications: Embeddings are Everywhere(Bilingual)
Intrinsic (semantic) tasks:
Cross-lingual and multilingual semantic similarity
Bilingual lexicon learning, multi-modal representations
Extrinsic (downstream) tasks:
Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer
Statistical/Neural machine translation
Beyond NLP:
(Cross-Lingual) Information retrieval
(Cross-Lingual) Question answering45 / 47
Language Technology Lab @Cam
46 / 47
Questions?
47 / 47