Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I...
Transcript of Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I...
![Page 1: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/1.jpg)
Natural Language Processing
Word vectors
Many slides borrowed from Richard Socher ,Chris Manning, and Hugo Lachorelle
![Page 2: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/2.jpg)
Lecture plan• Word representations
• Word vectors (embeddings)
• skip-gram algorithm
• Relation to matrix factorization
• Evaluation
2
![Page 3: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/3.jpg)
Representing words
3
![Page 4: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/4.jpg)
Representing wordsDefinition: meaning (Webster dictionary)
• the idea that is represented by a word, phrase, etc.
• The idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.
In linguistics:
signifier <—> signified (idea or thing) = denotation
4
![Page 5: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/5.jpg)
Taxonomies
![Page 6: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/6.jpg)
Taxonomies
“beverage”
![Page 7: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/7.jpg)
Representing words with computers
A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: “is-a” relationHyponym: the opposite of ‘hypernym’
7
![Page 8: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/8.jpg)
Drawbacks• Expensive!
• Subjective (how to split different synsets?)
• Incomplete
• wicked, badass, nifty, crack, ace, wizard, genius, ninja
• Missing functionality:
• how do you compute word similarity?
• How to compose meanings?
8
![Page 9: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/9.jpg)
Discrete representationWords are atomic symbols (one-hot representation):
V = {hotel,motel,walk,wife, spouse}
|V| ⇡ 100, 000
hotel [1 0 0 0 0]
motel [0 1 0 0 0]
walk [0 0 1 0 0]
wife [0 0 0 1 0]
spouse [0 0 0 0 1]
9
![Page 10: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/10.jpg)
DrawbackBarack Obama’s wife ≈ Barack Obama’s spouse Barack Obama’s wife ≉ Barack Obama’s advisors
Seattle motels ≈ Seattle hotels Seattle motels ≉ Seattle attractions
But all words vectors are orthogonal and equidistant
Goal: word vectors with a natural notion of similarity
h“hotel” · “motel”i > h“hotel” · “spouse”i 10
![Page 11: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/11.jpg)
Distributional similarity“You shall know a word by the company it keeps”
(Firth, 1957)
“… cashed a check at the bank across the street…” “… that bank holds the mortgage on my home…” “… said that the bank raised his forecast for…” “… employees of the bank have confessed to the charges”
Central idea: represent words by their context
11
![Page 12: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/12.jpg)
Idea 1word context
wife {met: 3, married: 4, children: 2, wedded: 1, …}
spouse {met: 2, married: 5, children: 2, kids: 1, …}
Problem: • married <==> wedded • children <==> kids
12
![Page 13: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/13.jpg)
Distributed representations
language =
0.278�0.9110.792
�0.1770.109
�0.542�0.0003
• Represent words as low-dimensional vectors• Represent similarity with vector similarity metrics
13
![Page 14: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/14.jpg)
Word vectors
14
![Page 15: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/15.jpg)
Motivation
• Word embeddings are widely used
• (other options exist: word-parts, character-level,…).
• The great innovation of 2018 - contextualized word embeddings.
![Page 16: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/16.jpg)
Supervised learning• Input: training set
• Output (probabilistic model):
• Example: train a spam detector from spam and non-spam e-mails.
f : X ! Yargmax
yp(y | x)
Intro to ML prerequisite 16
{(xi, yi)}Ni=1, (xi, yi) ⇠ D(X ⇥ Y)<latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit>
![Page 17: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/17.jpg)
Word embeddings“… that bank holds the mortgage on my home…”
1. Define a supervised learning task from raw text (no manual annotation!):
1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …
17
![Page 18: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/18.jpg)
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• Multi-class classification model (number of classes?)
• How many parameters are in the model:
Intro to ML prerequisite 18
![Page 19: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/19.jpg)
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• Multi-class classification model (number of classes?)
• How many parameters are in the model:
|✓| = 2 · V · d u, v 2 Rd
Intro to ML prerequisite 18
![Page 20: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/20.jpg)
Word embeddings3. Define objective function for corpus of length T:
L(✓) =TY
t=1
Y
�m j mj 6= 0
p✓(wt+j | wt)
J(✓) = logL(✓) =TX
t=1
X
�m j mj 6= 0
log p✓(wt+j | wt)
Find parameters that maximize the objective
Intro to ML prerequisite 19
![Page 21: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/21.jpg)
Class 1 Recap
Intro to ML prerequisite
• Word representations:
• Ontology-based
• Pros: polysemy, similarity metrics
• Cons: expensive, compositionality, granularity
• One-hot
• Pros: cheap, simple, scales, compositionality
• Cons: no similarity
• Embeddings:
• Cheap, simple, scales, compositionality, similarity
![Page 22: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/22.jpg)
Today
Intro to ML prerequisite
• Word2vec
• Efficiency:
• Hierarchical softmax
• Skipgram with negative sampling (assignment 1)
• Skipgram as matrix factorization
• Evaluation (GloVe)
![Page 23: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/23.jpg)
Word embeddings“… that bank holds the mortgage on my home…”
1. Define a supervised learning task from raw text (no manual annotation!):
1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …
22
Mikolov et al., 2013
![Page 24: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/24.jpg)
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• We don’t really need the distribution - only the representation!
Intro to ML prerequisite 23
J(✓) = logL(✓) =TX
t=1
X
�m j mj 6= 0
log p✓(wt+j | wt)
![Page 25: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/25.jpg)
Word embeddings
• What probabilities would maximize the objective?
Intro to ML prerequisite
![Page 26: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/26.jpg)
Intro to ML prerequisite
L(⇥) =Y
c,o
p(o | c)#(c,o)
We can solve separately for each center word c
Lc(⇥) =Y
o
p(o | c)#(c,o)
solve for
Jc(⇥) =X
i
#(c, oi) log p(oi | c)
s.t.X
i
p(oi | c) = 1, p(oi | c) � 0
Use lagrange multipliers:
L(⇥,�) =X
i
#(c, oi) log p(oi | c)� �((X
i
p(oi | c))� 1)
rp(oi|c)L =#(c, oi)
p(oi | c)� � = 0
p(oi | c) =#(c, oi)
�X
i
p(oi | c) =X
i
#(c, oi)
�= 1
� =X
i
#(c, oi)
p(oi | c) =#(c, oi)Pi #(c, oi)
<latexit sha1_base64="y2++QIjhsNq4COprvrYiDUVbsEQ=">AAAFTHicjVRNb9NAEHWbtJTw0RaOXEZEVIlUohiQQEiVKrgglEORmqZSHaz1epKsuvaa3XWhWvkHcuHAjV/BhQMIIbGOXZKmDrCn8Xy893bGs0HCmdLd7peV1Vp9bf3axvXGjZu3bm9ubd85UiKVFPtUcCGPA6KQsxj7mmmOx4lEEgUcB8Hpyzw+OEOpmIgP9XmCw4iMYzZilGjr8rdrwY4XET2hhJte1vIOJ6hJG/bAS6QIfUN3RQZJS4AXsRBo+63xmi3rbGfgeY0dT+MHbQYIlMSgBD9DUJgQSTTycxgJCUjoBCjGGiW8FzIEWzgh2tASYEbumzJQIWNeRJlVSCk/FgQVSnL6wvu6GlulkW+YTZnD8VkbPC7GOZ/PLjPOM3R0JwfPIVhl7h64uxUBsChjfAfdObC+QuBkLEk8RohSrpkdvR3a80J9r9S8a4XZyYYz8ez/pT+8KG61lovO09zymjEJOPHNXBa1Pe7l1CNJqFlgzkwFXjajtXXFlat7tQSzLC5n+7dmX/TjHzh2KgXWH1XLGmmTllEt4agEyhoNf6vZ7XSnB64abmk0nfIc+FufvVDQNLJbQzlR6sTtJnpoiNSMcswaXpovGT0lYzyxZkwiVEMzfQwyeGA94fTfH4lYw9Q7X2FIpNR5FNjMfPfUYix3VsVOUj16NjQsTlKNMS2IRikHLSB/WSBkEqm2ax8yQiWzWoFO7FNA7eqrvAnu4pWvGkePOu7jjvvmSXP/RdmODeeec99pOa7z1Nl3XjkHTt+htY+1r7XvtR/1T/Vv9Z/1X0Xq6kpZc9e5dNbWfwMbE7W3</latexit>
![Page 27: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/27.jpg)
Questions• Intuitions:
• Why should similar words have similar vectors?
• Why do we have different parameters for the center word and the output word?
26
![Page 28: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/28.jpg)
27
![Page 29: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/29.jpg)
Gradient descent3. How to find parameters that minimize the objective?
• Start at some point and move in the opposite direction of the gradient
Intro to ML prerequisite 28
![Page 30: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/30.jpg)
Gradient descent
f(x) = x4 + 3x3 + 2
f 0(x) = 4x3 + 9x2
Intro to ML prerequisite 29
![Page 31: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/31.jpg)
![Page 32: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/32.jpg)
Gradient descent• We want to minimize:
J(✓) = �TX
t=1
X
j
log p✓(wt+j | wt)
• Update rule:
✓newj = ✓oldj � ↵@J(✓)
@✓j
✓new = ✓old � ↵rJ(✓)
• 𝛂 is a step size
✓ 2 R2V d
Intro to ML prerequisite 31
![Page 33: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/33.jpg)
Stochastic gradient descent• For large corpora (billions of tokens) this update is
very slow
• Sample a window t
• Update gradients based on that window
✓new = ✓old � ↵rJt(✓)
Intro to ML prerequisite 32
![Page 34: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/34.jpg)
Deriving the gradient• Mostly applications of the chain rule
• Let’s derive the gradient of a center word for a single output word
• You will do this again in the assignment (and more)
log p✓(wt+j | wt)
33
![Page 35: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/35.jpg)
Gradient derivationL(⇥) = log p(o | c) = log
exp(u>o vc)P
i exp(u>oivc)
= u>o vc � log
X
i
exp(u>oivc)
rvcL(⇥) = uo �1P
j exp(u>ojvc)
·X
i
exp(u>oivc) · uoi
= uo �X
i
exp(u>oivc)P
j exp(u>ojvc)
· uoi
= uo �X
i
p(oi | c) · oi = uo � Eoi⇠p(oi|c)[uoi ]<latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit>
![Page 36: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/36.jpg)
Recap• Goal: represent words with low-dimensional vectors
• Approach: Define a supervised learning problem from a corpus
• We defined the necessary components for skip-gram:
• Model (softmax over word labels for each word)
• Objective (minimize Negative Log Likelihood)
• Optimize with SGD
• We computed the gradient for some parameters by hand
35
![Page 37: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/37.jpg)
Computational problem• Computing the partition function is too expensive
• Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log|V| by constructing a binary tree over the vocabulary
• Solution 2: Change the objective
• skip-gram with negative sampling (home assignment 1)
36
![Page 38: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/38.jpg)
Hierarchical softmax• p(“cat” | “dog”) = p(left at 1) x p(right at 2) x p(right at 5)
= (1 - p(right at 1)) x p(right at 2) x p(right at 5)dog
2 3
4 75 6
1
he she and cat the have be are
p(cat | dog) = (1� �(o>1 cdog))
= ⇥�(o>2 cdog))
= ⇥�(o>5 cdog))
![Page 39: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/39.jpg)
Hierarchical softmax• How to construct the tree?
• Randomly (doesn’t work well but better than you’d think)
• Using external knowledge like WordNet
• Learn word representations somehow and then cluster
![Page 40: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/40.jpg)
Skip-gram with Negative Sampling
(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)
39
What information is lost?
![Page 41: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/41.jpg)
Skip-gram with Negative Sampling
(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)
39
What information is lost?X
o2Vp(y = 1 | o, c) =?
![Page 42: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/42.jpg)
Skip-gram with Negative Sampling
• Model:
p✓(y = 1 | c, o) = 1
1 + exp(�u>o vc)
= �(u>o vc)
p✓(y = 0 | c, o) = 1� �(u>o vc) = �(�u>
o vc)
• Objective:
• p(w) = U(w)3/4 / T
X
t,j
�log(�(u>
wt+jvwt)) +
X
k⇠p(w)
log(�(�u>w(k)vwt))
�
Intro to ML prerequisite 40
![Page 43: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/43.jpg)
Summary
• We defined the three necessary components.
• Model (binary classification)
• Objective (ML with negative sampling)
• Optimization method (SGD)
41
![Page 44: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/44.jpg)
Many variants• CBOW: predict center word from context
• Defining context:
• How big is the window?
• Is it sequential or based on syntactic information?
• Different model for every context position?
• Use stop words?
• …
42
![Page 45: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/45.jpg)
Matrix factorization
43
![Page 46: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/46.jpg)
Matrix factorization• Consider the word-context co-occurrence matrix for a
corpus:
“I like deep learning. I like NLP. I enjoy flying.”
I Like enjoy deep learning NLP flying .I 2 1
like 2 1 1enjoy 1 1deep 1 1
learning 1 1NLP 1 1flying 1 1
. 1 1 1
Landauer and Dumais (1997)
44 Intro to ML prerequisite
![Page 47: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/47.jpg)
Matrix factorization• Reconstruct matrix from low-dimensional word-
context representations.
• Minimizes: X
i,j
(Aij � Akij)
2 = ||A� Ak||2
45
![Page 48: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/48.jpg)
Matrix factorization
46
![Page 49: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/49.jpg)
Relation to skip-gram• The output of skip-gram can be viewed as
factorizing a word-context matrix:
×=
M VUT
M 2 R|V|⇥|V|, V, U 2 R|V|⇥d
• What M is decomposed by skip-gram?
Levy and Goldberg, 2015
47
![Page 50: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/50.jpg)
Relation to skip-gram#(c) =
X
o0
#(c, o0)
#(o) =X
c0
#(c0, o)
T =X
(c,o)
#(c, o)
#(o)
T: Unigram probability of o
PT : unigram distribution
PT (w) =c(w)
|D| =c(w) ·m|D| ·m =
#(o)
T<latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit>
![Page 51: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/51.jpg)
Relation to skip-gram• Re-write objective:
49
distribute
expectationis constant
for o
Openexpectation
Gather terms
L(✓) =X
c,o
#(c, o)�log(�(u>
o vc)) + k · Eo0⇠PT [log(�(�u>o0vc))]
�
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c,o
#(c, o) · k · Eo0⇠PT [log(�(�u>o0vc))]
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c
#(c) · k · Eo0⇠PT [log(�(�u>o0vc))]
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c
#(c) · k ·X
o0
#(o0)
Tlog(�(�u>
o0vc))
=X
c,o
#(c, o) log(�(u>o vc)) + #(c) · k · #(o)
Tlog(�(�u>
o vc))
![Page 52: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/52.jpg)
Relation to skip-gram• Let’s assume the dot products are independent of
one another:Let x = u>
o vc
l(x) = #(c, o) log(�(x)) + #(c) · k · #(o)
Tlog(�(�x))
L(✓) =X
c,o
l(x)
@l(x)
@x= #(c, o)�(�x)�#(c) · k · #(o)
T�(x) = 0
x = log
✓#(c, o) · T#(c) ·#(o)
· 1k
◆
x = log
✓p(c, o)
p(c) · p(o)
◆� log k = PMI(c, o)� log k
50
![Page 53: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/53.jpg)
Relation to skip-gram• Conclusion: Skip-gram with negative sampling
implicitly factorizes a “shifted” PMI matrix
• Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors.
51
![Page 54: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/54.jpg)
Evaluation
52
![Page 55: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/55.jpg)
Evaluation
• Intrinsic vs. extrinsic evaluation:
• Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm (a bit of that in home assignment 1).
• Extrinsic: check whether your output is useful in a real NLP task
53
![Page 56: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/56.jpg)
Intrinsic evaluation• Word analogies:
• Normalize all word vectors to 1
• man::woman <—> king::??
• a::b <—> c::d d = argmaxi
(xb � xa + xc)>xi
||xb � xa + xc||
54
![Page 57: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/57.jpg)
Visualization
55
![Page 58: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/58.jpg)
Visualization
56
![Page 59: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/59.jpg)
Visualization
57
![Page 60: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/60.jpg)
GloVe• An objective that attempts to create a semantic
space with linear structure
Pennington et al., 2014
• Probability ratios are more important than probabilities
![Page 61: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/61.jpg)
GlovePennington et al., 2014
• Try to find word embeddings such that (roughly):
• As an example:
(vc1 � vc2)>uo =
Pc1o
Pc2o
Pco is the probability of an output word o given a center word c<latexit sha1_base64="vdM0OrDgI/C1mR7dOcbuj6cWHWk=">AAACcXicbZBNbxMxEIa9S4E2fAXoBSHQqBFVK0S0GyqVC1IFF45BIm2lbFh5ndnEqtde2bOBaLV3fh83/kQv/QM4mxxoy0iWHr/zYc+blUo6iqI/QXhn6+69+9s7nQcPHz1+0n367NSZygocCaOMPc+4QyU1jkiSwvPSIi8yhWfZxedV/myB1kmjv9GyxEnBZ1rmUnDyUtr9tX+wSGuRxg28g5YGzeH3hEwJVWrgIyS55aIetjWmadY08ARJ0tlf3UyTEP6kGqQDmiOU1mQ8k0rSEkwOXIOpqKwIfhg7BQMzuUANHARqQrtWRZN2e1E/agNuQ7yBHtvEMO3+TqZGVIWfIhR3bhxHJU1qbkkKhU0nqRyWXFzwGY49al6gm9StYw288coUcmP90QSt+m9HzQvnlkXmKwtOc3cztxL/lxtXlH+Y1FL7fVGL9UN5pYAMrOyHqbQoSC09cGGl/yuIOfcWeytcx5sQ31z5NpwO+vH7fvz1qHfyaWPHNnvJ9tgBi9kxO2Ff2JCNmGCXwW7wKngdXIUvQgj31qVhsOl5zq5F+PYvS9u75A==</latexit>
vice � vsteam ⇡ usolid
vsteam � vice ⇡ ugas<latexit sha1_base64="s94IU64EXbOLJ9TUCXpfNRBAhX8=">AAACXXicbVFNS8NAEN3E7/hV9eDBy2JRvFgSFfRY9OJRwVahKWGzndbFTTbsTkpLyJ/0phf/its2iLYOLDzemzez+zbOpDDo+x+Ou7S8srq2vuFtbm3v7Nb29ttG5ZpDiyup9EvMDEiRQgsFSnjJNLAklvAcv91N9OchaCNU+oTjDLoJG6SiLzhDS0U1HEZFiDDCQnAoS3pOfwiDdo6lTkOWZVqNaB5VgpKiV9Iw9BZ6f9ln8xbMA2bKqFb3G/606CIIKlAnVT1Etfewp3ieQIpcMmM6gZ9ht2AaBZdQemFuIGP8jQ2gY2HKEjDdYppOSU8s06N9pe1JkU7Z346CJcaMk9h2Jgxfzbw2If/TOjn2b7qFSLMcIeWzRf1cUlR0EjXtCQ0c5dgCxrWwd6X8lWnG0X6IZ0MI5p+8CNoXjeCyETxe1Zu3VRzr5IgckzMSkGvSJPfkgbQIJ58OcTYcz/lyV9wtd2fW6jqV54D8KffwG1M7uA8=</latexit>
![Page 62: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/62.jpg)
Word analogies evaluation
60
![Page 63: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/63.jpg)
Human correlation intrinsic evaluation
word 1 word 2 human judgement
tiger cat 7.35
book paper 7.46
computer internet 7.58
plane car 5.77
stock phone 1.62
stock CD 1.31
stock jaguar 0.92 61
![Page 64: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/64.jpg)
Human correlation intrinsic evaluation
• Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353):
62
![Page 65: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/65.jpg)
Extrinsic evaluation• Task: named entity recognition. Find mentions of
person, location, organization in text.
• Using good word representation might be useful
63
![Page 66: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/66.jpg)
Extrinsic evaluation
64
![Page 67: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/67.jpg)
Summary• Words are central to language
• In most NLP systems some word representations are used
• Graph-based representations are difficult to manipulate and compose
• One-hot vectors are useful with enough data but lose all of generalization information
• Word embeddings provide a compact way to encode word meaning and similarity (but what about inference relations?)
• Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem
• Strongly related to classical matrix decomposition methods.
65
![Page 68: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/68.jpg)
Current Research
• Contextualized word representations
• Sentence representations
![Page 69: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/69.jpg)
Assignment 1
• Implement skip-gram with negative sampling
• There is ample literature if you want to consider this for a project
67
![Page 70: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1](https://reader034.fdocuments.us/reader034/viewer/2022042606/5f733bf577d7ef192848c006/html5/thumbnails/70.jpg)
Gradient checks
• This is the single parameter case • For parameter vectors, iterate over all parameters
and compute the numerical gradient for each one
@J(✓)
@✓= lim
✏!0
J(✓ + ✏)� J(✓ � ✏)
2✏
68