An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction...
Transcript of An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction...
![Page 1: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/1.jpg)
An Introduction to Deep Learning
Marc’Aurelio RanzatoFacebook AI Research
DeepLearn Summer School - Bilbao, 17 July 20171
![Page 2: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/2.jpg)
Outline• PART 0 [lecture 1]
• Motivation
• Training Fully Connected Nets with Backpropagation
• Part 1 [lecture 1 and lecture 2]
• Deep Learning for Vision: CNN
• Part 2 [lecture 2]
• Deep Learning for NLP: embeddings
• Part 3 [lecture 3]
• Modeling sequences2
![Page 3: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/3.jpg)
Representing Symbolic Data• Lots of data is symbolic. For instance:
• Text
• Graphs
• Can DL be useful to represent such data?
• If we could represent symbolic data in a continuous space, we could easily measure relatedness.
• We could apply the powerful tools of linear algebra and DL to perform complex reasoning.
3
![Page 4: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/4.jpg)
Representing Symbolic Data• Challenges:
• Discrete nature, easy to count but not obvious how to represent.
• One cannot use standard backprop through discrete units.
• The number of entities to represent can be very large, albeit finite; e.g., words in English dictionary.
• Often times this data is not associated to a regular grid structure like an image. E.g.: text, social graph.
4
![Page 5: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/5.jpg)
Case Study: Learning Word Representations• As a case study, we will consider the problem of learning word
representations from raw text (without any supervision).
• We will explore a few approaches to learn such representations.
• Practical applications:
• Text classification
• Ranking (e.g., google search, Facebook feeds ranking)
• Machine translation
• Chatbot5
![Page 6: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/6.jpg)
Latent Semantic Analysis• Problem: Find similar documents in a corpus.
• Solution:
• construct the “term”/“document” matrix storing (normalized) occurrence counts
• SVD
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
![Page 7: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/7.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
![Page 8: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/8.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
Example doc1: the cat is furry doc2: dogs are furry
doc1 doc2are 0 1cat 1 0
dogs 0 1furry 1 1
is 1 0the 1 0
![Page 9: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/9.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
![Page 10: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/10.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
Each column of V , is a representation of a document in the corpus.
is
T
![Page 11: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/11.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
Each column of V , is a representation of a document in the corpus.
is
T
Each column is a D dimensional vector. We can use it to compare & retrieve documents.
![Page 12: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/12.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
Each row of U, is a representation of a word in the dictionary.
![Page 13: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/13.jpg)
Latent Semantic Analysis
Deerwester et al. “Indxing by Latent Semantic Analysis” JASIS 1990
term-document matrixxi,j (normalized) number of times word i appears in document j
Each row of U, is a representation of a word in the dictionary. Each row of U, is a vectorial representation of a word, a.k.a. embedding.
![Page 14: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/14.jpg)
Word Embeddings• Convert words (symbols) into a D dimensional vector,
where D is a hyper-parameter.
• Once embedded, we can:
• Compare words.
• Apply our favorite machine learning method (DL) to represent sequences of words.
• At document retrieval time in LSA, the representation of a new document is a weighted sum of word embeddings (bag-of-words -> bag-of-embeddings): U’ x
14
![Page 15: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/15.jpg)
bi-gram• A bi-gram is a model of the probability of a word
given the preceding one:
• The simplest approach consists of building a (normalized) matrix of counts:
15
p(wk|wk�1)
ci,j number of times word i is preceded by word j
wk 2 V
c(wk|wk�1) =
2
4c1,1 . . . c1,|V |. . . ci,j . . .c|V |,1 . . . c|V |,|V |
3
5
preceding word
curre
nt w
ord
![Page 16: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/16.jpg)
Factorized bi-gram
• We can factorize (via SVD, for instance) the bigram to reduce the number of parameters and become more robust to noise (entries with low counts):
16
c(wk|wk�1) =
2
4c1,1 . . . c1,|V |. . . ci,j . . .c|V |,1 . . . c|V |,|V |
3
5 = UV
U 2 R|V |⇥D
V 2 RD⇥|V |
• Rows of U store “output” word embeddings, and columns of V store “input” word embeddings.
input word
outp
ut w
ord
![Page 17: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/17.jpg)
Factorized bi-gram• The same can be expressed as a two layer (linear)
neural network:
17
c(wk|wk�1) =
2
4c1,1 . . . c1,|V |. . . ci,j . . .c|V |,1 . . . c|V |,|V |
3
5 = UV
softmaxV U
2
66666666664
0...010...0
3
77777777775
input word
1-hot representation of the input word
outp
ut w
ord
![Page 18: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/18.jpg)
Factorized bi-gram
18
c(wk|wk�1) =
2
4c1,1 . . . c1,|V |. . . ci,j . . .c|V |,1 . . . c|V |,|V |
3
5 = UV
softmaxV U
2
66666666664
0...010...0
3
77777777775
input word
1-hot representation of the input word
outp
ut w
ord
No need to multiply, V is just a look up table!
• The same can be expressed as a two layer (linear) neural network:
![Page 19: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/19.jpg)
Factorized bi-gram
19
c(wk|wk�1) =
2
4c1,1 . . . c1,|V |. . . ci,j . . .c|V |,1 . . . c|V |,|V |
3
5 = UV
softmaxV U
2
66666666664
0...010...0
3
77777777775
input word
1-hot representation of the input word
outp
ut w
ord
No need to multiply, V is just a look up table!
NOTE: Since embeddings are free, there is no point adding non-linearities and more layers!Here, depth does not help!
• The same can be expressed as a two layer (linear) neural network:
![Page 20: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/20.jpg)
Factorized bi-gram
• bi-gram model could be useful for type-ahead applications (in practice, it’s much better to condition upon the past n>2 words).
• Factorized model yields word embeddings as a by-product.
20
![Page 21: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/21.jpg)
Word Embeddings• LSA learns word embeddings that take into
account co-occurrences across documents.
• bi-gram instead learns word embeddings that only take into account the next word.
• It seems better to do something in between, using more context but just around the word of interest, yielding a method called word2vec.
Mikolov et al. “Efficient estimation of word representations” rejected by ICLR 2013
![Page 22: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/22.jpg)
word2vec
Mikolov et al. “Efficient estimation of word representations” rejected by ICLR 2013
![Page 23: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/23.jpg)
word2vec
Mikolov et al. “Efficient estimation of word representations” rejected by ICLR 2013
![Page 24: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/24.jpg)
skip-gram• Similar to factorized bi-gram model, but
predict N preceding and N following words.
• Words that have the same context will get similar embeddings. E.g.: cat & kitty.
• Input projection is just look-up table. Bulk of computation is the the prediction of words in context.
• Learning by cross-entropy minimization via SGD.
Mikolov et al. “Efficient estimation of word representations” rejected by ICLR 2013
![Page 25: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/25.jpg)
Hierarchical Softmax• When there are lots of classes to predict (e.g.,
words in a dictionary, |V| in the order of 100,000 or more), projection in the output space is computationally very expensive.
• Hierarchical softmax speeds up computation at the cost of a little decrease of accuracy:
25n-th clusterfeature
(word embedding in skip-gram)
Drop sum: each word belongs to 1 and only 1 clusterp(wk|h) =
NX
n=1
p(wk|h, cn)p(cn|h)
= p(wk|h, cn)p(cn|h)
![Page 26: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/26.jpg)
Hierarchical Softmaxp(wk|h) =
NX
n=1
p(wk|h, cn)p(cn|h)
= p(wk|h, cn)p(cn|h)Why is it cheaper to have two softmaxes instead of one?
Because these are much smaller. If clusters have all the same size and contain words:
D ⇥ |V | � D ⇥N +D ⇥ |V |N
⇡ D ⇥ |V |N
|V |N
In practice, clusters are formed by taking into account word frequency in order to minimize computation cost. Tree can be have more children (binary tree).
Mikolov et al. “Strategies for training large-scale neural network language models” ASRU 2011Morin et al. “Hierarchical probabilistic neural network language model” AISTATS 2005
![Page 27: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/27.jpg)
Hierarchical Softmaxp(wk|h) =
NX
n=1
p(wk|h, cn)p(cn|h)
= p(wk|h, cn)p(cn|h)
Is hierarchical softmax “deep”? No, as we walk down the tree the representation is not changed.
Mikolov et al. “Strategies for training large-scale neural network language models” ASRU 2011Morin et al. “Hierarchical probabilistic neural network language model” AISTATS 2005
h
p(wk|h, cn)
p(cn|h) 6=
![Page 28: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/28.jpg)
word2vec
• code at: https://code.google.com/archive/p/word2vec/
• next some evaluation from Tomas’s NIPS 2013 presentation at: https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit
28
![Page 29: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/29.jpg)
29from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 30: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/30.jpg)
30from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 31: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/31.jpg)
31from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 32: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/32.jpg)
32from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 33: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/33.jpg)
33from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 34: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/34.jpg)
34from https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit credit T. Mikolov
![Page 35: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/35.jpg)
word2vec demo
35
![Page 36: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/36.jpg)
Recap• Embedding words (from a 1-hot to a distributed
representation) lets you:
• understand similarity between words
• plug them within any parametric ML model
• Several ways to learn word embeddings. word2vec is still one of the most efficient ones.
• Note word2vec leverages large amounts of unlabeled data.
36
![Page 37: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/37.jpg)
Representing Phrases• How about representing short sequences of words?
• Could we simply average (pool) word embeddings?
word embedding of
e(wk, wk+1, . . . , wk+n�1) =1
n
n�1X
i=0
e(wk+i)
wk+i
• This is a surprisingly good baseline! E.g.: recommender systems.
e =
credit to: A. Szlam https://learning.mpi-sws.org/mlss2016/slides/Arthur_Szlam_MLSS-2016.pdf
![Page 38: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/38.jpg)
Bag-of-embeddings• Well-known but counter-intuitive fact about :
[concentration measure] with high probability, the inner product of any two random vectors is 0 (therefore their distance is approx. ).
If word embeddings were drawn i.i.d., what’s the value of s.t. we can recover by finding the nearest neighbor to ?
Rd
pd
ed wk+i
d > n log(
n|V |✏
)
number of words in the bag
probability of recovery failure
credit to: A. Szlam https://learning.mpi-sws.org/mlss2016/slides/Arthur_Szlam_MLSS-2016.pdf
![Page 39: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/39.jpg)
Bag-of-embeddings• Well-known but counter-intuitive fact about :
[concentration measure] with high probability, the inner product of any two random vectors is 0 (therefore their distance is approx. ).
If word embeddings were drawn i.i.d., what’s the value of s.t. we can recover by finding the nearest neighbor to ?
Rd
pd
ed wk+i
d > n log(
n|V |✏
)
number of words in the bag
probability of recovery failure
credit to: A. Szlam https://learning.mpi-sws.org/mlss2016/slides/Arthur_Szlam_MLSS-2016.pdf
if |V|=100,000, n=10 and d>100,-> perfect (orderless) recovery
from a bag!
![Page 40: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/40.jpg)
Recap• Given word embeddings, bagging embeddings is
often an effective way to represent short sequences of words.
• Theory of sparse recovery explains why.
• What other (better) ways are there?
• How can DL help here?
40
![Page 41: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/41.jpg)
Language Modeling• In language modeling, we want to predict a word given some
context.
• bi-gram uses only the preceding word.
• More generally, we can use the last N words. E.g.: n-grams and neural net language model.
• Or even better, we can use some sort of running average of all the words seen thus far, as in recurrent neural networks.
• As a by-product, these methods produce a representation of a sequence of (fixed or variable length) words without any supervision.
41
![Page 42: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/42.jpg)
Language Modeling• the math…
• with Markov assumption (used by n-grams):
42
p✓(w1, w2, . . . , wM ) = p✓(wM |wM�1 . . . , wM�n)p✓(wM�1|wM�2, . . . , wM�n�1) . . . p✓(w2|w1)p✓(w1)
p✓(w1, w2, . . . , wM ) = p✓(wM |wM�1 . . . , w1)p✓(wM�1|wM�2, . . . , w1) . . . p✓(w2|w1)p✓(w1)
![Page 43: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/43.jpg)
Neural Network LM
43Y. Bengio et al. “A neural probabilistic language model” JMLR 2003
![Page 44: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/44.jpg)
Neural Network LM
44Y. Bengio et al. “A neural probabilistic language model” JMLR 2003
• Natural extension of the factorized bi-gram model.
• Improved accuracy with more context. A bit better than n-gram (count based methods).
• if we are just interested in word embeddings, much more expensive than word2vec.
• It gives a representation to ordered sequences of n words.
![Page 45: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/45.jpg)
Recurrent Neural Network
• In NN-LM, the hidden state is the concatenation of word embeddings.
• Key idea of RNNs: compute a (non-linear) running average instead, to increase the size of the context.
• Many variants…
45
![Page 46: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/46.jpg)
Recurrent Neural Network• Elman RNN:
46Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
only difference compared to factorized bi-gram language model
![Page 47: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/47.jpg)
Recurrent Neural Network• Elman RNN:
47Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
only difference compared to factorized bi-gram language model
this could be a hierarchical softmax
![Page 48: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/48.jpg)
RNN: Inference Time• Elman RNN:
48Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 49: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/49.jpg)
RNN: Inference Time• Elman RNN:
49Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 50: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/50.jpg)
RNN: Inference Time• Elman RNN:
50Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 51: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/51.jpg)
RNN: Inference Time• Elman RNN:
51Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 52: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/52.jpg)
RNN: Inference Time• Elman RNN:
52Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 53: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/53.jpg)
RNN: Inference Time• Elman RNN:
53Elman “Finding structure in time” Cognitive Science 1990
hk = �(Urhk�1 + U i1(wk) + br)
p(wk+1|h) = softmax(Uoh
k
+ bo)
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
![Page 54: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/54.jpg)
RNN: Inference Time
54
• Inference in an RNN is like a regular forward pass in a deep neural network, with two differences:
• Weights are shared at every layer. • Inputs are provided at every layer.
• Two possible applications: • Scoring: compute the log-likelihood of an input
sequence (sum the log-prob scores at every step). • Generation: sample or take the max from the predicted
distribution over words at each time step, and feed that prediction as input at the next time step.
![Page 55: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/55.jpg)
RNN: Inference Time
55
• Inference in an RNN is like a regular forward pass in a deep neural network, with two differences:
• Weights are shared at every layer. • Inputs are provided at every layer.
• Two possible applications: • Scoring: compute the log-likelihood of an input
sequence (sum the log-prob scores at every step). • Generation: sample or take the max from the predicted
distribution over words at each time step, and feed that prediction as input at the next time step.
![Page 56: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/56.jpg)
RNN: Training Time• Truncated Back-Propagation Through Time:
• Unfold RNN for only N steps and do:
• Forward
• Backward
• Weight update
• Repeat the process on the following sequence of N words, but carry over the value of the last hidden state.
56Werbos “Backpropagation through time: what does it do and how to do it” IEEE 1990
![Page 57: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/57.jpg)
RNN: Truncated BPTT
57Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
Forward Pass
![Page 58: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/58.jpg)
58Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTForward Pass
![Page 59: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/59.jpg)
59Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTForward Pass
![Page 60: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/60.jpg)
60Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 61: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/61.jpg)
61Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 62: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/62.jpg)
62Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 63: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/63.jpg)
63Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTParameter Update
![Page 64: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/64.jpg)
64Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTForward Pass
![Page 65: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/65.jpg)
65Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTForward Pass
![Page 66: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/66.jpg)
66Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTForward Pass
![Page 67: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/67.jpg)
67Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 68: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/68.jpg)
68Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 69: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/69.jpg)
69Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTBackward Pass
![Page 70: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/70.jpg)
70Elman “Finding structure in time” Cognitive Science 1990
U U U U U U
U U U U U U
r r r r r r
o o o o o o
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
w2 w3 w4 w5 w6 w7
o
RNN: Truncated BPTTParameter Update
![Page 71: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/71.jpg)
Recap• RNNs are more powerful because they capture a
context of potentially “infinite” size.
• The hidden state of a RNN can be interpreted as a way to represent the history of what has been seen so far.
• RNNs can be useful to represent variable length sentences.
• There are lots of RNN variants. The best working ones have gating (units that multiply other units): e.g.: LSTM and GRU.
71
![Page 72: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/72.jpg)
Gated Recurrent Unit RNNKey idea: add gating units that enable hidden units to maintain (or reset) their state over time.
72
rk = �(V i1(wk) + V rhk�1)
zk = �(Si1(wk) + Srhk�1) update gates
reset gates
Cho et al. “On the properties of NMT: encoder-decoder approaches” arXiv 2014
![Page 73: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/73.jpg)
Gated Recurrent Unit RNNKey idea: add gating units that enable hidden units to maintain (or reset) their state over time.
73
hk = tanh(U i1(wk) + Ur(rk · hk�1))
rk = �(V i1(wk) + V rhk�1)
zk = �(Si1(wk) + Srhk�1)
hk = (1� zk)hk�1 + zkhk
update gates
reset gates
candidate hiddens
new hiddens
Cho et al. “On the properties of NMT: encoder-decoder approaches” arXiv 2014
![Page 74: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/74.jpg)
ComparisonPennTreeBank perplexity
n-gram 141
neural net 141
Elman RNN 123
GRU -
LSTM 82
Mikolov et al. “Extensions of RNN LMs” ICASSP 2011Grave et al. “Improving neural LMs with continuous cache” ICLR 2017
perplexity = 2H(p)
interpretation: average number of words the model is uncertain among
(ideal value is 1).
Hochreiter et al. “Long short term memory” Neural Computation 1997
![Page 75: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/75.jpg)
Recap• There are several ways to represent sentences:
• Bag of embeddings: strong baseline.
• neural net language model: assumes fixed context, good for predicting the next word.
• RNN: longer context, particularly good for predicting the next word.
• Why predicting just future words? How about predicting surrounding words in the context?
75
![Page 76: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/76.jpg)
Skip-Thought VectorsKey idea: 1) encode a sentence with an RNN, 2) use final hidden state to bias two other RNNs, one predicting the next sentence and one predicting the previous sentence.
76 https://github.com/ryankiros/skip-thoughtsKyros et al. “Skip-Thought vectors” arXiv 2015
GRU-RNN 2
Given: “Deep learning works well in applications. I want to learn it. I already know logistic regression.”
GRU-RNN 3
0
I want to learn it.
hGRU-RNN 1
Deep learning works well
in applications.
I already know logistic regression.
![Page 77: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/77.jpg)
Skip-Thought Vectors
77 https://github.com/ryankiros/skip-thoughtsKyros et al. “Skip-Thought vectors” arXiv 2015
GRU-RNN 2
Given: “Deep learning works well in applications. I want to learn it. I already know logistic regression.”
GRU-RNN 3
Deep learning works well
in applications.
I already know logistic regression.
I want to learn it.
sentence representation
hGRU-RNN 1
Key idea: 1) encode a sentence with an RNN, 2) use final hidden state to bias two other RNNs, one predicting the next sentence and one predicting the previous sentence.
![Page 78: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/78.jpg)
Skip-Thought Vectors
78 https://github.com/ryankiros/skip-thoughtsKyros et al. “Skip-Thought vectors” arXiv 2015
GRU-RNN 2
Given: “Deep learning works well in applications. I want to learn it. I already know logistic regression.”
GRU-RNN 3
Deep learning works well
in applications.
I already know logistic regression.
I want to learn it.
sentence representation
GRU-RNN 2 & 3have slightly modified recurrent equations
zk = �(Si1(wk) + Srhk�1 + Sch)
rk = �(V i1(wk) + V rhk�1 + V ch)
hGRU-RNN 1
…
Key idea: 1) encode a sentence with an RNN, 2) use final hidden state to bias two other RNNs, one predicting the next sentence and one predicting the previous sentence.
![Page 79: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/79.jpg)
Skip-Thought VectorsIt’s a generalization of word2vec to sentences, using RNNs to represent sentences.
79 https://github.com/ryankiros/skip-thoughtsKyros et al. “Skip-Thought vectors” arXiv 2015
Loss = cross entropy of previous sentence + cross entropy of next sentence.
It uses the BookCorpus dataset with sentences from 11,000 books.
Training:
![Page 80: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/80.jpg)
80
Skip-Thought Vectors
Kyros et al. “Skip-Thought vectors” arXiv 2015
![Page 81: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/81.jpg)
81
Skip-Thought Vectors
Kyros et al. “Skip-Thought vectors” arXiv 2015
Example of generation:
![Page 82: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/82.jpg)
Supervised Learning of Sentence Representations
If one has available labeled data on related tasks, it’s always better to train in supervised mode. Representations transfer well to other tasks.
82Conneau et al. “Supervised learning of universal sentence representations” arXiv 2017
![Page 83: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/83.jpg)
Supervised Learning of Sentence Representations
83Conneau et al. “Supervised learning of universal sentence representations” arXiv 2017
![Page 84: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/84.jpg)
Recap• Predict surrounding context is a general principle.
It can be used to learn word and sentence representations in an unsupervised manner.
• Learning from labeled datasets, lets you transfer better features usually.
• Choice of sentence representation depends on sequence length, task, computational and memory constraints.
84
![Page 85: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/85.jpg)
Questions?
85
![Page 86: An Introduction to Deep Learningranzato/files/ranzato_deeplearn17_lec2_nlp.pdf · An Introduction to Deep Learning Marc’Aurelio Ranzato Facebook AI Research ranzato@fb.com DeepLearn](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ec980396ace79356a38eaf2/html5/thumbnails/86.jpg)
Acknowledgements
I would like to thank Arthur Szlam for sharing his material about sparse recovery from bag-of-word embeddings.
86