CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments...
Transcript of CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments...
![Page 1: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/1.jpg)
1/70
CS7015 (Deep Learning) : Lecture 10Learning Vectorial Representations Of Words
Mitesh M. Khapra
Department of Computer Science and EngineeringIndian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 2: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/2.jpg)
2/70
Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method’ by Yoav Goldberg and Omer Levy
Sebastian Ruder’s blogs on word embeddingsa
aBlog1, Blog2, Blog3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 3: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/3.jpg)
3/70
Module 10.1: One-hot representations of words
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 4: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/4.jpg)
4/70
Model
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
This is by far AAMIR KHAN’s best one. Finest
casting and terrific acting by all.
Let us start with a very simple mo-tivation for why we are interested invectorial representations of words
Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))
Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))
We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 5: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/5.jpg)
4/70
Model
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
This is by far AAMIR KHAN’s best one. Finest
casting and terrific acting by all.
Let us start with a very simple mo-tivation for why we are interested invectorial representations of words
Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))
Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))
We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 6: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/6.jpg)
4/70
Model
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
This is by far AAMIR KHAN’s best one. Finest
casting and terrific acting by all.
Let us start with a very simple mo-tivation for why we are interested invectorial representations of words
Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))
Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))
We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 7: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/7.jpg)
4/70
Model
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
This is by far AAMIR KHAN’s best one. Finest
casting and terrific acting by all.
Let us start with a very simple mo-tivation for why we are interested invectorial representations of words
Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))
Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))
We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 8: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/8.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus,
consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 9: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/9.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus,
consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 10: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/10.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 11: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/11.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 12: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/12.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 13: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/13.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |
The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 14: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/14.jpg)
5/70
Corpus:
Human machine interface for computerapplications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]
machine: 0 1 0 ... 0 0 0
Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)
V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)
We need a representation for everyword in V
One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 15: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/15.jpg)
6/70
cat: 0 0 0 0 0 1 0
dog: 0 1 0 0 0 0 0
truck: 0 0 0 1 0 0 0
euclid dist(cat,dog) =√
2
euclid dist(dog, truck) =√
2
cosine sim(cat,dog) = 0
cosine sim(dog, truck) = 0
Problems:
V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)
These representations do not captureany notion of similarity
Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck
However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in
√2
And the cosine similarity betweenany two words in the vocabulary is0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 16: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/16.jpg)
6/70
cat: 0 0 0 0 0 1 0
dog: 0 1 0 0 0 0 0
truck: 0 0 0 1 0 0 0
euclid dist(cat,dog) =√
2
euclid dist(dog, truck) =√
2
cosine sim(cat,dog) = 0
cosine sim(dog, truck) = 0
Problems:
V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)
These representations do not captureany notion of similarity
Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck
However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in
√2
And the cosine similarity betweenany two words in the vocabulary is0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 17: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/17.jpg)
6/70
cat: 0 0 0 0 0 1 0
dog: 0 1 0 0 0 0 0
truck: 0 0 0 1 0 0 0
euclid dist(cat,dog) =√
2
euclid dist(dog, truck) =√
2
cosine sim(cat,dog) = 0
cosine sim(dog, truck) = 0
Problems:
V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)
These representations do not captureany notion of similarity
Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck
However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in
√2
And the cosine similarity betweenany two words in the vocabulary is0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 18: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/18.jpg)
6/70
cat: 0 0 0 0 0 1 0
dog: 0 1 0 0 0 0 0
truck: 0 0 0 1 0 0 0
euclid dist(cat,dog) =√
2
euclid dist(dog, truck) =√
2
cosine sim(cat,dog) = 0
cosine sim(dog, truck) = 0
Problems:
V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)
These representations do not captureany notion of similarity
Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck
However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in
√2
And the cosine similarity betweenany two words in the vocabulary is0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 19: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/19.jpg)
6/70
cat: 0 0 0 0 0 1 0
dog: 0 1 0 0 0 0 0
truck: 0 0 0 1 0 0 0
euclid dist(cat,dog) =√
2
euclid dist(dog, truck) =√
2
cosine sim(cat,dog) = 0
cosine sim(dog, truck) = 0
Problems:
V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)
These representations do not captureany notion of similarity
Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck
However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in
√2
And the cosine similarity betweenany two words in the vocabulary is0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 20: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/20.jpg)
7/70
Module 10.2: Distributed Representations of words
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 21: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/21.jpg)
8/70
A bank is a financial institution that acceptsdeposits from the public and creates credit.
The idea is to use the accompanying words(financial, deposits, credit) to represent bank
You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-resentations
This leads us to the idea of co-occurrence matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 22: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/22.jpg)
8/70
A bank is a financial institution that acceptsdeposits from the public and creates credit.
The idea is to use the accompanying words(financial, deposits, credit) to represent bank
You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-resentations
This leads us to the idea of co-occurrence matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 23: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/23.jpg)
8/70
A bank is a financial institution that acceptsdeposits from the public and creates credit.
The idea is to use the accompanying words(financial, deposits, credit) to represent bank
You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-resentations
This leads us to the idea of co-occurrence matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 24: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/24.jpg)
8/70
A bank is a financial institution that acceptsdeposits from the public and creates credit.
The idea is to use the accompanying words(financial, deposits, credit) to represent bank
You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-resentations
This leads us to the idea of co-occurrence matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 25: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/25.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 26: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/26.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 27: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/27.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 28: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/28.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 29: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/29.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 30: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/30.jpg)
9/70
Corpus:
Human machine interface for computer ap-plications
User opinion of computer system responsetime
User interface management system
System engineering for improved responsetime
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term
The context is defined as a window ofk words around the terms
Let us build a co-occurrence matrixfor this toy corpus with k = 2
This is also known as a word ×context matrix
You could choose the set of wordsand contexts to be same or different
Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 31: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/31.jpg)
10/70
human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2
for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 0 ... 0
Some (fixable) problems
Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh
Solution 1: Ignore very frequentwords
Solution 2: Use a threshold t (say, t= 100)
Xij = min(count(wi, cj), t),
where w is word and c is context.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 32: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/32.jpg)
10/70
human machine system ... userhuman 0 1 0 ... 0machine 1 0 0 ... 0system 0 0 0 ... 2
. . . . . .
. . . . . .
. . . . . .user 0 0 2 ... 0
Some (fixable) problems
Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh
Solution 1: Ignore very frequentwords
Solution 2: Use a threshold t (say, t= 100)
Xij = min(count(wi, cj), t),
where w is word and c is context.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 33: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/33.jpg)
10/70
human machine system for ... userhuman 0 1 0 x ... 0machine 1 0 0 x ... 0system 0 0 0 x ... 2
for x x x x ... x. . . . . . .. . . . . . .. . . . . . .
user 0 0 2 x ... 0
Some (fixable) problems
Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh
Solution 1: Ignore very frequentwords
Solution 2: Use a threshold t (say, t= 100)
Xij = min(count(wi, cj), t),
where w is word and c is context.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 34: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/34.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 35: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/35.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 36: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/36.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 37: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/37.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 38: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/38.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 39: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/39.jpg)
11/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (fixable) problems
Solution 3: Instead of count(w, c) usePMI(w, c)
PMI(w, c) = logp(c|w)
p(c)
= logcount(w, c) ∗N
count(c) ∗ count(w)
N is the total number of words
If count(w, c) = 0, PMI(w, c) = −∞
Instead use,
PMI0(w, c) = PMI(w, c) if count(w, c) > 0
= 0 otherwise
or
PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 40: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/40.jpg)
12/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (severe) problems
Very high dimensional (|V |)
Very sparse
Grows with the size of the vocabulary
Solution: Use dimensionality reduc-tion (SVD)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 41: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/41.jpg)
12/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (severe) problems
Very high dimensional (|V |)Very sparse
Grows with the size of the vocabulary
Solution: Use dimensionality reduc-tion (SVD)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 42: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/42.jpg)
12/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (severe) problems
Very high dimensional (|V |)Very sparse
Grows with the size of the vocabulary
Solution: Use dimensionality reduc-tion (SVD)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 43: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/43.jpg)
12/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Some (severe) problems
Very high dimensional (|V |)Very sparse
Grows with the size of the vocabulary
Solution: Use dimensionality reduc-tion (SVD)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 44: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/44.jpg)
13/70
Module 10.3: SVD for learning word representations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 45: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/45.jpg)
14/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
Singular Value Decompositiongives a rank k approximation ofthe original matrix
X = XPPMIm×n = Um×kΣk×kVTk×n
XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values
SVD gives the best rank-k ap-proximation of the original data(X)
Discovers latent semantics in thecorpus (let us examine this withthe help of an example)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 46: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/46.jpg)
14/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
Singular Value Decompositiongives a rank k approximation ofthe original matrix
X = XPPMIm×n = Um×kΣk×kVTk×n
XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values
SVD gives the best rank-k ap-proximation of the original data(X)
Discovers latent semantics in thecorpus (let us examine this withthe help of an example)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 47: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/47.jpg)
14/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
Singular Value Decompositiongives a rank k approximation ofthe original matrix
X = XPPMIm×n = Um×kΣk×kVTk×n
XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values
SVD gives the best rank-k ap-proximation of the original data(X)
Discovers latent semantics in thecorpus (let us examine this withthe help of an example)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 48: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/48.jpg)
15/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
Notice that the product can bewritten as a sum of k rank-1matrices
Each σiuivTi ∈ Rm×n because it
is a product of a m × 1 vectorwith a 1× n vector
If we truncate the sum at σ1u1vT1
then we get the best rank-1 ap-proximation of X
(By SVD the-orem! But what does this mean?We will see on the next slide)
If we truncate the sum atσ1u1v
T1 +σ2u2v
T2 then we get the
best rank-2 approximation of Xand so on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 49: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/49.jpg)
15/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
Notice that the product can bewritten as a sum of k rank-1matrices
Each σiuivTi ∈ Rm×n because it
is a product of a m × 1 vectorwith a 1× n vector
If we truncate the sum at σ1u1vT1
then we get the best rank-1 ap-proximation of X
(By SVD the-orem! But what does this mean?We will see on the next slide)
If we truncate the sum atσ1u1v
T1 +σ2u2v
T2 then we get the
best rank-2 approximation of Xand so on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 50: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/50.jpg)
15/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
Notice that the product can bewritten as a sum of k rank-1matrices
Each σiuivTi ∈ Rm×n because it
is a product of a m × 1 vectorwith a 1× n vector
If we truncate the sum at σ1u1vT1
then we get the best rank-1 ap-proximation of X
(By SVD the-orem! But what does this mean?We will see on the next slide)
If we truncate the sum atσ1u1v
T1 +σ2u2v
T2 then we get the
best rank-2 approximation of Xand so on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 51: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/51.jpg)
15/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
Notice that the product can bewritten as a sum of k rank-1matrices
Each σiuivTi ∈ Rm×n because it
is a product of a m × 1 vectorwith a 1× n vector
If we truncate the sum at σ1u1vT1
then we get the best rank-1 ap-proximation of X (By SVD the-orem! But what does this mean?We will see on the next slide)
If we truncate the sum atσ1u1v
T1 +σ2u2v
T2 then we get the
best rank-2 approximation of Xand so on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 52: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/52.jpg)
15/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
Notice that the product can bewritten as a sum of k rank-1matrices
Each σiuivTi ∈ Rm×n because it
is a product of a m × 1 vectorwith a 1× n vector
If we truncate the sum at σ1u1vT1
then we get the best rank-1 ap-proximation of X (By SVD the-orem! But what does this mean?We will see on the next slide)
If we truncate the sum atσ1u1v
T1 +σ2u2v
T2 then we get the
best rank-2 approximation of Xand so on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 53: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/53.jpg)
16/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
What do we mean by approxim-ation here?
Notice that X has m× n entries
When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]
But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)
Each subsequent term (σ2u2vT2 ,
σ3u3vT3 , . . . ) stores less and less
important information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 54: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/54.jpg)
16/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
What do we mean by approxim-ation here?
Notice that X has m× n entries
When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]
But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)
Each subsequent term (σ2u2vT2 ,
σ3u3vT3 , . . . ) stores less and less
important information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 55: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/55.jpg)
16/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
What do we mean by approxim-ation here?
Notice that X has m× n entries
When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]
But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)
Each subsequent term (σ2u2vT2 ,
σ3u3vT3 , . . . ) stores less and less
important information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 56: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/56.jpg)
16/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
What do we mean by approxim-ation here?
Notice that X has m× n entries
When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]
But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)
Each subsequent term (σ2u2vT2 ,
σ3u3vT3 , . . . ) stores less and less
important information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 57: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/57.jpg)
16/70
X
m×n
=
↑ · · · ↑
u1 · · · uk↓ · · · ↓
m×k
σ1
. . .
σk
k×k
← vT1 →...
← vTk →
k×n
= σ1u1vT1 + σ2u2v
T2 + · · ·+ σkukv
Tk
What do we mean by approxim-ation here?
Notice that X has m× n entries
When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]
But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)
Each subsequent term (σ2u2vT2 ,
σ3u3vT3 , . . . ) stores less and less
important information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 58: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/58.jpg)
17/70
verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1
light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1
dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1
verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1
As an analogy consider the case whenwe are using 8 bits to represent colors
The representation of very light, light,dark and very dark green would lookdifferent
But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)
We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious
Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 59: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/59.jpg)
17/70
verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1
light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1
dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1
verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1
As an analogy consider the case whenwe are using 8 bits to represent colors
The representation of very light, light,dark and very dark green would lookdifferent
But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)
We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious
Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 60: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/60.jpg)
17/70
verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1
light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1
dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1
verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1
As an analogy consider the case whenwe are using 8 bits to represent colors
The representation of very light, light,dark and very dark green would lookdifferent
But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)
We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious
Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 61: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/61.jpg)
17/70
verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1
light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1
dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1
verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1
As an analogy consider the case whenwe are using 8 bits to represent colors
The representation of very light, light,dark and very dark green would lookdifferent
But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)
We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious
Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 62: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/62.jpg)
17/70
verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1
light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1
dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1
verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1
As an analogy consider the case whenwe are using 8 bits to represent colors
The representation of very light, light,dark and very dark green would lookdifferent
But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)
We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious
Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 63: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/63.jpg)
18/70
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
Co-occurrence Matrix (X)
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Low rank X → Low rank X
Notice that after low rank reconstruction with SVD, the latent co-occurrencebetween {system,machine} and {human, user} has become visible
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 64: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/64.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 65: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/65.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 66: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/66.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 67: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/67.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 68: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/68.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 69: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/69.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 70: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/70.jpg)
19/70
X =
human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .
user 0 0 1.84 0 ... 0
XXT =
human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21
Recall that earlier each row of the originalmatrix X served as the representation of aword
Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])
X[i :]
X[j :]
1 2 32 1 01 3 5
︸ ︷︷ ︸
X
1 2 12 1 33 0 5
︸ ︷︷ ︸
XT
=
. . 22. . .. . .
︸ ︷︷ ︸
XXT
The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 71: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/71.jpg)
20/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
cosine sim(human, user) = 0.33
Once we do an SVD what is agood choice for the representation ofwordi?
Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional
But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful
Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 72: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/72.jpg)
20/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
cosine sim(human, user) = 0.33
Once we do an SVD what is agood choice for the representation ofwordi?
Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional
But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful
Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 73: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/73.jpg)
20/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
cosine sim(human, user) = 0.33
Once we do an SVD what is agood choice for the representation ofwordi?
Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional
But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful
Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 74: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/74.jpg)
20/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
cosine sim(human, user) = 0.33
Once we do an SVD what is agood choice for the representation ofwordi?
Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional
But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful
Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 75: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/75.jpg)
21/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33
Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X
XXT = (UΣV T )(UΣV T )T
= (UΣV T )(V ΣUT )
= UΣΣTUT (∵ V TV = I)
= UΣ(UΣ)T = WwordWTword
Conventionally,
Wword = UΣ ∈ Rm×k
is taken as the representation of the m wordsin the vocabulary and
Wcontext = V
is taken as the representation of the contextwords
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 76: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/76.jpg)
21/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33
Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X
XXT = (UΣV T )(UΣV T )T
= (UΣV T )(V ΣUT )
= UΣΣTUT (∵ V TV = I)
= UΣ(UΣ)T = WwordWTword
Conventionally,
Wword = UΣ ∈ Rm×k
is taken as the representation of the m wordsin the vocabulary and
Wcontext = V
is taken as the representation of the contextwords
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 77: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/77.jpg)
21/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33
Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X
XXT = (UΣV T )(UΣV T )T
= (UΣV T )(V ΣUT )
= UΣΣTUT (∵ V TV = I)
= UΣ(UΣ)T = WwordWTword
Conventionally,
Wword = UΣ ∈ Rm×k
is taken as the representation of the m wordsin the vocabulary and
Wcontext = V
is taken as the representation of the contextwords
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 78: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/78.jpg)
21/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33
Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X
XXT = (UΣV T )(UΣV T )T
= (UΣV T )(V ΣUT )
= UΣΣTUT (∵ V TV = I)
= UΣ(UΣ)T = WwordWTword
Conventionally,
Wword = UΣ ∈ Rm×k
is taken as the representation of the m wordsin the vocabulary and
Wcontext = V
is taken as the representation of the contextwords
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 79: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/79.jpg)
21/70
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
XXT =
human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33
Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X
XXT = (UΣV T )(UΣV T )T
= (UΣV T )(V ΣUT )
= UΣΣTUT (∵ V TV = I)
= UΣ(UΣ)T = WwordWTword
Conventionally,
Wword = UΣ ∈ Rm×k
is taken as the representation of the m wordsin the vocabulary and
Wcontext = V
is taken as the representation of the contextwords
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 80: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/80.jpg)
22/70
Module 10.4: Continuous bag of words model
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 81: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/81.jpg)
23/70
The methods that we have seen so far are called count based models becausethey use the co-occurrence counts of words
We will now see methods which directly learn word representations (these arecalled (direct) prediction based models)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 82: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/82.jpg)
23/70
The methods that we have seen so far are called count based models becausethey use the co-occurrence counts of words
We will now see methods which directly learn word representations (these arecalled (direct) prediction based models)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 83: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/83.jpg)
24/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 84: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/84.jpg)
24/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 85: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/85.jpg)
24/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 86: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/86.jpg)
24/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 87: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/87.jpg)
24/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 88: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/88.jpg)
25/70
Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus
Consider this Task: Predict n-thword given previous n-1 words
Example: he sat on a chair
Training data: All n-word windowsin your corpus
Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)
For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 89: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/89.jpg)
25/70
Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus
Consider this Task: Predict n-thword given previous n-1 words
Example: he sat on a chair
Training data: All n-word windowsin your corpus
Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)
For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 90: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/90.jpg)
25/70
Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus
Consider this Task: Predict n-thword given previous n-1 words
Example: he sat on a chair
Training data: All n-word windowsin your corpus
Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)
For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 91: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/91.jpg)
25/70
Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus
Consider this Task: Predict n-thword given previous n-1 words
Example: he sat on a chair
Training data: All n-word windowsin your corpus
Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)
For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 92: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/92.jpg)
25/70
Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus
Consider this Task: Predict n-thword given previous n-1 words
Example: he sat on a chair
Training data: All n-word windowsin your corpus
Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)
For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 93: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/93.jpg)
26/70
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 94: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/94.jpg)
26/70
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 95: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/95.jpg)
26/70
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 96: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/96.jpg)
27/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We will model this problem using afeedforward neural network
Input: One-hot representation of thecontext word
Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)
Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 97: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/97.jpg)
27/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We will model this problem using afeedforward neural network
Input: One-hot representation of thecontext word
Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)
Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 98: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/98.jpg)
27/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We will model this problem using afeedforward neural network
Input: One-hot representation of thecontext word
Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)
Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 99: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/99.jpg)
27/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We will model this problem using afeedforward neural network
Input: One-hot representation of thecontext word
Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)
Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 100: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/100.jpg)
28/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What is the product Wcontextx given that xis a one hot vector
It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3
0
10
=
0.5−11.7
So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith
column of Wcontext gets selected
In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext
More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 101: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/101.jpg)
28/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What is the product Wcontextx given that xis a one hot vector
It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3
0
10
=
0.5−11.7
So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith
column of Wcontext gets selected
In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext
More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 102: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/102.jpg)
28/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What is the product Wcontextx given that xis a one hot vector
It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3
0
10
=
0.5−11.7
So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith
column of Wcontext gets selected
In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext
More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 103: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/103.jpg)
28/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What is the product Wcontextx given that xis a one hot vector
It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3
0
10
=
0.5−11.7
So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith
column of Wcontext gets selected
In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext
More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 104: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/104.jpg)
28/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What is the product Wcontextx given that xis a one hot vector
It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3
0
10
=
0.5−11.7
So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith
column of Wcontext gets selected
In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext
More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 105: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/105.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function?
(softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 106: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/106.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 107: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/107.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 108: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/108.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 109: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/109.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 110: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/110.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 111: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/111.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 112: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/112.jpg)
29/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
P (on|sat) =e(Wwordh)[i]∑j e
(Wwordh)[j]
How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)
Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext
and ith column of Wword
P (word = i|sat) thus depends on the ith
column of Wword
We thus treat the i-th column of Wword asthe representation of word i
Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext
and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)
Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 113: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/113.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ?
softmax
What is an appropriate loss function?
crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 114: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/114.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ?
softmax
What is an appropriate loss function?
crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 115: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/115.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function?
crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 116: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/116.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function?
crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 117: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/117.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function? crossentropy
L (θ) = − log yw = − logP (w|c)
h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 118: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/118.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function? crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 119: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/119.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function? crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 120: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/120.jpg)
30/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
y=P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w
For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax
What is an appropriate loss function? crossentropy
L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
uc is the column of Wcontext correspondingto context c and vw is the column of Wword
corresponding to context w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 121: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/121.jpg)
31/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
How do we train this simple feed for-ward neural network?
backpropaga-tion
Let us consider one input-output pair(c, w) and see the update rule for vw
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 122: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/122.jpg)
31/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
How do we train this simple feed for-ward neural network? backpropaga-tion
Let us consider one input-output pair(c, w) and see the update rule for vw
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 123: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/123.jpg)
31/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
How do we train this simple feed for-ward neural network? backpropaga-tion
Let us consider one input-output pair(c, w) and see the update rule for vw
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 124: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/124.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 125: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/125.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 126: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/126.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 127: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/127.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 128: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/128.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 129: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/129.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 130: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/130.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw
= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 131: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/131.jpg)
32/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
∇vw = − ∂
∂vwL (θ)
L (θ) = − log yw
= − logexp(uc · vw)∑
w′∈V exp(uc · vw′)
= −(uc · vw − log∑w′∈V
exp(uc · vw′))
∇vw = −(uc −exp(uc · vw)∑
w′∈V exp(uc · vw′)· uc)
= −uc(1− yw)
And the update rule would be
vw = vw − η∇vw= vw + ηuc(1− yw)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 132: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/132.jpg)
33/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
This update rule has a nice interpret-ation
vw = vw + ηuc(1− yw)
If yw → 1 then we are already predict-ing the right word and vw will not beupdated
If yw → 0 then vw gets updated byadding a fraction of uc to it
This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)
The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 133: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/133.jpg)
33/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
This update rule has a nice interpret-ation
vw = vw + ηuc(1− yw)
If yw → 1 then we are already predict-ing the right word and vw will not beupdated
If yw → 0 then vw gets updated byadding a fraction of uc to it
This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)
The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 134: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/134.jpg)
33/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
This update rule has a nice interpret-ation
vw = vw + ηuc(1− yw)
If yw → 1 then we are already predict-ing the right word and vw will not beupdated
If yw → 0 then vw gets updated byadding a fraction of uc to it
This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)
The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 135: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/135.jpg)
33/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
This update rule has a nice interpret-ation
vw = vw + ηuc(1− yw)
If yw → 1 then we are already predict-ing the right word and vw will not beupdated
If yw → 0 then vw gets updated byadding a fraction of uc to it
This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)
The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 136: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/136.jpg)
33/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
This update rule has a nice interpret-ation
vw = vw + ηuc(1− yw)
If yw → 1 then we are already predict-ing the right word and vw will not beupdated
If yw → 0 then vw gets updated byadding a fraction of uc to it
This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)
The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 137: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/137.jpg)
34/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What happens to the representationsof two words w and w′ which tend toappear in similar context (c)
The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other
This is only an intuition (reasonable)
Haven’t come across a formal prooffor this!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 138: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/138.jpg)
34/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What happens to the representationsof two words w and w′ which tend toappear in similar context (c)
The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other
This is only an intuition (reasonable)
Haven’t come across a formal prooffor this!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 139: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/139.jpg)
34/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What happens to the representationsof two words w and w′ which tend toappear in similar context (c)
The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other
This is only an intuition (reasonable)
Haven’t come across a formal prooffor this!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 140: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/140.jpg)
34/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat)
P(chair|sat)
P(man|sat)
P(on|sat)
. . . . . . . . .
h ∈ Rk
Wword ∈ Rk×|V |
x ∈ R|V |
Wcontext ∈
Rk×|V |
What happens to the representationsof two words w and w′ which tend toappear in similar context (c)
The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other
This is only an intuition (reasonable)
Haven’t come across a formal prooffor this!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 141: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/141.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2
− 1 0.5 2
3 −1 −2
3 −1 −2
−2 1.7 3
− 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 142: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/142.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2
− 1 0.5 2
3 −1 −2
3 −1 −2
−2 1.7 3
− 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 143: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/143.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2
− 1 0.5 2
3 −1 −2
3 −1 −2
−2 1.7 3
− 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 144: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/144.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2
− 1 0.5 2
3 −1 −2
3 −1 −2
−2 1.7 3
− 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 145: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/145.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 146: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/146.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 147: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/147.jpg)
35/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
In practice, instead of window size of 1 it iscommon to use a window size of d
So now,
h =
d−1∑i=1
uci
[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix
−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3
0
1000
1
} sat
}he
=
2.5−34.7
The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 148: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/148.jpg)
36/70
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is ith word in the vocabulary and sat is the jth word then we willsimply access columns W[i :] and W[j :] and add them
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 149: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/149.jpg)
36/70
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is ith word in the vocabulary and sat is the jth word then we willsimply access columns W[i :] and W[j :] and add them
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 150: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/150.jpg)
37/70
Now what happens during backpropagation
Recall that
h =d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 151: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/151.jpg)
37/70
Now what happens during backpropagation
Recall that
h =
d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 152: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/152.jpg)
37/70
Now what happens during backpropagation
Recall that
h =
d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 153: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/153.jpg)
37/70
Now what happens during backpropagation
Recall that
h =
d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 154: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/154.jpg)
37/70
Now what happens during backpropagation
Recall that
h =
d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 155: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/155.jpg)
37/70
Now what happens during backpropagation
Recall that
h =
d−1∑i=1
uci
and
P (on|sat, he) =e(wwordh)[k]∑j e
(wwordh)[j]
where ‘k’ is the index of the word ‘on’
The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one wederived before
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 156: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/156.jpg)
38/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
Some problems:
Notice that the softmax function atthe output is computationally veryexpensive
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
The denominator requires a summa-tion over all words in the vocabulary
We will revisit this issue soon
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 157: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/157.jpg)
38/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
Some problems:
Notice that the softmax function atthe output is computationally veryexpensive
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
The denominator requires a summa-tion over all words in the vocabulary
We will revisit this issue soon
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 158: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/158.jpg)
38/70
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
h ∈ Rk
Wword ∈
Rk×2|V |
x ∈ R2|V |
[Wcontext,Wcontext] ∈
Rk×2|V |
Some problems:
Notice that the softmax function atthe output is computationally veryexpensive
yw =exp(uc · vw)∑
w′∈V exp(uc · vw′)
The denominator requires a summa-tion over all words in the vocabulary
We will revisit this issue soon
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 159: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/159.jpg)
39/70
Module 10.5: Skip-gram model
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 160: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/160.jpg)
40/70
The model that we just saw is called the continuous bag of words model (itpredicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given aninput word)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 161: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/161.jpg)
40/70
The model that we just saw is called the continuous bag of words model (itpredicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given aninput word)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 162: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/162.jpg)
41/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Notice that the role of context andword has changed now
In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier
Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors
L (θ) = −d−1∑i=1
log ywi
Typically, we predict context wordson both sides of the given word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 163: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/163.jpg)
41/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Notice that the role of context andword has changed now
In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier
Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors
L (θ) = −d−1∑i=1
log ywi
Typically, we predict context wordson both sides of the given word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 164: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/164.jpg)
41/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Notice that the role of context andword has changed now
In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier
Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors
L (θ) = −d−1∑i=1
log ywi
Typically, we predict context wordson both sides of the given word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 165: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/165.jpg)
41/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Notice that the role of context andword has changed now
In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier
Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors
L (θ) = −d−1∑i=1
log ywi
Typically, we predict context wordson both sides of the given word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 166: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/166.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 167: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/167.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 168: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/168.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 169: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/169.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 170: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/170.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 171: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/171.jpg)
42/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 172: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/172.jpg)
43/70
D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]
D′
= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]
Let D be the set of all correct (w, c) pairs in thecorpus
Let D′
be the set of all incorrect (w, r) pairs inthe corpus
D′
can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)
As before let vw be the representation of the wordw and uc be the representation of the context wordc
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 173: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/173.jpg)
43/70
D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]
D′
= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]
Let D be the set of all correct (w, c) pairs in thecorpus
Let D′
be the set of all incorrect (w, r) pairs inthe corpus
D′
can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)
As before let vw be the representation of the wordw and uc be the representation of the context wordc
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 174: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/174.jpg)
43/70
D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]
D′
= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]
Let D be the set of all correct (w, c) pairs in thecorpus
Let D′
be the set of all incorrect (w, r) pairs inthe corpus
D′
can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)
As before let vw be the representation of the wordw and uc be the representation of the context wordc
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 175: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/175.jpg)
43/70
D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]
D′
= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]
Let D be the set of all correct (w, c) pairs in thecorpus
Let D′
be the set of all incorrect (w, r) pairs inthe corpus
D′
can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)
As before let vw be the representation of the wordw and uc be the representation of the context wordc
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 176: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/176.jpg)
44/70
·
σ
P (z = 1|w, c)
uc vw
For a given (w, c) ∈ D we are interested in max-imizing
p(z = 1|w, c)
Let us model this probability by
p(z = 1|w, c) = σ(uTc vw)
=1
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)
where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 177: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/177.jpg)
44/70
·
σ
P (z = 1|w, c)
uc vw
For a given (w, c) ∈ D we are interested in max-imizing
p(z = 1|w, c)
Let us model this probability by
p(z = 1|w, c) = σ(uTc vw)
=1
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)
where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 178: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/178.jpg)
44/70
·
σ
P (z = 1|w, c)
uc vw
For a given (w, c) ∈ D we are interested in max-imizing
p(z = 1|w, c)
Let us model this probability by
p(z = 1|w, c) = σ(uTc vw)
=1
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)
where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 179: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/179.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 180: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/180.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 181: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/181.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 182: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/182.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw
= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 183: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/183.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 184: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/184.jpg)
45/70
·
−
σ
P (z = 0|w, r)
ur vw
For (w, r) ∈ D′ we are interested in maximizing
p(z = 0|w, r)
Again we model this as
p(z = 0|w, r) = 1− σ(uTr vw)
= 1− 1
1 + e−vTr vw
=1
1 + euTr vw= σ(−uTr vw)
Considering all (w, r) ∈ D′ , we are interested in
maximizeθ
∏(w,r)∈D′
p(z = 0|w, r)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 185: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/185.jpg)
46/70
·
−
σ
P (z = 0|w, r)
ur vw
Combining the two we get:
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′p(z = 0|w, r)
=maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log p(z = 1|w, c)
+∑
(w,r)∈D′log(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log1
1 + e−vTc vw
+∑
(w,r)∈D′log
1
1 + evTr vw
=maximizeθ
∑(w,c)∈D
log σ(vTc vw) +∑
(w,r)∈D′log σ(−vTr vw)
where σ(x) = 11+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 186: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/186.jpg)
46/70
·
−
σ
P (z = 0|w, r)
ur vw
Combining the two we get:
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′p(z = 0|w, r)
=maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log p(z = 1|w, c)
+∑
(w,r)∈D′log(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log1
1 + e−vTc vw
+∑
(w,r)∈D′log
1
1 + evTr vw
=maximizeθ
∑(w,c)∈D
log σ(vTc vw) +∑
(w,r)∈D′log σ(−vTr vw)
where σ(x) = 11+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 187: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/187.jpg)
46/70
·
−
σ
P (z = 0|w, r)
ur vw
Combining the two we get:
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′p(z = 0|w, r)
=maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log p(z = 1|w, c)
+∑
(w,r)∈D′log(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log1
1 + e−vTc vw
+∑
(w,r)∈D′log
1
1 + evTr vw
=maximizeθ
∑(w,c)∈D
log σ(vTc vw) +∑
(w,r)∈D′log σ(−vTr vw)
where σ(x) = 11+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 188: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/188.jpg)
46/70
·
−
σ
P (z = 0|w, r)
ur vw
Combining the two we get:
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′p(z = 0|w, r)
=maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log p(z = 1|w, c)
+∑
(w,r)∈D′log(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log1
1 + e−vTc vw
+∑
(w,r)∈D′log
1
1 + evTr vw
=maximizeθ
∑(w,c)∈D
log σ(vTc vw) +∑
(w,r)∈D′log σ(−vTr vw)
where σ(x) = 11+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 189: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/189.jpg)
46/70
·
−
σ
P (z = 0|w, r)
ur vw
Combining the two we get:
maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′p(z = 0|w, r)
=maximizeθ
∏(w,c)∈D
p(z = 1|w, c)∏
(w,r)∈D′(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log p(z = 1|w, c)
+∑
(w,r)∈D′log(1− p(z = 1|w, r))
=maximizeθ
∑(w,c)∈D
log1
1 + e−vTc vw
+∑
(w,r)∈D′log
1
1 + evTr vw
=maximizeθ
∑(w,c)∈D
log σ(vTc vw) +∑
(w,r)∈D′log σ(−vTr vw)
where σ(x) = 11+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 190: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/190.jpg)
47/70
·
−
σ
P (z = 0|w, r)
ur vw
In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs
The size of D′
is thus k times the size of D
The random context word is drawn from a modi-fied unigram distribution
r ∼ p(r)34
r ∼ count(r)34
N
N = total number of words in the corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 191: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/191.jpg)
47/70
·
−
σ
P (z = 0|w, r)
ur vw
In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs
The size of D′
is thus k times the size of D
The random context word is drawn from a modi-fied unigram distribution
r ∼ p(r)34
r ∼ count(r)34
N
N = total number of words in the corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 192: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/192.jpg)
47/70
·
−
σ
P (z = 0|w, r)
ur vw
In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs
The size of D′
is thus k times the size of D
The random context word is drawn from a modi-fied unigram distribution
r ∼ p(r)34
r ∼ count(r)34
N
N = total number of words in the corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 193: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/193.jpg)
47/70
·
−
σ
P (z = 0|w, r)
ur vw
In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs
The size of D′
is thus k times the size of D
The random context word is drawn from a modi-fied unigram distribution
r ∼ p(r)34
r ∼ count(r)34
N
N = total number of words in the corpus
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 194: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/194.jpg)
48/70
Module 10.6: Contrastive estimation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 195: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/195.jpg)
49/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 196: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/196.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 197: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/197.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 198: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/198.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 199: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/199.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− sc
But we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 200: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/200.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 201: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/201.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 202: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/202.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m
(don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 203: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/203.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m (don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 204: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/204.jpg)
50/70
Positive: He sat on a chair
. . . . . . . . . .
vc vw
sat on
Wh ∈
R2d×h
Wout ∈ Rh×|1|
s
We would like s to be greater than sc
Okay, so let us try to maximize s− scBut we would like the difference to be atleast m
Negative: He sat abracadabra a chair
. . . . . . . . . .
vc vw
sat abracadabra
Wh ∈
R2d×h
Wout ∈ Rh×|1|
sc
So we can maximize s− (sc +m)
What if s > sc + m (don’t do any thing)
maximize max(0, s− (sc +m))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 205: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/205.jpg)
51/70
Module 10.7: Hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 206: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/206.jpg)
52/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Some problems
Same as bag of words
The softmax function at the outputis computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estima-tion
Solution 3: Use hierarchical softmax
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 207: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/207.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 208: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/208.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 209: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/209.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 210: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/210.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 211: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/211.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 212: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/212.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 213: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/213.jpg)
53/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .max e
vTc uw∑
|V |evTc uw
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary
There exists a unique path from the rootnode to a leaf node.
Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w
Let π(w) be a binary vector such that:
π(w)k = 1 path branches left at node l(wk)
= 0 otherwise
Finally each internal node is associated witha vector ui
So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 214: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/214.jpg)
54/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
For a given pair (w, c) we are interested inthe probability p(w|vc)
We model this probability as
p(w|vc) =∏k
(π(wk)|vc)
For example
P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)
In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 215: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/215.jpg)
54/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as
p(w|vc) =∏k
(π(wk)|vc)
For example
P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)
In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 216: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/216.jpg)
54/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as
p(w|vc) =∏k
(π(wk)|vc)
For example
P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)
In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 217: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/217.jpg)
54/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as
p(w|vc) =∏k
(π(wk)|vc)
For example
P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)
In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 218: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/218.jpg)
55/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
We model
P (π(on)i = 1) =1
1 + e−vTc ui
P (π(on)i = 0) = 1− P (π(on)i = 1)
P (π(on)i = 0) =1
1 + evTc ui
The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui
Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 219: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/219.jpg)
55/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
We model
P (π(on)i = 1) =1
1 + e−vTc ui
P (π(on)i = 0) = 1− P (π(on)i = 1)
P (π(on)i = 0) =1
1 + evTc ui
The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui
Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 220: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/220.jpg)
55/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
We model
P (π(on)i = 1) =1
1 + e−vTc ui
P (π(on)i = 0) = 1− P (π(on)i = 1)
P (π(on)i = 0) =1
1 + evTc ui
The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui
Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 221: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/221.jpg)
56/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
P (w|vc) =
|π(w)|∏k=1
P (π(wk)|vc)
Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax
How do we construct the binary tree?
Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 222: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/222.jpg)
56/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
P (w|vc) =
|π(w)|∏k=1
P (π(wk)|vc)
Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax
How do we construct the binary tree?
Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 223: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/223.jpg)
56/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
P (w|vc) =
|π(w)|∏k=1
P (π(wk)|vc)
Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax
How do we construct the binary tree?
Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 224: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/224.jpg)
56/70
0 1 0 ... 0 0 0
sat
. . . . . . . . . .
. . . 1 . . . . . . . .
. . . .
π(on)1 = 1
π(on)2 = 0
π(on)3 = 0
u1
u2
uV
on
h = vc
P (w|vc) =
|π(w)|∏k=1
P (π(wk)|vc)
Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax
How do we construct the binary tree?
Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 225: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/225.jpg)
57/70
Module 10.8: GloVe representations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 226: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/226.jpg)
58/70
Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-ation
Why not combine the two (count and learn) ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 227: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/227.jpg)
58/70
Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-ation
Why not combine the two (count and learn) ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 228: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/228.jpg)
58/70
Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-ation
Why not combine the two (count and learn) ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 229: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/229.jpg)
59/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)
Why not learn word vectors which are faith-ful to this information?
For example, enforce
vTi vj = logP (j|i)= logXij − log(Xi)
Similarly,
vTj vi = logXij − logXj (Xij = Xji)
Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 230: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/230.jpg)
59/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)
Why not learn word vectors which are faith-ful to this information?
For example, enforce
vTi vj = logP (j|i)= logXij − log(Xi)
Similarly,
vTj vi = logXij − logXj (Xij = Xji)
Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 231: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/231.jpg)
59/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)
Why not learn word vectors which are faith-ful to this information?
For example, enforce
vTi vj = logP (j|i)= logXij − log(Xi)
Similarly,
vTj vi = logXij − logXj (Xij = Xji)
Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 232: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/232.jpg)
59/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)
Why not learn word vectors which are faith-ful to this information?
For example, enforce
vTi vj = logP (j|i)= logXij − log(Xi)
Similarly,
vTj vi = logXij − logXj (Xij = Xji)
Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 233: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/233.jpg)
59/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)
Why not learn word vectors which are faith-ful to this information?
For example, enforce
vTi vj = logP (j|i)= logXij − log(Xi)
Similarly,
vTj vi = logXij − logXj (Xij = Xji)
Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 234: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/234.jpg)
60/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Adding the two equations we get
2vTi vj = 2 logXij − logXi − logXj
vTi vj = logXij −1
2logXi −
1
2logXj
Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned
vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij
We can then formulate this as the followingoptimization problem
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters
− logXij︸ ︷︷ ︸actual value
computed fromthe given corpus
)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 235: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/235.jpg)
60/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Adding the two equations we get
2vTi vj = 2 logXij − logXi − logXj
vTi vj = logXij −1
2logXi −
1
2logXj
Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned
vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij
We can then formulate this as the followingoptimization problem
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters
− logXij︸ ︷︷ ︸actual value
computed fromthe given corpus
)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 236: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/236.jpg)
60/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=XijXi
Xij = Xji
Adding the two equations we get
2vTi vj = 2 logXij − logXi − logXj
vTi vj = logXij −1
2logXi −
1
2logXj
Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned
vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij
We can then formulate this as the followingoptimization problem
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters
− logXij︸ ︷︷ ︸actual value
computed fromthe given corpus
)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 237: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/237.jpg)
61/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=Xij∑Xi
Xij = Xji
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj − logXij)2
Drawback: weighs all co-occurrencesequally
Solution: add a weighting function
minvi,vj ,bi,bj
∑i,j
f(Xij)(vTi vj + bi + bj − logXij)
2
Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.
f(x) =
{( xxmax
)α, if x < xmax1, otherwise
}where α can be tuned for a given dataset
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 238: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/238.jpg)
61/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=Xij∑Xi
Xij = Xji
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj − logXij)2
Drawback: weighs all co-occurrencesequally
Solution: add a weighting function
minvi,vj ,bi,bj
∑i,j
f(Xij)(vTi vj + bi + bj − logXij)
2
Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.
f(x) =
{( xxmax
)α, if x < xmax1, otherwise
}where α can be tuned for a given dataset
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 239: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/239.jpg)
61/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=Xij∑Xi
Xij = Xji
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj − logXij)2
Drawback: weighs all co-occurrencesequally
Solution: add a weighting function
minvi,vj ,bi,bj
∑i,j
f(Xij)(vTi vj + bi + bj − logXij)
2
Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.
f(x) =
{( xxmax
)α, if x < xmax1, otherwise
}where α can be tuned for a given dataset
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 240: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/240.jpg)
61/70
Corpus:Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
X =
human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
P (j|i) =Xij∑Xij
=Xij∑Xi
Xij = Xji
minvi,vj ,bi,bj
∑i,j
(vTi vj + bi + bj − logXij)2
Drawback: weighs all co-occurrencesequally
Solution: add a weighting function
minvi,vj ,bi,bj
∑i,j
f(Xij)(vTi vj + bi + bj − logXij)
2
Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.
f(x) =
{( xxmax
)α, if x < xmax1, otherwise
}where α can be tuned for a given dataset
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 241: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/241.jpg)
62/70
Module 10.9: Evaluating word representations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 242: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/242.jpg)
63/70
How do we evaluate the learned word representations ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 243: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/243.jpg)
64/70
Shuman(cat, dog) = 0.8
Smodel(cat, dog) =vTcatvdog
‖ vcat ‖‖ vdog ‖= 0.7
Semantic Relatedness
Ask humans to judge the relatednessbetween a pair of words
Compute the cosine similaritybetween the corresponding wordvectors learned by the model
Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models
Model 1 is better than Model 2 if
correlation(Smodel1, Shuman)
> correlation(Smodel2, Shuman)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 244: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/244.jpg)
64/70
Shuman(cat, dog) = 0.8
Smodel(cat, dog) =vTcatvdog
‖ vcat ‖‖ vdog ‖= 0.7
Semantic Relatedness
Ask humans to judge the relatednessbetween a pair of words
Compute the cosine similaritybetween the corresponding wordvectors learned by the model
Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models
Model 1 is better than Model 2 if
correlation(Smodel1, Shuman)
> correlation(Smodel2, Shuman)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 245: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/245.jpg)
64/70
Shuman(cat, dog) = 0.8
Smodel(cat, dog) =vTcatvdog
‖ vcat ‖‖ vdog ‖= 0.7
Semantic Relatedness
Ask humans to judge the relatednessbetween a pair of words
Compute the cosine similaritybetween the corresponding wordvectors learned by the model
Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models
Model 1 is better than Model 2 if
correlation(Smodel1, Shuman)
> correlation(Smodel2, Shuman)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 246: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/246.jpg)
64/70
Shuman(cat, dog) = 0.8
Smodel(cat, dog) =vTcatvdog
‖ vcat ‖‖ vdog ‖= 0.7
Semantic Relatedness
Ask humans to judge the relatednessbetween a pair of words
Compute the cosine similaritybetween the corresponding wordvectors learned by the model
Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models
Model 1 is better than Model 2 if
correlation(Smodel1, Shuman)
> correlation(Smodel2, Shuman)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 247: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/247.jpg)
64/70
Shuman(cat, dog) = 0.8
Smodel(cat, dog) =vTcatvdog
‖ vcat ‖‖ vdog ‖= 0.7
Semantic Relatedness
Ask humans to judge the relatednessbetween a pair of words
Compute the cosine similaritybetween the corresponding wordvectors learned by the model
Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models
Model 1 is better than Model 2 if
correlation(Smodel1, Shuman)
> correlation(Smodel2, Shuman)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 248: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/248.jpg)
65/70
Term : levied
Candidates : {unposed,
believed, requested, correlated}
Synonym : = argmaxc∈C
cosine(vterm, vc)
Synonym Detection
Given: a term and four candidatesynonyms
Pick the candidate which has thelargest cosine similarity with the term
Compute the accuracy of differentmodels and compare
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 249: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/249.jpg)
65/70
Term : levied
Candidates : {unposed,
believed, requested, correlated}
Synonym : = argmaxc∈C
cosine(vterm, vc)
Synonym Detection
Given: a term and four candidatesynonyms
Pick the candidate which has thelargest cosine similarity with the term
Compute the accuracy of differentmodels and compare
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 250: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/250.jpg)
65/70
Term : levied
Candidates : {unposed,
believed, requested, correlated}
Synonym : = argmaxc∈C
cosine(vterm, vc)
Synonym Detection
Given: a term and four candidatesynonyms
Pick the candidate which has thelargest cosine similarity with the term
Compute the accuracy of differentmodels and compare
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 251: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/251.jpg)
65/70
Term : levied
Candidates : {unposed,
believed, requested, correlated}
Synonym : = argmaxc∈C
cosine(vterm, vc)
Synonym Detection
Given: a term and four candidatesynonyms
Pick the candidate which has thelargest cosine similarity with the term
Compute the accuracy of differentmodels and compare
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 252: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/252.jpg)
66/70
brother : sister :: grandson : ?work : works :: speak : ?
Analogy
Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson
Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 253: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/253.jpg)
66/70
brother : sister :: grandson : ?
work : works :: speak : ?
Analogy
Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson
Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 254: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/254.jpg)
66/70
brother : sister :: grandson : ?work : works :: speak : ?
Analogy
Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson
Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 255: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/255.jpg)
67/70
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 256: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/256.jpg)
67/70
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 257: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/257.jpg)
67/70
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 258: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/258.jpg)
68/70
Module 10.10: Relation between SVD & word2Vec
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 259: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/259.jpg)
69/70
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 260: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/260.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword
.
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 261: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/261.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword
.
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 262: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/262.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword
.
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 263: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/263.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword .
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 264: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/264.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword .
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
![Page 265: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:](https://reader033.fdocuments.us/reader033/viewer/2022050113/5f4acfa47293576e405c810a/html5/thumbnails/265.jpg)
70/70
0 0 1 ... 0 0 0
. . . . . . . . . .
he sat a chair
h ∈ R|k|
Wcontext ∈
Rk×|V |
x ∈ R|V |
Wword ∈ Rk×|V |
Recall that SVD does a matrix factorizationof the co-occurrence matrix
Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization
What does this mean ?
Recall that word2vec gives us Wcontext &Wword .
Turns out that we can also show that
M = Wcontext ∗Wword
where
Mij = PMI(wi, ci)− log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10