Tutorial word2vec toolkit - NTU Speech Processing...
-
Upload
truongphuc -
Category
Documents
-
view
223 -
download
3
Transcript of Tutorial word2vec toolkit - NTU Speech Processing...
![Page 2: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/2.jpg)
Download & Compile• word2vec: https://code.google.com/p/word2vec/• Download
1. Install subversion(svn)sudo apt-get install subversion
2. Download word2vecsvn checkout
http://word2vec.googlecode.com/svn/trunk/• Compile• make
![Page 3: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/3.jpg)
CBOW and Skip-gram• CBOW stands for “continuous bag-of-
words”• Both are networks without hidden
layers.
Reference: Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, et al.
![Page 4: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/4.jpg)
Represent words as vectors• Example sentence謝謝 學長 祝 學長 研究 順利• Vocabulary
[ 謝謝 , 學長 , 祝 , 研究 , 順利 ]• One-hot vector of 學長
[0 1 0 0 0 ]
![Page 5: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/5.jpg)
Example of CBOW• window = 1謝謝 學長 祝 學長 研究 順利
Input: [ 1 0 1 0 0]Target: [0 1 0 0 0]• Projection Matrix Input vector
= vector( 謝謝 ) + vector( 祝 )
![Page 6: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/6.jpg)
Trainingword2vec -train <training-data> -output <filename>-window <window-size>-cbow <0(skip-gram), 1(cbow)>-size <vector-size>-binary <0(text), 1(binary)>-iter <iteration-num>
Example:
![Page 7: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/7.jpg)
Play with word vectors• distance <output-vector>
- find related words• word-analogy <output-vector>
- analogy task, e.g.
![Page 8: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/8.jpg)
• Data: https://www.dropbox.com/s/tnp0wevr3u59ew8/data.tar.gz?dl=0
![Page 9: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/9.jpg)
RESULTS
![Page 10: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/10.jpg)
OTHER RESULTS
![Page 11: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/11.jpg)
![Page 12: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/12.jpg)
ANALOGY
![Page 13: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/13.jpg)
ANALOGY
![Page 14: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/14.jpg)
Advanced Stuff – Phrase Vector• Phrases
You want to treat “New Zealand” as one word.• If two words usually occur at the same time,
we add underscore to treat them as one word.e.g. New_Zealand• How to evaluate?
If the score > threshold, we add an underscore.
• word2phrase -train <word-doc> -output <phrase-doc>-threshold 100
Reference: Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov, et al.
![Page 15: Tutorial word2vec toolkit - NTU Speech Processing …speech.ee.ntu.edu.tw/Project2015Autumn/word2vecTutori… · PPT file · Web viewProjection Matrix × Input vector = vector(謝謝)+vector(祝)147258369](https://reader031.fdocuments.us/reader031/viewer/2022022009/5ae842e97f8b9ae157903919/html5/thumbnails/15.jpg)
Advanced Stuff – Negative Sampling• Objective
word, context, random sample context•