Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for...
Transcript of Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for...
![Page 1: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/1.jpg)
Deep Learning for Natural Language Processing
Dylan Drover, Borui Ye, Jie Peng
University of Waterloo
[email protected]@uwaterloo.ca
July 8, 2015
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 1 / 59
![Page 2: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/2.jpg)
Overview
1 Neural Networks: An “Intuitive” LookThe BasicsDeep LearningRBM and Language UnderstandingRNN and Statistical Machine Translation
2 Neural Probabilistic Language ModelWhy Use Distributed RepresentationBengio’s Neural Network
3 Google’s Word2VecCBOW+Hierarchical SoftmaxCBOW+Negative Sampling
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 2 / 59
![Page 3: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/3.jpg)
Introduction to Neural Network Models
A Black Box with a Billion Dials
But we can do better...
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 3 / 59
![Page 4: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/4.jpg)
The Artificial Neuron
Base unit for (most) neural networks is a simplified version of abiological neuron
Neuron has a set of inputs which have associated weights, anactivation function which then determines whether a neuron will”fire” or be activated
Together these can form very complex functionmodellers/approximators
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 4 / 59
![Page 5: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/5.jpg)
The Artificial Neuron
Output y is some function of the sumof the weights and the inputs.
y = f(∑m
j=0 wkjxj
)Each neuron has a weight vector andeach layer has a weight matrix W
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 5 / 59
![Page 6: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/6.jpg)
Multilayer Perceptron
Performs multiple logistic regressions at once for arbitrary functionapproximation
The multilayer perceptron which you might consider “deep” but isn’t.It suffers from the “vanishing gradient problem”
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 6 / 59
![Page 7: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/7.jpg)
Backpropogation: Learning with Derivatives
The basic idea behind learning in a neural network is that a network willhave:
Objective function: J(θ), E (v , h) or training vectorI This applies for supervised and unsupervised learningI The network’s output is compared with objective to obtain an error
Optimization algorithm: Stochastic Gradient Descent, ContrastiveDivergence etc.
I These algorithms direct learning within the “weight space”
But how do we adjust weights to optimize the objective?
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 7 / 59
![Page 8: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/8.jpg)
Chain Rule
Use the chain rule to determine how each output effects the finaldesired output
Now have many gradients that we can use for optimization
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 8 / 59
![Page 9: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/9.jpg)
Backpropogation
Calculate network error (where, t is the target, y is the network output)
E = 12(t − y)2
∂E
∂wij=∂E
∂oj
∂oj∂netj
∂netj∂wij
∂netj∂wij
=∂
∂wij
(n∑
k=1
wkjxk
)= xi
∆wij = −α ∂E
∂wij
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 9 / 59
![Page 10: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/10.jpg)
High Dimensional Optimization Problem: Weight Space
Training a neural network is a N-dimensional optimization problem
Weight in a NN, is a dimension and the goal is to mind theminimum error (“hight”) with gradient descent
There are many optimization algorithms for finding weightsI Hessian-Free Optimization, Stochastic Gradient Descent, RMSProp,
AdaGrad, Momentum, etc.
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 10 / 59
![Page 11: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/11.jpg)
What is ”Deep Learning” and why should we care?
Deep Learning is just a re-branding of artificial neural networks
Multiple factors led to the new deeper NN architecturesI Pre-training (unsupervised)I Faster optimization algorithmsI Graphic Processing Units (GPU)
A myriad of techniques existed previously but things began to cometogether in the mid 2000s
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 11 / 59
![Page 12: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/12.jpg)
Better Results Through Pre-Training
Train each layer of network greedily using other layers output asinput with unsupervised learning
Combine layers and use fine tune training as in MPL
Network starts with better position in weight space
MNIST error rates [Erhan et al., JMLR 2010]
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 12 / 59
![Page 13: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/13.jpg)
Promise for NLP
Improved results for Part of Speech tagging and Named Entity Recognition
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 13 / 59
![Page 14: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/14.jpg)
Application of Deep Belief Networks for Natural LanguageUnderstanding [Sarikaya et al.]
Sarikaya et al. attempt to solve the problem of spoken languageunderstanding (SLU) in the context of call-routing (call-centre data)
Data:I 27 000 transcribed utterances serve as unlabelled dataI Sets of 1K, 2K, 3K, 4K, 5K, 6K, 7K, 8K, 9K, 10K are used as labelled
data sets
Restricted Boltzmann Machines (RBM) were trained with unlabelleddata and then stacked to form a Deep Belief Network
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 14 / 59
![Page 15: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/15.jpg)
Restricted Boltzmann Machine
Composed of visble and hidden neurons
Unlabelled training data is presented as v
Hidden neurons (stochastic binary units) and weights attempt toapproximate a joint probability distribution of data (GenerativeModel)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 15 / 59
![Page 16: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/16.jpg)
Learning Without KnowingJoint energy distribution of RBM is defined as:
E (v , h) = −∑i
aivi −∑
bihi −∑∑
viwi ,jhj
Which defines the probability distribution:
p(v,h) =1
Ze−E(v ,h)
Which is marginalized over over hidden vectors:
p(v) =1
Z
∑h
e−E(v ,h)
Z acts as a normalizing factor:
Z =∑v ,h
e−E(v ,h)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 16 / 59
![Page 17: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/17.jpg)
Learning without Knowing
Training uses a “reconstruction” (the math is easier)of the input vectorthat is achieved in the hidden units with:
p(hj = 1|v) = σ(aj +∑i
viwij)
Which is used in:
p(vi = 1|h) = σ(bi +∑j
hjwij)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 17 / 59
![Page 18: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/18.jpg)
Learning without Knowing
The hidden units compare their rough reconstruction to the actualinput.
Based on the reconstruction, the weights are changed to improve
∂logp(v)
∂wij= 〈vihj〉v − 〈vihj〉model
∆wij ∝ 〈vihj〉data − 〈vihj〉model
∆wij ∝ 〈vihj〉data − 〈vihj〉recon
This is a rough approximation, however it works well in practice.
*(Angle brackets are the expectation with respect to the subscriptdistribution)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 18 / 59
![Page 19: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/19.jpg)
Visualization of Weights from a RBM
Each square is a set of weights for one neuron
Features emerge from unsupervised learning
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 19 / 59
![Page 20: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/20.jpg)
Deep Belief Network
Use one RBM to train the next layer RBMI Each layer learns featuresI Each successive layer learns features of features
This continues until a supervised layer
Results in a Deep Belief Network
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 20 / 59
![Page 21: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/21.jpg)
Results
Results show improvements from unsupervised learning in all aspects
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 21 / 59
![Page 22: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/22.jpg)
Autoencoders
Creates compressedrepresentation of data(Dimensionality reduction)
Central layer acts as anon-linear principalcomponent analysis
Decoder weights are transposedencoder weight matrices: Wli
and W Tli
Each layer trained greedily(similar to DBN layers)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 22 / 59
![Page 23: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/23.jpg)
Recurrent Neural Networks
Input is treated as time series:..., xt−1, xt , xt+1, ...
Retain temporal context orshort term memory
Trained with backpropogationthrough time (BPTT)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 23 / 59
![Page 24: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/24.jpg)
Long Short Term Memory
Prevent the “vanishinggradient” problem
Can retain informationfor arbitrary amount oftime
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 24 / 59
![Page 25: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/25.jpg)
Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation [Choet al.]
Combination of a RNN with an autoencoder
Encoder maps variable length source phrase X = (x1, x2, ..., xN)(English sentence) into a fixed length internal representation vector c
The decoder then maps this back into another variable lengthsequence, the target sentence Y = (y1, y2, ..., yN) (French Sentence)
Analysis showed that the internal representation (in the 1000 hiddenunits) preserved syntactic and semantic information
Learns probability distribution:
p(y1, ..., yT ′ |x1, ..., xT )
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 25 / 59
![Page 26: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/26.jpg)
Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation
The decoder uses the compressed semantic meaning vector c as wellas the previous word in its translation
h〈t〉 = f (h〈t−1〉, yt−1, c)
Next element in translated sequence is conditioned on:
P(yt |yt−1, ..., y1, c) = g(h〈t−1〉, yt−1, c)
Training of the network attempts to maximize the conditional loglikelihood of:
maxθ
1
N
N∑n=1
log pθ(yn, xn)
where θ are the model parameters (weights) and (yn, xn) are inputand output pairs.
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 26 / 59
![Page 27: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/27.jpg)
Encoder - Decoder
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 27 / 59
![Page 28: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/28.jpg)
Big Data
The training of the model used:I Bilingual corpora Europarl (61M words)I News commentary (5.5M words)I UN transcriptions (421M words)I 870M words from crawled corpora
However to optimize results only a subset of 348M words for trainingtranslation
Final BLEU scores for the model were 34.54
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 28 / 59
![Page 29: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/29.jpg)
Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation
This model is not restricted to just translation (which is why ANN areso useful and exciting)
This model also created semantic relations between similar wordsand sentences from their continuous vector space representations(more on that later)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 29 / 59
![Page 30: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/30.jpg)
Results
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 30 / 59
![Page 31: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/31.jpg)
Results
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 31 / 59
![Page 32: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/32.jpg)
Results
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 32 / 59
![Page 33: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/33.jpg)
Results
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 33 / 59
![Page 34: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/34.jpg)
Neural Probabilistic Language Model
Presenter: Borui YePapers:
1 Efficient Estimation of Word Representations in Vector Space
2 A Neural Probabilistic Language Model
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 34 / 59
![Page 35: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/35.jpg)
What is Word Vector
One-hot representation: represents a word using a long vector. Forexample:“microphone” is represented as : [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...]“phone” is represented as : [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...]
PROS: this method can coordinate well with max entropy, SVM,CRF algorithm.
CONS: word gap
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 35 / 59
![Page 36: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/36.jpg)
What is Word Vector (Cont.)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 36 / 59
![Page 37: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/37.jpg)
How To Train Word Vector
Language Model In practice, we usually need to calculate the probabilityof a sentence:
P(S) = p(w1,w2,w3,w4,w5, ...,wn)
= p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2, ...,wn−1)
Markov’s Assumption Each word depends only on the last n − 1 words.
P(S) = p(w1,w2,w3,w4,w5, ...,wn)
= p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2, ...,wn−1)
≈ p(w1)p(w2|w1)p(w3|w2)...p(wn|wn−1) (bigram)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 37 / 59
![Page 38: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/38.jpg)
How To Train Word Vector (Cont.)
Problems With N-gram Model:
It is not taking into account contexts farther than 1 or 2 words
Cannot capture the similarities among words.
Example:The cat is walking in the bedroom
A dog was running in a room
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 38 / 59
![Page 39: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/39.jpg)
Bengio’s Neural Network
Training Set a sequence w1...wT of words wt ∈ V , where the vocabularyV is a large but finite set.Objective learn a model : f (wt , ...,wt−n+1) = P̂(wt |w t−1
1 )
Constraint∑|V |
i=1 f (i ,wt−1, ...,wt−n+1) = 1, with f > 0
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 39 / 59
![Page 40: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/40.jpg)
Bengio’s Neural Network (Cont.)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 40 / 59
![Page 41: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/41.jpg)
ParametersDefinitionsC : a shared word vector matrix, C ∈ R|V |∗mx : vector of hidden layer, x = (C (wt−1),C (wt−2), ...,C (wt−n+1))y : vector of output layer, y = b + Wx + Utanh(d + Hx)P(wt = i |context) = P̂(wt |wt−1, ...,wt−n+1) = eywt∑
i eyi
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 41 / 59
![Page 42: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/42.jpg)
Parameter EstimationThe goal is to find the parameter that maximized the training corpuspenalized log-likelihood: L = 1
T
∑t log f (wt ,wt−1, ...,wt−n+1; θ) + R(θ)
where θ = (b, d ,W ,U,H,C )
SGD: θ ← θ + ε∂P̂(wt |wt−1,...,wt−n+1)∂θ
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 42 / 59
![Page 43: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/43.jpg)
Google’s Word2Vec
Project url : http://code.google.com/p/word2vec/
Feature : Additive Compositionality :vector(‘Paris’) - vector(‘France’) + vector(’Italy’) ≈ vector(’Rome’)vector(‘king’) - vector(‘man’) + vector(’woman’) ≈ vector(‘queen’)
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 43 / 59
![Page 44: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/44.jpg)
Google’s Word2Vec (Cont.)
./distance
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 44 / 59
![Page 45: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/45.jpg)
Two Models and Two Algorithms
Models Continuous Bag of Words Skip-gram
Alg.Hierarchical
SoftmaxNegativeSampling
HierarchicalSoftmax
NegativeSampling
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 45 / 59
![Page 46: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/46.jpg)
CBOW+Hierarchical Softmax
Predict the probability of a word given its context:
P(w |Context(w))
Learning objective : maximize log-likelihood:
ζ =∑w∈C
log p(w |Context(w))
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 46 / 59
![Page 47: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/47.jpg)
CBOW+Hierarchical Softmax (Cont.)
Input Layer 2c word vectors in Context(w) :v(Context(w)1), v(Context(w)2), ...v(Context(w)2c) ∈ Rm
Projection Layer Adding all the vectors in input layer:
xw =2c∑i=1
v(Context(w)i ) ∈ Rm
Output Layer A Huffman tree using words in vocabulary as leaves.
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 47 / 59
![Page 48: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/48.jpg)
CBOW+Hierarchical Softmax (Cont.)
Notations
pw : Path from root to corresponding leaf w
lw : Number of nodes included in pw
pw1 , pw2 , ..., p
wlw : lw nodes of path pw
dw2 , d
w3 , ..., d
wlw ∈ {0, 1} : Huffman code of each node on path pw ,
root does not have code
θw1 , θw2 , ..., θ
wlw : vector of each node on path pw ,
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 48 / 59
![Page 49: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/49.jpg)
Huffman Tree
I
love watching Brazil
football game
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 49 / 59
![Page 50: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/50.jpg)
Learning Objective
We assign every node a label:
Lable(pwi ) = 1− dwi , i = 2, 3, ..., lw
So the probability of a node being classified as positive label is :
δ(xTw σ) =1
1 + e−xTw θ
Then:
p(w |Context(w)) =lw∏j=2
p(dwj |xw , θwj−1)
where
p(dwj |xw , θwj−1) =
{σ(xTw θ
wj−1), dw
j = 0;
1− σ(xTw θwj−1), dw
j = 1;
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 50 / 59
![Page 51: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/51.jpg)
Learning Objective
Full learning objective is to maximize:
ζ =∑w∈C
loglw∏j=2
{[σ(xTw θwj−1)]1−d
wj [1− σ(xTw θ
wj−1)]d
wj }
=∑w∈C
lw∑j=2
{(1− dwj ) log[σ(xTw θ
wj−1)] + dw
j log[1− σ(xTw θwj−1)]}
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 51 / 59
![Page 52: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/52.jpg)
CBOW+Negative Sampling
In a nutshell, it doesn’t have Huffman tree in the output layer, but a set ofnegative samples instead (Given Context(w), word w is positive, whileothers are negative). Negative samples are randomly selected.Assume that we have had a negative sample set NEG (w) 6= Φ, ∀w̃ ∈ D,we denote the label of w as follows:
Lw (w̃) =
{1, w̃ = w ;
0, w̃ 6= w ;
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 52 / 59
![Page 53: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/53.jpg)
CBOW+Negative Sampling (Cont.)
Given (Context(w),w) our goal is to maximize:
g(w) =∏
u∈w⋃
NEG(w)
p(u|Context(w))
where
p(u|Context(w)) =
{σ(xTw θ
u), Lw (u) = 1;
1− σ(xTw θu), Lw (u) = 0;
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 53 / 59
![Page 54: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/54.jpg)
CBOW+Negative Sampling (Cont.)
To increase the probability of positive sample and decrease negative ones:
g(w) =∏
u∈w⋃
NEG(w)
p(u|Context(w))
= σ(xTw θw )
∏u∈NEG(w)
(1− σ(xTw θw ))
Then:G =
∏w∈C
= g(w)
where C is the corpus.
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 54 / 59
![Page 55: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/55.jpg)
Learning Objective
Full learning objective: maximize the following:
ζ = logG = log∏w∈C
g(w) =∑w∈C
log g(w)
=∑w∈C
log∏
u∈{w}⋃
NEG(w)
{[σ(xTw θu)]L
w (u)[1− σ(xTw θu)](1−L
w (u))}
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 55 / 59
![Page 56: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/56.jpg)
Difference Between CBOW and Skip-gram
Skip-gram is more accurate.
Skip-gram is slower given larger context.
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 56 / 59
![Page 57: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/57.jpg)
Why Use Negative Sampling & Hierachical Softmax
I
love watching Brazil
football game
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 57 / 59
![Page 58: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/58.jpg)
References
Sarikaya, Ruhi, Geoffrey E. Hinton, and Anoop Deoras. ”Application of deep beliefnetworks for natural language understanding.” Audio, Speech, and LanguageProcessing, IEEE/ACM Transactions on 22.4 (2014): 778-784.
Cho, Kyunghyun, et al. ”Learning phrase representations using rnn encoder-decoderfor statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).
http://nlp.stanford.edu/courses/NAACL2013/
NAACL2013-Socher-Manning-DeepLearning.pdf
http://devblogs.nvidia.com/parallelforall/
introduction-neural-machine-translation-gpus-part-2/
http://blog.csdn.net/itplus/article/details/37969979
http://licstar.net/archives/328
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 58 / 59
![Page 59: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca](https://reader035.fdocuments.us/reader035/viewer/2022081405/5f07efaf7e708231d41f80d4/html5/thumbnails/59.jpg)
Thank you!
Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 59 / 59