Representation in NLP
Transcript of Representation in NLP
Representation in NLP
Pujari Rajkumar
Representation
Representations can be of two different types:
1. Representation by association
Eg:
2. Representation by features
Eg:
6/30/2016
2
Narendra Modi
Barack Obama
Vladimir Putin
Angela Merkel
Narendra Modi
Indian Politician Age: 65
Sachin Tendulkar
Saurav Ganguly
Brian Lara
Rahul Dravid
Sachin Tendulkar
Indian Cricketer Age: 43
Isomorphism of Representation
Any representation can be a valid representation if it can project desired
relationships between various entities into the represented space.
In the example provided, ‘Representation by association’ can be a valid
representation, as it can capture a sense of similarity between individuals
by their circle
For example, using such a representation, Obama and Modi would have a
higher similarity than Sachin and Modi
Similarly, ‘Representation by features’ captures a different sense of similarity
Choice of representation is often determined by the task at hand
6/30/2016
3
Representation in NLP
Representing words, sentences, paragraphs and documents in vector
space is quite an active area of research NLP
Representations that capture both syntagmatic and paradigmatic
information have been achieved with some success using Neural Networks,
Recurrent Neural Networks, LSTMs etc..,
Such representations have helped improve state-of-the-art performances
on several core NLP problems such as POS tagging, Chunking, Named
Entity Recognition(NER), Semantic Role Labelling, Syntactic Parsing etc..,
6/30/2016
4
One-hot Representation
One very intuitive way of representation is to have one dimension for each
different word
Eg: If we have 5 words in the entire vocabulary
<apple, tastes, sweet, lemon, sour>
They may be represented as follows:
apple - <1, 0, 0, 0, 0> tastes - <0, 1, 0, 0, 0> sweet - <0, 0, 1, 0, 0>
lemon - <0, 0, 0, 1, 0> sour - <0, 0, 0, 0, 1>
But this representation only captures the fact that each is a different word
Cosine Similarity(apple, lemon) = Cosine Similarity(apple, sweet) = 0
But we know that ‘apple’ is more similar to ‘lemon’ than it is to ‘sweet’
6/30/2016
5
Neural Networks
Neural networks is a beautiful biologically-inspired programming
paradigm which enables a computer to learn from observational
data1.
6/30/20161http://neuralnetworksanddeeplearning.com/
6
Perceptron
wi = weights
xi = inputs
b = bias of the perceptron
z = Σ(wi * xi) + b
σ = sigmoid function
6/30/2016
7
Typical Neural Network
6/30/2016
8
Concept of ‘hidden’ layer
Cost Function
A function to quantify error in the output. Also called ‘loss’ function
Eg: Mean Squared Error Function
C(w, b) = 1
2𝑛Σ𝑥 𝑦 𝑥 − 𝑎 2
w – weights, b – biases
y(x) – Expected output vector for input x
a – current activation vector for input x
n – number of input training data samples
6/30/2016
9
Stochastic Gradient Descent
6/30/2016
10
Move in such a way that the cost reduces
v → v’ in such a way that ΔC is -ve
SGD: Mathematics
ΔC ≈𝜕𝐶
𝜕𝑤Δ𝑤 +
𝜕𝐶
𝜕𝑏Δ𝑏
ΔC ≈ ▼C . ΔvT ; ▼C = (𝜕𝐶
𝜕𝑤,𝜕𝐶
𝜕𝑏), Δv = (Δ𝑤, Δ𝑏)
If we choose Δv = - η▼C, then ΔC is definitely –ve. η is a parameter
Δv = (Δ𝑤, Δ𝑏) = - η(𝜕𝐶
𝜕𝑤,𝜕𝐶
𝜕𝑏)
We need to compute 𝜕𝐶
𝜕𝑤𝑙𝑗𝑘
,𝜕𝐶
𝜕𝑏𝑙𝑗
where blj is bias for jth node in layer l, and wl
jk
is the weight of edge from kth node in layer l -1 to jth node in layer l
6/30/2016
11
Back-Propagation Algorithm
6/30/2016
12
Define δlj =
𝜕𝐶
𝜕𝑧𝑙𝑗
L denotes output layer, a is activation obtained for current input
Training Word Vectors
6/30/2016
13
Bi-gram model training data:
(Apple, tastes) – (<1, 0, 0>, <0, 1, 0>)
(tastes, sweet) – (<0, 1, 0>, <0, 0, 1>)
Training Word Vectors (contd.)
We begin with one-hot representation and train the system using bi-gram
model
For given corpus, training data is: (Apple, tastes), (tastes, sweet)
For training data point (Apple, tastes), input is <1, 0, 0> and output
expected is <0, 1, 0>
The weights and biases in the network are initialized randomly and are
modified(learnt) using back-propagation algorithm during training phase
Word vector for ‘Apple’ would be (w11i, w
12i ) + (w11
o, w21o), + being
composition of choice
6/30/2016
14
Training Word Vectors (contd..)
Dimension of word vector is number of neurons in ‘hidden’ layer and it is a
design choice
Number of neurons in the input layer = Number of neurons in the output
layer = Size of vocabulary
6/30/2016
15
Representing sequences
6/30/2016
16
How do we train RNN / LSTM / GRU to generate
representations of sentences / paragraphs / documents?
Recurrent Neural Network
6/30/2016
17
RNN is exactly the same as a Feed-Forward Neural Network with an
added concept of ‘memory’
Recurrent Neural Network(contd.)
6/30/2016
18
st = f(U * xt + W * st-1), f is usually a non-linear function like tanh() or ReLU()
yt = σ(V * st), σ denotes sigmoid function
x, y ε R3X1, U ε R2X3, V ε R3X2, s ε R2X1, W ε R2X2
Recurrent Neural Network(contd.)
6/30/2016
19
s0 = f(U * x0 + W * s-1) s1 = f(U * x1 + W * s0)
y0 = σ(V * s0) y1 = σ(V * s1)
Back-Propagation Through Time
6/30/2016
20
RNN unfolding in time
Back-Propagation Through Time
6/30/2016
21
Training RNNs
Exactly similar to training feed-forward neural networks
Bi-gram model, CBOW, skip-gram etc..,
Use BPTT instead of back propagation algorithm
Truncate number of steps till which the error is propagated during training
phase
6/30/2016
22
Generating Sequence Representations
To a trained RNN, set s-1 = 0
Pass one-hot representation of each token of sequence as input
sequentially in time
sT the state value of RNN accrued after T steps, where T is length of input
sequence is the representation of that sequence
6/30/2016
23
Drawbacks in RNNs
We introduced ‘memory’ all right. But it just keeps growing
Humans also forget what is not needed
RNNs can’t forget
Eg: The man wearing a brown hat and a red scarf, walked out of the room
Q: Who walked out? A: The man
Neither the brown hat, nor the red scarf
The model needs to forget about ‘brown hat’ and ‘red scarf’ and need to
remember about ‘The man’
How can we simulate forgetting? => LSTM
6/30/2016
24
Long Short Term Memory(LSTM)
6/30/2016
25
Modification to the repeating module of RNN
LSTM: Repeating Module
6/30/2016
26
The first step in our LSTM is to decide what information we’re going
to throw away from the cell state. This decision is made by a
sigmoid layer called the “forget gate layer.”
LSTM: Repeating Module
6/30/2016
27
The next step is to decide what new information we’re going to
store in the cell state. This has two parts: a sigmoid layer and a
tanh() layer.
LSTM: Repeating Module
6/30/2016
28
We multiply the old state by ft, forgetting the things we decided
to forget earlier. Then we add the new candidate values, scaled
by how much we decided to update each state value.
LSTM: Repeating Module
6/30/2016
29
Finally, we need to decide what we’re going to output. This output
will be based on our cell state.
Gated Recurrent Unit(GRU)
A popular variation of LSTM
6/30/2016
30
Questions?
6/30/2016
31