Representation in NLP

Representation in NLP

Pujari Rajkumar

Representation

Representations can be of two different types:

1. Representation by association

Eg:

2. Representation by features

Eg:

6/30/2016

2

Narendra Modi

Barack Obama

Vladimir Putin

Angela Merkel

Narendra Modi

Indian Politician Age: 65

Sachin Tendulkar

Saurav Ganguly

Brian Lara

Rahul Dravid

Sachin Tendulkar

Indian Cricketer Age: 43

Isomorphism of Representation

Any representation can be a valid representation if it can project desired

relationships between various entities into the represented space.

In the example provided, ‘Representation by association’ can be a valid

representation, as it can capture a sense of similarity between individuals

by their circle

For example, using such a representation, Obama and Modi would have a

higher similarity than Sachin and Modi

Similarly, ‘Representation by features’ captures a different sense of similarity

Choice of representation is often determined by the task at hand

6/30/2016

3

Representation in NLP

Representing words, sentences, paragraphs and documents in vector

space is quite an active area of research NLP

Representations that capture both syntagmatic and paradigmatic

information have been achieved with some success using Neural Networks,

Recurrent Neural Networks, LSTMs etc..,

Such representations have helped improve state-of-the-art performances

on several core NLP problems such as POS tagging, Chunking, Named

Entity Recognition(NER), Semantic Role Labelling, Syntactic Parsing etc..,

6/30/2016

4

One-hot Representation

One very intuitive way of representation is to have one dimension for each

different word

Eg: If we have 5 words in the entire vocabulary

<apple, tastes, sweet, lemon, sour>

They may be represented as follows:

apple - <1, 0, 0, 0, 0> tastes - <0, 1, 0, 0, 0> sweet - <0, 0, 1, 0, 0>

lemon - <0, 0, 0, 1, 0> sour - <0, 0, 0, 0, 1>

But this representation only captures the fact that each is a different word

Cosine Similarity(apple, lemon) = Cosine Similarity(apple, sweet) = 0

But we know that ‘apple’ is more similar to ‘lemon’ than it is to ‘sweet’

6/30/2016

5

Neural Networks

Neural networks is a beautiful biologically-inspired programming

paradigm which enables a computer to learn from observational

data1.

6/30/20161http://neuralnetworksanddeeplearning.com/

6

Perceptron

wi = weights

xi = inputs

b = bias of the perceptron

z = Σ(wi * xi) + b

σ = sigmoid function

6/30/2016

7

Typical Neural Network

6/30/2016

8

Concept of ‘hidden’ layer

Cost Function

A function to quantify error in the output. Also called ‘loss’ function

Eg: Mean Squared Error Function

C(w, b) = 1

2𝑛Σ𝑥 𝑦 𝑥 − 𝑎 2

w – weights, b – biases

y(x) – Expected output vector for input x

a – current activation vector for input x

n – number of input training data samples

6/30/2016

9

Stochastic Gradient Descent

6/30/2016

10

Move in such a way that the cost reduces

v → v’ in such a way that ΔC is -ve

SGD: Mathematics

ΔC ≈𝜕𝐶

𝜕𝑤Δ𝑤 +

𝜕𝐶

𝜕𝑏Δ𝑏

ΔC ≈ ▼C . ΔvT ; ▼C = (𝜕𝐶

𝜕𝑤,𝜕𝐶

𝜕𝑏), Δv = (Δ𝑤, Δ𝑏)

If we choose Δv = - η▼C, then ΔC is definitely –ve. η is a parameter

Δv = (Δ𝑤, Δ𝑏) = - η(𝜕𝐶

𝜕𝑤,𝜕𝐶

𝜕𝑏)

We need to compute 𝜕𝐶

𝜕𝑤𝑙𝑗𝑘

,𝜕𝐶

𝜕𝑏𝑙𝑗

where blj is bias for jth node in layer l, and wl

jk

is the weight of edge from kth node in layer l -1 to jth node in layer l

6/30/2016

11

Back-Propagation Algorithm

6/30/2016

12

Define δlj =

𝜕𝐶

𝜕𝑧𝑙𝑗

L denotes output layer, a is activation obtained for current input

Training Word Vectors

6/30/2016

13

Bi-gram model training data:

(Apple, tastes) – (<1, 0, 0>, <0, 1, 0>)

(tastes, sweet) – (<0, 1, 0>, <0, 0, 1>)

Training Word Vectors (contd.)

We begin with one-hot representation and train the system using bi-gram

model

For given corpus, training data is: (Apple, tastes), (tastes, sweet)

For training data point (Apple, tastes), input is <1, 0, 0> and output

expected is <0, 1, 0>

The weights and biases in the network are initialized randomly and are

modified(learnt) using back-propagation algorithm during training phase

Word vector for ‘Apple’ would be (w11i, w

12i ) + (w11

o, w21o), + being

composition of choice

6/30/2016

14

Training Word Vectors (contd..)

Dimension of word vector is number of neurons in ‘hidden’ layer and it is a

design choice

Number of neurons in the input layer = Number of neurons in the output

layer = Size of vocabulary

6/30/2016

15

Representing sequences

6/30/2016

16

How do we train RNN / LSTM / GRU to generate

representations of sentences / paragraphs / documents?

Recurrent Neural Network

6/30/2016

17

RNN is exactly the same as a Feed-Forward Neural Network with an

added concept of ‘memory’

Recurrent Neural Network(contd.)

6/30/2016

18

st = f(U * xt + W * st-1), f is usually a non-linear function like tanh() or ReLU()

yt = σ(V * st), σ denotes sigmoid function

x, y ε R3X1, U ε R2X3, V ε R3X2, s ε R2X1, W ε R2X2

Recurrent Neural Network(contd.)

6/30/2016

19

s0 = f(U * x0 + W * s-1) s1 = f(U * x1 + W * s0)

y0 = σ(V * s0) y1 = σ(V * s1)

Back-Propagation Through Time

6/30/2016

20

RNN unfolding in time

Back-Propagation Through Time

6/30/2016

21

Training RNNs

Exactly similar to training feed-forward neural networks

Bi-gram model, CBOW, skip-gram etc..,

Use BPTT instead of back propagation algorithm

Truncate number of steps till which the error is propagated during training

phase

6/30/2016

22

Generating Sequence Representations

To a trained RNN, set s-1 = 0

Pass one-hot representation of each token of sequence as input

sequentially in time

sT the state value of RNN accrued after T steps, where T is length of input

sequence is the representation of that sequence

6/30/2016

23

Drawbacks in RNNs

We introduced ‘memory’ all right. But it just keeps growing

Humans also forget what is not needed

RNNs can’t forget

Eg: The man wearing a brown hat and a red scarf, walked out of the room

Q: Who walked out? A: The man

Neither the brown hat, nor the red scarf

The model needs to forget about ‘brown hat’ and ‘red scarf’ and need to

remember about ‘The man’

How can we simulate forgetting? => LSTM

6/30/2016

24

Long Short Term Memory(LSTM)

6/30/2016

25

Modification to the repeating module of RNN

LSTM: Repeating Module

6/30/2016

26

The first step in our LSTM is to decide what information we’re going

to throw away from the cell state. This decision is made by a

sigmoid layer called the “forget gate layer.”


6/30/2016

27

The next step is to decide what new information we’re going to

store in the cell state. This has two parts: a sigmoid layer and a

tanh() layer.


6/30/2016

28

We multiply the old state by ft, forgetting the things we decided

to forget earlier. Then we add the new candidate values, scaled

by how much we decided to update each state value.


6/30/2016

29

Finally, we need to decide what we’re going to output. This output

will be based on our cell state.

Gated Recurrent Unit(GRU)

A popular variation of LSTM

6/30/2016

30

Questions?

6/30/2016

31

Representation in NLP

Documents

Transcript of Representation in NLP