CS626: NLP, Speech and the Webpb/cs626-2014/cs626-lect30to... · 2014. 11. 9. · “The basic...

CS626: NLP, Speech and the Web

Pushpak BhattacharyyaCSE Dept., IIT Bombay

Lectures 30, 31, 32, 33: Recurrent NN, Language Modeling

13th October onwards, 2014

(Guiding paper: Application of Deep Belief Networks for Natural Language Understanding, IEEE Transactions on

Audio, Speech and Language Processing)

13 Oct, 2014Pushpak Bhattacharyya: recurrent

NN 1

Harris’s distributional hypothesis

“We group A and B into substitution set whenever A and B have the same (or par-tially same) environments X (Harris, 1981, p.17)”


NN 2

“The basic concept of word”


NN 3

The Basic Concept

of word

is hard

to express

The 0 1 1 1 1 0 0 0 0

basic 1 0 1 1 1 0 0 0 0

concept

of

word …..

is

hard

to

express 0 0 0 0 0 0 0 0 0

Backpropagation algorithm

Fully connected feed forward network Pure FF network (no jumping of

connections over layers)

Hidden layers

Input layer (n i/p neurons)

Output layer (m o/p neurons)

j

iwji

….

….

….

….


NN 4

General Backpropagation Rule

ijjk

kkj ooow )1()(layernext

)1()( jjjjj ooot

iji jow • General weight updating rule:

• Where

for outermost layer

for hidden layers


NN 5

Recurrent NN


NN 6

Hopfield net Inspired by associative memory which means

memory retrieval is not by address, but by part of the data.

Consists ofN neurons fully connected with symmetric weight strength wij = wji

No self connection. So the weight matrix is 0-diagonal and symmetric.

Each computing element or neuron is a linear threshold element with threshold = 0.


NN 7

Connection matrix of the network, 0-diagonal and symmetric

n1 n2 n3 . . . nk

n1

n2

n3. ..

nk

j

i wij

0 – diagonal


NN 8

Examplew12 = w21 = 5w13 = w31 = 3w23 = w32 = 2At time t=0

s1(t) = 1s2(t) = -1s3(t) = 1Unstable state: Neuron 1 will flip.A stable pattern is called an

attractor for the net.


NN 9

Concept of Energy Energy at state s is given by the equation:

nn xxwxxwxxwsE 1131132112)(

nn xxwxxw 223223

nnnn xxw )1()1(


NN 10

Relation between weight vector W and state vector X

W TX

Weight vector Transpose of state vector

3

21

2

5

3

1

-11

023205350

1

11

W TX

023205350

W

1

11

TX

For example, in figure 1,At time t = 0, state of the neural network is:s(0) = <1, -1, 1> and corresponding vectors are as shown.

Fig. 1


NN 11

W.XT gives the inputs to the neurons at the next time instant

023205350

1

11

W TX

17

2

11

1).sgn( TXW

This shows that the n/w will change state


NN 12

Theorem

In the asynchronous mode of operation, the energy of the Hopfield net alwaysdecreases.

Proof:

)()()()()()()( 11111311131211121 txtxwtxtxwtxtxwtE nn

)()()()( 1122131223 txtxwtxtxw nn

)()( 11)1()1( txtxw nnnn


NN 13

Proof Let neuron 1 change state by summing and comparing We get following equation for energy

)()()()()()()( 22112321132221122 txtxwtxtxwtxtxwtE nn

)()()()( 2222232223 txtxwtxtxw nn

)()( 22)1()1( txtxw nnnn


NN 14

Proof: note that only neuron 1 changes state

)()()()()()( 1111131113121112 txtxwtxtxwtxtxw nn

n

jjjj txtxtxtxw

21112211 )()()()(

)()()()()()( 2211232113222112 txtxwtxtxwtxtxw nn

)()( 12 tEtEE

n

j

jj txtxtxw2

211111 )()()(

Since only neuron 1 changes state, xj(t1)=xj(t2), j=2, 3, 4, …n, and hence


NN 15

Proof (continued)

Observations: When the state changes from -1 to 1, (S) has to be +ve

and (D) is –ve; so ΔE becomes negative. When the state changes from 1 to -1, (S) has to be -ve and

(D) is +ve; so ΔE becomes negative.

Therefore, Energy for any state change always decreases.

n

jjj txtxtxw

2211111 )()()(

(D)(S)


NN 16

The Hopfield net has to “converge” in the asynchronous mode of operation

As the energy E goes on decreasing, it has to hit the bottom, since the weight and the state vector have finite values.

That is, the Hopfield Net has to converge to an energy minimum.

Hence the Hopfield Net reaches stability.


NN 17

Training of Hopfield Net Early Training Rule proposed by Hopfield Rule inspired by the concept of electron

spin Hebb’s rule of learning

If two neurons i and j have activation xi and xjrespectively, then the weight wij between the two neurons is directly proportional to the product xi ·xj i.e.

jiij xxw 13 Oct, 2014

Pushpak Bhattacharyya: recurrent NN 18

Hopfield Rule

Training by Hopfield Rule Train the Hopfield net for a specific

memory behavior Store memory elements How to store patterns?


NN 19

Hopfield Rule To store a pattern

<xn, xn-1, …., x3, x2, x1>make

jiij xxn

w

)1(

1

• Storing pattern is equivalent to ‘Making that pattern the stable state’


NN 20

Training of Hopfield Net

Establish that<xn, xn-1, …., x3, x2, x1>

is a stable state of the net

To show the stability of <xn, xn-1, …., x3, x2, x1>

impress at t=0<xt

n, xtn-1, …., xt

3, xt2, xt

1>13 Oct, 2014


Training of Hopfield Net Consider neuron i at t=1

n

jijjiji

ii

txwtnet

tnetta

1,

))0(()0(

))0(sgn()1(


NN 22

Establishing stability

))0(sgn()1(,

)0(

)1())0(()1(

1

1))0(()1(

1

])0([))0(()1(

1

))0(())0(())0(()1(

1

)0(

,1

2

1,

txtxThus

tx

ntxn

txn

txtxn

txtxtxn

txw

ii

i

i

ijji

jji

jj

ji

n

jijjij


NN 23

Example

We want <1, -1, 1> as stored memory

Calculate all the wijvalues

wAB = 1/(3-1) * 1 * -1 = -1/2

Similarly wBC = -1/2 and wCA = ½

Is <1, -1, 1> stable?

C

BA

1

-11

C

BA

-0.5

-0.5

0.5

1

-11

Initially

After calculating weight values


NN 24

Observations

How much deviation can the net tolerate?

What if more than one pattern is to be stored?


NN 25

Storing k patterns Let the patterns be:

P1 : <xn, xn-1, …., x3, x2, x1>1

P2 : <xn, xn-1, …., x3, x2, x1>2

.

.

.Pk : <xn, xn-1, …., x3, x2, x1>k

Generalized Hopfield Rule is:

p

k

pji xx

nwij |

)1(1

1

Pth pattern


NN 26

Storing k patterns

Study the stability of<xn, xn-1, …., x3, x2, x1>

Impress the vector at t=0 and observer network dynamics

Looking at neuron i at t=1, we have


NN 27

Examining stability of the qth

pattern

)1(|)0(

1|)0()1(

1

]|)0([)|)0(()1(

1]|)0(|)0([)1(

1

|)0(|)0(|)0()1(

1|)0(]|)0(|)0([)1(

1

|)0(]|)0(|)0([)1(

1

|)0(

|)0(|)1();|)1(sgn(

|)1(

,1

2

,1

1

,1

nx

Q

xn

Q

xxn

xxn

xxxn

xxxn

xxxn

xw

xwnetnet

x

qi

qi

k

qppqjqipjqi

k

qppqjqjqiqjpjpi

k

pqjpjpi

qjij

n

ijjqjijqiqi

qi


NN 28

Examining stability of the qth

pattern

)]0()|)0(|)0((sgn[

)]0(sgn[

])1()0(sgn[)1(

,

,1,1

,1

,1,1

in

ijj

k

qpppjqi

in

ijj

inijj

nijji

xxx

xQ

nxQx

Thus

Small when k << n


NN 29

Storing k patterns

Condition for patterns to be stable on a Hopfield net with n neurons is:

k << n The storage capacity of Hopfield net is

very small Hence it is not a practical memory

element


NN 30

Boltmann M/C


NN 31

Boltzmann Machine

Hopfield net Probabilistic neurons Energy expression = -Σi Σj>i wij xi xj

where xi = activation of ith neuron Used for optimization Central concern is to ensure global

minimum Based on simulated annealing13 Oct, 2014


Comparative RemarksFeed forward n/w with BP

Hopfield net Boltzmann m/c

Mapping device:(i/p pattern --> o/p pattern), i.e.Classification

Associative Memory

+Optimization device

Constraint satisfaction.(Mapping + Optimization device)

Minimizes total sum square error

Energy Entropy (Kullback–Leibler divergence)


NN 33

Comparative Remarks (contd.)Feed forward n/w with BP


Deterministic neurons

Deterministic neurons

Probabilistic neurons

Learning to associate i/p with o/p i.e.equivalent to a function

Pattern Probability Distribution


NN 34

Comparative Remarks (contd.)

Feed forward n/w with BP


Can get stuck in local minimum (Greedy approach)

Local minimum possible

Can come out of local minimum

Credit/blame assignment (consistent with Hebbian rule)

Activation product (consistent with Hebbian rule)

Probability and activation product (consistent with Hebbian rule)


NN 35

Theory of Boltzmann m/c

For the m/c the computation means the following:At any time instant, make the state of the kth neuron (sk) equal to 1, with probability:

1 / (1 + exp(-ΔEk / T))ΔEk = change in energy of the m/c when

the kth neuron changes stateT = temperature which is a parameter of

the m/c13 Oct, 2014Pushpak Bhattacharyya: recurrent

NN 36

Theory of Boltzmann m/c (contd.)

P(sk = 1)

ΔEk0

1

P(sk = 1) = 1 / (1 + exp(-ΔEk / T))

Increasing T

ΔEk = α


NN 37

Theory of Boltzmann m/c (contd.)

ΔEk = Ekfinal – Ek

initial

= (sinitialk - sfinal

k) * Σj≠kwkjsj

We observe:1. The higher the temperature, lower is P(Sk=1)2. at T = infinity, P(Sk=1) = P(Sk=0) = 0.5, equal

chance of being in state 0 or 1. Completely random behavior

3. If T 0, then P(Sk=1) 14. The derivative is proportional P(Sk=1)*(1 -

P(Sk=1))13 Oct, 2014


Consequence of the form of P(Sk=1)P(Sα) proportional to exp[( -E(Sα)) / T]

Probability Distribution called as Boltzmann Distribution

1 -1 1 -1

N - bits

P(Sα) is the probability of the state Sα

Local “sigmoid” probabilistic behavior leads to global Boltzmann Distribution behaviour of the n/w

SP(


NN 39

T

P(Sα) α exp[( -E(Sα)) / T]

P

E


NN 40

Ratio of state probabilities

Normalizing,P(Sα) = (exp(-E(Sα)) / T) / (∑β є all statesexp(-E(Sβ)/T)

P(Sα) / P(Sβ) = exp -(E(Sα) - E(Sβ) ) / T


NN 41

Learning a probability distribution

Digression: Estimation of a probability distribution Q by another distribution P

D = deviation = ∑sample space Qln Q/P D >= 0, which is a required property

(just like sum square error >= 0)


NN 42

Recurrent n/w and optimization

Problem representation


NN 43

What is common between Sentence Generation Sorting Travelling Salesman Problem

Sentence GenerationGiven a set of words place them at appropriate positions in the sentence.

Position(pj)

Words(wi) 1 2 3 ........... M123....M

xij =1 iff ith word is in jh position

SortingGiven some numbers, place them at appropriate position in the ordered list.

Position(pj)

Numbers(ni) 1 2 3 ........... M123....M

xij =1 iff ith number is in jh position

TSPGiven the cities a traveller must visit, place the cities in the “tour” so that the total distance travelled is minimized.

Position(pj)

Cities(ci) 1 2 3 ........... M123....M

xij =1 iff ith city is in jh position

Hopfield Net for Optimization

Optimization problem Maximizes or minimizes a quantity

Hopfield net used for optimization Hopfield net and Traveling Salesman

Problem Hopfield net and Job Scheduling Problem


NN 48

The essential idea of the correspondence

In optimization problems, we have to minimize a quantity.

Hopfield net minimizes the energy THIS IS THE CORRESPONDENCE


NN 49

Hopfield net and Traveling Salesman problem

We consider the problem for n=4 cities In the given figure, nodes represent cities

and edges represent the paths between the cities with associated distance.

D C

A B

dDA dBC

dCD

dAB

dBDdAC


NN 50

Traveling Salesman Problem

Goal Come back to the city A, visiting j = 2 to n

(n is number of cities) exactly once and minimize the total distance.

To solve by Hopfield net we need to decide the architecture: How many neurons? What are the weights?


NN 51

Constraints decide the parameters

1. For n cities and n positions, establish city to position correspondence, i.e.

Number of neurons = n cities * n positions

2. Each position can take one and only one city

3. Each city can be in exactly one position

4. Total distance should be minimum13 Oct, 2014


Architecture n * n matrix where

rows denote cities and columns denote positions

cell(i, j) = 1 if and only if ith city is in jthposition

Each cell is a neuron n2 neurons, O(n4)

connections

pos(α)

city(i)


NN 53

Expressions corresponding to constraints

1. Each city in one and only one position i.e. a row has a single 1.

• Above equation partially ensures each row has a single 1

• xiα is I/O cost at the cell(i, α)

i

n

i

n n

i xxAE

1 1 ,1

1 2


NN 54

Expressions corresponding to constraints (contd.)

2. Each position has a single cityi.e. each column has at most single 1.

n n

i

n

ijjji xxBE

1 1 ,12 2


NN 55


3. All cities MUST be visited once and only once

2

1 13 ])[(

2

n

i

n

i nxCE


NN 56


E1, E2, E3 ensure that each row has exactly one 1 and each column has exactly one 1

If we minimize E1 + E2 + E3

Ensures a Hamiltonian circuit on the city graph which is an NP-complete problem.


NN 57

Constraint of distance4. The distance traversed should be minimum

dij = distance between city i and city j


NN

n

i

n

j

n

jjiij xxxdE1 1 1

1,1,2 )(21


We equate constraint energy:EProblem = Enetwork (*)

Where, Eproblem= E1+E2+E3+E4

and Enetwork is the well known energy expression for the Hopfield net

Find the weights from (*).


NN 59

Finding weights for Hopfield Net applied to TSP

Alternate and more convenient Eproblem

EP = E1 + E2

whereE1 is the equation for n cities, each city in one position and each position with one city.E2 is the equation for distance


NN 60

Expressions for E1 and E2

n n

i

n

i

n

ii xxAE1 1 1 1

221 ])1()1([

2

n

i

n

j

n

jjiij xxxdE1 1 1

1,1,2 )(21


NN 61

Explanatory example3

1 2

1 2 31 x11 x12 x13

2 x21 x22 x23

3 x31 x32 x33

pos

city

Fig. 1 shows two possible directions in which tour can take place

Fig. 1

For the matrix alongside, xiα = 1, if and only if, ith city is in position α


NN 62

Kinds of weights

Row weights:w11,12 w11,13 w12,13

w21,22 w21,23 w22,23

w31,32 w31,33 w32,33

Column weightsw11,21 w11,31 w21,31

w12,22 w12,32 w22,32

w13,23 w13,33 w23,3313 Oct, 2014


Cross weights

w11,22 w11,23 w11,32 w11,33

w12,21 w12,23 w12,31 w12,33

w13,21 w13,22 w13,31 w13,32

w21,32 w21,33 w22,31 w23,33

w23,31 w23,32


NN 64

Expressions

])1(

)1(

)1(

)1(

)1(

)1[(2

2332313

2322212

2312111

2333231

2232221

21312111

21

xxxxxx

xxxxxxxxx

xxxAE

EEE problem


NN 65

Expressions (contd.)

)...]()()()()(

)([21

32311313

31331213

33321113

22211312

21231212

232211122

xxxdxxxdxxxdxxxdxxxd

xxxdE


NN 66

Enetwork

...]

[

331133,11321132,11311131,11

231123,11221122,11211121,11

131213,12131113,11121112,11

xxwxxwxxwxxwxxwxxwxxwxxwxxwEnetwork


NN 67

Find row weight

To find, w11,12

= -(co-efficient of x11x12) in Enetwork

Search x11x12 in Eproblem

w11,12 = -A ...from E1. E2 cannot contribute


NN 68

Find column weight

To find, w11,21

= -(co-efficient of x11x21) in Enetwork

Search co-efficient of x11x21 in Eproblem

w11,21 = -A ...from E1. E2 cannot contribute


NN 69

Find Cross weights

To find, w11,22

= -(co-efficient of x11x22) Search x11x22 from Eproblem. E1 cannot

contribute Co-eff. of x11x22 in E2

(d12 + d21) / 2

Therefore, w11,22 = -( (d12 + d21) / 2)13 Oct, 2014


Find Cross weights

To find, w11,33

= -(co-efficient of x11x33) Search for x11x33 in Eproblem

w11,33 = -( (d13 + d31) / 2)


NN 71

Summary

Row weights = -A Column weights = -A Cross weights = - ( (dij + dji) / 2), j = i

± 1


NN 72

Restricted Boltzmann Machines (RBM)

Lecture 396th Nov, 2014

Restricted Boltzmann Machine (Binary)

VISIBLE HIDDEN

1v2v3v

mv

1h2h3h

nh

Fully connected

Weights(Bidirectional and symmetric)

11w 12w . . . 1nw21w 22w . . . 2nw

. . . . . .

. . . . . .

. . . . . .

m1w m2w . . . mnw

Bias 1 Visible Units

Bias 2Hidden Units

1b2b

.

.

.

mb

1c2c

.

.

.

nc

V = Visible Unit Activation 1v2v

.

.

.

mv

H = Hidden Unit Activation 1h2h

.

.

.

nh

Energy at a state

Energy E(H, V) at the state <H, V>

E(H,V)VTWHBTV CTH ijw

j 1

n

i1

m

iv jh ib ivi1

m

jcj 1

n

jh

Problem: Name Identification

Let a sentence be denoted by S

Input: Sentence POS tagged with Noun (= 1) or not Noun (= 0)

TajMahal is the most visited site in India

‘1’ for NE, 0 otherwise

wwwww nnS 1210 ...

1 10 0 0 0 0 0

How do the neurons become 0 and 1?

where

Similarly, for visible neurons

P ( h j 1 | V )1

1 e net j

net j w jii 1

m

v i c j

Input: Delhi is in IndiaOutput: 1 0 0 1

Where E = Energy, Z = partition function

P( H,V ) eE ( H ,V )

eE( H ,V )

Z

Z eV '

H ' E ( H ' ,V ' )

Probability of a state of the network is given by energy.

Probability of the state of a neuron is given by sigmoid.

The weights and biases should be adjusted such that the desired <H,V> combination is stabilized.

CS626: NLP, Speech and the Webpb/cs626-2014/cs626-lect30to... · 2014. 11. 9. · “The basic...

Documents

Transcript of CS626: NLP, Speech and the Webpb/cs626-2014/cs626-lect30to... · 2014. 11. 9. · “The basic...