CS626: NLP, Speech and the Webpb/cs626-2014/cs626-lect30to... · 2014. 11. 9. · “The basic...
Transcript of CS626: NLP, Speech and the Webpb/cs626-2014/cs626-lect30to... · 2014. 11. 9. · “The basic...
CS626: NLP, Speech and the Web
Pushpak BhattacharyyaCSE Dept., IIT Bombay
Lectures 30, 31, 32, 33: Recurrent NN, Language Modeling
13th October onwards, 2014
(Guiding paper: Application of Deep Belief Networks for Natural Language Understanding, IEEE Transactions on
Audio, Speech and Language Processing)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 1
Harris’s distributional hypothesis
“We group A and B into substitution set whenever A and B have the same (or par-tially same) environments X (Harris, 1981, p.17)”
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 2
“The basic concept of word”
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 3
The Basic Concept
of word
is hard
to express
The 0 1 1 1 1 0 0 0 0
basic 1 0 1 1 1 0 0 0 0
concept
of
word …..
is
hard
to
express 0 0 0 0 0 0 0 0 0
Backpropagation algorithm
Fully connected feed forward network Pure FF network (no jumping of
connections over layers)
Hidden layers
Input layer (n i/p neurons)
Output layer (m o/p neurons)
j
iwji
….
….
….
….
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 4
General Backpropagation Rule
ijjk
kkj ooow )1()(layernext
)1()( jjjjj ooot
iji jow • General weight updating rule:
• Where
for outermost layer
for hidden layers
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 5
Recurrent NN
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 6
Hopfield net Inspired by associative memory which means
memory retrieval is not by address, but by part of the data.
Consists ofN neurons fully connected with symmetric weight strength wij = wji
No self connection. So the weight matrix is 0-diagonal and symmetric.
Each computing element or neuron is a linear threshold element with threshold = 0.
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 7
Connection matrix of the network, 0-diagonal and symmetric
n1 n2 n3 . . . nk
n1
n2
n3. ..
nk
j
i wij
0 – diagonal
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 8
Examplew12 = w21 = 5w13 = w31 = 3w23 = w32 = 2At time t=0
s1(t) = 1s2(t) = -1s3(t) = 1Unstable state: Neuron 1 will flip.A stable pattern is called an
attractor for the net.
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 9
Concept of Energy Energy at state s is given by the equation:
nn xxwxxwxxwsE 1131132112)(
nn xxwxxw 223223
nnnn xxw )1()1(
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 10
Relation between weight vector W and state vector X
W TX
Weight vector Transpose of state vector
3
21
2
5
3
1
-11
023205350
1
11
W TX
023205350
W
1
11
TX
For example, in figure 1,At time t = 0, state of the neural network is:s(0) = <1, -1, 1> and corresponding vectors are as shown.
Fig. 1
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 11
W.XT gives the inputs to the neurons at the next time instant
023205350
1
11
W TX
17
2
11
1).sgn( TXW
This shows that the n/w will change state
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 12
Theorem
In the asynchronous mode of operation, the energy of the Hopfield net alwaysdecreases.
Proof:
)()()()()()()( 11111311131211121 txtxwtxtxwtxtxwtE nn
)()()()( 1122131223 txtxwtxtxw nn
)()( 11)1()1( txtxw nnnn
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 13
Proof Let neuron 1 change state by summing and comparing We get following equation for energy
)()()()()()()( 22112321132221122 txtxwtxtxwtxtxwtE nn
)()()()( 2222232223 txtxwtxtxw nn
)()( 22)1()1( txtxw nnnn
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 14
Proof: note that only neuron 1 changes state
)()()()()()( 1111131113121112 txtxwtxtxwtxtxw nn
n
jjjj txtxtxtxw
21112211 )()()()(
)()()()()()( 2211232113222112 txtxwtxtxwtxtxw nn
)()( 12 tEtEE
n
j
jj txtxtxw2
211111 )()()(
Since only neuron 1 changes state, xj(t1)=xj(t2), j=2, 3, 4, …n, and hence
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 15
Proof (continued)
Observations: When the state changes from -1 to 1, (S) has to be +ve
and (D) is –ve; so ΔE becomes negative. When the state changes from 1 to -1, (S) has to be -ve and
(D) is +ve; so ΔE becomes negative.
Therefore, Energy for any state change always decreases.
n
jjj txtxtxw
2211111 )()()(
(D)(S)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 16
The Hopfield net has to “converge” in the asynchronous mode of operation
As the energy E goes on decreasing, it has to hit the bottom, since the weight and the state vector have finite values.
That is, the Hopfield Net has to converge to an energy minimum.
Hence the Hopfield Net reaches stability.
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 17
Training of Hopfield Net Early Training Rule proposed by Hopfield Rule inspired by the concept of electron
spin Hebb’s rule of learning
If two neurons i and j have activation xi and xjrespectively, then the weight wij between the two neurons is directly proportional to the product xi ·xj i.e.
jiij xxw 13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 18
Hopfield Rule
Training by Hopfield Rule Train the Hopfield net for a specific
memory behavior Store memory elements How to store patterns?
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 19
Hopfield Rule To store a pattern
<xn, xn-1, …., x3, x2, x1>make
jiij xxn
w
)1(
1
• Storing pattern is equivalent to ‘Making that pattern the stable state’
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 20
Training of Hopfield Net
Establish that<xn, xn-1, …., x3, x2, x1>
is a stable state of the net
To show the stability of <xn, xn-1, …., x3, x2, x1>
impress at t=0<xt
n, xtn-1, …., xt
3, xt2, xt
1>13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 21
Training of Hopfield Net Consider neuron i at t=1
n
jijjiji
ii
txwtnet
tnetta
1,
))0(()0(
))0(sgn()1(
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 22
Establishing stability
))0(sgn()1(,
)0(
)1())0(()1(
1
1))0(()1(
1
])0([))0(()1(
1
))0(())0(())0(()1(
1
)0(
,1
2
1,
txtxThus
tx
ntxn
txn
txtxn
txtxtxn
txw
ii
i
i
ijji
jji
jj
ji
n
jijjij
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 23
Example
We want <1, -1, 1> as stored memory
Calculate all the wijvalues
wAB = 1/(3-1) * 1 * -1 = -1/2
Similarly wBC = -1/2 and wCA = ½
Is <1, -1, 1> stable?
C
BA
1
-11
C
BA
-0.5
-0.5
0.5
1
-11
Initially
After calculating weight values
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 24
Observations
How much deviation can the net tolerate?
What if more than one pattern is to be stored?
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 25
Storing k patterns Let the patterns be:
P1 : <xn, xn-1, …., x3, x2, x1>1
P2 : <xn, xn-1, …., x3, x2, x1>2
.
.
.Pk : <xn, xn-1, …., x3, x2, x1>k
Generalized Hopfield Rule is:
p
k
pji xx
nwij |
)1(1
1
Pth pattern
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 26
Storing k patterns
Study the stability of<xn, xn-1, …., x3, x2, x1>
Impress the vector at t=0 and observer network dynamics
Looking at neuron i at t=1, we have
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 27
Examining stability of the qth
pattern
)1(|)0(
1|)0()1(
1
]|)0([)|)0(()1(
1]|)0(|)0([)1(
1
|)0(|)0(|)0()1(
1|)0(]|)0(|)0([)1(
1
|)0(]|)0(|)0([)1(
1
|)0(
|)0(|)1();|)1(sgn(
|)1(
,1
2
,1
1
,1
nx
Q
xn
Q
xxn
xxn
xxxn
xxxn
xxxn
xw
xwnetnet
x
qi
qi
k
qppqjqipjqi
k
qppqjqjqiqjpjpi
k
pqjpjpi
qjij
n
ijjqjijqiqi
qi
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 28
Examining stability of the qth
pattern
)]0()|)0(|)0((sgn[
)]0(sgn[
])1()0(sgn[)1(
,
,1,1
,1
,1,1
in
ijj
k
qpppjqi
in
ijj
inijj
nijji
xxx
xQ
nxQx
Thus
Small when k << n
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 29
Storing k patterns
Condition for patterns to be stable on a Hopfield net with n neurons is:
k << n The storage capacity of Hopfield net is
very small Hence it is not a practical memory
element
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 30
Boltmann M/C
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 31
Boltzmann Machine
Hopfield net Probabilistic neurons Energy expression = -Σi Σj>i wij xi xj
where xi = activation of ith neuron Used for optimization Central concern is to ensure global
minimum Based on simulated annealing13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 32
Comparative RemarksFeed forward n/w with BP
Hopfield net Boltzmann m/c
Mapping device:(i/p pattern --> o/p pattern), i.e.Classification
Associative Memory
+Optimization device
Constraint satisfaction.(Mapping + Optimization device)
Minimizes total sum square error
Energy Entropy (Kullback–Leibler divergence)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 33
Comparative Remarks (contd.)Feed forward n/w with BP
Hopfield net Boltzmann m/c
Deterministic neurons
Deterministic neurons
Probabilistic neurons
Learning to associate i/p with o/p i.e.equivalent to a function
Pattern Probability Distribution
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 34
Comparative Remarks (contd.)
Feed forward n/w with BP
Hopfield net Boltzmann m/c
Can get stuck in local minimum (Greedy approach)
Local minimum possible
Can come out of local minimum
Credit/blame assignment (consistent with Hebbian rule)
Activation product (consistent with Hebbian rule)
Probability and activation product (consistent with Hebbian rule)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 35
Theory of Boltzmann m/c
For the m/c the computation means the following:At any time instant, make the state of the kth neuron (sk) equal to 1, with probability:
1 / (1 + exp(-ΔEk / T))ΔEk = change in energy of the m/c when
the kth neuron changes stateT = temperature which is a parameter of
the m/c13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 36
Theory of Boltzmann m/c (contd.)
P(sk = 1)
ΔEk0
1
P(sk = 1) = 1 / (1 + exp(-ΔEk / T))
Increasing T
ΔEk = α
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 37
Theory of Boltzmann m/c (contd.)
ΔEk = Ekfinal – Ek
initial
= (sinitialk - sfinal
k) * Σj≠kwkjsj
We observe:1. The higher the temperature, lower is P(Sk=1)2. at T = infinity, P(Sk=1) = P(Sk=0) = 0.5, equal
chance of being in state 0 or 1. Completely random behavior
3. If T 0, then P(Sk=1) 14. The derivative is proportional P(Sk=1)*(1 -
P(Sk=1))13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 38
Consequence of the form of P(Sk=1)P(Sα) proportional to exp[( -E(Sα)) / T]
Probability Distribution called as Boltzmann Distribution
1 -1 1 -1
N - bits
P(Sα) is the probability of the state Sα
Local “sigmoid” probabilistic behavior leads to global Boltzmann Distribution behaviour of the n/w
SP(
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 39
T
P(Sα) α exp[( -E(Sα)) / T]
P
E
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 40
Ratio of state probabilities
Normalizing,P(Sα) = (exp(-E(Sα)) / T) / (∑β є all statesexp(-E(Sβ)/T)
P(Sα) / P(Sβ) = exp -(E(Sα) - E(Sβ) ) / T
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 41
Learning a probability distribution
Digression: Estimation of a probability distribution Q by another distribution P
D = deviation = ∑sample space Qln Q/P D >= 0, which is a required property
(just like sum square error >= 0)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 42
Recurrent n/w and optimization
Problem representation
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 43
What is common between Sentence Generation Sorting Travelling Salesman Problem
Sentence GenerationGiven a set of words place them at appropriate positions in the sentence.
Position(pj)
Words(wi) 1 2 3 ........... M123....M
xij =1 iff ith word is in jh position
SortingGiven some numbers, place them at appropriate position in the ordered list.
Position(pj)
Numbers(ni) 1 2 3 ........... M123....M
xij =1 iff ith number is in jh position
TSPGiven the cities a traveller must visit, place the cities in the “tour” so that the total distance travelled is minimized.
Position(pj)
Cities(ci) 1 2 3 ........... M123....M
xij =1 iff ith city is in jh position
Hopfield Net for Optimization
Optimization problem Maximizes or minimizes a quantity
Hopfield net used for optimization Hopfield net and Traveling Salesman
Problem Hopfield net and Job Scheduling Problem
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 48
The essential idea of the correspondence
In optimization problems, we have to minimize a quantity.
Hopfield net minimizes the energy THIS IS THE CORRESPONDENCE
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 49
Hopfield net and Traveling Salesman problem
We consider the problem for n=4 cities In the given figure, nodes represent cities
and edges represent the paths between the cities with associated distance.
D C
A B
dDA dBC
dCD
dAB
dBDdAC
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 50
Traveling Salesman Problem
Goal Come back to the city A, visiting j = 2 to n
(n is number of cities) exactly once and minimize the total distance.
To solve by Hopfield net we need to decide the architecture: How many neurons? What are the weights?
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 51
Constraints decide the parameters
1. For n cities and n positions, establish city to position correspondence, i.e.
Number of neurons = n cities * n positions
2. Each position can take one and only one city
3. Each city can be in exactly one position
4. Total distance should be minimum13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 52
Architecture n * n matrix where
rows denote cities and columns denote positions
cell(i, j) = 1 if and only if ith city is in jthposition
Each cell is a neuron n2 neurons, O(n4)
connections
pos(α)
city(i)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 53
Expressions corresponding to constraints
1. Each city in one and only one position i.e. a row has a single 1.
• Above equation partially ensures each row has a single 1
• xiα is I/O cost at the cell(i, α)
i
n
i
n n
i xxAE
1 1 ,1
1 2
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 54
Expressions corresponding to constraints (contd.)
2. Each position has a single cityi.e. each column has at most single 1.
n n
i
n
ijjji xxBE
1 1 ,12 2
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 55
Expressions corresponding to constraints (contd.)
3. All cities MUST be visited once and only once
2
1 13 ])[(
2
n
i
n
i nxCE
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 56
Expressions corresponding to constraints (contd.)
E1, E2, E3 ensure that each row has exactly one 1 and each column has exactly one 1
If we minimize E1 + E2 + E3
Ensures a Hamiltonian circuit on the city graph which is an NP-complete problem.
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 57
Constraint of distance4. The distance traversed should be minimum
dij = distance between city i and city j
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN
n
i
n
j
n
jjiij xxxdE1 1 1
1,1,2 )(21
Expressions corresponding to constraints (contd.)
We equate constraint energy:EProblem = Enetwork (*)
Where, Eproblem= E1+E2+E3+E4
and Enetwork is the well known energy expression for the Hopfield net
Find the weights from (*).
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 59
Finding weights for Hopfield Net applied to TSP
Alternate and more convenient Eproblem
EP = E1 + E2
whereE1 is the equation for n cities, each city in one position and each position with one city.E2 is the equation for distance
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 60
Expressions for E1 and E2
n n
i
n
i
n
ii xxAE1 1 1 1
221 ])1()1([
2
n
i
n
j
n
jjiij xxxdE1 1 1
1,1,2 )(21
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 61
Explanatory example3
1 2
1 2 31 x11 x12 x13
2 x21 x22 x23
3 x31 x32 x33
pos
city
Fig. 1 shows two possible directions in which tour can take place
Fig. 1
For the matrix alongside, xiα = 1, if and only if, ith city is in position α
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 62
Kinds of weights
Row weights:w11,12 w11,13 w12,13
w21,22 w21,23 w22,23
w31,32 w31,33 w32,33
Column weightsw11,21 w11,31 w21,31
w12,22 w12,32 w22,32
w13,23 w13,33 w23,3313 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 63
Cross weights
w11,22 w11,23 w11,32 w11,33
w12,21 w12,23 w12,31 w12,33
w13,21 w13,22 w13,31 w13,32
w21,32 w21,33 w22,31 w23,33
w23,31 w23,32
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 64
Expressions
])1(
)1(
)1(
)1(
)1(
)1[(2
2332313
2322212
2312111
2333231
2232221
21312111
21
xxxxxx
xxxxxxxxx
xxxAE
EEE problem
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 65
Expressions (contd.)
)...]()()()()(
)([21
32311313
31331213
33321113
22211312
21231212
232211122
xxxdxxxdxxxdxxxdxxxd
xxxdE
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 66
Enetwork
...]
[
331133,11321132,11311131,11
231123,11221122,11211121,11
131213,12131113,11121112,11
xxwxxwxxwxxwxxwxxwxxwxxwxxwEnetwork
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 67
Find row weight
To find, w11,12
= -(co-efficient of x11x12) in Enetwork
Search x11x12 in Eproblem
w11,12 = -A ...from E1. E2 cannot contribute
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 68
Find column weight
To find, w11,21
= -(co-efficient of x11x21) in Enetwork
Search co-efficient of x11x21 in Eproblem
w11,21 = -A ...from E1. E2 cannot contribute
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 69
Find Cross weights
To find, w11,22
= -(co-efficient of x11x22) Search x11x22 from Eproblem. E1 cannot
contribute Co-eff. of x11x22 in E2
(d12 + d21) / 2
Therefore, w11,22 = -( (d12 + d21) / 2)13 Oct, 2014
Pushpak Bhattacharyya: recurrent NN 70
Find Cross weights
To find, w11,33
= -(co-efficient of x11x33) Search for x11x33 in Eproblem
w11,33 = -( (d13 + d31) / 2)
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 71
Summary
Row weights = -A Column weights = -A Cross weights = - ( (dij + dji) / 2), j = i
± 1
13 Oct, 2014Pushpak Bhattacharyya: recurrent
NN 72
Restricted Boltzmann Machines (RBM)
Lecture 396th Nov, 2014
Restricted Boltzmann Machine (Binary)
VISIBLE HIDDEN
1v2v3v
mv
1h2h3h
nh
Fully connected
Weights(Bidirectional and symmetric)
11w 12w . . . 1nw21w 22w . . . 2nw
. . . . . .
. . . . . .
. . . . . .
m1w m2w . . . mnw
Bias 1 Visible Units
Bias 2Hidden Units
1b2b
.
.
.
mb
1c2c
.
.
.
nc
V = Visible Unit Activation 1v2v
.
.
.
mv
H = Hidden Unit Activation 1h2h
.
.
.
nh
Energy at a state
Energy E(H, V) at the state <H, V>
E(H,V)VTWHBTV CTH ijw
j 1
n
i1
m
iv jh ib ivi1
m
jcj 1
n
jh
Problem: Name Identification
Let a sentence be denoted by S
Input: Sentence POS tagged with Noun (= 1) or not Noun (= 0)
TajMahal is the most visited site in India
‘1’ for NE, 0 otherwise
wwwww nnS 1210 ...
1 10 0 0 0 0 0
How do the neurons become 0 and 1?
where
Similarly, for visible neurons
P ( h j 1 | V )1
1 e net j
net j w jii 1
m
v i c j
Input: Delhi is in IndiaOutput: 1 0 0 1
Where E = Energy, Z = partition function
P( H,V ) eE ( H ,V )
eE( H ,V )
Z
Z eV '
H ' E ( H ' ,V ' )
Probability of a state of the network is given by energy.
Probability of the state of a neuron is given by sigmoid.
The weights and biases should be adjusted such that the desired <H,V> combination is stabilized.