1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University...
-
Upload
geoffrey-mcbride -
Category
Documents
-
view
213 -
download
0
Transcript of 1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University...
1
The Hidden Vector State Language Model
Vidura Senevitratne, Steve Young
Cambridge University Engineering Department
2
Reference
• Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, 2003.
• He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, 2003.
• Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.
3
Outline
• Introduction
• HVS Model
• Experiments
• Conclusion
4
Introduction
• Language model:
• Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information
• Class-based language model– POS tag information
• Structured language model– Syntactic information
11 1 2 1
1 11 1 1 1
( ) ( ) ( | )
( | ( )) ( | )
M M kk k
m i m ii i i i i n
P W P w P w W
P w W P w W
5
Hierarchical Hidden Markov Model
• HHMM is structured multi-level stochastic process.– Each state is an HHMM– Internal state: hidden state that do not emit
observable symbols directly– Production state: leaf state
• States of HMM are production states of HHMM.
6
HHMM (cont.)
• Parameters of HHMM:
1,..., 1,..., 1 1,..., 1, ,
d d d Dq q q q
d D d D d DA B
*
1 2
: finite alphabet
: the set of all possible string over
: observation sequence ...
: state of an HHMM,
is the state index and is the hierarchy index
T
di
O o o o
q
i d
7
HHMM (cont.)
• Transition probability: horizontal
• Initial probability: vertical
• Observation probability:
1 1, ( | )
d dq q d di j j iA a P q q
1 1{ ( )} { ( | )}d dq q d d d
i iq P q q
{ ( )} ( | )D Dq q D
kB b k P q
8
HHMM (cont.)
• Current node is root:– Choose child according to initial probability
• Child is production state:– Produce an observation– Transit within the same level– When it reaches end-state, back to parent of end-state
• Child is internal state:– Choose child– Wait until control is back from children– Transit within the same level– When it reaches end-state, back to parent of end-state
9
HHMM (cont.)
10
HHMM (cont.)
• Other application: trend of stocks (IDEAL 2004)
11
Hidden Vector State Model
12
Hidden Vector State Model (cont.)
The semantic information relating to any single word can be stored as a vector of semantic tag names
13
Hidden Vector State Model (cont.)
• If state transitions were unconstrained– Fully HHMM
• Transitions between states can be factored into a stack shift: two stage, pop, push
• Stack size is limited, # of new concept to be pushed is limited to one– More efficient
14
Hidden Vector State Model (cont.)
• The joint probability is defined:( , , | )
: sequence of stack pop operations N
: word sequence
: concept vector sequence
P W C N
N
W
C
1 11
1
1 11 1 1
( , , ) ( | , )
( [1] | , ) ( | , )
Tt t
t it
t t tt t t
P W C N P n W C
P c W n P w W C
15
Hidden Vector State Model (cont.)
• Approximation (assumption):
• So,
11
( , , ) ( | )
( [1] | [2... ]) ( | )
T
t tt
t t t t t
P W C N P n c
P c c D P w c
1 11 1
11
11 1
( | , ) ( | )
( [1] | , ) ( [1] | [2... ])
( | , ) ( | )
t tt i t t
tt t t t t
t tt t t
P n W C P n c
P c W n P c c D
P w W C P w c
16
Hidden Vector State Model (cont.)
• Generative process associated with this constrained version of HVS models consists of three step for each position t:
1. choose a value for nt
2. Select preterminal concept tag ct[1]
3. Select a word wt
17
Hidden Vector State Model (cont.)
• It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema.
• It is not reasonable to require utterances with manually transcribed parse trees.
• Assume abstract semantic annotations and availability of a set of domain specific lexical classes.
18
Hidden Vector State Model (cont.)
Abstract semantic annotations: • show me flights arriving in X at T.• List flights arriving around T in X.• Which flight reaches X before T.
= FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T))))
Class set:
CITY: Boston, New York, Denver…
19
Experiments
Experimental Setup
Training set: ATIS-2, ATIS-3
Test set: ATIS-3 NOV93, DEC94
Baseline: FST (Finite Semantic Tagger)
GT for FST, Witten-Bell for HVS
Show me flights from Boston to New York
Goal: FLIGHT
Slots: FROMLOC.CITY = Boston
TOLOC.CITY = New York
20
Experiments
F-measure: 2 /( )F PR P R
21
Experiments
Dash line: goal detection accuracy, Solid line: F-measure
22
Conclusion
• The key features of HVS model – Its ability for representing hierarchical information in a
constrained way – Its capability for training directly from target semantics
without explicit word-level annotation.
23
HVS Language Model
• The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure.
• Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c1,…,cM,c#}
• A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D
24
HVS Language Model (cont.)
25
HVS Language Model (cont.)
• Each HVS model state transition is restricted:
(i) exactly nt class label are popped off the stack
(ii) exactly one new class label ct is pushed into the stack
• The number of elements to pop nt and the choice of new class label to push ct are determined:
26
HVS Language Model (cont.)
1 1 1
11 1
ji
( | ) ( | ) ( | )
: intermediate state, s with n clas labels popped off
[ , ( )]
: stack access function such that ( )
denotes the sequence of class label from i to j
of
nt t t t t t
t
D nt t t
P s s P n s P c s
s
s c s
s
D1a stack of depth D and ( )s s
27
HVS Language Model (cont.)
• nt is conditioned on all the class labels that are in the stack at t-1 but ct is conditioned only on the class labels that remain on the stack after the pop operation
• Former distribution can encode embedding, whereas the latter focuses on modeling long-range dependencies.
1 1 1( | ) ( | ) ( | )nt t t t t tP s s P n s P c s
28
HVS Language Model (cont.)
• Joint probability:
• Assumption:
1 11 1
1
1 1 11 1 1 1
( , ) ( | , )
( | , , ) ( | , )
Tt t
tt
t t t tt t t
P W S P n W S
P c W S n P w W S
11
11 1
( , ) ( | )
( | ( )) ( | )
T
t tt
D nt t t t
P W S P n s
P c s P w s
29
HVS Language Model (cont.)
• Training: EM algorithm– C,N: latent data, W: observed data
• E-step:
C,N|W,
,
( ) log ( , , | ) log[ ( | ) ( , | , )]
log ( | ) log ( , | , )
log ( | ) log ( , , | ) log ( , | , )
take expection over C,N (E[ ] )
log ( | ) [ ( , | , ) log( , , | )]
[ ( , | , ) log( , | ,
C N
L P W C N P W P C N W
P W P C N W
P W P W C N P C N W
P W P C N W W C N
P C N W C N W
,
)]C N
30
HVS Language Model (cont.)
• M-Step:– Q function (auxiliary):
– Substituting P(W,C,N|λ)
,
,
ˆ[ ( , | , ) log( , , | )]
( , , | ) ˆ[ log( , , | )]( | )
C N
C N
P C N W W C N
P W C NW C N
P W
11
11 1
( , ) ( , , ) ( | )
( | ( )) ( | )
T
t tt
D nt t t t
P W S P C N W P n s
P c s P w s
31
HVS Language Model (cont.)
• Calculate probability distributions separately.
1
1
11
1
( , | , )ˆ ( | )( | , )
( [ , ( )] | , )ˆ ( | )( | , )
( , | , )ˆ ( | )( | , )
t tt
tt
D ntn t
n ntt
t tt
tt
P n n s s WP n s
P s s W
P s c s WP c s
P s s W
P s s w w WP w s
P s s W
32
HVS Language Model (cont.)
• State space S, if fully populated:– |S|=MD states, for M=100+, D=3 to 4
• Due to data sparseness, backoff is needed.
1 11 1 1( | ( )) ( ( )) ( | ( ))D d d
BO BOP w s B s P w s
33
HVS Language Model (cont.)
• Backoff weight:
• Modified version of absolute discounting
11 2
1 2
1 2
1
( ) , 0, 0( 2 )0( ( 2 ))
a knd r n n an n
aa n n a
1( )1 1
1( )
1 1 ( ) ( | ( ))( ( ))
1 1 ( | ( ))
d
w ddd
BOw d
d r P w sB s
P w s
34
Experiments
• Training set: – ATIS-3,276K words, 23K sentences.
• Development set: – ATIS -3 Nov93
• Test set : – ATIS-3 Dec94, 10K words, 1K sentences.
• OOV were removed
• k=850
35
Experiments (cont.)
36
Experiments (cont.)
37
Conclusion
• The HVS language model is able to make better use of context than standard class n-gram models.
• HVS model is trainable using EM.
38
Class tree for implementation
39
Iteration number vs. perplexity