1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University...

1

The Hidden Vector State Language Model

Vidura Senevitratne, Steve Young

Cambridge University Engineering Department

2

Reference

• Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, 2003.

• He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, 2003.

• Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.

3

Outline

• Introduction

• HVS Model

• Experiments

• Conclusion

4

Introduction

• Language model:

• Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information

• Class-based language model– POS tag information

• Structured language model– Syntactic information

11 1 2 1

1 11 1 1 1

( ) ( ) ( | )

( | ( )) ( | )

M M kk k

m i m ii i i i i n

P W P w P w W

P w W P w W

5

Hierarchical Hidden Markov Model

• HHMM is structured multi-level stochastic process.– Each state is an HHMM– Internal state: hidden state that do not emit

observable symbols directly– Production state: leaf state

• States of HMM are production states of HHMM.

6

HHMM (cont.)

• Parameters of HHMM:

1,..., 1,..., 1 1,..., 1, ,

d d d Dq q q q

d D d D d DA B

*

1 2

: finite alphabet

: the set of all possible string over

: observation sequence ...

: state of an HHMM,

is the state index and is the hierarchy index

T

di

O o o o

q

i d

7

HHMM (cont.)

• Transition probability: horizontal

• Initial probability: vertical

• Observation probability:

1 1, ( | )

d dq q d di j j iA a P q q

1 1{ ( )} { ( | )}d dq q d d d

i iq P q q

{ ( )} ( | )D Dq q D

kB b k P q

8

HHMM (cont.)

• Current node is root:– Choose child according to initial probability

• Child is production state:– Produce an observation– Transit within the same level– When it reaches end-state, back to parent of end-state

• Child is internal state:– Choose child– Wait until control is back from children– Transit within the same level– When it reaches end-state, back to parent of end-state

9

HHMM (cont.)

10

HHMM (cont.)

• Other application: trend of stocks (IDEAL 2004)

11

Hidden Vector State Model

12

Hidden Vector State Model (cont.)

The semantic information relating to any single word can be stored as a vector of semantic tag names

13


• If state transitions were unconstrained– Fully HHMM

• Transitions between states can be factored into a stack shift: two stage, pop, push

• Stack size is limited, # of new concept to be pushed is limited to one– More efficient

14


• The joint probability is defined:( , , | )

: sequence of stack pop operations N

: word sequence

: concept vector sequence

P W C N

N

W

C

1 11

1

1 11 1 1

( , , ) ( | , )

( [1] | , ) ( | , )

Tt t

t it

t t tt t t

P W C N P n W C

P c W n P w W C

15


• Approximation (assumption):

• So,

11

( , , ) ( | )

( [1] | [2... ]) ( | )

T

t tt

t t t t t

P W C N P n c

P c c D P w c

1 11 1

11

11 1

( | , ) ( | )

( [1] | , ) ( [1] | [2... ])

( | , ) ( | )

t tt i t t

tt t t t t

t tt t t

P n W C P n c

P c W n P c c D

P w W C P w c

16


• Generative process associated with this constrained version of HVS models consists of three step for each position t:

1. choose a value for nt

2. Select preterminal concept tag ct[1]

3. Select a word wt

17


• It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema.

• It is not reasonable to require utterances with manually transcribed parse trees.

• Assume abstract semantic annotations and availability of a set of domain specific lexical classes.

18


Abstract semantic annotations: • show me flights arriving in X at T.• List flights arriving around T in X.• Which flight reaches X before T.

= FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T))))

Class set:

CITY: Boston, New York, Denver…

19

Experiments

Experimental Setup

Training set: ATIS-2, ATIS-3

Test set: ATIS-3 NOV93, DEC94

Baseline: FST (Finite Semantic Tagger)

GT for FST, Witten-Bell for HVS

Show me flights from Boston to New York

Goal: FLIGHT

Slots: FROMLOC.CITY = Boston

TOLOC.CITY = New York

20

Experiments

F-measure: 2 /( )F PR P R

21

Experiments

Dash line: goal detection accuracy, Solid line: F-measure

22

Conclusion

• The key features of HVS model – Its ability for representing hierarchical information in a

constrained way – Its capability for training directly from target semantics

without explicit word-level annotation.

23

HVS Language Model

• The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure.

• Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c1,…,cM,c#}

• A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D

24

HVS Language Model (cont.)

25


• Each HVS model state transition is restricted:

(i) exactly nt class label are popped off the stack

(ii) exactly one new class label ct is pushed into the stack

• The number of elements to pop nt and the choice of new class label to push ct are determined:

26


1 1 1

11 1

ji

( | ) ( | ) ( | )

: intermediate state, s with n clas labels popped off

[ , ( )]

: stack access function such that ( )

denotes the sequence of class label from i to j

of

nt t t t t t

t

D nt t t

P s s P n s P c s

s

s c s

s

D1a stack of depth D and ( )s s

27


• nt is conditioned on all the class labels that are in the stack at t-1 but ct is conditioned only on the class labels that remain on the stack after the pop operation

• Former distribution can encode embedding, whereas the latter focuses on modeling long-range dependencies.

1 1 1( | ) ( | ) ( | )nt t t t t tP s s P n s P c s

28


• Joint probability:

• Assumption:

1 11 1

1

1 1 11 1 1 1

( , ) ( | , )

( | , , ) ( | , )

Tt t

tt

t t t tt t t

P W S P n W S

P c W S n P w W S

11

11 1

( , ) ( | )

( | ( )) ( | )

T

t tt

D nt t t t

P W S P n s

P c s P w s

29


• Training: EM algorithm– C,N: latent data, W: observed data

• E-step:

C,N|W,

,

( ) log ( , , | ) log[ ( | ) ( , | , )]

log ( | ) log ( , | , )

log ( | ) log ( , , | ) log ( , | , )

take expection over C,N (E[ ] )

log ( | ) [ ( , | , ) log( , , | )]

[ ( , | , ) log( , | ,

C N

L P W C N P W P C N W

P W P C N W

P W P W C N P C N W

P W P C N W W C N

P C N W C N W

,

)]C N

30


• M-Step:– Q function (auxiliary):

– Substituting P(W,C,N|λ)

,

,

ˆ[ ( , | , ) log( , , | )]

( , , | ) ˆ[ log( , , | )]( | )

C N

C N

P C N W W C N

P W C NW C N

P W

11

11 1

( , ) ( , , ) ( | )

( | ( )) ( | )

T

t tt

D nt t t t

P W S P C N W P n s

P c s P w s

31


• Calculate probability distributions separately.

1

1

11

1

( , | , )ˆ ( | )( | , )

( [ , ( )] | , )ˆ ( | )( | , )

( , | , )ˆ ( | )( | , )

t tt

tt

D ntn t

n ntt

t tt

tt

P n n s s WP n s

P s s W

P s c s WP c s

P s s W

P s s w w WP w s

P s s W

32


• State space S, if fully populated:– |S|=MD states, for M=100+, D=3 to 4

• Due to data sparseness, backoff is needed.

1 11 1 1( | ( )) ( ( )) ( | ( ))D d d

BO BOP w s B s P w s

33


• Backoff weight:

• Modified version of absolute discounting

11 2

1 2

1 2

1

( ) , 0, 0( 2 )0( ( 2 ))

a knd r n n an n

aa n n a

1( )1 1

1( )

1 1 ( ) ( | ( ))( ( ))

1 1 ( | ( ))

d

w ddd

BOw d

d r P w sB s

P w s

34

Experiments

• Training set: – ATIS-3,276K words, 23K sentences.

• Development set: – ATIS -3 Nov93

• Test set : – ATIS-3 Dec94, 10K words, 1K sentences.

• OOV were removed

• k=850

35

Experiments (cont.)

36

Experiments (cont.)

37

Conclusion

• The HVS language model is able to make better use of context than standard class n-gram models.

• HVS model is trainable using EM.

38

Class tree for implementation

39

Iteration number vs. perplexity

1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University...

Documents

Transcript of 1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University...