Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional...

Modeling Speech using POMDPs

• In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal.

• We use state of the art techniques to build and decode our new model.

• We demonstrate improved recognition results on a small data set.

Description of a POMDP

• A Markov Decision Processes (MDP) is a mathematical formalization of problems in which a decision maker, an agent, must decide which actions to choose that will maximize its expected reward as it interacts with its environment.

• MDPs have been used in modeling an agents behavior in– planning problems

– robot navigation problems

• In a fully observable MDP an agent always knows precisely what state it is in.

• If an agent cannot determine its state, its world is said to be partially observable.

• In such a situation we use a generalization of MDPs, called a Partially Observable Markov Decision Process (POMDP).

• POMDP vs HMM– differs from HMM

• multiple transitions between two states representing actions

• added reward to each state

– as with HMM

• you do not know which state you are in

POMDP in Speech• As with HMMs

– left to right topology with 3 to 5 states– states represent pronunciation tasks:

• beginning, middle, end of phoneme• observed acoustic features are associated with each state

– Randomness in state transitions still accounts for time stretching in the phoneme:

• Short, long, hurried pronunciations

– Randomness in the observations still accounts for the variability in pronunciations

• Differs from HMMs– In theory

• model all possible context classes (infinite number)

• model all contexts of a particular context class

– In practice• model three context classes

– Triphone, biphone, monophone

• model all contexts of a particular context class

– Use actions of our model to represent context

Beg. Mid. End

Training a POMDP

• We train each context class independently on the same training data– treated as HMM models trained using standard EM

• We then collect all context models for each phoneme over the four different context classes and combine them into a single, unified POMDP model– we label each action with both the context and context

class that the particular HMM model belongs to

Decoding a POMDP

• We look at 3 decoding strategies based on Viterbi:– Uniform Mixed Model (UMM) Viterbi– Weighted Mixed Model (WMM) Viterbi– Cross-Context Mixed Model (CMM) Viterbi

UMM Viterbi• From Viterbi point of view

– Add to mix all context classes and allow Viterbi to choose the best path through entire search space

• relax context rules by matching up all partial context phonemes– wild-card all monophones to match up with all biphones and

triphones sharing same center phone– wild-card all biphones to mach up with triphones whose other

context they share

– Add class weight, Wc, to each context class c• applied to each model as we enter it

• From POMDP point of view – the model constrains actions

• add constraint to leave state with same action that we entered that state in the model

– insures model’s context as in HMM

• relax constraint of allowing to choose different context classes in model

– differs from HMM

– class weight is reward given at start state for entering model

Viterbi expansion of “tomato” having two spellings– “t-ow-m-ey-t-ow”– “t-ow-m-aa-t-ow”

• (a) standard Viterbi and (b) UMM Viterbi

t-ow+m

ow-m+ey

ow-m+aa

t-ow+m

ow-m+ey

ow-m-aa

m+ev

m

m+aa

ow-m+aa

(a) (b)

ow+m

ow

WMM Viterbi

• Similar to UMM Viterbi, except now we weigh each context model of each context class individually, based on frequency counts of its occurrence in training data

wcm = Lc + min(fc

m / Kc, 1) * (Wc – Lc)

• fcm – frequency count for model m of context class c

• Lc – lower bound for context class c• Wc – upper bound for context class c• Kc – frequency count cutoff threshold for context class c

• Similar to WMM Viterbi, except now our POMDP model relaxes the constraint on actions– allows cross model jumps– jumps are now weighted by model weight wc

m

– constraint relaxed to sub-class of context models as follows:

• models can jump between triphone and associated biphone and monophone whose partial context they share

CMM Viterbi

• Various strategies to relaxing cross model jump constraints– Maximum cross context

• for each cross context model jump, add weight to the likelihood score and choose jump that yields highest score

– Expanded cross context• choose all context model jumps at every state, adding the weight to

the likelihood score of each jump

– Restricted form of both Maximum and Expanded• add constraint that once we choose a lower order context class

model, cannot go back to higher order context class model, only stay within own or lower

– idea is to abandon higher order models that perform poorly

t-ow+m

ow+m

ow

Experiments

• Tested our model on TIMIT data sets:– TIMIT – read English sentences

• 45 phonemes, ~8000 word dictionary

• 3 hours training on 3869 utterances by 387 speakers

• 6 minute decoding on 110 utterances by 11 speakers– independent of training data

• trigram language model built from training data and outside source (OGI: Stories and NatCell )

• Found best system configuration for each corpus.– created 16 mixture SCTM models for each HMM context class using ISIP prototype system (v 5.10)– ran baseline for all 3 HMM models

Baseline

Corpus WER accuracy

Timit triphone 52.4 53.9 biphone 53.1 51.3 monophone 69.3 33.5

Results

• Results for all three modified Viterbi algorithms similar to development set• POMDP model shows robustness to different test sets

– not tuned to data

Corpus ViterbiLc / Wc / Kc

WER accuracytriphone biphone monophone

Timit UMM -/5/- -/100/- -/10/- 51.2 54.0

WMM 0/5/90 25/100/90

0/10/90 50.5 54.9

CMM 0/5/90 45/100/90

0/10/90 48.8 55.3

Future Work

• Apply new model to larger data set• Find better method to generate individual context model

weights– linear interpolation and backoff techniques used in language

modeling

• Find better method for adjusting overall POMDP model context class weights for the various decoding strategies– current method of experimentation is inefficient

• For CMM Viterbi, look to find better ways to constrain cross model jumps outside of partial context classes– use similar technique of linguistic information used in tying

mixtures at the state level

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional...

Documents

Transcript of Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional...