Combining Speech Combining Speech Attributes for Speech Attributes for Speech
RecognitionRecognitionJeremy MorrisJeremy Morris
November 9, 2006November 9, 2006
OverviewOverview►Problem Statement (Motivation)Problem Statement (Motivation)►Conditional Random FieldsConditional Random Fields►Experiments & ResultsExperiments & Results►Future WorkFuture Work
Problem StatementProblem Statement►Developed as part of the ASAT Project
Automatic Speech Attribute Transcription Project to build tools to extract and parse
speech attributes from a speech signal
►Goal: Develop a system for bottom-up speech recognition using 'speech attributes'
Speech Attributes?Speech Attributes?►Any information that could be useful for
recognizing the spoken language Phonetic attributes
►Consonants have manner, place of articulation, voicing
►Vowels have height, frontness, roundness, tenseness
►Speaker attributes (gender, age, etc.) Any other useful attributes that could be
used for speech recognition
/d/manner: stop
place of artic: dentalvoicing: voiced
/t/manner: stop
place of artic: dentalvoicing: unvoiced
/iy/height: high
frontness: frontroundness: nonround
tenseness: tense
/ae/height: low
frontness: frontroundness: nonround
tenseness: tense
Feature CombinationFeature Combination►Our piece of this project is to find ways
to combine speech attributes together and use them to recognize language Other groups are working on finding features Other groups are working on finding features
to extract and methods of extracting themto extract and methods of extracting them Note that there is no guarantee that
attributes will be independent of each other In fact, many attributes will be strongly In fact, many attributes will be strongly
correllated or dependent on other attributes correllated or dependent on other attributes ►e.g. voicing for vowelse.g. voicing for vowels
Evidence Combination►Two basic ways to build hypotheses
hyp
data
hyp
data
Top Down
Generate a hypothesis
See if the data fits the hypothesis
Bottom Up
Examine the data
Search for a hypothesisthat fits
Top DownTop Down►Traditional Automated Speech
Recogintion Systems (ASR) use a top-down approach Hypothesis is the phone we are
predicting Data is some encoding of the
acoustic speech signal A likelihood of the signal given the
phone label is learned from data A prior probability for the phone
label is learned from the data These are combined through Bayes These are combined through Bayes
Rule to give us the posterior Rule to give us the posterior probability P(label | data)probability P(label | data)
/iy/
X
P(/iy/)
P(X|/iy/)
Bottom Up►Bottom-up models have the same
high-level goal – determine the label from the observation But instead of a likelihood, the
posterior probability P(label | data) is learned directly from the data
►Neural Networks can be used to learn probabilities in this manner
/iy/
X
P(/iy/|X)
Speech is a SequenceSpeech is a Sequence
►Speech is not a single, independent event It is a combination of multiple events over time
►A model to recognize spoken language should take into account dependencies across time
/k/ /k/ /iy/ /iy/ /iy/
Speech is a SequenceSpeech is a Sequence
►A top down model can be extended into a time sequence as a Hidden Markov Model (HMM) Now our likelihood of the data is over the
entire sequence instead of a single phone
/k/ /k/ /iy/ /iy/ /iy/
X X X X X
Conditional Random FieldsConditional Random Fields►A form of discriminative modelling
Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks
►Processes evidence bottom-up Combines multiple features of the data Builds the probability P( sequence | data)
Conditional Random FieldsConditional Random Fields►Conceptual Overview
Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label►A positive value if the attribute appears in the data►A zero value if the attribute is not in the data
Each feature function carries a weight that gives the strength of that feature function for the proposed label►High positive weights indicate a good association
between the feature and the proposed label►High negative weights indicate a negative association
between the feature and the proposed label►Weights close to zero indicate the feature has little or no
impact on the identity of the label
Conditional Random FieldsConditional Random Fields
►CRFs have transition feature functions and state feature functions Transition functions add associations between
transitions from one label to another State functions help determine the identity of the
state
/k/ /k/ /iy/ /iy/ /iy/
X X X X X
Conditional Random FieldsConditional Random Fields
)(
)),,(),((exp)|(
1
xZ
yyxgyxfxyP t i j
ttjjtii
State Feature Function
Association of an attribute witha phone label
e.g. f(P(stop), /k/)
State Feature Weight
Indicates the strength of theassociation of this attribute
with this label
Transition Feature Function
Association of an attribute with aphone-to-phone transition
e.g. g(attr, /iy/,/k/)
Transition Feature Weight
Indicates the strength ofthe association of this
attribute with this transition
ExperimentsExperiments►Goal: Implement a Conditional Random Field
Model on speech attribute data Perform phone recognition Compare results to those obtained via a Tandem
system
►Experimental Data TIMIT read speech corpus Moderate-sized corpus of clean, prompted speech,
complete with phonetic-level transcriptions
Attribute SelectionAttribute Selection►Attribute Detectors
Built using ICSI QuickNet Neural Network software
►Two different types of attributes Phonological feature detectors
►Place, Manner, Voicing, Vowel Height, Backness, etc.►Features are grouped into eight classes, with each class
having a variable number of possible values based on the IPA phonetic chart
Phone detectors►Neural networks output based on the phone labels –
one output per label Classifiers were trained on 2960 utterances from
the TIMIT training set►Uses extracted 12Uses extracted 12thth order PLP coefficients (i.e. order PLP coefficients (i.e.
frequency coefficients) in a 9 frame window as inputs to frequency coefficients) in a 9 frame window as inputs to the neural networksthe neural networks
Experimental SetupExperimental Setup►Code built on the Java CRF toolkit on
Sourceforge http://crf.sourceforge.net Performs training to maximize the log-
likelihood of the training set with respect to the model►Does this via gradient descent – find the place
where the gradient of the log-likelihood function goes to zero
Experimental SetupExperimental Setup► Output from the Neural Nets are themselves treated as
feature functions for the observed sequence Each attribute/label combination gives us a value for one feature
function►We also use a bias feature for each label
Currently, all combinations of features and labels are used as feature functions► e.g. f(P(stop),/t/), f(P(stop),/ae/), etc.
Phone class features are used in the same mannerPhone class features are used in the same manner► e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc.e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc.
Transition features use only a 0/1 bias featureTransition features use only a 0/1 bias feature► 1 if the transition occurs at that timeframe in the training set1 if the transition occurs at that timeframe in the training set► 0 if the transition does not occur at that timeframe in the training set0 if the transition does not occur at that timeframe in the training set
► For comparison purposes, we compare to a baseline For comparison purposes, we compare to a baseline HMM-trained system that uses decorrellated features as HMM-trained system that uses decorrellated features as inputsinputs
Initial ResultsInitial ResultsModelModel Label SpaceLabel Space Phone Phone
Recog Recog AccuracyAccuracy
HMM (phones)HMM (phones) triphonestriphones 67.32%67.32%CRF (phones)CRF (phones) monophonesmonophones 67.27%67.27%HMM (features)HMM (features) triphonestriphones 66.69%66.69%CRF (features)CRF (features) monophonesmonophones 65.25%65.25%HMM (phones/feas) HMM (phones/feas) (top 39)(top 39)
triphonestriphones 67.96%67.96%
CRF (phones/feas)CRF (phones/feas) monophonesmonophones 68.00%68.00%
Experimental SetupExperimental Setup► Initial CRF experiments show results Initial CRF experiments show results
comparable to triphone HMM results with only comparable to triphone HMM results with only monophone labellingmonophone labelling No decorrellation of features neededNo decorrellation of features needed No assumptions about feature independenceNo assumptions about feature independence
► Comparison to HMM crippled in one way:Comparison to HMM crippled in one way: HMM training allowed for shifting of phone HMM training allowed for shifting of phone
boundaries during trainingboundaries during training CRF training used set phone boundaries for all CRF training used set phone boundaries for all
trainingtraining► Another experiment – train the CRF, realign Another experiment – train the CRF, realign
training labels, then retrain on realigned labelstraining labels, then retrain on realigned labels
Realignment ResultsRealignment ResultsModelModel Label SpaceLabel Space Phone Phone
Recog Recog AccuracyAccuracy
HMM (phones)HMM (phones) triphonestriphones 67.32%67.32%CRF (phones) baseCRF (phones) base monophonesmonophones 67.27%67.27%CRF (phones) CRF (phones) realignrealign
monophonesmonophones 69.63%69.63%
HMM (features)HMM (features) triphonestriphones 66.69%66.69%CRF (features) baseCRF (features) base monophonesmonophones 65.25%65.25%CRF (features) CRF (features) realignrealign
monophonesmonophones 67.52%67.52%
Experimental SetupExperimental Setup►CRFs can also make use of features on the CRFs can also make use of features on the
transitionstransitions For the initial experiments, transition feature For the initial experiments, transition feature
functions only used bias features (e.g. 1 or 0 functions only used bias features (e.g. 1 or 0 based on label in the training corpus)based on label in the training corpus)
►What if the phone classifications were What if the phone classifications were used as the state features, and the used as the state features, and the feature classes were used as transition feature classes were used as transition features?features? Linguistic observation – feature spreading Linguistic observation – feature spreading
from phone to phonefrom phone to phone
Realignment ResultsRealignment ResultsModelModel Label SpaceLabel Space Phone Phone
Recog Recog AccuracyAccuracy
CRF (phones) baseCRF (phones) base monophonesmonophones 67.27%67.27%CRF (phones) CRF (phones) realignrealign
monophonesmonophones 69.63%69.63%
CRF (features) baseCRF (features) base monophonesmonophones 65.25%65.25%CRF (features) CRF (features) realignrealign
monophonesmonophones 67.52%67.52%
CRF (p+f) baseCRF (p+f) base monophonesmonophones 68.00%68.00%CRF (p + trans f) CRF (p + trans f) basebase
monophonesmonophones 69.49%69.49%
CRF (p + trans f) CRF (p + trans f) alignalign
monophonesmonophones 70.86%70.86%
Discussion & Future WorkDiscussion & Future Work► This seems to be a good model for the type of This seems to be a good model for the type of
feature combination we want to performfeature combination we want to perform Makes use of arbitrary, possibly correllated featuresMakes use of arbitrary, possibly correllated features Results on phone recognition task comparable or superior Results on phone recognition task comparable or superior
to the alternative sequence model (HMM)to the alternative sequence model (HMM)► Future WorkFuture Work
New featuresNew features► What kinds of features can we add to improve our transitions?What kinds of features can we add to improve our transitions?► We hope to get more from the other research groupsWe hope to get more from the other research groups
New training methodsNew training methods► Faster algorithms than the gradient descent method exist and Faster algorithms than the gradient descent method exist and
need to be testedneed to be tested Word recogntionWord recogntion
► We are thinking about how to model word recogntion in this We are thinking about how to model word recogntion in this frameworkframework
Larger corporaLarger corpora► TIMIT is a comparably small corpus – we are looking to move TIMIT is a comparably small corpus – we are looking to move
to something biggerto something bigger
Top Related