Post on 24-Feb-2016
description
Using Conversational Word Bursts in Spoken Term Detection
Justin ChiuLanguage Technologies Institute
Presented at University of CambridgeSeptember 6 2013
Introduction
• Spoken Term Detection (STD): detecting word or phrase targets in conversational speech
• Queries: Text• Target: Audio
• The BABEL program: rapid development of key word search for novel languages, with limited resources
• This research: using (ideally universal) properties of conversation to enhance performance
Current Approach
1 ASR produces candidates (as a confusion network)2 Search on the ASR result identifies targets– Single-word query: Posterior probability for the query
word in confusion network– Multi-word query: Product of the posterior probability
for each word in confusion network
3 Deciding detection result with search result probability/threshold (YES/NO decision)
Challenge for Current Approach
• Sensitive to ASR performance
• High-performance ASR requires extensive training resources; limited resources produce low-accuracy
• In the BABEL LimitedLP condition, only 10 hour training data are provided
Intuition
• Can we use other sources of information to enhance STD performance even with poor ASR results?
• We specifically focus on conversational structure, since we believe it is language-independent
Word Burst
• Word burst refers to a phenomenon in conversational speech in which particular content words tend to occur in close proximity to each other, as a byproduct of some topic under discussion
• Content Word: Word that do not occur with high frequency
0 100 200 300 400 500 600
Seconds
Conversations
Occurrence pattern for the Tagalog word “magkano”
0 100 200 300 400 500 600
Seconds
Conversations
Occurrence pattern for the Tagalog word “magkano”
Word Burst
0 100 200 300 400 500 600
Seconds
Conversations
Occurrence pattern for the Tagalog word “magkano”
Word Burst
No Word Burst
Word Burst rescoring on the Confusion Network
• Give Bonus to the posterior probability for the content word that is in a Word Burst
• Penalize the posterior probability for the content word that is not in a Word Burst
• Goal: Improve the quality for the confusion network for better search performance
Word Burst Rescoring Parameters
• Window Size: time region for deciding Word Burst
• Stop word %: top X% of words in training data; assumed not to be content words
• Penalty for non-burst word: penalty for words that do not appear in a Word Burst
• Trained from language pack (development) data
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Word Burst Rescoring Formula
q(x i) p(x i)b(x i) if x j window(x i) i j
q(x i) p(x i) * penalty(L) if x j window(x i)i j
b(x i) d(x i,x j )j *( d(x i,x j )
j * p(x j ))i j
d(x i,x j ) 1 (dis(x i,x j ) /windowsize) i j
p(x)
time
Datasets
• 4 Language, 2 different size of training data (80 hour, 10 hour)
• An additional 10 hours of data in each language for 5 fold cross-validation
Language Setup Lexicon SizeCantonese FullLP 18769
LimitedLP 5112Pasto FullLP 17904
LimitedLP 6219Tagalog FullLP 21098
LimitedLP 5565Turkish FullLP 38849
LimitedLP 10173
ATWV
• Actual Term Weighted Value
• In BABEL, the C/V is 0.1, and the is 10-4
• Having a Correct Detection is much more valuable than reducing false alarm
TWV () 1 mean{PMiss(term, ) PFA (term,)}
CV(Prterm
1 1)
Prterm 1
Comparison between 10/80 hours of training data
• In the condition with limited data, the introduction of additional information can compensate for the lack of training data
Language Setup Baseline Rescore Change
CantoneseFullLP 0.322 0.320 -0.7%
LimitedLP 0.103 0.109 +6%
PashtoFullLP 0.214 0.215 +0.5%
LimitedLP 0.095 0.114 +19%
TagalogFullLP 0.358 0.358 -0.3%
LimitedLP 0.130 0.144 +11%
TurkishFullLP 0.385 0.385 +0.1%
LimitedLP 0.262 0.265 +1%
Tuned LimitedLP(10h) performance
• Rescoring significantly reduces the False Alarm from detection result, indicating that penalizing an isolated content word is beneficial to STD performance
Language Baseline Rescore ∆ ATWV FACantonese 0.114 0.118 4% -21%
Pashto 0.073 0.094 29% -32%Tagalog 0.143 0.159 11% -21%Turkish 0.241 0.245 2% +4%
Best Parameters for LimitedLP(10h)
• Apart from Turkish, parameter values are similar across languages.
• Words do not repeat in the same form in Turkish, an agglutinative language.
Language Window (sec) Stop (%) PenaltyCantonese 12 4 0.1
Pashto 18 1 0.05Tagalog 11 2 0.15Turkish 9 8 1
Analysis/Discussion
1. The impact of Word Burst on WER?2. How often does a Word Burst occur? 3. Issues in Turkish/Vietnamese?4. Difference between Word Burst and Cache-
base/Trigger-based language model?
WER
• Word Burst does not provide significant improvement on WER
• The rescoring helps the candidate’s probability to cross the detection threshold
How often do Word Bursts occur?
• Query/Content word burst percentage (10sec)
• More than 35% of Content/Query words occur as part of a Word Burst, indicating it’s generality
• Query/Content word burst percentages are similar, except for Turkish
Language Cantonese Pashto Tagalog TurkishContent 43.6 35.7 40.7 35.4Query 41.2 35.3 36.8 27.7
Issues in Turkish/Vietnamese
• Turkish word will re-occur in different form, can not count on Word Burst for the same word
• Using tool for morphological normalization might be able to find Word Burst for the same word
• Vietnamese performance is in the same range as Cantonese
• Some of the evaluation queries are syllables, not words, which Burst does not sounds reasonable
How Word Burst differs from Cache/Trigger Language Models
1. It not only considers the top 1 hypothesis, it considers the entire confusion set
2. It examines both past and future; Cache/Trigger based language model focus on the past
3. It focuses on time-based information, instead of token based information
Conclusion
• Our experiments indicate that the Word Burst is a property of conversational speech and is language-independent phenomenon
• Information from language or conversational structure (such as Word Burst) can be used to improve performance in the STD task, particularly when language resources are otherwise limited.
• It is potentially useful in other applications
Special Thanks
Alex Rudnicky Alan Black
Florian Metze Yajie Miao Yuchen Zhang
Questions?