spoken solo 1: Gobble . spoken solo 2: Shh ! spoken solo 3: Gobble . spoken solo 4: Shh !
Spoken Term Detection Using Multiple Speech Recognizers’ Outputs...
Transcript of Spoken Term Detection Using Multiple Speech Recognizers’ Outputs...
![Page 1: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/1.jpg)
Spoken Term Detection Using Multiple
Speech Recognizers’ Outputs
at NTCIR-9 SpokenDoc STD subtask
Hiromitsu Nishizaki
Yuto Furuya
Satoshi Natori
Yoshihiro Sekiguchi
University of Yamanashi, Japan
NTCIR-9 Workshop: SpokenDoc
2011/12/8 NTCIR-9 SpokenDoc task
![Page 2: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/2.jpg)
Outline
� Introduction
� Spoken Term Detection (STD) using multiple speech
recognizers
� Overview of our STD framework
� Multiple speech recognizers
� Phoneme Transition Network (PTN)-based indexing
� Search engine and experimental result
� False detection control
� Introducing the control parameters
� Experimental result
� Conclusion
2011/12/8 NTCIR-9 SpokenDoc task
![Page 3: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/3.jpg)
Much multi-media data
available
• improved the environment on
multi-media
• improved the infrastructures
More efficient
utterance retrieval
• key words or phrases
extraction
Term detection from
LVCSR output
• the out-of-vocabulary
problem
• recognition errors get worse
detection performance
Introduction
� Back ground
� Our goal
2011/12/8 NTCIR-9 SpokenDoc task
![Page 4: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/4.jpg)
Summary of our research
Multiple speech recognizers
• Combination of “1 decoder x 2 AMs x 5 LMs”
• This made speech recognition performance better
Construction of index for STD and search engine
• Confusion Network based indexing
• Term detection using a simple term search method
STD performance evaluated on the formal-run
• The index from multiple speech recognizers’ outputs got
the highest STD performance
• Introducing false detection parameters makes the STD
performance more improvement 2011/12/8 NTCIR-9 SpokenDoc task
![Page 5: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/5.jpg)
Outline
� Introduction
� Spoken Term Detection (STD) using multiple speech
recognizers
� Overview of our STD framework
� Multiple speech recognizers
� Phoneme Transition Network (PTN)-based indexing
� Search engine and experimental result
� False detection control
� Introducing the control parameters
� Experimental result
� Conclusion
2011/12/8 NTCIR-9 SpokenDoc task
✔
![Page 6: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/6.jpg)
STD task flow diagram
network-basedindex
STD Result
SpeechData
RecognitionSystem #1
RecognitionSystem #10
・・・
Index build phase
Converting to sub-word
sequences
TextTerms
TermSearchengine
PhonemeTerms
Search phase
2011/12/8 NTCIR-9 SpokenDoc task
Making network-based
index
![Page 7: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/7.jpg)
Multiple speech recognizers
10 speech
recognizers
LVCSR decoder
• Julius rev.4.1.3
5 types of Language Models
• Word based trigram :WBC
• Hiragana based trigram :WBH
• Syllable based trigram :CB
• A bi-syllables based trigram :BM
• Nothing :Non
2 types of
Acoustic Models
• syllable based HMM :Syl
• tri-phone HMM :Tri
each model was trained
from the open data
2011/12/8 NTCIR-9 SpokenDoc task
![Page 8: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/8.jpg)
Phoneme Transition Network (PTN)
� Phoneme-level Confusion Network based index for STD
� It called as ``PTN’’ (Phoneme Transition Network)
� PTN is built from multiple speech recognizers’ outputs
TerminalNode Node
Arc
@
o
@
s
@
i @
a
i
k
b
q m
i
b
a
@
a
@
i
N
NULLArc
2011/12/8 NTCIR-9 SpokenDoc task
TerminalNode
![Page 9: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/9.jpg)
Example of building PTN-based index
speech utterance “Cosine” ( /k o s a i N/ )
LM/AM Outputs of 10 recognition systems
(all outputs are converted into phoneme sequence)
WBC/Tri k o s @ a @ @ i @
WBH/Tri q o s u a @ a @ N
CB/Tri k o s @ a m a i @
BM/Tri k o s @ a @ @ @ N
Non/Tri k o s @ a @ @ @ N
WBC/Syl @ @ s @ a @ @ @ N
WBH/Syl b o s @ a a a @ @
CB/Syl @ @ s @ a b @ i @
BM/Syl @ @ s @ a @ @ @ N
Non/Syl @ @ s @ a @ @ @ N
Arc NodeTerminalNode
@
o
@
u
k
b
q
@a
@
a
m
a
@ i
@
@
N
b
sPTN based index
2011/12/8 NTCIR-9 SpokenDoc task
Base output
10
systems
![Page 10: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/10.jpg)
Search engine (no false detection control)
� Simple search engine
� Dynamic Programming (DP) based engine
� Both endpoints free
� Edit distance is used for calculating DP cost between an
index and a query term
� We modified the simple DP framework to adapt the
PTN-based index
2011/12/8 NTCIR-9 SpokenDoc task
![Page 11: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/11.jpg)
no insertionerrors
Example of the modified DP framework for
PTN-based index (baseline technique)
� NULL transition cost is
set to 0.1
k
o
s
a
i
N
Sea
rch
term
PTN-based index
Cost: 0.3
@
o
@
u
k
b
q
@a
@
a
m
a
@ i
@
@
N
b
s
(Transition cost = 0.1)
Matching!(Transition cost = 0)
2011/12/8 NTCIR-9 SpokenDoc task
![Page 12: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/12.jpg)
Experimental setup
• CORE set of the STD task (about 40 hours, 144×103 sec.)• CORE set of the STD task (about 40 hours, 144×103 sec.)
Data for STD task
• 50 queries for the CORE set
• Including 31 out-of-vocabulary(OOV) queries
• 50 queries for the CORE set
• Including 31 out-of-vocabulary(OOV) queries
Query
• Recall-Precision curve
• F-measure at the maximum point of the curve
• Recall-Precision curve
• F-measure at the maximum point of the curve
Evaluation measure
2011/12/8 NTCIR-9 SpokenDoc task
![Page 13: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/13.jpg)
Indices for STD
� Two types of Index
Index # of hypothesis
# of hypothesis
type of index
type of index How to makeHow to make
Baseline 11 Phoneme-base
Phoneme-base 1-Best output of “CB/Tri”1-Best output of “CB/Tri”
PTN 1010 Phoneme-base
Phoneme-base 10 types of output10 types of output
Baseline STD is performed by the simple DP on the transcription of “CB/Tri.”
2011/12/8 NTCIR-9 SpokenDoc task
![Page 14: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/14.jpg)
STD results
Maximum F-measure = 71.4%
Maximum F-measure = 55.6%
2011/12/8 NTCIR-9 SpokenDoc task
![Page 15: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/15.jpg)
Outline
� Introduction
� Spoken Term Detection (STD) using multiple speech
recognizers
� Overview of our STD framework
� Multiple speech recognizers
� Phoneme Transition Network (PTN)-based indexing
� Search engine and experimental result
� False detection control
� Introducing the control parameters
� Experimental result
� Conclusion
2011/12/8 NTCIR-9 SpokenDoc task
✔
✔
![Page 16: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/16.jpg)
Robust for false detections
Voting• The number
of recognizers outputting the same phoneme on the same arc
ArcWidth• The number
of arcs between successive two nodes
Two types of
control
parameters!
� False detection control for more STD improvement
� Our approach generates many false detections
because of :
� using multiple speech recognizers’ outputs
� using a network-based index
2011/12/8 NTCIR-9 SpokenDoc task
![Page 17: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/17.jpg)
False detection control parameters
Voting
ArcWidth
ko
s
i
N
5 3 7 2 9
4 5 3 4 2 2
A phoneme from more recognizers may have better confidence
The less number of arcs may enhance the reliability of the recognized phonemes
PTN based index
2011/12/8 NTCIR-9 SpokenDoc task
![Page 18: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/18.jpg)
Experimental results
( with false detection control)
Maximum F-measure = 71.4%
Maximum F-measure = 72.5%
2011/12/8 NTCIR-9 SpokenDoc task
![Page 19: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/19.jpg)
Conclusion
• Using multiple speech recognizers for STD
• Multiple recognizers make STD performance better
• Integrating multiple recognizers’ output in to PTN was very
powerful to improve the performance
• Using multiple speech recognizers for STD
• Multiple recognizers make STD performance better
• Integrating multiple recognizers’ output in to PTN was very
powerful to improve the performance
Summary
• Improving index
• Reduction of unnecessary information
• Improving search engine
• Developing new control parameters in the STD engine
• Customizing the engine depending on an inputted query
• Improving index
• Reduction of unnecessary information
• Improving search engine
• Developing new control parameters in the STD engine
• Customizing the engine depending on an inputted query
Future works
2011/12/8 NTCIR-9 SpokenDoc task
![Page 20: Spoken Term Detection Using Multiple Speech Recognizers’ Outputs …research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/... · 2011-12-12 · Spoken Term Detection Using Multiple](https://reader033.fdocuments.us/reader033/viewer/2022042022/5e7a14813a8ce8650c16edde/html5/thumbnails/20.jpg)
2011/12/8 NTCIR-9 SpokenDoc task
Thank you for your attention
Our poster will be posted at the poster session tomorrow