Reducing Waiting Time in Automatic Captioned Relay Service ...
Transcript of Reducing Waiting Time in Automatic Captioned Relay Service ...
REDUCING WAITING TIME IN AUTOMATIC CAPTIONED RELAY
SERVICE USING SHORT PAUSE IN VOICE ACTIVITY DETECTION
BY
MR. KIETTIPHONG MANOVISUT
.
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF MASTER OF SCIENCE (COMPUTER SCIENCE)
DEPARTMENT OF COMPUTER SCIENCE
FACULTY OF SCIENCE AND TECHNOLOGY
THAMMASAT UNIVERSITY
ACADEMIC YEAR 2017
COPYRIGHT OF THAMMASAT UNIVERSITY
Ref. code: 25605809035149VUW
REDUCING WAITING TIME IN AUTOMATIC CAPTIONED RELAY
SERVICE USING SHORT PAUSE IN VOICE ACTIVITY DETECTION
BY
MR. KIETTIPHONG MANOVISUT
.
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF MASTER OF SCIENCE (COMPUTER SCIENCE)
DEPARTMENT OF COMPUTER SCIENCE
FACULTY OF SCIENCE AND TECHNOLOGY
THAMMASAT UNIVERSITY
ACADEMIC YEAR 2017
COPYRIGHT OF THAMMASAT UNIVERSITY
Ref. code: 25605809035149VUW
(1)
Thesis Title REDUCING WAITING TIME IN AUTOMATIC
CAPTIONED RELAY SERVICE USING SHORT
PAUSE IN VOICE ACTIVITY DETECTION
Author Mr. Kiettiphong Manovisut
Degree Master of Science (Computer Science)
Department/Faculty/University Computer Science
Faculty of Science and Technology
Thammasat University
Thesis Advisor
Thesis Co-Advisor
Pokpong Songmuang, Ph.D.
Nattanun Thatphithakkul, Ph.D.
Academic Year 2017
ABSTRACT
The Automatic Captioned Relay Service is crucial for hearing disabilities or
hard-of-hearing to communicate with others in real life. This service uses an Automatic
Speech Recognition (ASR) to transcribe speech to a caption. Reducing the waiting time
in the automatic captioned relay service, the relay service will support more users.
Moreover, the ASR results are more continuous and constant, which directly affect the
user experience. In this paper, we divide the proposed work into four steps. At the first
step, we propose a method for improving a voice activity detection (VAD) based on
dual-threshold method VAD. An idea of this research aims to reduce the waiting time
of ASR result using short pause in speech as an endpoint instead of using only silence.
This step reduces an average waiting time. At the second and the third step, we
propose methods to maintain an accuracy of ASR result. Furthermore, we show a fixed
energy threshold problem in proposed work and traditional VAD that cannot work in
a noisy environment. This problem always related to an ability of short pause and
silence detection directly. Finally, we propose the pause model classifier that is trained
from LSTM-RNN. This last step solves the weakness of short pause and silence
detection. The experimental result shows that the proposed work reduces the average
waiting time of ASR result up to 17.1% compared with traditional VAD.
Ref. code: 25605809035149VUW
(2)
Keywords: voice activity detection, captioned relay service, recurrent neural network,
dual-threshold method
Ref. code: 25605809035149VUW
(3)
ACKNOWLEDGEMENTS
First of all, this research would not have been possible without the support
of my advisor Dr. Pokpong Songmuang and co-advisor Dr. Nattanun Thatphithakkul. I
would like to express my sincere gratitude to them for their invaluable advice and
patient proofreading towards the completion of my research. Their guidance helped
me through pass my hard time when I am doing this research.
Beside my advisor and co-advisor, I sincerely thank Dr. Ananlada
Chotimongkol and Phongphan Phienphanich, Ph.D. student for their valuable and
constructive suggestions during the planning and development of this research works.
They willingness to give their time so generously has been very much appreciated.
Furthermore, I also would like to thank the rest of my thesis committee:
Asst. Prof. Nuttanont Hongwarittorrn, Asst. Prof. Rachada Kongkachandra for their
encouragement, insightful comments, and hard questions.
Finally, I most gratefully acknowledge my family who have been always
beside me and encouraged me for studying the master program. I can go to the goal
and accomplish with all their kindness.
Parts of this research are supported by the grant from National Electronics
and Computer Technology Center (NECTEC), Thailand.
Mr. Kiettiphong Manovisut
Ref. code: 25605809035149VUW
(4)
TABLE OF CONTENTS
Page
ABSTRACT (1)
ACKNOWLEDGEMENTS (3)
TABLE OF CONTENTS (4)
LIST OF TABLES (7)
LIST OF FIGURES (8)
LIST OF ABBREVIATIONS (10)
CHAPTER 1 INTRODUCTION 1
1.1. Objective of the Thesis 3
1.2. Structure of the Thesis 4
CHAPTER 2 LITERATURE REVIEW 5
2.1. The difference of VAD analysis 5
2.2. Voice activity detection (VAD) 6
2.3. The difference of VAD analysis 7
2.3.1 Time-domain analysis 7
2.3.1.1 The dual-threshold method 7
(1) Pre-processing 9
(2) Framing 9
(3) Windowing 9
(4) Feature extraction 10
(5) Decision Algorithm 13
Ref. code: 25605809035149VUW
(5)
2.3.2 Frequency-domain analysis 15
2.3.3 Pattern recognition 16
2.3.3.1 Recurrent Neural Network 16
(1) Long short-term memory networks 18
2.3.3.2 Feature extraction for neural network 19
(1) Mel Frequency Cepstral Coefficient – MFCC 19
CHAPTER 3 SHORT PAUSE BASED VOICE ACTIVITY DETECTION 22
3.1. Dataset 22
3.1.1 Speech and non-speech label 22
3.1.2 Short pause in speech 24
3.2. Research Question 24
3.2.1 How many short pauses can be found in a sentence? 24
3.2.2 Is it possible to use the short pause to reduce the waiting time and
maintains the accuracy? 26
3.3. Short Pause based VAD 27
3.3.1 Short Pause algorithm (Determine silence and short pause) 28
3.4. Short Pause based VAD with endpoint decision 30
3.4.1 Endpoint decision 31
3.5. Short Pause based VAD with padding silence 31
3.5.1 Padding Silence 32
3.6. Short Pause based VAD with LSTM-RNN 33
3.6.1 Pause model from LSTM-RNN 36
CHAPTER 4 EXPERIMENT 39
4.1. Experiments 39
4.1.1 The first step 39
4.1.2 The second step 39
4.1.3 The third step 39
Ref. code: 25605809035149VUW
(6)
4.1.4 The fourth step 40
4.2. Experiment settings 40
4.3. Experiment dataset 40
4.4. Measurement 41
4.4.1 Average waiting time 41
4.4.2 Word error rate (WER) 42
CHAPTER 5 EXPERIMENTAL RESULTS AND DISCUSSIONS 44
CHAPTER 6 CONCLUSION 58
REFERENCES 60
APPENDICES 64
APPENDIX A IMPLEMENTATION OF SHORT PAUSE BASED VAD 65
APPENDIX B IMPLEMENTATION OF SHORT PAUSE BASED VAD WITH LSTM-RNN
69
BIOGRAPHY 71
Ref. code: 25605809035149VUW
(7)
LIST OF TABLES
Tables Page
3.1 The sample labeling of speech sentence. 23
3.2. The number of the short pause and the silence found in 12 hours of the
continuous speech sentences 25
3.3. LSTM-RNN specification 38
4.1. Alignment and types of error for test phrase 43
5.1. The average waiting time and WER of ASR result by short pause based VAD
compared with traditional VAD 45
5.2. The average waiting time and WER of ASR result by short pause based VAD
with endpoint decision compared with the previous step and traditional VAD
47
5.3. The average waiting time and WER of ASR result by short pause based VAD
with padding silence compared with the previous steps and traditional VAD 48
5.4. The average waiting time and WER of ASR result by short pause based VAD
with LSTM-RNN compared with all previous steps and traditional VAD 52
Ref. code: 25605809035149VUW
(8)
LIST OF FIGURES
Figures Page
1.1. The infrastructure of automatic captioned relay service. 1
2.1. The captioned phone. 5
2.2. The flow diagram of dual-threshold method. 8
2.3. The original speech signals 10
2.4. The speech signal after applying a window function 10
2.5. The original speech signals 11
2.6. Energy waveform after short-time energy extraction 11
2.7. Zero-crossing rate of speech signal in noisy environment 12
2.8. An illustration of roughly search and smoothing search on the dual-threshold
method 13
2.9. The original waveform, labeled sum, and its component frequencies 15
2.10. A recurrent neural network and the unfolding in time of the computation
involved in its forward computation. 17
2.11. The RNNs and variety of output 18
2.12. Block diagram of the MFCC algorithm 20
3.1. Short pauses and silence found in sentence. 26
3.2. The flow diagram of traditional VAD 27
3.3. The flow diagram of the first step on short pause based VAD 28
3.4. State transition diagram for determining short pause and silence 29
3.5. The flow diagram of short pause based VAD with endpoint decision. 31
3.6. The flow diagram of short pause based VAD with padding silence 32
3.7. Increasing the energy threshold, the unvoiced portion might not include in
speech segment 33
3.8. The flow diagram of short pause based VAD with LSTM-RNN 35
3.9. The creation of pause model 36
3.10. The result of MFCCs 37
3.11. The LSTM-RNN structure of pause model 37
Ref. code: 25605809035149VUW
(9)
4.1. The description of the waiting time equation 41
5.1. The example of sentence that cannot find silence or short pause by short
pause VAD and traditional VAD 49
5.2. The labeled of silence and short pause in noisy speech sentence 50
5.3. The speech segments in office noise that is detected by short pause based
VAD with LSTM-RNN 51
5.4. The average waiting time reduced compared with the minimum pause setting
in short pause based VAD 53
5.5. The comparison of WER and minimum pause setting in short pause based
VAD 54
5.6 The flow diagram of short pause based VAD in final step. 55
5.7. The frequency and waiting time of Short Pause based VAD with LSTM-RNN
compared with traditional VAD 56
Ref. code: 25605809035149VUW
(10)
LIST OF ABBREVIATIONS
Symbols/Abbreviations Terms
VAD Voice activity detection
ASR Automatic speech recognition
MFCC Mel-frequency cepstral coefficients
RNN Recurrent neural network
LSTM-RNN Long short-term memory recurrent neural network
STE Short-time energy
TE Short-time energy threshold
ZCR Zero-crossing rate
TZ Zero-crossing rate threshold
WER Word error rate
WPM Word per minute
Ref. code: 25605809035149VUW
1
INTRODUCTION
An automatic captioned relay service allows people with hearing
disabilities or hard-of-hearing to use a special mobile application that enables user to
speak and simultaneously read captions of what others say. This service uses an
Automatic Speech Recognition (ASR) to transcribe speech to a caption. Then the
service transmits the caption directly to a mobile application. An infrastructure of
automatic captioned relay service is shown in Figure 1.1
Figure 1.1. The infrastructure of automatic captioned relay service.
To separate a continuous speech in real-time, the automatic captioned
relay service uses Voice Activity Detection (VAD) to separate speech and non-speech
into segment. Then the speech segments are transcribed into a caption by the ASR.
VAD is the first important step for ASR to discriminate a speech signal into speech or
non-speech segments. It is applied in many tasks of a speech-based application to
reduce bandwidth, word error rate (WER) and computation time. These abilities
improve overall performance for ASR. VADs can be divided by feature extraction into
three categories as follows;
(1) Time-domain feature such as log-energy (Wang & Qu, 2014), short-time energy and
zero-crossing rate (Guo & Li, 2010; Pal & Phadikar, 2015).
(2) Frequency-domain feature such as energy spectral entropy (Pang, 2017). The
frequency domain refers to an analysis of signals with respect to frequency rather than
time.
Ref. code: 25605809035149VUW
2
(3) Pattern recognition such as linear classifier, Naive Bayes classifier and Neural
Network (NN). Many recent works (Ryant et al., 2013; Tashev & Mirsamadi, 2016) show
that the neural network clearly improves an accuracy of VADs significantly.
VADs in each category are characterized by their complexity, precision and
computation cost. More complexity of a feature extraction requires more computation
cost. The dual-threshold method is the low complexity in time-domain VAD that is the
most widely used in a real-time speech-based application. This technique is used in
the real-time application to achieve the lowest computation time and computation
cost (Zhang & Junqin, 2015).
High speaking rate is common in a continuous speech. According to the
National Center for Voice and Speech, an average rate of speech for speakers is
approximately 150 words per minute (wpm) (NCVS, 2007). Additionally, High speaking
rate has less of silence. Silence is a non-speech portion that are over 100-millisecond
(Moattar & Homayounpour, 2009). When apply the traditional VAD, which is called
dual-threshold method to the automatic captioned relay service, the traditional VAD
separates the continuous speech to a long speech segment, then ASR takes a long
time to transcribe. Since the traditional VAD that commonly uses silence as an
endpoint of speech, users take long waiting time for ASR results.
From the mentioned above, none of the previous works propose a method
to reduce a waiting time of ASR results. Therefore, to reduce a delay time of ASR result
in the automatic captioned relay service, we propose the new VAD algorithm, which is
called the short pause based VAD based on the traditional dual-threshold method. An
idea of this research aims to reduce the waiting time of ASR result using short pause
in speech as an endpoint. The short pause is a pause of speech that ranges from 20
to 99 milliseconds. To confirm this idea, we investigate a possibility of using short pause
to reduce the waiting time of real-time captioning process. We note that short pauses
are usually found in sentences. Then they are possible to use the short pauses as an
endpoint to reduce the waiting time. Furthermore, traditional dual-threshold method
is the most of easiest implementation, this is an ideal for proof of concept of using the
short pause in VAD.
Ref. code: 25605809035149VUW
3
To understand an experiment of using short pause in VAD and the
research's problems step by step, we divide an experimental into four steps. The first
step, we show that the short pause can be used to reduce the waiting time of ASR
results well. However, using short pause as an endpoint increases a trend of word error
rate (WER) rapidly. Next, an endpoint decision and padding silence module are applied
to reduce WER on the second and third steps, respectively. Experimental results show
that the endpoint decision and padding silence module reduce the WER significantly.
Moreover, we show a fixed energy threshold in the short pause based VAD and
traditional VAD cannot work in an office noise environment. These problems always
relate to an ability of the short pause and silence detection directly.
In 2016, J. Kim (Kim, Kim, Lee, Park, & Hahn, 2016) shows that a long short-
term memory recurrent neural network (LSTM-RNN) shows that a long short-term
memory recurrent neural network (LSTM-RNN) can learn complex features and
patterns in data. Their proposed work achieves state-of-the-art performance. Hence,
we move to the fourth step that borrow the VAD structure from the third step. Then
short-time energy and zero-crossing rate are replaced by a pause model. The pause
model is speech and non-speech classifier trained from LSTM-RNN to achieve the weak
of short pause and silence detection problem. Finally, all of steps are compared the
average waiting time and WER to the traditional VAD used in automatic captioned relay
service.
1.1. Objective of the Thesis
The aim of thesis is to achieve these goals.
- To reduce the waiting time of caption result using short pause as an
endpoint of speech.
- To propose a method for maintaining the efficiency and accuracy of
caption caused by the short pause.
- To propose a method that accurate the short pause detection in a
noisy environment.
Ref. code: 25605809035149VUW
4
1.2. Structure of the Thesis
The rest of thesis is organized as follows: Chapter 2 presents literature
reviews on a difference of VAD algorithm and background on the research. Chapter 3
and 4 shows experiments of the difference of proposed algorithm. In this chapter, we
discuss a problem in the first proposed VAD algorithm and a possible solution which
can be used to conquer the problem. Chapter 5 presents the experimental results of
proposed works and analysis the trade-off between proposed algorithms compared to
the traditional work. Lastly, the conclusion of the proposed works is presented in
Chapter 6.
Ref. code: 25605809035149VUW
5
LITERATURE REVIEW
In this chapter, we discuss a fundamental on the subject, including the
automatic captioned relay service and voice activity detection problem. And finally,
the recurrent neural network and feature extraction for voice activity detection are
briefly presented.
2.1. The difference of VAD analysis
The Automatic Captioned Relay Service is crucial for hearing disabilities or
hard-of-hearing to communicate with others in real life. The first captioned phone was
invented in the 1960s by scientist name is Robert Weitbrecht (Jones, 2017).
Previously, the captioned relay service uses the captioned phone as a tool
to communicate with captioned relay service. The captioned phone has a built-in
screen that displays the caption of conversation during the call in near real-time. When
making the call, the captioned phone automatically connects to the captioned relay
service, hearing disabilities or hard-of-hearing can speak and simultaneously read
caption of what other saying.
Figure 2.1. The captioned phone (Jones, 2017).
Ref. code: 25605809035149VUW
6
Nowadays, instead of using the captioned phone which is only used in
home or at work. The arrival of smartphone is currently some mobile apps available
for the hearing disabilities or hard-of-hearing to use the captioned relay service
anywhere their need. Makes it convenient and accessibility.
2.2. Voice activity detection (VAD)
Voice activity detection (VAD) had started since 1975 (Rabiner & Sambur,
1975), also known as speech activity detection, speech detection or endpoint
detection. It is a speech processing technique that detects the presence or absence
of human speech. The behavior of VAD generally uses to discriminate a speech signal
into speech or non-speech. VAD takes the speech signal from an input source and uses
the feature extraction to retrieve the important feature from audio. Then the feature
value is used to determine a startpoint and endpoint of speech by decision algorithm.
The main uses of VAD is in speech coding, speech recognition (Li, Zheng,
Tsai, & Zhou, 2002) and speech enhancement. In speech coding, VAD is used to
determine a part of silence that can be switched off in speech transmission. This
technique reduces the amount of transmitted data of non-speech packets in Voice
over Internet Protocol (VoIP) applications, saving on computation and on network
bandwidth. In speech recognition, VAD is used to find a part of speech signal that
should be feed to recognition engine. Since the recognition engine is computationally
complex operation. Ignoring non-speech parts improve overall performance of ASR. In
speech enhancement, VAD is used to reduce or remove noise in a speech signal. We
estimate noise characteristics from non-speech parts and remove noise in the speech
part.
From mentioned above, VAD is an important technology for a variety of
speech-based applications. Therefore, various VAD algorithms provide varying features
and compromise between latency, sensitivity, accuracy and computational cost
differently.
VAD is usually language independent. VAD classification can be done in a
variety of techniques. Each technique is suitable for different speech applications in
Ref. code: 25605809035149VUW
7
term of the accuracy, computation time and accuracy of the results. These factors
often affect to the user experience directly. In automatic captioned relay service, the
lowest delay time in VAD is the most appropriate VAD that is used in the real-time
speech-based application. The difference of VAD can be divided into several
categories, which are described in the next section.
2.3. The difference of VAD analysis
The traditional VADs are roughly divided into several categories as follows.
2.3.1 Time-domain analysis
Generally, the creation of VAD analyzes in time-domain is an analysis
of mathematical functions, physical signals or time series with respect to time. In the
time domain analysis, the signal is known for all real numbers for the case of
continuous time or at various separate instants in the case of discrete time.
From literature, many approaches (Guo & Li, 2010; Zhang & Junqin, 2015)
show how time-domain VADs can achieve the computation speed. The time-domain
analysis is the easiest and most with high speed and low computation cost when
compared to another category. It is suitable for real-time application and work well in
clean speech or a little background noise. However, time domain analysis has some
limitations, such as robustness against background noise. The dual-threshold method
is the most one which commonly used in the real-world application. This algorithm
uses short time energy and zero-crossing rate with simple feature thresholds to
determine the speech and non-speech of each signal frame and performs well in clean
speech or little background noise condition (Yali, Dongsheng, Shuo, & Xuefen, 2014).
2.3.1.1 The dual-threshold method
The dual-threshold method algorithm is the most popular
time-domain analysis. It is the easiest implementation which is an ideal for proof of
concept usage short pause in VAD.
Ref. code: 25605809035149VUW
8
At first, the dual-threshold method is proposed based on two features are
short-time energy and zero-crossing rate. In order, to make the process of signals, we
start by making them stationary using framing operation. After that, the window is
added and then STE is calculated immediately.
The first judgment uses short-time energy (STE) as a feature, which is a
summation of the audio signal of each frame. There are very few advantages in
computational time and complexity. It is can be used in clean speech environments.
The secondary judgment is a zero-crossing rate (ZCR). The ZCR is the rate of sign-
changes along a signal which changes from positive to negative or back. This feature is
another basic acoustic feature that can be computed easily. ZCR is commonly used in
both speech recognition and music information retrieval, being a key feature to classify
percussive sounds, it is often used in conjunction with energy (or volume) for speech
detection. In particular, ZCR is used for detecting the start and endpoint of unvoiced
sounds. If ZCR is high, the speech signal is unvoiced, while if the zero-crossing rate is
low, the speech signal is voiced.
Figure 2.2. The flow diagram of dual-threshold method.
From the Figure 2.2 shows the flow diagram of dual-threshold method
algorithm. When received the speech signal from an input device, the speech signal
Short-time energy
Zero-crossing rate
Decision Algorithm
Divide into chunks
Speech signals
(Roughly search)
(Smoothing search)
Begin, End
Speech segments
Pre-processing & Framing
Ref. code: 25605809035149VUW
9
will be processed before using in feature extraction and other processes. The
description of each process will be described below.
(1) Pre-processing
Speech signal is pre-processed to make it suitable for a
features extraction. Pre-emphasis (Pal & Phadikar, 2015) is a very simple signal
processing method which increases the amplitude of high-frequency bands and
decreases the amplitudes of lower bands; the expression is shown as equation 2.1.
𝑦(𝑛) = 𝑎𝑥( + (1 − 𝑎)𝑥(,- (2.1)
Where 𝑎 = 0.95, 𝑥( is 𝑛-th speech frame. 𝑦(𝑛) presents speech frame
after pre-emphasis. We note that the higher frequencies are more important for signal
disambiguation than lower frequencies. Thus, applying the pre-emphasis in energy-
based VAD slightly take the better result. This makes the pre-emphasis become
popular.
(2) Framing
In nature, the speech signal is non-stationary as it normally
changes quite rapidly over time. However, it can be assumed that the speech signal is
stationary in short time range (10ms-30ms). Thus, framing the speech signal into small
frame can fulfill the assumption that the speech signal can be made to be stationary.
(3) Windowing
When we frame the speech signal into small frame
immediately, there often have a case that the end of frame does not smoothly mesh
with the start of next frame. This means it may have a signal which is discontinuous
between the frames. Then the window function is used for “tapering” the edge of the
frame to zero. This is the basic idea of windowing. The expression of hamming window
(Podder, Khan, Zaman, & Haque Khan, 2014) is shown as equation 2.2.
Ref. code: 25605809035149VUW
10
𝑤(𝑛) = 𝛼 − 𝛽𝑐𝑜𝑠 9:;(<,-
= , 0 ≤ 𝑛 ≤ 𝑁 − 1 (2.2)
Where, 𝛼 = 0.54, 𝛽 = 1 − 𝛼 = 0.46. 𝑁 represents the frame length. 𝑤(𝑛) is 𝑛-
th frame after adding window.
Figure 2.3. The original speech signals
Figure 2.4. The speech signal after applying a window function
(4) Feature extraction
Short-time energy
Short-time energy is the most common feature for speech and non-speech
detection. It can represent changes of amplitude of speech signal and amplitude of
speech signal can be used to classify energy of voice and silence. However, the
Ref. code: 25605809035149VUW
11
accuracy of short-time energy decreases rapidly in noisy environment. Figure 2.5 shows
the original speech signal. The energy of original signal after short-time energy
extraction is illustrated in Figure 2.6.
Figure 2.5. The original speech signals
Figure 2.6. Energy waveform after short-time energy extraction
To obtain the short-time energy (𝐸) of each frame, let speech signal be
𝑦(𝑛), 𝑁 represents the frame length and hamming window is adopted by 𝑤(𝑛). 𝑋((𝑛) is 𝑛-th speech frame after adding window, the expression of windowed speech
frame is shown in equation 2.3.
𝑋((𝑛) = 𝑤(𝑛) × 𝑦(𝑛); 0 ≤ 𝑛 ≤ 𝑁 − 1 (2.3)
Ref. code: 25605809035149VUW
12
Finally, short-time energy of the 𝑛-th frame can be defined by equation 2.4.
𝐸( = G [𝑋((𝑛)]:<,-(JK (2.4)
Zero-crossing rate
The zero-crossing rate (ZCR) is another most popular characteristic of a
speech signal. It represents the alternative times of sampling points changes its sign,
the expression of ZCR (Pal & Phadikar, 2015) is illustrated as equation 2.5.
𝑍( =12NO|𝑠𝑖𝑔𝑛[𝑋((𝑛)] − 𝑠𝑖𝑔𝑛[𝑋((𝑛 − 1)]|<,-
(JK
S
Where, 𝑠𝑖𝑔𝑛[𝑥] = T 1 𝑥 ≥ 0−1 𝑥 < 0
Generally, ZCR is used as the secondary parameter to improve the
accuracy of VAD when speech occurs in noisy environment. The ZCR is higher when
unvoiced or silence segment occurs and lower when voice segment occurs. The ZCR
is extracted from original speech signal that is illustrated in Figure 2.7.
Figure 2.7. Zero-crossing rate of speech signal in noisy environment
(2.5)
Ref. code: 25605809035149VUW
13
(5) Decision Algorithm
Let threshold value of short-time energy be 𝑇𝐸, and the
threshold value of zero-crossing rate be 𝑇𝑍.
Figure 2.8. An illustration of roughly search and smoothing search on the dual-
threshold method
The first detection process is called "Roughly search", this process uses a
short-time energy as a feature extraction, which can retrieve the result in a frame by
frame energy value. The decision algorithm determines the beginning and endpoint of
speech using energy value of each frame. If the frame energy value greater than the
threshold 𝑇𝐸 at least P frames. Therefore, we assign the first P frame as Beginning, if
it does not meet condition above, we locate the Beginning point again. The endpoint
of speech can be found like finding the beginning point. But, the frame energy value
must less than the threshold 𝑇𝐸 at least P frames. Then we assign the last P frame
as Endpoint. Shown as A and B, respectively in Figure 2.8.
When we finish the first process above, then make the second level
judgment that is called "Smoothing search". The second level judgment uses zero-
A
B
C
D
Roughly search (short-time energy)
Smoothing search (zero-crossing rate)
Ref. code: 25605809035149VUW
14
crossing rate as a feature extraction. We search backward from the Beginning that found
in the first detection process until found the frame which is less than the threshold
𝑇𝑍 at first. The result of second level judgment is shown as (A) and (C) in Figure 2.8.
Then we assign this frame as a new Beginning. Finally, completing this process again
by search from Endpoint. The result is shown as (B) and (D) in Figure 2.8.
As mentioned above, that is how the dual-threshold method works with
speech signal. Generally, the creation of VAD analyzes in time-domain like the-dual-
threshold method is the easiest and most with high speed and low computation cost.
Hence, these are used to achieve the fast computation cost and maintain the accuracy
in real-time VAD.
In 2010, Qiuyu Guo and Nan Li (Guo & Li, 2010) show that the difference
algorithm in VAD will perform the difference performance vary depending on each side
such as precision, the computational complexity, and robustness to noise varies. To
improve the performance of VAD, they choose a dual-threshold method using a short-
time energy and zero-crossing rate to conquer a noisy environment. The experiment
result shows the performance of their proposed work can be robust in the noisy
environment while maintaining a computational cost.
And later in 2015, Provat Kumar Pal and Santanu Phadikar (Pal & Phadikar,
2015) proposed a method to reduce the computation cost and improve the
performance of dual-threshold method algorithm. Additional techniques of pre-
processing such as Wiener filter is used to improve a speech signal before short-time
energy and zero-crossing rate extraction. Moreover, this research also provides a
method that is called adaptive threshold to enhance the determination process of
VAD for using in a real-world environment. The experiment result shows the
performance of their proposed work can reduce the computation cost up to 27.7%
and improve the accuracy in ASR up to 5.9%. However, there are some limitations in
this literature. The researcher assumes that none of speech presents at the first 100-
millisecond interval. Moreover, the researcher uses the smaller dataset to measure
the performance of VAD. This may produce the inaccurate result.
These are literatures that are closest to the proposed works. All works of
literature use the dual-threshold method to achieve on computation cost and
Ref. code: 25605809035149VUW
15
accuracy. But, there are not reduce the waiting time of result. Therefore, we aim to
reduce the waiting time of ASR result in automatic captioned relay service. Thus, the
traditional VAD is necessary to be improved using the short pause in speech.
2.3.2 Frequency-domain analysis
In electronics, control systems engineering and statistics, the
frequency domain refers to an analysis of mathematical functions or signals. The
analysis respect to frequency rather than time. A time-domain graph shows how the
signal changes over time, while a frequency-domain graph shows how much energy of
the signal lines within each given frequency band over a range of frequencies. A
frequency-domain representation can include the information on phase shift that is
applied to each sinusoid which is able to recombine the frequency components to
recover the original time signals.
A given function or signal can be converted between a time and frequency
domains with a pair of mathematical operators which is called a transform. An example
is the Fourier transform. The Fourier transform converts the time function into a sum
of sine waves of different frequencies. Each of line represents a frequency component.
The spectrum of frequency components is the frequency domain representation of
the signal. The inverse Fourier transforms converts the frequency domain function back
to a time function.
Figure 2.9. The original waveform, labeled sum, and its component frequencies
Ref. code: 25605809035149VUW
16
From literature, many experiments (Jia & Xu, 2002; Misra, 2012; Moattar &
Homayounpour, 2009; Shen, Hung, & Lee, 1998) have proved that time-domain analysis
usually fails under low SNR condition and accuracy cannot be made more accurate.
Nowadays, to achieve more accuracy in VAD algorithm. The types of pattern
recognition are widely used in VAD algorithm. The description of pattern recognition is
described in the next section.
2.3.3 Pattern recognition
In the last few years, the creation of VAD using the pattern
recognition has become very popular because typically pattern recognition such as
Gaussian Mixture Model (GMM) (Misra, 2012), Neural Network (NN) (Kim et al., 2016;
Ryant et al., 2013; Tashev & Mirsamadi, 2016) has been widely used in speech
recognition. Thus, the application of pattern recognition is used to build VAD and the
result is satisfactory in noisy environment and also more accurate in clean speech.
2.3.3.1 Recurrent Neural Network
The neural network is a mathematical model or a computer
model for computing information with a connection-oriented calculation
(Connectionist). The original concept of recurrent neural networks was derived from
the study of the bioelectric network in the brain, which consists of neurons and
synapses. Finally, it is a collaborative network. These techniques commonly use the
feature extraction for improving speech and non-speech classifier, one example is Mel-
frequency cepstral coefficients (MFCC). Besides, many studies (Eyben, Weninger,
Squartini, & Schuller, 2013; Hughes & Mierle, 2013) show that a recurrent neural
network (RNN) improves a noise robustness using the context of previous speech
frames.
In traditional neural networks are usually think that all inputs (and output)
are independent. Unlike recurrent neural network (RNN), the recurrent neural network
has a memory that contains at least one feedback loop, ingesting their own outputs
moment after moment as input. This feature enables the network to do temporal
processing and learn sequences such as perform sequence recognition or temporal
Ref. code: 25605809035149VUW
17
prediction. These techniques often use for video, audio, natural language processing
and images.
Figure 2.10. A recurrent neural network and the unfolding in time of the computation
involved in its forward computation (Olah, 2015).
In the recurrent neural network, it has two sources of input that are the
present and the recent past which combine to determine how they respond to new
data.
That sequential information is preserved in the recurrent network’s hidden
state. The hidden state manages to span many time steps as it cascades forward to
affect the processing of each new example. And we can also write the hidden state
update is as follows:
ℎY = 𝜙(𝑊𝑥Y + 𝑈ℎY,-) (2.6)
From equation 2.6 (Olah, 2015), the hidden state at time step is (ℎY). It is a function of the input at the same time step (𝑥Y) that is modified by a weight matrix
(𝑊) and add to the hidden state of the previous time step (ℎY,-). Then we multiply
by own hidden-state-to-hidden-state matrix (𝑈) also known as a transition matrix. The
weight matrices are filters that determine how much importance to accord to both
the present input and the past hidden state. The error from loss function will return
via back-propagation that is used to adjust the weights until error is lowest as possible.
The sum of the weight input and hidden state is squashed by the function either a
Ref. code: 25605809035149VUW
18
sigmoid function, tanh or rectified linear unit (ReLU). These are a standard tool for
condensing very large or very small values into a logistic space as well as making
gradients workable for backpropagation.
Because this feedback loop occurs at every time step in the series. Each
hidden state contains traces not only of the previous hidden state. But, also of all
those that preceded for as long as memory can persist.
Figure 2.11. The RNNs and variety of output (Olah, 2015)
One of the appeals of RNN is the idea that they might be able to connect
previous information to the present task, such as using a language model trying to
predict the next word based on the previous ones.
(1) Long short-term memory networks
Long short-term memory network (LSTM) is a special kind of
RNN, capable of learning long-term dependencies. LSTM is published by Hochreiter
and Schmidhuber (Hochreiter & Schmidhuber, 1997) and were refined and popularized
by many people in the following works. They work greatly well on a large variety of
problems and are now widely used.
LSTM is explicitly designed to avoid the long-term dependency problem.
Remembering information for long periods of time is practically their default behavior.
All recurrent neural networks have the form of repeating modules. In standard RNNs,
this repeating module will have a very simple structure such as a single tanh layer.
Ref. code: 25605809035149VUW
19
LSTM also has this chain like RNN structure. But, the repeating module has a different
structure. Instead of having a single neural network layer, there are four interacting in
a very special way. LSTM have the ability to remove or add information to the cell
state, carefully regulated by structures called gates.
LSTM has also become very popular in the field of natural language
processing (Palangi et al., 2016; Yao et al., 2014). Unlike previous models based on
HMMs and similar concepts, LSTM can learn to recognize context-sensitive languages.
LSTM improves the machine translation, language modeling and multilingual language
processing. Moreover, LSTM is combined with convolutional neural networks (CNNs)
that also improves automatic image captioning and a plethora of other applications.
2.3.3.2 Feature extraction for neural network
Feature Extraction is finding the parameters that represent the
features of audio signals. The feature extraction is very important. It's used to reduce
the complexity of neural networks, the amount of dataset used to learn and the
computation time of neural networks. Many researches have shown the features
extraction can improve the accuracy, reduce learning time by extract unnecessary
information and improve the overall performance of neural networks (Misra, 2012).
Nowadays, they have various ways to extract these different features.
Generally, the most important features that are commonly used are MFCC, LPC, PLP,
RASTA-PLP (Dave, 2013) which are popular for use with speech recognition systems.
(1) Mel Frequency Cepstral Coefficient – MFCC
Mel Frequency Cepstral Coefficient (MFCC) is a feature that is
widely used in automatic speech and speaker recognition. They were introduced by
Davis and Mermelstein in the 1980's (Mermelstein, 1976) and have been state-of-the-
art ever since. This technique is based on experiments of the human misconception
of words. Prior to the introduction of MFCC, Linear Prediction Coefficient (LPC) and
Linear Prediction Cepstral Coefficient (LPCC) and were the main feature type for ASR,
especially with HMM classifiers.
Ref. code: 25605809035149VUW
20
To extract a feature vector containing all information about the linguistic
message, MFCC mimics some parts of the human speech production and speech
perception. MFCC mimics the logarithmic perception of loudness and pitch of human
auditory system and tries to eliminate speaker-dependent characteristics by excluding
the fundamental frequency and their harmonics. To represent the dynamic nature of
speech the MFCC also includes the change of the feature vector over time as part of
the feature vector.
The standard implementation of computing the Mel-Frequency Cepstral
Coefficients is shown in Figure 2.12.
Figure 2.12. Block diagram of the MFCC algorithm
MFCC is commonly derived as follows:
- Take the Fourier transform of a signal.
- Map the powers of the spectrum obtained above onto the Mel scale,
using triangular overlapping windows.
- Take the logs of the powers at each of the Mel frequencies.
- Take the discrete cosine transform of the list of Mel log powers, as if it
were a signal.
- The MFCCs are the amplitudes of the resulting spectrum.
The Input for the computation of the MFFCs is a speech signal in the time
domain representation with a duration in the order of 10-30 milliseconds.
Fast Fourier transform
Mel scale Filtering
Log Discrete cosine transforms Derivatives
MFCC
Speech Spectrum
Mel frequency Spectrum
Cepstrum Coefficient
Ref. code: 25605809035149VUW
21
In order, to build a short pause based VAD with more accurate in short
pause detection, we need to compare the accuracy, detection time and efficiency of
captions on VAD created by the neural network, and with the time-domain analysis.
And the study research of the past to create the VAD using pattern recognition, which
is detailed below.
In 2012, Ananya Misra (Misra, 2012) has presented research to create a
speech/non-speech segmentation for speech transcription. In this literature review, the
researchers compared the feature extraction and including classifier that can make
speech/non-speech segmentation be the most effective in noisy environments. In the
comparison of the features, the feature extractions are including Low short-time energy
ratio, High zero-crossing rate ratio, Line spectral pair (LSP), Spectral flux, Spectral
centroid, Spectral roll off, Ratio off magnitudes in speech band, Top peaks, Ratio of
magnitudes under top peaks are compared with the most commonly used Mel-
frequency cepstral coefficients (MFCCs). Moreover, the Maximum entropy classifier
(Maxent) is used to compare with Gaussian mixture models (GMM) in comparison of
the classifier.
The researchers used data from YouTube web videos amount 95 hours to
achieve a variety of noise. The experimental result has shown that the extraction
features were compared, the results are performed better than GMM and show that
the great feature extraction will affect the accuracy of speech recognition.
However, according to a study earlier research that although pattern
recognition has been popular in terms of efficiency and robustness to noise. This
method needs to estimate the model parameters of speech and noise signal and
needs to have much training data.
Ref. code: 25605809035149VUW
22
SHORT PAUSE BASED VOICE ACTIVITY DETECTION
In this chapter, the datasets used to investigate possibility of using the
short pause and an experimental are described. Next, creation of short pause based
VAD based on the traditional VAD is presented. Followed by a problem of short pause
and silence detection in the traditional VAD and proposed work. Finally, creation of
pause model based on LSTM-RNN is presented in last section to solve the problem.
3.1. Dataset
In order, to investigate a frequency of short pause in speech sentences. A
process begins with a dataset. We use LOTUS (Kasuriya, Sornlertlamvanich,
Cotsomrong, Kanokphara, & Thatphithakkul, 2003) as a dataset. LOTUS is a large
vocabulary continuous speech recognition (LVCSR) that is provided by the National
Electronics and Computer Technology Center (NECTEC), Thailand. The dataset consists
equal numbers of male and female voices. The voices are recorded with two types of
microphone including high-quality close-talk and medium-quality undirectional. In
addition, the dataset is included two types of environment including a silent
environment and an office noise environment for testing in noisy environment.
3.1.1 Speech and non-speech label
Speech and non-speech segments in LOTUS dataset are labeled by
hand. Silence or "sil" is pause that are over 100 milliseconds. Short pause or "sp" is
pause that ranges from 20 to 99 milliseconds. The dataset consists "sil", "sp" and other
phonetics as a label. We use these labels to investigate frequency of short pause and
measure an accuracy while training a recurrent neural network. The example of
labeling on the dataset is illustrated in Table 2.1.
Ref. code: 25605809035149VUW
23
Table 2.1. The sample labeling of speech sentence.
Start Time
(millisecond)
End Time
(millisecond) Phonetic
0 2990323 sil
2990323 3213509 p
3213509 3681851 e
3681851 4610073 n^
4610073 5620114 w
5620114 6850219 a
6850219 8145215 j^
8145215 9268110 kh
9268110 10532071 @@
10532071 11186622 ng^
11186622 12374408 khw
12374408 13635548 aa
13635548 15142145 m^
15142145 15785411 j
15785411 16916769 u
16916769 17768814 ng^
17768814 18691393 j
18691393 20079494 aa
20079494 20367271 k^
20367271 22043148 sp
22043148 22474814 c
22474814 24421540 a
24421540 27169528 j^
27169528 30174257 sil
Ref. code: 25605809035149VUW
24
The discussions of using the dataset are explained in next topics.
3.1.2 Short pause in speech
In previous research, silence between words (from 100-millisecond)
is commonly used as an endpoint of speech for better accuracy (Moattar &
Homayounpour, 2009; Wu, Kingsbury, Morgan, & Greenberg, 1998). However, using
silence as the endpoint in continuous speech may increases the waiting time
accordingly. In this work, we focus on using short pause between words to reduce the
waiting time of ASR result. A pause ranges from 20 to 99 milliseconds is defined as
short pause.
3.2. Research Question
Here are the main two questions of this work.
- How many short pauses can be found in a sentence?
- Is it possible to uses short pause to reduce the waiting time and
maintains the accuracy?
3.2.1 How many short pauses can be found in a sentence?
By using short pause as an endpoint to reduce a waiting time of ASR
result, it is important to investigate frequency of short pause that occurs during
conversation in continuous speech.
In order, 12 hours of continuous speech sentence in LOTUS dataset are
used in the preliminary investigation. This investigation discriminates number of short
pause and silence that are occurred in speech sentences. However, the characteristic
of speech in LOTUS dataset is relatively slow continuous speech (approximately 123
wpm), the number of silence and short pause are different from high speech rate.
Nevertheless, short pause and silence in LOTUS are labeled by hand that is the reason
why we use LOTUS to investigate preliminary statistics of the short pause in the
sentence. Therefore, we write a program that can determine length of pause and
define a label as follow; (1) the pause ranges from 20 to 99 milliseconds, which is
Ref. code: 25605809035149VUW
25
defined as short pause (2) the pause between words over 100 milliseconds is
automatically defined as silence. The number of short pause and silence in LOTUS is
shown in Table 2.2.
Table 2.2. The number of the short pause and the silence found in 12 hours of the
continuous speech sentences
Types
Minimum
pause
(milliseconds)
Frequency
(seconds) Sd. Total
Endpoints
(Silence +
Short Pause)
Silence 100 1.14 0.918 36,975 36,975
Short Pause
80 0.98 0.8 6125 43,100
60 0.82 0.7 13,957 50,932
40 0.71 0.61 22,107 59,082
20 0.64 0.55 28,944 65,919
Table 2.2 is a results of short pause frequency in the 12 hours of
continuous speech sentence. We found a total short pause is 28,944 points (including
ranges from 20 to 99 milliseconds) and silence is 36,975 points. In other words, the
silence may be used as an endpoint every 1.14 seconds, while 0.64 seconds using the
short pause at 20-millisecond. We note that short pause is usually found in sentence
not less than silence.
Ref. code: 25605809035149VUW
26
Figure 2.13. Short pauses and silence found in sentence.
From Figure 2.13, dashed lines represent short pause and straight lines
represent silence. The figure shows the investigation of short pause frequency, there
are normally some of short pauses in sentence.
3.2.2 Is it possible to use the short pause to reduce the waiting time
and maintains the accuracy?
By considering frequency of short pause that occurs during speech,
the result of investigation show that short pause can be found during speech similar
to silence. Hence, it is possible to reduce the waiting time of traditional VAD using
short pause as an endpoint instead of using only silence. Therefore, we redesign and
improve the traditional VAD algorithm to detect short pause and determine an
endpoint as fast as possible. Then separated speech segments are transmitted to ASR
and transcribe in near real-time.
The proposed work is called short pause based VAD, which is separated
an improvement into four steps. The first step, we improve the traditional VAD by
applying short pause algorithm that uses short pause as an endpoint, instead of using
only silence. This step is improved based on traditional dual-threshold method VAD,
which is low complexity, computation cost and simply to implement. The second step,
we apply the endpoint decision to the short pause based VAD. This step aims reduce
WER by monitoring and minimizing the number of short pauses that are used by VAD.
Ref. code: 25605809035149VUW
27
The third step, we add the padding silence module to short pause based VAD to
minimize WER. The fourth step, we borrow VAD structure from the third step. Then
two feature extractions in traditional VAD are replaced with the pause model. The
pause model is a speech and non-speech classifier that is trained from a long short-
term memory recurrent neural network (LSTM-RNN). In the fourth step, it is more
accurate than the first, second and the third step on the short pause detection
algorithm. The fourth step reduces WER of ASR result in a near real-time captioning
process. The descriptions of each step are described in next topics.
3.3. Short Pause based VAD
Figure 2.14. The flow diagram of traditional VAD (Guo & Li, 2010)
Short-time energy
Zero-crossing rate
Decision Algorithm
Divide into chunks
Speech Signals
(Roughly search)
(Smoothing search)
Begin, End
Speech Segments
Pre-processing & Framing
Ref. code: 25605809035149VUW
28
Figure 2.15. The flow diagram of the first step on short pause based VAD
Generally, the decision algorithm in traditional VAD is designed to detect
silence that is over 100-millisecond. Then traditional VAD uses the silence as an
endpoint immediately. This is the first step that improves traditional VAD (as shown in
Figure 2.14) to detect short pause between words instead of detecting only silence.
The flow diagram of proposed work is illustrated in Figure 2.15. The decision algorithm
in traditional VAD is replaced with the short pause algorithm (dashed block in Figure
2.15), which has an ability to detect short pause in speech. The detail of short pause
algorithm is explained in the topic below.
3.3.1 Short Pause algorithm (Determine silence and short pause)
From the first question that is described earlier, short pause usually
occurs during speech and we note that the short pause may be used as an endpoint,
which can reduce the waiting time of ASR result. Therefore, to detect the short pause
and silence in speech simultaneously, the short pause algorithm in proposed VAD is
necessary to examine the process consists of two main processes;
- The process is to determine silence from 100-millisecond. This process
gives the probability of the most accurate of result in ASR.
Short-time energy
Zero-crossing rate
Short pause algorithm
Divide into chunks
Speech Signals
(Roughly search)
(Smoothing search)
Begin, End
Speech Segments
Pre-processing & Framing
Ref. code: 25605809035149VUW
29
- The process of validating pause between words in a sentence. If the
pause is more than 20-millisecond, it is defined as the short pause.
This process uses the short pause to find an appropriate endpoint to
reduce the waiting time.
To locate the short pause and silence, we use a four-state transition
diagram that is illustrated in Figure 2.16.
Figure 2.16. State transition diagram for determining short pause and silence
Ref. code: 25605809035149VUW
30
As shown in Figure 2.16, the four states are included; silence, maybe-
speech, speech, leaving-speech. We assume the silence state is a start state and any
state can be a final state. The transition conditions are aligned with line between
states and actions of the condition are in brackets. From the previous discussion, the
output value from short-time energy feature is 𝐸, 𝑇𝐸 is threshold energy. The output
of start points, silences and a list of short pauses are presented by a detected frame
number. "count" is a number of speech frame that detected, "pause" is a number of
pause frame that found in speech. "sp" is defined as a minimum pause length.
Moreover, "speech" is a minimum speech threshold and "silence" is minimum pause
that can be defined as a silence, these both parameters are set to 100-millisecond.
To illustrate two main processes in the determination of silence and short
pause, we focus on the leaving-speech state that short pause algorithm can detect a
weak signal during speech, which is called a pause. If the pause length is greater than
the "silence" threshold, then the short pause algorithm decides that detected pause
as silence and moves to the silence state. Meanwhile, if the pause length is lower than
the "silence" threshold but greater than the "sp" threshold. We define detected pause
as a short pause, which is used to find an appropriate endpoint. Finally, the state
moves back to the speech state.
3.4. Short Pause based VAD with endpoint decision
Based on preliminary experiments, we separate speech sentence using
short pause as an endpoint by hand. We found that the short pause increases WER
significantly. Moreover, using a small length of short pause increases WER rapidly in
real-time captioning process. Therefore, we propose the endpoint decision, it monitors
and minimizes the use of short pause that minimizing captioning errors.
At the first step, the short pause algorithm examines pause during speech.
Then in this step, we use the detected pause in the endpoint decision which decide
to use the detected pause or do not using the delay time of VAD. The delay time is
an actual processing time that compared with an average chunk duration. The
Ref. code: 25605809035149VUW
31
improvement of short pause based VAD with endpoint decision is presented in Figure
2.17. The descriptions of the endpoint decision rule are described below.
Figure 2.17. The flow diagram of short pause based VAD with endpoint decision.
3.4.1 Endpoint decision
We determine that 𝜃 is an average of chunk duration. ∆𝑑 is an actual
processing time. Then the algorithm chooses between using silence or short pause as
an endpoint by following rules as below.
- If silence can be found in 𝜃, the algorithm uses silence as the endpoint.
- If ∆𝑑 > (𝜃) but VAD cannot find silence portion, then the algorithm
uses the latest short pause as the endpoint.
3.5. Short Pause based VAD with padding silence
Generally, speech recognition systems learn from a large dataset which
commonly contains amount of silence in front and back of speech. Using short pause
as an endpoint reduces the silence in speech segment. An idea of padding silence
comes from Pheraniti’s proposed (Pheraniti, 2008). The researcher shows an
Short pause algorithm
Speech Signals
(Roughly search)
(Smoothing search)
Begin, End
Speech Segments
Short-time energy
Zero-crossing rate
Divide into chunks
Pre-processing & Framing
Endpoint decision
Ref. code: 25605809035149VUW
32
experimental that allows silence in front and back of speech segment. His
experimental result shows that an appropriate length of silence importantly affects
accuracy in ASR.
In addition, the creation of ASR engine largely relies on language model
that helps to predict a word in sentence. The language model has similar as a state
machine that looking for a final state of speech sentence. The final state is silence that
is given on training process. Using short pause as an endpoint reduces some silence
and unvoiced speech, then ASR engine cannot reach the final state. Thus, by adding
silence at a certain level after separated using short pause may allow ASR to repair
missing silence or recovery unvoiced segment in speech. This technique might reduce
WER.
3.5.1 Padding Silence
Figure 2.18. The flow diagram of short pause based VAD with padding silence
Short pause algorithm
Speech Signals
(Roughly search)
(Smoothing search)
Begin, End
Speech Segments
Short-time energy
Zero-crossing rate
Divide into chunks
Pre-processing & Framing
Endpoint decision
Padding silence
Used short pause as endpoint
Speech Segments
Ref. code: 25605809035149VUW
33
We add 100-millisecond of generated silence in front or back of short
pause to improve an accuracy. The second step is improved by adding the padding
silence module. We call the third step is the short pause based VAD with padding
silence. Figure 2.18 is the flow diagram of the third step. And the performance is shown
in the experimental results.
3.6. Short Pause based VAD with LSTM-RNN
Considering the flow diagram of traditional VAD and previous steps, we
note that an energy threshold is very important for traditional VAD and proposed work
algorithm. Short pause and silence are cannot detect, if the first threshold energy fail.
From Wang & Qu’s research (Wang & Qu, 2014), the researchers show the fixed
threshold cannot work in a noisy environment. This problem always related to an
ability of short pause detection directly. This problem may lead to failure of the
algorithm.
Figure 2.19. Increasing the energy threshold, the unvoiced portion might not include
in speech segment
Ref. code: 25605809035149VUW
34
However, the higher threshold adjustment is an option that we can solve
the fixed energy threshold problem. Furthermore, increasing threshold energy, we
necessary to consider a problem since the fixed threshold energy is higher.
In order, we need to consider searching of a start point and endpoint of speech
segment. At first, the traditional VAD searches the start point that continuously
increases over 100-millisecond at least (A). Then zero-crossing rate backward scans for
a lowest ZCR frame at first. And we define the lowest ZCR frame as start point. The
result has some risks. Because the lowest ZCR frame that is found at first (B) probably
not a beginning of speech (C). Therefore, an unvoiced speech parts are ignored in the
process of zero-crossing rate. This problem affects the accuracy of ASR result directly,
as illustrated in Figure 2.19.
Many of research (Guo & Li, 2010; Zhang & Junqin, 2015) are shown that
short-time energy works well under clean speech environment. In this proposed work,
the short-time energy cannot capture complex feature like the short pause. Because
the characteristic of short pause is similar to speech or noise that occur in speech.
Using short-time energy will be difficult and inaccurate to search the smallest detail
like the short pause.
We know well that a long short-term memory recurrent neural network
(LSTM-RNN) can understand long-range data sequence better than a vanilla neural
network. The LSTM-RNN has memory cells on the internal structure. The memory cell
can write, read and reset an incoming feature context. We note that a huge of data is
crucial for LSTM-RNN and all of neural network. From recent research, Juntae Kim (Kim
et al., 2016) shows that LSTM-RNN can capture complex feature like vowel sounds
rather whole speech. The vowel sounds are a complex feature like a short pause in
the speech. This research inspires the idea of improving short pause algorithm in this
proposed work. Therefore, we use LSTM-RNN to improve the accuracy of short pause
algorithm.
To apply LSTM-RNN in the short pause based VAD, we borrow the VAD
structure from the third step (short pause based VAD with padding silence). Then we
apply a pause model that is trained from LSTM-RNN. The pause model finely detects
Ref. code: 25605809035149VUW
35
a pause in a noisy environment. Next, the detected pause is defined as short pause or
silence depending on the length of the pause. This step will improve the accuracy of
the short pause algorithm. The flow diagram of short pause based VAD with LSTM-RNN
is illustrated in Figure 3.8.
Figure 2.20. The flow diagram of short pause based VAD with LSTM-RNN
From illustrated in Figure 3.8, the figure shows a new diagram of short
pause based VAD with LSTM-RNN. At all previous step, traditional VAD features are
short-time energy and zero-crossing rate that have weaknesses in short pause and
silence detection. We replace traditional VAD features with the pause model. The
pause model is learned from short pause, silence, and speech. The creation of pause
model that use in the fourth step is described in the next topic.
Short pause algorithm
Speech Signals
Begin, End
Speech Segments
Pause model
Divide into chunks
Pre-processing & Framing
Endpoint decision
Padding silence
Used short pause as endpoint
Speech Segments
Ref. code: 25605809035149VUW
36
3.6.1 Pause model from LSTM-RNN
We use 12 hours of speech sentence from LOTUS, which is labeled
short pause and silence by hand. The hand labeled and speech audio in LOTUS
dataset are used to create the pause model on LSTM-RNN. The dataset is separated
to 70% for training and then test and evaluation a result with remaining 20% and 10%,
respectively.
The creation of pause model is shown in Figure 2.21. In order, we slice
speech audio into 20-millisecond of a chunk.
Figure 2.21. The creation of pause model
Then the 13-MFCCs feature extraction is used to break apart complex
sound wave into frequency bands (from low to high). This method makes an audio
chunk easier for LSTM-RNN to process. Because trying to recognize speech pattern by
processing these raw audios are the difficulty, LSTM-RNN takes a long time to learn.
The output of MFCCs is illustrated in Figure 2.22.
Mapping
MFCCs feature extraction
Labeled
Weight, Graph
Speech signals
One-hot vector
Recurrent neural network
Framing
Spectrogram
Ref. code: 25605809035149VUW
37
Figure 2.22. The result of MFCCs
Figure 2.22 presents a feature data from 13-MFCC. Each color represents
how much energy of each MFCC coefficient index in speech frame that we feed into
LSTM-RNN. We map each audio chunk with a label at a time during that chuck. The
labels contain three different tags including silence, short pause, and speech.
Figure 2.23. The LSTM-RNN structure of pause model
LSTM-RNN fit the model by approximating the gradient of loss function
with respect to the neuron-weights. The process of approximating is looking at only a
small subset of data, also known as a "mini-batch". Each mini-batch contains 200 of
MFCC features. We feed 20-millisecond of MFCC feature to LSTM-RNN one at a time.
This technique is called a "time-step". There is a concept of time instances
corresponding to a time-series or sequence. Once each mini-batch completed, we
calculate the gradient of loss function immediately, and so on.
… Input Layer
13 LSTM Cells
Output node
Ref. code: 25605809035149VUW
38
The output of LSTM-RNN is binary classification [0, 1] that represents a
probability of pause or speech of each MFCC frame. The output zero (0) is a pause.
Meanwhile, the output one (1) is a speech. The LSTM-RNN structure is presented in
Figure 2.23 and specification of LSTM-RNN is shown in Table 2.3.
Table 2.3. LSTM-RNN specification
Hidden Layer 1
Memory Cell 13
Initial learning rate 0.001
Weight initialization range [-0.1, 0.1]
Decision threshold 0.5
Ref. code: 25605809035149VUW
39
EXPERIMENT
In this proposed work, we propose a new VAD algorithm, which is called
Short Pause based VAD. We divide an experiment into four steps for the understanding
experiment of using short pause in VAD and research problems in step by step.
4.1. Experiments
4.1.1 The first step
This is the first step, we use traditional VAD to improve the
experiment of short pause based VAD. The short pause algorithm is applied to
traditional VAD. Therefore, we find a minimum pause in difference length that suitable
for the algorithm. A length of short pause varies from 20 to 99ms. In ranges 20-39, 40-
59, 60-79 and 80-99 milliseconds, respectively (range of the short pause caused by
framing at 20-millisecond). Finally, we include the traditional VAD setting that use only
silence at 100-millisecond in short pause based VAD. This experiment determines how
the length of short pause affects the waiting time and WER in ASR differently.
4.1.2 The second step
This step is called short pause based VAD with endpoint decision.
We apply endpoint decision module to short pause based VAD (from the first step).
The endpoint decision is used for monitoring and minimizing the use of short pauses
and minimizing WER. This step mainly aims to maintain the accuracy of ASR result.
4.1.3 The third step
This experiment shows an improvement of the short pause based
VAD. We apply the padding silence module to short pause based VAD, it reduces WER
caused by short pause. This step also aims to maintain the accuracy of ASR result in
short pause based VAD.
Ref. code: 25605809035149VUW
40
4.1.4 The fourth step
Lastly, the experiment of short pause based VAD with LSTM-RNN is
the last step. We replace a feature extraction including short-time energy and zero-
crossing rate with a pause model. The pause model is a speech and non-speech
classifier that is trained from LSTM-RNN. We propose the pause model to solve the
weakness of short pause detection in a noisy environment. Finally, we compare all
previous steps with traditional VAD on the average waiting time and WER.
4.2. Experiment settings
To provide a standard setting in an experiment, we set threshold for short-
time energy (𝑇𝐸) and zero-crossing rate (𝑇𝑍) as the similar values as follows 𝑇𝐸 =
50000 and 𝑇𝑍 = 20 respectively. The threshold for short-time energy and zero-crossing
rate easily determine by observation and literature review (Guo & Li, 2010).
The details of experiment dataset are described in next section. Moreover,
all of the experiments are measured on an average waiting time and WER of ASR result.
The descriptions of measurement will describe in the next section.
4.3. Experiment dataset
In experiment, 20 minutes of LOTUS-BN dataset (Chotimongkol, Saykhum,
Chootrakool, Thatphithakkul, & Wutiwiwatchai, 2009) is used to measure an average
waiting time of ASR result. The LOTUS-BN dataset is a Thai television broadcast news
corpus, which is a good resource for investigating on the waiting time of ASR result.
Since it has a higher rate of speech more than natural speech (approximately 196
words per minute). This is an ideal for analyzing the waiting time caused by VAD. An
audio signal is 16 kHz sampling rate and encoded in a 16 bits Microsoft PCM format.
Please note that the sample rate of speech audio is reduced to 8 kHz in this
experiment.
Ref. code: 25605809035149VUW
41
4.4. Measurement
4.4.1 Average waiting time
We measure an average waiting time for the ASR results using
detection time of VAD, upload time and recognition time. The equation of average
waiting time is presented below.
𝑊𝑇 ab =∑ defghef(ie×jkl)mnopeqr
<,- (4.1)
Where the detection time of VAD is 𝐷t . 𝑈t represents upload time of
speech segment. 𝐶t is the duration time of speech segment. Finally, 𝑅𝑇𝐹 is a
processing time factor of automatic speech recognition (also known as real-time
factor). Each variable in the equation can be explained in the illustration below.
Figure 4.1. The description of the waiting time equation
The RTF value in this experiment is 1 to control external factor that affect
the experiment. Due to a large number of speech recognition service providers, each
provider has a different real-time factor.
VAD ASR Automatic
Captioned
Relay
Service
Voice
streaming
Detection
time (𝐶t ) Recognition
time (𝑅𝑇𝐹) Upload
time (𝑈t )
Response
Ref. code: 25605809035149VUW
42
4.4.2 Word error rate (WER)
To measure the effectiveness of each step of proposed algorithm,
the researchers study two factors including average waiting time and WER. For
measurement on accuracy, we apply the Word Error Rate (WER) to measure accuracy
of ASR results. WER is a number of substitution, deletion and insertion errors over a
number of correct words, as the following equation 4.2.
𝑊𝐸𝑅 = xyz{YtYyYt|(fd}~}Yt|(f�({}�Yt|(xyz{YtYyYt|(fd}~}Yt|(fi|��}�Y
(4.2)
WER is developed by Frederick (Frederick Jelinek, 1998). WER is used to
check an accuracy of ASR engine. It works by calculating distance between result from
ASR (hypothesis) and the answer text (reference). In an alignment process, three types
of errors can be distinguished:
- substitutions (SUB)
- deletions (DEL)
- insertions (INS)
In Table 4.1 is an example of alignment produced from example data. The
reference is marked as REF. An output produced by ASR as HYP. The symbols including
SUB, INS, and DEL are the type of error that is used for WER calculation. The example
of comparison yields 38.46% of WER.
Ref. code: 25605809035149VUW
43
Tabl
e 4.
1. A
lignm
ent a
nd ty
pes
of e
rror f
or te
st p
hras
e
Ref. code: 25605809035149VUW
44
EXPERIMENTAL RESULTS AND DISCUSSIONS
In this chapter, we describe the results of proposed work step by step. The
first step, short pause algorithm is applied in traditional VAD. In this step, we
demonstrate the possibility of using short pause to reduce the waiting time of ASR
result. Afterward, endpoint decision and padding silence are added to reduce WER
caused by short pause in the second and third steps. Next, we solve the weakness of
short pause algorithm in a noisy environment by applying the pause model. The pause
model is a speech and non-speech classifier that is trained from LSTM-RNN. Finally,
the result of the fourth step is compared with all previous steps and traditional VAD.
From hypothesis, when we use short pause instead of using silence as an
endpoint of speech, there is an opportunity to dismiss an unvoiced speech such as
vowel sounds or the sounds of letter Ch, F, K, P, S, Sh, T, and Th. Because, these types
of speech are statistically similar to background noise. Therefore, researching of using
short pause in different length (described in section 4.1.1) is necessary for studying
trend-off between average waiting time and WER. The result of the first step is shown
below.
From Table 5.1, the experimental result shows that various lengths of short
pause affect to WER differently (the range of short pause caused by framing at 20-
millisecond). The average of chunk duration and waiting time are related to the
minimum pause. Using short pause as an endpoint in short pause algorithm decrease
the average waiting time of ASR result significantly. The ability to reduce the average
waiting time up to 71.7% by the minimum pause at 20-millisecond. Unfortunately,
short pause increases 35.4% of WER accordingly compared with traditional VAD.
Ref. code: 25605809035149VUW
45
Table 5.1. The average waiting time and WER of ASR result by short pause based
VAD compared with traditional VAD
Algorithms
Minimum
Pause
(ms)
Average
chunk
duration
(s)
Average
waiting
time (s)
Reduced
waiting
time (%)
WER
(%)
Traditional VAD 100 2.14 4.70 - 26.8
Short Pause based VAD
(the 1st step)
20 0.46 1.33 71.7 62.2
40 0.62 1.65 64.9 49.8
60 0.91 2.23 52.6 38.9
80 1.44 3.28 30.2 31.0
100 2.14 4.67 0.6 26.4
However, a nearby acceptable WER is the minimum pause at 80-
millisecond. Using short pause at 80-millisecond reduces the average waiting time by
30.2% and also increase WER by 4.2% compared with traditional VAD. While we note
that the endpoint decision and padding silence module are not included in this step.
As mentioned above, using short pause in VAD is possible to reduce the average waiting
time of ASR result nicely. However, the experimental result from the first step shows
that the short pauses increase WER obviously. Because they have an opportunity to
dismiss unvoiced speech or separate the middle of words such as “ab | solutely”. A
vertical bar (|) shows a false alarm position, which is a low energy value similar to short
pause or silence. Therefore, ASR fails to recognize real-time captioning.
Moreover, the small chunks of speech are another factor that affects the
accuracy of ASR. In general, ASRs use a nearby word to improve the probability of
weakness speech signal in speech sentence. This technique improves the overall
performance of speech recognition. Separating speech sentence using short pause as
an endpoint, VAD divides the speech frequently, and the speech becomes a small
Ref. code: 25605809035149VUW
46
chunk. Then ASR cannot use the nearby word technique to improve the probability of
weakness speech signal. This problem directly affects the accuracy of speech
recognition. Hence, we need to maintain the efficiency and accuracy of ASR result.
We propose a method to minimize the amount of short pause that are
used as an endpoint. Endpoint decision is applied in short pause based VAD algorithm
(described in section 4.1.2). This method considers a delay caused by searching silence
and minimizes the number of short pauses that are used as an endpoint. This method
aims to minimize WER in ASR. The experimental result after added endpoint decision
is presented in Table 5.2.
Table 5.2 presents a performance of the second step by applying an
endpoint decision to short pause based VAD. Using short pause as an endpoint as
needed sights to reduce WER well. From the experimental result shows that the
endpoint decision reduces the waiting time by 22.8% at least (using the minimum
pause from 80-millisecond). Moreover, WER is increased only by 2.9%, it is the lowest
acceptable WER of this experiment compared with traditional VAD. Both of two steps
including the first and the second step are shown that using excessive of short pauses
decrease WER rapidly. Applying endpoint decision to minimize the number of short
pause as need, WER is reduced significantly. However, we need to maintain an accuracy
of ASR result as possible.
Ref. code: 25605809035149VUW
47
Table 5.2. The average waiting time and WER of ASR result by short pause based
VAD with endpoint decision compared with the previous step and traditional VAD
Algorithms
Minimum
Pause
(ms)
Average
chunk
duration
(s)
Average
waiting
time (s)
Reduced
waiting
time (%)
WER
(%)
Traditional VAD 100 2.14 4.70 - 26.7
Short Pause based VAD
(the 1st step)
20 0.46 1.33 71.7 62.2
40 0.62 1.65 64.9 49.8
60 0.91 2.23 52.6 38.9
80 1.44 3.28 30.2 31.0
100 2.14 4.67 0.6 26.4
Short Pause based VAD
with endpoint decision
(the 2nd step)
20 0.65 1.72 63.5 55.0
40 0.84 2.08 55.9 45.6
60 1.19 2.79 40.7 35.7
80 1.65 3.63 22.8 29.7
100 2.14 4.67 0.7 26.7
Ref. code: 25605809035149VUW
48
Table 5.3. The average waiting time and WER of ASR result by short pause based
VAD with padding silence compared with the previous steps and traditional VAD
Algorithms
Minimum
Pause
(ms)
Average
chunk
duration
(s)
Average
waiting
time (s)
Reduced
waiting
time (%)
WER
(%)
Traditional VAD 100 2.14 4.70 - 26.7
Short Pause based VAD
(the 1st step)
20 0.46 1.33 71.7 62.2
40 0.62 1.65 64.9 49.8
60 0.91 2.23 52.6 38.9
80 1.44 3.28 30.2 31.0
100 2.14 4.67 0.6 26.4
Short Pause based VAD
with endpoint decision
(the 2nd step)
20 0.65 1.72 63.5 55.0
40 0.84 2.08 55.9 45.6
60 1.19 2.79 40.7 35.7
80 1.65 3.63 22.8 29.7
100 2.14 4.67 0.7 26.7
Short Pause based VAD
with padding silence
(the 3rd step)
20 0.68 1.75 62.8 47.7
40 0.86 2.09 55.6 39.2
60 1.21 2.81 40.2 31.7
80 1.68 3.71 21.2 27.6
100 2.14 4.70 0.2 26.3
Ref. code: 25605809035149VUW
49
Afterward, in the third step that is described in section 4.1.3. we add the
padding silence module to short pause based VAD additionally. This is the third step
that mainly aims to reduce WER of ASR result. The comparisons of WER and average
of waiting time are presented in Table 5.3.
The result of the third step shows that the average of waiting time slightly
increases the average of chunk duration. Because of adding 100-millisecond of the
silence increases the length of speech that also affect the recognition time of ASR. The
short pause based VAD with padding silence reduces the average waiting time by
21.2%, which is reduced from the first and the second step. While WER is increased
only by 0.9% compared with traditional VAD. We note that adding an appropriate
length of silence between front or back of short pause in the third step, ASR can
recover a missing unvoiced speech and improve an accuracy of ASR result.
However, from the previous steps, we found a limitation of traditional VAD
and short pause based VAD that affects a noisy environment. When we test both VADs
in an office noise environment which is contained in a testing set. We found that short
pause based VAD delays the ASR result up to 35 seconds, while traditional VAD is 43
seconds. The problem inherits to a fixed threshold that is used by short-time energy
feature. The short-time energy feature calculates an energy value of each frame and
decides a speech and non-speech using the fixed threshold.
Figure 5.1. The example of sentence that cannot find silence or short pause by short
pause VAD and traditional VAD
Ref. code: 25605809035149VUW
50
Figure 5.2. The labeled of silence and short pause in noisy speech sentence
As illustrated in Figure 5.1, the straight line represents start point and the
dashed line represents endpoint that is detected by short pause based VAD. From
illustrated, the result shows that the proposed work and traditional VAD are cannot
find an endpoint of speech. Since, a noise energy is higher than the fixed threshold
value. Then considering short pause and silence using the fixed threshold are not
precise. This problem affects an ability of short pause detection in short pause based
VAD that directly delays an ASR result to a user.
Moreover, we investigate the fixed threshold problem with a label. The
sentence actually contains 2 segments of silence and 3 segments of short pause that
cannot use as an endpoint due to noise. The illustration is shown in Figure 5.2.
Therefore, we propose the fourth step that is described in section 4.1.4.
Short pause based VAD with LSTM-RNN is created to solve the weakness of short pause
and silence detection that is mentioned above. Thus, the feature extractions in short
pause based VAD are replaced with a pause model which is trained from LSTM-RNN.
The pause model is trained from a noisy dataset to enhance the short pause algorithm
and overall efficiency of short pause based VAD.
Ref. code: 25605809035149VUW
51
Figure 5.3. The speech segments in office noise that is detected by short pause
based VAD with LSTM-RNN
From Figure 5.3 shows the performance of the pause model that detect
the short pause in sentence, where the minimum pause is from 60-millisecond. The
straight lines represent start point and dashed lines represent endpoint, which are
detected by short pause based VAD with LSTM-RNN. Figure 5.3 shows that the short
pause based VAD with LSTM-RNN can detect short pause even in office noise
environment (A). Hence, it is better than using short-time energy with fixed threshold,
which is sensitive in a noisy environment.
The result of the fourth step is shown in Table 5.4. The short pause VAD
with LSTM-RNN reduces an average waiting time up to 17.1%. In addition, the proposed
work in the fourth step reduces WER of ASR result up to 1.5% and 2.29% compared
with traditional VAD and the third step, respectively.
At the first step, using short pause in a few lengths reduce the average
waiting time well. The fourth step can reach the minimum pause at 60-millisecond
compared with all previous steps, which can use only short pause at 80-millisecond.
However, an average waiting time in the fourth step is not reduced as it should be.
Because when using a pause model instead of short-time energy, the pause model has
an ability to detect smallest characteristics in speech frame such as unvoiced speech
or speech in noisy environment simultaneously. Then the number of speech frames
become a longer and the numbers of silence are shorter compared with all previous
steps and traditional VAD. These are the reason why the fourth step achieves the
average waiting time lower and WER greater than the previous steps.
Ref. code: 25605809035149VUW
52
Table 5.4. The average waiting time and WER of ASR result by short pause based
VAD with LSTM-RNN compared with all previous steps and traditional VAD
Algorithms
Minimum
Pause
(ms)
Average
chunk
duration
(s)
Average
waiting
time (s)
Reduced
waiting
time (%)
WER
(%)
Traditional VAD 100 2.14 4.70 - 26.7
Short Pause based VAD
(the 1st step)
20 0.46 1.33 71.7 62.2
40 0.62 1.65 64.9 49.8
60 0.91 2.23 52.6 38.9
80 1.44 3.28 30.2 31.0
100 2.14 4.67 0.6 26.4
Short Pause based VAD
with endpoint decision
(the 2nd step)
20 0.65 1.72 63.5 55.0
40 0.84 2.08 55.9 45.6
60 1.19 2.79 40.7 35.7
80 1.65 3.63 22.8 29.7
100 2.14 4.67 0.7 26.7
Short Pause based VAD
with padding silence
(the 3rd step)
20 0.68 1.75 62.8 47.7
40 0.86 2.09 55.6 39.2
60 1.21 2.81 40.2 31.7
80 1.68 3.71 21.2 27.6
100 2.14 4.70 0.2 26.3
Short Pause based VAD
with LSTM-RNN
(the 4th step)
20 1.04 2.39 49.3 37.7
40 1.32 2.96 37.2 31.8
60 1.77 3.90 17.1 25.3
80 2.47 5.26 - 11.8 22.6
100 3.12 6.55 - 39.3 19.8
Ref. code: 25605809035149VUW
53
Figure 5.4. The average waiting time reduced compared with the minimum pause
setting in short pause based VAD
Ref. code: 25605809035149VUW
54
Figure 5.5. The comparison of WER and minimum pause setting in short pause based
VAD
From Figure 5.4 and Figure 5.5 show an average waiting time and WER in
all of the proposed steps. Dotted lines represent the performance of the first step
(short pause based VAD). Dashed lines represent the second step (short pause based
VAD with endpoint decision). Dot-dashed lines represent the third step (short pause
based VAD with padding silence). Finally, the fourth step (short pause based VAD with
LSTM-RNN) is represented by straight lines.
Ref. code: 25605809035149VUW
55
We note that the first step achieves an average waiting time nicely.
However, WER of ASR result in the first step is decreased rapidly compared with other
steps. Although endpoint decision in the second step increases an average waiting
time, the WER is reduced when compared with the first step. In the third step, padding
silence lightly increases average waiting time. But, reducing huge of WER caused by
short pause significantly. Using minimum pause at 80-millisecond in the fourth step
increases the average waiting time so long. Hence, the average of chunk duration is
increased by unvoiced or speech in noisy, while the silences are decreased. However,
these enable VAD to get more accurate in short pause and silence detection. The
processes of all proposed step are illustrated in Figure 5.6.
5.6 The flow diagram of short pause based VAD in final step.
Ref. code: 25605809035149VUW
56
The Figure 5.4 shows that using the minimum pause at 60-millisecond in
the fourth step (Figure 5.6) slightly reduces an average waiting time lower than other
steps. The average waiting time is reduced from 4.7 to 3.9 seconds at least, which is
only 17.1%. Although, an average waiting time is not reduced better than the first,
second and the third step, which are reduced 30.2%, 22.8% and 21.2% respectively.
But, the fourth step with the minimum pause at 60-millisecond reduces WER up to
25.3%, which is better than traditional VAD and all previous steps. Hence, the fourth
step shows an effective to reduce the waiting time while maintain WER of ASR result,
which is suitable for use in automatic captioned relay service.
Figure 5.7. The frequency and waiting time of Short Pause based VAD with LSTM-RNN
compared with traditional VAD
In addition, the illustration in Figure 5.7 shows the distribution of waiting
time for ASR results. The straight line and the dashed line represent the frequency of
waiting time by short pause based VAD with LSTM-RNN (the fourth step) and traditional
Ref. code: 25605809035149VUW
57
VAD serially. We use the minimum pause at 60-millisecond, which is the best result of
the fourth step.
The graph shows that short pause based VAD with LSTM-RNN delivers an
ASR result every 3.9 seconds and the distribution of caption is less than traditional
VAD. An average waiting time reduced, the ASR results are more continuous and
constant. This will directly affect the user experience.
Ref. code: 25605809035149VUW
58
CONCLUSION
High speaking rate is common in a continuous speech. An average rate of
speech for speakers is approximately 150 wpm (NCVS, 2007). In addition, high speaking
rate has less of silence. Silence is a non-speech portion that are over 100-millisecond. When applying the traditional VAD to the automatic captioned relay service, the
traditional VAD separates a continuous speech to a long speech segment, then ASR
takes a long time to transcribe. The experimental result shows the average waiting
time for ASR result is 4.7 seconds. Since the traditional VAD commonly uses silence as
an endpoint of speech. This delay time directly affect the user experience and lack of
real-time captioning service. In this work, we reduce the waiting time of ASR result using short pause as
an endpoint. We investigate the frequency of short pause in speech sentence. The
result of investigation shows that short pauses are usually found in a sentence every
0.64 seconds, while 1.14 seconds for silence. Base on the experiment, we divide the
experiment of proposed work into four steps. At the first step, we show the
improvement of traditional VAD. The proposed work algorithm is called short pause
based VAD which detect short pause and silence simultaneously. Then the proposed
work uses the detected short pause or silence as an endpoint to reduce the waiting
time. This step, we aim to reduce the waiting time and study trend-off of average
waiting time and WER. The experimental result of the first step shows that using short
pause as an endpoint reduces huge of average waiting time. However, using short
pause in a lower range increases WER rapidly.
The second step, we maintain the accuracy of ASR result from the first
step. We apply endpoint decision module to short pause based VAD. The endpoint
decision is used for monitoring and minimizing the use of short pause to minimize WER.
The experimental result shows that the endpoint decision reduces the average waiting
time at least 22.8%, while WER is increased only by 2.9%, which is the lowest
acceptable WER of this proposed work compared with traditional VAD. The result of
Ref. code: 25605809035149VUW
59
the second step shows that we can only use the minimum pause at 80-millisecond to
reduce the average waiting time.
Afterward, we apply padding silence module to short pause based VAD in
the third step. This step also mainly aims to reduce WER of ASR result. The result of
the third step shows that the average of waiting time slightly increases the average
chunk duration. The short pause based VAD with padding silence reduces the average
waiting time by 21.2%, which reduced from the first and second step, while WER is
increased only by 0.9% compared with traditional VAD. We note that adding an
appropriate length of silence between front or back of short pause in the third step,
ASR can recover a missing unvoiced speech and improve an accuracy and reduce WER
of ASR result.
Then we explain the weakness of short pause and silence detection due
to an office environment. Thus, the feature extractions in short pause based VAD are
replaced with the pause model which is trained from LSTM-RNN. The result of the
fourth step shows that the pause model can solve the weakness of short pause and
silence detection. The fourth step shows the effective to reduce the average waiting
time by 17.1% at least, while WER is reduced to 25.3%, which is the lowest WER.
Short pause based VAD with LSTM-RNN delivers an ASR result in 3.9
seconds on the average compared to 4.7 seconds by the traditional VAD. Reducing the
average waiting time of ASR result, the captions are more continuous and constant,
which directly affect the user experience.
Ref. code: 25605809035149VUW
60
REFERENCES
Books
Frederick Jelinek. (1998). Statistical methods for speech recognition. MIT Press
Cambridge.
Articles
Chotimongkol, A., Saykhum, K., Chootrakool, P., Thatphithakkul, N., & Wutiwiwatchai,
C. (2009). LOTUS-BN: A Thai broadcast news corpus and its research
applications. 2009 Oriental COCOSDA International Conference on Speech
Database and Assessments, ICSDA 2009, 44–50.
https://doi.org/10.1109/ICSDA.2009.5278377
Dave, N. (2013). Feature Extraction Methods LPC , PLP and MFCC In Speech
Recognition. International Journal for Advance Research in Engineering and
Technology, 1(Vi), 1–5.
Eyben, F., Weninger, F., Squartini, S., & Schuller, B. (2013). Real-life voice activity
detection with LSTM Recurrent Neural Networks and an application to
Hollywood movies. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-
88., 1988 International Conference on (Acoust Speech Signal Process) (pp.
483–487). ICASSP; IEEE Signal Processing Society, Institute of Electrical and
Electronics Engineers. https://doi.org/10.1109/ICASSP.2013.6637694
Guo, Q., & Li, N. (2010). A Improved Dual-threshold Speech Endpoint Detection
Algorithm. In Computer and Automation Engineering (ICCAE) (pp. 123–126).
Singapore.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural
Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Ref. code: 25605809035149VUW
61
Hughes, T., & Mierle, K. (2013). Recurrent Neural Networks for Voice Activity
Detection. Acoustics, Speech and Signal Processing (ICASSP), 7378–7382.
https://doi.org/10.1109/ICASSP.2013.6639096
Jia, C., & Xu, B. (2002). An improved entropy-based endpoint detection algorithm.
International Symposium on Chinese Spoken, 1(1), 1–4.
Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Kanokphara, S., & Thatphithakkul,
N. (2003). Thai speech corpus for Thai speech recognition. Proceedings of
Oriental COCOSDA, (January), 54–61.
Kim, J., Kim, J., Lee, S., Park, J., & Hahn, M. (2016). Vowel based Voice Activity
Detection with LSTM Recurrent Neural Network. Proceedings of the 8th
International Conference on Signal Processing Systems - ICSPS 2016, 134–137.
https://doi.org/10.1145/3015166.3015207
Li, Q., Zheng, J., Tsai, A., & Zhou, Q. (2002). Robust endpoint detection and energy
normalization for real-time speech and speaker recognition. IEEE Transactions
on Speech and Audio Processing, 10(3), 146–157.
https://doi.org/10.1109/TSA.2002.1001979
Mermelstein, P. (1976). Distance measures for speech recognition, psychological and
instrumental. Pattern Recognition and Artificial Intelligence.
Misra, A. (2012). Speech / Nonspeech Segmentation in Web Videos. Proceedings of
InterSpeech 2012.
Moattar, M., & Homayounpour, M. (2009). A simple but efficient real-time voice
activity detection algorithm. European Signal Processing Conference
(EUSIPCO), (Eusipco), 2549–2553. https://doi.org/10.1007/978-1-4419-1754-6
Pal, P. K., & Phadikar, S. (2015). Modified energy based method for word endpoints
detection of continuous speech signal in real world environment. In Research
in Computational Intelligence and Communication Networks (ICRCICN) (pp.
381–385). https://doi.org/10.1109/ICRCICN.2015.7434268
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., … Ward, R. (2016). Deep
Sentence embedding using long short-term memory networks: Analysis and
application to information retrieval. IEEE/ACM Transactions on Audio Speech
Ref. code: 25605809035149VUW
62
and Language Processing, 24(4), 694–707.
https://doi.org/10.1109/TASLP.2016.2520371
Pang, J. (2017). Spectrum energy based voice activity detection. In Computing and
Communication Workshop and Conference (CCWC) (pp. 1–5).
Pheraniti, T. (2008). A Speech/Non-Speech Detection System for Automatic Speech
Recognition. In National Computer Science and Engineering Conference
(NCSEC) (pp. 755–759).
Podder, P., Khan, Zaman, T., & Haque Khan, M. (2014). Comparative Performance
Analysis of Hamming , Hanning and Blackman Window. International Journal
of Computer Applications, 96(18), 1–7.
Rabiner, L. R., & Sambur, M. R. (1975). An Algorithm for Determining the Endpoints of
Isolated Utterances. Bell System Technical Journal, 54(2), 297–315.
https://doi.org/10.1002/j.1538-7305.1975.tb02840.x
Ryant, N., Liberman, M. Y., Yuan, J., Ryant, N., Liberman, M. Y., & Yuan, J. (2013).
Speech Activity Detection on YouTube Using Deep Neural Networks.
Proceedings of Interspeech, 728–731.
Shen, J., Hung, J., & Lee, L. (1998). Robust Entropy-based Endpoint Detection for
Speech Recognition in Noisy Environments. In 5th International conference
ICSLP ’98 (p. 4).
Tashev, I., & Mirsamadi, S. (2016). DNN-based Causal Voice Activity Detector.
Information Theory and Applications Workshop.
Wang, X., & Qu, L. (2014). The Self-adaptive Voice Activity Detection Algorithm based
on time-frequency Parameters. In The Open Automation and Control Systems
Journal (pp. 1661–1668).
Wu, S. L., Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Incorporating
information from syllable-length time scales into automatic speech
recognition. In International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (Vol. 2, pp. 721–724).
https://doi.org/10.1109/ICASSP.1998.675366
Ref. code: 25605809035149VUW
63
Yali, C., Dongsheng, L., Shuo, J., & Xuefen, N. (2014). A Speech Endpoint Detection
Algorithm Based on Wavelet Transforms. In Control and Decision Conference
(CCDC) (pp. 3010–3012).
Yao, K., Peng, B., Zhang, Y., Yu, D., Zweig, G., & Shi, Y. (2014). Spoken language
understanding using long short-term memory neural networks. In Spoken
Language Technology Workshop (SLT) (pp. 189–194).
https://doi.org/10.1109/SLT.2014.7078572
Zhang, Z., & Junqin, H. (2015). An adaptive voice activity detection algorithm.
International Journal on Smart Sensing and Intelligent Systems, 8(4), 2175–
2194.
Electronic Media
NCVS. (2007). Voice Qualities. Retrieved November 1, 2017, from
http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/quality.html
Olah, C. (2015). Understanding LSTM Networks. Retrieved September 23, 2017, from
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Jones, S. (2017). Captioned phones. Retrieved August 29, 2017, from
http://www.healthyhearing.com/help/assistive-listening-devices/captioned-
phones
Ref. code: 25605809035149VUW
64
APPENDICES
Ref. code: 25605809035149VUW
65
APPENDIX A
IMPLEMENTATION OF SHORT PAUSE BASED VAD
We implement Short Pause based VAD using Visual Studio Code editor and
Python 3. Moreover, third-party for python libraries are required including; Numpy,
Tensorflow, python_speech_features, Scipy, Keras and Pycurl.
Short pause based VAD class, which classify the speech/non-speech
segment. The state diagram is presented as Figure 2.16.
class ShortPause_VAD(Base_VAD): def __init__(self, maxsp=5, **kwargs): Base_VAD.__init__(self, **kwargs) self.feature = FeatureExtractor() self.ste_threshold = 50000 self.zcr_threshold = 20 self.maxsp = maxsp self.margin = 0 self.latest_sp = None self.latest_end = 0 # Override def reset_variables(self): self.begin = None self.latest_sp = None self.state = 0 self.count = 0 self.sil = 0 # Override def decision(self, i, X): Base_VAD.decision(self, i, X) en = self.feature.short_term_energy(X) if self.state == 0: if en > self.ste_threshold: self.state = 1 self.count += 1 else: # loop in silence self.reset_variables()
Ref. code: 25605809035149VUW
66
elif self.state == 1: if en > self.ste_threshold: self.count += 1 if self.count >= self.ensure_speech: self.state = 2 position = i - self.count self.begin = self.zcr_beginning_search(position, self.latest_end) self.begin = self.begin - self.margin else: self.reset_variables()
elif self.state == 2: if en > self.ste_threshold: self.count += 1 self.sil = 0 else: self.state = 3 self.sil += 1
elif self.state == 3: if en >= self.ste_threshold: if self.sil >= self.maxsp: self.consider_sp((i, self.sil)) self.state = 2 self.count = self.count + self.sil + 1 self.sil = 0 else: self.sil += 1 if (self.sil >= self.maxsil) and self.count >= self.ensure_speech: position = i position = self.zcr_ending_search(position) position = position + self.margin self.cut(self.begin, position) self.latest_end = position self.reset_variables() return
self.endpoint_decision()
def cut_at_latest(self):
Ref. code: 25605809035149VUW
67
position = self.latest_sp[0] self.padding_silence_end = True self.cut(self.begin, position) new_begin = self.latest_sp[0] - self.latest_sp[1] self.reset_variables() self.latest_sp = None self.begin = new_begin self.state = 2 self.padding_silence_begin = True def consider_sp(self, sp_tuple): self.latest_sp = sp_tuple
Endpoint decision function (proposed in section 3.4.1) for monitoring and
minimizing the use of short pauses and minimize real-time captioning errors.
def endpoint_decision(self): if self.begin and self.state != 3: if self.latest_sp and self.avg_chunk and i - self.begin >= self.avg_chunk: self.cut_at_latest()
Function for generating the 100-millisecond of silence that is used in
padding silence module.
def get_100ms_silence(self): # generate 100ms silence target = int(self.fs * 0.1) shape = (target,) return numpy.zeros(shape)
Function for padding 100-millisecond of silence between front or back of
short pause to improve accuracy (proposed in section 3.5.1).
def padding_silence(self, signal): if self.plus: if self.padding_silence_begin: signal = numpy.array(numpy.hstack((self.get_100ms_silence(), signal)), dtype=numpy.int16) if self.padding _silence_end:
Ref. code: 25605809035149VUW
68
signal = numpy.array(numpy.hstack((signal, self.get_100ms_silence())), dtype=numpy.int16) self.padding _silence_begin = False self.padding _silence_end = False return signal
Ref. code: 25605809035149VUW
69
APPENDIX B
IMPLEMENTATION OF SHORT PAUSE BASED VAD WITH LSTM-RNN
The short pause based VAD with LSTM-RNN class, which using pause model
to classify silence and short pause in each speech frame. The state diagram is
presented as Figure 2.16.
class ShortPauseRNN_VAD(Base_VAD): def __init__(self, maxsp=5, **kwargs): Base_VAD.__init__(self, **kwargs) self.maxsp = maxsp self.latest_sp = None self.plus = True def set_model(self, model): self.model = model def predict(self, X): predict = self.model.predict([X])[0] speech = round(predict[1]) return speech # Override def reset_variables(self): self.begin = None self.latest_sp = None self.state = 0 self.count = 0 self.sil = 0 # Override def decision(self, i, X): Base_VAD.decision(self, i, X) if self.state == 0: if self.predict(X): self.state = 1 self.count += 1 else: self.state = 0 self.count = 0 self.begin = None
Ref. code: 25605809035149VUW
70
elif self.state == 1: if self.predict(X): self.count += 1 if self.count >= self.ensure_speech: self.state = 2 self.begin = i - (self.count + self.margin) else: self.state = 0 self.count = 0 self.begin = None
elif self.state == 2: if self.predict(X): self.count += 1 self.sil = 0 else: self.state = 3 self.sil += 1
elif self.state == 3: if self.predict(X): if self.sil >= self.maxsp: self.consider_sp((i - self.sil, self.sil)) self.state = 2 self.count = self.count + self.sil + 1 self.sil = 0 else: self.sil += 1 if (self.sil >= self.maxsil) and self.count >= self.ensure_speech: position = (i - self.sil) + self.margin self.cut(self.begin, position) self.reset_variables() return self.endpoint_decision()
Ref. code: 25605809035149VUW
71
BIOGRAPHY
Name Mr. Kiettiphong Manovisut
Date of Birth May 31, 1991
Educational Attainment
Academic Year 2012: Computer Science, Faculty
Of Informatics, Mahasarakham University (MSU),
Thailand
Work Position Software Engineer
Spinsoft Co.,Ltd., Thailand
Publications
Manovisut, K., Thatphithakkul, N., & Pokpong, S. (2017). Reducing waiting time in
automatic captioned relay service using short pause in voice activity
detection. 2017 9th International Conference on Knowledge and Smart
Technology (KST) (pp. 216-219). IEEE.
Ref. code: 25605809035149VUW