8/6/2019 Report VUV for Shifted Mar2011
1/28
UTD-REP-01 Page 1
Technical Report UTD-REP-01
Evaluation of Voiced/Unvoiced Detection Algorithms
for Frequency-Shifted Speech
Jaewook Lee and Philipos Loizou
March 2011
8/6/2019 Report VUV for Shifted Mar2011
2/28
UTD-REP-01 Page 2
I. Introduction
Voiced/unvoiced classification is important for speech coding, recognition and
enhancement. For that reason, various methods are developed for the robust classification. In this
report, four feature extraction methods, autocorrelation coefficient (AC), pre-emphasized energy
ratio (ER), zero crossing rate (ZCR) and high-to-full subband energy ration(SR), are used for
voiced/unvoiced speech classification[3]. And Otsus method is used to select threshold level from
histogram of AC, ER, ZCR and SR. For the final decision, short-time energy (STE) and its fixed
threshold level are used in silence detection[5,6]. Semiautomatic tool for voiced/unvoiced
detection is developed to obtain reliable voiced/unvoiced speech detection as a reference for test.
10 IEEE corpus sentences are selected and their frequency are shifted to the range of 600 ~ 1500
and range of -600 ~ -1500 respectively for test[7].
8/6/2019 Report VUV for Shifted Mar2011
3/28
UTD-REP-01 Page 3
II. Algorithms for Voiced/Unvoiced Detection
A. Equation for Four Voiced/Unvoiced Detection Algorithms
Autocorrelation Coefficient (AC):
()()
()(1)
Pre-Emphasized Energy Rate (ER):
()()
()(2)
Zero Crossing Rate (ZCR):
(()( ) ) (3)
Where () is the indicator function which is 1 if the argument A is true and 0 otherwise.
Low-to-Full Subband Energy Ratio (SR):
()
()(4)
Where () is low-pass filtered speech at 3kHz.
8/6/2019 Report VUV for Shifted Mar2011
4/28
UTD-REP-01 Page 4
B. Equation for Automatic Threshold Level Selection Algorithm
Otsus method (OTSU):
The optimum global threshold level
can be obtained by the value offor which
() is maximum.
()
() (5-1)
Where the between-class variance, for k=1,2,,N
()
(()())
()(())(5-2)
Where the global intensity mean,
() (5-3)
Where the cumulative means, () for k=1,2,,N
() () (5-4)
Where the cumulative sums, () for k=1,2,,N
() () (5-5)
Where the normalized histogram of input signal is p(i) for i=1,2,,N.
Histogram for Normalized AC and Threshold Level Using Otsus Method:
Figure 1. Histogram for (a) normalized AC, (b) normalized ER, (c) normalized ZCR and (d) normalized SR,
and their threshold level which is selected by Otsus method.
0 0.5 10
5
10
15
20
25
30
0.69
count
(a)
0 0.5 10
5
10
15
20
25
30
0.46
(b)
0 0.5 10
5
10
15
20
25
30
0.47
(c)
0 0.5 10
5
10
15
20
25
30
0.6
(d)
8/6/2019 Report VUV for Shifted Mar2011
5/28
UTD-REP-01 Page 5
Voiced/Unvoiced Detection Using 4 Methods with Automatically Selected Threshold Level:
Figure 2. (a) Normalized AC with threshold level (0.68). (b) Normalized ER with threshold level (0.46). (c)
Normalized ZCR with threshold level (0.47). (d) Normalized SR with threshold level (0.60). (e)
Voiced/unvoiced detection using AC and its threshold level. (f) Voiced/unvoiced detection from AC.
0 0.5 1 1.5 2-0.5
0
0.5
data:
(a)
0 0.5 1 1.5 20
0.5
1
normalizedAC
(b)
0 0.5 1 1.5 20
0.5
1
normalizedER
(c)
0 0.5 1 1.5 20
0.5
1
normalizedZCR
(d)
0 0.5 1 1.5 20
0.5
1
normalizedSR
(e)
0 0.5 1 1.5 20
0.5
1
VUVfromAC:
(f)time (sec)
8/6/2019 Report VUV for Shifted Mar2011
6/28
UTD-REP-01 Page 6
C. Decision Making
Short-Time Energy (STE) :
() (6)
STE with Fixed Threshold Level for Silence Detection (0.08 for unshifted, upshifted):
Figure 3. (a) Waveform of sample sentence. (b) Short-time Energy with fixed threshold level (0.08). (c)
Silence detection using STE with its threshold level.
Final Decision for Voiced/Unvoiced Speech Detection:
Figure 4. (a) Waveform of sample sentence. (b) Voiced/unvoiced detection using AC. (c) Silence detection
using STE. (d) Final decision for voiced/unvoiced detection of sample sentence.
0 0.5 1 1.5 2-0.5
0
0.5
data:
(a)
0 0.5 1 1.5 20
1
2
3
STE:
(b)
0 0.5 1 1.5 20
0.5
1
silencefromS
TE:
(c)time (sec)
0 0.5 1 1.5 2-0.5
0
0.5
data:
(a)
0 0.5 1 1.5 20
0.5
1
VUVfromA
C:
(b)
0 0.5 1 1.5 20
0.5
1
silencefromS
TE:
(c)
0 0.5 1 1.5 20
0.5
1
finaldecisionfor
VUV:
(d)time sec
8/6/2019 Report VUV for Shifted Mar2011
7/28
UTD-REP-01 Page 7
II. Materials and Experimental Methods
A. 10 IEEE Sample Sentences and Reference Voiced/Unvoiced Detection
Table 1. 10 IEEE Sample Sentences for Test.
# sentence sex Len. (s) Fs (kHz)
1 The birch canoe slid on the smooth planks. M 2.8 25
2 He knew the skill of the great young actress. M 3.5 25
3 Her purse was full of useless trash. M 2.2 25
4 Read verse out loud for pleasure. M 2.1 25
5 Wipe the grease off his dirty face. M 2.2 25
6 He wrote down a long list of items. F 2.9 25
7 The drip of the rain made a pleasant sound. F 2.7 25
8 Smoke poured out of every crack. F 2.5 25
9 Hats are worn to tea and not to dinner. F 2.9 25
10 The clothes dried on a thin wooden rack. F 2.9 25
10 IEEE Sample Sentences and Voiced/Unvoiced Detection as a Reference:
Figure 5. Waveform of 10 IEEE sample sentences and their voiced/unvoiced detection as a reference which
are detected manually using spectrogram.
1 2 3 4 5 6 7
x 104
0
0.51
1 2 3 4 5 6 7 8
x 104
0
0.52
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
x 104
0
0.53
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.54
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
x 104
0
0.55
1 2 3 4 5 6 7
x 104
0
0.56
1 2 3 4 5 6
x 104
0
0.57
1 2 3 4 5 6
x 104
0
0.58
1 2 3 4 5 6 7
x 104
0
0.59
1 2 3 4 5 6 7
0
0.510
8/6/2019 Report VUV for Shifted Mar2011
8/28
UTD-REP-01 Page 8
B. Upshifted and Downshifted Speech
Spectrogram for Unshifted and Shifted Speech:
Figure 6. Spectrogram of sample sentence: (a) unshifted speech, (b) upshifted speech by 800Hz, (c) upshifted
speech by 1200Hz, (d) downshifted speech by -800Hz and (e) downshifted speech by -1200Hz.
(a)
freq(Hz)
0 0.5 1 1.5 20
5000
10000
(b)
freq(Hz)
0 0.5 1 1.5 20
5000
10000
(c)
freq(Hz)
0 0.5 1 1.5 20
5000
10000
(d)
freq
(Hz)
0 0.5 1 1.5 20
5000
10000
(e)
freq(Hz)
time (sec)
0 0.5 1 1.5 20
5000
10000
8/6/2019 Report VUV for Shifted Mar2011
9/28
UTD-REP-01 Page 9
C. Tool for Voiced/Unvoiced Detection As a Reference
Initial voiced and unvoiced detection Using ER:
Figure 7. Voiced/unvoiced detection using ER with Spectrogram.
Manual Correction of Voiced/Unvoiced Classification as a Reference using spectrogram by clicking
mouse button on the wrong detected point:
Figure 8. Corrected voiced/unvoiced detection manually using spectrogram
8/6/2019 Report VUV for Shifted Mar2011
10/28
UTD-REP-01 Page 10
D. Performance Measurement
Table 2. Definition of Symbols in Error Calculation
HIT0 :Hit when unvoiced segment is correctly detected as an unvoiced. (unvoiced->unvoiced)
FALSE0 :False alarm when unvoiced segment is to be detected as a voiced. (unvoiced->voiced)
HIT1 :Hit when voiced segment is correctly detected as a voiced. (voiced->voiced)
FALSE1 :False alarm when voiced segment is to be detected as an unvoiced. (voiced->unvoiced)
Hit Rate:
(7-1)
False Alarm Rate:
(7-2)
Error Rate:
(7-3)
8/6/2019 Report VUV for Shifted Mar2011
11/28
UTD-REP-01 Page 11
IV. Experimental Result
A. Voiced/Unvoiced Detection for Unshifted Speech
Voiced/Unvoiced Detection for Unshifted Male Sentence Using 4 Methods:
Figure 9. (a) Reference voiced/unvoiced detection for male sentence (Her purse was full of useless trash).
Voiced/unvoiced detection: (b) AC, (c)ER, (d) ZCR and (e) SR.
Voiced/Unvoiced Detection for Unshifted Female Sentence Using 4 Methods:
Figure 10. (a) Reference voiced/unvoiced detection for female sentence (Hats are worn to tea and not to
dinner.). Voiced/unvoiced detection: (b) AC, (c)ER, (d) ZCR and (e) SR.
0 0.5 1 1.5 20
0.5
1
Ref.:
(a)
0 0.5 1 1.5 20
0.5
1
AC:
(b)
0 0.5 1 1.5 2
0
0.5
1
ER:
(c)
0 0.5 1 1.5 20
0.5
1
ZCR:
(d)
0 0.5 1 1.5 20
0.5
1
SR:
(e)time sec)
0 0.5 1 1.5 2 2.50
0.5
1
Ref.:
(a)
0 0.5 1 1.5 2 2.50
0.5
1
AC:
(b)
0 0.5 1 1.5 2 2.50
0.5
1
ER:
(c)
0 0.5 1 1.5 2 2.50
0.51
ZCR
:
(d)
0 0.5 1 1.5 2 2.50
0.5
1
SR:
(e)time (sec)
8/6/2019 Report VUV for Shifted Mar2011
12/28
UTD-REP-01 Page 12
Comparison of Detection Result with Reference Detection:
Table 3. Hit, False Alarm and Error Rate of Voiced/Unvoiced Detection for Unshifted Speech
Hit Rate (%) False Alarm Rate (%) Error Rate (%)
AC
ER
ZCR
SR
92.8222
93.1485
93.3116
92.8222
1.7833
1.7833
1.7833
1.7833
4.2474
4.0984
4.0238
4.2474
Error Rate of Each Method:
Figure 11. Hit, false alarm and error rate for unshifted speech.
8/6/2019 Report VUV for Shifted Mar2011
13/28
UTD-REP-01 Page 13
B. Voiced/Unvoiced Detection for Frequency Upshifted Sentences
Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using AC:
Figure 12. Voiced/unvoiced detection for upshifted speech using AC in the frequency range from 600 to
1500Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.51
600Hz
0 0.5 1 1.5 20
0.51
700Hz
0 0.5 1 1.5 20
0.51
800Hz
0 0.5 1 1.5 20
0.5
1
900
Hz
0 0.5 1 1.5 20
0.51
1000Hz
0 0.5 1 1.5 20
0.51
1100Hz
0 0.5 1 1.5 20
0.51
1200Hz
0 0.5 1 1.5 20
0.51
1300Hz
0 0.5 1 1.5 20
0.51
1400Hz
0 0.5 1 1.5 20
0.51
1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
14/28
UTD-REP-01 Page 14
Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using ER:
Figure 13. Voiced/unvoiced detection for upshifted speech using ER in the frequency range from 600 to 1500
Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.5
1
600
Hz
0 0.5 1 1.5 20
0.51
700Hz
0 0.5 1 1.5 20
0.51
800Hz
0 0.5 1 1.5 20
0.51
900Hz
0 0.5 1 1.5 20
0.5
1
1000
Hz
0 0.5 1 1.5 20
0.51
1100Hz
0 0.5 1 1.5 20
0.51
1200Hz
0 0.5 1 1.5 20
0.51
1300Hz
0 0.5 1 1.5 20
0.51
1400Hz
0 0.5 1 1.5 20
0.51
1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
15/28
UTD-REP-01 Page 15
Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using ZCR:
Figure 14. Voiced/unvoiced detection for upshifted speech using ZCR in the frequency range from 600 to
1500 Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.5
1
600Hz
0 0.5 1 1.5 20
0.51
700Hz
0 0.5 1 1.5 20
0.51
800Hz
0 0.5 1 1.5 20
0.51
900Hz
0 0.5 1 1.5 20
0.511000H
z
0 0.5 1 1.5 20
0.51
1100Hz
0 0.5 1 1.5 20
0.51
1200Hz
0 0.5 1 1.5 20
0.51
1300Hz
0 0.5 1 1.5 20
0.51
1400Hz
0 0.5 1 1.5 20
0.51
1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
16/28
UTD-REP-01 Page 16
Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using SR:
Figure 15. Voiced/unvoiced detection for upshifted speech using SR in the frequency range from 600 to 1500
Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.5
1
600Hz
0 0.5 1 1.5 20
0.51
700Hz
0 0.5 1 1.5 20
0.51
800Hz
0 0.5 1 1.5 20
0.51
900Hz
0 0.5 1 1.5 20
0.51
1000Hz
0 0.5 1 1.5 20
0.51
1100Hz
0 0.5 1 1.5 20
0.51
1200Hz
0 0.5 1 1.5 20
0.51
1300Hz
0 0.5 1 1.5 20
0.51
1400H
z
0 0.5 1 1.5 20
0.51
1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
17/28
UTD-REP-01 Page 17
Comparison of Detection Result with Reference Detection:
Table 4. Hit, False Alarm and Error Rate for Upshifted speech.
AC ER ZCR SR
Hit False Error Hit False Error Hit False Error Hit False Error
600
700
800
900
1000
1100
1200
1300
1400
1500
93.31
93.14
93.14
93.31
93.14
93.14
93.47
93.31
93.47
93.31
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
4.09
4.17
4.17
4.09
4.17
4.17
4.02
4.09
4.02
4.09
93.47
93.31
93.31
93.47
93.31
93.31
93.47
93.31
93.47
93.31
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
4.02
4.09
4.09
4.02
4.09
4.09
4.02
4.09
4.02
4.09
93.96
93.47
93.47
93.96
93.63
93.80
93.96
93.80
93.63
93.63
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
3.80
4.02
4.02
3.80
3.94
3.87
3.80
3.87
3.94
3.94
93.47
93.31
93.31
93.47
93.31
93.31
93.47
93.31
93.47
93.31
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
1.92
4.02
4.09
4.09
4.02
4.09
4.09
4.02
4.09
4.02
4.09
Error Rate for Each Upshifted Frequency Level:
Figure 16. Hit, false alarm and error rate for upshifted speech.
600 700 800 900 1000 1100 1200 1300 1400 150090
92
94
96
hitrate(%)
AC
ER
ZCR
SR
600 700 800 900 1000 1100 1200 1300 1400 15000
2
4
6
falsealarmr
ate(%)
600 700 800 900 1000 1100 1200 1300 1400 15000
2
4
6
errorrate(%)
frequency level (Hz)
8/6/2019 Report VUV for Shifted Mar2011
18/28
UTD-REP-01 Page 18
C. Voiced/Unvoiced Detection for Frequency Downshifted Sentences
Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using AC:
Figure 17. Voiced/unvoiced detection for downshifted speech using AC in the frequency range from -600 to-1500 Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.51
-600Hz
0 0.5 1 1.5 20
0.51
-700Hz
0 0.5 1 1.5 20
0.51
-800Hz
0 0.5 1 1.5 20
0.51
-900Hz
0 0.5 1 1.5 20
0.51
-1000Hz
0 0.5 1 1.5 20
0.51
-1100Hz
0 0.5 1 1.5 2
00.5
1
-1200Hz
0 0.5 1 1.5 20
0.51
-1300Hz
0 0.5 1 1.5 20
0.51
-1400Hz
0 0.5 1 1.5 20
0.51
-1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
19/28
UTD-REP-01 Page 19
Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using ER:
Figure 18. Voiced/unvoiced detection for downshifted speech using ER in the frequency range from -600 to
-1500 Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.5
1
-600
Hz
0 0.5 1 1.5 20
0.51
-700Hz
0 0.5 1 1.5 20
0.51
-800Hz
0 0.5 1 1.5 20
0.51
-900Hz
0 0.5 1 1.5 20
0.51
-1000
Hz
0 0.5 1 1.5 20
0.51
-1100Hz
0 0.5 1 1.5 20
0.51
-1200Hz
0 0.5 1 1.5 20
0.51
-1300Hz
0 0.5 1 1.5 20
0.51
-1400H
z
0 0.5 1 1.5 20
0.51
-1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
20/28
UTD-REP-01 Page 20
Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using ZCR:
Figure 19. Voiced/unvoiced detection for downshifted speech using ZCR in the frequency range from -600 to
-1500 Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 200.5
1
-60
0Hz
0 0.5 1 1.5 20
0.51
-700Hz
0 0.5 1 1.5 20
0.51
-800Hz
0 0.5 1 1.5 20
0.51
-900Hz
0 0.5 1 1.5 200.5
1
-10
00Hz
0 0.5 1 1.5 20
0.51
-1100Hz
0 0.5 1 1.5 20
0.51
-1200Hz
0 0.5 1 1.5 20
0.51
-1300Hz
0 0.5 1 1.5 20
0.51
-1400Hz
0 0.5 1 1.5 20
0.51
-1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
21/28
UTD-REP-01 Page 21
Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using SR:
Figure 20. Voiced/unvoiced detection for downshifted speech using SR in the frequency range from -600 to
-1500 Hz.
0 0.5 1 1.5 20
0.51
Ref.
0 0.5 1 1.5 20
0.5
1
-600Hz
0 0.5 1 1.5 20
0.51
-700Hz
0 0.5 1 1.5 20
0.51
-800Hz
0 0.5 1 1.5 20
0.51
-900Hz
0 0.5 1 1.5 20
0.5
1
-1000Hz
0 0.5 1 1.5 20
0.51
-1100Hz
0 0.5 1 1.5 20
0.51
-1200Hz
0 0.5 1 1.5 20
0.51
-1300Hz
0 0.5 1 1.5 20
0.51
-1400Hz
0 0.5 1 1.5 20
0.51
-1500Hz
time (sec)
8/6/2019 Report VUV for Shifted Mar2011
22/28
UTD-REP-01 Page 22
Comparison of Detection Result with Reference Detection:
Table 5. Hit, False Alarm and Error Rate for Downshifted Speech.
AC ER ZCR SR
Hit False Error Hit False Error Hit False Error Hit False Error
-600
-700
-800
-900
-1000
-1100
-1200
-1300
-1400
-1500
73.40
77.16
79.77
79.77
78.62
78.95
78.79
74.38
79.77
79.77
3.97
5.62
6.03
7.13
8.36
10.19
10.21
11.24
12.75
13.71
14.30
13.48
12.51
13.11
14.30
15.12
15.27
17.80
16.16
16.69
76.34
80.26
83.36
83.03
82.05
82.87
83.52
78.95
84.01
83.36
3.97
5.76
6.58
7.27
9.05
10.69
11.11
12.34
14.67
15.22
12.96
12.14
11.17
11.69
13.11
13.63
13.56
16.31
15.27
15.87
76.50
79.44
83.52
83.03
82.21
82.70
81.72
78.46
83.68
83.84
4.52
5.76
6.85
7.95
8.64
10.97
10.56
12.20
14.95
16.46
13.18
12.51
11.25
12.07
12.81
13.85
14.08
16.46
15.57
16.31
75.20
78.79
82.38
81.72
80.42
80.58
81.23
76.83
82.05
81.89
4.25
5.62
6.85
7.13
8.77
10.56
10.83
11.93
13.58
14.54
13.63
12.74
11.77
12.22
13.71
14.60
14.45
17.06
15.57
16.16
Error Rate on Each Downshifted Frequency Level:
Figure 21. Hit, false alarm and error rate for downshifted speech.
8/6/2019 Report VUV for Shifted Mar2011
23/28
UTD-REP-01 Page 23
VI. Conclusions and Planned Activity
Four feature extraction algorithms, autocorrelation coefficient (AC), pre-emphasized
energy ratio (ER), zero crossing rate (ZCR) and high-to-full subband energy ratio (SR), are used
for voiced/unvoiced classification, and their threshold levels are automatically selected using
Otsus method. Short-time Energy is used as a silence detection method with fixed threshold
level to make final decision for voiced/unvoiced classification. 10 IEEE corpus sentences are
used for test, and their reference voiced/unvoiced detection are manually obtained using
spectrogram. In unshifted and upshifted speech, all four methods have error rate under 4.3% in
all frequency range from 600 Hz to 1500 Hz. In the frequency downshifted speech, all four
methods have error rate of 11% to 18% in the frequency range from -600Hz to -1500Hz. And
three activities to improve performance of voiced/unvoiced detection are proposed as a
planned activity below:
Multiple Number of Threshold Level:
Two or multiple threshold levels are detected. Lower level is used for voiced/unvoiced
detection, and upper level is used to make sure it is voiced or unvoiced. Each level will be
detected to use multiple times of Otsus method [5,6].
Reliable Detection for Weak Voiced Speech:
If there are unvoiced segment near voiced utterance, it is likely to be voiced which is
weak between the voiced and unvoiced utterance. STE will be used to detect voiced speech,
and AC,ER,ZCR and SR will be used to detect unvoiced speech. Then they will be combined
together to detect weak voiced speech[4].
New Approach to Decision Making:
Each voiced, unvoiced and silence speech are manually detected, and their statistical
information such as mean and variance is obtained from their histogram. Final decision will be
made by using the statistical information. This is Bayesian approach to voiced/unvoiced
detection[4].
8/6/2019 Report VUV for Shifted Mar2011
24/28
UTD-REP-01 Page 24
References
[1] John G. Proakis, Digital Signal Processing, 4th, Pearson
[2] Philip C. Loizou, Speech Enhancement Theory and Practice, CRC
[3] A.M.Kondoz, Digital Speech: coding for low bit rate communication systems, Wiley
[4] L.R.Rabiner, R.W.Schafer, Theory and Applications of Digital Speech Processing, Prentice Hall
[5] R.C.Gonzalez, R.E.Woods, Digital Image Processing, Pearson
[6] Otsu, N.,A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man,and Cybernetics, Vol. 9, No. 1, 1979, pp. 62-66.
[7] IEEE Subcommittee (1969). IEEE Recommended Practice for Speech Quality Measurements.IEEE Trans.
Audio and Electroacoustics, AU-17(3), 225-246.
Matlab Code
Matlab Function for Autocorrelation Coefficient (AC) with Short-time Energy (STE):
function [ac,ste,n]=ac(data,win_size)
n=floor(length(data)/win_size);
data_fit=data(1:n*win_size);
data_2=data(1:n*win_size+1);
for i=1:n
data_win=data_fit(1+win_size*(i-1):win_size*i);
data_post=data_2(2+win_size*(i-1):1+win_size*i);
ac(i)=sum(data_post.*data_win)/sum(data_post.^2);
ste(i)=sum(data_win.^2);
end
end
8/6/2019 Report VUV for Shifted Mar2011
25/28
UTD-REP-01 Page 25
Matlab Function for Pre-emphasize Energy Ratio (ER):
function [er,ste,n]=er(data,win_size)
n=floor(length(data)/win_size);
data_fit=data(1:n*win_size);
data_2=data(1:n*win_size+1);
for i=1:n
data_win=data_fit(1+win_size*(i-1):win_size*i);
data_post=data_2(2+win_size*(i-1):1+win_size*i);
er(i)=sum(abs(data_post-data_win))/sum(abs(data_post));
ste(i)=sum(data_win.^2);
end
end
Matlab Function for Zero Crossing Rate (ZCR):
function [zcr,ste,n]=zcr(data,win_size)n=floor(length(data)/win_size);
data_fit=data(1:n*win_size);
data_2=data(1:n*win_size+1);
for i=1:n
data_win=data_fit(1+win_size*(i-1):win_size*i);
data_post=data_2(2+win_size*(i-1):1+win_size*i);
[row,column]=find((data_win.*data_post)
8/6/2019 Report VUV for Shifted Mar2011
26/28
UTD-REP-01 Page 26
Semiautomatic Tool for Voiced/Unvoiced Detection using Spectrogram:
filename='C:\Users\Owner\Desktop\P4\10sentences\sp03';
cd('C:\Users\Owner\Desktop\P4\10sentences');
[num,txt]=xlsread('10sentences.xlsx'); sentence=txt{3};
[data,fs]=wavread(filename);win=0.02*fs;
[er,ste,n]=er(data,win); er=er/max(er);
thres_sil=0.08;
thres_er=graythresh(er);
vuv_er=ones(1,n);
for i=1:n
if ste(i)thres_er vuv_er(i)=0; end
end
for i=1:n vuv(1+win*(i-1):win*i)=vuv_er(i); end
%%
subplot(2,1,1); area(vuv,'edgecolor','c','facecolor','c'); hold on;
subplot(2,1,1); plot(data(1:win*n)+0.4); ylim([0,0.9]); hold off;
title(sentence,'fontsize',12);
subplot(2,1,2); specgram(data(1:win*n));
for j=1:100
[x,y]=ginput(1);
for i=0:n
if (x>=i*win+1 && x
8/6/2019 Report VUV for Shifted Mar2011
27/28
UTD-REP-01 Page 27
'C:\Users\Owner\Desktop\P4\10sentences\1500'};
cd('C:\Users\Owner\Desktop\P4\10sentences');
file=dir('*.wav'); file_ref=dir('*.mat');
filenames={file.name}'; filenames_ref={file_ref.name}';
%%
for k=1:10
cd(pathnames{k})
for i=1:10
[data,fs]=wavread(filenames{i});
win=0.02*fs;
[ac,ste,n(i)]=ac(data,win);
[er]=er(data,win);
[zcr]=zcr(data,win);
[sr]=sr(data,win,fs);
ac=ac/max(ac);
er=er/max(er);
zcr=zcr/max(zcr);
sr=sr/max(sr);
thres_sil=0.08;
thres_ac=graythresh(ac);
thres_er=graythresh(er);
thres_zcr=graythresh(zcr);
thres_sr=graythresh(sr);
for j=1:n(i)if (ste(j)
8/6/2019 Report VUV for Shifted Mar2011
28/28
for l=1:4
if l==1 vuv_1=vuv_ac; end
if l==2 vuv_1=vuv_er; end
if l==3 vuv_1=vuv_zcr; end
if l==4 vuv_1=vuv_sr; end
for k=1:10
hit0=0; false0=0; hit1=0; false1=0;
for i=1:10
for j=1:n(i)
if (vuv(i,j)==0 && vuv_1(i,j,k)==0) hit0=hit0+1; end
if (vuv(i,j)==0 && vuv_1(i,j,k)==1) false0=false0+1; end
if (vuv(i,j)==1 && vuv_1(i,j,k)==1) hit1=hit1+1; end
if (vuv(i,j)==1 && vuv_1(i,j,k)==0) false1=false1+1; end
end
end
hit(l,k)=hit0/(hit0+false0)*100;
false(l,k)=false1/(hit1+false1)*100;
error(l,k)=(false0+false1)/sum(n)*100;
end
end