Doctor of Philosophy Electronics and Communication Engineering
Transcript of Doctor of Philosophy Electronics and Communication Engineering
Analytical Studies Relating ToBandwidth Extension of Speech Signal
for Next Generation WirelessCommunication
A Thesis submitted to Gujarat Technological University
for the Award of
Doctor of Philosophy
in
Electronics and Communication Engineering
by
Rajnikant Natubhai Rathod
159997111013
under supervision of
Dr. Mehfuza S. Holia
GUJARAT TECHNOLOGICALUNIVERSITYAHMEDABAD
July-2021
iii
I declare that the thesis entitled Analytical Studies Relating To Bandwidth
Extension of Speech Signal for Next Generation Wireless Communication
submitted by me for the degree of Doctor of Philosophy is the record of research
work carried out by me during the period from 2015 to 2021 under the supervision
of Dr. Mehfuza S. Holia and this has not formed the basis for the award of any
degree, diploma, associateship, fellowship, titles in this or any other University or
other institution of higher learning.
I further declare that the material obtained from other sources has been duly
acknowledged in the thesis. I shall be solely responsible for any plagiarism
or other irregularities, if noticed in the thesis.
Signature of the Research Scholar:......................................... Date: 02/07/ 2021
Name of Research Scholar: Rajnikant Natubhai Rathod
Place:LEC,Morbi
iv
CERTIFICATE
I certify that the work incorporated in the thesis Analytical Studies Relating To
Bandwidth Extension of Speech Signal for Next Generation Wireless
Communication submitted by Rajnikant Natubhai Rathod was carried out by
the candidate under my supervision/guidance. To the best of my knowledge: (i) the
candidate has not submitted the same research work to any other institution for any
degree/diploma, Associateship, Fellowship or other similar titles (ii) the thesis
submitted is a record of original research work done by the Research Scholar during
the period of study under my supervision, and (iii) the thesis represents independent
research work on the part of the Research Scholar.
Signature of Supervisor: .............................................. Date: 02/07/ 2021
Name of Supervisor: Dr. Mehfuza S. Holia
Place: BVM,V.V.Nagar
v
Course-work Completion Certificate
This is to certify that Mr. Rajnikant Natubhai Rathod enrolment
no.159997111013 is a PhD scholar enrolled for PhD program in the branch
Electronics and Communication Engineering of Gujarat Technological University,
Ahmedabad.
(Please tick the relevant option(s))
He/She has been exempted from the course-work (successfully completed
during M.Phil Course)
He/She has been exempted from Research Methodology Course only
(successfully completed during M.Phil Course)
He/She has successfully completed the PhD course work for the partial
requirement for the award of PhD Degree. His/ Her performance in the course work
is as follows-
Grade Obtained in ResearchMethodology (PH001)
Grade Obtained in Self Study Course(Core Subject)
(PH002)
BB BB
Sign
(Dr. Mehfuza S. Holia)
vi
Originality Report Certificate
It is certified that PhD Thesis titled Analytical Studies Relating To Bandwidth
Extension of Speech Signal for Next Generation Wireless Communication by
Rajnikant Natubhai Rathod has been examined by us. We undertake the
following:
Thesis has significant new work / knowledge as compared already
published or are under consideration to be published elsewhere. No
sentence, equation, diagram, table, paragraph or section has been copied
verbatim from previous work unless it is placed under quotation marks
and duly referenced.
The work presented is original and own work of the author (i.e. there is
no plagiarism). No ideas, processes, results or words of others have
been presented as Author own work.
There is no fabrication of data or results which have been compiled /
analyzed.
There is no falsification by manipulating research materials, equipment
or processes, or changing or omitting data or results such that the
research is not accurately represented in the research record.
The thesis has been checked using Urkund Software (copy of
originality report attached) and found within limits as per GTU
Plagiarism Policy and instructions issued from time to time (i.e.
permitted similarity index <10%).
Signature of the Research Scholar: Date: 02/07/2021
Name of Research Scholar: Rajnikant Natubhai Rathod
Place: LEC,Morbi
Signature of Supervisor: Date: 02/07/ 2021
Name of Supervisor: Dr. Mehfuza S. Holia
Place: BVM,V.V.Nagar
vii
Sources included in the report
Receiver: [email protected]
Receiver: [email protected]
Receiver: [email protected]
URL: https://ijcsmc.com/docs/papers/October2013/V2I10201318.pdf
viii
PhD THESIS Non-Exclusive License to
GUJARAT TECHNOLOGICAL UNIVERSITY
In consideration of being a PhD Research Scholar at GTU and in the interests
of the facilitation of research at GTU and elsewhere, I, Rajnikant Natubhai
Rathod having Enrollment No. 159997111013 hereby grant a non-exclusive,
royalty free and perpetual license to GTU on the following terms:
GTU is permitted to archive, reproduce and distribute my thesis, in whole
or in part, and/or my abstract, in whole or in part ( referred to collectively
-commercial purposes, in all
forms of media;
GTU is permitted to authorize, sub-lease, sub-contract or procure
any of the acts mentioned in paragraph (a);
GTU is authorized to submit the Work at any National / International
Library, -Exclusive
The Universal Copyright Notice (©) shall appear on all copies
made under the authority of this license;
I undertake to submit my thesis, through my University, to any Library andArchives.
Any abstract submitted with the thesis will be considered to form part of thethesis.
I represent that my thesis is my original work, does not infringe any rights
of others, including privacy rights, and that I have the right to make the
grant conferred by this non-exclusive license.
If third party copyrighted material was included in my thesis for
which, under the terms of the Copyright Act, written permission
ix
from the copyright owners is required, I have obtained such
permission from the copyright owners to do the acts mentioned in
paragraph (a) above for the full term of copyright protection.
I retain copyright ownership and moral rights in my thesis, and may deal
with the copyright in my thesis, in any way consistent with rights granted
by me to my University in this non-exclusive license.
I further promise to inform any person to whom I may hereafter assign or
license my copyright in my thesis of the rights granted by me to my
University in this non- exclusive license.
I am aware of and agree to accept the conditions and regulations of PhD
including all policy matters related to authorship and plagiarism.
Signature of the Research Scholar:
Name of Research Scholar: Rajnikant Natubhai Rathod
Date: 02/07/ 2021 Place: BVM,V.V.Nagar
Signature of Supervisor:
Name of Supervisor: Dr. Mehfuza S. Holia
Date: 02/07/ 2021 Place: BVM,V.V.Nagar
Seal: Dr. M.S.Holia
Assistant Professor
BVM, V.V.Nagar
Thesis Approval Form
The viva-voce of the PhD Thesis submitted by Shri
Rajnikant Natubhai Rathod (Enrollment No. 159997111013) entitled Analytical
Studies Relating To Bandwidth Extension of Speech Signal for Next Generation
Wireless Communication was conducted on Friday(02-07-2021) at Gujarat
Technological University.
(Please tick any one of the following option)
The performance of the candidate was satisfactory. We recommend that
he/she should be awarded the PhD degree.
Any further modifications in research work recommended by the panel
after 3 months from the date of first viva-voce upon request of the
Supervisor or request of Independent Research Scholar after which
viva-voce can be re-conducted by the same panel again.
(briefly specify the modifications suggested by the panel)
The performance of the candidate was unsatisfactory. We
recommend that he/she should not be awarded the PhD degree.
(The panel must give justifications for rejecting the research work)
----------------------------------------------------- ----------------------------------------------
Name and Signature of Supervisor with Seal 1) (External Examiner 1) Name andSignature
------------------------------------------------------- ----------------------------------------------
2) (External Examiner 2) Name and Signature 3) (External Examiner 3) Name andSignature
x
x
Dr Pancham Shukla
K.R.Parmar
xi
ABSTRACT
As the constraint of the Wireless transmission frequency band resources due to the
complexity of the 3G mobile communications environment, reconstructed Speech (voice)
signal quality, intelligibility at the receiver side is found stiffened, barely audible, slim.In
contrast, Today's terminals, infrastructure, operate at wider bandwidths for which speech
(voice) quality, intelligibility are greatly improved. Normally, the total move from
narrowband (300-3.4KHz) to wideband (0.05-7kHz) and wideband (0.05-7kHz) to super-
wideband (0.05-14kHz) communications will require impressive time. As a result,
wideband, super-wideband innovation must interoperate with narrowband innovation. In
this case, clients will involvement significant varieties in speech (voice) quality and
intelligibility.
The strategy of accomplishing the original wideband signal from a band-limited
narrowband speech signal without actually transmitting the wideband signal is called
bandwidth extension. Bandwidth extension aims to recreate wideband speech (0-8kHz)
from a narrowband speech signal (0-4kHz). Significant improvements in the quality of
speech coders have been achieved by widening the coded frequency range from
narrowband to wideband. The same concept can be applied for attaining super wideband
from the wideband signal. Bandwidth extension based on sinusoidal transform coding,
linear prediction, non-linear device gives good results compared to spectral
folding/spectral translation approaches employed by various researchers and academicians.
Within the modern state of undertakings of progression in innovation, different
coding algorithms have been built-up for super wideband to obtain the full benefit of
advancement in available telecommunications bandwidth, predominantly for the internet.
Think about based on sub-band filter has been examined where the work of sub-band filter
is to part out the wideband speech into low frequency and high-frequency parts, giving out
both parts exclusively so that without a wideband coder at the transmitter side the near-
perfect unique wideband speech (voice) quality and intelligibility has been recovered. The
obtained voice signal quality and intelligibility can be further processed via source filter
model based on linear prediction. For assessing speech (voice) quality, intelligibility,
frequency domain spectrogram analysis has been carried out on speech (voice) signal to
judge the speech (voice) signal compression. From the obtained speech (voice) signal at
the time-varying synthesis filter, it can be concluded that the source-filter model requires
xii
fewer bits compared to original speech (voice) signal transmission at the cost of
degradation in speech quality. One can notice the significant changes in original speech
(voice) quality by observing the spectrogram of speech (voice) signal. Less number of bit
representation can help in reducing the storage requirement, bandwidth.
Within the proposed strategy based on the source-filter demonstrate, deep-seated
thought adopted for the bandwidth extensions are the separate extensions of the spectral
envelope and residual signal. Each part is processed independently through different
speech (voice) enrichment procedure to get the high band component,. It is then added to
the re-sampled and delayed version of the signal to acquire the final extended output.
Obtained output is compared through intelligibility (subjective), quality perspective
(objective) and results for both the analysis are compared with baseline algorithm, next-
generation super wideband coder algorithms to prove that obtained results are comparable
with both algorithms. Algebraic evaluation to get missing high band components from the
original Narrowband/Wideband signal is not needed as in this approaches, the
Wideband/Super wideband signal achieved by Bandwidth extension from N.B./W.B.
signal itself. Multimedia communication, DSVD, VOIP, Voice messaging, Speaker
recognition, Internet telephony, Tele-conferencing are some of the important areas of
bandwidth extension. The research work in this thesis may help to shed light on the
proposed S.F.M. based algorithm which can be an alternative for the next generation super
wideband E.V.S. coder employed for speech(voice) quality, intelligibility improvement.
S.W.B. to F.B. speech quality is also the next targeted area of work for researchers found
interest in the area of speech coding and processing. The proposed work can also be
extended for music signals and music and mixed (speech and music).
xiii
Acknowledgement and/or Dedication
First I would like to convey my utmost gratitude towards my mentor god, Lord Krishna for
all his guidance and provide me courage, ability to complete this research work.
I am deeply gratified to have a supervisor like Dr. M. S. Holia, Assistant Professor,
Electronics Department, B.V.M. Engineering College V.V.Nagar-Anand. His continuous
encouragement, suggestions, advice, guidance inspires me to work continuously, improvise it.
I would like to thank him for being my advisor & supporter for research work. His
insightful remarks, encouragement have been of great to my research work as well as to
my future career.
I am very thankful to Doctoral Progress Committee (DPC): Dr. T. D. Pawar, Head,
Electronics Department, B.V.M. Engineering College V.V.Nagar-Anand, Dr. V. K.
Thakar, Principal, Hansaba college of Engineering & Technology, Siddhpur for their
valuable suggestions, support during reviews. I am very grateful for their suggestions and
critical reviews that enable me to improve this research work.
I am very thankful to Dr. Indrajit Patel, Principal BVM engineering college, Dr. S. N.
Pandya, Principal L.E. College, Prof. M. V. Makwana, H.O.D. Power Electronics
Department, Prof. M. H. Ayalani, Associate Professor, Power Electronics, for their
valuable support, guidance throughout my research work.
I am very thankful to my parents, brothers for all their support during the tenure of my
research. Meera, my wife who always stood by my side in all difficult time of my life,
provided support, encouragement to complete research work. I am very grateful to have loving
son Jayveer who have to lighten my mood during difficult times.
I would like to thank Prof. A. C. Lakum, Prof. V. J. Rupapara, Prof. P. D. Raval, Prof.
H. T. Loriya, Prof. H. M. Karkar, Prof. B. J. Makwana, Prof. A. P. Patel, Prof. J. B.
Bheda, Prof. A. R. Gauswami, Prof. S. N. Gohil for providing assistance whenever I
face difficulties in work. Last but not least all of those who have helped in finishing
research work.
Rajnikant Natubhai Rathod
Research Scholar, Gujarat Technological University
xiv
Table of Contents
Abstract ................................................................................................................................. xi
Acknowledgement ............................................................................................................. xiii
CHAPTER-1 .......................................................................................................................... 1
Introduction ............................................................................................................................ 1
1.1 Fruition of Communication Systems ...................................................................... 1
1.1.1 Analog, Digital telephony.................................................................................1
1.1.2 Wireless Cellular Networks .............................................................................. 1
1.1.3 Speech Coding .................................................................................................. 3
1.1.4 Background on Digital Speech Transmission .................................................. 7
1.1.5 Background on Bandwidth Extension .............................................................. 7
1.1.6 Problem Background ..................................................................................... 10
1.1.7 Problem Specification .................................................................................... 11
1.1.8 Motivation and applications ........................................................................... 12
1.2 Contribution ........................................................................................................ 14
1.3 Organization of the Thesis .................................................................................. 14
CHAPTER-2 ........................................................................................................................ 16
Literature Review and Objective of Work ........................................................................... 16
2.1 Literature reviews of different papers .................................................................... 16
2.2 Review sheet for each paper................................................................................. 20
2.3 Summary of Literature Survey ............................................................................... 22
2.4 Definition of the Problem....................................................................................... 23
2.5 Objectives and Overview of the research...............................................................24
CHAPTER-3 ........................................................................................................................ 25
Fundamental of Speech Production Model, Types of Speech coder and its attributes ........ 25
3.1 Introduction ........................................................................................................... 25
3.2 Speech (Voice) sounds..................................................................................................26
3.3 Representation of Speech Signals ......................................................................... 28
3.4 Classification of Coder .......................................................................................... 28
3.4.1 Waveform Coding .......................................................................................... 29
3.4.2 Source Coding ................................................................................................ 30
3.4.3 Hybrid Coding ................................................................................................ 31
xv
3.5 Speech (voice) coding ........................................................................................... 31
3.6 Issues related to digital speech coding .................................................................. 33
3.7 Speech codec attributes ......................................................................................... 34
3.7.1 Transmission bit rate ........................................................................................... 34
3.7.2 Speech Quality..................................................................................................... 34
3.7.3 Bandwidth............................................................................................................ 35
3.7.4 Communication Delay ......................................................................................... 35
3.7.5 Complexity .......................................................................................................... 35
3.8 Speech codec for next-generation wireless communications................................ 36
3.9 Speech (voice) quality, intelligibility ................................................................... 36
3.9.1 Listening-only tests ............................................................................................. 36
3.9.2 Conversational tests..............................................................................................37
3.9.3 Field tests ............................................................................................................. 38
3.9.4 Intelligibility tests ................................................................................................ 38
3.10 Objective quality evaluation ............................................................................... 38
3.11 Effect of B.W. on speech (voice) quality, intelligibility ..................................... 39
CHAPTER-4 ........................................................................................................................ 40
Development of Bandwidth Extension Model For W.B. To S.W.B. ................................. 40
4.1 Introduction ........................................................................................................... 40
4.2 Non-model based B.W.E. approaches................................................................... 40
4.3 B.W.E. approaches based on the source-filter model ........................................... 41
4.3.1 B.W.E. from N.B. to W.B. Speech Conversion: ................................................. 41
4.3.2 General Model for B.W.E.: ................................................................................. 41
4.3.3 Bandwidth Limitation in Compressed Speech/Audio: ........................................ 42
4.3.4 Baseline System Model for B.W.E...................................................................... 44
4.4 Detailed analysis of B.W.E. based on proposed S.F.M. ....................................... 47
CHAPTER 5 ........................................................................................................................ 59
Linear prediction analysis and synthesis ............................................................................. 59
5.1 Introduction ........................................................................................................... 59
5.2 Speech file Compression based on LPC ............................................................... 59
5.3 Source filter model for sound production ............................................................. 61
5.4 Linear predictive coding (LPC) ............................................................................ 63
5.5 Linear Prediction and Autoregressive Modeling .................................................. 71
5.6 Linear Prediction (L.P.) and Bandwidth extension (B.W.E.) ............................... 72
xvi
5.6.1. LPC analysis ....................................................................................................... 72
5.6.2 LPC estimation .................................................................................................... 72
5.6.3 LPC-synthesis ...................................................................................................... 73
5.7 Bandwidth Extension based on Sub band filter and Evaluation of speech signal
through Source Filter Model ............................................................................................ 75
5.8 Stability consideration (LP) .................................................................................. 80
CHAPTER 6 ........................................................................................................................ 85
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
............................................................................................................................................. 85
6.1 Subjective & Objective Measurement for W.B. to S.W.B.................................. 85
6.2 Spectrogram .......................................................................................................... 93
CHAPTER-7 ...................................................................................................................... 100
Conclusion, Major Contribution, and Future Scope .......................................................... 100
Conclusion ..................................................................................................................... 100
Major Contribution ........................................................................................................ 101
Future Scope .................................................................................................................. 101
References .......................................................................................................................... 102
List of Publications ............................................................................................................ 113
xvii
List of Abbreviation
Sr.
No.
Abbreviations Full Form
1
1G
1st generation
2 2G
2nd
generation
3
3G 3rd
generation
4
4G 4th
generation
5
GSM global System for mobile communications
6
3GPP 3rd generation partnership project
7
NB narrowband
8
WB wideband
9 SWB super wideband
10
AMR adaptive multi rate
11
AMR WB adaptive multi-rate wideband
12
BWE bandwidth extension
13 LTE long-term evolution
14
CCITT international telegraph and telephone
consultative committee
15 AMPS advanced mobile telephone system
16 ASR automatic speech recognition
17 PCM pulse code modulation
18 ADPCM adaptive delta pulse code modulation
19 CDMA
code division multiple access
20 CELP code-excited linear prediction
21 PSTN public switched telephone network
22 CCR comparison category rating
xviii
23 WCDMA wideband code division multiple access
24 CS-ACELP conjugate-structure algebraic code-excited
linear prediction
25 EVS enhanced voice services
26 FFT fast Fourier transform
27 VT vocal tract
28 SFM source filter model
29 AMPS advanced mobile telephone system
30 LPC linear predictive coding
31 ITU international telecommunication union
32 ITU-T
telecommunication standardization sector of
the international telecommunication union
33 M.F.C.C. mel frequency cepstral coefficients
34 ISDN integrated services digital network
35 IMT-2000 international mobile telecommunications-
2000
36 GMM gaussian mixture model
37 HB
highband
38 HD high definition
39 HF high frequency
40 LF low frequency
41 DCR degradation category rating
42 EFR enhanced full rate
43 BPF band pass filter
44 KBPS kilo bit per second
45 MBPS megabit per second
46 MOS mean opinion score
47 MDCT modified discrete cosine transform
xix
48 BW bandwidth
49 OLA overlap and add
50 VoIP voice over Internet Protocol
51 LTE long term evaluation
52 UMTS universal mobile telecommunication System
53 VQ vector quantization
54 PESQ perceptual evaluation of speech quality
55 CMOS complementary mean opinion score
56 RPE-LTP regular pulse excitation with long-term
prediction
57 ETSI european telecommunications standards
institute
58 WiMAX 2 worldwide interoperability for microwave
access 2
59 LTE-A long-term evolution advanced
60 VoLTE voice over long term evolution
61 FDMA frequency division multiple access
62 FB full band
63 DRT diagnostic rhyme test
64 MRT modified rhyme test
65 SRT speech reception threshold
66 ENC encoder
67 DEC decoder
68 ABS analysis-by-synthesis
69 HFBE high frequency bandwidth extension
70 SF Spectral folding
71 ST spectral translation
72 NLD non linear distortion
xx
73 LSF line spectral frequencies
74 STFT short-time Fourier transform
75 PSD power spectral density
76 AR autoregressive
77 MOS-LQO mean opinion score linear quality objective
78 TD-CDMA time division code division multiple access
79 SNR signal-to-noise- ratio
80 EB extension band
xxi
List of Symbols
Sr.
No.
Symbol
Description
1 Sn narrowband input signal
2 x feature extracted from narrowband input signal
3 unᶺ estimated narrowband signal from analysis filter
4 uwᶺ estimated wideband signal from synthesis filter
5 ARᶺw Estimate wideband envelope
6 e[n] error signal
7 s[n] AR signal
8 sᶺ[n] predicted AR signal
9 Xwb input wideband signal
10 Xswb estimated super wideband output
xxii
List of Figures
Figure 1.1 A Snapshot of the mobile devices launched in various generations (1G to 4G )....
........................................................................................................................................ 2
Figure 1.2 Average speech spectrum from a 10s long speech sample of a male speaker.. ... 8
Figure 1.3 Bandwidth Extension at the Receiver Terminal................................................... 8
Figure 1.4 Spectra comparison of oiginal, N.B. & W.B. speech signal ................................ 9
Figure 1.5 Step From N.B.-W.B.-S.W.B. telephony ............................................................ 9
Figure 1.6 Missing frequency components of W.B. and N.B. signal. ................................. 11
Figure 3.1 Speech production system model. ...................................................................... 26
Figure 3.2 Representation of speech signals ........................................................................ 29
Figure 3.3 Hierarchy of speech coders ................................................................................ 30
Figure 3.4 General Block Diagram of speech coding .......................................................... 32
Figure 3.5 Voice And Unvoiced speech frame representation ............................................ 33
Figure 4.1 General Model for B.W.E. ................................................................................. 42
Figure 4.2 Bandwidth Limitation in compressed speech/audio ........................................... 43
Figure 4.3 High-Frequency Bandwidth Extension (H.F.B.E.) Algorithm ........................... 44
Figure 4.4 Input and Reconstruction Band .......................................................................... 46
Figure 4.5 Block Diagram of the Proposed approach for W.B. To S.W.B. ......................... 48
Figure 4.6 Pictorial representation for Framing. .................................................................. 49
Figure 4.7 Flow chart representation for Framing .............................................................. 49
Figure 4.8 All Pole spectral shaping synthesis filter ............................................................ 50
Figure 4.9 Linear Prediction (LP) Model of speech creation. ............................................. 51
Figure 4.10 LPC based B.W.E. ............................................................................................ 51
Figure 4.11 Proposed Flow for W.B. to S.W.B. .................................................................. 56
Figure 4.12 Baseline Algorithm Proposed Flow Chart ........................................................ 56
Figure 4.13 Proposed Algorithm Proposed Flow Chart....................................................... 57
Figure 4.14 Data Pre-Processing and Assessment for W.B. to S.W.B................................58
Figure 5.1 LPC co-efficient obtained from Input speech signal .......................................... 59
Figure 5.2 Block Diagram representation of speech file compression based on LPC Analysis
& Synthesis ................................................................................................................... 61
Figure 5.3 Source Filter Model for Sound Production ........................................................ 62
Figure 5.4 Classical Linear Prediction Coefficients (LPC) Model of Speech Production .. 63
Figure 5.5 Linear Prediction as System Identification ......................................................... 65
xxiii
Figure 5.6 Prediction-Error Filter ........................................................................................ 65
Figure 5.7 Prediction Gain (PG) as a function of the Prediction Order (M) ....................... 67
Figure 5.8 A Plot of PG Vs Prediction Order (M) for the signal Frames ............................ 68
Figure 5.9 Plots of Prediction Error and Periodograms for the Voiced Frame .................... 69
Figure 5.10 Plots of Prediction Error and Periodograms for the Unvoiced Frame.............. 69
Figure 5.11 Plot of PSD for M=2, M=10,M=20 .................................................................. 70
Figure 5.12 LPC Analysis .................................................................................................... 72
Figure 5.13 LPC Estimation ................................................................................................ 72
Figure 5.14 LPC Synthesis .................................................................................................. 73
Figure 5.15 Input Signal before LPC Analyzer and Input Signal After LPC Synthesizer with
Error Signal................................................................................................................... 73
Figure 5.16 Complete Block Diagram of B.W.E. Output from Band-Limited N.B. Signal 74
Figure 5.17 Bandwidth Extension (B.W.E.) Based on Linear Prediction (LP) ................... 75
Figure 5.18 Bandwidth Extension Based on Sub Band Filter ............................................. 76
Figure 5.19 Performance Evaluation of speech signal Based on Source Filter Model ........ 77
Figure 5.20 Source Filter Model-based Analysis ................................................................ 77
Figure 5.21 Source Filter Model-based Synthesis ............................................................... 77
Figure 5.22 I/P Time-Domain Waveform of Wave File ''Om Shri Ganeshay Namah'' ...... 78
Figure 5.23 O/P Time-Domain Waveform of Wave File "Om Shri Ganeshay Namah" ..... 78
Figure 5.24 Input Frequency Domain Waveform of Wave file "Om Shri Ganeshay Namah"
...................................................................................................................................... 78
Figure 5.25 O/P Freq. Domain Waveform of Wave file "Om Shri Ganeshay Namah" ...... 79
Figure 5.26 Pre-Emphasized speech signal ......................................................................... 79
Figure 5.27 A Hamming Windowed speech signal ............................................................. 79
Figure 5.28 LPC Analyzer Output ....................................................................................... 80
Figure 5.29 LPC Synthesizer Output ................................................................................... 80
Figure 5.30 Auto Cor-Relation co-efficient, Reflection co-efficient & LPC co-efficient ... 81
Figure 5.31 Short-Term Prediction-Error filter connected in series to a Long-Term
Prediction-Error Filter .................................................................................................. 82
Figure 5.32 Block Diagram of the Synthesis filter .............................................................. 82
Figure 6.1 MOS-L.Q.O. For Proposed, H.F.B.E., L.P._Order=12,16,24 & E.V.S. Algorithm
for Bdl_Arctic_A0001.wav .......................................................................................... 88
Figure 6.2 P.E.S.Q. For Proposed, H.F.B.E., L.P._Order=12,16,24 & E.V.S. Algorithm For
Bdl_Arctic_A0001.wav ................................................................................................ 89
xxiv
Figure 6.3 MOS-L.Q.O. For Proposed, HFBE, LP_Order=12,16,24 & EVS Algorithm For
Thirteen Different Speech Files .................................................................................... 90
Figure 6.4 P.E.S.Q. For Proposed, HFBE, LP_Order=12,16,24 & EVS Algorithm For
Thirteen Different Speech Files .................................................................................... 92
Figure 6.5 M.O.S.L.Q. For Proposed, HFBE, LP_Order=24 & EVS Algorithm For Various
Gain For Wave File Bdl_Arctic_A0001.wav ............................................................... 92
Figure 6.6 M.O.S.L.Q. For Proposed, HFBE, LP_Order=16 & EVS Algorithm for Various
Gain For Wave File Bdl_Arctic_A0001.wav ............................................................... 92
Figure 6.7 M.O.S.L.Q. For Proposed, HFBE, LP_Order=12 & EVS Algorithm for Various
Gain For Wave File Bdl_Arctic_A0001.wav ............................................................... 93
Figure 6.8 Spectrogram for Speech File Ma01_01.wav .................................................... 93
Figure 6.9 Spectrogram for Speech File Ma01_02.wav ..................................................... 94
Figure 6.10 Spectrogram for Speech File Ma01_03.wav ................................................... 94
Figure 6.11 Spectrogram for Speech File Ma01_04.wav ................................................... 95
Figure 6.12 Spectrogram for Speech File Ma01_05.wav ................................................... 95
Figure 6.13 Spectrogram for Speech File Fa01_01.wav..................................................... 96
Figure 6.14 Spectrogram for Speech File Fa01_03.wav..................................................... 96
Figure 6.15 Spectrogram for Speech File Fa01_04.wav..................................................... 97
Figure 6.16 Spectrogram for Speech File Fa01_05.wav..................................................... 97
Figure 6.17 Spectrogram for Speech File Arctic-A0002.wav ............................................ 98
Figure 6.18 Spectrogram for Speech File Arctic-A0003.wav ............................................ 98
Figure 6.19 Spectrogram for Speech File Arctic-A0004.wav ............................................ 99
Figure 6.20 Spectrogram for Speech File Arctic-A0005.wav ............................................. 99
xxv
List of Tables
Table 3.1 MOS Quality Rating ..................................................................................... 35
Table 3.2 Mean Opinion Score according to ITU-T P.800 ......................................... 37
Table 3.3 Degradation Means Opinion Score according to ITU-T P.800 ................... 37
Table 3.4 Comparison Means Opinion Score according To ITU-T P.800 .................. 37
Table 5.1 AR Synthesizer With ten LPC co-efficient .................................................. 67
Table 6.1 Lister Rating in a Scale of -3 to 3 for different wavefiles.............................78
Table 6.2 MOS-L.Q.O. for Proposed, H.F.B.E., LP_Order=12,16,24 & E.V.S. algorithm
for bdl_arctic_a0001.wav ............................................................................................ 88
Table 6.3 P.E.S.Q. For Proposed, H.F.B.E., LP_Order=12,16,24 & E.V.S. algorithm for
bdl_arctic_a0001.wav ................................................................................................... 89
Table 6.4 MOS-L.Q.O. for Proposed, H.F.B.E., LP_Order=12,16,24 & E.V.S. algorithm
for Thirteen different speech files ................................................................................ 89
Table 6.5 P.E.S.Q. for Proposed, H.F.B.E., LP_Order=12,16,24 & E.V.S. algorithm for
Thirteen different speech files ...................................................................................... 90
Table 6.6 M.O.S.L.Q. for Proposed, H.F.B.E., LP_Order=24 & E.V.S. algorithm for
various Gain for wave file bdl_arctic_a0001.wav ........................................................ 91
Table 6.7 M.O.S.L.Q. For Proposed, H.F.B.E., LP_Order=16 & E.V.S. algorithm For
various Gain for wave file bdl_arctic_a0001.wav ........................................................ 91
Table 6.8 M.O.S.L.Q. For Proposed, H.F.B.E., LP_Order=12 & E.V.S. algorithm For
various Gain for wave file bdl_arctic_a0001.wav ........................................................ 91
Fruition of Communication Systems
1
CHAPTER-1
Introduction
1.1 Fruition of Communication Systems
As the imperative of the remote transmission recurrence band assets due to the
complexity of the 3G versatile communications environment, reproduced speech (voice)
quality, comprehensible at the recipient side is found hardened, scarcely capable of being
heard, thin since of the constrained transmission capacity of 300-3.4KHz. In contrast,
Today's terminals, infrastructure, operate at wider bandwidths for which speech (voice)
quality, intelligibility are greatly improved. Normally, the total move from narrowband (300-
3.4KHz) to wideband (0.05-7kHz) and wideband (0.05-7kHz) to super-wideband (0.05-
14kHz) communications will require impressive time. As a result, wideband, super-
wideband innovation must interoperate with narrowband innovation. In this case, clients will
involvement significant varieties in speech (voice) quality and intelligibility. This chapter
discusses brief highlight of fruition of communication system followed by contribution,
organization of thesis.
1.1.1 Analog , Digital telephony
The limited N.B. speech (voice) signals, were sent out over different frequency
channels with a frequency separation of 4kHz. The first wireless speech (voice) transmission
started in year 1915 [1]. In year 1937, Alex Reeve has visualized time-division multiplexing
based pulse code modulation (PCM) which has shaded light for voice communication
digitization [2]. The commercial utilization of PCM was started in the year 1950s [3]. By the
existing PSTNs, PCM adopted the typical N.B. B.W. for communication. So, for long period
of times, subscribers were offered only N.B. communication services.
1.1.2 Wireless Cellular Networks
After Marconi's doing well attempt at wireless transmission, engineers, academicians,
scientists started to carry out the research and development ( R & D) activity for R.F. based
proficient communication [1]. The 1st
generation (1G) wireless mobile phone system was
developed in 1973, but not commercialized until year 1984. Wireless communications have
Introduction
2
remarkably gone ahead in last few years. Mobile handsets have also advanced together with
the generations (from 1G to 4G) with added functionalities.
Fig. 1.1 depicts a snapshot of the mobile devices launched in various generations (1G
to 4G). The first generation cellular systems, introduced in the year 1980s, used analog
cellular, cordless telephone technology [2]. Second-generation (2G) systems, arrived in year
1980s, used digital speech (voice) transmission. 2G services played a vital role for voice
transmission which was inefficient for data transmission [4].Third-generation (3G) systems
entered in market in the year 2000. It provides advanced voice, high-speed data services in
comparisons with Second-generation (2G) systems. It utilized packet switching for data
transmission while circuit-switching for voice transmission. Universal Mobile Tele-
communication System (UMTS) or wideband CDMA (WCDMA), time division-
synchronous CDMA (TD-SCDMA), CDMA2000 are the most popular 3G technologies [4].
Packet switching technology employed in fourth generation (4G) systems supports 100Mbps
data rate. It is used in system for voice as well as data services. WiMAX 2, LTE-Advanced
are the two most popular 4G protocols [4]. Third as well as fourth generation wireless
communication (W.C.) system are becoming popular in the world market due to their low bit
rate coder. It was proposed to provide interactive multimedia communication including
teleconferencing, internet access an combination of another benefit that gets to be practicable
with low complex, low cost, low bit rate, less processing delay. So one can articulate that
these are the major attributes that play a vital role while designing any particular coder.
Figure 1.1 A snapshot of the mobile devices launched in various generations [5]
Fruition of Communication Systems
3
1.1.3 Speech Coding
Speech coding is defined as the procedure of shrinking the bit rate of digital speech
(voice) signal representation for communication or storage while maintaining a speech
(voice) quality adequate for the application [6]. In general, it is the way to represent the
speech (voice) signal with the fewest number of bits, while maintaining a sufficient level of
quality of the retrieved or synthesized speech with reasonable computational complexity. To
achieve high-quality speech at a low bit rate, one can apply coding algorithm (a sophisticated
method to reduce the redundancies from the speech signal).
Although wide-bandwidth channels, networks are becoming more viable, speech
(voice) coding for bit-rate reduction has retained its importance due to the need for low bit
rates for cellular, Internet communications over constrained-bandwidth channels, voice
storage systems. Due to the increasing demand for digital speech communication, speech
coding technology has received augmenting levels of interest from the research and
standardization. Speech coding standards have played, and continued to play an important
role in the development, use of speech codes. There are a few speech coding applications in
which inter-operatability is not an issue. An example of such an application is a digital
answering machine or digital voice mail system in which the same system is used to both
encode, decode the speech. For such applications, the speech coder of choice can be the best,
most cost-effective available at the time when the system is designed, without regard to
inter-operatability. For the vast majority of applications, however, inter-operatability is a
major issue. All telecommunications applications belong to this class. For inter-operatability
to be achieved, standards must be defined as well as implemented. This encourages the
research community to investigate alternative techniques for speech (voice) coding with the
objectives of overcoming deficiencies. The standardization community pursues the
establishment of standard speech coding methods for various applications that will be
widely accepted, implemented by the industry. A variety of standards bodies have been
responsible for the definition of new standards. Speech (voice) coding can be categorized
based on equipped bandwidth.
N.B. coding:
N.B. coding techniques squeeze speech (voice) signals in the range of 0.3-3.4kHz [7].
Introduction
4
Pulse-code modulation (PCM):
PCM is a waveform coding technique that carries out discrete-time, magnitude
approximation of continuous signals in the time field [8]. ITU- recommendation G.711 [9],
G.712 [10] normalized the transmission characteristics of PCM for speech (voice) signals.
Parametric coders (L.P.C.) code a set of model parameters in place of the time domain
waveforms. To take benefit of both the schemes, the combination of both the coding method,
hybrid method can be utilized. In this method co-efficient of synthesis filter are sent as a side
information while quantization of L.P. residual error (difference between actual and
predicted sample difference) is done via CELP coding [11].
For economical, complexity reasons, bit rate reduction from 64kbps is a most
important requirement [11]. ITU-T- recommendation G.726 [12] homogenized an PCM
extension known as ADPCM [12]. ADPCM supports multiple bit rates of 16, 24, 32 &
40kbps. In ADPCM quantization of error signal is carried out [13].
GSM full-rate (GSM-FR), enhanced full-rate (EFR) codec:
GSM full rate (GSM-FR) codec relies on LPC has employed short-term LP analysis
for spectral envelope modeling, long-term LP analysis to attain a residual error signal. The
Quantization of attained signal is done through ADPCM which operates at 13kbps was
adopted by ETSI-recommendation GSM 06.10 [14]. In year 1992, GSM-FR was further
improved using a well-organized V.Q. system for residual error signal using ACELP
algorithm [15-16]. It was normalized as GSM-EFR codec in ETSI- recommendation GSM
06.60 [17] in 1996, operates at 12.2kbps. The GSM-EFR codec achieved speech (voice)
quality comparable with ADPCM at 32kbps [11].
G.729:
It is normalized in ITU-T-recommendation G.729 [18] that in the year 1995, it was
operated at 8kbps. The codec was based on conjugate-structure ACELP which was widely
employed in VOIP infrastructures.
Adaptive multi-rate (AMR):
An expansion of the EFR codec with eight possible bit rates ranging from 4.75 to
12.2 kbps, was standardized by ETSI Recommendation GSM 06.90 for 2G & 3G system
[19]. 3GPP takes up AMR as the default speech (voice) codec for 3G wideband systems
Fruition of Communication Systems
5
(UMTS, CDMA2000) as mentioned in 3GPP TS 26.090 [20]. AMR coding occupies the
transcoding of AMR-coded speech (voice) signals to/from PCM format [21].
Wideband coding:
As W.B. transmission improves the quality of voice transmission, there is a rising
demand for W.B. communication services in fixed, mobile networks at lower bit rates. The
first W.B. speech codec (voice) for ISDN, tele-conferencing was homogenized in year 1985
by CCITT [11]. The ITU-T recommendation G.722 [22] spell out the characteristics of an
audio W.B. coding system in which a frequency band is split into two, lower and higher,
sub-bands in a way that both are encoded using sub band ADPCM. The G.722 standard is
employed as a benchmark for other codec assessment [13]. In the year 1999, a low-
complexity W.B. codec was commenced in ITU-T recommendation G.722.1 [23]. Low-
complexity W.B. codec attained as good as speech (voice) quality at reduced bit rates of
24kbps, 32 kbps.
Adaptive multi-rate wideband (AMRWB):
AMRWB encodes speech (voice) within a bandwidth of 0.05-7kHz. The AMRWB
codec relies on the ACELP technique was first standardized in the year 2001 by 3GPP TS
26.190 [24] for 3G systems. It utilizes B.W.E. for signal re-synthesis beyond 6.4kHz as
specified in ITU-T recommendation G.722.2 [25]. It supports nine-bit rates ranging from 6.6
to 23.85 kbps but achieves higher speech quality at 8.85 kbps in comparisons with AMR at
12.2kbps [26]. Speech (voice) transmission by AMRWB provides significantly better quality
than N.B. due to B.W. expansion results in much more natural sound. By May 2016, 164
mobile operators (17 on GSM (2G), 130 on UMTS (3G), 63 on LTE (4G) networks) have
commenced commercial HD voice services in 88 countries [21].
G.729.1:
ITU-T Recommendation G.729.1 [27] sanctify an extension to the G.729 codec
provide scalable N.B.,W.B. coding of speech (voice), audio signals from 8-32kbps [28].
S.W.B. or F.B. coding:
S.W.B. or F.B. speech (voice) communications send out almost complete human
speech (voice) spectrum results in much more natural, understandable speech (voice) sound
Introduction
6
in comparisons with N.B. or W.B. communications..
G.729.1 Annex E:
G.729.1 Annex E [29] broadens the 32kbps mode of the G.729 codec to S.W.B. mode
providing bit rates in the range of 36-64kbps. An S.W.B. extension to the scalable W.B.
codec G.729.1 proposed in [30] achieve improved audio quality in comparisons with
existing S.W.B. extension G.722.1 Annex E.
Extended AMRWB (AMRWB+):
The AMRWB standard (3GPP TS 26.290 [31]) is a S.W.B.E. to the AMRWB codec
operates up to an augmented frequency range of 16kHz, bit rates up to 32kbps [14]. It is a
mixture of two codec (hybrid codec) that combines linear predictive, transform coding
techniques depending on the signal type, e.g., speech (voice) or audio [32].
G.719:
A low-complexity coding algorithm for F.B. speech (voice), audio signals is
described in ITU-T recommendation G.719. The coding technique offers bit rate from 32 to
128 kbps. [33]
HE-AAC:
The high-efficiency advanced audio codec (HE-AAC) makes use of spectral band
replication (SBR) approach for efficient coding of audio signals [34-36]
Over the top (O.T.T.) conversational codec:
O.T.T. service providers (Skype) give point-to-point services intended for voice over
internet protocol. The utilization of SILK codec allows, conventional N.B. speech (voice)
services to be transferred towards W.B., S.W.B. communications via the use of broadband IP
services [37]. OPUS is another high-quality codec which carry out hybrid coding,. OPUS
involves the coding of frequency up to 8kHz using the SILK codec, while above 8kHz
frequency are coded using CELT [39]. It gives S.W.B. or F.B. transmission at, above 24
kbps[37].
Fruition of Communication Systems
7
Enhanced voice services (EVS) codec:
3rd
generation partnership project (3GPP) has carried out a study in the year 2010
regarding EVS codec [38]. The homogenization of it was done in year 2014. EVS codec can
encode a speech (voice) as well as other audio signals with an S.W.B. (0.05-14kHz) at bit
rates as low as 9.6kbps [28]. It operates at four different B.W. (i.e. N.B. (0.02-4kHz), W.B.
(0.05-7kHz), S.W.B. (0.05-16kHz), F.B. (0.05-20kHz) [37]). EVS supports twelve bit rates
ranging from 5.9 to 128 kbps with S.W.B., F.B. services starting at or above 9.6, 16.4kbps
correspondingly [48]. Subjective listening tests show that EVS outperforms in comparisons
with all existing conversational voice, audio codec across all bit rates, B.W. [39]. Detailed
technical details of EVS can be found in [40-43]. In upcoming scenario of advancement in
next generation wireless communication system due to the key features provided by the EVS
codec, mobile operators have started enabling their networks for EVS support [44].
1.1.4 Background on Digital Speech Transmission
In digital signal processing (DSP), signals are band-limited w.r.t. use of sampling
frequency. In analog telephone speech, B.W. is limited to 300-3.4 kHz. Due to our human
hearing system's ability to detect fundamental frequency based on the harmonics present in
the signal, the researchers, academicians have always targeted towards high frequency
compared to low frequency. Also from a speech quality & intelligibility point of view, high-
frequency content is much more significant than low frequency.
Fig.1.2 depicts the average speech (voice) spectrum deliberate from a 10s long
speech (voice) sample of a male speaker [45]. From the diagram one can say that amount of
information in N.B. is small compare to W.B. & S.W.B. In W.B amount of information is
more compare to N.B. & less compare to S.W.B. By investigating at various levels, the
researcher has found that B.W. limitation of transmitted speech (voice) resulting in much
inferior quality compares to face to face conversion due to diminution in speech (voice)
quality & intelligibility.
1.1.5 Background on Bandwidth Extension
In the recent scenario of advancement in the next-generation wireless communication
systems, due to missing high band (H.B.) component information, limitation of N.B. to
represent consonant sound speech (voice) signal looks like stiffened, barely audible & slim.
Introduction
8
Figure 1.2 Average speech spectrum deliberate from a 10s long speech (voice) sample of a
male speaker. The N.B. bandwidth (300-3400 Hz) is colored red, W.B. bandwidth (50-7000
Hz) is colored blue, super wideband (S.W.B.) (50-14000 Hz) yellow [45]
Fig.1.3 illustrates the bandwidth extension at the receiver terminal. As depicted in
Fig. 1.4 N.B. speech may be short of spectrum compared to W.B. The difference between all
three spectrum are quite obvious, which may result in speech (voice) quality degradation.
A general block diagram of N.B.-W.B.-S.W.B.telephony is illustrated in Fig. 1.5. As
represented in fig. 1.5(a) B.W.E. happens at the receiver side while in figure 1.5(b) it
happens at the network side. In fig. 1.5(c) the information required to acquire missing H.B.
component from band-limited N.B. speech is sent via sideband information or with N.B..
signal itself via digital watermarking. As depicted in fig.1.5(d) direct W.B. to W.B.
transmission from the far end to the near-end terminal. As depicted in fig. 1.5(e), B.W.E. can
be done on the receiver side to get S.W.B. from the W.B. signal.
Figure 1.3 Bandwidth extension at the receiver terminal [46]
Fruition of Communication Systems
9
Figure 1.4 Spectra comparison of original, N.B. & W.B. speech (voice) signal [46]
Figure 1.5 Step from N.B.-W.B.-S.W.B. telephony [46]
For the reconstruction of H.B.components B.W.E. can be categorized into two types namely
blind and non-blind schemes.
Introduction
10
Non-blind methods:
Non-blind B.W.E. scheme recuperate missing H.F. components at the receiving-end
from auxiliary side information related to H.F. that are encoded into a data stream together
with N.B. components [27]. The inclusion of such side information incurs an additional
burden of 1-5kbps [47]. Examples are enhanced AMR codec [48], HE-AAC codec [34-35],
AMRWB codec [20].
Blind methods:
Blind B.W.E. methods estimate missing H.B. components, using only the available
N.B. components. Such B.W.E. solutions thus exploit the correlation between N.B. and H.B.
components of speech (voice), estimated missing H.B. components using a regression model
is learned from W.B. speech training data.
1.1.6 Problem Background
Within the present day state of undertakings, non wired, wired communication
structure encompassed by various substantial basis speech (voice) quality is in general
corrupted at the recipient side. One of the fundamental and most vital sound chosen & well-
thought-out is, the arrangement of N.B. supporting B.W. of 300 Hz-3.4kHz. The drawback
of the N.B. communication structure is that the speech (voice) signal is sounds stifled, thin
due to the non-attendance of High Band (H.B.) spectral components [49]. The limited
frequency band trims down superiority and clearness of speech (voice) signal. As a result of
going off track high-frequency components which play a noteworthy part especially in
consonant sounds[7].
Due to missing high band (H.B.) components, limitation of N.B. to represent
consonant sound like a ship, pin, run, work, etc.. have resulted in speech signal that looks
like stiffened, muffled, and thin [45]. To get better quality, clearness, naturalness,
pleasantness, the brightness of speech (voice) signal N.B. must be upgraded to the W.B.
system. The utilization of the W.B. system makes it is necessary to upgrade the transmission
network, terminal devices to W.B.. Software cum hardware up-gradation, compatibility,
feasibility problem need to be solved for abruptly substitute the entire existing N.B. coding
system.[46].
Fruition of Communication Systems
11
1.1.7 Problem Specification
Fig.1.6 depicts the original W.B. and down sampled N.B. speech (voice) signal with
a spectrogram. A closer look at fig.1.6 reveals that N.B. speech may lack significant parts of
the spectrum and the difference between the original W.B. and down sampled N.B. speech
(voice) signal is still noticeable. So, creating the missing frequency components 3.4 kHz to
4.6 kHz will be an exigent task [50].
Figure 1.6 Missing frequency components of W.B. and N.B. signal [50]
Bandwidth extension techniques of speech (voice) are mostly categorized into two
classes. In first, the absent frequency element is attained from the available narrowband
voice element regardless of information transmission about the stolen frequencies [5]. The
second relies on information hide process. The prior strategy makes W.B. speech (voice)
signal by source filter model (S.F.M.). It expresses the excitation signal and LPC coefficient
for spectral envelope [51]. In the second method, W.B. line spectral pairs (LSP) bits can be
put in a watermark in N.B. PCM samples; In simple words B.W.E. techniques w.r.t.
steganography methods can insert high-frequency components onto N.B. speech (voice) data
or bit-stream. A Limitation of the B.W.E. method with steganography is that analogous
Introduction
12
techniques require to be bolstered at two ends of the transmission line [7,52]. The work
employed in this thesis will address these issues by doing a B.W.E. of speech (voice) signal
based on a source-filter model (S.F.M.), perform subjective and objective measurements on
the MATLAB platform.
In the recent scenario of advancement in next generation wireless technology, several
smart devices support high-quality speech (voice) communication services at S.W.B. but it is
fact that speech (voice) quality is corrupted when they are utilized with devices or network,
be deficient in S.W.B. support. So either B.W.E. or suitable S.W.B. codec is a primary
requirement for next-generation wireless communication. To Propose a highly efficient
algorithm with tiny latency is a major focused area of researchers for next-generation
wireless communication.
1.1.8 Motivation and applications
This segment lists out the applications of B.W.E. in different state of affairs.
When network and/or mobile terminals do not support W.B. communication:
To get better speech quality offered by traditional telephony infrastructure, coding
techniques have been developed to squeeze information at higher B.W.. Calls at higher B.W.
are feasible only if the complete communication path supports operations at the identical
B.W. Lack of either leads to a reduction in B.W and thereby a reduction in speech quality.
In the recent scenario, a combination or interconnection of different networks, mobile
devices supports N.B.,W.B.,S.W.B. communications [53]. While the exploitation of W.B.
codec, networks is in progress, it is slow as it incurs costs to the network operators as well as
end-users. Additionally, a phone call may involve a landline device that restricts in the B.W.
to N.B. by default. So, even today, a significant portion of calls operate in N.B. mode
whereas the migration to W.B. will take considerable time [54]. N.B., W.B. networks,
devices (or terminals) will thus coexist for some upcoming years, leading to mobile phone
calls of different B.W. at the receiving terminal [55].
When a W.B.-to-N.B. handover occurs during a phone call:
Due to the attendance of heterogeneous communication networks [56], the process of
B.W. switching from W.B. to N.B. may take place for the period of an ongoing phone call.
This may take place either due to handovers between two different networks or due to
diminishes in network resources that cause dynamic fall back from W.B. to N.B. mode [57-
Fruition of Communication Systems
13
58]. This may lead to sudden changes in quality, so the irritating client involvement.
According to [55], a W.B. to N.B. handover leads to perceived speech (voice) quality even
below N.B. A possible solution to avoid this problem is to switch to W.B. communication as
soon as the network resources are enlarged. However, subjective tests have indicated that
switching from N.B. to W.B. speech (voice) is perceived as an impairment unless the
transition happens early enough or at the beginning of the call [58]. According to [57], W.B.
transmission should continue at least for 30 seconds after switching to benefit from improved
speech (voice) quality. So, instead of switching from N.B. to W.B. B.W.E. can be used as
soon as the call falls back to N.B. mode, thereby mitigating the need for a true W.B. call. A
comparison between the subjective quality of two switching schemes, namely transitions
between W.B., N.B. (AMR coded) speech (voice) and transitions between W.B. ,bandwidth-
extended N.B. (AMR coded) speech is reported in 3GPP TR 26.976 [59].
When there is a B.W. mismatch stuck between training, testing data:
For definite applications i.e..speaker recognition, great amounts of N.B is available
for model training. To operate upon N.B. data, the conventional approach is to perform down
sampling from W.B. to N.B. However, this leads to the thrashing of helpful spectral content
in W.B. speech. B.W.E. can therefore be utilized to shrink the B.W. mismatch between
speech data recorded at different sampling rates. N.B. speech data can be B.W. extended and
then be used with an available W.B. data. This is supportive in two aspects. First, the amount
of W.B. training data can be augmented for the training of W.B. models. Second, already
trained W.B. models can still be used (with the application of B.W.E.) in the case that test
data is N.B.; re-training of the models with N.B. data is no longer necessitated. The use of
B.W.E. therefore permits for only one model to be trained while still underneath different
B.W. modes. B.W.E. has been investigated for applications such as speaker recognition &
automatic [60,61-63], speaker identification & verification [64-65] to perk up the
performance of W.B. models by increasing the amount of W.B. training data through B.W.E.
When clients have hearing impairments:
People with impaired hearing use of hearing aids frequently face difficulties during
telephone calls due to bandwidth limitations. Bandwidth extension can be used to improve
the intelligibility, quality of N.B. speech for such users [66]. The study reported in [67]
showed that users with hearing impairments can tolerate overestimated energies in the B.W.,
speech (voice) sounds, thereby providing increased intelligibility.
Introduction
14
Super-wide bandwidth extension
With the progress in S.W.B. or F.B. speech coding techniques, many smart devices
and networks now support high-quality speech communication services at super-wide
bandwidths. However, in today's heterogeneous networks, S.W.B. devices are repeatedly
bring into play with other devices, networks which support only N.B. or W.B.
communications. While they generally put forward backward compatibility, users of S.W.B.
devices will then be restricted to N.B. or W.B. communications.. S.W.B.E. has the objective
of humanizing the gap in quality between W.B., S.W.B. communications.
1.2 Contribution
The steps followed in this research progression are
To do the literature survey relating to bandwidth extension (B.W.E.) algorithms,
define problem and set the objective.
Study the speech production model, speech transmission, quality, intelligibility.
Learn the source-filter model(S.F.M.) based on linear prediction analysis.
Study digital Analyzing the existing speech bandwidth extension methods for W.B.
and S.W.B. with its pros and cons.
Develop a bandwidth extension model for W.B. to S.W.B. conversion by taking a
general model for B.W.E. from N.B. to W.B. as a reference and utilizing the concept
of baseline model based on high-frequency bandwidth extension(H.F.B.E.).
Perform MATLAB based simulation for subjective as well as an objective
measurement of the proposed model and compare the results with baseline as well as
next-generation speech codec.
1.3 Organization of the Thesis
Chapter 1 presented the fruition of communication systems. problem, scope, outline of the
contributions of the research.
Chapter 2 is about literature survey relating to bandwidth extension algorithms (B.W.E.),
define problem and set the objective.
Chapter 3 discusses about speech production model, digital speech transmission, speech
quality & intelligibility, speech coding basics, classification of coder, speech codec attributes
etc..
Organization of the Thesis
15
Chapter 4 talks about the development of the B.W.E. model for W.B. to S.W.B. conversion
by taking detailed analysis of N.B. to W.B. model and reference Algorithm.
Chapter 5 analyses about linear prediction analysis, synthesis with stability criteria.
Chapter 6 discusses MATLAB based simulation for subjective as well as an objective
measurement of the proposed model and compare the results with baseline as well as next-
generation speech codec.
Chapter 7 concluded with overall results and highlights the scope for future work in this
area of research.
Chapter 8 presents a summary of the list of publications and references.
Literature Review and Objective of Work
16
CHAPTER-2
Literature Review and Objective of Work
2.1 Literature reviews of different papers
Review of Paper 1
Title of Paper
Evaluation of Levinson-Durbin Recursion Method for Source-
Filter Model-based Bandwidth Extension Systems
Author
G.Gandhimathi, Dr. A. Hemalatha, Dr.S.P.K.Babu
Publication Year
2015
Publication
International Journal of Scientific & Engineering Research,
Periyar Maniammai University Thanjavur, India
Review:
Bandwidth extension (B.W.E.) techniques are employed to generate a W.B. signal
from the N.B. signal. Since most of the H.F. components, fricative consonants were absent in
the N.B. representation of the sound, it is a challenging task to create those missing
components in the W.B. equivalent signal. In this paper, the author has evaluated the
performance of two autoregressive (AR) modeling methods, autocorrelation method (LPC),
Levinson-Durbin recursion method. These estimation methods lead to approximately the
same results (same coefficients) for particular autoregressive parameters, but the small
differences in such estimations will have a great impact on the quality of reproduced sound.
The author has implemented a source-filter model-based speech bandwidth extension system
with the above two AR modeling methods, validated their performance with suitable
metrics..
Literature reviews of different papers
17
Review of Paper 2
Title of Paper
Bandwidth Extension for Speech in the 3GPP EVS Codec
Author
Venkatraman Atti ,Venkatesh Krishnan, Duminda Dewasurendra,
Venkata Chebiyyam
Publication Year
2015
Publication
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP)
Review:
In this paper, the author has approached the time-domain bandwidth extension
(T.B.E.) structure employed to code W.B., S.W.B. speech in the newly standardized 3GPP
enhanced voice service codec (E.V.S.). In particular, the advanced modeling techniques used
to recreate the W.B., S.W.B. frequencies with a fewer number of bits paved the way for EVS
to become the most advanced feature-rich conversational speech coder of its time. At 13.2
kbps, the S.W.B. coding of speech uses as low as 1.55 kbps for encoding the spectral content
from 6.4-14.4 kHz. Extensive MOS testing as per ITU-T P.800 has proven that the EVS
codec with time-domain bandwidth extension outperforms in comparisons with all other
standard codec references with significant margins, making it the ideal codec to be deployed
in modern VoLTE networks and other VOIP networks such as VoWiFi.
Review of Paper 3
Title of Paper
Simulation and overall comparative evaluation of performance
between different techniques for high band feature extraction based
on bandwidth extension
Author
Ninad S. Bhatt
Publication Year
2016
Publication
Springer, Int. Journal of Speech Technology
Literature Review and Objective of Work
18
Review:
In this paper the author inspect, study, simulate calculation of the High band (H.B.)
component based on linear predictive coding (L.P.C.), M.F.C.C. techniques. The author has
made comparisons of B.W.E. using LPC, MFCC technique for different extension of
excitation. From the simulation result, bar chart for mean opinion score (MOS), Perceptual
evaluation of speech quality (P.E.S.Q.) highlighted by the author in the paper it is clear that
amongst various excitation methods the results obtained for sinusoidal transform
coding(S.T.C.) [82], Full-wave rectification (F.W.R.) as a Non-linear Distortion (N.L.D.)
component produce good results in comparisons with all other techniques for various speech
files for LPC, MFCC based methods[26].
Review of Paper 4
Title of Paper
Audio Bandwidth Extension
Author
Somesh Ganesh
Publication Year
2016
Publication
Georgia Institute of Technology Atlanta, Georgia
Review:
In this paper, the author has provided evidence that the audio bandwidth extension is
a technique used to improve the perceived quality of band-limited audio generated using
various audio codecs. The author proposes a few methods for audio bandwidth extension and
evaluates them by performing listening tests. A comparison between half-wave rectification,
full-wave rectification, along with the use of sub-band filtering is done. The wrapping up
from the experiment conducted is that half-wave rectification performs better than full-wave
rectification as a non-linear device and use of sub-band filtering improves the amount of the
perceived quality in the acquired bandwidth extended output signal.
Literature reviews of different papers
19
Review of Paper 5
Title of paper B.W.E. of Speech Signals using Quadrature Mirror Filter
Author
Janki Patel, Nikunj V. Tahilramani and N.S.Bhatt
Publication Year
2018
Publication
IEEE- International Conference on Computation of Power,
Energy, Information and Communication (ICCPEIC)
Review:
In ordinary telephone networks, the speech signal looks like stiffened, muffled, thin
due to limited N.B. frequency range (300 Hz to 3.4 kHz). So it is a prime requirement to
apply the bandwidth extension algorithm on it. Here QMF based B.W.E. system is proposed
which analyzes, synthesizes the speech (voice) signal to acquire W.B. speech (voice) signal
at the receiver side to improve the quality and naturalness of the speech (voice). By applying
the B.W.E. algorithm, the missing higher frequency components of speech can be added at
the end terminal to produce W.B. speech. The objective, subjective evaluations show that the
quality of speech synthesized by the proposed method is better than N.B. speech.
Review of Paper 6
Title of Paper
Super Wideband Spectral Envelope Modeling for Speech Coding
Author
Guillaume Fuchs, Chamran Ashour
Publication Year
2019
Publication
INTERSPEECH 2019, September 15–19, 2019, Graz, Austria,
Erlangen, Germany,
Review:
N.B. speech traditionally transmitted in communication, has been extended to W.B.
(W.B. up to 8 kHz) in digital communications by different standards [83-85]. This trend
continued with the more recent introduction of super-wideband (S.W.B. up to 16 kHz)
speech coders [86-87]. Nonetheless, existing S.W.B. coding solutions are all built on a dual-
band system, coding the lower band separately from the upper band according to a source-
Literature Review and Objective of Work
20
filter model of speech production, pure perceptual considerations using a parametric
representation respectively. The spectral envelope is generally modeled in speech coding by
Linear Predictive Coding (LPC) parameters characterizing the vocal tract and acting as an
autoregressive estimation of the speech. Although applying LPC to N.B. and W.B. speech
(voice) is well studied in the literature [88], very little consideration has been given in the
past to the S.W.B. The widening of the audio B.W. has two major impacts on the design of
LPC; the order of the prediction, design of the high-frequency pre-emphasis filter usually
applies before the linear prediction analysis. As the number of samples increases with a
higher sampling frequency, the prediction order must be adjusted accordingly. The LPC
order of 10 usually adopted for N.B. was increased to 16 for W.B. and should probably be
even higher for S.W.B.. However, the order has a significant impact on the bit-rate required
for transmitting the LPC coefficients.
2.2 Review sheet for each paper
Table 2.1 Review summary of all research paper.
Sr.
No
Paper Title
Review:
comment
(include work was done)
1 Evaluation of Levinson-
Durbin Recursion Method
for Source-Filter Model-
based Bandwidth Extension
Systems
1. Source-filter model-based B.W.E. using LPC
coefficients is a simple, reliable technique to achieve
B.W.E. from N.B. to W.B. and it can make a pedestal
for B.W.E. from W.B. to S.W.B.
2. From the Spectrogram comparison of N.B.,
Extended W.B. output by B.W.E., one can note that the
proposed algorithm can able to represent consonant
sounds, mostly fricative consonants like ( th /, sh/, / Isl
/, xl/ ch, /, etc..).
2 Bandwidth Extension for
Speech in the 3GPP EVS
Codec
1. At 13.2 kbps, the super-wideband coding of speech
(voice) uses as low as 1.55 kbps for encoding the
spectral content from 6.4-14.4 kHz.
2. Subjective (MOS) testing as per ITU-T P.800 has
Literature reviews of different papers
21
confirmed that the EVS codec with time-domain
B.W.E. performs well in comparison with all other
standard codec references, making it the ideal codec to
be deployed in modern VoLTE & VOIP network.
3
Simulation and overall
comparative evaluation of
performance between
different techniques for high
band feature extraction
based on bandwidth
extension
1. The method employed in this paper is a narrative
approach to inspect, study, simulate calculation of the
High band (H.B.) component based on linear predictive
coding(L.P.C.), Mel frequency cepstral coefficient
techniques.
2. Amongst various excitation methods the results
obtained for sinusoidal transform coding(S.T.C.), Full-
wave rectification (F.W.R.) as a Nonlinear Distortion
component produce good results in comparisons with
all other techniques in terms MOS score for various
speech files for LPC and MFCC based methods.
4 Audio Bandwidth Extension 1. Audio bandwidth extension is a technique used to
improve the perceived quality of band-limited audio
generated using various audio codec.
2. From the conducted listening test as per ITU-T
recommendation for various methods of bandwidth
extension, one can make a note that half-wave
rectification performs better than full-wave
rectification as a non-linear device. The use of sub-
band filtering improves the amount of the perceived
quality in the bandwidth extended signal.
5 Bandwidth Extension of
Speech Signals using
Quadrature Mirror Filter
1. QMF based B.W.E. system analyzes and synthesizes
the speech (voice) signal to acquire W.B. speech
(voice) signal at the receiver side to improve the
quality, naturalness of the speech (voice).
2.The missing higher frequency components of speech
(voice) can be added at the end terminal to produce
W.B. speech (voice).
3.The objective, subjective evaluations show that the
Literature Review and Objective of Work
22
quality of speech (voice) synthesized by the proposed
method is better than N.B. speech.
6 Super Wideband Spectral
Envelope Modeling for
Speech Coding
1.The author is paying attention to the spectral
envelope modeling by Linear Predictive Coding (LPC)
parameters characterizing the vocal tract.
2. The order of the prediction, design of the high-
frequency pre-emphasis filter usually applied before
the linear prediction analysis.
3. As the number of samples increases with a higher
sampling frequency, the prediction order must be
adjusted accordingly. The LPC order of 10 usually
adopted for N.B. was increased to 16 for W.B. and
should probably be even higher for SW.B.
4. The order has a significant impact on the bit-rate
required for transmitting the LPC coefficients.
2.3 Summary of Literature Survey
Bandwidth extension algorithms (B.W.E.) aspire to estimate missing higher
frequency components at 3.4-8kHz for W.B. or 6.8-16kHz for S.W.B. signal. In this chapter,
a survey of approaches to B.W.E. is presented. While B.W.E. approaches focus on N.B. to
W.B. extension, S.W.B.E. approaches for W.B. to S.W.B. extension are developed to bridge
the quality gap between W.B., S.W.B. communication. While studying bandwidth extension
several parameters, constraints need to be taken into consideration to propose a highly
efficient algorithm that introduces only negligible latency. During the literature review many
research papers, websites, journals, other articles on wireless-communication [1-5], [10],
ETSI standard [14],[17],[19], ITU-T Recommendation [8],[9],[12],[18],[22-23]
[25],[27],[29],[33],[109],[137-138],[140],[145], 3GPP TS series [20],[38],[40],[41],[59] for
coding of speech followed by B.W.E. [6],[34],[35],[54],[55],[62],[64],[65],[67],[77] are
referred. A general overview of the bandwidth extension has been presented in many
research papers [7],[13],[45-46],[50-52],[68][81]. This step led us to define the research
problem. In [69] author has given a solution to trim down the wideband speech coder bit-rate
Summary of Literature Survey
23
by coding the parameters of wideband voice (speech) employing noteworthy enlarge in bit-
rates of N.B. coders. [51-52],[70-71] have discussed various approaches (i.e..Linear
Prediction algorithm using L.P.C. coefficients, codebook mapping approach) and make
subjective measurement comparison for various audio wave files.
Performance assessment of speech (voice) signal based on the sub-band filter
followed by source-filter model are discussed in [72]. Based on review it is found that Linear
prediction is an effective tool for the bandwidth extension. It plays a very significant role in
achieving data rate compression. After studying detailed linear prediction analysis, synthesis,
stability criteria, it is found that for next-generation wireless communication bandwidth
extension from the signal itself (without sending sideband information) is useful for data rate
reduction because no need of sending information in sideband is needed for reproduction of
signal at the receiver side. Codebooks mapping [73-74], linear mapping [75], Neural
Networks, etc.. are used to estimate the missing components. [76-78] bring into being the
potential features of speech (voice), evaluate their performance for BWE application. In [79]
author is paying attention to various approaches for bandwidth extension namely with and
without speech (voice) production model as well as BWE with side information. The author
has listed out all the techniques of estimation of the wideband spectral envelope. In [80] the
author has designed the source-filter model based on the vocal tract to retrieve bandwidth
extended output. So in this thesis, the basic discussion that takes place here is between first
general model and baseline model. Based on that model proposed system model for B.W.E.
with flow diagram are discussed. Simulation results achieved for proposed system model in
terms of mean opinion score (M.O.S.), perceptual evaluation of speech quality (P.E.S.Q.),
spectrogram are compared with baseline algorithm, next-generation super wideband coder
algorithm.
2.4 Definition of the Problem
The problem title is " Analytical Studies Relating To Bandwidth Extension of Speech
Signal For Next Generation Wireless Communication". The problem is the B.W.E. of the
speech (voice) signal from W.B. to S.W.B. to improve naturalness, pleasantness, clarity, the
brightness of the speech (voice) signal. In next-generation wireless communication many
smart devices hold up high-quality speech (voice) communication services at S.W.B. so
B.W.E. play a important role for the system.
This can be done by:
Literature Review and Objective of Work
24
Employing S.F.M. based approach by taking general model for B.W.E., baseline
algorithm as a reference to achieve B.W.E. from W.B. to S.W.B. In the approach
performance of the subjective listening test (MOS) as per ITU-T recommendation for various
speech (voice) files followed by comparisons of spectrogram of estimated S.W.B. with the
baseline algorithm & next-generation super wideband coder algorithm is carried out.
2.5 Objectives and Overview of the Research
Bandwidth extension (B.W.E.) is utilized to produce W.B. & S.W.B. as good as
voice quality at end devices without much modification in the existing environment. In other
words, one can say that the objective of bandwidth extension (B.W.E.) is to extend the
bandwidth (B.W.) of the reproduced sound by synthesizing, adding additional high-
frequency components to the received low-band width speech signal.
So the objectives of bandwidth extension(B.W.E.) are listed as follows.
To increase transparency and genuineness of recovered speech, N.B.→W.B.
Further improvement in transparency and genuineness of speech (voice) signal,
W.B.→S.W.B.
To propose an algorithm based on the source-filter model(S.F.M.) by taking a baseline
algorithm as a reference and compare it with next-generation super wideband coder in terms
of perceptual evaluation of speech quality (PESQ), mean opinion score (MOS),
complementary mean opinion score (CMOS).
Compare the spectrogram of the extended output of baseline coder, proposed coder, next-
generation S.W.B. coder and prove that results obtained based on the proposed coder are
comparable with baseline as well as a next-generation super wideband coder.
B.W.E. based on the source-filter model is a suitable choice because the illustration of
20ms (frame duration) * 8KHz (sampling frequency) -160 samples by just 12 (twelve)
parameter results in data rate reduction ultimately save the bandwidth and storage
requirements.
The digital filter and its slowly changing parameters are the key factors to achieve
compression of a speech signal.
The research work presented here can be utilized to get better speech (voice) quality
obtainable by the W.B. system and to conserve speech (voice) quality when S.W.B. devices
are used at the side of W.B. services.
Introduction
25
CHAPTER-3
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
3.1 Introduction
Before studying in detail regarding B.W.E. for next-generation wireless
communication, it is essential to know the fundamental of the speech (voice) production
process as well as the data rate of a speech (voice) signal. A speech (voice) signal is defined
as the acoustic wave that is radiated from the human speech production arrangement when
air is expelled from the lungs, resulting flow of air is agitated by a constriction someplace in
the vocal tract.
Fig.3.1 represents speech (voice) production system model. A speech (voice) signal
is created by a speaker at his/her mouth or lips in the form of pressure waves. The organs
engaged in the speech (voice) creation mechanism are the lungs, larynx, vocal tract (V.T.).
The lungs generated an airflow is modulated by vocal cords or vocal folds of the larynx. The
airflow passing through the glottis-a slit-like orifice between the two vocal folds–is
translated to either a quasi-periodic or noisy airflow by the vocal folds vibration. The
resulting airflow source stimulates the V.T. that comprises oral, nasal, pharynx cavities. The
V.T. performs spectral shaping or coloring of the excitation source. The subsequent variation
of air pressure at the lips is spread out in the form of traveling waves called speech (voice)
[90]. Speech (voice) signals can be seen as the output of a filtering operation in which the
V.T. system (or filter) is excited by the modulation of an excitation source or airflow. The
mechanism is typically known as the source-filter model (S.F.M.) of speech (voice)
production which allows the modeling of speech (voice) signals as a convolution of the
impulse response of the V.T. filter, excitation source [91].
This section presents a concise overview of the different speech (voice) sounds with
their spectral characteristics, B.W. limitations imposed by telephone filter etc..
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
26
Figure 3.1 Speech (voice) production system model [90]
This section provides a brief overview of different speech sounds, their spectral
characteristics, bandwidth limitations imposed by telephone filter results in terms of
intelligibility,quality.
3.2 Speech (voice) sounds
Speech (voice) sounds are extensively separated into two categories: vowels, consonants
[147].
Vowels:
Vowels form the largest group of phonemes. The characteristics of vowels change based on
the position of the tongue–that mainly determine the V.T. shape – towards the front, center,
or back of the oral cavity [147].
Consonants:
Consonants form the second largest group of phonemes. It can be subcategorized into
nasals, plosives, fricatives, whispers, affricates [147].
Speech (voice) sounds
27
Nasals:
Nasals are closest to vowels, produced at the nostrils by the quasi-periodic airflow only
through the nasal cavity; the oral cavity remains constricted. Depending upon the place of
constriction that is formed by the tongue across the oral cavity, nasals are distinguished, e.g.
/m/ as in “mo” and /n/ as in “no”[147].
Fricatives are of two types.
Unvoiced fricatives:
Unvoiced fricatives (e.g., /sh/ in “should”) are described by a noise source generated
due to turbulent airflow near the V.T. constriction. The noise source is spectrally shaped
depending upon the location, degree of constriction formed by the tongue at the teeth or lips
or along the oral cavity [147].
Voiced fricatives:
Voiced fricatives (e.g. /z/ as in “zebra”) are generated by the concurrent generation of
noise at the constriction, vibration of the vocal folds. These sounds are formed by a
combination of a noisy, periodic airflow [147].
Diphthongs:
Diphthongs are produced by vibrating vocal folds analogous to vowels, however, the
V.T. does not remain steady, but varies smoothly between two vowel configurations. So
diphthongs are described by formant transitions as the V.T. articulation changes gradually
between two vowel positions. The examples of diphthongs are /Y/ as in "hide", /W/ as in
"out", /O/ as in "boy" and /JU/ as in "new[147].
Spectral characteristics of speech sounds:
Different speech (voice) sounds are distinguished by different temporal, spectral
characteristics. it can be distinguished from each other based upon their acoustic properties
like rapid transitions in spectral content, abrupt changes in amplitude, presence or lack or
combination of voicing, spectral shape attributed to the V.T. configuration [92]. Based on
spectral properties, speech (voice) sounds can be categorized into voiced, unvoiced sounds
[147].
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
28
Voiced sounds:
The voiced sounds are the end result of excitation of the V.T. by a quasi-periodic
glottal airflow. They are expressed by a quasi-periodic time-domain waveform with large
variations in magnitude. The periodicity is calculated in terms of the pitch period. The
magnitude spectra of voiced sounds thus exhibit harmonic structure with peaks at integer
multiples of the fundamental frequency or pitch, especially in the L.F. region. [7,91].
Unvoiced sounds:
Unvoiced sounds are differentiated by time-domain waveforms with relatively lower
magnitude than voiced sounds, however, with rapid variations as a result of the noise-like
nature of the excitation source. So, the spectrum of unvoiced speech broadens over the whole
audio spectrum [147].
3.3 Representation of Speech Signals
There are many possibilities for the discrete-time representation of speech (voice)
signals as shown in fig.3.2. These representations can be classified into two broad groups,
namely waveform representation, and parametric representations.
Waveform representations are concerned with simply preserving the “wave shape” of
analog speech (voice) signal through a sampling, quantization process. Parametric
representations are concerned with representing the speech (voice) signal as the output of a
model for the speech (voice) production. The first step in acquiring a parametric
representation is often a digital waveform representation; (speech (voice) signal is sampled,
quantized and then further processed to achieve the parameters of the model for speech
(voice) production). The parameters of the models are classified as either excitation
parameters (it is related to the source of speech (voice) sounds or vocal tract response
parameters) [146].
3.4 Classification of Coder
In past for landline communication PCM-64kbps was used. After that ADPCM-
32kbps was used to enlarge the speech (voice) channel by a factor of 2. LPC based source
coders are become popular in the world market which is good for estimating basic speech
parameters like pitch, formant, V.T. area function, etc.. It is used for representing speech
Classification of Coder
29
(voice) for low bit rate transmission or storage It requires 80000 bits for 100s speech frame
which is less than PCM-6400000 and ADPCM-3200000 bits.
The hierarchy of speech (voice) coder is shown in fig. 3.3. Speech coders differ
widely in their approaches to achieving signal compression. Based on how they accomplish
compression, speech (voice) coders are broadly classified into two categories: waveform
coders, vocoders. Also, there exists a hybrid coder made up of a combination of both the
coder to provide a good quality speech (voice) at low bit rates. The speech (voice) quality
produced by speech (voice) codec is a function of transmission bit-rate, complexity, Delay,
bandwidth. Hence when considering speech (voice) codec it is essential to consider all these
attributes [146].
3.4.1 Waveform Coding
Waveform coding techniques were the first to be investigated and are still widely
embodied in the relevant standards. In this class of coding, the objective is to minimize a
criterion that measures the dissimilarity between the original, reconstructed speech (voice)
signals, evaluated on a block-by-block basis. A block can be as small as one sample [146].
Waveform codec attempt, without using any knowledge of how the signal to be
coded was generated. Results from such codec can be improved if the predictor and
Representation
speech signals
Waveform
Representation
Parametric
Representations
Exercitation
Parameters
Vocal tract
Parameters
Figure 3.1
Figure 3.2 Representation of speech (voice) signals
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
30
quantizer are made adaptive so that they change to match the characteristics of the speech
(voice) being coded. This leads to Adaptive Differential PCM (ADPCM) codec.
Figure 3.3 Hierarchy of speech coders[146]
Waveform coders are most useful in applications that require the successful coding of
both speech, no speech signals. In the public switched telephone network (PSTN) successful
transmission of modem, fax signaling tones, switching signals is nearly as important as the
successful transmission of speech (voice). The most commonly used waveform coding
algorithms are uniform 16-bit PCM, commanded 8-bit PCM, ADPCM [146].
3.4.2 Source Coding
Source coders operate using a model of how the source was generated, attempt to
extract, from the signal being coded, the parameters of the model which are transmitted to
the decoder. The V.T. is represented as a time-varying filter, excited with either a white
noise source, for unvoiced speech segments, or a train of pulses separated by the pitch period
for voiced speech (voice). Therefore the information which must be sent to the decoder is the
filter specification, a voiced/unvoiced flag, the necessary variance of the excitation signal,
Classification of Coder
31
pitch period for voiced speech. This is updated every 10-20 ms to follow the non-stationary
nature of speech (voice).
The model parameters can be determined by the encoder in several different ways,
using either time or frequency domain techniques. Also, the information can be coded for
transmission in various ways. Vocoders tend to operate at around 2.4 kbit/s or below,
produce speech (voice) which although intelligible is far from natural sounding. Increasing
the bit rate much beyond 2.4 kbit/s is not worthwhile because of the inbuilt limitation in the
coder's performance due to the simplified model of speech (voice) production used. The
main use of vocoders has been in military applications where natural-sounding speech
(voice) is not as important as a very low bit rate to allow heavy protection, encryption [146].
3.4.3 Hybrid Coding
Waveform coders are capable of providing good quality speech (voice) at bit rates
down to about 16 kbit/s but are of limited use at rates below this. Vocoders on the other hand
can afford an intelligible speech (voice) at 2.4 kbit/s and below, but cannot provide a natural-
sounding speech (voice) at any bit rate.
A hybrid coder tends to act like a waveform coder for high bit-rate, parametric coder
at low bit-rate, with fair to good quality for medium bit-rate. This technique dominates the
medium bit-rate coders. The most successful, commonly used coder are time-domain
Analysis-by-Synthesis (ABS) codec. Such coders use the same linear prediction filter model
of the vocal tract as found in LPC vocoders. However, instead of applying a simple two-
state, voiced/unvoiced, model to find the necessary input to this filter, the excitation signal is
chosen by attempting to match the reconstructed speech waveform as closely as possible to
the original speech (voice) waveform [146].
3.5 Speech (voice) coding
Speech (voice) coding algorithms have the target of approximating the compact
digital representations of continuous speech (voice) signals for efficient storage, transmission
[94]. In the year 1990s, the increased number of mobile phones, demands to mobile
communications brought new challenges for digital speech transmission systems [7]. General
block diagram of speech coding is shown in fig. 3.4.. As represented in the diagram the
digital transmission of speech (voice) involves A to D converter to covert continuous data in
digital arrangement. The function of codec at the encoder side is to converts an continuous
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
32
signal into squeezed, binary bits pattern as indicated in fig. 3.4. This binary bits pattern can
be sent over digital wireless networks. At the recipient side, the binary bit stream is
translated back to an continuous signal using a decoder. Codec are utilized in telephones,
cellular networks, televisions, set-top boxes etc..[89].
Considering the speech (voice) codec attributes like low complexity, limited delay,
lower cost, limited radio network resources it is must be necessary to operate speech (voice)
codec at lower bit rate [93,95]. Perception plays a fundamental role in the art of lossy speech
(voice) coding. Lossy speech (voice) coding can be summarized as the endeavor to reduce
the bit rate of the coded speech (voice) through an efficient, minimal representation of the
speech (voice) signal while maintaining an acceptable level of perceived quality of the
decoded speech (voice).
Fig. 3.5 (a) and 3.5 (b) show a section of the spectrogram for an unvoiced and a
voiced sound respectively. The range from 300-3400 Hz is called the narrowband. The
complete range is referred to as wideband. The "missing" range in narrowband compared to
the wideband is referred to as the extension band. Throughout the thesis the bands will
frequently be referred to by their abbreviations; N.B., W.B.,E.B. respectively. As can be seen
in both cases, there is a considerable amount of energy in the extension band, which would
result in a significantly improved version of the speech (voice)[69].
Speech can be divided into two classes, voiced and unvoiced. The difference between
the two signals is the use of the vocal cords and V.T. (mouth and lips). When voiced sounds
are pronounced you use the vocal cords, vocal tract. Because of the vocal cords, it is possible
to find the fundamental frequency of the speech. In contrast to this, the vocal cords are not
Speech
Encoder
Transmission Speech
Decoder
Transmitter Channel Receiver
Discrete Sample s (n)
10110011…..., 64kbps
1001..,
8kbps
1001…,
8kbps
10110011…,
64kbps
D/A
Converter
Retrieved
speech s^ (t)
A/D
Converter
Analog
speech s(t)
Figure 3.4 General block diagram of speech coding
Speech (voice) coding
33
used when pronouncing unvoiced sounds. Because the vocal cords are not used, is it not
possible to find a fundamental frequency in unvoiced speech. In general, all vowels are
voiced sounds. Examples of unvoiced sounds are /sh/ /s/ and /p/.There are different ways to
detect if voices are voiced or unvoiced. the fundamental frequency can be used to detect the
voiced and unvoiced parts of speech. Another way is to calculate the energy in the signal
(signal frame). There is more energy in a voiced sound than in an unvoiced sound [69].
Figure 3.5 Voice (a) and Unvoiced (b) Speech frame representation [69]
3.6 Issues related to digital speech coding
Digital coding of speech (voice), bit rate reduction process has thus emerged as an
important area of research. This research largely addresses the following problems:
Although it is very attractive to reduce the PCM bit rate as much as possible, it
becomes increasingly difficult to maintain acceptable speech (voice) quality as the bit
rate falls.
As the bit rate falls, acceptable speech (voice) quality can only be maintained by
employing very complex algorithms, which are difficult to implement in real-time
even with new fast processors with their associated high cost, power consumption, or
by incurring excessive delay, which may create echo control problems in system.
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
34
To achieve low bit rates, parameters of speech production and/or perception model
are encoded and transmitted [96].
Speech quality as produced by speech codec is a function of transmission bit rate,
complexity, delay, B.W.. Therefore when considering speech (voice) codec it is essential to
consider all these attributes. For a given application some attributes are predetermined while
trade-off can be made among others. For example, low bit rate speech codec tends to have
more delay as compared to the higher-bit-rate codec. They are generally more complex to
implement, often have lower speech quality than the higher-bit-rate codec [146].
3.7 Speech codec attributes
Speech codes are characterized by five general attributes:-
3.7.1 Transmission bit rate
Specification of how many bits are required to describe one second of speech. The bit
rate is a measure of how much the speech model has been exploited in the coder; the lower
the bit rate, the greater the reliability on the speech production model. Depending on the
system and design constraints, fixed-rate or variable-rate speech coders can be used [146].
3.7.2 Speech Quality
The speech quality has the most important dimensions among all the attributes. The
decoded speech should have a quality acceptable for the target application. Bit rate, quality
are intimately related: the lower the bit rate, the lower the quality. While the bit rate is
inherently a number, it is difficult to quantify. The most widely used subjective measure of
quality is the Mean Opinion Score (MOS), which is the result of averaging opinion scores
for a set of untrained subjects (listeners). Each listener characterizes each set of utterances
with a score on a scale from 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5- excellent). The MOS
quality rating is shown in Table 3.1. A MOS of 4.0 or higher defines good or toll-quality,
where the reconstructed speech (voice) signal is generally indistinguishable from the original
signal. A MOS between 3.5 and 4.0 defines communication quality, which is sufficient for
telephone communications. The most widely used objective measure of quality is the signal-
to-noise- ratio (SNR), SNR indicates how well the waveforms match but do not account for
how the decoded speech (voice) is perceived [146].
Issues related to digital speech coding
35
Table 3.1 MOS Quality Rating
Quality scale Score Listening effort scale Impairment
Excellent 5 No effort required Imperceptible
Good 4 No appreciable effort
required
Perceptible but not
annoying
Fair 3 Moderate effort required Slightly annoying
poor 2 Considerable effort
required
Annoying
bad 1 No meaning understood
with reasonable effort
Very annoying
3.7.3 Bandwidth
The B.W. of the speech signal that needs to be encoded is also an issue. Typical
telephony requires 200–3400 Hz B.W. Wideband speech coding techniques (useful for audio
transmission, teleconferencing, tale teaching) necessitate 7–20 kHz bandwidth [146].
3.7.4. Communication Delay
The measure of the time taken for an input sample to be processed at the encoder,
transmitted, and decoded at the decoder. Delays over 150 ms can be unacceptable for highly
interactive conversations. Coder delay is the sum of different types of delay. The first is the
algorithmic delay arising because speech coders usually operate on a block of samples
(frames), which needs to be accumulated before processing can begin. The computational
delay is the time that the speech (voice) coder acquires to course of action the frame. In
practice, the total delay of many speech coders is at least three frames[146].
3.7.5 Complexity
Complexity is defined as The measure of computation (memory) required to
implement the coder. The computational complexity, memory requirements of a speech
(voice) coder determine the cost, power consumption of the hardware on which it is
implemented. In most cases, the real-time operation is required at least for the decoder. A
large complexity can result in high power consumption in the hardware [146].
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
36
3.8 Speech codec for next-generation wireless communications
Selection of Particular speech (voice) codec plays a vital role in design of digital
next-generation wireless communications. it a challenging task to shrink speech (voice) to
maximize the number of user on the system. The other important parameters are encoding
delay, complexity in design of encoder, necessity of power requirement, compatibility with
on hand standards, robustness of the coded speech (voice) to transmission errors etc...[146].
3.9 Speech (voice) quality, intelligibility
In a telephone chat, the acoustic signal is recognized by the near-end user's ear, origin
an auditory event in the brain results in a depiction of the voice (sound) [96]. former
knowledge of the communication system, emotions of the listener influence the quality
judgment [13,97]. According to [98] speech (voice) quality encompasses attributes such as
naturalness, clarity, pleasantness, brightness. Intelligibility is considered as a other
dimension to measure speech (voice) quality [7]. Besides the traditional N.B. (300–3.4kHz)
speech coding, transmission, W.B. (50–7kHz), S.W.B. (50–14kHz) speech transmission has
been deployed in many telephone applications..
3.9.1 Listening-only tests
Listening-only tests can play a vital role in assessing speech (voice) quality of a
communication system. This test can be done by means of pre-recorded, processed set of
speech (voice) samples (through headphones). The subjects are requested to assess the
quality using a predefined scale given by the experimenter. An advantage of listening-only
tests is that the evaluation is subjective, i.e. subjects listen to real speech samples, grade them
according to their personal opinion. A drawback listening-only tests is that test design
factors, rating procedures affect the subject's observation procedure. As shown in table 3.2.,
subjects are asked to assess the speech (voice) quality using the 5-point mean opinion scale
(MOS) for absolute category rating (ACR) test [99]. In degradation category rating (DCR)
test the subject is asked to assess a degraded signal w.r.t. reference signal using the
degradation mean opinion score (DMOS) as shown in table 3.3.
Speech codec for next-generation wireless communications
37
Table 3.2 MOS as per recommendation ITU-T P.800 [99]
Score Quality of speech
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad
Table 3.3 Degradation means opinion score as per recommendation ITU-T P.800 [99]
Score Degradation is
5 Inaudible
4 Audible but not annoying
3 Slightly annoying
2 Annoying
1 Very annoying
In the comparison category rating (CCR) test, two signals are judge against each
other as shown in table 3.4
Table 3.4 Comparison means opinion score according to ITU-T P.800 [99]
Score Quality of the second compared
to that of the first
3 Much better
2 Better
1 Slightly better
0 About the same
-1 Slightly worse
-2 Worse
-3 Much worse
3.9.2 Conversational tests
In comparisons with previous tests, conversational tests are much close to real
conversation situation. The test can be utilized to assess either a specific source of
degradation, such as delay or echo, or the overall quality of a transmission system. In test,
Fundamental of Speech Production Model,
Types of Speech coder and its attributes
38
two subjects at a time are situated in separate soundproof rooms. The subjects can be experts,
experienced, or naive participants depending on the rationale of the test. During the test, the
subjects have a dialogue on a given topic or task, give their opinion on the speech (voice)
quality [100].
3.9.3 Field tests
This test is utilized to assess speech (voice) quality of transmission systems in most
realistic environment. For example, the speech (voice) quality of a wireless phone might be
judged during true phone calls. The arrangement of such a test is expensive during system
process development. so quality tests are typically arranged in laboratory.
3.9.4 Intelligibility tests
It is one types of subjective test computed as correctly recognized speech (voice)
sounds at the receiver side of a system in percentage. In rhyme tests rhyming words are
utilized[106]. The diagnostic rhyme test (DRT) consists of 96 rhyming word pairs that differ
in their initial consonant [107]. The modified rhyme test (MRT) contains 50-word lists of six
one-syllable words, differing in either the opening or the closing consonant [108]. To find a
presentation level for test speech (voice) speech reception threshold (SRT) test can be
utilized. In this test it is necessary for a listener to understand the speech (voice) correctly a
specified percentage of the time, usually 50%.
3.10 Objective quality evaluation
Objective quality estimation techniques are computational tools build-up to assess the
quality of speech (voice) signals. Objective quality assessment methods are often classified
into two parts namely parameter based, signal-based models [101-102]. Parameter-based
models, such as E model [103], rate the important quality of a whole transmission path. They
provide information on the whole network, helpful in planning of network. Signal-based
methods play a very important role in speech (voice) enhancement field. The perceptual
evaluation of speech (voice) quality (PESQ), normalized in ITU-T Recommendation P.862
[104], can be employed to assess N.B./W.B/S.W.B. quality. Both the degraded and a
reference signal are required for evaluation of speech (voice) quality. An extension of the
PESQ is presented in the recommendation P.862.2 [105].
Speech (voice) quality, intelligibility
39
3.11 Effect of B.W. on speech (voice) quality, intelligibility
Selection of cut-off frequencies, telephone filter characteristics specified in ITU-T
recommendation G.120 [109] was relies on the B.W. limitations obliged by continuous
transmission systems. The selections were motivated by outcomes attained from subjective
listening tests [11]. To retain compatibility with existing analog PSTNs, initial progress in
digital telephony has occurred with B.W. constraints of 0.3-3.4kHz. After digitization of
speech (voice) transmission, speech (voice) coding techniques were developed to condense
speech (voice) signals to lower bit rates than PCM, without degrading speech (voice) quality.
So the bandwidth available for speech (voice) transmission was dictated by development of
cellular systems, speech (voice) coding techniques. However, the frequency content of
speech (voice) signals ranges from 60Hz to 20kHz.
The drawbacks upon the telephone B.W. thus result in a lack of distinctive spectral
properties of speech (voice) sounds. Some fricatives such as /f/ and /s/ differ in the location
of the lowest spectral peak, which occurs typically around 2.5kHz, 4kHz for a male adult
speaker correspondingly. Such distinctive properties are vanished in typical telephone speech
(voice), which often causes difficulties to the listener in distinguishing between different
fricative sounds [15]. Some plosives (/t/ and /d/) are characterized by higher energy bursts
occurring around 3.9kHz. Others (/b/, /p/) also exhibit similar energy bursts albeit with less
intensity. These distinct properties are vanished in N.B. speech (voice) results in trim down
intelligibility, naturalness for plosives. Nasals are too affected. They are dominated by the
first formant F1 which typically occurs around 250Hz. This is also vanished due to the lower
cut-off of the telephone band.
During a telephone call, occasionally H.F. sounds such as /f/ and /s/ (or /p/ and /t/)
are hard to distinguish. Improvements in intelligibility are thus essential to shrink the
listening effort, to afford comfortable communications [6]. The perceived quality as well as
the intelligibility of speech (voice) signals enhancing with increases in acoustic B.W.,
particularly for unvoiced sounds. This is because they restrain a substantial portion of their
spectral content beyond 3.4kHz. A changeover from N.B. to W.B. communication then
obviously leads to an raise in syllable, sentence intelligibility from 90% ,99.3% to 98%,
99.9% correspondingly [110-111]. The quality of communication is further improved at
S.W.B.
Development of Bandwidth Extension Model For W.B. To S.W.B.
40
CHAPTER-4
Development of Bandwidth Extension Model
For W.B. To S.W.B.
4.1 Introduction
Before studying in detail regarding development of bandwidth extension model for
W.B. To S.W.B. general model for B.W.E. from N.B. to W.B., bandwidth limitation in
compressed speech/audio, High-Frequency Bandwidth Extension (H.F.B.E.) algorithm [120]
are discussed. The B.W.E. approaches can be categorized into two parts namely Non-model
based B.W.E., source-filter model based B.W.E..
4.2 Non-model based B.W.E. approaches
The non-model based B.W.E. approaches do not use a priori knowledge about speech
(voice) production mechanism and employ simple operations to introduce missing frequency
components. The operations include spectral translation or shifting (fixed or adaptive),
generation of high-frequency components via non-linear operations on time-domain signals,
band pass filtering on white noise, etc. Some approaches also perform spectral shaping of the
generated frequency components using some gain control parameter or an empirically
determined filter. The first commercial use of such B.W.E. methods by the British
Broadcasting Corporation (BBC) [112] is reported in the year 1972 where the acoustical
bandwidth of telephone speech in broadcast programs is improved. The notable examples are
[113-116]. Such methods, however, produce extended W.B. speech (voice) signals with
audible processing artifacts and distortions. This is because the energy of the generated
frequency components is either too weak or too strong compared to the N.B. component
[117]. Additionally, the quality of extended speech (voice) signals depends upon the
effective bandwidth of the original input signal [118].
Introduction
41
4.3 B.W.E. approaches based on the source-filter model
The model-based B.W.E. algorithms use a priori knowledge about characteristics of
speech (voice) signals, human speech (voice) production mechanism. Since the beginning of
the 90s, most B.W.E. algorithms started exploiting the classical source-filter model (S.F.M.)
[119] of speech (voice) production where an N.B./W.B. speech signal is represented by an
excitation source, vocal tract filter. The frequency content of these two components can be
extended through independent processing before a W.B./S.W.B. signal is resynthesized.
The extension of N.B./W.B. speech is thus divided into two tasks;
Estimation of H.B. or W.B./S.W.B. spectral envelope from input N.B./W.B. features
via some form of an estimation technique
Generation of H.B. or W.B./S.W.B. excitation components via some form of time-
domain non-linear processing, spectral translation or spectral shifting methods. The
H.B. component is usually parameterized with some form of linear prediction (LP)
coefficients whereas the N.B./W.B.component is parameterized by a variety of static
and/or dynamic features
4.3.1 B.W.E. from N.B. to W.B. Speech Conversion:
In Digital Signal Processing signals are band-limited w.r.t. use of sampling
frequency. N.B., W.B., S.W.B. speech has a sampling frequency of 8kHz,16 kHz,24 kHz
respectively. Based on N.B. to W.B. speech conversion and baseline algorithms (H.F.B.E.)
proposed method for W.B. to S.W.B. has been discussed. It is a blind method because they
estimate missing H.F. component from available L.F. components. The proposed flow chart
of bandwidth extension based on baseline and proposed algorithms discuss the achievement
of W.B./S.W.B. speech signal at the receiver side without actually transmitting W.B./S.W.B.
4.3.2 General Model for B.W.E.:
The fundamental thought adopted here for the B.W.E. system is the separate
extension of the spectral envelope and the residual signal as depicted in fig.4.1. First, the
incoming telephone-band signal is analyzed through the LPC-analysis. Based on the spectral
envelope, telephone-band short time features, the spectral envelope is extended. The
extended version is essential to define the shaping filter characteristic. Second the extended
Development of Bandwidth Extension Model For W.B. To S.W.B.
42
residual signal for this shaping filter is calculated from the telephone-band residual signal as
highlighted in fig.4.1. According to the linear speech production model the synthetic signal
is generated by driving the shaping filter with the extended residual signal. The resulting
power of the synthetic signal has to be matched to the telephone-band signal power such that
both signals can be added to build the desired W.B. signal [70].
Figure 4.1 General Model for B.W.E.[70]
4.3.3 Bandwidth Limitation in Compressed Speech/Audio:
Most perceptual speech/audio codec B.W. limits the input signals to use the available
bits for psycho acoustically relevant L.F. signals. The B.W. limitation becomes severe at
very low bit rates e.g. 128kbps mp3 encoded speech/audio is bandwidth limited to about 15
B.W.E. approaches based on the source-filter model
43
kHz and 64kbps mp3 to about 8 kHz. Fig. 4.2 shows an example of a 96 kbps mp3 encoded-
decoded signal spectrum, the signal spectrum chopped beyond 11 kHz. These missing H.F.
signals cause speech/audio signals to sound dull and at times muffled. so it is becoming
essential to improve audio quality in such cases by artificially creating high frequencies from
low-frequency information[120].
Figure 4.2 Bandwidth Limitation in Compressed Speech/Audio [120]
Issues with the Approaches:
Bandwidth limitation becomes severe at very low bit rates.
Proposed Solution
Encoder side modifications solve the bandwidth limitation issues.
Internal Issue
An enormous amount of content available which suffers from the above-mentioned
quality issues.
Proposed Solution
Post-processing methods after decoding Ex..spectral extrapolation
Internal Issue
Computationally too expensive
Final Solution-
Employ Time-Domain Method for real-time Implementation.
Development of Bandwidth Extension Model For W.B. To S.W.B.
44
4.3.4 Baseline System Model for B.W.E.
As shown in fig.4.3 baseline system model accepts decoded data as input from any
lossy audio decoder and recreates high frequencies blindly i.e. by using only the decoded
audio signal and nothing else. Since the bandwidth of the input signal is unknown it is first
estimated by using a real-time bandwidth detection process. After detecting the highest
frequency present in the signal at any given time, the decoded signal is divided into sub-
bands up to half the detected highest frequency of the signal. Each sub-band signal is then
individually passed through non-linearity to generate harmonics. The generated harmonics
are gain scaled to achieve spectral envelope shaping and added back to the original
signal[120].
Figure 4.3 High-Frequency Bandwidth Extension (H.F.B.E.) Algorithm [120]
Component Description of the Baseline System Model for B.W.E.:
Bandwidth Detection:
Bandwidth detection involves analyzing the signal in real-time e.g. every 20msec to
find the highest frequency present in the signal. A typical range of bandwidth of interest is
from 7 kHz to 16 kHz. Signals with B.W. below 7 kHz might means that the original audio
signal does not contain high frequencies[120].
B.W. beyond 16 kHz means that there are a significant amount of high frequencies
present and reconstruction is not required. A possible solution for bandwidth detection might
be to perform an FFT on the decoded audio data. Using the frequency response it would be
simple to detect the highest frequency with any significant energy. But an FFT would
increase the complexity of the reconstruction algorithm and application to various platforms
Sub-band
Filtering
Non-Linear
Processing Post-
Processing
Bandwidth
Detection
Bandwidth Limited
Input(WB) Bandwidth Etended
S.W.B. Output
Delay
B.W.E. approaches based on the source-filter model
45
would be difficult. Instead of using an FFT, we could have a bank of band pass IIR filters
followed by the energy calculation of filtered results. For 1 kHz accuracy in detection, we
need to have 8 band pass filters in our frequency range of interest[120].
Sub band filtering:
The B.W. detection step is followed by sub-band filtering in which the input signal is
divided into multiple bands before the nonlinear processing step. It is well known that any
nonlinear operation produces harmonics of the input frequency along with intermediation
noise. The more the frequency contents are present in the signal before the nonlinear process
the more the intermediation noise. Hence if we band pass filters the input signal into multiple
bands and applies nonlinear processing individually on each sub-band the intermediation
noise would be reduced. The higher the number of sub-bands the lesser will be the
intermediation noise. In our implementation, we divided the signal into 2 sub-bands.
Although a higher number of sub-bands would have been desirable, the complexity of
implementation would have been significantly increased. If B is the B.W. detected then the
1st band extends from 0.5xB – 0.75xB and 2
nd band from 0.75xB – B. The 1
st and 2
nd sub-
bands are called Band1 and Band2. Since the detected bandwidth is approximated to be
within a set of fixed values, the filter coefficients used for sub-band filtering are
precalculated and stored. 4th
order IIR filters were found to be sufficient for the
purpose[120].
Non linear processing:
The sub-band filtered data is then passed through a non-linear process to detect
harmonics. Any non-linear process which generates second harmonic can be used. Full-wave
rectification is a suitable non-linear process. Along with the very low complexity of
implementation, it also exhibits harmonic signal amplitude linearity [121]. The full-wave
rectification process is followed by band pass filtering to filter out frequencies outside the
range of interest generated due to intermediation and aliasing. Band1 harmonics are post
filtered from B – 1.5xB and 2nd sub-band harmonics from 1.5xB – 2.0xB. The result of the
post-filtering of Band1 data is called Band3 and that of Band2 is called Band4. Fig. 4.4
shows the bands Band1, Band2, Band3, and Band4. If B is detected greater than 9 kHz, the
maximum recreated frequency is limited to 18 kHz.
Development of Bandwidth Extension Model For W.B. To S.W.B.
46
Post processing:
The goal of the post-processing step is to modify the energy of the recreated bands
Figure 4.4 Input and Reconstruction band[120]
to have them match the original frequency content. A simple algorithm that attempts to
maintain the continuity of the spectral envelope by modifying energies of Band3 and Band4
has been developed. After the nonlinear processing and post-filtering step the energy in
bands Band1, Band2, Band3, and Band4 are evaluated on a short time basis. The energy of
the original signal's sub-band Band1 is called E1, that of Band2 is called E2 and those of
reconstructed Bands 3 and Band 4 are E3 and E4 respectively. Assume that E3Target and
E4Target are the desired energies of Band3 and Band4 respectively which would give a smooth
reconstructed spectral envelope. From our experiments, we found that the desired value of
E3Target is approximated by
1
22
3TargetE E
EE
.................................. (1)
3
3Target G 3
E
E
......................... (2)
2
3Target3Target
4TargetE E
EE
................ (3)
B.W.E. approaches based on the source-filter model
47
4
4Target
4
G
E
E
............................... (4)
The reconstructed sub-bands Band3 and Band4 are then gain modified using G3 and
G4 from equations (3) and (4) and added back to the original signal to get the B.W.E. output
signal [120].
4.4 Detailed analysis of B.W.E. based on proposed S.F.M.
Fig. 4.5 depicts the proposed S.F.M. based algorithm block diagram for W.B. to
S.W.B. speech conversion. The block diagram is mainly categorized into four parts as
narrated below:
The first and main important step is to acquire the W.B. input signal from the pre-
processing stage and the framing & windowing is performed on the W.B. input signal.
In the second steps the output of the first step is carried out by the LP algorithm to
divide the signal into two parts namely spectral information and residual error signal, so one
can say that missing H.F. components are estimated from accessible L.F. components.
In the third step original L.F. component is extracted from the input W.B. frame by
zero insertion.
In the final step both the L.F. and H.F. components are added together to get
estimated S.W.B. output.
Detailed Analysis of Proposed System Model:
Framing:
Partitioning of a speech signal into frames is the first basic component of our
proposed Input from Preprocessing approach. on the whole, a speech (voice) signal is not
stationary, but it is typically is stationary in windows of 20 milliseconds. Therefore the
signal is divided into frames of 20 milliseconds which corresponds to n1 samples [122].
sst ftn *1 ..........................(5)
Development of Bandwidth Extension Model For W.B. To S.W.B.
48
Xamr= xwb
Figure 4.5 Block diagram of the proposed approach for W.B. to S.W.B.
speech (voice) signal
Input from
Preprocessing
p
k
kwbzak1
1
1
Framing
LP
Analysis
Analysis
Windowing
Zero
insertion
FFT jwezZH |)(
HPF
Zero
Insertion
Synchronization
IFFT
Synchronization
Estimated
SWB XSWB
LPF
+
×
Detailed analysis of B.W.E. based on proposed S.F.M.
49
Fig.4.6 depicts a pictorial representation for framing the signal into four frames of 20
ms[122]. Fig.4.7 illustrates a flow chart demonstration for framing where the signal is
divided into frames, finds out the number of samples in a frame, and plot the frames through
the audio read function in MATLAB is carried out.
Figure 4.6 Pictorial representation for framing [122]
Windowing:
During frame blocking, there is a possibility that signal discontinuity may arise at the
beginning and end of each frame. To reduce the signal discontinuity at either end of each
block the next step employed is windowing. The window function exists only inside a
window and evaluates to zero outside some chosen interval. When multiplied with the
original speech frame, the window function taper down the beginning and the end to zero
and thereby minimize spectral distortion at both the end. The simplest rectangular window
Figure 4.7 Flow chart representation for Framing
Using audio read function read the given wave file
Select 25 ms duration of frame and
sampling frequency= 16 kHz
Plot speech(voice) signal
Find out the number of sample in one frame and total
number of frames in wave file
Plot each frame by processing data of specified
duration of frames
uration of frames
Development of Bandwidth Extension Model For W.B. To S.W.B.
50
function is given by
otherwise
1Nn0
0,
1,ω(n)
......................(6)
A more commonly used smoother function (Hamming window) is defined as
otherwise
1Nn0,
0,
1N-
πn2cos0.46-0.54
ω(n)
...............(7)
where n= sample no. in a frame, N= total no. of samples & 10-30ms window length[122].
Linear Prediction:
Linear prediction (LP) is the heart of the bandwidth extension algorithm for N.B. to
W.B. and W.B. to Super wideband (S.W.B.). In linear prediction, the present sample can be
estimated as a linear combination of past samples. Fig.4.8 and Fig.4.9 represent all-pole
spectral shaping synthesis filter and linear prediction (LP) model of speech(voice) creation
respectively. Naturally, three filters namely glottal pulse model G(z), vocal tract(VT) model
V(z), and radiation model R(z) are utilized to model the speech creation. The glottal pulse
model(GPM) contour the pulse train before it is used as input to V(z). Three models together
can be represented via single T.F. H(z), i.e..
H(z) = G(z)V(z)R(z) ..................(8)
Where H(z) is called as the synthesis filter and is shown in "Fig.5", Obviously the
synthesis filter can be represented via the inverse of the analysis filter, i.e.:
H(z) = 1/A(z) .................. (9)
In this way, we can parameterize a voice signal and it is a suitable and precise method [97].
Figure 4.8 All pole spectral shaping synthesis filter [97]
Excitation E(z) Synthetic speech
S(z) = E(z) * A(z)
)(
1 H(z)
zA
Detailed analysis of B.W.E. based on proposed S.F.M.
51
Figure 4.9 Linear prediction (LP) model of speech(voice) creation[97]
Fig.4.10 represents the LPC based standalone B.W.E. algorithm (without sideband
information). It is a blind approach in the sense that there is no requirement of additional side
information to be sent. so no additional data rate burden occurs in this approach.
Bandwidth extension here can be divided into two tasks [123]:
Estimation of the W.B. spectral envelope
Extension of the excitation signal.
evaluation of the W.B. spectral envelope can be attained by a variety of techniques like
Codebook, adaptive Codebook mapping, Gaussian Mixture Models (GMM), etc...
x
Sn ARᶺw
unᶺ uwᶺ
Figure 4.10 LPC based B.W.E [123]
Feature
Extraction
Exertion
of excitation
Synthesis
1/Aᶺ (z)
Prior
Knowledge
Estimate wideband
envelope
Interpolator Analysis
Aᶺ (z)
N.B. Signal Estimated
W.B.signal
Development of Bandwidth Extension Model For W.B. To S.W.B.
52
Codebook Mapping:
The codebook mapping techniques make use of two codebooks constructed in the
training phase utilizing vector quantization to find a representative set of N.B. and W.B.
speech frames. A W.B. codebook contains several wideband spectral envelopes that are
parameterized and stored as, e.g., LPC coefficients or line spectral frequencies (LSFs). An
N.B. codebook contains the parameters of the corresponding N.B. speech frames. When the
bandwidth extension system is used, the best matching entry for each N.B. input frame is
identified in the N.B. codebook and the spectral envelope of the extension band is generated
using the corresponding entry in the W.B. codebook. This technique was proposed in [124],
and presented in [125] with refinements. Separate codebooks were put up for unvoiced,
voiced fricatives as mentioned in [125], which boost the quality of the spectral envelope.
Also, a codebook mapping with interpolation was accounted in [126] to boost the extension
quality, Furthermore, codebook mapping with memory was implemented by interpolating
the current envelope estimate with the envelope estimate of the previous frame[127].
Linear mapping:
Linear mapping is utilized in [128] to estimate the highland spectral envelope. The
linear mapping between the input and output parameters (LPCs or LSFs) is denoted as
WX= Y ..............(10)
Where the matrix W is acquired via an off-line training procedure with the least-squares
method that diminishes the model error (Y–WX) by means of a training data with N.B.
parameters is indicated by a vector x = [x1, x2, . . . ,xn] ,corresponding wideband envelope to
be estimated by another vector Y = [y1, y2 . . . ,ym]. To better imitate the non-linear
correlation between the N.B. and H.B. envelopes, modifications for the basic linear mapping
method have been presented. The mapping can be realized by several matrices instead of
using a single mapping matrix. In [129], input data are clustered into four clusters, and for
every cluster, a separate mapping matrix is created. In [128], Objective analysis was showing
that spectral distortion (SD) for codebook mapping, neural network was bigger compare to
piecewise linear mapping. The objective comparison specified in [126] indicates that the
performance of codebook mapping is better than linear mapping.
Detailed analysis of B.W.E. based on proposed S.F.M.
53
Gaussian Mixture Model:
In linear mapping, only linear dependencies between the N.B. spectral envelope, H.B.
envelope are exploited. Non-linear dependencies can be embraced in the statistical model by
employing the Gaussian mixture model (G.M.M). A G.M.M. approximates a probability
density function as a sum of several multivariate Gaussian distributions, a mixture of
Gaussians. G.M.M. are used in B.W.E. to model the joint probability distribution of the
parametric representations of the N.B. input, extension band. Given the input feature vector
of a speech (voice) frame, the parameters representing the extension band envelope are
determined using an error estimation rule.
The G.M.M. is utilized directly in envelope extension to estimate W.B. L.P..C
coefficients or L.S.F.s from corresponding N.B. parameters [130]. The act of the GMM
based spectral envelope extension was then augmented by using Mel frequency cepstral
coefficients (M.F.C.C.) instead of L.P.C. coefficients [131]. The G.M.M. mapping with
memory further results in better performance in terms of perceptual evaluation of speech
quality (P.E.S.Q.) [132].The benefit of the G.M.M. in envelope extension methods is that
they present a continuous estimation from N.B. to W.B. features in comparisons with
discrete acoustic space resulting from vector quantization (V.Q.). Better results were
reported for G.M.M. based schemes compared to codebook mapping in [130] in terms of
S.D, cepstral distance, a paired subjective comparison.
Extension of the excitation signal.
Extension of the excitation process is an imperative feature of any W.B./S.W.B.
enhancement scheme is the generation of the H.B./W.B. speech excitation. The residual
extension process mainly claims to double the sampling rate, from 8 to 16 kHz, as well as to
uphold the whole spectrum flat. The harmonics contained in the N.B. residual should also be
carried forwarded in the W.B. residual. Forthcoming are few of the popular techniques which
are referred to in this research for bringing forth the H.B./W.B. excitation signal from the
given input N.B. excitations (N.B. residual signal).
Spectral folding (SF)
Spectral translation (ST)
Nonlinear distortion (NLD)
Development of Bandwidth Extension Model For W.B. To S.W.B.
54
Spectral folding (SF):
Spectral folding is a time-domain approach for extension of excitation and is one of
the most conventional, popular methods because of its simplicity, wide usage in high-
frequency excitation regeneration in B.W.E.. Inserting zero between each sample of N.B.
residual signal, folds the baseband spectrum to higher frequencies. Thus, the up-sampling
extends the signal bandwidth from 4 to 8 kHz. Spectrum folding results in the mirrored
image of the N.B. spectrum at the spectrum of H.B. but the major disadvantage of this
method is that the harmonic structure leaves a spectral gap at 4 kHz [133,134].
Spectral translation:
Spectral translation creates a shifted spectral version of the N.B. residual at the high-
frequency band. This approach prompts input N.B. residual to mix with (-1)n, (where n is the
index of each sample) which is rather similar to modulation of any signal with a frequency
equal to Nyquist frequency. This modulated signal is up-sampled, high pass filtered, added
to the up-sampled, low pass filtered N.B. residual to yield W.B. residual at the output. The
major shortfall of this approach is that it fails to preserve the original N.B. residual
information [134].
Nonlinear distortion (NLD):
In the NLD approach, initially, the N.B. residual is up-sampled by factor two and
then delivered to a nonlinear function as expressed in Eq.11. The desired bandwidth and
harmonic structure can be obtained over the whole spectrum from the resulted distorted
signal. After the whitening filter, the resulting signal spectrum is then flattened so that the
excitation does not affect the overall spectral shape. Finally, the output meets the
requirement of the W.B. residual. A simple nonlinear function can be illustrated by:
2/)]()1()()1[( (t) txtxy .................(11)
where x(t)= input signal, y(t)= distorted output signal, and α is a parameter between 0 and 1.
When α = 1, it becomes the absolute value function. Furthermore, after varying the values
within a specific stipulated range on the trial and error basis in this work, it is witnessed that
the value of α = 0.7 offers overall promising and better results[134].
Detailed analysis of B.W.E. based on proposed S.F.M.
55
H.F. & L.F. component addition to getting Estimated SW.B. speech signal:
As discussed in Step 2 of the proposed system model for B.W.E. after framing,
windowing, L.P. analysis the signal is divided into two different parts namely spectral
envelope estimation and residual error signal. Both parts are processed separately as shown
in the fig.4.5 and shaped by shaping filter and High pass filter to get the H.F. component.
The Band limited signal at the input side is resampled via zero insertion to get the L.F.
component. After getting both components they are added together via the over-lap add
method to get the estimated SW.B. output.
Fig.4.11 Depicts the proposed flow for W.B. to S.W.B. indicating subjective,
objective measurement. As shown in the flow diagram first the simulation parameter is
defined and the required filter is loaded. After that, the speech files AMRWB, EVS, S.W.B.
are read by utilizing audio read function and it is passed through the H.F.B.E. algorithm
function as shown in Fig.4.12. H.F.B.E. baseline algorithm first W.B. signal is up sampled
via zero insertion, low pass filter. After that to extract the highest octave from the up-
sampled W.B. signal, find out the absolute value of function bandwidth detection, sub-band
filtering, N.L.D. devices are utilized. After that to capture the required parts of the spectrum,
omit the remaining part post-processing with Filtering is utilized. Finally, it is added with a
delayed version of input to get the required S.W.B. output. Fig.4.13 depicts the proposed
algorithm proposed flow chart for obtaining the estimated S.W.B. signal. The basic
difference between the baseline and the proposed approach is that in baseline only AMRWB
input, filter, and gain parameter are processed while in the proposed approach three
parameters like LPC order, NFFT Points, window length of S.W.B. are additional parameter
to be processed. In the proposed method input W.B. Signal framing, separation of a framed
signal into a spectral envelope & residual error, finding of missing H.B. components is done
in a loop which is to process for the whole number of frames. Then after adjustment of delay
H.B. and N.B. components are added to get the estimated S.W.B. signal. Spectrogram,
objective & subjective measurement are performed on the estimated S.W.B. signal w.r.t.
input wave files.
All experiments reported in this thesis were carried out by utilizing voice records
from The CMU file (database) [135] & TSP file (database) [136] at different sampling rate
fs. Fig.4.14 demonstrates the data pre-processing, assessment for W.B. to S.W.B. TSP, CMU
ARCTIC were down sampled to S.W.B. signals databases so that both the databases have a
common fs of 32kHz. Down sampling can be done employing the Resamp-Audio tool
Development of Bandwidth Extension Model For W.B. To S.W.B.
56
Figure 4.11 Proposed Flow for W.B. to S.W.B. indicating Subjective and
Objective measurement
Figure 4.12 Baseline algorithm proposed flow chart
Band limit extended speech file to 15kHz
Read speech files
Remove few samples at the start and end to remove
inconsistencies for initial and last 2 frames
Subjective & Objective measures foe given speech files
Plot and analyze various spectrograms
Load filters
Define Filter delays
Extract highest octave from the up sampled WB
signal & find out absolute value using function
abs
Up sample input WB signal
Synthesize extended speech
Extract HB speech
Define Input and output parameters
Detailed analysis of B.W.E. based on proposed S.F.M.
57
for m = 0 :Nframes-1
Loop Ends
Figure 4.13 Proposed Algorithm Proposed Flow Chart
contained in the AFsp package [137]. The voice level of all utterances in both files
(databases) maintained to 26dBov [138] to produce Xswb. After that next-generation voice
coder (EVS) encoding [139] is applied to produce Xevs, Xswb. down sampled to 16kHz and
Define Filter delays
Input WB signal framing
Carry out the overlap-add FFT processing:
Excitation extension via zero insertion
Get WB LP spectral envelope and residual/excitation
Define Input and output parameters
Get NB components from up sampled WB signal
Resynthesize the HB speech
Combine HB and NB components
Adjust delays
Development of Bandwidth Extension Model For W.B. To S.W.B.
58
processed through BPF [140] as per recommendation P.341, so finally we can get the data
Xwb on which AMRWB coding [141] has been applied to produce Xamr, (Xwb in fig.4.5 is
replaced by Xamr). As shown in fig. 4.5 Xamr is the input to baseline/Proposed Algorithm to
get the Estimated S.W.B. output after processing the signal through the Algorithm. The
proposed B.W.E. algorithm is to evaluate and compare to AMRWB and next-generation
voice coder (EVS) processed voice signals, baseline H.F.B.E. algorithm [120].
XSWB Xevs
Xamr
XᶺSWB
Figure 4.14 Data Pre-Processing, Assessment for W.B. to S.W.B.
TSP
Database
P341
CMU
ARCTIC
Database
EVS
AMR-
WB
Down
sample to
32 kHz
Down
sample to
16 kHz
Level
Alignment
Baseline
/Proposed
Algorithm
Introduction
59
CHAPTER 5
Linear prediction analysis and synthesis
5.1 Introduction
This chapter starts with a basic block diagram representation of speech file
compression based on LPC analysis & Synthesis followed by a source-filter model for sound
production. N.B. to W.B. speech conversion based on LPC, Linear prediction autoregressive
modeling comparisons, Mathematical analysis of LPC model, stability consideration of LP.
5.2 Speech file Compression based on LPC
Fig. 5.1 represents the complete block diagram showing LPC co-efficient obtained
from input speech(voice) signal. First, the input signal is divided into a frame duration of
20ms duration followed by windowing to remove discontinuity and taper down the edge.
After windowing speech (voice) signal is given to linear predictor to find out LPC co-
efficient. Thus, instead of transmitting the PCM samples, parameters of the model are sent to
achieve compression.
Figure 5.1 LPC co-efficient obtained from Input Speech Signal [some
part adapted from [142]
Linear prediction analysis and synthesis
60
Fig.5.2 depict the detailed analysis of speech file compression based on LPC analysis &
synthesis. As shown in fig. 5.2 compression of the speech file from the original speech file based
on LPC can be done in seven steps as below:-
(1) Speech file-This is a speech file in .wav form. This may be any arbitrary speech file
given as an input to the next module.
(2) Divide into the frames-This module will divide each .wav file into 20-30 ms frames.
Each frame generated in this way will be given as input to the next module for its analysis.
The frames obtained in this way are given as input to the function.
(3) Find parameters- This module will collect some of the data required for regenerating
the original input signal (voiced/unvoiced, gain, pitch). High amplitudes of the voiced
sounds and high frequency of the unvoiced sounds will be used to determine whether a
sound is voiced or unvoiced. An algorithm called the Average Magnitude Difference
Function (AMDF) will be used for pitch period estimation. The gain of the filter will also be
determined.
(4) LPC technique-This module will generate the remaining data required to reconstruct the
original speech signal. This data is the coefficients of the synthesis filter. These are to be
calculated using the Autocorrelation method. In this method using the concept of minimum
prediction error, a matrix is obtained where the variables are the filter coefficients. The
solution of the matrix obtained in the Autocorrelation method will be found using the
efficient Levinson Durbin algorithm. This algorithm can be used because of the special
properties of the Autocorrelation matrix. It reduces the complexity of multiplication
.
(5) Regenerate frames-This module will consist of the synthesis filter. The synthesis filter
will receive input. The excitement signal will be passed through the filter to produce the
synthesized speech signal.
(6) Reconstruct signals-In this module, the frames are put back together to reconstruct the
originalsignal.
Speech file Compression based on LPC
61
(7) Compressed Speech file-This is the final output of our system. It will be compared with
input and analyzed accordingly.
Analysis
Synthesis
Synthesis
Figure 5.2 Block diagram representation of Speech file compression based on LPC Analysis
& Synthesis
5.3 Source filter model for sound production
The source-filter model for sound production is shown in fig. 5.3. As represented in
fig.5.3 air from the lungs is taken as a source and the vocal tract as a filter. Voiced sounds
are usually vowels and often have high average energy levels and very distinct resonant or
formant frequencies. Voiced sounds are generated by air from the lungs being forced over
the vocal cords. Unvoiced sounds are usually consonants and generally have less energy and
higher frequency than voiced sounds. The production of unvoiced sounds involves air being
forced through the vocal tract in a turbulent flow. During this process the vocal cords do not
vibrate, instead, they stay open until the sound is produced. It is based on the idea of
separating the source from the filter in the production of sound.
The V.T. and nasal tract can be shown as tubes of non-uniform cross-sectional area.
As sound generated and propagates down the tubes, the frequency spectrum is shaped by the
frequency selectivity of the tube. This effect is very similar to the resonance effects observed
with organ pipe or wind instruments. In the context of speech production, resonant
frequencies of the vocal tract tube are called formant frequencies or simply formants. The
formants depend upon the shape and dimension of the vocal tract; each shape is
characterized by a set of formant frequencies. Different sounds are formed by varying the
Speech
file
Divide into
frames
Find
parameters
Apply LPC
techniques
Reconstruct
signal
Regenerate
frames
Compressed
speech file
Linear prediction analysis and synthesis
62
shape of the V.T. Thus the spectral properties of the speech signal vary with time as the
vocal tract shape varies.
HUMAN SPEECH PRODUCTION
QUASI PERIODIC
EXCITATION SIGNAL
LUNGS
MOUTH-NOSE
VOCAL CORDS SPEECH
VOICE CODER SPEECH PRODUCTION
LUNGS VOICED/UNVOICED
DECISIION MOUTH-NOSE
FUNDAMENTAL SPEECH
FREQUENCY
FILTER
CO-EFFICIENT
ENERGY VALUE
.............................EXCITATION................................ ......ARTICULATION............
Figure 5.3 Source filter model for sound production [13]
When the tenth-order all-pole filter representing the vocal tract is excited by white
noise, the signal model corresponds to an autoregressive (AR) time-series representation.
The coefficients of the AR model can be determined using linear prediction techniques. The
application of linear prediction in speech processing, specifically in speech coding, is often
referred to as Linear Predictive Coding (LPC). The LPC parameterization is a central
component of many compression algorithms that are used in cellular telephony for
bandwidth compression and enhanced privacy. Bandwidth is conserved by reducing the data
rate required to represent the speech signal. This data rate reduction is achieved by a
parameterzing speech in terms of the AR or all-pole filter coefficients, a small set of
SOUND
PRESSURE
UNVOICED
EXCITATION
VOICED
EXCITATION
ARTICU
LATION
ENERGY
NOISE
GENERATOR
TONE
GENERATOR
VARIABLE
FILTER
Source filter model for sound production
63
excitation parameters. The two excitation parameters for the synthesis configuration shown
in fig. 5.4 are the following: (a) the voicing decision (voiced/unvoiced) and (b) the pitch
period. In the simplest case, 160 samples (20 ms for 8 kHz sampling) of speech (voice) can
be represented with ten all-pole vocal tract parameters and two parameters that specify the
excitation signal. Therefore, in this case,160 speech samples can be represented by only
twelve parameters which result in a data compression ratio of more than 13 to 1 in terms of
the number of parameters that will be encoded, transmitted. In new standardized algorithms,
more elaborate forms of parameterization exploit further redundancy in the signal, yield
better compression, much-improved speech (voice) quality [13,143].
Speech
Pitch
Period
Filter
co-efficient
Voicing Gain
Figure 5.4 The Classical Linear Prediction Coefficients (LPC) model of speech (voice)
production [13]
5.4 Linear predictive coding (LPC)
Linear predictive coding (LPC) is a widely used technique in audio signal processing,
especially in speech signal processing. It has found particular use in voice signal
compression, allowing for very high compression rates. Linear prediction (LP) forms an
integral part of almost all modern-day speech coding algorithms. The fundamental idea is
that a speech (voice) sample can be approximated as a linear combination of past samples.
Within a signal frame, the weights used to compute the linear combination are found by
minimizing the mean-squared prediction error; the resultant weights, or linear prediction
White noise
generator
Impulse train
generator Synthesis
filter
Voiced/
unvoiced
switch
All pole
representing
Vocal Tract
×
Linear prediction analysis and synthesis
64
coefficients (LPCs), are used to represent the particular frame. LPC determines the
coefficients of a forward linear predictor by minimizing the prediction error in the least-
squares sense. It has applications in filter design and speech coding.
Linear prediction is an identification technique where parameters of a system are
found from the observation. The basic assumption is that speech can be modeled as an AR
signal, which in practice is appropriate. LP is a spectrum estimation method in the sense that
its analysis allows the computation of the AR parameters, which define the PSD of the signal
itself. By computing the LPCs of a signal frame, it is possible to generate another signal in
such a way that the spectral contents are close to the original one. LP can also be viewed as a
redundancy removal procedure where information repeated in an event is eliminated. After
all, there is no need for transmission of certain data can be predicted. By displacing the
redundancy in a signal, the number of bits required to carry the information is lowered,
therefore achieving the purpose of compression [7,13].
As shown in fig. 5.5, consider the block to be a predictor, which tries to predict the
current output as a linear combination of previous outputs(hence LPC) The predictor‟s input
is the prediction error. The parameter estimation process is repeated for each frame, with the
results representing information on the frame Thus, instead of transmitting the PCM
samples, parameters of the model are sent. By carefully allocating bits for each parameter to
minimize distortion, an impressive compression ratio can be achieved up to 50-60 times.
Here, linear prediction is described as a system identification problem, where the parameters
of an AR model are estimated from the signal itself. The situation is illustrated in fig. 5.5.
The white noise signal x[n] is filtered by the AR process synthesizer to obtain s[n]-the AR
signal-with the AR parameters denoted by ^ai. A linear predictor is used to predict s[n] based
on the M past samples; this is done with
M
i
insi
ans
1
][][^
........................(12)
where the ai are the estimates of the AR parameters and are referred to as the linear
prediction coefficients (LPCs). The constant M is known as the prediction order. Therefore,
the prediction is based on a linear combination of the M past samples of the signal, and
hence the prediction is linear. The prediction error is equal to
][][][ nsnsne ^...................................(13)
Fig.5.6 shows the signal flow graph implementation of the error equation and is
Linear predictive coding (LPC)
65
known as the prediction-error filter: it takes an AR signal as input to produce the prediction-
error signal at its output [13].
Error Minimization:
The system identification problem consists of the estimation of the AR parameters ^ai from
s[n], with the estimates being the LPCs. To perform the estimation, a criterion must be
established. In the present case, the mean-squared prediction error,
Figure 5.6 The prediction-error filter
2
1
][)(]}[{ 2M
i
insiansEneEJ
...............(14)
is minimized by selecting the appropriate LPCs. Note that the cost function J is precisely a
second-order function of the LPCs. Consequently, we may visualize the dependence of the
cost function J on the estimates a1, a2..., aM as a bowl-shaped (M+1)-dimensional surface
Figure 5.5 Linear prediction as system identification
Linear prediction analysis and synthesis
66
with M degrees of freedom. This surface is characterized by a unique minimum. The optimal
LPCs can be found by setting the partial derivatives of J concerning ai to zero; that means
0][
1
][)(2
kns
M
i
insiansEka
J
...........(15)
for k=1,2.........M At this point, it is maintained without proof that when the above equation
is satisfied, then ai = ^ai; that is, the LPCs are equal to the AR parameters. Thus, when the
LPCs are found, the system used to generate the AR signal (AR process synthesizer) is
uniquely identified[13].
Prediction Gain:
The prediction gain of a predictor is given by
.........(16)
It is the ratio between the variance of the input signal and the variance of the prediction error
in decibels (dB). Prediction gain is a measure of the predictor‟s performance. A better
predictor is capable of generating lower prediction error, leading to a higher gain[13].
Minimum Mean-Squared Prediction Error:
From Fig. 5.5 we can see that when ai=^ai, e[n]=x[n]; that means prediction error is
the same as the white noise used to generate the AR signal s[n]. Indeed, this is the optimal
situation where the mean-squared error is minimized, with
2][2]}[{ 2
xnxEneEJ
...........(17)
or equivalently, the prediction gain is maximized. The optimal condition can be reached
when the order of the predictor is equal to or higher than the order of the AR process
synthesizer. In practice, M is usually unknown. A simple method to estimate M from a signal
source is by plotting the prediction gain as a function of the prediction order. In this way it is
possible to determine the prediction order for which the gain saturates; that means further
increasing the prediction order from a certain critical point will not provide additional gain.
The value of the predictor order at the mentioned critical point represents a good estimate of
the order of the AR signal under consideration. The cost function J in the error minimization
equation is characterized by a unique minimum. If the prediction order M is known, J is
]}[2{
]}[2{log10
2
2log10
1010
neE
nsE
e
sPG
Linear predictive coding (LPC)
67
minimized when ai =^ai, leading to e[n]= x[n]; that is, the prediction error is equal to the
excitation signal of the AR process synthesizer. This is a reasonable result since the best that
the prediction error filter can do is to „„whiten‟‟ the AR signal s[n]. Thus, the maximum
prediction gain is given by the ratio between the variance of s[n] and the variance of x[n] in
decibels. Taking into account the AR parameters used to generate the signal s[n], we have
M
i
isRias
Rx
J
1
][]0[2min
......................(18)
White noise is generated using a random number generator with uniform distribution
and unit variance. This signal is then filtered by an AR synthesizer with co-efficient as
indicated in table 5.1.
Table 5.1 AR synthesizer with ten LPC co-efficient
a1 = 1.534 a2 = 1 a3 = 0.587 a4 = 0.347 a5 = 0.08
a6 = -0.061 a7 = -0.172 a8 = -0.156 a9 = -0.157 a10 = -0.141
The frame of the resultant AR signal is used for LP analysis, with a length of 240
samples. Nonrecursive autocorrelation estimation using a Hamming window is applied. LP
analysis is performed with prediction order ranging from 2 to 20; prediction error and
prediction gain are found for each case. Fig. 5.7 summarizes the results, where we can see
that the prediction gain (PG) grows initially from M= 2 and is maximized when M = 10.
Further increasing the prediction order will not provide additional gain; in fact, it can even
reduce it. This is an expected result since the AR model used to generate the signal has order
ten. We observe that for the unvoiced frame, PG increases abruptly when the prediction
order goes from 2 to 5. Increasing the prediction order provides additional gain increase but
at a milder pace. For M > 10, prediction gain remains essentially constant, implying the fact
that correlation between far separated samples is low[13].
Figure 5.7 Prediction gain (PG) as a function of the prediction order (M)
Linear prediction analysis and synthesis
68
As shown in Fig. 5.8 for the voiced frame, prediction gain is low for M˂ 3, it remains almost
constant for 4 < M < 49, and it reaches a peak for M˃ 49. The phenomenon is because, for
the voiced frame under consideration, the pitch period = 49. For M < 49, the number of
LPCs is not enough to remove the correlation between samples one pitch period apart. For
M˃ 49, however, the linear predictor is capable of modeling the correlation between samples
one pitch period apart, leading therefore to a substantial improvement in prediction gain.
Further note that the change in prediction gain is abrupt: between M = 48 and 49, for
instance, a jump of nearly 3 dB in prediction gain is observed[13].
Figure 5.8 A plot of PG as a function of the prediction order (M) for the signal frames [13]
The effectiveness of the predictor at different prediction orders can be studied further
by observing the level of „„whiteness‟‟ in the prediction-error sequence. The prediction-error
filter associated with a good predictor is capable of removing as much correlation as possible
from the signal samples, leading to a prediction error sequence with a flat PSD. The fig. 5.9
illustrates the prediction-error sequence of the unvoiced frame and the corresponding
periodogram for different prediction order. Note that M=4 is not enough to „„whiten‟‟ the
original signal frame, where we can see that the periodogram of the prediction error does not
mimic the flat spectrum of a white noise signal. For M=10, however, flatness is achieved in
the periodogram and, hence, the prediction error becomes „„roughly‟‟ white.
The fig.5.10 shows the prediction-error sequences and the corresponding
periodograms of the voiced frame. We can see that for M=3, a high level of periodicity is
still present in the prediction-error sequence and a harmonic structure is observed in the
corresponding periodogram. When M =10, the amplitude of the prediction error sequence
Linear predictive coding (LPC)
69
Figure 5.10 Plots of prediction error and periodograms for the voiced frame
M=3, M =10 and M= 50[13]
Figure 5.9 Plots of prediction error and period grams for the unvoiced frame
Top: M = 4 & Bottom: M =10[13]
Linear prediction analysis and synthesis
70
becomes lower, However, the periodic components remain. As we can see, the periodogram
develops a flatter appearance, but the harmonic structure is still present. For M=50,
periodicity in the time and frequency domain is reduced to a minimum. Hence, to effectively
''whiten'' the voiced frame, a minimum prediction order 50 is required.
Fig. 5.11 compares the theoretical PSD (defined with the original AR parameters)
with the spectrum estimates found with the LPCs computed from the signal frame using M =
2, 10, and 20. For low prediction order, the resultant spectrum is not capable of fitting the
original PSD. An excessively high order, on the other hand, leads to over fitting, where
undesirable errors are introduced. In the present case, a prediction order of 10 is optimal.
Note how the spectrum of the original signal is captured by the estimated LPCs. This is the
reason why LP analysis is known as a spectrum estimation technique, specifically a
parametric spectrum estimation method since the process is done through a set of parameters
or coefficients [13].
Figure 5.11 Plot of PSD for M=2, M=10,M=20[13]
Linear Prediction and Autoregressive Modeling
71
5.5 Linear Prediction and Autoregressive Modeling
Linear Prediction Autoregressive Modeling
In the case of linear prediction, the
intention is to determine an FIR filter that
can optimally predict future samples of an
autoregressive process based on a linear
combination of past samples.
In the case of autoregressive modeling,
the intention is to determine an all-pole
IIR filter, that when excited with white
noise produces a signal with the same
statistics as the autoregressive process
that we are trying to model.
LPC returns the coefficients of the entire
whitening filter A(z), this filter takes AR
signal x as a input and returns as output
the prediction error. However, A(z) has
the prediction filter embedded in it, in the
form B(z)= 1- A(z).
Generate an AR Signal using an All-
Pole Filter with White Noise as Input
By using the LPC function and an FIR
filter simply to come up with
parameters we will use to create the
autoregressive signal
Extract B(z) from A(z) as described
above to use the FIR linear predictor
filter.
To generate the autoregressive signal,
we will excite an all-pole filter with
white Gaussian noise of variance p0.
Obtain an estimate of future values of the
autoregressive signal based on linear
combinations of past values.
To get variance p0, use SQRT(p0) as
the 'gain' term in the noise generator.
Compare Actual and Predicted Signals. Use the white Gaussian noise signal
and the all-pole filter to generate an
AR signal.
Plot 200 samples of the original AR
signal along with the signal estimate
resulting from the linear predictor.
Find AR Model from Signal using the
Yule-Walker Method.
Compare Prediction Errors. Compare AR Model with AR Signal.
The prediction error power (variance) is
returned as the second output from LPC.
overlay the power spectral density of
the output of the model, computed
using FREQZ, with the power spectral
density estimate of x, computed using
PERIODOGRAM.
Linear prediction analysis and synthesis
72
5.6 Linear Prediction (L.P.) and Bandwidth extension (B.W.E.)
LPC analysis is used to constructing the LPC coefficients for the inverse transfer
function of the vocal tract. The standard methods for LPC coefficients estimation have the
assumption that the input signal is stationary. The quasi-stationary signal is obtained by
framing the input signal which is often done in frames in the length of 20 ms. A more
stationary signal results in a better LPC analysis because the signal is better described by the
LPC coefficients and therefore minimizes the residual signal.
5.6.1. LPC analysis
The fig. 5.12 shows a block diagram of an LPC analysis, where S is the input signal,
g is the gain of the residual signal and a is a vector containing the LPC coefficients to a
specific order. The size of the vector depends on the order of the LPC analysis. Bigger order
means more LPC coefficients and therefore better estimation of the vocal tract.
Figure 5.12 LPC analysis
5.6.2 LPC estimation
LPC estimation calculates an error signal from the LPC coefficients from LPC
analysis. This error signal is called the residual signal which could not be modeled by the
LPC analysis. This signal is calculated by filtering the original signal with the inverse
transfer function from LPC analysis. If the inverse transfer function from LPC analysis is
equal to the vocal tract transfer function then the residual signal obtained from the LPC
estimation equal to the residual signal which is put into the vocal tract. In that case, residual
signal equal to the impulses or noise from human speech production. The fig. 5.13 shows a
block diagram of LPC estimation where S is the input signal, g and a is calculated from LPC
analysis and e is the residual signal for LPC estimation.
Figure 5.13 LPC estimation
g, gwb *awb
a
Input
Signal S
LPC
analysis g, a
Input
Signal
S
LPC
estimation e
Linear Prediction (L.P.) and Bandwidth extension (B.W.E.)
73
5.6.3 LPC-synthesis
LPC synthesis is used to reconstruct a signal from the residual signal and the transfer
function of the vocal tract. Because the vocal tract transfer function is estimated from LPC
analysis can this be used combined with the residual/error signal from LPC estimation to
construct the original signal. Fig. 5.14 shows a block diagram of LPC synthesis where e is
the error signal found from LPC estimation and g and a from LPC analysis. Reconstruction
of the original signal s is done by filtering the error signal with the vocal tract transfer
function respectively. finally both the output are combined via an LPC synthesizer to get the
required W.B. signal.
Figure 5.14 LPC synthesis
Input signal before LPC analyzer, input signal after LPC synthesizer with the error signal is
depicted in fig.5.15.
Figure 5.15 Input signal before LPC analyzer, input signal after LPC synthesizer with the
error signal.
g, a
e LPC
synthesis Input
Signal S
Linear prediction analysis and synthesis
74
Fig. 5.16 represents the complete block diagram of bandwidth extended output from band-
limited N.B. signal. The N.B. signal is processed by the LPC Estimation block to get gain
and LP Coefficient, while LPC Analyzer produced residual error signal based on two input
signals one is the N.B. signal and the other is LPC estimated output. after that LPC
estimated output is will be processed by envelope extension and LPC analyzer output
processed by Excitation Extension to get W.B. gain and LP Coefficient and W.B. residual
error signal. The more simplified version of the fig.5.16 is depicted in the fig.5.17. The basic
thought approached in the fig.5.17 is to create a signal that contains the frequencies that are
missing from the original N.B. signal. This signal having its energy mainly on the frequency
band 4–8kHz is then added to an interpolated version of the original N.B. signal that has
most of its energy on the band 0–4kHz. the extension causes a delay in the signal and so the
resampled N.B. signal must be delayed to synchronize the signals. Start the procedure by
windowing the incoming signal. The frames are decomposed into source signal and filter
parts using LP analysis and the parts are extended separately. The frame waveform is rebuilt
by filtering the extended source signal using the extended filter. After scaling the frames are
joined together using overlap-add.
Figure 5.16 Complete block diagram of B.W.E. output from band-limited N.B. signal
LPC
Estimation
Envelope
Extension
LPC
Analysis
Excitation
Extension
LPC
Synthesis
enb ewb
Swb
Snb gnb *anb gwb *awb
Bandwidth Extension based on Sub band filter and Evaluation of speech signal through
Source Filter Model
75
Figure 5.17 Bandwidth extension (B.W.E.) based on Linear Prediction (L.P.)
5.7 Bandwidth Extension based on Sub band filter and Evaluation of
speech signal through Source Filter Model
The Bandwidth extension system based on the Sub band filter is shown in fig. 5.18.
The block diagram containing high band, low band analysis sub band coder, down sampler,
G.711 encoder at transmitting terminal, G.711 decoder, up sampler, high band, low band
synthesis sub-band coder. At the user end terminal original W.B. speech signal is placed into
a sub-band filter [18]. Then filter outcomes are decimated by two. In this way, both HF and
LF segments through 8 kHz sampling frequency are acquired. Next, the LF parts are encoded
by the narrowband encoder which is G.711 encoder [19]. the HF part is transmitted without
changes at the transmitter side to the receiver through the transmission channel. At the user
end, the N.B. speech (voice) signal is decoded through the N.B. speech (voice) decoder, HF
is restored as it is. Finally the signal is synthesized by high band, low band synthesis sub-
band coder for attaining near-perfect reconstruction of the original signal at the output which
can be observed on the spectrum analyzer. Listener speech (voice) file played on the Audio
device writer.
Performance evaluation of speech (voice) signal based on the source filter model are
shown in fig. 5.19, 5.20 and 5.21 respectively. As shown in fig. 5.18 the original speech
(voice) signal is passed through the LPC analysis block to separate the LPC co-efficient,
residual component which is further processed by bit stream quantization to make a
quantized signal level that is processed by LPC synthesizer to make a compressed original
speech (voice) signal. As shown in fig. 5.19 the speech (voice) signal is broken up into
frames of size 20 ms (160 samples), with an overlap of 10 ms (80 samples). Each frame is
LP
Analysis
Source
Signal
Extension
Spectral
Envelope
Extension
Waveform
generation
Narrowband
Signal
Bandwidth
Extended
Output
Resampled and
Delayed N.B.
Signal
Linear prediction analysis and synthesis
76
windowed using a Hamming window. 11th
order autocorrelation coefficients are initiated,
reflection coefficients are calculated from the autocorrelation coefficients using the Levinson
approach. The original speech (voice) signal is conceded via an analysis filter, which is an
all-zero filter with coefficients as the reflection coefficients obtained above. The output of
the filter is the residual signal. As shown in fig. 5.21 LPC Synthesizer is a time-varying filter
found in the receiver section of the system. It reconstructs the original signal using the
reflection coefficients, residual signal. Speech file is played through the audio player.
Fig. 5.22 & 5.23 shows the input and output time-domain waveform of the wave file
''OM SHRI GANESHAY NAMAH'', while fig. 5.24 & 5.25 the input, output frequency
domain waveform of the wave file ''OM SHRI GANESHAY NAMAH''. From fig. 5.22,
5.23, 5.24, 5.25, it is identified that the signal looks very different in the time domain but in
the frequency domain, it looks like same because of the nearly same power distribution at the
same frequency will be observed which can be heard on the audio player. The result
displayed on each stage by using a spectrum analyzer or time scope or display is taking some
time because in our simulation 1536 samples need to be updated for displaying result at each
stage once the required number of samples are acquired there is no issue regarding results.
one other noticeable remark from viewing various time/frequency domain waveform is that
at each stage of the source-filter model approach the sampling frequency is changing so
variation in the signal waveform can be obtained. Pre-Emphasized Speech signal, hamming
windowed Speech signal, LPC analyzer output, LPC synthesizer output, auto cor-relation co-
efficient, reflection coefficient(RC), and LPC co-efficient are represented in fig. 5.26 to 5.29
respectively. From the obtained RC co-efficient in fig. 5.30 ,one can say that the value of
RC co-efficient <1 is obtained for the given input speech file to ensure stability for Linear
prediction (L.P.) which can be discussed further in the next section.
Figure 5.18 Bandwidth extension based on Sub band filter
Bandwidth Extension based on Sub band filter and Evaluation of speech signal through
Source Filter Model
77
Figure 5.19 Performance evaluation of speech signal based on Source filter
Figure 5.20 Source Filter model-based Analysis
Figure 5.21 Source Filter model-based Synthesis
Linear prediction analysis and synthesis
78
Figure 5.22 I/P time-domain waveform of wave file ''OM SHRI GANESHAY NAMAH''
Figure 5.23 O/P time-domain waveform of wave file "OM SHRI GANESHAY NAMAH"
Figure 2 Input frequency domain waveform of wave file "OM SHRI GANESHAY
NAMAH"
Figure 5.24 Input frequency domain waveform of Wave file "OM SHRI GANESHAY
NAMAH"
Bandwidth Extension based on Sub band filter and Evaluation of speech signal through
Source Filter Model
79
Figure 5.25 O/P freq. domain waveform of Wave file "OM SHRI GANESHAY NAMAH"
Figure 5.26 Pre-Emphasized Speech signal
Figure 3 A hamming windowed Speech signal
Figure 5.27 A hamming windowed Speech signal
Linear prediction analysis and synthesis
80
Figure 5.28 LPC analyzer output
Figure 5.29 LPC synthesizer output
5.8 Stability consideration (LP)
The prediction-error filter with system function
n
k
kk za
1
1A(z) ..............(19)
where the ai are the LPCs found by solving the normal equation, is a minimum phase system
if and only if the associated RCs ki satisfy the condition.
Mi .......2,1;1ik ..........(20)
The fact that A(z) represents a minimum phase system implies that the zeros of A(z) are
inside the unit circle of the z-plane. Thus, the poles of the inverse system 1/ A(z) are also
Stability consideration (LP)
81
inside the unit circle. Hence, the inverse system is guaranteed to be stable if the RCs satisfy
the above equation. Condition Since the inverse system is used to synthesize the output
signal in an LP-based speech coding algorithm, stability is mandatory with all the poles
located inside the unit circle.
Figure 5.30 Auto cor-relation co-efficient, Reflection co-efficient & LPC co-efficient
For real speech data prediction order must be high enough to include at least one
pitch period to model adequately the voiced signal under consideration. A linear predictor
with an order of ten is not capable of accurately modeling the periodicity of the voiced signal
having a pitch period of 50. The problem is evident when the prediction error is examined: a
lack of fit is indicated by the remaining periodic component. By increasing the prediction
order to include one pitch period, the periodicity in the prediction error has largely
disappeared, leading to a rise in prediction gain. High prediction order leads to excessive bit-
rate and implementation cost since more bits are required to represent the LPCs, and extra
computation is needed during analysis. Thus, it is desirable to come up with a scheme that is
simple and yet able to model the signal with sufficient accuracy[13].
An increase in prediction gain is due mainly to the first 8 to 10 coefficients, plus the
coefficient at the pitch period, equal to 49 in that particular case. The LPCs at orders between
11 and 48 and at orders greater than 49 provide essentially no contribution toward improving
the prediction gain. This can be seen from the flat segments from 10 to 49, and beyond 50.
Therefore coefficients that are not contributing toward elevating the prediction gain can be
Linear prediction analysis and synthesis
82
eliminated, leading to a more compact and efficient scheme. In long-term LP short-term
predictor is connected in cascade with a long-term predictor as shown in fig.5.31. The short-
term predictor has relatively low prediction order M in the range of 8 to 12. This predictor
eliminates the correlation between nearby samples or is short-term in the temporal sense. The
long-term predictor targets the correlation between samples one pitch period apart.
Figure 5.31 Short-term prediction-error filter connected in series to a long-term prediction-
error filter
The long-term prediction-error filter with input es[n] and output e[n] has system function
T-bz 1 H(z)
.......................(21)
two parameters are required to specify the filter: pitch period T and long term gain b (also
known as long-term LPC or pitch gain).
The fig.5.32 shows the block diagram of the synthesis filter, where a unit-variance
white noise is generated and scaled by the gain g and is input to the synthesis filter to
generate the synthesized speech at the output. Since x[n] has unit variance, g[n] has variance
equal to g2.
Figure 5.32 Block diagram of the synthesis filter
Stability consideration (LP)
83
M
i
iRa si
1
][ [0]R sg ........................(22)
Thus, the gain can be found by knowing the LPCs and the autocorrelation values of the
original signal. In the above equation, g is a scaling constant. A scaling constant is needed
because the autocorrelation values are normally estimated using a window that weakens the
signal‟s power. The value of g depends on the type of window selected and can be found
experimentally. Typical values of g range from 1 to 2. Also, it is important to note that the
autocorrelation values in the equation must be the time-averaged ones, instead of merely the
sum of products[13].
The long-term predictor is responsible for generating a correlation between samples
that are one pitch period apart. The filter with system function
T-bz 1
1 (z)HP
........................................ (23)
describing the effect of the long-term predictor in synthesis is known as the long-term
synthesis filter or pitch synthesis filter. On the other hand, the short-term predictor recreates
the correlation present between nearby samples, with a typical prediction order equal to ten.
The synthesis filter associated with the short-term predictor, with system function given by
equation is also known as the formant synthesis filter since it generates the envelope of the
spectrum in a way similar to the vocal track tube, with resonant frequencies known simply as
formants. For the pitch synthesis filter with system function (4.89), the system poles are
found by solving
0bz 1T- ................................ (24)
bzT-
............................................ (25)
So we can say that there are T different solutions for Z, the system has T different poles.
For Long term LP analysis
n s
n s
Tne
Tnesneb
][
][][2
............. (26)
There are a total of T different solutions for z, and hence the system has T different
poles. These poles lie at the vertices of a regular polygon of T sides that is inscribed in a
circle of radius |b|1/T
. Thus, for the filter to be stable, the following condition must be
satisfied:
Linear prediction analysis and synthesis
84
1 b
....................................................................... (27)
An unstable pitch synthesis filter arises when the absolute value of the numerator is
greater than the denominator as in (4.84), resulting in |b| > 1. This usually arises when a
transition from an unvoiced to a voiced segment takes place and is marked by a rapid surge
in signal energy. When processing a voiced frame that occurs just after an unvoiced frame,
the denominator quantity Σes2[n-T] involves the sum of the squares of amplitudes in the
unvoiced segment, which is normally weak. On the other hand, the numerator quantity
Σes[n]*es[n-T] involves the sum of the products of the higher amplitudes from the voiced
frame and the lower amplitudes from the unvoiced frame. Under these circumstances, the
numerator can be larger in magnitude than the denominator, leading to |b| > 1. To ensure
stability, the long-term gain is often truncated so that its magnitude is always less than
one[13].
Subjective & Objective Measurement for W.B. to S.W.B.
85
CHAPTER 6
Comparative Analysis of Proposed Model with
Baseline & Next-Generation Speech Codec
6.1 Subjective & Objective Measurement for W.B. to S.W.B.
For quality measurement which is highly subjective evaluation, comparison based
mean-opinion score (CMOS) ratings are performed on various speech files [28,41]. In each
examination bandwidth, extended signals are compared with XEVS ,XH.F.B.E. Each
examination was carried out by 56 listeners, among them 28 were male, 28 were female
Speakers. They were requested to judge against the superiority of 13 randomly ordered pairs
of speech (voice) signals X and Y. Either X or Y can be processed with the proposed
algorithm, remaining by baseline or next-generation coder. The selected wave files were
judged by the individual listener and give a rating to the selected wave files in the range of -3
to 3 (7 Scale) where -3 is much worse and 3 is much better. zero ratings meaning both are of
the same quality.-2 & -1 for slightly worse and worse while 1 & 2 rating means better and
slightly better. The age group selected for the above measurement is in between 21 years to
50 years.
As shown in Table 6.1 each Listener has given the rating in the scale of -3 to 3 for
Each of the 13 wave files and finally, the average value of all listeners is calculated to
rate/judge the CMOS Score of the proposed algorithm and baseline/next generation coder
algorithm. here the table 6.1 is prepared for the proposed coder, the same way table can be
prepared for baseline/next generation coder algorithm. The samples were played using
Logitech good quality headphones. Speech files used for quality measurement are available
online. For intelligibly highly objective measurement PESQ can be characterized as
.....................(28)
where x is the raw P.E.S.Q. output (ITU P.862.2). One can use the above expression to
obtain an equivalent PESQ score for the M.O.S.L.Q. score (=y) as discussed in subjective
measurement for W.B. to S.W.B.. Below expression can be used to get P.E.S.Q. output from
8224.3*3669.11
999.0999.4999.0
xey
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
86
M.O.S.L.Q. score [41].
.............................(29)
Table 6.1 Listener rating in a scale of -3 to 3 for different wave files
Speech
file
MA0
1_
01
MA0
1_02
MA0
1_03
MA0
1_04
MA0
1_05
FA01
_01
FA01
_03
FA0
1_04
FA0
1_05
arctic
_a000
2
arcti
c_a0
003
arcti
c_a0
004
arcti
c_a0
002
L
i
s
t
n
e
n
e
r
r
a
t
i
n
g
i
n
a
s
c
a
l
e
o
f
-
3
t
o
3
L-1 (M)
2 2 2 2 2 2 2 2 1 2 1 1 3
L-2
(M)
2 3 3 2 2 0 1 1 0 1 2 2 2
L-3 (M)
2 1 1 2 1 2 2 1 1 2 1 2 3
L-4
(M)
1 2 2 1 1 1 1 1 1 1 0 1 1
L-5 (M)
2 2 1 2 2 2 2 2 1 1 2 2 2
L-6
(M)
1 3 2 1 3 1 1 1 1 1 0 1 2
L-7 (M)
0 1 2 2 2 2 1 2 0 2 2 2 1
L-8
(M)
2 1 2 2 1 1 1 1 1 1 1 1 2
L-9 (M)
2 3 1 2 2 0 0 2 2 1 2 2 2
L-10
M)
1 2 0 1 2 2 1 1 1 2 2 1 2
L-11 (M)
2 2 2 2 3 1 1 0 0 2 1 0 2
L-12
(M)
2 2 1 2 2 1 0 1 1 2 2 2 1
L-13(M)
1 1 2 1 1 1 1 2 2 1 1 2 2
L-
14(M)
1 3 2 2 3 0 0 1 1 2 2 2 2
L-
15(M)
1 2 3 1 2 0 2 2 1 0 2 1 3
L-
16(M)
1 3 2 2 3 2 1 0 1 1 1 1 2
L-
17(M)
2 1 3 1 1 2 2 2 0 1 1 0 2
L-
18(M)
1 2 1 0 2 1 1 1 1 2 1 2 3
L-
19(M)
2 3 2 2 1 2 2 1 1 2 2 1 2
L-
20(M)
2 2 2 2 2 1 1 1 2 0 1 2 1
L-
21(M)
2 3 1 2 3 0 2 1 1 2 2 1 2
L-
22(M)
2 2 2 1 2 2 0 1 2 1 2 2 1
L-
23(M)
1 1 2 2 1 1 2 0 1 1 1 1 2
L-
24(M)
2 0 2 2 0 0 1 1 1 0 2 1 2
L-
25(M)
1 2 2 1 2 1 2 1 1 2 1 2 1
L-26(M)
1 1 1 2 1 1 2 0 2 1 2 2 1
L-
27(M)
1 3 2 1 3 2 1 1 1 2 1 2 2
L-28(M)
2 2 2 2 2 1 2 0 2 1 2 2 1
L-1 (F) 1 3 3 1 3 1 1 0 2 0 0 2 2
L-2 (F) 2 2 2 1 2 2 1 2 1 2 2 1 3
4945.1
999.0
999.4ln6607.4
y
y
x
Subjective & Objective Measurement for W.B. to S.W.B.
87
L-3 (F) 2 3 2 0 3 1 0 2 2 2 1 2 1
L-4 (F) 2 1 3 2 1 1 1 1 1 2 2 1 2
L-5 (F) 2 3 2 1 2 0 1 2 2 2 2 2 1
L-6 (F) 1 2 1 2 2 1 0 1 1 1 1 2 2
L-7 (F) 2 1 2 1 1 0 1 0 1 2 2 2 2
L-8 (F) 2 2 1 2 2 2 2 1 2 1 0 2 1
L-9 (F) 1 2 2 1 2 2 1 1 1 1 2 1 1
L-
10(F)
1 3 2 1 3 1 1 0 2 2 1 2 0
L-11(F)
2 2 1 2 2 1 1 1 1 1 1 1 2
L-
12(F)
2 2 2 2 1 2 2 1 2 2 2 2 1
L-13(F)
2 3 2 2 3 1 1 0 1 2 1 1 2
L-
14(F)
2 1 1 2 1 2 2 1 2 1 0 2 2
L-15(F)
1 3 1 2 3 1 0 2 0 2 1 2 1
L-
16(F)
1 2 2 1 2 2 2 1 2 2 2 2 2
L-17(F)
1 3 1 2 2 2 2 2 1 1 1 1 1
L-
18(F)
2 3 0 1 3 0 1 1 2 2 0 2 0
L-19(F)
2 2 2 2 2 1 2 2 2 2 1 2 2
L-
20(F)
0 3 1 2 3 2 1 1 1 1 2 1 1
L-21(F)
2 2 2 2 2 0 0 2 2 1 1 2 2
L-
22(F)
1 3 1 2 1 2 1 0 0 1 1 2 1
L-
23(F)
1 2 2 1 2 1 2 2 2 2 0 2 2
L-
24(F)
0 3 3 2 3 0 1 1 2 2 1 2 3
L-25(F)
2 1 1 1 1 2 0 2 1 1 1 1 1
L-
26(F)
1 3 2 2 3 1 1 2 2 2 1 2 2
L-27(F)
2 2 1 1 2 0 2 1 1 1 1 1 1
L-
28(F)
1 3 2 2 1 1 1 2 0 0 0 2 2
=1.48 =2.13 =1.74 =1.56 =1.96 =1.14 =1.18 1.15 1.23 1.38 1.24 1.57 1.70
From the MATLAB based simulation results obtained through programming in
MATLAB, one can observe that in terms of objective measurement (P.E.S.Q.) the proposed
method based on LP comparable results is achieved in comparison with the baseline
algorithm, next-generation super wideband E.V.S. coder as indicated in Table 6.3 (fig. 6.2)
& Table 6.5 (fig. 6.4). In terms of subjective measurement (mean opinion score) the result of
the proposed coder is comparable in comparison with the baseline algorithm and next-
generation Super wideband E.V.S. coder as seen in Table 6.2 (fig. 6.1) and Table 6.4 (fig.
6.3).The Comparative analysis of all algorithms for various wave files has been carried out.
From the results, it can be concluded that the result of the proposed coder is comparable in
comparison with the baseline algorithm and next-generation Super wideband E.V.S. coder.
As indicated in Table 6.6 (fig. 6.5), Table 6.7 (fig.6.6) , Table 6.8 (fig.6.7) to observe the
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
88
effect of energy on speech (voice) quality subjective measurement (CMOS) has been carried
out by taking variation in gain for different prediction order and it is found that baseline
H.F.B.E. algorithm is dependent on variation in gain, No change in prediction order affect
result. For next generation S.W.B. coder variation of gain and prediction order not affect the
obtained result means steady results for all prediction order and change in gain. For Proposed
algorithm for unity gain not much variation is observed for various prediction order, so
prediction order of 16 (LP) with unity gain is considered as a standard at various stages in
the thesis due to accurate representation and complexity problem. By observing the
spectrogram of input and output speech files for baseline, proposed as well as a next-
generation coder and by comparing all the performed results one can say that in the
bandwidth-limited input signal(N.B./W.B.) the required spectral component is absent due to
missing H.B. information and the result obtained by proposed coder are comparable in
comparison with baseline algorithm and next-generation super wideband E.V.S. coder as
shown in fig. 6.8 to 6.20.
Table 6.2: C.M.O.S.-L.Q.O.for proposed, H.F.B.E., LP_order=12,16,24 & E.V.S. algorithm for
bdl_arctic_a0001.wav
File name proposed
algorithm
(CMOS-
L.Q.O.)
(LP-12)
proposed
Algorithm
(CMOS-
L.Q.O.)
(LP-16)
proposed
Algorithm
(CMOS-
L.Q.O.)
(LP-24)
HFBE
algorithm
(CMOS-
L.Q.O.)
EVS
algorithm
(CMOS-
L.Q.O.)
bdl_arctic_a0001.wav 1.117 1.116 1.114 1.123 1.138
Figure 6.1 C.M.O.S.-L.Q.O. for proposed, H.F.B.E., L.P._order=12,16,24 & E.V.S. algorithm
for bdl_arctic_a0001.wav
Subjective & Objective Measurement for W.B. to S.W.B.
89
Table 6.3: P.E.S.Q. for proposed, H.F.B.E., LP_order=12,16,24 & E.V.S. algorithm for bdl_arctic_a0001.wav
Figure 6.2 P.E.S.Q. for proposed, H.F.B.E., LP_order=12,16,24 & E.V.S. algorithm for
bdl_arctic_a0001.wav Table 6.4: C.M.O.S.-L.Q.O. for proposed, HFBE, LP_order=12,16,24 & EVS algorithm for 13 different
speech files
Complementary Mean Opinion Score
Speech Files
Proposed
Algorithm
(LP_12)
Proposed
Algorithm
(LP_16)
Proposed
Algorithm
(LP_24)
HFBE
Algorithm
(Baseline)
EVS
Algorithm
MA01_01 1.469 1.477 1.473 1.816 4.644
MA01_02 2.116 2.128 2.123 2.573 4.644
MA01_03 1.726 1.739 1.729 2.189 4.644
MA01_04 1.560 1.562 1.558 1.992 4.644
MA01_05 1.926 1.959 1.953 2.388 4.644
FA01_01 1.143 1.145 1.142 1.223 4.644
FA01_03 1.186 1.189 1.182 1.287 4.644
FA01_04 1.145 1.15 1.142 1.241 4.644
FA01_05 1.234 1.239 1.231 1.365 4.644
arctic_a0002 1.369 1.379 1.37 1.606 4.644
arctic_a0003 1.24 1.242 1.233 1.37 4.644
arctic_a0004 1.552 1.568 1.562 1.834 4.644
arctic_a0005 1.696 1.7 1.69 1.964 4.644
file name proposed
algorithm
(P.E.S.Q.)
(LP-12)
proposed
algorithm
(P.E.S.Q.)
(LP-16)
proposed
algorithm
(P.E.S.Q.)
(LP-24)
HFBE
algorithm
(P.E.S.Q.)
EVS
algorithm
(P.E.S.Q.)
bdl_arctic_a0001.wav 0.238 0.236 0.236 0.28 0.368
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
90
Figure 6.3 C.M.O.S.-L.Q.O. for proposed, HFBE, LP_order=12,16,24 & EVS algorithm for
13 different speech files Table 6.5: P.E.S.Q. for proposed, HFBE, LP_order=12,16,24 & EVS algorithm for 13 different speech files
PESQ
Speech Files
Proposed
Algorithm
(LP_12)
Proposed
Algorithm
(LP_16)
Proposed
Algorithm
(LP_24)
HFBE
Algorithm
EVS
Algorithm
MA01_01 1.335 1.340 1.336 1.8 1.094
MA01_02 2.114 2.125 2.121 2.48 1.094
MA01_03 1.712 1.721 1.718 2.17 1.094
MA01_04 1.509 1.528 1.524 1.991 1.094
MA01_05 1.913 1.925 1.918 2.34 1.094
FA01_01 1.126 1.128 1.123 0.73 1.094
FA01_03 0.165 0.17 0.164 0.92 1.094
FA01_04 1.138 1.142 1.139 0.78 1.094
FA01_05 0.77 0.78 0.772 1.114 1.094
arctic_a0002 1.128 1.137 1.131 1.54 1.094
arctic_a0003 0.78 0.79 0.782 1.138 1.094
arctic_a0004 1.46 1.48 1.470 1.7 1.094
arctic_a0005 1.66 1.66 1.63 1.958 1.094
Subjective & Objective Measurement for W.B. to S.W.B.
91
Table 6.6: C.M.O.S.L.Q. for proposed, HFBE, LP_order=24 & EVS algorithm for various gain for wave file
bdl_arctic_a0001.wav
Table 6.7: C.M.O.S.L.Q. for proposed, HFBE, LP_order=16 & EVS algorithm for various gain for wave file
bdl_arctic_a0001.wav
LP_order_wb = 16 & Gain= 0.50, 0.75 & 1
Complementary Mean Opinion Score for wave file bdl_arctic_a0001.wav
LPAS
(Proposed)
LP_order=
16
Gain =0.5
LPAS
(Propos
ed)
LP_ord
er=16
Gain
=0.75
LPAS
(Propos
ed)
LP_orde
r=16
Gain =1
(HFBE
_Base
line)
Gain
=0.5
(HFBE
_Base
line)
Gain
=0.75
(HFB
E_Bas
e line)
Gain
=1
EVS
Gain
=0.5
EVS
Gain
=0.75
EVS
Gain =1
1.121
1.116
1.114
1.132
1.127
1.123
1.138
1.138
1.138
Table 6.8: C.M.O.S.L.Q. for proposed, HFBE, LP_order=12 & EVS algorithm for various gain for wave file
bdl_arctic_a0001.wav
LP_order_wb = 12 & Gain= 0.50, 0.75 & 1
Complementary Mean Opinion Score for wave file bdl_arctic_a0001.wav
LPAS
(Proposed)
LP_order=
12
Gain =0.5
LPAS
(Proposed)
LP_order=
12
Gain =0.75
LPAS
(Proposed)
LP_order=
12
Gain =1
(HFBE
_Base
line)
Gain
=0.5
(HFBE_Ba
se line)
Gain =0.75
(HFBE_
Base
line)
Gain =1
EVS
Gain
=0.5
EVS
Gain
=0.75
EVS
Gain
=1
1.128
1.123
1.115
1.132
1.127
1.123
1.138
1.138
1.138
LP_order_wb = 24 & Gain= 0.50, 0.75 & 1
Complementary Mean Opinion Score for wave file bdl_arctic_a0001.wav
LPAS
(Propo
sed)
LP_or
der=24
Gain
=0.5
LPAS
(Proposed)
LP_order=
24
Gain =0.75
LPAS
(Propos
ed)
LP_ord
er=24
Gain =1
(HFBE_Ba
se line)
Gain =0.5
(HFBE_B
ase line)
Gain
=0.75
(HFBE_
Base
line)
Gain =1
EVS
Gain
=0.5
EVS
Gain
=0.75
EVS
Gain =1
1.126
1.119
1.114
1.132
1.127
1.123
1.138
1.138
1.138
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
92
Figure 6.4 P.E.S.Q. for proposed, HFBE, LP_order=12,16,24 & EVS algorithm for 13
different speech files
Figure 6.5 C.M.O.S.L.Q. for proposed, HFBE, LP_order=24 & EVS algorithm for various
gain for wave file bdl_arctic_a0001.wav
Figure 6.6 C.M.O.S.L.Q. for proposed, HFBE, LP_order=16 & EVS algorithm for various
gain for wave file bdl_arctic_a0001.wav
Spectrogram
93
Figure 6.7 C.M.O.S.L.Q. for proposed, HFBE, LP_order=12 & EVS algorithm for various
gain for wave file bdl_arctic_a0001.wav
6.2 Spectrogram
A sound spectrogram is a visual representation of sound. A spectrogram provides
more complete, precise information because it is based on actual measurements of the
changing frequency content of a sound over time. The horizontal dimension of the
spectrogram corresponds to time, and the vertical dimension corresponds to frequency or
pitch, with higher sounds , shown higher on the display. The relative intensity of the sound at
any particular time and frequency is indicated by the color of the spectrogram at that point.
Spectrograms are usually created in either of two ways: approximated as a filter bank that
results from a series of band pass filters, or calculated from the time signal using the short-
time Fourier transform (STFT) when applied to an audio signal, spectrograms are sometimes
called sonograph, voiceprints, or voice grams.
Figure 6.8 Spectrogram for speech file MA01_01.wav
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
94
Figure 6.9 Spectrogram for speech file MA01_02.wav
Figure 4 Spectrogram for speech file MA01_02.wav
Figure 5 Spectrogram for speech file MA01_03.wav
Figure 6.10 Spectrogram for speech file MA01_03.wav
Spectrogram
95
Figure 6.11 Spectrogram for speech file MA01_04.wav
Figure 6.12 Spectrogram for speech file MA01_05.wav
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
96
Figure 6.13 Spectrogram for speech file FA01_01.wav
Figure 6.14 Spectrogram for speech file FA01_03.wav
Spectrogram
97
Figure 6.15 Spectrogram for speech file FA01_04.wav
Figure 6.16 Spectrogram for speech file FA01_05.wav
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
98
Figure 6.17 Spectrogram for speech file arctic-a0002.wav
Figure 6.18 Spectrogram for speech file arctic-a0003.wav
Spectrogram
99
Figure 6.19 Spectrogram for speech file arctic-a0004.wav
Figure 6.20 Spectrogram for speech file arctic-a0005.wav
Comparative Analysis of Proposed Model with Baseline & Next-Generation Speech Codec
100
CHAPTER-7
Conclusion, Major Contribution, and Future Scope
Conclusion
In this research, an approach to B.W.E. lying on S.F.M. (L.P.) for N.B. to W.B. ,W.B. to
S.W.B. are discussed with measurement based on Mean Opinion Score (Subjective) and
Perceptual Evaluation of Speech Quality (P.E.S.Q.).
When only narrowband speech transmission is available, B.W.E. can be applied to the
signal at the receiving end of the communication link. B.W.E. method aims to improve
speech quality, intelligibility by regenerating new frequency components in the signal in
frequencies above 3400 Hz. Since the extension is made without any transmitted
information the algorithm is suitable for any telephone application that can reproduce
wideband audio signals.
By performing subjective, objective measurement for various speech (voice) files,
comparing the spectrogram of estimated S.W.B. with the baseline algorithm & next-
generation super wideband (E.V.S.) coder algorithm one can say that the results obtained
are comparable to the next-generation super wideband (E.V.S.) coder algorithm making
them an attractive alternative to conventional speech coder.
The subjective, objective measurement adopted here indicates that the baseline
algorithm, adopted proposed method is tremendously well-organized with minor latency.
By observing the spectrogram of input, output speech files for baseline, proposed, next-
generation coder and comparing all the performed results - one can say that in the
bandwidth-limited input signal (N.B./W.B.) the required spectral component is absent
due to missing H.B. information and the result obtained by proposed coder are
comparable in comparison with baseline algorithm, next-generation Super wideband
E.V.S. coder.
Being codec neutral, the proposed algorithm can be used to improve the speech (voice)
quality offered by wideband networks, devices. It can also be used to preserve quality
when super-wideband devices are used alongside wideband services. When used in
combination with a wideband codec which operates on some form of linear prediction
Future Scope
101
coefficients, the proposed approach avoids an additional resynthesis step to obtain super-
wide bandwidth signals, thereby reducing the complexity.
Major Contribution
The key contributions reported in thesis involve the development of an especially
efficient approach to S.W.B. B.W.E. based on linear prediction analysis-synthesis which
avoids the statistical estimation of missing higher frequency components.
In addition to computational efficiency, the solution delivers a speech (voice) of superior
quality to wideband speech signals processed with an adaptive-multi-rate wideband
codec and comparable quality attainable by super wideband E.V.S. coder.
MATLAB based simulation of sub band filter processed via source filter model are
performed. Also LPC based analysis, synthesis was carried out and result obtained are
compared in terms of spectrogram at input and output.
LPC stability is examined by checking the value of reflection co-efficient for stability
purpose.
Future Scope
In discussing directions of research, it is not viable to be exhaustive, and in predicting
what the successful directions maybe, we do not necessarily expect to be accurate.
Nevertheless, it may be useful to set down some broad research directions, with a range
that covers the obvious as well as the speculative.
The research work in this thesis may help to shed light on the proposed S.F.M. based
algorithm which can be an alternative for the next generation super wideband E.V.S.
coder employed for speech quality, intelligibility improvement.
S.W.B. to F.B. speech quality is also the next targeted area of work for researchers found
interest in the area of speech coding and processing. One can propose the model based on
the source-filter model for attaining the F.B. signal from the S.W.B. signal. The proposed
work can also be extended for music signals and music and mixed (speech and music).
References
102
References
[1] N. R. Council, The evolution of untethered communications, National Academies
Press, 1998.
[2] W. Goodall, “Telephony by pulse code modulation,” The Bell System Technical
Journal, vol. 26, no. 3, pp. 395–409, 1947.
[3] K. Cattermole, “History of pulse code modulation,” in Proc. Institution of Electrical
Engineers, pp. 889–892, IET, 1979.
[4] A. Dodd, The essential guide to telecommunications, 5th edition, Prentice Hall
Professional, 2002.
[5] https://www.rfpage.com (Evolution of- wireless technologies 1G to 5G).
[6] P.Jax, P. Vary, “On bandwidth extension of telephone speech,” signal
process. ,vol. 83, no. 8:, 1707–1719, 2003.
[7] L. Laaksonen, “bandwidth extension of N.B. speech enhanced speech quality and
intelligibility in mobile devices,” Ph.D. Thesis, Aalto University, Finland, 2013.
[8] ITU-T Recommendation G.711: Pulse code modulation (PCM) of voice frequencies,
2001.
[9] ITU-T Recommendation G.712: Transmission performance characteristics of pulse
code modulation channels,1988.
[10] B. Oliver, J. Pierce, and C. Shannon, “The philosophy of PCM,” Proc. of the IRE, vol.
36, no. 11, pp. 1324–1331, 1948.
[11] P. Vary and R. Martin, Digital speech transmission: Enhancement, coding and error
concealment. John Wiley & Sons, 2006.
[12] ITU-T Recommendation G.726: 40, 32, 24, 16 kbit/s Adaptive Differential Pulse
Code Modulation (ADPCM),” 1990.
[13] W. Chu, “Speech coding algorithms” Foundation and evolution of standardized
coders, 2003.
[14] ETSI Recommendation GSM 06.10 : GSM full rate speech transcoding,1992.
[15] R. Salami, C. Laflamme, B. Bessette, and J.-P. Adoul, “ITU-T G. 729 Annex A:
reduced complexity 8 kb/s CS-ACELP codec for digital simultaneous voice, data,”
IEEE Communications Magazine, vol. 35, no. 9, pp. 56–63,1997.
[16] R. Salami, C. Laflamme, B. Bessette, and J. Adoul, “Description of ITU-T rec. G. 729
Annex A: Reduced complexity 8 kbit / s CS-ACELP coding,” in Proc. IEEE Int.Conf.
on Acoustics, Speech and Signal Processing (ICASSP), 1997.
[17] ETSI Recommendation GSM 06.60 : Digital Cellular Telecommunications System
References
103
(Phase 2+); Enhanced Full Rate (EFR) Speech Transcoding, 1996.
[18] ITU-T Recommendation G.729: Coding of speech at 8 kbit/s using Conjugate
Structure Algebraic Code-Excited Linear-Prediction (CS-ACELP), 1996.
[19] ETSI Recommendation GSM 06.90 : Digital Cellular Telecommunications System
(Phase 2+); Adaptive Multi-Rate (AMR) Speech Transcoding,1998.
[20] 3GPP TS 26.090: Mandatory Speech Codec speech processing functions; Adaptive
Multi- Rate (AMR) Speech Codec; Transcoding Functions, 2000.
[21] Global mobile Suppliers Association (GSA), “Mobile HD voice: Global Update
report, May, 2016. [Online]: https://gsacom.com/paper/mobilehd-voice-global-
update-report-2/ Last accessed : April 2019.
[22] ITU-T Recommendation G.722: 7 kHz audio-coding within 64 kbit/s, 1988.
[23] ITU-T Recommendation G.722.1: Low-complexity coding at 24 and 32 kbit/s for
hands- free operation in systems with low frame loss,2005.
[24] 3GPP TS 26.190: Speech Codec Speech Processing Functions; AMR-WB speech
codec; Transcoding functions,2002. [Online]:www.etsi.org/deliver/etsi_ts/126100_
126190/13.00.00_60/ts_126190v130000p.pdf, ver. 13.0.0 Rel. 13,2016.
[25] ITU-T Recommendation G.722.2: Wideband Coding of Speech at Around 16 kbits/s
using Adaptive Multi-Rate Wideband (AMR-WB),2002.
[26] P. Ojala, A. Lakaniemi, H. Lepanaho, and M. Jokimies, “The adaptive multirate
wideband speech codec: system characteristics, quality advances, and deployment
strategies,” IEEE Communications Magazine, vol. 44, no.5,pp. 59–65, 2006.
[27] ITU-T Recommendation G.729.1: G.729 Based embedded variable bit-rate coder:
An 8–32 kb/s scalable wideband coder bit stream interoperable with g.729,” 2006.
[28] D. Sinder, I. Varga, V. Krishnan, V. Rajendran, and S. Villette, “Recent speech
technologies and standards,” in Speech and Audio Processing for Coding,
Enhancement and Recognition, pp. 75–109, Springer, 2015.
[29] ITU-T Recommendation G.729.1: Annex E: Super wideband scalable extension for
G.729.1,2010.
[30] Y. Lee and S. Choi, “S.W.B. B.W.E. using normalized MDCT coefficients for
scalable speech and audio coding,” Advances in Multimedia, vol. 2013, p.1-4, 2013.
[31] 3GPP ts 26.290: Audio Codec Processing Functions; Extended AMR-WB+ Codec;
Transcoding functions, 2005. [Online]: tps://www.etsi.org/deliver/etsi_ts/126200_
26299/126290/15.00.00_60/ts_126290v150000p.pdf, ver. 15.0.0,Rel. 15,2018.
[32] J. Makinen, B. Bessette, S. Bruhn,R. Salami, and A. Taleb, “AMR-WB+:a new audio
References
104
coding standard for 3rd generation mobile audio services,” in Proc. IEEE Int. Conf. on
Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. ii–1109, 2005.
[33] ITU-T Recommendation G.719: Low-complexity full-band audio coding for high-
quality conversational applications,2008.
[34] P. Ekstrand, “Bandwidth extension of audio signals by spectral band replication, in
Proc. IEEE Benelux Workshop on Model Based Processing and Coding of Audio
(MPCA), pp.53–58, 2002.
[35] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral Band Replication, a novel
approach in audio coding,” in AES Convention, Audio Engineering Society, 2002.
[36] J. Herre and M. Dietz, “MPEG-4 high-efficiency AAC coding, IEEE Signal
Processing Magazine, vol. 25, no. 3, pp. 137–142, 2008.
[37] Full HD Voice,” Huawei white paper, October, 2014. [Online]: https:
//www.huawei.com /ilink/en/download/HW_377700 (Last accessed : April 2019).
[38] 3GPP TR 22.813, Study of use cases and requirements for enhanced voice codec
for Evolved Packet System (EPS),” 2010.
[39] A. Rämö and H. Toukomaa, “Subjective quality evaluation of the 3GPP EVS codec,”
in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.
5157–5161, 2015.
[40] 3GPP TS 26.445: Codec for Enhanced Voice Services; Detailed algorithmic
description, 2014. [Online]: https://www.etsi.org/deliver/etsi_ts/
126400_126499/126445/13.04.01_60/ts_126445v130401p.pdf, ver. 13.4.0 Rel. 13,
2016 (Last accessed : April 2019).
[41] 3GPP TS 26.441: Codec for EVS; General overview,” 2014. [Online]:
https://www.etsi.org/deliver/etsi_ts/126400_126499//ts_126441v130000 p.pdf, ver.
13.0.0Rel. 13, 2016 (Last accessed : April 2019).
[42] S. Bruhn, B. Grill, J. Gibbs, L. Miao, K. Järvinen, L. Laaksonen, N. Harada, N. Naka,
et al., “Standardization of the new 3GPP EVS codec,” in Proc. IEEE Int. Conf. on
Acoustics, Speech and Signal Processing (ICASSP), pp. 5703–5707, 2015.
[43] M. Dietz, M. Multrus, V. Eksler, E. Norvell, H. Pobloth, L. Miao,L. Laaksonen, A.
Vasilache, et al., “Overview of the EVS codec architecture,” in Proc. Int. Conf. on
Acoustics, Speech and Signal Processing (ICASSP), pp. 5698–5702, 2015.
[44] Global mobile Suppliers Association (GSA), “Enhanced Voice Services (EVS):
market Update,” Sept-2018. [Online]: https://gsacom.com/paper/ evs-enhanced-voice-
services-market-update-september-2018/ (Last accessed : April 2019).
References
105
[45] N.Prasad, “Bandwidth Extension of speech(voice) signal: A comprehensive
Review,” MECS,2016.
[46] Ninad S. Bhatt, “ Implementation and Performance Evaluation of CELP based
GSM AMRNB coder over B.W.E., ” IEEE,2015.
[47] X. Liu and C. Bao, “Blind bandwidth extension of audio signals based on non-linear
prediction and hidden Markov model,” APSIPA Transactions on Signal and
Information Processing, vol. 3, p.1-8, 2014.
[48] S. Villette, S. Li, P. Ramadas, and D. J. Sinder, “eAMR: Wideband speech over
legacy narrowband networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pp. 5110–5114, 2017.
[49] T.Rappaport, “Wireless Communications: Principles and Practice,” Prentice-Hall,
1996.
[50]G.Gandhimathian, “Analysis On Source-Filter Model Based B.W.E.System,”
Journal of Theoretical And Applied Information Technology,2014.
[51] A. Sagi and D. Malah, “Bandwidth extension of telephone speech aided by data
embedding,” EURASIP Journal on Advances in Signal Processing, 2007.
[52] S. Chen and H. Leung. “Speech B.W.E. by data hiding and phonetic classification,” In
Proceedings of the IEEE Int. Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 4, Honolulu, Hawaii, USA,pp.593–596,2007
[53] C. Beaugeant, I. Varga, “Challenges of 16 khz in acoustic pre-and post- processing for
terminals,” IEEE Communications Magazine, vol. 44, no. 5, pp. 98–104, 2006.
[54] S. Villette, S. Li, P. Ramadas, and D. J. Sinder, “An objective evaluation methodology
for blind bandwidth extension.,” in Proc. INTERSPEECH, pp. 2548–2552, 2016.
[55] L. Laaksonen, H. Pulakka, V. Myllyla, and P. Alku, “Development, evaluation and
implementation of B.W.E. method of telephone speech in mobile terminal,” IEEE
Transactions on Consumer Electronics, vol. 55, no. 2, pp. 780–787, 2009.
[56] B. Geiser, “High-definition telephony over heterogeneous networks,” Ph.D. Thesis.
Rheinisch-Westfälischen Technische Hochschule Aachen, Germany, 2012.
[57] S. Moller, M. Waltermann, and P. Vidales, “Speech quality while roaming in next
References
106
generation networks,” in Proc. IEEE Int.conf. on Communications, pp. 1–5, 2009.
[58] S. Voran, “Subjective ratings of instantaneous and gradual transitions from N.B.to
W.B.active speech,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
processing (ICASSP), pp. 4674–4677, 2010.
[59] 3GPP TS 26.976: Performance characterization of the AMR-WB speech codec,2003.
[Online]: https://www.etsi.org/ deliver/etsi_tr99/126976/tr_126976v150000p.pdf,
ver. 15.0.0 Rel. 15, 2018.
[60] P. Nidadavolu, V. Iglesias, J. Villalba, and N. Dehak, “Investigation on neural B.W.E.
of telephone speech for improved speaker recognition,” in Proc. IEEE Int. Conf. on
Acoustics, Speech and Signal Processing (ICASSP),pp. 6111–6115, 2019.
[61] K. Li, Z. Huang, Y. Xu, and C.-H. Lee, “DNN-based speech B.W.E.and
its application to adding H.F. missing features for automatic speech recognition of
N.B. speech,” in Annual Conf. of the Int. Speech Comm. Association, 2015.
[62] D. Haws and X. Cui, “Cyclegan bandwidth extension acoustic modeling for automatic
speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6780–6784, 2019.
[63] R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba, and Y. Aono, “Large
context end- to-end automatic speech recognition via extension of hierarchical
recurrent encoder- decoder models,” in Proc. IEEE Int. Conf. on Acoustics, Speech
and Signal Processing (ICASSP), pp. 5661–5665, 2019.
[64] M. Faúndez-Zanuy and W. B. Kleijn, “On relevance of B.W.E. for speaker
identification,” in Proc. IEEE European Signal Processing Conference,pp. 1–4, 2002.
[65] R. Kaminishi, H. Miyamoto, S. Shiota, and H. Kiya, “Investigation on blind
bandwidth extension with a non-linear function and its evaluation of x-vector-based
speaker verification,” pp. 4055–4059, 2019.
[66] C. Liu, Q.-J. Fu, and S. S. Narayanan, “Effect of bandwidth extension to telephone
speech recognition in cochlear implant users,” The Journal of the Acoustical Society
of America,vol. 125, no. 2, pp. EL77–EL83, 2009.
References
107
[67] W. Nogueira, J. Abel, and T. Fingscheidt, “speech bandwidth extension improves
telephone speech intelligibility and quality in cochlear implant users,” The Journal of
the Acoustical Society of America, vol. 145, no. 3, pp. 1640–1649, 2019.
[68] Ninad S. Bhatt,2016, “Simulation and overall comparative evaluation of performance
between different techniques for high band feature extraction based on
bandwidth extension”, Int. Journal of Speech Technology.
[69] Bandwidth Expansion of Narrowband Speech using Linear Prediction, AALBORG
UNIVERSITY, Institute of Electronic Systems, September - December 2004.
[70] Ulrich Kornagel, “Speech Techniques for bandwidth extension of telephone” Signal
Processing 86, (Elsevier),Germany. pp.1296–1306,2006
[71] Peter Jax and Peter Vary Bandwidth Extension of Speech Signals: A Catalyst for the
Introduction of Wideband Speech Coding? RWTH Aachen University, IEEE
Communications Magazine -May 2006.
[72] SchnitZier, J. “A 13.0 Kbit/S wide band codec based on SB- ACELP”, in
Proc.ICASSP, Vol.1, pp-157- 160,1998.
[73] Makhoul, J., Berouti,M. “ High frequency generation in speech coding system”, in
proc.ICASSP, pp 428- 431,1979.
[74] Carl,H., and Heute,U., “B.W. Enhancement of N.B. speech signals”,Signal Processing
VII, Theories and applications, EUSIPCO, Vol 2, pp. 1178- 1181, 1994.
[75] Yoshida ,Y.,and Abe ,M. “An algorithm to reconstruct the wideband speech from NB
speech on codebook mapping”, in Proc.ICSLP,pp1591- 1594, 1994.
[76] Jax P., and Vary P., “ Wide band Extension of speech using Hidden Markov Model”
in Proc. IEEE workshop on speech coding ,2000.
[77] R.N.Rathod, M.S.Holia & N.S.Bhatt, "B.W.E. and Quality Evaluation of Speech
Signal Based On QMF And S.F.M. Using Simulink and MATLAB," International
Journal of Research and Analytical Reviews,pp 404-411,2019.
[78] CCITT, “7 kHz Audio Coding Within 64 kBit/s”, Recommendation G.722, Vol.
Fascile III.4 of Bluebook, Melbourne,1988
[79] Cheng,Y.M., ‟Shaughnessy,D.O, Mermelstein,P., “Statistical Recovery of
Wideband Speech from narrowband Speech,” IEEE Transactions on Speech and
References
108
Audio Processing,Vol.2, no4,pp.544– 548,1994
[80] Chennoukh,S., Gerrits, A., Miet, G.,and Sluitjer, R., "Speech enhancement via
frequency bandwidth extension using line spectral frequencies," Pro. IEEE, Int. Conf.
on Acoustics, Speech, Signal Processing, vol. 1, pp.665- 668,2001
[81] Holger Carl and Ulrich Heute, Bandwidth Enhancement of Narrow-Band Speech
Signals, In Proceedings of European Signal Processing Conference (EUSIPCO), pages
1178–1181, Edinburgh, Scotland, September 1994.
[82] McAulay, R. J., and Quatieri, T. F., “Sinusoidal coding, in Speech Coding and
Synthesis,” Elsevier, Amsterdam, Chapter 4, pp.121-173,1995.
[83] J. Schnitzler and P. Vary, “Trends and perspectives in wideband speech coding,”
Signal Processing, vol. 80, pp. 2267–2281, 2000.
[84] 7kHz Audio Audio-Coding within 64kbit/s ITU-T Recommendation G.722, ITU-T
G.722.[Online]. Available: https://www.itu.int/rec/T-REC-G.722-198811-S/en
[85] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Vainio, H. Mikkola, and K.
Jarvinen,“The adaptive multi-rate wideband speech codec (AMR-W.B.),”IEEE
Transactions on Speech and Audio Processing, vol. 10, no. 8, pp. 620–636, Nov 2002.
[86] M. Neuendorf, P. Gournay, M. Multrus, J. Lecomte, B. Bessette, R. Geiger, S. Bayer,
R.Salami, G. Schuller, R. Lefebvre, and B. Grill, “Unified speech and audio coding
scheme for high quality at low bitrates,” in 2009 IEEE International Conference on
Acoustics, Speech and Signal Processing, pp. 1–4, April 2009.
[87] M. Dietz, M. Multrus, V. Eksler, V. Malenovsky, E. Norvell, H. Pobloth, L. Miao,
Z.Wang, L.Laaksonen, A. Vasilache, Y. Kamamoto, K. Kikuiri, S. Ragot, J. Faure, H.
Ehara, V. Rajendran, V. Atti, H. Sung, E. Oh, H. Yuan, and C. Zhu, “Overview of the
evs codec architecture,” in 2015 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 5698–5702, April 2015.
[88] D. O‟Shaughnessy, “Linear predictive coding,” IEEE Potentials, vol. 7, pp. 29–
32, Feb 1988.
[89] A. Dodd, The essential guide to telecommunications. 5th edition, Prentice Hall
Professional, 2002.
[90] T. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice. Prentice
Hall, 2002.
[91] D. O‟shaughnessy, Speech communication: Human and Machine. Universities press,
USA, 1987.
References
109
[92] K. Stevens, “Acoustic correlates of some phonetic categories,” The Journal of the
Acoustical Society of America, vol. 68, no. 3, pp. 836–842, 1980.
[93] P. Vary and R. Martin, Digital speech transmission: Enhancement, coding and error
concealment., John Wiley & Sons, 2006.
[94] A. Spanias, “Speech coding: A tutorial review,” Proc. of the IEEE, vol. 82,
no. 10, pp. 1541–1582, 1994.
[95] H. Pulakka, “Development and evaluation of bandwidth extension methods for
narrowband telephone speech,” Ph.D. Thesis, Aalto University, Finland, 2013.
[96] L. R. Rabiner and R.W. Schafer. Digital processing of speech signals. Prentice-Hall,
1978.
[97] Nels Rohde, Svend Aage Vedstesen, " Bandwidth Extension of Narrowband Speech"
September 4, 2006 - June 7, 2007.
[98] A. M. Kondoz, University of Surrey, UK., Speech, Coding for Low Bit Rate
communication Systems.wiley,1994.
[99] ITU-T Recommendation P.800 Methods for subjective determination of transmission
quality, available at http://www.itu.int/ITU-T/rec=3638 (Last Accessed: May 2015).
[100] Hu, Y., & Loizou, P.,Subjective comparison and evaluation of speech
enhancement algorithms. Speech Communication, 49, 588–601,2007.
[101] Hu, Y., & Loizou, P.,Evaluation of objective measures for speech enhancement. In
Proc. inter speech,pp. 1447–1450,2006.
[102] Hu, Y., & Loizou, P. C.,Evaluation of objective quality measures for speech
enhancement. IEEE Transactions on Audio, Speech, Language Processing, 16(1),2008
[103] T. E. Tremain, “The Government Standard Linear Predictive Coding LPC-10”,
Speech Technology, pp.40-49, 2014.
[104] ITU-T Perceptual evaluation of speech quality (PESQ), an objective method
for end-to-end speech quality assessment of N.B. telephone networks(ITU-T
RecommendationP.862, 2001)
[105] Salmela, J., & Mattila, V. ,New intrusive method for the objective quality
evaluation of acoustic noise suppression in mobile communications. In Proc. 116th
audio eng. soc. conv., Preprint 6145,2004.
[106] Falk, T. H., & Chan, W.,Single-ended speech quality measurement using
machine learning methods. IEEE Transactions on Audio, Speech, and Language
Processing, 14(6), pp.1935–1947,2006.
[107] Grundlehner, B., Lecocq, J., Balan, R., & Rosca, J. The performance
References
110
assessment method for speech enhancement systems. In Proc. 1st Annu. IEEE
BENELUX/DSP valley signal process, Symp,2005.
[108] Hu, Y., & Loizou, P.,Subjective comparison of speech enhancement algorithms. IEEE
int. conf. acoustic., speech, signal process.vol.1, pp.153–156),2006.
[109] ITU-T Recommendation P.341: Transmission characteristics of national networks,
1998.
[110] A. Nour-Eldin, “Quantifying and exploiting speech memory for the improvement
of N.B. speech B.W.E.,” Ph.D. Thesis, McGill University, Canada, 2013.
[111] S. Voran, “Listener ratings of speech pass bands,” in Proc. IEEE Workshop on
Speech Coding For Telecommunications, pp. 81–82, 1997.
[112] M. Croll, Sound-quality improvement of broadcast telephone calls. Research
Department, Engineering Division, BBC, 1972.
[113] P. Patrick, “Enhancement of band-limited speech signals,” Ph.D. Thesis,
Loughborough University of Techology, UK, 1983.
[114] H. Yasukawa, “Signal restoration of broad band speech using nonlinear processing,”
in Proc. IEEE European Signal Processing Conference, pp. 1–4, 1996.
[115] M. Dietrich, “Performance and implementation of a robust ADPCM algorithm for
wideband speech coding with 64 kbit/s,” in Proc. Int. Zürich Seminar on Digital
Communications, 1984.
[116] U. Lindgren and H. Gustafsson, “Speech bandwidth extension,” US Patent no.
2002/0128839 A1,2002.
[117] P. Jax, “Enhancement of band limited speech signals: Algorithms and theoretical
bounds,” Ph.D. Thesis, Aachen University (RWTH), Germany, 2002.
[118] B. Iser and G. Schmidt, “Bandwidth extension of telephony speech,” in Speech and
Audio Processing in Adverse Environments, pp. 135–184, Springer, 2008.
[119] J. Flanagan, Speech analysis synthesis and perception. 2nd edition, Springer,1972.
[120] M. Arora, High Quality Blind B.W.E. of Audio for Portable Player Applications,
Audio Engineering Society, Paris, France, May 20–23,2006.
[121] Somesh Ganesh , Audio Bandwidth Extension , Georgia Tech Center for Music
Technology Georgia Institute of Technology, Atlanta, Georgia 30318,2018.
[122] Bjarke Bliksted Andersen,Jakob Dyreby,Peter Drustrup Nielsen,Brianensen,
"Bandwidth Extension of Narrowband Speech using Linear Prediction", AALBORG
UNIVERSITY Institute of Electronic Systems,pp. 1-91,2004.
[123] P. Jax and P. Vary, “B.W.E. of speech signals: A catalyst for the introduction of
References
111
wideband speech coding?,” IEEE Commun.Mag., vol. 44, no. 5, pp. 106–111, 2006.
[124] Y.Yoshida and M.B.W.E., “An algorithm to reconstruct W.B.speech from N.B.
speech based on codebook mapping,” In Pro. of ICSLP, pages 1591–1594, 1994.
[125] Y.Qian and P.Kabal, “Wideband speech recovery from narrowband speech using
classified codebook mapping, "In Pro.of AICSST, pp.106–111, Australia, 2002.
[126] J.Epps and W.H.Holmes, “A new technique for W.B.enhancement of coded N.B.
speech,” IEEE Workshop on Speech Coding, pp.174–176, Finland, 1999.
[127] R.Hu et al., “Speech B.W.E. by improved codebook mapping towards increased
phonetic classification,” In Pro.of Interspeech,pp.1501–1504, Portugal, 2005.
[128] Y.Nakatoh et al., “Generation of broadband speech from N.B.and speech using
piecewise linear mapping,” EUROSPEECH, Pages 1643-1646, Greece, 1997.
[129] S.Chennoukh et al., “Speech enhancement via frequency bandwidth extension using
line spectral frequencies,” In Proc.of ICASSP, Pages 665-668, USA, 2001.
[130] K.Y.Park and H.S.Kim, “Narrowband to the wideband conversion of speech using
GMM based transformation," In Proc. of ICASSP, pages 1843–1846, Turkey, 2000.
[131] A.H.Nour Eldin and P.Kabal, “MFCC based bandwidth extension of N.B.. speech,”
In Proc.of Interspeech, pages 53–56, Australia, 2008.
[132] A.H.NourEldin and P.Kabal, “Combining frontend based memory with MFCC
features for B.W.E.of N.B. speech,” ICASSP, pp. 4001– 4004, Taiwan,2009.
[133] Fuemmeler, J., Hardie, R., & Gardner, W.,Techniques for the regeneration of
wideband speech from narrowband speech. EURASIP Journal on Applied Signal
Processing, 2001.
[134] Cabaral, J., & Oliveira, L. Pitch-synchronous time-scaling for high-
frequency excitation regeneration. In INTERSPEECH,pp.1513-1516,2005.
[135] J. Kominekand A. Black,“CMU ARCTIC databases, ” [Online]:http://festvox.org/
/index.html.
[136] P.Kabal,“TSPSpeechDatabase, http://mmsp.ece.mcgill.ca/Data, pp.02–10,2002.
[137] ITU, [Online]: https://www.itu.int/rec/T-REC-P.501.,2012.
[138] ITU-T Recommendation P. 56, Objective measure of active speech (voice) level,
ITU, 2011.
[139] Codec for Enhanced Voice Services; ANSI C Code (fixed point) (3GPP TS 26.442
ver. 13.3.0 rel. 13),” 2016.
[140] ITU-T Recommendation G.191, Software Library User‟s Manual, ITU, 2009.
[141]ANSI-C Code for the AMR-WB Speech Codec (3GPP TS26.173ver.13.1.rel.13),”2016.
References
112
[142] Monika Yadav et al, “Speech Coding Technique And Analysis Of Speech Codec
Using CS-ACELP”, IJESAT, ISSN:2250-3676,vol.-6, Issue-3, pp.143-151,
JUNE-2016.
[143] Karthikeyan N.Ramamurthy and Andreas S. Spanias, MATLAB Software for the
Code Excited Linear Prediction Algorithm The Federal Standard–1016,2010.
[144] D. Zaykovskiy and B. Iser, “Comparison of neural networks and linear mapping in
an application for bandwidth extension," in Proc of Int. Conf. on Speech and
Computer (SPECOM), pp. 1–4, 2005.
[145] ITU-T Recommendation P. 800: Methods for subjective determination of
transmissionquality,”ITU,1996.
[146] Vijay K. Garg & Joseph E. Wilkes, “principles and application of GSM” Pearson
Education, 2004.
[147] Enyi Francis, Chiadika Mario, Ifezulike N. Florence, “Analysis of Vocoder
Technology on Male Voice”, IJCSMC, Vol. 2, Issue. 10, pp.243 – 266, October 2013.
List of Publications
113
List of Publications
“Quantization & Speech Coding for Wireless Communication”, R.N.Rathod, Dr.
M.S.Holia, Kalpa Publications in Engineering,ICRISET2017, International
Conference on Research and Innovations in Science, Engineering & Technology.
Selected Papers in Engineering, volume 1, pp.19-25,2017
“Performance Analysis of Speech Coder for Wireless Communication under Varying
Channel Condition” R.N. Rathod, Dr. M.S. Holia, International Journal of
Innovations & Advancement in Computer Science IJIACS ISSN 2347-8616, UGC
Approved Journal, Journal, Impact factor 2.65, volume7, Issue-2 February 2018.
“Bandwidth Extension and Quality Evaluation of speech signal based on QMF and
source filter model using simulink and MATLAB”, R.N.Rathod, Dr. M.S.Holia,
N.S.Bhatt, International Journal of Research and Analytical Reviews,ISSN2348-
1269,UGC Approved Journal, Impact factor 5.75 Volume-6,Issue-1,Jan-March-
2019.
"Bandwidth Extension of Speech Signal Artificially & Transmission & Coding for
Wireless Communication", R. N. Rathod, M. S. Holia, Journal of Communication
Engineering & Systems, SJIF: 6.144 , Volume- 9, Issue-2, 2019.
"Analytical Studies Relating to Bandwidth Extension from Wideband to Super
wideband for Next Generation Wireless Communication", R.N.Rathod, Dr.
M.S.Holia, paper has been submitted, presented in International Conference on
Research and Innovations in Science, Engineering & Technology (ICRISET2020)
held on 4-5 September-2020 & Published in SCOPUS Indexed UGC CARE
Journal RT&A, Special Issue № 1 (60) Volume 16,Pp:206-22 January 2021.