Acoustics Lab 2007

Alexandria University

Faculty of Engineering

Electrical Engineering Department

Communications & Electronics Section

Under the supervision of

Dr. Noha O. Korany

B.Sc. Electrical Communications

Academic Year 2006 - 2007

Preface We believe that good seeds even if they were planted in a fertile soil won’t grow

without care and attention.

We also believe that plants won’t grow overnight and that in order to enjoy what we

have planted; we have to be patient until they are completely grown. And when they are

completely grown and become beautiful, we have to protect them by watering them

continuously and never let them until they are fully covered by dust which destroys

their shine and beauty and leads them to wither and death.

We are not agricultural engineers neither our project is concerned in agriculture; but

simply we wanted to take a moment to express our deep gratitude, thanks and

appreciation to our dear Dr. Noha O. Korany who supported us a lot and helped us a

lot in achieving what we have achieved in this project. She was always encouraging us

to find out our own capabilities, she never obliged us to do something we did not like to

do; she has put us on the right track and has driven us to a real start of a professional

life in the near future.

The project for us was not just an ordinary college task that has to be submitted in a

deadline; but we were waiting for every weekly meeting with great enthusiasm to

discuss general actualities and events with an extremely open minded, hardly

committed and respectful person like our dear Dr. Noha O. Korany.

Now we guess that you have almost understood the analogy of plants we used at the

beginning.

We were the seeds in their way to grow, the project was the fertile soil and those seeds

won’t have grown properly without the care and attention of Dr. Noha O. Korany.

Dr. Noha,

Words cannot tell how much we are grateful to you

Project’s Participants

Abstract

Selected topics on acoustics-communication:

Topic 1: Audiology

Deals with the hearing process

Students: Menatollah Mostafa Aly

Hanaà Mohamed El Borollosy

Manal Khalaf

Topic 2: Acoustical Simulation of Room

Deals with the modeling of sound propagation in rooms

Students: Héba Mohamed Noweir

Mona Abdel Kader Mohamed

Mona Zarif Shenouda

Topic 3: Noise Control Deals with environmental noise

Students: Hanaà Khamis Mohamed

Nermine Mohamed Ahmed

Topic 4: Speech Technology

Deals with speech analysis/Synthesis and speaker identification

Students: Ahmed Mohamed Hamido

Beshoy Kamel Ibrahim Ghaly

Contents

Page

Topic 1

AUDIOLOGY 1

CHAPTER 1

HEARING PROCESS 2

1.1 Structure of the Ear………………………………………………… 2

1.2 How the Ear works………………………………………………… 4

1.3 Computational Models of Ear Functions…………………………... 5

CHAPTER 2

FUNDMENTAL PROPERTIES OF HEARING 13

2.1 Thresholds…………………………………………………………. 13

2.2 Equal Loudness Level Contours…………………………………… 14

2.3 Critical Bandwidth………………………………………………… 14

2.4 Masking……………………………………………………………. 15

2.5 Beat and Combination Tones……………………………………… 15

CHAPTER 3

MODELS FOR HEARING AIDS 17

3.1 History of Development of Hearing Aids…………………………. 17

3.2 First Model (Digital Hearing Aids for Moderate Hearing Losses)... 18

3.2.1 Features of Real Time Binaural Hearing Aid……………………… 18

3.2.2 Speech Processing Algorithm……………………………………... 18

3.2.2.1 Interaural Time Delay and Timer.………………………………… 18

3.2.2.2 Frequency Shaping.………………………………………………... 19

3.2.2.3 Adaptive Noise Cancellation using LMS.…………………………. 20

3.2.2.4 Amplitude Compression…………………………………………… 25

3.3 Second Model (A Method of Treatment for Sensorineeural

Hearing Impairment)………………………………………………. 26

3.3.1 The Conceptual Prosthetic System Architecture..…………………. 26

3.3.2 Human Temporal Bone Vibration…………………………………. 28

3.3.3 Design Guideline for Optimum Accelerometer…………………… 31

3.3.4 Conclusion…………………………………………………………. 31

Topic 2

ACOUSTICAL SIMULATION OF ROOM 32

CHAPTER 1

GEOMETRICAL ACOUSTICS 33

1.1 Introduction………………………………………………………... 33

1.2 Sound Behavior……………………………………………………. 35

1.3 Geometrical Room Acoustics……………………………………… 36

1.3.1 The Reflection of Sound Rays……………………………………... 36

1.3.2 Sound Reflections in Rooms………………………………………. 38

1.3.3 Room Reverberation……………………………………………….. 39

1.4 Room Acoustical Parameters & Objective Measures……………... 40

1.4.1 Reverberation Time………………………………………………... 40

1.4.2 Early Decay Time………………………………………………….. 40

1.4.3 Clarity and Definition……………………………………………… 41

1.4.4 Lateral Fraction and Bass Ratio…………………………………… 41

1.4.5 Speech Transmission Index………………………………………... 41

CHAPTER 2

ARTIFICIAL REVERBERATION 42

2.1 Introduction………………………………………………………... 42

2.2 Shortcomings of Electronic Reverberators………………………… 42

2.3 Realizing Natural Sounding Artificial Reverberation……………... 43

2.3.1 Comb Filter………………………………………………………… 43

2.3.2 All-pass Filter……………………………………………………… 45

2.3.3 Combined Comb and All-pass Filters……………………………... 47

2.4 Ambiophonic Reverberation………………………………………. 49

CHAPTER 3

SPATIALIZATION 50

3.1 Introduction………………………………………………………... 50

3.2 Two-Dimensional Amplitude Panning…………………………….. 51

3.2.1 Trigonometric Formulation………………………………………... 52

3.2.2 Vector Base Formulation…………………………………………... 54

3.2.3 Two-Dimensional VBAP for More Than Two Loudspeakers…….. 55

3.2.4 Implementing 2D VBAP for More Than Two Loudspeakers……... 56

Topic 3

NOISE CONTROL 57

CHAPTER 1

SOUND ABSORPTION 58

1.1 Absorption Coefficient...…………………………………………... 58

1.2 Measurement of Absorption Coefficient of the different materials.. 58

1.2.1 Procedures…………………………………………………………. 59

1.2.2 Laboratory Measurements of Absorption Coefficient…………….. 59

1.3 Sound Absorption by Vibrating or Perforated Boundaries………... 62

CHAPTER 2

SOUND TRANSMISSION 65

2.1 Transmission Coefficient………………………………………….. 65

2.2 Transmission loss………………………………………………….. 65

2.3 Sound Transmission Class STC…………………………………… 65

2.3.1 Determination of STC……………………………………………... 66

2.3.2 Laboratory Measurements of STC………………………………… 67

2.4 Controlling Sound Transmission through Concrete Block Walls…. 69

2.4.1 Single-Leaf Concrete Block Walls………………………………… 69

2.4.2 Double-Leaf Concrete Block Walls……………………………….. 71

2.5 Noise Reduction…………………………………………………… 71

2.5.1 Noise Reduction Determination Method…………………………... 71

2.5.2 The Noise Reduction Determinations of Some Absorbed Materials 72

2.6 The Performance of Some Absorbed Materials…………………… 73

Topic 4

SPEECH TECHNOLOGY 75

CHAPTER 1

SPEECH PRODUCTION 76

1.1 Introduction…………………………………………………........... 76

1.2 The human vocal apparatus………………………………………... 76

1.2.1 Breathing………………………………………………………....... 77

1.2.2 The larynx………………………………………………………….. 77

1.2.3 The vocal tract……………………………………………………... 79

1.3 Speech sounds……………………………………………………... 79

1.3.1 Phonemic representation…………………………………………... 79

1.3.2 Voiced, unvoiced and plosive sounds……………………………... 80

1.4 Acoustics of speech production……………………………………. 80

1.4.1 Formant frequencies……………………………………………….. 80

1.5 Perception………………………………………………………….. 82

1.5.1 Pitch and loudness…………………………………………………. 82

1.5.2 Loudness perception……………………………………………….. 82

CHAPTER 2

PROPERTIES OF SPEECH SIGNALS

IN TIME DOMAIN

83

2.1 Introduction………………………………………………………... 83

2.2 Time-Dependent Processing of Speech……………………………. 83

2.3 Short-Time Average Zero-Crossing Rate………………………….. 84

2.4 Pitch period estimation…………………………………………….. 86

2.4.1 The Autocorrelation Method………………………………………. 86

2.4.2 Average magnitude difference function…………………………… 89

CHAPTER 3

SPEECH REPRESENTATION IN FREQUENCY DOMAIN 91

3.1 Introduction………………………………………………………... 91

3.2 Formant analysis of speech………………………………………... 91

3.3 Formant frequency extraction……………………………………... 91

3.3.1 Spectrum scanning and peak-picking method……………………... 92

3.3.2 Spectrum scanning………………………………………………… 92

3.3.3 Peak-Picking Method……………………………………………… 92

CHAPTER 4

SPEECH CODING 93

4.1 Introduction………………………………………………………... 93

4.2 Overview of speech coding………………………………………... 93

4.3 Classification of speech coding……………………………………. 94

4.4 Linear Predictive Coding (LPC)…………………………………… 97

4.4.1 Basic Principles……………………………………………………. 97

4.4.2 The LPC filter……………………………………………………… 99

4.4.3 Problems in LPC model…………………………………………… 100

4.5 Basic Principles of Linear Predictive Analysis……………………. 101

4.5.1 The autocorrelation method………………………………………... 104

4.5.2 The covariance method……………………………………………. 106

CHAPTER 5

APPLICATIONS 108

5.1 Speech synthesis…………………………………………………… 108

5.1.1 Formant – frequency Extraction…………………………………… 108

5.1.2 LPC..……………………………..………………………………… 114

5.2 Speaker identification using LPC………………………………….. 119

5.3 Introduction to VOIP………………………………………………. 122

5.3.1 VoIP Standards…………………………………………………….. 122

5.3.2 System architecture……...………………………………………… 123

5.3.3 Coding technique in VOIP systems……………………………….. 124

5.3.4 Introduction to G.727……………………………………………… 125

5.3.5 Introduction to G.729 and G.723.1………………………………… 127

_________________________________________________________________________

Page 1 Topic 1 – Audiology

_________________________________________________________________________


CHAPTER 1

HEARING PROCESS

Introduction: - The human ear can respond to frequencies from 20Hz up to 20 KHz.

- It is more than sensitive and broad band receiver.

- It acts as a frequency analyzer of impressive selectivity.

- It is one of the most delicate mechanical structures in the human body.

1.1 The structure of the ear:

Figure(1.1) The main structure of the Ear

It consists of 3 main parts: outer, middle and inner ear.

1-the outer ear: - The visible portion of the ear.

- It collects the sound and sends it to the ear drum via the ear canal.

It contains:

1. Pinna: - It serves as a horn, collecting sound in the auditory canal.

2. Auditory canal: - It is a straight tube of 0.8 cm diameter & 2.8 cm. long.

- closed by the ear drum.

3. Ear drum: - Small membrane separates the outer ear from the middle ear

(Considered the entrance to the middle ear).

- Flatted cone.

- Quite flexible in the center and attached around the edge to the end of the canal.

_________________________________________________________________________


2- The middle ear: - It houses chain of three bones (hammer, incus and stapes)

- It contains 3 ossicles (bones), the ear drum is connected to the 1st (hammer), which

communicates to the last (stapes) through the middle one (incus).

- These bones are set into motion by the movement of the ear drum.

- It is also connected to the throat via the "Eustachian tube".

- There is a collection of muscles and ligaments that control the lever ratio of the

system. for high tension, the muscles controlling the motion of the bones change their

tension to reduce the amplitude of motion of the stapes, thereby protecting the inner ear

from damage.(N.B.: it offers no protection from sudden impulsive sounds).

3- The inner ear: - contains the tiny nerve endings for balance and hearing.

- Also contains a very unique fluid that becomes set in motion by the movement of the

oval window.

-the tiny nerve endings are then stimulated (activated) and each sends a message or an

impulse to the brain.

The inner ear has 3 main parts: The vestibule, semi-circular canals and the cochlea.

• The vestibule: - connects with the middle ear through 2 opening, the oval window& the round window

(both prevent the fluid escape in the inner ear).

• The semi-circular canals: - provide a sense of balance.

• The tube of cochlea: - is divided by partition to: upper gallery, duct& the lower gallery.

- The ends of the galleries are connected to the oval& the round window; the other ends

are connected to the apex of the cochlea.

The Duct: - Is filled with endolymph (potassium rich, related to intracellular fluid throughout the

body) & perilymph (sodium rich, is similar to the spinal fluid).

- It also contains membranes; one of them is called the "Basilar membrane". (At the top

of this membrane there is organ of corti "contain 4 rows of hair cell").

_________________________________________________________________________


1.1.2 How the ear works:

Figure (1.2) The Hearing process

1- When the ear is exposed to a pure tone, sound waves are collected by the outer ear

and funneled through the ear canal to the ear drum.

2- Sound waves cause the ear drum to vibrate.

3- The motion of the ear drum is transmitted and amplified by the 3 bones of the middle

ear to the oval window of the inner ear creating fluid disturbance that travels in the

upper gallery toward the apex, in to the lower gallery, and then propagates in the lower

gallery to the round window which acts as pressure release termination.

NOTES

The basilar membrane is driven into highly damped motion with a peak amplitude

increases slowly with distance away from stapes. Reaches the maximum, and then

diminishes rapidly toward the apex

_________________________________________________________________________


11..22 CCoommppuuttaattiioonnaall MMooddeellss ffoorr EEaarr FFuunnccttiioonn

A computational model has been derived to describe basilar membrane displacement in

response to an arbitrary sound pressure at the ear drum.

In this simplified schematic of the peripheral ear:

Figure (1.3) Shematic Diagram of Peripheral Ear

)(.)()(

)(.

)(

)(

)(

)(SFSG

SX

SY

SP

SX

SP

SYL

LL ==

WWhheerree::

P (t): is the sound pressure at the ear drum.

X (t): is the equivalent linear displacement of the stapes.

yl (t): is the linear displacement of the basilar membrane at a distance "l"

from the stapes.

TThhee DDeessiirreedd OObbjjeeccttiivvee:: Is an analytical approximation to the relations among these quantities.

It is convenient to obtain it in two steps:

Is to approximate the middle ear transmission, that is, the relation between

x(t) and p(t).

Is to approximate the transmission from the stapes to the specified point I on

the membrane.

Approximating functions are indicated as the frequency domain "LAPLACE

transforms" G(s) and Fl(s).

CCoonnddiittiioonnss ffoorr AApppprrooxxiimmaattiioonn::

The functions G(s) and Fl(s) must be fitted to available physiological data,

therefore:

If the ear is assumed to be mechanically passive and linear over

the frequency and amplitude ranges of interest, rational functions

of frequency can be used to approx. the physiological data.

Middle Ear Basilar Membrane P(t) x(t) y (t)

_________________________________________________________________________


Because the model is an input-output analog, the response of one

point does not require explicit computation of the activity at the

other points. One therefore has the freedom to calculate the

displacement yl (t) for as many or for as few, values of I as are

desired.

11.. BBaassiillaarr MMeemmbbrraannee MMooddeell:: The physiological data upon which the form of Fl(s) is based are those of

"BEKESY".

If the curves are normalized with respect to the frequency of the maximum

response, one can find that:

They are approximately constant percentage bandwidth

responses as shown in figure 1.4(A).

Also, the phase data suggest a component which is

approximately a simple delay, and whose value is inversely

proportional to the frequency of peak response. That is, low

frequency points on the membrane (nearer to the apex) exhibit

more delay than high frequency (basal) points as shown in figure

1.4(B) where at low frequencies the phase is high i.e. big shift

while at high frequencies phase is low i.e. small shift .

One function which provides a reasonable fit to BEKESY results is:

L

S

LLL

L

L

LLL e

SS

SCSF

β

π

βαβ

ε

πβ

βπβ 4

32

22

8.0

4

1 .)(

1..

2000

2000)(

−

++

+

+

+=

WWhheerree::

wjS +=σ is the complex frequency.

LL αβ 2= is the radian frequency to which the point l-distance from the

stapes responds maximally.

1C is a real constant that gives the proper absolute value of displacement.

L

S

eβ

π

4

3−

is a delay factor of 3π/4βL seconds which brings the phase delay

of the model into line with the phase measured on the human ear.

48.0 .)

2000

2000( L

L

L βπβ

πβ

+ is an amplitude factor which matches the variations

in peak response with resonant Frequency (βL)

ξl/ βL =0.1 to 0.0 depending upon the desired fit to the response at low

frequencies.

_________________________________________________________________________


10-1

100

101

0

200

400

AB

S.

OF

F(S

) B=2*PI*50

FREQ.(W/B)

10-1

-3-2-101

freq.(W/B)

AN

GLE

OF

F(S

) B=2*PI*50

10-1

100

101

0

200

400B=2*PI*2500

10-1

-3-2-101

B=2*PI*2500

10-1

100

101

0

200

400B=2*PI*5000

10-1

-3-2-101

B=2*PI*5000

10-1

100

101

0

200

400B=2*PI*10000

(A)

10-1

-3-2-101

B=2*PI*10000

(B)

Figure (1.4) Amplitude & Phase Response of the Basilar

Membrane Model FL(S)

The membrane response at any point is therefore approximated in terms of the

poles and zeros of the rational function of FL(s).

The reason at properties of the membrane is approximately constant (constant

percentage bandwidth).

The real and imaginary parts of the critical frequencies can therefore be related

by a content factor, namely, (βL=2xL)-the imaginary part of the pole frequency (βL)

completely describes the model and the characteristics of the membrane at a place l-

distance from the Stapes.

The real-frequency response of the model is evidenced by letting s=jw.

The inverse laplace transform of Fl(s) (shown in figure 1.5) is the displacement

response of the membrane to an impulse of displacement by the stapes, the details of

inverse transformation is found to be:

[ ]

[ ]

0.575.0)(cos.

)(320.0575.0)(sin.

)(36.0033.02000

2000)(

)(2

)(

2

)(

1

8.0

1

=−−

−−+−

−+

+=

−−

−−

−−

+

Tt

L

Tt

LL

Tt

L

r

L

L

l

L

L

L

eTtXe

TtTtXe

TtCtf

ββ

β

β

ββ

ββπβ

π

For t ≥ T and ξL/βL = 0.1 where the delay T = 3π/4βL

_________________________________________________________________________


0 2 4 6 8 10 12 14

x 104

10

15

20

25

30

35

40

45

B(IN HZ)

L(I

N M

M)

And so from the relation we find that as the frequency increase as the delay of the

maximum response decrease as shown in figure 5.

0 5 10-5

0

5

10

time

the u

nvers

e o

f F

(S)

when B=2*pi*50

0 5 10-50

0

50B=2*pi*150

time

inv o

f f(

s)

0 5 10-200

0

200

400B=2*PI*1500

time

0 5 10-100

0

100

200B=2*PI*5000

time

0 5 10-1000

-500

0

500

1000

1500B=2*PI*10000

time0 5 10

-1500

-1000

-500

0

500B=2*pi*20000

time

Figure (1.5) The Basilar Membrane Response to Impulse of Stapes Displacement and with Frequency too

And we note also from the equation of F(S) that (βL) depend on the l the distance from

the stapes of maximum response of the membrane according to the equation

(35-l)=7.5 log ((βL)/ (2*Ω*20)

So as frequency increase as the l decrease and this note is logical because as the

frequency increase as the basilar membrane respond quickly and maximally (as shown

in figure 1.6).

Figure (1.6) The relation between the (βL) & I-The distance from the stapes that has responds

maximally

22.. MMiiddddllee EEaarr TTrraannssmmiissssiioonn::

To account for middle ear transmission, an analytical specification is necessary

of the stapes displacement produced by a given sound pressure at the ear drum.

Quantitative physioacoustical data on the operation of the human middle ear are

sparse (few).

All agree that the middle ear transmission is a low-pass function. As shown in

figure in 1.7

_________________________________________________________________________


0 1000 2000 3000 4000 5000 6000 7000-5

0

5

10

15

20x 10

-13

frequenc y in c y c les per s ec ond

G(S

)

AAnn aapppprrooxxiimmaattiinngg FFuunnccttiioonn ooff 33rrdd DDeeggrreeee ooff MMiiddddllee EEaarr TTrraannssmmiissssiioonn::

[ ]22)()()(

baSaS

CoSG

+++=

Where Co is a positive real constant.

One might consider Co=a (a2+b

2) so that the low frequency transmission of

G(s) is unity .when the pole frequencies of G(s) are related according to b=

2a=2π (1500) rad/sec.

Figure (1.7) functional approximation of middle ear

The inverse transform of G(s) is the displacement response of the stapes to

an impulse of pressure at the ear drum, given by:

)cos1()cos1(.)(2/

btb

eCobt

b

eCotg

btat

−=−=−−

FFoorr tthhiiss mmiiddddllee eeaarr ffuunnccttiioonn tthhee rreessppoonnssee iiss sseeeenn ttoo bbee hheeaavviillyy ddaammppeedd,, ssoo aass tthhee

ffrreeqquueennccyy ddeeccrreeaassee aass tthhee rreessppoonnssee iiss ddaammppeedd ooff tthhee mmiiddddllee eeaarr aanndd tthhee vveelloocciittyy ooff tthhee

ddiissppllaacceemmeenntt ooff tthhee ssttaappeess aallssoo ddaammppeedd aass sshhoowwnn iinn tthhee nneexxtt ttwwoo ffiigguurreess

_________________________________________________________________________


0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time

g(t

)

0 2 4 6 8 10 12 14-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

t im e

velo

city

FIGURE (1.8) Displacement of stapes to an impulse of pressure on the ear drum

TThhee TTiimmee ddeerriivvaattiivvee ooff tthhee ssttaappeess ddiissppllaacceemmeenntt iiss::

)1cos2.(2

.)(

2/*

−+=−

btbtSineCo

tgbt

Figure (1.9) Velocity ooff ssttaappeess ttoo aann iimmppuullssee ooff pprreessssuurree oonn tthhee eeaarr ddrruumm

33.. CCoommbbiinneedd RReessppoonnssee ooff MMiiddddllee eeaarr aanndd BBaassiillaarr MMeemmbbrraannee:: The combined response of the models for the middle ear and basilar membrane is:

HI(s) =G(s) Fl(s)

hl (t) =g(t)* fl(t)

The combined response of G(s) and FL (s) in the frequency domain is simply the sum of

the individual curves for amplitude (in dB) and phase (in radians).

When the inverse transform is calculated, the result has the form:

_________________________________________________________________________


0);()(

)()(

)()2

1()(

2/2/

2/2/

2/2/2/

1

≥++

+++

+−+=

−−

−−−

−−−

ττητητη

ττητη

ττττ

τητη

τητητη

τττ

ForbCosebHbCoseG

bSinebFbSineEeD

bSineCbSinbCosBeAeh

bb

bbb

bbb

Where: A, B, C, D, E, F, G, H are all real numbers which are functions of (βL)

and (b). τ=-(t - T); T= 3π/4 βL, η= βL /b, βL=2αL, b=2a, ξL=0

The form of the impulse response is thus seen to depend upon the parameter (η=

βL /b)

Values of η <1.0 refer to apical (low frequency) membrane points whose

frequency of maximal response is less than the critical frequency of the middle ear.

For these points, the middle ear transmission is essentially constant with

frequency. (As shown in figure 1.10 (a,b,c))

On the other hand, values of η> l .0 refer to basal (high frequency) points which

respond maximally at frequencies greater than the critical frequency of the middle ear.

For these points, the middle ear transmission is highly dependent upon

frequency and would be expected to influence strongly the membrane displacement. (as

shown in figure 1.10 as shown in figure(1.10(d,e,f))

0 1 2-1

0

1

2x 10

-12

time

(a)

h(t

)

B=2*PI*50

0 1 2-5

0

5

10x 10

-12B=2*PI*150

time

(b)

0 1 2-1

0

1

2x 10

-10 B=2*PI*1200

time

(c)

0 1 2-4

-2

0

2

4x 10

-10 B=2*PI*3000

time

(d)

h(t

)

0 1 2-5

0

5x 10

-10 B=2*PI*5000

time

0 1 2-4

-2

0

2

4x 10

-10 B=2*PI*10000

time

Figure (10) The response of the ear to the different frequency range

_________________________________________________________________________


NNoottee::

AAss aallrreeaaddyy iinnddiiccaatteedd bbyy tthhee iimmppuullssee rreessppoonnsseess::

The response of apical points on the membrane is given essentially by

FL(s), while for basal points the response is considerably influenced by

the middle-ear transmission G (s).

Concerning the latter point may be noted that at frequencies appreciably

less than its peak response frequency, the membrane function FL (w)

behaves as a differentiator because the middle ear transmission begins to

diminish in amplitude at frequencies above about 1500 cps, the

membrane displacement in the basal region is roughly the time

derivative of the stapes displacement (as shown in figure (1.10)).

The waveform of the impulse response along the basal part of the

membrane is therefore approximately constant in shape (as shown in

figure (1.4).

Along the apical part, however, the impulse response oscillates more

slowly (in time) as the apex is approached (as shown in figure 1.9).

_________________________________________________________________________


CHAPTER 2

FUNDAMENTAL PROPERTIES OF HEARING

At the beginning the main 5 properties we are going to talk about are:

1) Thresholds

2) Equal loudness level

3) Critical bandwidth

4) Masking

5) Beats and combinational tones

2.1 Thresholds: The threshold of audibility is the minimum perceptible of L1 of a tone that can be

detected at each frequency over the entire range of ear. The tone should have duration

of 1S.a representative threshold of audibility for a young undamaged ear is shown as

the lowest

Curve in figure.

Figure (2.1) Threshold of audibility & free field, equal loudness level contour

The frequency of maximum sensitivity is near 4KHZ for high frequencies the threshold

also raises rapidly to a cutoff. It is in this higher freq. Region that the greatest

variability is observed among different listeners, particularly if they are over 30 of age.

The cut-off frequency. For young person may be as high as 20kHZ or even 25kHZ,but

people over 40 or 50 years of age with typical hearing can seldom hear freq. Near or

above 15kHZ.in the range below 1kHZ,the threshold is usually independent of the

range of the listener.

As the intensity of the incident acoustic wave is increased, the sound grows louder and

eventually produces a tickling sensation, this occurs at an intensity level of about

120dB and is called threshold of feeling, and the tickling sensation becomes one of pain

about 140dB.

Since the ear responds relatively slowly to loud sound by reducing the lever action of

the middle ear, the threshold of audibility shifts upwards under exposure, the amount of

shift depends on the intensity and the duration of the sound. After the sound is removed

_________________________________________________________________________


the threshold of hearing will begin to reduce and if the ear fully recovers its original

threshold it has experienced temporary threshold shift (TTS). The amount of time

required for a complete recovery increases with increasing intensity and duration of

sound. If the exposure is long enough or the intensity is high enough the recovery of the

ear is not complete. The threshold never returns to its original value and permanent

threshold shift (PTS) has occurred.

It is important to realize that the damage leading to PTS occurs in the inner ear, the hair

cells are damaged.

Also of importance are differential thresholds one of which is the differential threshold

for intensity determination. If two tones of almost identical freq. are sounded together,

one tone much weaker than the other the resultant signal is indistinguishable from a

single freq. whose amplitude fluctuates slightly and sinusoidally.

The amount of fluctuation that the ear can just barely detect, when converted into the

difference in intensity between the stronger and the weaker portions, determines the

differential thresholds. As might be expected, values depend on frequency numbers of

beats per second and intensity level.

Generally the greatest sensitivity to intensity changes is found for about 3 beats per

second sensitivity decreases at frequency extremes, particularly for low frequency but

the effect diminishes with the increasing sound levels.

For sound more than 40dB above threshold, the ear is sensitive to intensity level

fluctuations of less than 2dB at the freq. Extremes and less than about 1dB between 100

and 1000HZ.

Other differential thresholds involve the ability to discriminate between two sequential

signals of nearly the same freq.

The frequency level required to make the discrimination is termed the difference limen.

2.2 Equal loudness level contour: Experiments in which listeners gauge when 2 tones of different freq. Sounded

alternately, are equally loud provides contours as function of frequency. As seen in

figure2.1, high and low freq. Tones requires greater values of L1 to sound as loud as

those in the mid-frequency range. The curves resulting from such comparison are

labeled by the L1 they have at 1KHZ. Each curve is an equal loudness level contour and

express the loudness level LN in phon. which is assigned to all tones whose L1 fall on

the contour. thus , Ln =L1 for 1KHZ tone ,regardless of its level ,however ,4KHZ tone

with L1=90 dB has a loudness level Ln =70 phons, as does a 4KHZ tone with L1=61

dB. The curves become straighter at higher loudness levels and Ln and L1 become more

similar at all frequencies.

2.3 Critical bandwidth: If a subject listens to a sample of noise with a tone present, the tone cannot be detected

until its L1 exceeds a value that depends on the amount of noise present.

And found that the masking of a tone by a broadband noise is independent of the noise

bandwidth until the bandwidth become smaller than some critical value that depends on

the frequency of the tone.

In this task the ear appears to act like a collection of parallel filters, each with its own

band width and the detection of a tone requires that its level exceed the noise level in its

particular band by some detection threshold.

_________________________________________________________________________


In early experiments it was assumed that the signal must equal the noise for detection to

occur (DT=0).on this bases and assuming that the sensitivity of the ear is constant

across each bandwidth Wcr, it follows that Wcr =S/N1 where S: the signal power and N1

the noise power per Hz the band width measured this way are now termed the critical

ratios.

Later experiments based on the perceived loudness of noise have yielded critical

bandwidths Wcb larger than the critical ratios. In some of theses experiments, the

loudness of a band of noise is observed as a unction of bandwidth while the overall

noise level is held constant. For noise bandwidth s less than the critical, the loudness

will be constant but when the bandwidth exceeds theoretical bandwidth, the loudness

will increase.

2.4 M asking: This is the increase of the level of audibility in the presence of noise. First consider the

masking of one pure tone by another. The subject is exposed to a single tone of fixed

frequency and L1, and then asked to detect another tone of different frequency and

level. Analysis yields the threshold shift, the increase in L1 of the masked tone above its

value for the threshold of audibility before it can be detected. As shown in figure 2.2

gives reprehensive results for masking frequencies of 400 and 2000 HZ , the frequency

range over which there is appreciable masking increases with the L1 of the masker, the

increase being greater for frequencies above that of the masker . This is to be expected

because the region of the basilar membrane excited in to appreciable motion of

moderate values of L1 extends from the maximum further toward the stapes than the

apex.

Figure (2.2) Masking of one pure tone by another (The Abscissa is the frequency of the masked

tone)

2.5 Beats and combination tones: Let two tones of similar frequency F1 and F2 and of equal L1 be presented to one ear (or

both ears).

When the two frequencies are very close together, the ear perceives a tone of single

frequency Fc= (F1 +F2)/2 fluctuating in intensity at the beat frequency FB=abs (F1 -F2).

As the frequency interval between the two tones increases, the sensation of beating

changes to throbbing and then to roughness gradually diminishes and the sound

_________________________________________________________________________


becomes smoother, finally resolving in to two separate tones for frequencies falling in

the midrange of hearing, the transition from beats to throbbing occurs at about 5 to 10

beats per second and this turns into roughness at about 15 to 30 beats per second. These

transitions occur for higher beat frequencies as the frequencies of the primary tones are

increased.

Transition to separate tones occurs when the frequency interval has increased to about

the critical bandwidth.

None of this occurs if each tone is presented to a different ear. When each ear is

exposed to a separate tone, the combined sound does not exhibit intensity fluctuations,

this kind of beating is absent. this suggest that the beats arise because the two tones

generate overlapping regions of excitation on the basilar membrane, and it is not until

these regions become separated by the distance corresponding to the critical bandwidth

that they can be separately sensed by the ear. When the tones are presented one in each

ear, each basilar membrane is separated excited and these effects do not occur if the

two tones are separated far enough and are of sufficient loudness, combination tones

can be detected. These combination tones are not present in the signal sound but are

manufactured by the ear. There is a collection of possible combination tones whose

frequencies are various sums and differences of the original frequencies F1 &F2

Fnm = ABS (m F2 +n F1) n, m=1, 2, 3...

Only a few of these frequencies will be sensed. One of the easiest to detect is the

difference frequency abs (F1 -F2).

_________________________________________________________________________


CHAPTER 3

MODELS OF HEARING AIDS

3.1 History of development of the hearing aids

Types of hearing impairment (weakness):

Basically, there are 3 types of hearing impairments: conductive, sensorineural and

mixed.

A-conductive: since the outer ear and the middle ear are involved in the conduction

as found, a problem located in these areas is considered a conductive hearing

impairment. It may be corrected or partially corrected with surgery and/or medication.

Amplification or the use of hearing aids may also be an option.

B-sensorineural: a problem associated with the inner ear is considered a

sensorineural hearing impairment. Generally, this type of hearing impairment is the

result of damage or degeneration to the tiny nerve endings. It is usually not correctable

with surgery or medication. The use of amplification is typically the choice of

treatment.

C-mixed: If both of these types of hearing impairment occur at the same time, the

result is a mixed hearing impairment.

The technology of hearing aids:

Analog vs. Digital: All hearing aids, whether analog or digital, are designed to increase the loudness of

sounds reach the ear drum so that the hearing-impaired (weakened) person can better

understand speech. To accomplish this, three basic components are required:

1- A microphone to gather acoustic energy (sound waves in the air) and convert it to

electrical energy.

2- An amplifier to increase the strength of the electrical energy.

3- A receiver, which converts the electrical energy back into acoustic energy (sound

waves).

The main advantage of the analog hearing aids is the accuracy in sound reproduction

with low noise and distortion.

But it was large in size and need high power problem. To overcome it was developed

by ASIC in its design. but this also had its disadvantage which was increasing the cost

five times. To overcome it there was of programmable DSP approach .its advantage

was the reduction in the cost and the improvement in the sound quality. but its

disadvantage that the digital H.A amplifies all the sound even noise.

To overcome it they use the real time binaural digital hearing aid platform

(TM3205000)

_________________________________________________________________________


Now, we will discuss the details of the hearing aid of the moderate hearing loss.

3.2 First Model Digital Hearing Aid for Conductive Impairment

3.2.1 Features of Real Time Binaural Hearing Aid:

1) Real time binaural hearing aid (of type fixed point signal processor).

[Note: there are 2 types of processors fixed and floating point chips the floating easier

to implement DSP algorithms since the quantization effects are negligible for most

application .but the fixed are smaller in size and have low power requirements but of

course its algorithm have to be analyzed for the effect of quantization noise on their

performance]

2) It sample 2 inputs microphone signal with sampling rate 32 kHz/channel (hearing aid

B.W=10KHZ) and drive a stereo headphone o/p (Need Power source =1.8V)

3) It can be developed by reducing voltage of the power source to 1Vand reduce MIP

for final implementation of hearing aid.

3.2.2 Speech Processing Algorithm The Hearing Loss

Is characterized by less sensitivity to sound that varies with the signal level and

frequency. So to overcome it they need a system to:

1) Separate high and low frequencies.

2) Improve the speech comprehension.

3) Listening comfort.

Speech Processing Algorithm:

1) Frequency shaping.

2) Adaptive noise reduction.

3) Multi channel amplitude compression.

4) Interaural time delay.

5) Timer

The first 3 algorithms will be discussed later in details.

But know a small comment on the last two algorithms, to understand their important

function.

3.2.2.1 The timer: Function: it is used to switch off the drive of the left and right ear piece in a mutually

exclusive fashion in a fraction of second at a time.

Users: persons of severe hearing loss.

Advantage: overcoming problem of fatigue due to high gain amplification without

affecting the H.A performance.

_________________________________________________________________________


The Interaural time delay:

Function: used to provide delay to signal going to 1 ear with respect to signal going to

2nd

ear on a frequency selective basis.

Uses: it is provided on the theory that if a person has differential hear losing, in

addition to compensating gain to signal going to the two ears, there must be provision

for compensating internal delay between the signals received by the 2 ears.

Now, we are going to discuss the main three points of our study.

Figure (3.1) Digital signal Processor

3.2.2.2 Frequency Shaping:

It is:

1) Binaural equalizer of 2 banks of band-pass filters (one bank/ear).

2) It provide frequency from (dc to 16KHZ)

3) The filters used I) has linear phase. II) High band insulation for each. III) FIR (finite

impulse response)

[Typically there are 50 filters][And typically each has 50-200taps to increase the

shaping precision].

The therapist select:

1) Number of band-pass filters

2) Cut off freq.

3) Critical freq.

4) Isolation between bands.

Once the filter is selected with all the previous characteristics the therapist improves the

spectral magnitude for subjects hearing loss by adjusting gain of each filter.

_________________________________________________________________________


3.2.2.3 The Noise Cancellation using LMS Algorithm:

This method is called Feedback System. This method depends on 2 operations that we

will declare now. According to the shown figure “the LMS Algorithm figure”; initially

the input taps is zero, and so there is nothing coming out of the transversal filter so

there will only be the error e(1) which obviously equal d(1) [which is the desired signal

(the one with no noise)] this error is used to adapt the input signal u(n) by producing

tap weights corresponding to the input taps of u(n) then produce estimate of the desired

signal then generating the error by comparing the estimate we got with the actual value

of the desired signal(that we already know) which will be feedback again to the

adoption of the successive signal this process is called filtering process.

The adjusting of the tap weight by the error generation is called the adaptive process.

So the function of the transversal filter is the filtering process and the adaptive control

of the tap weight is the adapting process.

So this was the basic of the LMS. But now we will learn how this LMS Algorithm

really works.

First, we will use the equation of the least mean square, to show how we update the

taping weight of the adaptive weight control mechanism. Which is:

)(22)( nWRpnJ +−=∇

The simplest choice of estimators for R and P is to use instantenous estimates that are

based on the sample value tap input

)()()( nununRH=

)

)(*)()( ndnunP =)

And so the first equation becomes:

)()()(2)(*)(2)( nwnunundnunJH )

+−=∇

And if we considered J (n) is viewed as the operator gradient applied to the

instantaneous squared error.

So by substituting in the estimate of equation the steepest descent, we can get a new

recursive relation for updating the tap- weight vector:

W (n+1) =W (n) +ωu (n) [d*(n)-uH (n) W (n)]

_________________________________________________________________________


Figure (3.2) The LMS Algorithm

So the summary of the figure is that we have:

1. Filter output (the estimate of the desired signal)

Y (n)=wH (n) u (n) 2

2. Estimation error

e (n)=d(n)-y(n) 3

3 .Tap weight adoption

W(n+1)=W(n)+ ωu(n)[d*(n)-uH(n)W(n)] 4

We notice that the error estimation is based on the current estimate of the tap weight

vector, W (n) also that the second term in equation 4 represents the correction that is

applied to the current estimate of the Tap weight.

That was just an introduction on how the LMS Algorithm works in general and the

basic block diagram.

Adaptive Noise Cancellation applied to a sinusoidal interference: The traditional method of suppressing a sinusoidal interference corrupting information

–bearing signal is to use a fixed notch filter tuned to the frequency of the interference.

The adaptive noise canceller using the LMS algorithm has 2 important characteristics:

1. The canceller behaves as an adaptive notch filter whose null point is determined by

the angular frequency ωo of the sinusoidal interference. Hence it is tunable and the

tuning frequency moves with ωo.

2. The notch in the frequency response can be made very sharp precisely at the

frequency ωo of the sinusoidal interference (by choosing a small enough value for the

and this exactly will be proved later.

Now we will discuss in details the noise canceller: it is a dual input adaptive noise

canceller.

As shown in figure (3.2), the primary input supplies information and a sinusoidal

interference .the reference supplies the sinusoidal interference. For the adaptive filter,

we may use a transversal filter whose tap weight are adapted by the means of the LMS

algorithm. The filter uses the reference input to provide an estimate of the sinusoidal

_________________________________________________________________________


interference contained in the primary signal. Thus, by subtracting the adaptive filter

output from the primary input, the effect of the sinusoidal interference will diminished.

Figure (3.3) adaptive noise canceller

So as shown in figure (3.2) the adaptive noise canceller consists of:

1. The primary input:

D (n) =s (n) +A0 COS (ωo n+φ0) Where the s (n) is the information bearing signal; A0 is the amplitude of the sinusoidal

interference ωo is the normalized angular frequency, and is the phase.

2. Reference input:

u (n) =A COS(ωo n + φ)

In the LMS algorithm, the tape weight update is described by the following equations:

)()()( 1

0 inunwny i

M

i −= ∑ −=

)

e (n)=d(n)-y(n)

W (n+1) =W (n) +ωu (n) [d*(n)-uH (n) W (n)]

Where: m is the total number of the tap weights in the transversal filter.

With a sinusoidal excitation as the input of the interest, we restructure the block

diagram of the adaptive noise cancellers in figure (3.2). According to this new

representation , we may lump the sinusoidal input u(n) ,the transversal filter, and the

weight update equation of the LMS algorithm in to a single (open loop system)defined

by the transfer function G(Z) as shown in figure(3.3)

_________________________________________________________________________


Figure (3.4) Equivalent Model in Z-domain

Where:

)(

)()(

zE

zYzG =

Where Y (Z) is the z- transform of the reference input u (n) and the estimation error e

(n), respectively, given E (Z), our task is to find Y (Z), and therefore G (Z).to do so we

use the signal flow graph representation in the figure (3.3) .In this diagram, we have

singled out the ith

tap weight for specific attention. The corresponding value of the tap

input is

[ ])()(

2

])([cos)(

ioio nwjnwj

o

eeA

inwAinu

φφ

φ

+−+ +=

+−=−

Then getting its z-transform (n-i) e (n)

[ ] [ ] [ ]

[ ] [ ]oii

noioi

jWjjWoj

jWjnjwj

ZeEeA

ZeEeA

eneZeA

enezeA

neinuZ

φϕ

φφ

−−

−−

+=

+=−

22

)(2

)(2

)()(

Then getting z transform of Wi (Z) by using equation no.4.

[ ])()()()( neinuZZWzWiZ I −+=∧∧

µ

_________________________________________________________________________


[ ])()(1

1

2)( jWoijjWoij ZeEeZeEe

Z

AzWi φφµ

+−

= −∧

Then since y (n)

[ ])()(1

)(2

)( ioio nwjnwjM

oi

i eenWA

nyφφ +−+

−

−

∧

+= ∑

The from them getting Y (Z)

+=

∧−

∧−

−

−

∑ )()(2

)(1

jwo

i

ijW

I

jM

oi

ZeWeZeWeA

zy oii φφ

We find that it consists of 2 components:

1) A time invariant component:

)1

1

1

1(

4

2

−+

−− jwojwoZeZe

MAµ 1

2) A time varying component:

o

oo

wSin

MwSinMW

)()( 1 =β 2

And since the value of are too large we can do so:

OwSinM

MWSin

M

MW

O

Oo≈=

)()( 1β

Thus Y (Z) is

)1

1

1

1)((

4)(

2

−+

−=∴

− jwojwZeZe

ZEMA

zYo

µ

Thus the open loop transfer function G (Z) is:

)12

1(

2

)1

1

1

1(

4)(

)()(

2

2

2

+−

−=

−+

−==

−

WoCosZZ

WCosZMA

ZeZe

MA

ZE

ZYzG

o

jwojwo

µ

µ

The adaptive filter has annul point determined by the angular frequency ωo of the

sinusoidal interference as G (Z) has zeros at z= (1\cosωo) and this prove the 1st

characteristic. Then from G (Z) we get H (Z) [transfer function of a closed loop

feedback]

_________________________________________________________________________


)(1

1

)(

)()(

ZGZD

ZEZH

+==

Where E(Z)is the z-transform of the system output e(n), and D(Z) is the z-transform of

the system input d(n).

)2/1()4/1(2

12)(

222

2

MAWoZCosMAZ

WoCosZZZH

ηµ −+−−

+−=

So this function is the transfer function of a second order digital notch filter with a notch

at the normalized angular frequency ωo. And finally we find that poles of H (Z) lies

inside the unit circle. This means that the adaptive filter is stable [that is needed for real

time practical life].

Also we find that the zeroes of H (Z) lies on the unit circle that means that the adaptive

noise canceller has a notch of infinite depth at frequency ωo. Also the sharpness of the

notch filter is determined by the closeness of the poles of H (Z) to its zeros. And since

the 3-dB bandwidth is used for this we find that it is equal

The smaller we therefore make ω, the smaller B is and therefore the sharper the notch is

and so finally we satisfy the 2nd

characteristic too.

3.2.2.4 Amplitude Compression: Speech amplitude compression is essentially the task of controlling the overall gain of a

speech amplification system. It essentially “maps” the dynamic range of the acoustic

environment to the restricted dynamic range of the hearing impaired listener.

Amplitude compression is achieved by applying a gain of less than one to a signal

whenever its power exceeds a predetermined threshold.

As long as the input power pin to the compressor is less than the input threshold pth

no

compression takes place and the input is equal to the output. When the input power

exceeds the threshold value pth

, a gain less than one is applied to the signal. Once

amplitude compression is being applied, if the attenuated input power exceeds a

specified saturation power p sat, the output power is held at a constant level.

Now, we will discuss the implanted hearing aid; the hearing aid with the severe hearing

loss.

2

2ΜΑ=Β

µ

_________________________________________________________________________


3.3 Second Model A method of treatment for a sensorineural hearing impairment

Introduction:

- Cochlear implants, which are implanted through a surgical procedure, are taking

hearing technology to a new level.

- The best candidates for cochlear implants are individuals with profound hearing

loss to both ears who have not received much benefit from traditional hearing

aids and are of good general health.

- Children as young as 14 months have been successfully implanted.

3.3.1 The conceptual prosthetic system architecture is:

Figure (3.6) Architecture of proposed fully implanted cochlear prosthetic system

Processing: • The proposed middle ear sound sensor, based on accelerometer operating

principle, can be attached to the "umbo" to convert the umbo vibration to an

electrical signal representing the input acoustic information.

• This electrical signal can be further processed by the cochlear implant speech

processor, which is followed by a stimulator to drive cochlear electrodes.

• The speech processor, stimulator, power management and control unit,

rechargeable battery and radio frequency (RF) coil will be housed in a

biocompatible package located under the skin to form a wireless network with

external adaptive control and battery charging system.

• Wireless communication between the implant and external system is essential

for post-implant programming of the speech processor.

_________________________________________________________________________


Tuning process: - After implant it is necessary for the patient to go through a tuning procedure for

speech processor optimization so that the cochlear implant can function properly.

- In this tuning procedure, an audiologist will present different auditory stimuli

consisting of basic sounds or words to a patient.

- The acoustic information will be detected by the implanted accelerometer and

converted into an electrical signal.

- The speech processor will then process the signal and filter it into a group of outputs,

which represent the acoustic information in an array corresponding to individual

cochlear electrode bandwidth.

- Then an array of biphasic current pulses with proper amplitude and duty cycle will be

delivered to stimulate the electrodes located inside the cochlea along the auditory

nerve.

- This excitation activates neurotransmitters, which travel to the brain for sound

reception.

- The patient will then provide feedback in terms of speech reception quality to the

audiologist.

- To achieve the optimal performance for the active implant-human interface network,

the audiologist will adaptively tune the speech processor through the RF-coils-based

wireless link.

Re-charging battery: A- Through the same link, an intelligent power management network for extending the

battery longevity and ensuring patient safety can also be implemented between the

implanted rechargeable battery and external powering electronics.

B- An external coil loop worn in a headset can transmit RF power across the skin to the

receiving loop, and active monitoring and control of incident RF power can be realized.

C- Upon completion of battery charging, the communication and control unit can send

out a wireless command to turn off the external powering system.

Accelerometer's position misalignment: - The umbo vibrates with the largest vibration amplitude in response to auditory inputs.

- Measurements of ossicular bone vibration can be performed by using a Laser Doppler

Vibrometer (LDV).

- Due to the umbo's curved surface it is possible that the device may become

misaligned anterior or posterior to the long process of the hammer (malleus) during

attachment, which could potentially degrade sensitivity.

- Therefore, it is necessary to investigate and characterize the umbo vibration response

along different axes.

- the direction perpendicular to the tympanic membrane is defined as the "primary axis"

of the umbo, and the vector parallel to the tympanic membrane plane and perpendicular

to the long process of the malleus is defined as the "secondary axis" of the umbo.

_________________________________________________________________________


Figure (3.7) umbo primary and secondary axes

Accelerometer's design: To achieve the optimum sensor design, we've to investigate human temporal bone

vibration characterization.

3.3.2 Human temporal bone vibration characterization:

A- Human temporal bone preparation:

-Four temporal bones were used to study the vibration cc's of the umbo.

-all temporal bones were individually inspected under a microscope to verify an intact

(uninjured before) tympanic membrane, ear canal, and ossicular bone structure.

-any bone with any evidence of structural damage due to the middle ear cavity exposure

was not used.

Temporal bones were then sequentially opened in two stages:

a- A simple mastiodectomy (surgical removal) with a facial recess approach.

b- After the initial opening of the middle ear cavity, the temporal bone was further

opened in a 2nd stage of drilling.

- In this stage, the facial recess was widened such that full access was gained into the

middle ear.

- The drilling proceeded until the tympanic membrane could be visualized.

- 2 pieces of <1 mm^2 > reflective material were placed as targets for the umbo

primary and secondary axes characterization.

_________________________________________________________________________


B- Temporal bone experimental setup and procedures:

Figure (3.8) Schematic of temporal bone setup

- A temporal bone under examination was placed in a weighted temporal bone

holder.

- An insert earphone driven by a waveform generator presented pure tones within

the audible spectrum to the tympanic membrane.

- A probe microphone was positioned approx. 4mm from the tympanic membrane

to monitor the input sound pressure level.

- A LDV exhibiting a velocity resolution of 5micro.m/s over the frequency range

of DC to 50 kHz was used to measure the ossicles's vibrational characteristics.

- The laser was focused onto the reflective targets attached to the primary and

secondary axes of the umbo.

- Acceleration of the umbo along the primary and the secondary axes was

measured in the frequency range of 250 Hz to 10 kHz with input tones between

70 dB and 100 dB SPL in increments of 5 dB.

C- Measurements results with respect to primary and secondary axes:

- The acceleration frequency response along the primary axis is nearly identical

to that of the secondary axis.

- Although the frequency trends are very similar, a 20% increase in the

acceleration amplitude is measured on the umbo along the primary axis

compared to the secondary axis.

- While the difference between the two axes measurements is small, it occurs

with approx. equal magnitude in all bones at all sound levels.

Therefore, any potential misalignment in sensor placement will have a minimal impact

on the output signal amplitude because of the similar acceleration amplitude response,

and also negligible frequency distortion due to the similar frequency response.

_________________________________________________________________________


Figure (3.9)

Figure (3.10)

- The vibration acceleration frequency response in the direction perpendicular to the

tympanic membrane increases with:

a- A slope of 40 dB per decade below 1 kHz. b- And with a slope of about 20 dB per decade from 1 kHz to 4 kHz.

c- Above 4 kHz the acceleration signal remains relatively flat.

- Through out the measurement frequency range the vibration acceleration

exhibits a linear function of the input sound pressure level SPL with a slope of

20 dB per decade.

_________________________________________________________________________


3.3.3 Design Guideline for optimum accelerometer:

Figure (3.11)

- The previous measurement results can serve as design guideline to help define the

specifications for the prototype accelerometer.

- Audiologists report that audible speech is primarily focused between 500 Hz and 8

KHz, and that the loudness of quite conversation is approx. 55 dB SPL.

- Within the audible speech spectrum, 500 Hz has the lowest acceleration response, and

thus it is the most difficult for detection.

- today's cochlear implants have multiple channels and electrodes to provide an

appropriate stimulus to the correct location within the cochlea, at 500 Hz, the electrode

channel bandwidth is on the order of 200 Hz.

Therefore:

1- To detect sounds at 55dB SPL at 500 Hz, an accelerometer with a sensitivity of 50

micro.g/Hz^1/2 and a bandwidth of 10 kHz is needed.

2- The total device mass is another important design consideration,

- The mass of the umbo and long process of the malleus is about 20 -- 25mg.

- adding a mass greater than 20 mg can potentially result in a significant damping effect

on the frequency response of the middle ear ossicular chain. Therefore, the total mass

of the packaged sensing system needs to be kept below 20 mg.

3.3.4 Conclusion: An accelerometer with reduced package mass (below 20 mg) and improved

performance, achieving a sensitivity of 50 µ.g/(Hz)½, and bandwidth of 10 kHz, would

be needed to satisfy the requirements for normal conversation detection.

_________________________________________________________________________

Page 32 Topic 2 – Acoustical Simulation of Room

_________________________________________________________________________


CHAPTER 1

GEOMETRICAL ACOUSTICS

1.1 Introduction

What is Room Acoustics?

Room Acoustics describe how sound behaves in an enclosed space.

The sound behavior in a room depends significantly on the ratio of the frequency (or

the wavelength) of the sound to the size of the room. Therefore, the audible spectrum

can be divided into four regions (zones) illustrated in the following drawing (for

rectangular room):

1. The first zone is below the frequency that has a wavelength of twice the longest

length of the room. In this zone sound behaves very much like changes in static

air pressure.

2. Above that zone, until the frequency is approximately 11,250 (RT60 / V) ½

,

wavelengths are comparable to the dimensions of the room, and so room

resonances dominate.

3. The third region which extends approximately 2 octaves is a transition to the

fourth zone.

4. In the fourth zone, sounds behave like rays of light bouncing around the room.

The first 3 zones constitute what we call “Physical Acoustics” and the fourth zone

is the target of our project is called “Geometrical Acoustics”.

_________________________________________________________________________


Acoustical Simulation

of Room

Chapter 1

Geometrical Room

Acoustics

Chapter 2

Artificial

Reverberation

Chapter 3

Spatial

Impression

What is Acoustical Simulation?

Acoustical Simulation is a technique that assists the acoustical consultants in the

evaluation of room acoustics or the performance of the sound systems. This acoustical

program can simulate the sound as it would be heard after the project is built. This is

also called auralization.

Human beings hear things by virtue of pressure waves impinging on their eardrums,

OK? All of the information that we need to know about the sound (such as volume,

frequency content, direction, etc.) is contained in those pressure waves (or so we

believe). What an auralization system tries to do is to fool your brain into thinking that

you're listening to a sound source in an acoustical space (i.e., room) that you're not in.

How it does this is to take the original sound source and alter the frequency spectrum

according to both:

1. How the room affects the wave

2. And how your head/ears affect the wave.

Once this is done, you play back two signals (one for each ear) through, for example,

headphones and listen. Hopefully, you get the sense that what you are listening to is

what you would actually hear if you were really in the room with the source. Simple

enough, right?

• We will discuss the simulation of Room Acoustics in the following 3 chapters as

follows:

_________________________________________________________________________


1.2 Sound Behavior

Consider a sound source situated within a bounded space. Sound waves will propagate

away from the source until they encounter one of the room's boundaries where, in

general, some of the energy will be absorbed, some transmitted and the rest reflected

back into the room.

Sound arriving at a particular receiving point within a room can be considered in two

distinct parts. The first part is the sound that travels directly from the sound source to

the receiving point itself. This is known as the direct sound field and is independent of

room shape and materials, but dependant upon the distance between source and

receiver.

After the arrival of the direct sound, reflections from room surfaces begin to arrive.

These form the indirect sound field which is independent of the source/receiver

distance but greatly dependant on room properties.

The growth and Decay of sound:

When a source begins generating sound within a room, the sound intensity measured at

a particular point will increase suddenly with the arrival of the direct sound and will

continue to increase in a series of small increments as indirect reflections begin to

contribute to the total sound level. Eventually equilibrium will be reached where the

sound energy absorbed by the room surfaces is equal to the energy being radiated by

the source. This is because the absorption of most building materials is proportional to

sound intensity, as the sound level increases, so too does the absorption.

If the sound source is abruptly switched off, the sound intensity at any point will not

suddenly disappear, but will fade away gradually as the indirect sound field begins to

die off and reflections get weaker. The rate of this decay is a function of room shape

and the amount/position of absorbent material. The decay in highly absorbent rooms

will not take very long at all, whilst in large reflective rooms, this can take quite a long

time.

(Figure1.1)Reverberant Decay of sound in a small absorbent enclosure.

This gradual decay of sound energy is known as reverberation and, as a result of this

proportional relationship between absorption and sound intensity, it is exponential as a

function of time. If the sound pressure level (in dB) of a decaying reverberant field is

_________________________________________________________________________


graphed against time, one obtains a reverberation curve which is usually fairly straight,

although the exact form depends upon many factors including the frequency spectrum

of the sound and the shape of the room.

1.3 Geometrical Room Acoustics

In geometrical room acoustics, the concept of a wave is of minor importance; it is

replaced instead by the concept of a sound ray. The latter is an idealization just as much

as the plane wave.

Sound Ray:

As in geometrical optics, we mean by a sound ray a small portion of a spherical wave

with vanishing aperture, which originates from a certain point. It has a well-defined

direction of propagation and is subject to the same laws of propagation as a light ray,

apart from the different propagation velocity, of these laws; only the law of reflection is

of importance in room acoustics. But the finite velocity of propagation must be

considered in all circumstances, since it is responsible for many important effects such

as reverberation, echoes and so on.

Diffraction phenomena are neglected in geometrical room acoustics, since propagation

in straight lines is its main postulate. Likewise, interference is not considered, i.e. if

several sound field components are superimposed, their mutual phase relations are not

taken into account; instead, simply their energy densities or their intensities are added.

This simplified procedure is permissible if the different components are 'incoherent'

with respect to each other.

1.3.1 The Reflection of Sound Rays

If a sound ray strikes a plane surface, it is usually reflected from it. This process takes

place according to the reflection law well-known in optics.

The Law of Reflection :

It states that the ray during reflection remains in the plane including the incident ray

and the normal to the surface, and that the angle between the incident ray and reflection

ray is halved by the normal to the wall.

Since the lateral extension of a sound ray is vanishingly small, the reflection law is

valid for any part of a plane no matter how small. Therefore it can be applied equally

well to the construction of the reflection of an extended ray bundle from a curved

surface by imagining each ray in turn to be reflected from the tangential plane which it

strikes.

_________________________________________________________________________


The mirror source concept:

The reflection of sound ray originating from a certain point can be illustrated by the

construction of a mirror source, provided that the reflection surface is plane (see figure

1.2) at some distance from the reflecting plane, there is a sound source A. we are

interested in the sound transmission to another point B. it takes place along the direct

path AB on the one hand (direct sound) and on the other, by reflection from the wall.

To find the path of the reflected ray we make A` the mirror image of A, connect A` to

B and A to the point of intersection of A`B with the plane.

Once we have constructed the mirror source A` associated with a given original source

A, we can disregard the wall altogether, the effect of which is now replaced by that of

the mirror source.

Of course, we must assume that the mirror emits exactly the same sound signal as the

original sound and that its directional characteristics are symmetrical to that of A. if the

extension of the reflecting wall is finite, then we must restrict the directions of emission

of A` accordingly. Usually, not all the energy striking a wall is reflected from it; part of

the energy is absorbed by the wall (or it is transmitted to the other side which amounts

to the same thing as far as the reflected fraction is concerned).

The absorption coefficient of the wall αααα:

The fraction of sound energy (or intensity) which is not reflected is characterized by the

absorption coefficient α of the wall, which defined as the ratio of the non-reflected to

the incident intensity. It depends generally, on the angle of incidence and, of course, on

the frequencies which are contained in the incident sound. Thus, the reflected ray

generally has a different power spectrum and a lower total intensity than the incident

one.

B

A A'

(Figure 1.2) Construction of a mirror source.

_________________________________________________________________________


1.3.2 Sound Reflections in Rooms

Suppose we follow a sound ray originating from a sound source on its way through a

closed room. Then we find that it is reflected not once, but many times, from the walls,

the ceiling and perhaps also from the floor. This succession of reflections continues

until the ray arrives at a perfectly absorbent surface. But even if there is no perfectly

absorbent area in our enclosure, the energy carried by the ray will become vanishingly

small after some time, because during its free propagation in air as well as with each

reflection a certain part of it is lost by absorption. If the room is bounded by plane

surface, it may be advantageous to find the paths of the sound rays by constructing the

mirror source.

Let us examine, in a room of arbitrary shape, the position of a sound and a point of

observation. We assume the sound source to emit at a certain time a very short sound

pulse with equal intensity in all directions. This pulse will reach the observation point

(see figure 1.3) not only by the direct path, but also via numerous partly signal, partly

multiple reflections, of which only a few are indicated in (figure 1.3) the total sound

field is thus composed of the 'direct sound' and of many 'reflections'.

In the following we use the term 'reflection' with a two-fold meaning: first to indicate

the process of reflecting sound from a wall and secondly as the name for a sound

component which has been reflected.

These reflections reach the observer from various directions, moreover their strengths

may be quite different and finally they are delayed with respect to the direct sound by

different times, corresponding to the total path length they have covered until they

reach the observation point.

(Figure 1.3) Direct sound and a few reflected components in a room

Thus, each reflection must be characterized by three quantities: its direction, its relative

strength and its relative time of arrival, i.e. its delay time.

The sum total of the reflections arriving at a certain point after emission of the original

sound pulse is the reverberation of the room, measured at or calculated for that point.

_________________________________________________________________________


0 50 100 150ms

t

10 d

B

(Figure 1.4) Reflection diagram for certain positions of sound source and receiver in a rectangular room of 40 m x 25 m x 8 m. Abscissa is the delay time of a reflection, ordinate its level, both with respect to the direct sound arriving at t = 0.

1.3.3 Room Reverberation

The temporal Distribution of Reflections

Since an enumeration of the great number of reflections, of their strengths, their

directions and their delay times would not be very illustrative and would yield much

more information on the sound field than is meaningful because of our limited hearing

abilities, we shall again prefer statistical methods in what follows.

If we mark the arrival times of the various reflections by perpendicular dashes over a

horizontal time axis and choose the heights of the dashes proportional to the relative

strengths of reflections, i.e. to the coefficients An we obtain what is frequently called a

'reflection diagram' or 'echogram'. It contains all significant information on the

temporal structure of the sound field at a certain room point. In (figure 1.4) the

reflection diagram of a rectangular room with dimensions 40m×25m×8m is plotted.

After the direct sound ,arriving at t=0, the first strong reflections occur at first

sporadically, later their temporal density increases rapidly, however; at the same time

the reflections carry less and less energy.

_________________________________________________________________________


As we shall see later in more detail, the role of the first isolated reflections with respect

to our subjective hearing impression is quite different from that of the very numerous

weak reflections arriving at later times, which merge into what we perceive

subjectively as reverberation. Thus, we can consider the reverberation of a room not

only as the common effect of free decaying vibrational modes, but also as the sum total

of all reflections-except the very first ones. The reverberation time, that is the time in

which the total energy falls to one millionth of its initial value, is thus:

)1ln(4163.0

α−−=

SmV

VT

Eq. (1.1)

If we use the value of the sound velocity in air and express the volume V in m3

and the

wall area S in m2.

We have by rather simple geometric considerations the most

important formula of room acoustics, which relates the reverberation time that is the

most characteristics figure with respect to the acoustics of a room, to its geometrical

data and to the absorption coefficient of its walls. We have assumed more or less tacitly

that the latter is the same for all wall portions and that it does not depend on the angle

at which a wall is struck by the sound rays.

1.4 Room Acoustical Parameters & Objective Measures

1.4.1 Reverberation Time

Sabine carried out a considerable amount of research in this area and arrived at an

empirical relationship between the volume of an auditorium, the amount of absorptive

material within it and a quantity which he called the Reverberation Time (RT).

As defined by Sabine, the RT is the time taken for a continuous sound within a room to

decay by 60 dB after being abruptly switched off and is given by;

A

VRT

)161.0(=

Where:

V is the volume of the enclosure (m3) and A is the total absorption coefficient (a) of

each material used within the enclosure (Sabine).

The term A is calculated as the sum of the surface area (in m2) times the absorption

coefficient (a) of each material used within the enclosure.

1.4.2 Early Decay Time

The reverberation time, as mentioned above, refers to the time taken for the reverberant

component of an enclosure to fall by 60 dB after the source is abruptly switched off. In

an ideal enclosure this decay is exponential, resulting in a straight line when graphed

against Sound Level.

_________________________________________________________________________


Research “Kuttruff 1973”, however; shows that this is not always the case. It has shown

that this exponential decay is the initial portion of the sound decay curve process which

is responsible for our subjective impression of reverberation as the later portion is

usually masked by new sounds. To account for this, the Early Decay Time (EDT) is

used. This is measured in the same way as the normal reverberation time but over only

the first 10 - 15 dB of decay, depending on the work being referenced.

1.4.3 Clarity & Definition

Clarity and Definition refer to the ease with which individual sounds can be

distinguished from within a general audible stream. This stream of sound may take

many forms; a conversation, a passage of music, a shouted warning, the whirring of

machinery, whatever.

In more detailed words we can define the Definition and Clarity as follows:

• Definition:

The definition is a measure of the clarity of speech, specifying how good a speaker can

be understood at a given position from the listener for example.

• Clarity:

Clarity is a comparable measure for the clarity of music. It is used for large rooms like

chamber music and concert halls. It is not appropriate for an evaluation of music

reproduced with loudspeakers in a recording studio or listening room.

1.4.4 Lateral Fraction & Bass Ratio

Lateral fraction and Bass ratio are two interesting room acoustical parameters. The

lateral fraction is related to the spatial impression made by the music in a concert hall.

The bass ratio describes the warmth of the music.

1.4.5 Speech Transmission Index

Speech Transmission Index, short STI is a measure of intelligibility of speech whose

value varies from 0 (completely unintelligible) to 1 (perfect intelligibility).The

understanding of speech, the intelligibility is directly dependent of the background

noise level, of the reverberation time, and of the size of the room. The STI is calculated

from acoustical measurements of speech and noise.

Another standard defines a method for computing a physical measure that is highly

correlated with the intelligibility of speech; this measure is called the Speech

Intelligibility Index, or SII.

_________________________________________________________________________


CHAPTER 2

ARTIFICIAL REVERBERATION

2.1 Introduction

Reverberation:

Reverberation is the result of many reflections of sound in a room. For any sound,

there’s a direct path to reach the ear of the audience but it’s not the only path; sound

waves may take slightly longer paths by reflecting off walls and ceiling before arriving

to our ears.

This sound will arrive later than the direct sound; and weaker because of the absorption

of walls of sound energy. It may also reflect again before arriving to our ears.

So these delays and attenuations in sound waves are called “Reverberation”.

Summary:

Reverberation occurs when copies of the audio signal reach the ear with different

delays and amplitudes. It depends on the room geometry and its occupants.

Reverberation is a natural phenomenon.

Artificial Reverberation:

Artificial Reverberation is added to sound signals requiring additional reverberation for

optimum listening enjoyment.

Our Aim:

To generate an artificial reverberation which is indistinguishable from the natural

reverberation of real rooms.

2.2 Shortcomings of Electronic Reverberators

The main defects of artificial electronic reverberators:

1) Coloration:

It is the change in timbre of many sounds. It occurs because the amplitude-frequency

responses of electronic reverberators are not flat; in fact they deviate from a flat

response so much, particularly if only little direct (unreverberated) sound is mixed with

the artificially reverberated signal.

2) Fluttering:

Fluttering of reverberated sound occurs because the echo density* is too low compared

to the echo density of real room, especially for short transients.

*Echo density: the number of echoes per second at the output of the reverberator for a

single pulse at the input.

_________________________________________________________________________


Delay, τ

Gain, g

+ OUT

IN

How to avoid the above degradations in artificial reverberators?

The problem of coloration can be solved by making an artificial reverberator with a flat

amplitude-frequency response. This can be achieved by passing all frequency

components equally by means of an all-pass filter which will be described below.

Concerning the problem of low echo density, it was found that for a flutter-free

reverberation; approximately 1000 echoes per second are required. Unfortunately, echo

densities of 1000 per second are not easily achieved practically by one dimensional

delay device.

Many researchers have suggested multiple feedbacks to produce a higher echo density.

However, multiple feedbacks have severe stability problems. Also, it leads to non flat

frequency responses and non-exponential decay characteristics.

We can simply treat the problem of low echo density by having a basic reverberating

unit that can be connected in series any number of times.

Now the question that must tackle our minds is why this simple remedy of echo density

problem has not been used from the beginning?

In fact, the answer is quite simple because the existing reverberators have highly

irregular frequency responses.

Here we have to mention that if the basic reverberator unit has a flat frequency

response. The series connection of any number of them will have a flat response too.

Conclusion:

All-pass reverberators (Reverberators with flat frequency response) can remove the two

main defects in artificial reverberators which are: Coloration and Fluttering.

2.3 Realizing Natural Sounding Artificial Reverberation

2.3.1 Comb Filter

What is Comb filter?

In signal processing, a comb filter adds a delayed version of a signal to itself, causing

constructive and destructive interference. The frequency response of a comb filter

consists of a series of regularly-spaced spikes, giving the appearance of a comb.

Comb filters exist in two different forms, feed-forward and feedback; the names refer to

the direction in which signals are delayed before they are added to the input.

Structure of Comb filter:

Feedback

_________________________________________________________________________


Impulse Response of Comb filter:

Frequency Response of Comb filter:

Analysis:

The impulse response of Comb Filter is given by the following equation:

h (t) = δ (t-τ) + g δ (t-2τ) + g2 δ (t-3τ) + g

3 δ (t-4τ) + . . . Eq. (2.1)

Where:

δ (t- τ) is the time response of a simple echo produced by a delay line.

g is the gain that must be less than one to guarantee the stability of the filter.

By taking Fourier Transform of Eq. (2.1), we obtain the following spectrum:

_________________________________________________________________________


-g

Gain, 1-g2

Gain, g

Delay, τ +

+ OUT IN

Eq. (2.2) can be rewritten as:

The amplitude response will be:

ωτω

cos21

1)(

2gg

H−+

= Eq. (2.4)

Where:

ω = 2πn/τ , n = 0, 1, 2, 3, … and 0 ‹ g ‹ 1

,

Disadvantage of Comb Filter:

The Amplitude Response of Comb Filter has periodic maxima and minima like shown

in the above figure. These peaks and valleys are responsible of the unwanted “colored”

quality which accompanies the reverberated sound.

2.3.2 All-pass Filter

What is All-pass filter?

It is a filter that passes all frequencies equally. In other words, the amplitude response

of an all-pass filter is 1 at each frequency, while the phase response can be arbitrary.

Structure of All-pass filter:

All pass filters can be implemented in many different ways, however we will focus on

“Schroeder All-pass” which is implemented by cascading a feedback comb filter with a

feed-forward comb filter.

H(ω) = e-jωτ

+ g e-2jωτ

+ g2 e

-3jωτ + g

3 e

-4jωτ + . . . Eq. (2.2)

ωτ

ωτ

ωj

j

ge

eH

−

−

−=

1)(

Eq. (2.3)

gH

−=

1

1max

Eq. (2.5)

gH

+=

1

1min

Eq. (2.6)

_________________________________________________________________________


Impulse Response of All-pass filter:

Frequency Response of All-pass filter:

Analysis:

The impulse response of the above All-pass Filter is given by the following equation:

h (t) = -g δ(t) + (1- g2)[δ (t-τ) + g δ (t-2τ) + . . .] Eq. (2.7)

By taking Fourier Transform of Eq. (2.7), we obtain the following spectrum:

Eq. (2.8) can be rewritten as:

−−+−=

−

−

ωτ

ωτ

ωj

j

ge

eggH

1)1()( 2

Eq. (2.8)

ωτ

ωτ

ωj

j

ge

geH

−

−

−

−=

1

)()(

Eq. (2.9)

_________________________________________________________________________


Or

The amplitude response will be:

Advantage of All-pass Filter:

The All-pass filter obtained from two combs has made a marked improvement in the

quality of the reverberated sound and we have successfully reached a perfectly

“colorless” quality.

Conclusion:

By implementing the all-pass reverberator discussed above, we can say that we possess

a basic reverberating unit that passes all frequencies with equal gain and thus avoids the

problem of sound coloration. Moreover, if we connect in series any desired number of

such units; we can increase the echo density.

−

−=

−

−

ωτ

ωτωτω

j

jj

ge

geeH

1

1)(

Eq. (2.10)

1)( =ωH Eq. (2.11)

2.3.3 Combined C omb and All-pass Filters

As we discussed in previous sections, connecting several units of all-pass reverberators

have lead to a complete elimination of both coloration and fluttering and we have

successfully reached our aim of creating an artificial reverberator that is

indistinguishable of real rooms.

Now, we can go deeper towards a more sophisticated artificial reverberator having

more subtle characteristics of natural reverberation like:

1. Mixing of direct and reverberated sounds:

By changing the position of the direct input sound added to the reverberated sound,

we can change the ratio of direct sound energy to reverberant sound energy without

producing non-exponential decays; because, non-exponential decays in real rooms

actually point to a lack of spatial “diffusion” impression of sound.

2. Introduction of time gap between the direct sound and reverberation:

This can be done by means of the delay (τ). In real rooms, the time gap as well as

the ratio of direct-to-reverberated sound depends on the positions of sound source

and listener.

3. A dependence of the reverberation time on frequency:

In order to add more realism to the artificial reverberation, we can make the

reverberation time a function of frequency (i.e. for low frequencies, enlarge the

_________________________________________________________________________


+ τ

1

g1

+ τ2

g2

+ τ3

g3

+ τ4

g4

+ IN

+ τ

5

g5

-g5

1-g52 + + τ

6

g6

-g6

1-g62 +

g7 + OUT

reverberation time). This can be done by adding a simple RC circuit in the feedback

loop of each all-pass reverberator.

The following block diagram describes an artificial reverberator that allows a

wide choice of specifications such as mixing ratios, delay of reverberated sound

and kind of decay.

A complete circuit diagram of an advanced artificial reverberator:

Description:

Here, as we can see we used a set of 4 comb filters connected in parallel; however,

comb filters have irregular frequency responses, the human ear cannot distinguish

between a flat response and an irregular response of a room that fluctuates about

approximately 10 dB. Studies have shown that such irregularities are unnoticed

when the density of peaks and valleys is high enough and this is the case used in

our 4 comb filters.

Several all-pass reverberators are connected in series with the comb filters to

increase the echo density.

_________________________________________________________________________


C1

C2

C4

C3 MA

TR

IX

-1

-1

-1

-1

IN

1

16

A2 A1

2.4 Ambiophonic Reverberation

We can achieve a highly diffuse reverberation by adding one more modification to our

artificial reverberator; that is to make it “Ambiophonic”.

In order to create the spatially diffuse character of real reverberation, we need to

generate several different reverberated signals and to feed them into a number of

loudspeakers distributed around the listener.

Description:

• The schema illustrates an Ambiophonic installation where the order of comb filters

and all-pass filters has been inverted compared with the diagram of section (2.5)

• The outputs and the inverted outputs of the 4 comb filters are connected to a

resistance matrix which forms up to 16 different combinations of the comb filter

outputs. Each combination uses each comb filter output or its negative exactly once.

• The matrix outputs are supplied to power amplifiers and loudspeakers distributed

around the listening area.

More techniques allowing the creation of spatial impression are discussed in

details in next chapter.

_________________________________________________________________________


CHAPTER 3

SPATIALIZATION

3.1 Introduction

The acoustical sound field around us is very complex. Direct sounds, reflections, and

refractions arrive at the listener's ears, which then analyzes incoming sounds and

connects them mentally to sound sources. Spatial hearing is an important part of the

surrounding world.

The perception of the direction of the sound source relies heavily on the two main

localization cues: Interaural level difference (ILD) and Interaural time difference (ITD).

These frequency-dependent differences occur when the sound arrives at the listener's

ears after having traveled paths of different lengths or being shadowed differently by

the listener's head. In addition to ILD and ITD some other cues, such as spectral

coloring, are used by humans in sound source localization.

Bringing a virtual three-dimensional sound field to a listening situation is one goal of

the research in the audio reproduction field. The first recordings were monophonic;

they created point like sound fields. A big step was two-channel stereophonic

reproduction, with which the sound field was enlarged to a line between two

loudspeakers. Two-channel stereophony is still the most used reproduction method in

domestic and professional equipment.

Various attempts to enlarge the sound field have been proposed. Horizontal-only

(pantophonic) sound fields have been created with various numbers of loudspeakers

and with various systems of encoding and decoding and matrixing. In most systems the

loudspeakers are situated in a two-dimensional (horizontal) plane. Some attempts to

produce periphonic (full-sphere) sound fields with three-dimensional loudspeaker

placement exist, such as holophony or three dimensional Ambisonics.

Periphonic sound fields can be produced in two-channel loudspeaker or headphone

listening by filtering the sound material with digital models of the free-field transfer

functions between the listener's ear canal and the desired place of the sound source

[head-related transfer functions (HRTFs)]. The spectral information of the direction of

the sound source is thus added to the signal emanating from the loudspeaker. The

system, however, has quite strict boundary conditions, which limits its use.

In most systems the positions of the loudspeakers are fixed. In the Ambisonics systems,

the number and placement of the loudspeakers may be variable. However, the best

possible localization accuracy is achieved with orthogonal loudspeakers is greater; the

accuracy is not improved appreciably.

A natural improvement would be a virtual sound source positioning system that would

be independent of the loudspeaker arrangement and could produce virtual sound

sources with maximum accuracy using the current loudspeaker configuration.

_________________________________________________________________________


The vector base amplitude panning (VBAP) is a new approach to the problem. The

approach enables the use of an unlimited number of loudspeakers in an arbitrary two-

or three-dimensional placement around the listener. The loudspeakers are required to be

nearly equidistant from the listener, and the listener room is assumed to be not

reverberant. Multiple moving or stationary sounds can be positioned in any direction in

the sound field spanned by the loudspeakers.

In VBAP the amplitude panning method is reformulated with vectors and vector bases.

The reformulation leads to simple equations for amplitude panning, and the use of

vectors makes the panning methods computationally efficient.

3.2 Two-Dimensional Amplitude Panning

Two-dimensional amplitude panning, also known as intensity panning, is the most

popular panning method. The applications range from small domestic stereophonic

amplifiers to professional mixers. The method is, however, an approximation of real-

source localization.

In the simple amplitude panning method two loudspeakers radiate coherent signals,

which may have different amplitudes. The listener perceives an illusion of a single

auditory event (virtual sound source, phantom sound source), which can be placed on a

two-dimensional sector defined by locations of the loudspeakers and the listener by

controlling the signal amplitudes of the loudspeakers. A typical loudspeaker

configuration is illustrated in figure 3.1.

_________________________________________________________________________


(Figure 3.1) Two-channel stereophonic configuration.

Two loudspeakers are positioned symmetrically with respect to the median plane.

Amplitudes of the signals are controlled with gain factors g1 and g2 respectively. The

loudspeakers are typically positioned at φ 0 = 30ο

angles.

The direction of the virtual source is dependent on the relation of the amplitudes of the

emanating signals. If the virtual source is moving and its loudness should be constant,

the gain factors that control the channel levels have to be normalized. The sound power

can be set to a constant value C, whereby the following approximation can be stated:

.2

2

2

1 Cgg =+ Eq. (3.1)

The parameter C > 0 can be considered a volume control of the virtual source. The

perception of the distance of the virtual source depends within some limits on C the

louder the sound, the closer it is located. To control the distance accurately, some

psycho-acoustical phenomena should be taken into account, and some other sound

elements should be added, such as reflections and reverberations.

When the distance of the virtual source is left unattended, the virtual source can be

placed on an arc between the loudspeakers, the radius of which is defined by the

distance between the listener and the loudspeakers. The arc is called the active arc.

In the ideal panning process only the direction where the virtual source should appear is

defined and the panning tool performs the gain factor calculation. In the next two

subsections some different ways of calculating the factors will be presented.

3.2.1 Trigonometric Formulation

The directional perception of a virtual sound source produced by amplitude panning

follows approximately the stereophonic law of sines originally proposed by Blumlein

and reformulated in phasor form Bauer,

.sin

sin

21

21

0 gg

gg

+

−=

ϕ

ϕ Eq. (3.2)

Where 0ο

< φ0 < 90ο

, - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. In Eq. (3.2) φ represents the

angle between the x axis and the direction of the virtual source; ± φ0 is the angle

between the x axis and the loudspeakers. This equation is valid if the listener's head is

_________________________________________________________________________


pointing directly forward. If the listener turns his or her head following the virtual

source, the tangent law is more correct,

.tan

tan

21

21

0 gg

gg

+

−=

ϕ

ϕ Eq. (3.3)

Where 0ο

< φ0 < 90ο

, - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. Eqs. (3.2) and (3.3) have been

calculated with the assumption that the incoming sound is different only in magnitude,

which is valid for frequencies below 500-600 Hz. When keeping the sound power level

constant, the gain factors can be solved using Eqs. (3.2) and (3.1) or using Eqs. (3.3)

and (3.1). The slight difference between Eqs. (3.2) and (3.3) means that the rotation of

the head causes small movements of the virtual sources. However, in subjective tests it

was shown that this effect is negligible.

Some kind of amplitude panning method is used in the Ambisonics encoding system. In

pantophonic Ambisonics the entire sound field is decoded to three channels using a

modified amplitude panning method. Two of the channels, X and Y, contain the

components of the sound on the x axis and the y axis, respectively. The third, W,

contains a monophonic mix of the sound material. The signal to be stored on the

channel is calculated by multiplying the input signal samples by the channel-specific

gain factor. The gain factors gx, gy, and gw are formulated as

.cosθ=xg Eq. (3.4)

Eq. (3.5)

.707.0=wg Eq. (3.6)

Where Ө is the azimuth angle of the virtual sound source, and r is the distance of virtual

source as illustrated in figure (3.2).

(Figure 3.2) Coordinate system of two-dimensional Ambisonics system.

This method differs from the standard amplitude panning method in that the gain

factors gx and gy may have negative values. The negative values imply that the signal is

stored on the recorder in antiphase when compared with the monophonic mix in the W

channel. When the decoded sound field is encoded, the antiphase signals on a channel

are applied to the loudspeakers in a negative direction of the respective axis. The

decoding stage is performed with matrixing equations. In the equations some additions

or subtractions are performed between the signal samples on the W channel and on the

X and Y channels. Equations for various loudspeaker configurations can be formulated.

.sinθ=yg

_________________________________________________________________________


The absolute values of the gain factors used in two-dimensional Ambisonics satisfy the

tangent law [Eq. (3.3)], which the reader may verify, for example, for values of 0ο< Ө <

90ο, by setting Ө = φ0 + φ, φ0 = 45

ο, g2 = gx, and g1 = gy, and by substituting Eqs. (3.4)

and (3.5) into the relation (gy – gx)/ (gy+gx).

3.2.2 Vector Base Formulation

In the two-dimensional VBAP method, the two-channel stereophonic loudspeaker

configuration is reformulated as a two-dimensional vector base. The base is defined by

unit-length vectors [ ]Tlll 12111 = and [ ]T

lll 22212 = , which are pointing toward

loudspeakers 1 and 2, respectively, as seen in figure (3.3). The super script T denotes

the matrix transposition. The unit-length vector [ ]Tppp 21= , which points toward the

virtual source, can be treated as a linear combination of loudspeaker vectors,

.2211 lglgp += Eq. (3.7)

(Figure 3.3) Stereophonic configuration formulated with vectors.

In Eq. (3.7) g1 and g2 are gain factors, which can be treated as nonnegative scalar

variables. We may write the equation in matrix form,

12gLpT = Eq. (3.8)

where [ ]21 ggg = and [ ] .2112

TllL = This equation can be solved if 1

12

−L exists,

[ ] .

1

2221

1211

21

1

12

−

−

==

ll

llppLpg

T Eq. (3.9)

The inverse matrix 1

12

−L satisfies 12L 1

12

−L = I, where I is the identity matrix. 1

12

−L exists

when φ0 ≠ 0ο and φ0 ≠ 90

ο, both problem cases corresponding to quite uninteresting

stereophonic loudspeaker placements. For such cases the one-dimensional VBAP can

be formulated.

_________________________________________________________________________


Gain factors g1 and g2 calculated using Eq. (3.9) satisfy the tangent law of Eq. (3.3).

When the loudspeaker base is orthogonal, φ0 = 45ο, the gain factors are also equivalent

to those calculated for the Ambisonics encoding system, with the exception that the

gain factors in Ambisonics may have negative values. In such cases, however, the

absolute values of the factors are equal.

When φ0 ≠ 45ο, the gain factors have to be normalized using the equation

.2

2

2

1 gg

gCg

scaled

+= Eq. (3.10)

Now gain factors g scaled

satisfy Eq. (3.1).

3.2.3 Two-Dimensional VBAP for More Than Two

Loudspeakers

In many existing audio systems there are more than two loudspeakers in the horizontal

plane, such as in Dolby surround systems. Such systems can also be reformulated with

vector bases. A set of loudspeaker pairs is selected from the system, and the signal is

applied at any one time to only on pair. Thus the loudspeaker system consists of many

vector bases competing among themselves. Each loudspeaker may belong to two pairs.

In figure (3.4) a loudspeaker system in which the two dimensional VBAP can be

applied is illustrated. A system for virtual source positioning, which similarly uses only

two loudspeakers at any one time, has been implemented in an existing theater.

(Figure 3.4) Two-dimensional VBAP with five loudspeakers.

The virtual source can be produced by the loudspeaker on the active arc of which the

virtual source is located. Thus the sound field that can be produced with VBAP is a

union of the active arcs of the available loudspeaker bases. In two-dimensional cases

_________________________________________________________________________


the best way of choose the loudspeaker bases is to let the adjacent loudspeakers from

them. In the loudspeaker system illustrated in figure (3.4) the selected bases would

be 12L , 23L , 34L , 45L , and 51L . The active arcs of the bases are thus nonoverlapping.

The use of the nonoverlapping active arcs provides continuously changing gain factors

when moving virtual sources are applied. When the sound moves from one pair to

another, the gain factor of the loudspeaker, which is not used after the change, becomes

gradually zero before the change-over point.

The fact that all other loudspeakers except the selected pair are idle may seem a waste

of resources. In this way, however, good localization accuracies can be achieved for the

principal sound, whereas the other loudspeakers may produce reflections and

reverberation as well as other elements.

3.2.4 Implementing Two-Dimensional VBAP for More Than

Two Loudspeakers

A digital panning tool that performs the panning process is now considered. Sufficient

hardware consists of a signal processor that can perform input and output with multiple

analog-to-digital (A/D) and digital-to-analog (D/A) converters and has enough

processing power for the computation needed. The tool has to include also a user

interface.

When the tool is initialized, the directions of the loudspeakers are measured relative to

the best listening position and loudspeaker pairs are formed from adjacent

loudspeakers. 1−

nmL matrices are calculated for each pair and stored in the memory of the

panning system.

During the run time the system performs the following steps in an infinite loop:

• New direction vectors p(1,….,n) are defined.

• The right pairs are selected.

• The new gain factors are calculated.

• The old gain factors are cross faded to new ones and the loudspeaker bases are

changed if necessary.

The pair can be selected by calculating unscaled gain factors with Eq. (3.9) using all

selected vector bases, and by selecting the base that does not produce any negative

factors. In practice it is recommended to choose the pair with the highest smallest

factor, because a lack of numerical accuracy during calculation may produce slightly

negative gain factors in some cases. The negative factor must be set to zero before

normalization.

_________________________________________________________________________

Page 57 Topic 3 – Noise Control

_________________________________________________________________________


CHAPTER 1

SOUND ABSORPTION

1.1 Absorption Coefficient α:-

It is a measure of the relative amount of energy that will be absorbed when a sound hits

a surface. Absorption coefficients are always a value ranging from 0 to 1 that when

multiplied by the surface areas in question yield a percentage of sound that will be

absorbed by that surface.

It is a measure of the relative amount of sound energy absorbed by that material when a

sound strikes its surface.

Figure1 absorption coefficient at a room

1.2 Measurement of the absorption coefficient of the different

materials

To have a standing wave (incident and reflected waves) inside a tube we should put

speaker with certain frequency and amplitude at one end of a tube and rigid (absorber

material) on the other end.

Using the microphone inside the tube useful to change the length of the standing wave

and to detect the positions of maximum and minimum values.

Figure2 A tube of a standing wave

_________________________________________________________________________


1.2.1 Procedures:

1- Adjust the speaker at certain frequency and amplitude then put the material which

needed to be measured its absorption coefficient at one end of the tube.

2- Move the microphone inside the tube which connects to the speaker at the other end

of the tube using the oscilloscope we can detect maximum minimum values Vmax &

Vmin.

3- Change the frequency for each

material (500, 1000, 2000 & 4000

Hz) and find Vmax & Vmin each

time.

σ =Vmax/Vmin

α = 4* σ /( σ+1)^2

Where α: absorption coefficient

Then calculate the average

absorption coefficient.

4- Repeat the experiment for

different materials.

1.2.2 Laboratory Measurements of Absorption Coefficient

By Appling the above steps for determination of the noise reduction on some absorbed

materials to get the best performance for the isolate the sound. We had this:

Open cell foam

Absorption Coefficient α Frequency Hz

0.395 500

0.89 1000

0.34 2000

0.28 4000

α average=0.383

_________________________________________________________________________


Carpet underlay


0.135 500

0.25 1000

0.96 2000

0.49 4000

α average=0.554

Mineral wool slab


0.75 500

0.785 1000

0.84 2000

0.94 4000

α average=0.813

Mineral wool slab (low density)


0.284 500

0.64 1000

0.78 2000

0.947 4000

α average=0.875

_________________________________________________________________________


_________________________________________________________________________


1.3 Sound Absorption by Vibrating or Perforated Boundaries

For the acoustics of a room it does not make any difference whether the apparent

absorption of a wall is physically brought about by dissipated processes, by conversion

of sound energy into heat, or by part of the energy penetrating through the wall into the

outer space. In this respect an open window is a very effective absorber, since it acts as

a sink for all the arriving sound energy.

A less trivial case is that of a wall or some part of a wall forced by a sound field into

vibration with substantial amplitude. Then a part of the wall's vibrational energy is re-

radiated into the outer space. This part is withdrawn from the incident sound energy,

viewed from the interior of the room. Thus the effect is the same as if it were really

absorbed it can therefore also be described by an absorption coefficient. In practice this

sort of absorption occurs with doors, windows, light partition wall, suspended ceiling,

circus tents and similar walls.

_________________________________________________________________________


p2

p1

p3

Figure3 pressure acting on a layer with mass M

This process, which may be quite involved especially for oblique sound incidence, is

very important in all problems of sound insulation. From the viewpoint of room

acoustics, it is sufficient, however, to restrict discussions to the simplest case of a plane

sound wave impinging perpendicularly onto the wall, whose dynamic properties are

completely characterized by its mass inertia. Then we need not consider the

propagation of bending waves on the wall.

Let us denote the sound pressures of the incident and the reflected waves on the surface

of a wall by p1 and p2, and the sound pressure of the transmitted wave by p3. The total

pressure acting on the wall is then p1+p2-p3. it is balanced by the inertial force iωMυ

where M denotes the mass per unit area of the wall and υ the velocity of the wall

vibrations. This velocity is equal to the particle velocity of the wave radiated from the

rare side, for which p3=ρcυ holds. Therefore we have p1+p2- ρcυ = iωMυ, from which

we obtain:

Z=iωM+ ρc

For the wall impedance

α= (1+ (ωM/2 ρc)²)¯ ¹~ (2 ρc/ ωM)²

Where c: air velocity

ρ: density of air

ω= 2*pi*f

f: resonance frequency

This simplification is permissible for the case frequently encountered in practice in

which the characteristic impedance of air is small compared with the mass reactance of

the wall. Thus the absorption becomes noticeable only at low frequencies.

At a frequency at 100 Hz the absorption coefficient of a glass pane with 4 mm

thickness is as low as 0.02 approximately. For oblique or random incidence this value is

a bit higher due to the better matching between the air and the glass pane, it is still very

low. Nevertheless, the increase in absorption with decreasing frequency has the effect

that rooms with many windows sometimes should 'crisp' since the reverberation at low

frequency is not as long as it would be in the same room without windows.

The absorption caused by the vibrations of normal and single walls and ceiling is thus

very low. Matters are different for double and multiple walls provided that the partition

on the side of the room under consideration is mounted in such away that vibrations are

not hindered and provided that it is not too heavy. Because of the inter action between

the leaves and the enclosed volume of air such a system behaves as a resonance system.

_________________________________________________________________________


It is a fact of great practical interest that a rigid perforated plate or panel has essentially

the same properties as a mass-loaded wall or foil. Each hole in a plate may be

considered as a short tube or channel with length b, the mass of air contained on it,

divided by the cross section is ρb.

Because of the contraction of the air stream passing through the hole, the air vibrates

with a greater velocity than that in the sound wave remote from the wall, and hence the

inertial forces of the air included in the hole are increased. The increase is given by the

ratio s2/s1, where s1 is the area of the hole and s2 is the plate area per hole. Hence the

equivalent mass of the perforated panel per unit area is M=ρb`/σ with σ=s1/s2

The geometrical tube length b has been replaced by the effective length b` which is

some times larger than b, b`=b+2δb

The correction term 2δb known as the end correction accounts for the fact that the

streamlines cannot contract or diverge abruptly but only gradually when interning or

leaving a hole. For circular apertures with radius (a) and with relatively large lateral

distances it is given by δb=0.8 a.

_________________________________________________________________________


CHAPTER 2

SOUND TRANSMISSION

2.1 Transmission Coefficient:

Definition: It is the unit less ratio of transmitted to incident sound energy, which ranges

from the ideal limits of 1 to 0. The limit of 1 is practically possible since a transmission

coefficient of 1 implies that all of the sound energy is transmitted through a partition.

This would be the case for an open window or door, where the sound energy has no

obstruction to its path. The other extreme of ‘0’ (implying no sound transmission), how

ever, is not a practical value since some sound will always travel through a partition.

2.2 Transmission loss:

We can define it as the amount of sound reduced by a partition between a sound source

and receiver. The less sound energy transmitted, the higher the transmission loss. In

other words, the greater the TL the better the wall is at reducing noise.

It is the principal descriptor for sound insulation, which is a decibel level based on the

transmission loss (TL) and is based on the logarithm of the mathematical reciprocal of

(1 divided by) the transmission coefficient. Since the logarithm of 1 is 0, the condition

in which the transmission coefficient is 1 translates to a TL of 0. This concurs with the

notion that an open air space in wall allows the free passage of sound .the practical

upper limit of TL is roughly 70 dB.TL is frequency dependent ,typical partitions have

TL values that increase with increasing frequency it is represented by a single number it

is called STC.

τ= = = = 1111 all sound is transmitted T.L =0

T.L= ∞ No sound is transmitted τ= = = = 0000

τ= = = = 0000

τ= = = = 0.20.20.20.2 20 % of incident sound is transmitted

2.3 Sound transmission class STC:

This is a single-number rating for TL that takes the entire frequency into account,

which can be used to compare the acoustical isolation of different barrier materials or

partition constructions over the human speech frequency range of roughly 125 to 4000

HZ. The STC number is determined from Transmission Loss values .The standard test

method requires minimum room volumes for the test to be correct at low frequencies.

T.L = 10 log (1/ τ) dB

_________________________________________________________________________


Due to the definition of the rating, STC is only applicable to air-borne sound, and is a

poor guideline for construction to contain mechanical equipment noise, music,

transportation noise, or any other noise source which is more heavily weighted in the

low frequencies than speech. The standard thus only considers frequencies above 125

Hz. Composite partitions composed of elements such as doors and windows as well as

walls will tend to have an STC close to the value of the lowest STC of any component.

In laboratory tests, the STC rating of that particular wall section varies from STC 47 to

STC 51. Any cracks in the wall or holes for electrical or mechanical servicing will

further reduce the actual result; rigid connections between wall surfaces can also

seriously degrade the wall performance. The higher the target STC, the more critical

are the sealing and structural isolation requirements. The builder's best options for

getting a satisfactory STC result are to specify partitions with a laboratory rating of

STC 54 or better.

2.3.1 Determination of STC

To determine the STC value for the material, at the first we should get the transmission

loss TL due to using the absorption material, so to determine it; we will use two test

rooms: a ''source'' room and a ''received'' room. The source room will contain a full-

range test loudspeaker and by adjusting the sound-level meter (measuring device) we

can determine it as following.

The first step

The sound transmitted should be in the band frequency that we can use STC to be a

good guide for the barrier materials from 125Hz to 4000Hz and generate it on 1/3

octave as shown in table 2.1 Then by using the sound level meter we can measure the

transmitted sound level in decibel from the source room.

Table 2-1 1/3 octave bands

Low cutoff

frequency(Hz)

Central frequency

(Hz)

Upper cutoff frequency

(Hz)

112 125 141

178 200 224

224 250 282

355 400 447

447 500 562

708 800 891

891 1000 1122

1413 1600 1778

1778 2000 2239

2239 2500 2818

3548 4000 4467

_________________________________________________________________________


The next step

Is to put the material on the separated wall between two rooms and sound transmission

between the rooms is measured again in the received room. The sound level from the

''receiver'' is subtracted from the sound level from ''source''. The resulting difference is

the transmission loss or ''TL.'‘

Now we can use this value of TL to determine the sound transmission class STC value

graphical or by numerical procedure.

By graphical procedure: the TL is plotted on a graph of 1/3-octave band center frequency versus level (in

dB). Now this is where it can get confusing. To get the STC, the measured curve is

compared to a reference STC curve.

By numerical procedure: the numerical procedure for determining the STC is easier unlike the graphical

procedure. The numerical procedure consists of the following steps.

Step 1: Record the TL at each one-third octave frequency in column 2.

Step 2: Write the adjustment factor against each frequency in column 3.these

factors are based on the STC contour, assuming the adjustment as zero at 500 HZ.

Step 3: add the TL and adjustment factors in column 4.this is the adjusted TL

(TLadj).

Step 4: Note the least TLadj value and circle it, at column 4.

Step 5: Add eight to the least TLadj .this is the first trial STC (STCtrial) write this

value in column 5.

Step 6: subtract STC trial from TL adj, i.e., determine (TL adj – STC trial).

If this value is positive, enter zero in column 5: if negative enter the actual value.

2.3.2 Laboratory measurements of STC:-

Here in our experiment we have two parts.

Note that: we should make the two parts of this

experiment in the same conditions to have the actual and

correct results

_________________________________________________________________________


Single layer of cork:

The absorbed material with a thickness equal 2.5 cm (1 in)

Frequency (HZ)

Transmitted

sound level dB

Received

level sounddB

Transmission

loss TL dB

500 85 63 22

1000 85.8 68.1 17.7

2000 86.6 70.3 16.3

4000 87 72 15

Table 2-2 The TL due to single layer

Frequency

(HZ)

TL

dB

Adjustment

factor

Adjusted

TL(TLadj)

Trial STCs

(STCtrial= 19)

500 22 0 22 0

1000 17.7 -3 14.7 -4.3

2000 16.3 -4 12.3 -6.7

4000 15 -4 11 -8

Table 2-3 the calculation of STC for single layer of cork

Double layer of cork:

We will use the absorbed material with a thickness equal 5 cm (2 in)

Frequency (HZ)

Transmitted

sound level dB

Received

dBlevel sound

Transmission

loss TL dB

500 85 62.7 22.3

1000 85.8 63.3 22.5

2000 86.7 63.7 23

4000 87 63.9 23.1

Table 2-4 The TL due to double layer of cork

_________________________________________________________________________


Table 2-5 the calculation of STC for double layer of cork

2.4 Controlling Sound Transmission through Concrete Block Walls

We discuss the various factors that affect sound transmission through different types of

concrete block walls, including single-leaf walls and double-leaf walls. Knowledge of

these factors will assist construction practitioners to design and build walls with high

levels of acoustic performance economically.

Concrete block walls are commonly used to separate dwelling units from each other

and to enclose mechanical equipment rooms in both residential and office buildings

because of their inherent mass and stiffness. Neither of these fundamental properties

(mass and stiffness) can be altered by users. However, there are additional factors that

need to be considered in building high quality walls.

2.4.1 Single-Leaf Concrete Block Walls

MassMassMassMass per Un per Un per Un per Unit Areait Areait Areait Area For single-leaf walls, the most important determinant of sound transmission class

(STC) is the mass per unit area: the higher it is the better. A single-leaf concrete block

wall is heavy enough to provide STC ratings of about 45 to 55 when the block surface

is sealed with paint or plaster.

Figure 2-1 shows measured STC ratings for single-leaf concrete block walls from a

number of published sources. The considerable scatter demonstrates that, while

important, block weight is not the only factor that determines the STC for this type of

wall. In the absence of measured data, the regression line in Figure 2-1 can be used to

estimate the STC from the block weight. Alternatively, Table 2-6 provides

representative STC values for 50% solid blocks that have been sealed on at least one

side. It shows the relatively modest effects of significant increases in wall thickness on

STC.

Adding materials such as sand or grout to the cores of the blocks simply increases the

weight; the increase in STC can be estimated from Figure 2-1.

Frequency(HZ)

TL dB

Adjustment

factor

Adjusted

TL(TLadj)

Trial STCs

(STCtrial= 27)

500 22.3 0 22.3 -4.7

1000 22.5 -3 19.5 -7.5

2000 23 -4 19 -8

4000 23.1 -4 19.1 -7.9

_________________________________________________________________________


Figure 2-1 also shows that using heavier block to get an STC rating of much more than

50 leads to impracticably heavy constructions.

Figure 2-1 Effect of block weight on STC for single- layer concrete block walls.

Table 2-6 STC ratings for 50% solid normal- weight and lightweight block walls sealed on at least one side

Effect of PorosityEffect of PorosityEffect of PorosityEffect of Porosity When the concrete block is porous, sealing the surface with plaster or block sealer

significantly improves the sound insulation; the more porous the block, the greater the

improvement. Improvements of 5 to 10 STC points, or even more, are not uncommon

for some lightweight block walls after sealing. Conversely, normal-weight blocks

usually show little or no improvement after sealing. This improvement in STC in

lightweight blocks is related to the increased airflow resistivity of these blocks.

Wall Thickness, mm Lightweight block Normal weight block

STC STC

90 43 44

140 44 46

190 46 48

240 47 49

290 49 51

_________________________________________________________________________


2.4.2 Double-Leaf Concrete Block Walls

In principle, double-leaf masonry walls can provide excellent sound insulation. They

appear to meet the prescription for an ideal double wall: two independent, heavy layers

separated by an air space. In practice, constructing two block walls that are not solidly

connected somewhere is very difficult. There is always some transmission of sound

energy along the wire ties, the floor, the ceiling and the walls abutting the periphery of

the double-leaf wall, and through other parts of the structure. This transmitted energy,

known as flanking transmission, seriously impairs the effectiveness of the sound

insulation. Flexible ties and physical breaks in the floor, ceiling and abutting walls are

needed to reduce it.

Even if such measures are considered in the design, mortar droppings or other debris

can bridge the gap between the layers and increase sound transmission. Such errors are

usually concealed and impossible to rectify after the wall has been built.

A double-leaf wall that was expected to attain an STC of more than 70 provided an

STC of only 60 because of mortar droppings that connected the two leaves of the wall.

In general, the greater the T.L the better the performance of the block wall, it is

possible to construct high-quality walls that can meet the most acoustically

demanding situations it can be achieved by changing (mass of the wall & the

amount of sound-absorbing material ) .

2.5 Noise reduction

Noise reduction at work areas inside production facilities is essential not only to

conserve hearing of employee, but also help employee to accomplish work efficiently

In order to achieve noise reduction between two rooms, the wall (or floor) separating

them must transmit only a small fraction of the sound energy that strikes it , it will be

achieved by using good absorbed material on the separated wall to have less sound

energy transmitted, then higher the transmission loss. In other words, the greater the

TL, the better the wall is at reducing noise. Now we can get the relation between the

noise reduction of the wall and the transmission loss as shown below.

N. R = T.L + 10 log (A/ S)

Where A & S: - area (m^2)

2.5.1 Determination of noise reduction:-

To determine the noise reduction for the different material

we should get the different between sound level with and

without using this material so we will follow the following

steps.

_________________________________________________________________________


1. Generating noise by using the noise source in our experiment we will use dc-

motor as shown in figure 2-2. Figure 2-2

2. We prepare some equipment (internal microphone, external microphone, and

sound level meter).

3. Put the absorbed material on the cover and

fix it on the noise source as shown in the

figure 2-3.

Figure 2-3

4. Then by the internal microphone& the external microphone and the sound level

meter we can get the internal sound level and the external sound level.

5. By subtract the two levels of the sound we can get the noise reduction of the

used material.

2.5.2 The laboratory measurements of the noise reduction:-

By Appling the above steps for determination of the noise reduction of some absorbed

materials to get the best performance for the isolate the sound. We had these results

Open cell foam

The sound level without using the absorbed material = 79.5 (dBA).

The sound level by using the absorbed material = 74.5 (dBA).

So the noise reduction due to using open cell foam as absorbed material is equal

N.R = 79.5 – 74.5 =5 dBA.

Carpet underlay



So the noise reduction due to using Carpet underlay as absorbed material is equal

N.R = 79.5 – 73.5 = 6 (dBA).

_________________________________________________________________________


Mineral wool slab (low density)



So the noise reduction due to using Mineral wool slab (low density) as absorbed

material is equal

N.R = 79.5 – 72.5 = 7 (dBA).

Mineral wool slab (high density)


The sound level by using the absorbed material = 72 (dBA).

So the noise reduction due to using Mineral wool slab (high density) as absorbed

material is equal

N.R = 79.5 – 72 = 7.5 (dBA).

2.6 The performance of some absorbed materials:

Now we will show the materials as its absorption coefficient and the noise reduction as

in table 2-7 trying to get the best absorbed material has a good performance and to get

the relation between absorption coefficient and the noise reduction value for same

material.

Table 2-7

Absorbed material Absorption coefficient Noise reduction

Open cell foam 0.383

5 dBA

Carpet underlay 0.554

6 dBA

Mineral wool slab (low density) 0.813

7 dBA

Mineral wool slab (high density) 0.875 7.5 dBA

_________________________________________________________________________


Now, from the results which shown we conclude for the shown absorbed materials that

for the material has high absorption coefficient it will be good insulation material due

to high noise reduction value ,so for the above four materials the best one should be

used is Mineral wool slab (high density).

____________________________________________________________________________

Page 75 Topic 4 – Speech Technology

____________________________________________________________________________


CHAPTER 1

SPEECH PRODUCTION

1.1 Introduction

In this chapter, we take a look at the physiological and acoustic aspects of speech

production and of speech perception, which will help to prepare the ground for later

chapters on the electronic processing of speech signals.

The human apparatus concerned with speech production and perception is complex and

uses many important organs—the lungs, mouth, nose, ears and their controlling

muscles and the brain.

When we consider that most of these organs serve other purposes such as breathing or

eating it is remarkable that this apparatus has developed to enable us to make such a

wide variety of easily distinguishable speech utterances.

1.2 The human vocal apparatus

Speech sounds are produced when breath is exhaled from the lungs and causes either a

vibration of the vocal cords (for vowels) or turbulence at some point of constriction in

the vocal tract (for consonants). The sounds are affected by the shape of the vocal tract

which influences the harmonics produced. The way in which the vocal cords are

vibrated, the shape of the vocal tract or the site of constriction can all be varied in order

to produce the range of speech sounds with which we are familiar. Figure 1.1 shows the

main human articulatory apparatus.

Figure (1.1) shows the main human articulatory apparatus.

____________________________________________________________________________


Figure (1.2) A schematic view of the larynx and vocal cord

1.2.1 Breathing

The use of exhaled breath is essential to the production of speech. In quiet breathing, of

which we are not normally aware, inhalation is achieved by increasing the volume of

the lungs by lowering the diaphragm and expanding the rib-cage. This reduces the air

pressure in the lungs which causes air from outside at higher pressure to enter the lungs.

Expiration is achieved by relaxing the muscles used in inspiration so that the volume of

the lungs is reduced due to the elastic recoil of the muscles, the reverse movement of

the rib-cage and gravity, thus increasing air pressure in the lungs and forcing air out.

The form of expiration achieved by relaxing the inspiratory muscles cannot be

controlled sufficiently to achieve speech or singing. For these activities, the inspiratory

muscles are used during exhalation to control lung pressure and prevent the lungs from

collapsing suddenly; when the volume is reduced below that obtained by elastic recoil,

expiratory muscles are used. Variations in speech intensity needed, for example, to

stress certain words are achieved by varying the pressure in the lungs; in this respect

speech differs from the production of a note sung at constant intensity.

1.2.2 The larynx

There are two main methods by which speech sounds are produced. In the first, called

voicing, the vocal cords located in the larynx are vibrated at a constant frequency by the

air pressure from the lungs. The second gives rise to unvoiced sounds produced by

turbulent how of air at a constriction at one of a number of possible sites in the vocal

tract. A schematic view of the larynx is shown in Fig. 1.2.

____________________________________________________________________________


Figure (1.3) Frequency range of the human voice

The vocal cords are at rest when open. Their tension and elasticity can be varied; they

can be made thicker or thinner, shorter or longer and then can be either closed, open

wide or held in some position between.

When the vocal cords are held together for voicing they are pushed open for each

glottal pulse by the air pressure from the lungs; closing is due to the cords’ natural

elasticity and to a sudden drop in pressure between the cords (the Bernoulli principle).

Considered in vertical cross-section the cords do not open and close uniformly, but

open and close in a rippling movement from bottom to top as shown in Fig. 1.2.

The frequency of vibration is determined by the tension exerted by the muscles, the

mass and the length of the cords. Men have cords between 17 and 24mm in length;

those of women are between 13 and 17mm. The average fundamental or voicing

frequency (the frequency of the glottal pulses) for men is about 125 Hz, for women

about 200 Hz and for children more than 300 Hz.

When the vocal cords vibrate harmonics are produced at multiples of the fundamental

frequency; the amplitude of the harmonies decreases with increasing frequency. Figure

1.3 shows the range of human voice.

____________________________________________________________________________


1.2.3 The vocal tract

For both voiced and unvoiced speech sound that is radiated from the speaker’s face is a

modification of the original vibration caused by the resonances of the vocal tract. The

oral tract is highly mobile and the position of the tongue, pharynx, palate, lips and jaw

will all affect the speech sounds made which we hear as radiation from the lips or

nostrils. The nasal tract is immobile, but can be coupled in to form part of the vocal

tract depending on the position of the velum. Combined voiced and unvoiced sounds

can also he produced as voiced consonants.

The major speech articulators are shown in Fig. 1.1. When the velum is closed the oral

and pharyngeal cavities combine to form the voice resonator. The tongue can move

both up and down and forward and back, thus altering the shape of the vocal tract; it

can also be used to constrict the tract for the production of consonants. By moving the

lips outward the length of the vocal tract can be increased. The nasal cavity is coupled

in when the velum is opened for sounds such as /m/, in ‘hum’; here the vocal tract is

closed at the lips and acts as a side branch resonator.

1.3 Speech sounds

The smallest element of speech sound which indicates a difference in meaning is called

a phoneme, and is written between slashes as, for example, /p/ in ‘pan’. About 40

phonemes are sufficient to discriminate between all the sounds made in British English;

other languages may use different phoneme sets.

1.3.1 Phonemic representation

The following table shows the International Phonetic Alphabet representation.

____________________________________________________________________________


1.3.2 Voiced, unvoiced and plosive sounds

As we have seen, voiced sounds, for example the vowel sounds /a /,/e/ and /I/, are

generated by vibration of the vocal cords which are stretched across the top of the

trachea. The pressure of air flow from the lungs causes the vocal cords to vibrate. The

fundamental pitch of the voicing is determined by the air flow, but mainly by the

tension exerted on the cords.

Unvoiced sounds are produced by frication caused by turbulence of air at a constriction

in the vocal tract. The nature of the sound is determined by the site of the constriction

and the position of the articulators (e.g. the tongue or the lips). Examples of unvoiced

sounds arc /f/, /s/ or /sh/. Mixed voiced and unvoiced sounds occur where frication and

voicing arc simultaneous. For example if voicing is added to the /f/ sound it becomes

/v/; if added to /sh/ it becomes /xh/ as in ‘azure’.

Silence occurs within speech, but in fluent speech it does not occur between words

where one might expect it. It most commonly occurs just before the stop in a plosive

sound. The duration of these silences is of the order of 30 to 50 ms.

1.4 Acoustics of speech production

The vibration of the vocal cords in voicing produces sound at a sequence of

frequencies, the natural harmonics, each of which is a multiple of the fundamental

frequency. Our ears will judge the pitch of the sound from the fundamental frequency.

The remaining frequencies have reducing amplitude. However, we never hear this

combination of frequencies because as the sound waves pass through the vocal tract it

resonates well at some frequencies and not so well at others and the strength of the

harmonics that we hear are the result of the change due to these resonances.

1.4.1 Formant frequencies

The resonances in the oral and nasal tract are not fixed, but change because of the

movement of the speech articulators described above. For any position the tract

responds to some of the basic and harmonic frequencies produced by the vocal cords

better than to others. For a particular position of the speech articulators, the lowest

resonance is called the first formant frequency (f1) the next the second formant

frequency (f2) and go on.

The formant frequencies for each of the vowel sounds are quite distinct but for each

vowel sound generally have similar values regardless of who is speaking.

For example, for a fundamental frequency of 100 Hz, harmonics will be produced at

200, 300, 400, 500 Hz, etc. For the vowel /ee/ as in ‘he’ typical values for f1 and f2 are

300 and 2100Hz respectively. For the vowel /ar/ as in ‘hard’ the typical values of the

corresponding formant frequencies are about 700 and 900Hz.

The fundamental frequency will vary depending on the person speaking, mood and

emphasis, but it is the magnitude and relationship of the formant frequencies which

make each voiced sound easily recognizable.

____________________________________________________________________________


Figure (1.4) source filter model of speech

A common approach to understanding speech production and the processing of speech

signals is to use a source filter model of the vocal tract. The model is usually

implemented in electronic form but has also been implemented mechanically. In the

electronic form an input signal is produced either by a pulse generator offering a

harmonic rich repetitive waveform or a broadband noise signal is generated digitally by

means if a pseudorandom binary sequence generator. The input signal is passed through

a filter which has the same characteristics as the vocal tract.

The parameters of the filter can clearly not be kept constant, but must be varied to

correspond to the modification of the vocal tract made by movement of the speech

articulators. The filter thus has time- variant parameters; in practice the rate of variation

is slow, with parameters being updated at intervals of between 5 and 25 ms.

The pitch of the voiced excitation is subject to control as is the amplitude of the output,

in order to provide a fairly close approximation to real speech. The pitch period may

vary from 20ms for a deep-voiced male to 2 ms for a high-pitched child or female.

In the case of the pulse waveform it will consist of a regular pattern of lines which are

spaced apart by the pitch frequency. For a noise waveform considered as the

summation of a large number of randomly arriving impulses the distribution will

approximate to a continuous function. In both cases the energy distribution decreases

with an increase of frequency, but there are significant levels up to 15 to 20 kHz.

Frequency shaping is provided by the filter characteristic which is applied to the signal

in the frequency domain. Typically the filter characteristic will consist of a curve where

the various resonances of the vocal tract appear as peaks or poles of transmission. The

frequency at which these poles occur represents the formant frequencies and will

change for the various speech sounds.

____________________________________________________________________________


1.5 Perception

1.5.1 Pitch and loudness

The pitch at which speech is produced depends on many factors such as the frequency

of excitation of the vocal cords, the size of the voice box or larynx and the length of the

vocal cords. Pitch also varies within words to give more emphasis to certain syllables.

The loudness of speech will generally depend on the circumstances, such as the

emotions of the speaker. Variations in loudness are produced by the muscles of the

larynx which allow a greater flow of air, thus producing the ‘sore throat’ feeling when

the voice has to be raised for a period to overcome noise. Loudness is also affected by

the flow of air from the lungs, which is the principal means of control in singing.

1.5.2 Loudness perception

The sensitivity of the human ear is not the same for tones of all frequencies. It is most

sensitive to frequencies in the range 1000 to 4000 Hz. Low- and high-frequency sounds

require a higher intensity sound to be just audible, and our concept of ‘loudness’ also

varies with frequency.

____________________________________________________________________________


CHAPTER 2

PROPERTIES OF SPEECH SIGNALS IN TIME DOMAIN

2.1 Introduction

We are now beginning to see how digital signal processing methods can be applied to

speech signals .Our goal in processing the speech signal is to obtain a more convenient

or more useful representation of the information carried by the speech signal.

In this chapter we shall be interested in a set of processing techniques that are

reasonably termed time-domain methods. By this we mean simply that the processing

methods involve the waveform of the speech signal directly.

Some examples of representations of the speech signal in terms of time- domain

measurements include average zero-crossing rate, energy, and the autocorrelation

function. Such representations are attractive because the required digital processing is

simple to implement, and, in spite of this simplicity, the resulting representations

provide a useful basis for estimating important features of the speech signal.

2.2Time-Dependent Processing of Speech

A sequence of samples (8000 samples/sec) representing a typical speech signal is

shown in Figure 2.1. It is evident from this figure that the properties of the speech

signal change with time. For example, the excitation changes between voiced and

unvoiced speech, there is significant variation in the peak amplitude of the signal, and

there is considerable variation of fundamental frequency within voiced regions.

The fact that these variations are so evident in a waveform plot suggests that simple

time-domain processing techniques should be capable of providing useful

representations of such signal features as intensity, excitation mode, pitch, and possibly

even vocal tract parameters such as formant frequencies. The fact that these variations

are so evident in a waveform plot suggests that simple time-domain processing

techniques should be capable of providing useful representations of such signal features

as intensity, excitation mode, pitch, and possibly even vocal tract parameters such as

formant frequencies.

The underlying assumption in most speech processing schemes is that the properties of

the speech signal change relatively slowly with time. This assumption leads to a variety

of ‘short-time’ processing methods in which short segments of the speech signal are

isolated and processed as if they were short segments from a sustained sound with fixed

properties. This is repeated (usually periodically) as often as desired. Often these short

segments, which are sometimes called analysis frame, overlap one another. Two of

these methods are short-time average zero-crossing rate and the short-time

autocorrelation function.

____________________________________________________________________________


Unvoiced

Voiced

UnvoicedUnvoiced

Voiced

2.3 Short-Time Average Zero-Crossing Rate

In the context of discrete-time signals, a zero-crossing is said to occur if successive

samples have different algebraic signs. The rate at which zero crossings occur is a

simple measure of the frequency content of a signal. This is particularly true of

narrowband signals. For example, a sinusoidal signal of frequency F0 sampled at a rate

Fs has Fs/F0 samples per cycle of the sine wave. Each cycle has two zero crossings so

that the long-time average rate of zero-crossings is Z = 2F0 /Fs crossings/sample;

Thus, the average zero-crossing rate gives a reasonable way to estimate the frequency

of a sine wave.

Speech signals are broadband signals and the interpretation of average zero-crossing

rate is therefore much less precise. However, rough estimates of spectral properties can

be obtained using a representation based on the short- time average zero-crossing rate.

Let us see how the short-time average zero-crossing rate applies to speech signals. The

model for speech production suggests that the energy of voiced speech is concentrated

below at 3 kHz because of the spectrum fall- off introduced by the glottal wave,

whereas for unvoiced speech, most of the energy is found at higher frequencies. Since

high frequencies imply high zero- crossing rates, and low frequencies imply low zero-

crossing rates, there is a strong correlation between zero-crossing rate and energy

distribution with frequency. A reasonable generalization is that if the zero-crossing rate

is high, the speech signal is unvoiced, while if the zero-crossing rate is low, the speech

signal is voiced.

Figure 2.11 shows a histogram of average zero-crossing rates (averaged. over 10 msec)

for both voiced and unvoiced speech. Note that a Gaussian curve provides a reasonably

good fit to each distribution. The mean short-time average zero-crossing rate is 49 per

10 msec for unvoiced and 14 per 10 msec for voiced. Clearly the two distributions

overlap so that an unequivocal voiced/unvoiced decision is not possible based on short-

time average zero- crossing rate alone. Nevertheless, such a representation is quite

useful in making this distinction.

Figure (2.1) Distribution of zero-crossings for unvoiced and voiced speech.

____________________________________________________________________________


An appropriate definition is for the zero crossing rates are:

sgn[ ( )] sgn[ ( 1)] ( )n

m

Z x m x m w n m∞

=−∞

= − − −∑ (2.1)

Where

sgn[ ( )] 1 ( ) 0

1 ( ) 0

x n x n

x n

= ≥

= − < (2.2)

And

1( ) 0 1

2

0

w n n NN

otherwise

= ≤ ≤ −

=

(2.3)

This can be achieved by a simple Matlab code. This measure could allow the

discrimination between voiced and unvoiced regions of speech, or between speech and

silence. Unvoiced speech has in general, higher zero-crossing rate. The signals in the

graphs are normalized. wRect = rectwin(winLen); ZCR = STAzerocross(speechSignal, wRect, winOverlap); subplot(1,1,1); plot(t, speechSignal/max(abs(speechSignal))); title('speech: He took me by surprise'); hold on; delay = (winLen - 1)/2; plot(t(delay+1:end-delay), ZCR/max(ZCR),'r'); xlabel('Time (sec)'); legend('Speech','Average Zero Crossing Rate'); hold off;

____________________________________________________________________________


2.4 Pitch period estimation

One of the most important parameters in speech analysis, synthesis, and coding

application is fundamental frequency, or pitch of voiced speech.

Pitch frequency is directly related to the speaker and sets the unique characteristic of a

person.

Voicing is generated when the airflow from the lungs is periodically interrupted by

movement of the vocal cords. The time between successive vocal cords openings is

called the fundamental period, or pitch period.

For men, the possible pitch frequency range is usually found somewhere between 50

and 250 Hz, while for women the range usually falls between 120 and 500 Hz. In terms

of period, the range for male is 4 to 40 ms, while for female it is 2 to 8 ms.

Pitch period must be estimate at every frame. By comparing a frame with past sample,

it is possible to identify the period in which the signal repeats itself, resulting in an

estimate of the actual pitch period. note that the estimation procedure make sense only

for voiced frames .meaningless results are obtained for unvoiced frames due to their

random nature.

Design of a pitch period estimation algorithm is a complex undertaking due to lack of

perfect periodicity, interference with formants of the vocal tract, uncertainty of the

starting instance of a voiced segment, and other real-word elements such as noise and

echo. In practice, pitch period estimation is implemented as a trade-off between

computational complexity and performance. Many techniques have been proposed for

the estimation of pitch period and only two methods are included here.

2.4.1The Autocorrelation Method

Assume we want to perform the estimation on the signal S[n], with n being the time

index. We consider the frame that ends at time instant m, where the length of the frame

is equal to N. Then the autocorrelation value

Reflects the similarity between the frame S[n], n=m-N+1 to m with respect to the time-

shifted version S [n-1], where l is a positive integer representing a time lag. The range

of lag is selected so that it covers a wide range of pitch period values. for instance, for

l=20 to 147/92.5 to 18.3 ms) , the possible pitch frequency values range from 54.4 to

400 Hz at 8 KHz sampling rate. This range of l is applicable for most speakers and can

be encoded using 7 bits, since there are 27

=128 values of pitch period.

m

R [l, m] = ∑ S[n] S[n -1]

n=m-N+1

____________________________________________________________________________


By calculating the autocorrelation values for the entire range of lag, it is possible to find

the value of lag associated with the highest autocorrelation representation the pitch

period estimation, since, in theory, autocorrelation is maximized when the lag is equal

to the pitch period.

The method is summarized with the following pseudo code:

PITCH (m, N)

1. Peak←0

2. For 1←20 to 150

3. Autocorrelation ←0

4. For n←m-N+1 to m

5. Autocorrelation← Autocorrelation + S[n] S [n-1]

6. If auto > peak

7. Peak← Autocorrelation

8. Lag←1

9. Return lag

It is important to mention that , the speech signal is often low pass filter before used as

input for pitch period estimation. Since the fundamental frequency associated with

voicing is located in low frequency region (500Hz), low pass filtering eliminate the

interfacing hi Autocorrelation-frequency components as well as out-of-band noise,

leading to a more accurate estimate.

MATLAB Example

[s, Fs, bits] = wavread ('sample6');

autoc = xcorr(s,'unbiased')

Plot (autoc)

x=s (1000:1320);

Plot (x)

____________________________________________________________________________


Figure 2.3 A voiced portion of a speech waveform used in pitch period estimation

Figure 2.4 Autocorrelation values obtained from the waveform of figure 2.3

Note: drawback of the autocorrelation method is need for multiplication, which is

relatively expensive for implementation

To overcome this problem, the magnitude difference

____________________________________________________________________________


2.4.2 Average magnitude difference function

Function of magnitude difference function is defined by

m

MDF [L,m] =∑ |s[n]-S[n-L]| n=m-N+1

For short segments of voiced speech it is reasonable to expect that S [n] –S [n-1] is

small for L=0, ±T, ±2T … with T being the signal's period.

Thus, by computing the magnitude difference function for the lag range of interest, one

can estimate the period by location the lag value associated with the minimum

magnitude difference.

Note that no products are needed for the implementation of the present method. The

following pseudo code summarizes the procedure

PITCH_MD (m, N)

1. min∞ ←

2. for 1← 20 to 150

3. mdf ← 0

4. for n ← m-N+1 to m

5. mdf ← mdf +|S [n] –S [n-1 [|

6. if mdf <min

7. min ←mdf

8. lag←1

9. return lag

____________________________________________________________________________


Figure 2.5 Magnitude difference values obtained from the waveform of figure 2.4

MATLAB Example

[s, Fs, bits] = wavread ('sample6');

x=s (1000:1320);

for k = 1: 240,

amdf (k) = 0;

for n = 1:240-k+1,

amdf (k) = amdf (k) + abs (x (n) – x (n+k-1));

end

amdf (k) = amdf (k)/ (240-k+1);

end

plot (amdf)

____________________________________________________________________________


CHAPTER 3

SPEECH REPRESENTATION IN FREQUENCY DOMAIN

3.1 Introduction

As discussed in chapter 1 the vibration of the vocal cords in voicing produces sound at

a sequence of frequencies, the natural harmonics, each of which is a multiple of the

fundamental frequency. Our ears will judge the pitch of the sound from the

fundamental frequency. And since the smallest element of speech sound that indicates a

difference in meaning is a phoneme. The formant frequencies for each phoneme are

quite distinct but for each phoneme they usually have similar values regardless who is

speaking.

The fundamental frequency will vary depending on the person speaking, mood, and

emphasis, but the relationship of the formant frequencies which make each voiced

sound easily recognizable.

In this chapter we will concentrate on the formant analysis of the speech signal, and the

extraction of the formant frequencies of different speech sounds.

3.2 Formant analysis of speech

Formant analysis of speech can be considered a special case of speech analysis. The

objective is to determine the complex natural frequencies of the vocal mechanism as

the change temporally. The changes are conditioned by the articulatory deformations of

the vocal tract. One approach to such analysis is to consider how the modes are

exhibited in the short-time spectrum of the signal .as an initial illustration, the temporal

courses of the first three speech formants are traced using various methods for formant

frequency extraction.

3.3 Formant frequency extraction

In its simplest visualization, the voiced excitation of a vocal resonance is analogous to

the excitation of a single-tuned circuit by brief periodic pulses. The output is a damped

sinusoid repeated at the pulse rate. The envelope of the amplitude spectrum has a

maximum at a frequency equal essentially to the imaginary part of the complex pole

frequency.

The formant frequency might be measured either by measuring the axis-crossing rate of

the time waveform, Or by measuring the frequency of the peak in the spectral envelope.

The resonances of the vocal tract are multiple. The output time waveform is therefore a

superposition of damped sinusoids and the amplitude spectrum generally exhibits

multiple peaks. If the individual resonances can be suitably isolated by appropriate

filtering the axis-crossing measures, the spectral maxima and the moments might all be

useful indications of formant frequency. There are several methods for formant

frequency extraction one of them is spectrum scanning and peak-picking method

____________________________________________________________________________


3.3.1 Spectrum scanning and peak-picking method

One approach to real-time automatic formant tracking is simply the detection and

measurement of prominences in the short-time amplitude spectrum. At least two

methods of this type have been designed and implemented. One is based upon locating

points of zero slopes, and the other is the detection of local spectral maxima by

magnitude comparison

3.3.2 Spectrum scanning

In this method a short-time amplitude spectrum is first produced by a set of bandpass

filters, rectifiers and integrators. The outputs of the filter channels are scanned rapidly

(on the order of 100 times per second) by a sample-and-hold circuit this produces a

time function which is a step-wise representation of the short time spectrum at a

number (36 in this instance) of frequency values. For each scan, the time function is

differentiated and binary-scaled to produce pulses marking the maxima of the spectrum.

The marking pulses are directed into separate channels by a counter where they sample

a sweep voltage produced at the scanning rate. The sampled voltages are proportional

to the frequencies of the respective spectral maxima and are held during the remainder

of the scan. The resulting step wise voltages are subsequently smoothed by low-pass

filtering.

3.3.3 peak-picking method

The second method segments the short-time spectrum into frequency ranges that ideally

contain single formant. The frequency of the spectral maximum within each segment is

then measured. In the simplest form the segment boundaries are fixed. However

additional control circuitry can automatically adjust the boundaries so that the

frequency range of a given segment is contingent upon the frequency of the next lower

formant .the normalizing circuit clamps the spectral segment either in terms of its peak

value or its mean value. The maxima of each segment are selected at a rapid rate for

example 100 times per second and a voltage proportional to the frequency of the

selected channel is delivered to the output. The selections can be time phased so that

the boundary adjustments of the spectral segments are made sequentially and are set

according to the measured position of the next lower formant.

____________________________________________________________________________


CHAPTER 4

SPEECH CODING

4.1 Introduction

In general speech coding is a procedure to represent a digitized speech signal using as

few bits as possible, maintaining at the same time a reasonable level of speech quality.

A not so popular name having the same meaning is speech compression.

Speech coding has matured to the point where it now constitutes an important

application area of signal processing. Due to the increasing demand for speech

communication, speech coding technology has received augmenting levels of interest

from the research, standardization, and business communities. Advances in

microelectronic and the vast availability of low-cost programmable processors and

dedicated chips have enable rapid technology transfer from research to product

development ;this encourages the research community to investigate alternative

schemes for speech coding , with the objectives of overcoming deficiencies and

limitations. The standardization community pursues the establishment of standard

speech coding methods for various applications that will be widely accepted and

implemented by the industry. The business communities capitalize on the ever-

increasing demand and opportunities in the consumer, corporate, and network

environments for speech processing products.

Speech coding is performed using numerous steps or operation specified as a speech

coding is performed using numerous steps or operations specified as an algorism. An

algorithm is any well-defined computational procedure that takes some value, or set of

values, as input and produces some values, or set of values, as output. An algorithm is

thus a sequence of computational steps that transform the input into the output. Many

signal processing problems-including speech coding- can be formulated as a well-

specified computational problem; hence, a particular coding scheme can be defined as

an algorithm. In general, an algorithm is specified a task. With these instructions, a

computer or processor can execute them so as complete the coding task. The

instructions can also be translated to the structure of a digital circuit, carrying out the

computation directly at the hardware level

4.2 Overview of speech coding

Structure of speech coding system

Figure (4.1) shows the block diagram of speech coding system.

____________________________________________________________________________


• Most speech coding system were designed to support telecommunication between

300 and 3400 Hz , band width :Fm ≈ 4 KHz

• According to Nyquist theorem to avoid aliasing

Fs=2*Fm Fs: sampling freq

Fs=8 kHz

• Commonly selected as standard sampling freq for speech signal to convert the

analog sample to a digital format use 8 bits/sample

Bit rate =8 kHz * 8 bits/sample =64 kbps

• This rate is what the source encoder attempts to reduce • Channel encoder providing error protection to the bit stream before transmission to

the communication channel

• May be in some vocoder the source encoder and channel encoder are done in a

single step

4.3 Classification of speech coding

According to coding techniques

1. Waveform coders An attempt is made to preserve the original shape of the signal waveform, and hence

the resultant codes can generally be applied to any signal source. These coders are

better suited for high bit-rate coding, since performance drops sharply with decreasing

bit-rate. In practice, these coders work best at a bit –rate of 32 kbps and higher.

Signal to noise ratio can be utilized to measure the quality of waveform coders. Some

examples of this class include various kinds of pulse code modulation and adaptive

differential PCM (ADPCM)

Waveform coding is applicable to traditional voice networks and voice over ATM.

Two processes are required to digitize an analog, as follows

Sampling. This discretizes the signal in time.

Quantizing. This discretizes the signal in amplitude.

Figure (4.2) quantization levels

____________________________________________________________________________


2. Parametric coders

Within the framework of parametric coders, the speech signal is assumed to be

generated from a model, which is controlled by some parameters. During encoding,

parameters of the model are estimate from the input speech signal, with the parameters

transmitted as the encoded bit-stream. This type of coder makes no attempt to preserve

the original shape of the waveform and hence SNR is a useless quality measure.

Perceptual quality of the decoded speech is directly related to the accuracy and

sophistication of the underlying model. Due to this limitation, the coder is signal

specific, having poor performance for non speech signals. There are several proposed

models in the literature. The most successful, however, is based on linear prediction. In

this approach, the human speech production mechanism is summarized using a time-

varying filter, with the coefficients of the filter found using the linear predication

analysis procedure.

3. Hybrid coders

As its name implies, a hybrid coder combines the strength of a waveform coder with

that of a parametric coder. Like a parametric coder, it relies on a speech production

model; during encoding, parametric of the model are located. Additional parameters of

the model are optimized in such a way that decoded speech is as close as possible to the

original waveform, with the closeness often measured by a perceptually weighted error

signal. As in waveform coders, an attempt is made to match the original signal with the

decoded signal in the time domain.

This class dominates the medium bit-rate coders, with the code-excited linear

prediction (CELP) algorithm and its variants the most outstanding representatives.

From a technical perspective, the difference between a hybrid coder and a parametric

coder is that the former attempts to quantize or represent the excitation signal to the

speech production model, which is transmitted as part of the encoder bit-stream. The

latter, however, achieves low bit-rate by discarding all detail information of the

excitation signals; only coarse parameters are extracted.

A hybrid coder tends to behave like a waveform coder for high bit-rate, and like a

parametric coder at low bit-rate, with fair to good quality for medium bit-rate.

4. Signal-mode and multimode coders

Signal-mode coders are those that apply a specific, fixed encoding mechanism at all

times, leading to a constant bit-rate for the encoded bit-stream. Examples of such

coders are pulse code modulation (PCM) and regular-pulse-excited long-term

prediction (RPE-LTP)

Multimode coders were invented to take advantage of the dynamic nature of the speech

signal, and to adapt to the time-varying network conditions. In this configuration, one

of several distance coding modes is selected, with the selection done by source control,

when the switching obeys some external commands in response to network needs or

channel conditions.

____________________________________________________________________________


According to bit rate 1. High >15 kbps

2. Medium 5 to 15 kbps

3. Low 2 to 5 kbps

4. Very low <2 kbps

LPC has an output rate 2.4 kbps

A reduction of more than 53 times with respect to the i/p

Its class Classification

• low bit rate

• Parametric codes

Since LPC offers a good quality vs. bit rate trade off, it is the most commonly used

coding technique in various applications as.

Applications FS1015 LPC (1984) To provide secure communication in military application

TIA IS54 VSEIP (1989) For TDMA

ETSI AMR ACELP (1999) For UMTS: Universal mobile telecommunication system

In 3GPP: 3rd generation partnership project.

Voip Voice over IP

So our focus will be linear predictive coding (LPC).

____________________________________________________________________________


4.4 Linear Predictive Coding (LPC)

Linear Predictive Coding (LPC)

• One of the most powerful speech analysis techniques

• One of the most useful methods for encoding good quality speech at a low bit

rate.

• It provides extremely accurate estimates of speech parameters

4.4.1 Basic Principles

A. Physical Model:

LPC starts with the assumption that

The speech signal is produced by a buzzer at the end of a tube.

The glottis (the space between the vocal cords) produces the buzz, which is

characterized by its intensity (loudness) and frequency (pitch).

The vocal tract (the throat and mouth) forms the tube, which is characterized by its

resonances (formants).

B. Mathematical Model:

LPC analyzes the speech signal by estimating the formants, removing their effects from

the speech signal, and estimating the intensity and frequency of the remaining buzz.

The process of removing the formants is called inverse filtering, and the remaining

signal is called the residue. The numbers which describe the formants and the residue

can be stored or transmitted somewhere else.

LPC synthesizes the speech signal by reversing the process: use the residue to create a

source signal, use the formants to create a filter (which represents the tube), and run the

source through the filter, resulting in speech. Because speech signals vary with time,

this process is done on short chunks of the speech signal, which are called frames.

Usually 30 to 50 frames per second give intelligible speech with good compression.

____________________________________________________________________________


Figure (4.3) Block diagram of simplified model for speech production.

Vocal Tract (LPC Filter)

Air (Innovations)

Vocal Cord Vibration (voiced)

Vocal Cord Vibration Period (pitch period)

Fricatives and Plosives (unvoiced)

Air Volume (gain)

____________________________________________________________________________


-

1

( )( )

( )1-

pk

k

k

S z GH z

U za z

=

= =

∑

4.4.2 The LPC filter

This is equivalent to saying that the input-output relationship of the filter is

given by the linear difference equation:

The LPC model can be represented in vector form as:

A changes every 20 ms or so. At a sampling rate of 8000 samples/sec, 20 ms is

equivalent to 160 samples.

• The digital speech signal is divided into frames of size 20 ms. There are

50 frames/second.

is equivalent to

Thus the 160 values of S is compactly represented by the 13 values of A.

• There's almost no perceptual difference in S if:

o For Voiced Sounds (V): the impulse train is shifted (insensitive

to phase change).

o For Unvoiced Sounds (UV): a different white noise sequence is

used.

LPC Synthesis: Given A, generate S

LPC Analysis: Given S, find the best A

____________________________________________________________________________


4.4.3 Problems in LPC model

Problem: the tube isn't just a tube

It may seem surprising that the signal can be characterized by such a simple linear

predictor. It turns out that, in order for this to work, the tube must not have any side

branches.

(In mathematical terms, side branches introduce zeros, which require much more

complex equations.)

For ordinary vowels, the vocal tract is well represented by a single tube. However, for

nasal sounds, the nose cavity forms a side branch. Theoretically, therefore, nasal sounds

require a different and more complicated algorithm. In practice, this difference is partly

ignored and partly dealt with during the encoding of the residue .

Encoding the Source

If the predictor coefficients are accurate, and everything else works right, the speech

signal can be inverse filtered by the predictor, and the result will be the pure source

(buzz). For such a signal, it's fairly easy to extract the frequency and amplitude and

encode them.

However, some consonants are produced with turbulent airflow, resulting in a hissy

sound (fricatives and stop consonants). Fortunately, the predictor equation doesn't care

if the sound source is periodic (buzz) or chaotic (hiss).

This means that for each frame, the LPC encoder must decide if the sound source is

buzz or hiss; if buzz, estimate the frequency; in either case, estimate the intensity; and

encode the information so that the decoder can undo all these steps. This is how LPC-

10e, the algorithm described in federal standard 1015, works: it uses one number to

represent the frequency of the buzz, and the number 0 is understood to represent hiss.

LPC-10e provides intelligible speech transmission at 2400 bits per second.

Here is a sample of LPC-10e encoded speech. Sound files are in Sun/NeXT 8 bit u-law

format, and should be playable on all browsers.

Problem: the buzz isn't just buzz

Unfortunately, things are not so simple. One reason is that there are speech sounds

which are made with a combination of buzz and hiss sources (for example, the initial

consonants in "this zoo" and the middle consonant in "azure"). Speech sounds like this

will not be reproduced accurately by a simple LPC encoder.

Another problem is that, inevitably, any inaccuracy in the estimation of the formants

means that more speech information gets left in the residue. The aspects of nasal

sounds that don't match the LPC model (as discussed above), for example, will end up

in the residue.

____________________________________________________________________________


There are other aspects of the speech sound that don't match the LPC model; side

branches introduced by the tongue positions of some consonants, and tracheal (lung)

resonances are some examples.

Therefore, the residue contains important information about how the speech should

sound, and LPC synthesis without this information will result in poor quality speech.

For the best quality results, we could just send the residue signal, and the LPC synthesis

would sound great. Unfortunately, the whole idea of this technique is to compress the

speech signal, and the residue signal takes just as many bits as the original speech

signal, so this would not provide any compression.

Encoding the Residue

Various attempts have been made to encode the residue signal in an efficient way,

providing better quality speech than LPC-10e without increasing the bit rate too much.

The most successful methods use a codebook, a table of typical residue signals, which

is set up by the system designers. In operation, the analyzer compares the residue to all

the entries in the codebook, chooses the entry which is the closest match, and just sends

the code for that entry. The synthesizer receives this code, retrieves the corresponding

residue from the codebook, and uses that to excite the formant filter. Schemes of this

kind are called Code Excited Linear Prediction (CELP).

4.5 Basic Principles of Linear Predictive Analysis

The particular form of this model that is appropriate for the discussion of linear

predictive analysis is depicted in Fig. 4.3. In this case, the composite spectrum effects

of radiation, vocal tract, and glottal excitation are represented by a time-varying digital

filter whose steady-state system function is of the form

-

1

( )( )

( )1-

pk

k

k

S z GH z

U za z

=

= =

∑ (4.1)

This system is excited by an impulse train for voiced speech or a random noise

sequence for unvoiced speech. Thus, the parameters of this model are:

Voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and

the coefficients ak of the digital filter. These parameters, of course, all vary slowly

with time.

The pitch period and voiced/unvoiced classification can be estimated using one of the

many methods. This simplified all-pole model is a natural representation of non- nasal

voiced sounds, but for nasals and fricative sounds, the detailed acoustic theory calls for

both poles and zeros in the vocal tract transfer function. We shall see, however, that if

the order p is high enough, the all-pole model provides a good representation for almost

all the sounds of speech.

____________________________________________________________________________


The major advantage of this model is that the gain parameter, G, and the filter

coefficients ak can be estimated in a very straightforward and computationally

efficient manner by the method of linear predictive analysis.

For the system of Fig. 4.3, the speech samples s(n) are related to the excitation u(n) by

the simple difference equation

1

( ) ( ) ( )p

k

k

s n a s n k Gu n=

= − +∑ (4.2)

A linear predictor with prediction coefficients, ak is defined as a system whose output is

1

( ) ( )p

k

k

s n s n kα=

= −∑% (4.3)

Such systems to reduce the variance of the difference signal in differential quantization

schemes. The system function of a pth

order linear predictor is the polynomial

1

( )p

k

k

k

P z zα −

=

=∑ (4.4)

The prediction error, e(n), is defined as

1

( ) ( ) ( ) ( ) ( )p

k

k

e n s n s n s n s n kα=

= − = − −∑% (4.5)

From Eq. (4.5) it can be seen that the prediction error sequence is the output of a

system whose transfer function is

1

( ) 1p

k

k

k

A z zα −

=

= −∑ (4.6)

It can be seen by comparing Eqs. (4.2) and (4.5) that if the speech signal obeys the

model of Eq. (4.2) exactly, and ifk k

aα = , then e(n) = Gu(n).

Thus, the prediction error filter, A(z), will be an inverse filter for the system, H(z), of

Eq. (4.1), i.e.,

( )( )

GH z

A z= (4.7)

The basic problem of linear prediction analysis is to determine a set of predictor

coefficients (ak) directly from the speech signal in such a manner as to obtain a good

estimate of the spectral properties of the speech signal through the use of Eq. (4.7).

Because of the time-varying nature of the speech signal the predictor coefficients must

be estimated from short segments of the speech signal. The basic approach is to find a

set of predictor coefficients that will minimize the mean-squared prediction error over a

short segment of the speech waveform. The resulting parameters are then assumed to be

the parameters of the system function, H(z), in the model for speech production.

____________________________________________________________________________


That this approach will lead to useful results may not be immediately obvious, but it

can be justified in several ways. First, recall that ifk k

aα = , then e(n) = Gu(n). For

voiced speech this means that e(n) would consist of a train of impulses; i.e., e(n) would

be small most of the time. Thus, finding ak’s that minimize prediction error seems

consistent with this observation. A second motivation for this approach follows from

the fact that if a signal is generated by Eq. (4.2) with non-time-varying coefficients and

excited either by a single impulse or by a stationary white noise input, then it can be

shown that the predictor coefficients that result from minimizing the mean squared

prediction error (over all time) are identical to the coefficients of Eq. (4.2). A third very

pragmatic justification for using the minimum mean-squared prediction error as a basis

for estimating the model parameters is that this approach leads to a set of linear

equations that can be efficiently solved to obtain the predictor parameters. More

importantly the resulting parameters comprise a very useful and accurate representation

of the speech signal as we shall see in this chapter.

The short-time average prediction error is defined as

2 ( )n n

m

E e m=∑ (4.8)

2( ( ) ( ))m

s n s n= −∑ % (4.9)

2

1

( ) ( )p

n k n

m k

s m s m kα=

= − −

∑ ∑ (4.10)

Where sn(m) is a segment of speech that has been selected in the vicinity of sample n,

i.e.,

( ) ( )n

s m s m n= + (4.11)

The range of summation in Eqs. (4.8)-(4.10) is temporarily left unspecified, but since

we wish to develop a short-time analysis technique, the sum will always be over a finite

interval. Also note that to obtain an average we should divide by the length of the

speech segment. However, this constant is irrelevant to the set of linear equations that

we will obtain and therefore is omitted. We can find the values of ak that minimize En

in Eq. (4.10) by setting / 0n i

E α∂ ∂ = , i= 1, 2… p, thereby obtaining the equations

1

ˆ( ) ( ) ( ) ( ) 1p

n n k n n

m k m

s m i s m s m i s m k i pα=

− = − − ≤ ≤∑ ∑ ∑ (4.12)

Where ˆk

α are the values of ak that minimize En (Since ˆk

α is unique, we will drop the

caret and use the notation ˆk

α to denote the values that minimize En.) If we define

( , ) ( ) ( )n n n

m

i k s m i s m kφ = − −∑ (4.13)

____________________________________________________________________________


then Eq. (4.12) can be written more compactly as

1

( , ) ( ,0) 1, 2,....,p

k n n

k

i k i i pα φ φ=

= =∑ (4.14)

This set of p equations in p unknowns can be solved in an efficient manner for the

unknown predictor coefficients ak that minimize the average squared prediction error

for the segment sn(m) Using Eqs. (4.10) and (4.12), the minimum mean-squared

prediction error can be shown to be

2

1

( ) ( ) ( )p

n n k n n

m k m

E s m s m s m kα=

= − −∑ ∑ ∑ (4.15)

And using eq. (4.14) we can express En as

1

(0,0) (0, )p

n n k n

k

E kφ α φ=

= −∑ (4.16)

Thus the total minimum error consists of a fixed component, and a component which

depends on the predictor coefficients.

To solve for the optimum predictor coefficients, we must first compute the

quantities ( , )n

i kφ for 1 i p≤ ≤ and 0 k p≤ ≤ . Once this is done we only have to solve

Eq. (4.14) to obtain the ak’s. Thus, in principle, linear prediction analysis is very

straightforward. However, the details of the computation of ( , )n

i kφ and the

subsequent solution of the equations are somewhat intricate and further discussion is

required.

So far we have not explicitly indicated the limits on the sums in Eqs. (4.8)-(4.10) and in

Eq. (4.12); however it should be emphasized that the limits on the sum in Eq. (4.12) are

identical to the limits assumed for the mean squared prediction error in Eqs. (4.8)-

(4.10). As we have stated, if we wish to develop a short-time analysis procedure, the

limits must be over a finite interval. There are two basic approaches to this question,

and we shall see below that two methods for linear predictive analysis emerge out of a

consideration of the limits of summation and the definition of the waveform segment

sn(m).

4.5.1 The autocorrelation method

One approach to determining the limits on the sums in Eqs. (4.8)-(4.10) and Eq. (4.12)

is to assume that the waveform segment, sn(m) is identically zero outside the

interval 0 1m N≤ ≤ − . This can be conveniently expressed as

( ) ( ) ( )n

s m s m n w m= + (4.17)

Where w(m) is a finite length window (e.g. a Hamming window) that is identically zero

outside the interval 0 1m N≤ ≤ − .

____________________________________________________________________________


The effect of this assumption on the question of limits of summation for the expressions

for En can be seen by considering Eq. (4.5). Clearly, if sn(m) is nonzero only

for 0 1m N≤ ≤ − , then the corresponding prediction error, en(m), for a pth

order

predictor will be nonzero over the interval 0 1m N p≤ ≤ − + . Thus, for this case En is

properly expressed as

12

0

( )N p

n n

m

E e m+ −

=

= ∑ (4.18)

Alternatively we could have simply indicated that the sum should be over alt nonzero

values by summing from to−∞ + ∞ .

Returning to Eq. (4.5), it can be seen that the prediction error is likely to be large at the

beginning of the interval (specifically0 1m p≤ ≤ − ) because we are trying to predict

the signal from samples that have arbitrarily been set to zero. Likewise the error can be

large at the end of the interval (specifically 1N m N p≤ ≤ + − ) because we are trying

to predict zero from samples that are nonzero. For this reason, a window which tapers

the segment, s to zero is generally used for w(m) in Eq. (4.17).

The limits on the expression for ( , )n

i kφ in Eq. (4.13) are identical to those of Eq.

(4.18). However, because sn(m) is identically zero outside the interval 0 1m N≤ ≤ − , it

is simple to show that

1

0

1( , ) ( ) ( )

0

N p

n n n

m

i pi k s m i s m k

k pφ

+ +

=

≤ ≤= − −

≤ ≤∑ (4.19a)

can be expressed as

1 ( )

0

1( , ) ( ) ( )

0

N i k

n n n

m

i pi k s m s m i k

k pφ

− − −

=

≤ ≤= + −

≤ ≤∑ (4.19b)

Furthermore it can be seen that in this case ( , )n

i kφ is identical to the short- time

autocorrelation function evaluated for (i-k). That is

( , ) ( )n n

i k R i kφ = − (4.20)

Where

1

0

( ) ( ) ( )N k

n n n

m

R k s m s m k− −

=

= +∑ (8.21)

Since R is an even function, it follows that

1,2,........,( , ) ( )

0,1,........,n n

i pi k R i k

k pφ

== −

= (4.22)

Therefore Eq (4.14) can be expressed as

1

( ) ( ) 1p

k n n

k

R i k R i i pα=

− = ≤ ≤∑ (4.23)

Similarly, the minimum mean squared prediction error of Eq. (4.16) takes the

____________________________________________________________________________


form

1

(0) ( )p

n n k n

k

E R R kα=

= −∑ (4.24)

The set of equations given by Eqs. (4.23) can be expressed in matrix form as

1

2

3

(0) (1) (2) ..... ( 1)

(1) (2) (3) ..... ( 2)

(2) (1) (0) ..... ( 3)

.......... ..... ..... ..... .....

.......... ..... ..... ..... .....

( 1) ( 2) ( 3) ..... (0)

n n n n

n n n n

n n n n

pn n n n

R R R R p

R R R R p

R R R R p

R p R p R p R

α

α

α

α

− − −

− − −

(1)

(2)

(3)

.....

.....

( )

n

n

n

n

R

R

R

R p

=

(4.25)

The pxp matrix of autocorrelation values is a Toeplitz matrix; i.e., it is symmetric and

all the elements along a given diagonal are equal. This special property will be

exploited in Section 4.3 to obtain an efficient algorithm for the solution of Eq. (4.23).

4.5.2 The covariance method

The second basic approach to defining the speech segment s and the limits on the sums

is to fix the interval over which the mean-squared error is computed and then consider

the effect on the computation of ( , )n

i kφ That is, if we define

12

0

( )N

n n

m

E e m−

=

=∑ (4.26)

then ( , )n

i kφ becomes

1

0

1( , ) ( ) ( )

0

N

n n n

m

i pi k s m i s m k

k pφ

−

=

≤ ≤= − −

≤ ≤∑ (4.27)

In this case, if we change the index of summation we can express ( , )n

i kφ as either

1

0

1( , ) ( ) ( )

0

N i

n n n

m

i pi k s m s m i k

k pφ

− −

=

≤ ≤= + −

≤ ≤∑ (4.28a)

or

1

0

1( , ) ( ) ( )

0

N k

n n n

m

i pi k s m s m k i

k pφ

− −

=

≤ ≤= + −

≤ ≤∑ (4.28b)

Although the equations look very similar to Eq. (4.19b), we see that the limits of

summation are not the same. Equations (4.28) call for values of sn(m) out side the

interval 0 1m N≤ ≤ − . Indeed, to evaluate. For all of the required values of i and k

requires that we use values of sn(m) in the interval ( 1p m N− ≤ ≤ − ). If we are to be

consistent with the limits on En in Eq. (4.26) then we have no choice but to supply the

____________________________________________________________________________


required values. In this case it does not make sense to taper the segment of speech to

zero at the ends as in the autocorrelation method since the necessary values are made

available from outside the interval ( 0 1m N≤ ≤ − ). Clearly, this approach is very

similar to what was called the modified autocorrelation function in Chapter 2, this

approach leads to a function which is not a true autocorrelation function, but rather, the

cross-correlation between two very similar, but not identical, finite length segments of

the speech wave. Although the differences between Eq. (4.28) and Eq. (4.19b) appear

to be minor computational details, the set of equations

1

( , ) ( ,0) 1, 2,...,p

k n n

k

i k i i pα φ φ=

= =∑ (4.29a)

Have significantly different properties that strongly affect the method of solution and

the properties of the resulting optimum predictor.

In matrix form these equations become

(1,1) (1, 2) (1,3) ..... (1, )

(2,1) (2,2) (2,3) ..... (2, )

(3,1) (3,2) (3,3) ..... (3, )

..... ..... ..... ..... .....

..... ..... ..... ..... .....

( ,1) ( , 2) ( ,3) ..... ( , )

n n n n

n n n n

n n n n

n n n n

p

p

p

p p p p p

φ φ φ φ

φ φ φ φ

φ φ φ φ

φ φ φ φ

1

2

3

(1,0)

(2,0)

(3,0)

..... .....

..... .....

( ,0)

n

n

n

p np

α φ

α φ

α φ

α φ

=

(4.29b)

In this case, since ( , ) ( , )n n

i k k iφ φ= Eq. (4.28), the pxp matrix of correlation-like

values is symmetric but not Toeplitz. Indeed, it can be seen that the diagonal elements

are related by the equation

( 1, 1) ( , ) ( 1) ( 1)

( 1 ) ( 1 )

n n n n

n n

i k i k s i s k

s N i s n k

φ φ+ + = + − − − −

− − − − − (4.30)

The method of analysis based upon this method of computation of ( , )n

i kφ has come to

be known as the covariance method because the matrix of values ( , )n i kφ has the

properties of a covariance matrix.

____________________________________________________________________________


CHAPTER 5

APPLICATIONS

5.1 Speech synthesis

Speech analysis is that portion of voice processing that converts speech to digital forms

suitable for storage on computer systems and transmission on digital (data or

telecommunications) networks

Speech synthesis is that portion of voice processing that reconverts speech data from a

digital form to a form suitable for human usage. These functions are essentially the

inverse of speech analysis

Speech analysis processes are also called digital speech encoding (or coding), and

speech synthesis is also called speech decoding. The objective of any speech-coding

scheme is to produce a string of voice codes of minimum data rate, so that a synthesizer

can reconstruct an accurate facsimile of the original speech in an effective manner,

while optimizing the transmission (or storage) medium.

5.1.1 Formant – frequency Extraction

We use the spectrum scanning and peak-picking method for the extraction of the

formant frequencies. We use voice signals of one phoneme and divide this signal into frames of 20ms

each.

From the spectrum of each frame we extract the pitch , first three formant

frequencies characterizing this phoneme

Figure (5.1) frames of 20ms

____________________________________________________________________________


Figure (5.2) spectrum of frame

Figure (5.3) zoom to show fundamental frequency

____________________________________________________________________________


• The range of fundamental frequency (f0) for male

between(150:250)Hz and foe female between(250:400)Hz

Then we extract the part of the frame containing the highest energy content (the

part around the three-formant frequencies) and we neglect the rest of the

frame.(we called it info).

By that we reduce signal which we send it

We try to reconstruct the voice signal by using the extracted part.

At receiver, we reconstructing the voice signal using the smallest part of the

frame that gives a moderately good quality that allows for voice recognition.

We try to create a synthetic frame using the pitch, the first three-formant

frequencies.

By using this synthetic frame, we create a syntheses voice signal.

By using this synthetic frame, we create a syntheses voice signal.

Figure (5.4) the part of the frame containing the highest energy content

____________________________________________________________________________


Figure(5.5) Info--- previous figure in time domain

-First synthetic frame reconstruct by convolution info with deltas at pitch period

Figure (5.6) Deltas at pitch period

____________________________________________________________________________


The synthesis signal is build but have poor quality

Figure (5.7) first synthesis frame

-Second synthetic frame reconstruct by convolution info with triangles its vertices

at pitch period.

Figure (5.8) Triangles its vertices at pitch period

____________________________________________________________________________


The synthesis signal is build and has good quality

Figure (5.9) Second synthesis frame

-Third synthetic frame reconstruct by convolution info with hamming windows at

pitch period.

Figure (5.10) Hamming windows at pitch period

____________________________________________________________________________


The synthesis signal is build and has good quality

Figure (5.11) Third second synthesis frame

5.1.2 LPC We use LPC (linear predictive coding) technique to synthesis voice signal.

We use voice signals of one phoneme and divide this signal into frames of 20ms

each.

We determine the parameters (here we take 20 parameter only) of frame using

LPC filter in MATLAB.

These parameters which we send from transmitter.

Figure (5.12) frame of 20 ms

____________________________________________________________________________


• Note when increase numbers of parameters the quality increase too.

U(n) Glottal pulse (Innovation)

• We get glottal pulse from previous equation

Figure (5.13) glottal pulse of frame fig (5.10)

We can transmit this glottal pulse and reconstruct the signal at receiver. At receiver, we reconstructing the voice signal using the LPC parameters and

glottal pulse for voice recognition.

However, we need to reduce the data, which we send so we use the next approach.

-

1

( )( )

( )1-

pk

k

k

S z GH z

U za z

=

= =

∑

____________________________________________________________________________


-First synthetic frame reconstruct by the part of the glottal pulse containing the

highest energy content

Figure (5.14) part of the glottal pulse containing the highest energy content

The synthesis signal has a very good quality as shown

Figure (5.15) First synthesis frame

However, we need to transmit parameters only so we generate periodic function at

receiver and synthesis signal

____________________________________________________________________________


-Second approach generate triangles at the pitch period

Figure (5.16) Triangles its vertices at pitch period

The synthesis signal has a medium quality as shown

Figure (5.17) Second synthesis frame

____________________________________________________________________________


-Third approach generate hamming windows at the pitch period

Figure (5.18) Hamming windows at pitch period

The synthesis signal has a medium quality as shown

Figure (5.19) Third synthesis frame

____________________________________________________________________________


Figure (5.20) LPC parameters from two different speakers

5.2 Speaker identification using LPC

Speaker recognition is the task of recognizing people from their voices. Such systems

extract features from speech, model them and use them to identify the person from

his/her voice.

In this application we first inspect the effective portion of the LPC filter analysis is

more effective in speaker identification the LPC parameters or the glottal pulse .

Part 1 a. We take two voice samples from two different speakers but both samples for the

same phoneme preferably a male sample and a female sample to emphasis the

difference in perception.

b. We pass each sample through an LPC filter getting the LPC parameters, and the

glottal pulse for both speakers.

c. The next step is to swap the glottal pulse of the two speakers and reconstruct the

voice signal of each speaker using his LPC parameters and the glottal pulse of

the other speaker.

d. After the reconstruction of the new voice signals we will find that the glottal

pulse is the effective parameter in voice recognition.

e. The next figure shows that the LPC parameters of both speakers have very close

values which assures our conclusion.

____________________________________________________________________________


f. While comparing the LPC parameters did not show a big difference between

two speakers; the comparison between the glottal pulses of the two speakers

showed a much bigger difference leading to confirm our conclusion that the

glottal pulse has a much greater weight in the identification of the speaker. This

is shown in the next figure .

Figure (5.21) glottal pulses from different speakers

____________________________________________________________________________


Part 2

The second portion of our application handles the identification part after

concluding that the glottal pulse is the effective parameter.

a. First we construct a code book containing voice samples from different

speakers but for the same phoneme.

b. Secondly we take a voice sample from one of the speakers for the same

phoneme used in the code book construction. And we consider this

signal as our input signal for which the identification of the speaker

needs to be made.

c. The identification process is done using the distortion measure

technique.

d. Distortion measure = ∑(Sn – Si )2

Where

Sn: is one of the samples saved in the code book ; n:1,2,… ,N

N : is the number of samples saved in the code book (no. of

speakers)

Si : is the input signal .

e. When we get two signals with the least distortion .then we identify the

speaker as the speaker of the signal Sn saved in the code book.

f. We can also use an input signal for a speaker not found in the code in

this case the program will calculate the distortion measure between this

signal and the signals saved in the code book and choose a speaker from

the code book with the least distortion measure .

____________________________________________________________________________


5.3 Introduction to VOIP

VoIP (voice over IP - that is, voice delivered using the Internet Protocol) is a term used

in IP telephony for a set of facilities for managing the delivery of voice information

using the Internet Protocol (IP). In general, this means sending voice information in

digital form in discrete packets rather than in the traditional circuit-committed protocols

of the public switched telephone network (PSTN). A major advantage of VoIP and

Internet telephony is that it avoids the tolls charged by ordinary telephone service.

VoIP, now used somewhat generally, derives from the VoIP Forum, an effort by major

equipment providers, to promote the use of ITU-T H.323, the standard for sending

voice (audio) and video using IP on the public Internet and within an intranet. The

Forum also promotes the user of directory service standards so that users can locate

other users and the use of touch-tone signals for automatic call distribution and voice

mail.

In addition to IP, VoIP uses the real-time protocol (RTP) to help ensure that packets get

delivered in a timely way. Using public networks, it is currently difficult to guarantee

Quality of Service (QoS). Better service is possible with private networks managed by

an enterprise or by an Internet telephony service provider (ITSP).

A technique used by at least one equipment manufacturer, Adir Technologies (formerly

Netspeak), to help ensure faster packet delivery is to use ping to contact all possible

network gateway computers that have access to the public network and choose the

fastest path before establishing a Transmission Control Protocol (TCP) sockets

connection with the other end.

Using VoIP, an enterprise positions a "VoIP device" at a gateway. The gateway

receives packetized voice transmissions from users within the company and then routes

them to other parts of its intranet (local area or wide area network) or, using a T-carrier

system or E-carrier interface, sends them over the public switched telephone network.

5.3.1 VoIP Standards

• ITU-T H.320 Standards for Video Conferencing,

• H.323 ITU Standards

• H.324 ITU Standards

• VPIM Technical Specification

____________________________________________________________________________


5.3.2 System architecture

Figure (5.22) Overview of VOIP network

____________________________________________________________________________


5.3.3 Coding technique in VOIP systems

Codecs are software drivers that are used to encode the speech in a compact enough

form that they can be sent in real time across the Internet using the bandwidth available.

Codecs are not something that VOIP users normally need to worry about, as the VOIP

clients at each end of the connection negotiate between them which one to use.

VOIP software or hardware may give you the option to specify the codecs you prefer to

use. This allows you to make a choice between voice quality and network bandwidth

usage, which might be necessary if you want to allow multiple simultaneous calls to be

held using an ordinary broadband connection. Your selection is unlikely to make any

noticeable difference when talking to PSTN users, because the lowest bandwidth part

of the connection will always limit the quality achievable, but VOIP-to-VOIP calls

using a broadband Internet connection are capable of delivering much better quality

than the plain old telephone system.

A broadband connection is desirable to use VOIP, though it is certainly possible to use

it over a dial-up modem connection if a low-bandwidth, low-fidelity codec is chosen.

The table below lists some commonly used codecs.

Codec Algorithm Bit rate (Kbps)

Bandwidth (Kbps)

G.711 Pulse Code Modulation (PCM) 64 87.2

G.722 Adaptive Pulse Code Modulation (ADPCM) 48 66.8

G.726 Adaptive Differential Pulse Code Modulation (ADPCM) 32 55.2

G.728 Low-Delay Code Excited Linear Predication (LD-CELP) 16 31.5

IXLBC Internet Low Bit-rate Coded (ILBC) 15 27.7

G.727 Embedded (ADPCM) 16 16

G.729a Conjugate Structure Algebraic-Code Excited Linear Prediction (CS-CELP) 8 31.2

G.723.1a MP-MLQ 6.4 21.9

G.723.1 ACELP 5.3 20.8

The bit rate is an approximate indication of voice quality or fidelity, however it is only

approximate. Codecs that use pulse code modulation all give high fidelity, and you will

detect little or no difference between any of them. The G.728 codec will give much

better quality than the only nominally lower rate GSM codec, because the algorithm it

uses is much more sophisticated. However, the GSM codec uses less computing power,

and so will run on simpler devices.

The bandwidth gives an indication of how much of the capacity of your broadband

Internet connection will be consumed by each VOIP call. The bandwidth usage is not

directly proportional to the bit rate, and will depend on factors such as the protocol

used. Each chunk of voice data is contained within a UDP packet with headers and

____________________________________________________________________________


other information. This adds a network overhead of some 15 - 25Kbit/s, more than

doubling the bandwidth used in some cases. However, most VOIP implementations use

silence detection, so that no data at all is transmitted when nothing is being said.

Insufficient bandwidth can result in interruptions to the audio if VOIP uses the same

Internet connection as other users who may be downloading files or listening to music.

For this reason, it is desirable to enable the Quality of Service "QoS" option in the

TCP/IP Properties of any computer running a software VOIP client, and to use a router

with QoS support for your Internet connection. This will ensure that your VOIP traffic

will be guaranteed a slice of the available bandwidth so that call quality does not suffer

due to other heavy Internet usage.

5.3.4 Introduction to G.727

ITU—T Recommendation G 727 contains the specification of an embedded adaptive

differential pulse code modulation (ADPCM) algorithm with 5, 4 3 and 2 bits per

sample (i – e., at rats of 40, 32, 24, and 16 kbps).

The characteristics following are recommended for the conversion of 64-kbps A-law or

µ-law PCM channels to or from variable rate-embedded ADPCM channels. The

recommendation defines the transcoding law when the source signal is a pulse code

modulation signal at a pulse rate of 64 kbps developed from voice frequency analog

signals as specified in ITU- T G .7 11. Figure 5.23 shows a simplified block diagram of

the encoder and the decoder.

____________________________________________________________________________


Figure 5.23 shows a simplified block diagram of the encoder and the decoder.

____________________________________________________________________________


Embedded ADPCM Algorithms

Embedded ADPCM algorithms are variable-bit-rate coding algorithms with the

capacity of bit-dropping outside the encoder and decoder blocks. They consist of a

series of algorithms such that the decision levels of the lower rate quantizes are subsets

of the quantize at the highest rate. This allows bit reductions at any point in the network

without the need for coordination between the transmitter and the receiver. In contrast,

the decision levels of the conventional ADPCM algorithms, such as those in

Recommendation G.726, are not subsets of one another; therefore, the transmitter must

inform the receiver of the coding rate and the encoding algorithm.

Embedded algorithms can accommodate the unpredictable and bursty characteristics of

traffic patterns that require congestion relief. This might be the case in IP-like

networks, or in ATM net works with early packet discard. Because congestion relief

may occur after the encoding is performed embedded ceding is different from the

variable-rate coding where the encoder and decoder must use the same number of bits

in each sample. In both cases, however the decoder must be told the number of bits to

use in each sample.

Embedded algorithms produce code words that Contain enhancement bits and core bits.

The feed-forward (FF) path utilizes enhancement and core bits, while the feedback

(FB) path uses core bits only. The inverse quantizer and the predictor of both the

encoder and the decoder use the core bits. With this structure, enhancement bits can be

discarded or dropped during network congestion.’ However, the number of core bits in

the FB paths of both the encoder and decoder must remain the same to avoid mist

racking.

The four embedded ADPCM rates are 40, 32, 24, and 16 kbps, where the decision

levels for the 32-, 24-, and 16-kbps quantizes are subsets of those for 40 k bits per

quantize. Embedded ADPCM algorithms are referred to by (x,y) pairs, where x refers

to the FE (enhancement and core) ADPCM bits and y refers to the FB (core) ADPCM

bits. For example, if y is set to 2 bits, (5, 2) represents the 24-kbps embedded algorithm

and (2, 2) the I 6-kbps algorithm, the bit rate is never less than 16 kbps because the

minimum number of core bits is 2. Simplified block diagrams of both the embedded

ADPCM encoder and decoder are shown in Figure 5.23.

5.3.5 Introduction to G.729 and G.723.1

The excitation signal (e.g., ACELP) and the partitioning of the excitation space ( the

algebraic codebook , G.729 and G. 723.1 can be differentiated in the manner , although

both assume that all pulses have the same amplitudes and that the sign information will

be transmitted .The two vocoders also chow major differences in terms of delay .

____________________________________________________________________________


Differentiations

G.729 has excitation frames of 5 ms and allows four pulses to be selected. The 40-

sample frame is partitioned into four subsets. The Technology and standards for low-

Bit-rate vocoding methods.

Vocoders Vocoders parameters

G.729 G.729A G.723.1

Bit-rate (kbps)

Frame size (ms)

Subframe size (ms)

Algorithmic delay (ms)

MIPS

RAM (bytes)

Quality

8

10

5

15

20

5.2 k

good

8

10

5

15

10

4 k

good

5.3-6.3

30

7.5

37.5

14-20

4.4 k

good

Figure 5.24 parameters for new vocoders

First three subsets have eight possible locations for pluses, the fourth has sixteen. One

pulse must be chosen from each subset .This is a four- pulse ACELP excitation

codebook method (see figure 5.2)

G.723.1 has excitation frames of 7.5 ms, and also uses a four-pulse ACELP excitation

codebook for the 5.3-kbps mode .For the mum likelihood quantizer (MP-MLQ) is

employed. Here the frame positions are grouped into even – numbered and odd-

numbered subsets .A sequential multipulse search is used for a fixed number of pulses

from the even subset (either five or six, depending on whether the frame itself is odd-or

even- numbered); a similar search is repeated for the odd-numbered subset. Then the

set resulting in the lowest total distortion is selected for the excitation (1).

At the decoder stage, the linear predication coder (LPC) information and adaptive and

fixed codebook information are demultiplexed and then used to reconstruct the output

signal .An adaptive post filter is used .In the case of the G.723.1 vocoders, the LT post

filter is applied to the excitation signal before it is passed through the LPC synthesis

filter and the ST post filter.

G. 723 .1 G. 732.1 specifies a coded representation that can be used for com- pressing the speech

or other audio signal component of multimedia services a very low bit-rate. In the

design of this coder, the principal application considered by the study Group was very

low bit rate visual telephony as part o the overall H. 324 families of standards.

____________________________________________________________________________


This coder has two bit rates associated with it, 5.3 and 6.3 kbps .The higher bit rate

gives greater quality. The lower bi rate gives good quality and provides system

designers with additional flexibility. Both rates are a mandatory part of the encoder and

decoder .It is possible to switch between the two rates a any 30 – ms frame boundary.

An option for variable rate operation using discontinuous transmission and noise fill

during nonspeech intervals is also possible.

The G 723.1 coder was optimized to represent speech with a high quality at the stated

rates, using a limited amount of complexity. Music and other audio signals are no

represented as faithfully as speech, but can be compressed and decompressed using his

coder.

The G.723.1 coder encodes speech or other audio signals in 30 –ms frames .In addition,

here is look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All

additional delay in the implementation and operation of this coder is due o he

following:

1- Actual time spent processing the data in the encoder and decoder.

2- Transmission time on the communication link

3- Additional buffering delay for the multiplexing protocol.

Encoder / Decoder The G .723 .1 coder is designed to operate with a digital signal by first performing

telephone bandwidth filtering ( Recommendation G. 712 ) of the analog input , then

sampling at 8000 Hz , and then Hz, and hen converting to 16 – bit linear PCM for thee

input to the encoder .

The output of the decoder is converted back to analog by similar means.

Other input / output characteristics , such as those specified by Recommendation G 711

for 64 – kbps PCM data, should be converted to 16–bit linear PCM before encoding or

from 16 – bit linear PCM to the appropriate format after decoding .

The coder is based on the principles of linear prediction analysis- by – synthesis coding

and attempts to minimize a perceptually weighted error signal. The encoder operates on

blocks (frames) of 240 samples each. That is equal to 30 ms an 8 – k Hz sampling rate.

Each block is first high – pass filtered to remove the DC component and then is divided

into four sub frames of 60 samples each .For every sub frame , a tenth – order linear

prediction coder filter is computed using the unprocessed input signal . The LPC filter

for the last sub frame is quantized using a predictive split vector quantizer ( PSVQ )

.The quantized LPC coefficients are use to construct the short – term perceptual

weighing filter, which is used to filter he enter frame and o obtain the perceptually

weighted speech signal .

For every two sub frames [120 samples ] , the open – loop pitch period L Lo is compute

using the weighted speech signal .This pitch estimation is performed on blocks of 120

samples. The pitch period is searched in the range from 18 to 142 samples.

From this point, the speech is processed on a basis of 60 samples per sub-frame.

Using the estimated pitch period computed previously, a harmonic noise shaping filter

is constructed. The combination of the LPC synthesis filter, the format perceptual

weighting filter, and the harmonic noise shaping filter is used to create an impulse

response.

The impulse response is then used for further computations.

Using the estimated pitch period estimation L Lo and the impulse response, a closed

pitch predictor is computed A fifth – order pitch predictor is used .The pitch period is

____________________________________________________________________________


Figure (5.25) .The block diagram of the encoder

computed as a small differential value around the open – loop pitch estimate. The

contribution of the pitch predictor is then subtracted from the initial target vector. Both

the pitch period and the differential values are transmitted to he decoder .

Finally, the nonperiodic component of the excitation is approximated. For the high bit

rate, multipulse maximum likelihood quantization ( MP – MLQ ) excitation is used ,

and for the low bit rate , an algebraic code excitation is used .

The block diagram of the encoder is shown in figure 5.25.

• Framer

• High – pass filter

• LPC analysis

• Lin spectral pair ( LSP) quantizer

• LSP interpolation

• Forman perceptual weighting filer

• Pitch estimation

• Sub frame processing

• Harmonic noise shaping

• Impulse response calculator

• Zero – input response and ringing subtraction

• Pitch predictor

• High – rate excitation (MP-MLQ )

• Excitation decoder

• Pitch information decoding

References

Topic 1: Audiology

[1] Fundamental of Acoustics By Kinsler, Lawrewnce; Sanders, Jame V;

Frey, Austinr.

[2] Adaptive Filters By Haykn, Simon.

[3] (http://www.freehearingtest.com).

[4] (http://www. healthline.com).

[5] (http://www.youcanhear4less.com).

[6] Texas Instruments (http://www.ti.com/).

Topic 2: Acoustical Simulation of Room

[1] Kutruff, K. H. (1991) Room Acoustics, Elsevier Science Publishers, Essex.

[2] (http://home.tir.com/~ms/roomacoustics/).

[3] (http://audiolab.uwaterloo.ca/~bradg/auralization.html).

[4] (http://www.acoustics.hut.fi/~riitta/.reverb/).

[5] (http://www.music.mcgill.ca/~gary/).

[6] (http://www.mcsquared.com/).

[7] (http://www.owlnet.rice.edu/~elec431/projects97/).

[8] J. Audio Eng. Soc., Vol.45, No.6, 1997 June

Topic 3: Noise Control

[1] Architectural Acoustics, McGraw Hill Inc., New York, 1988, p.18

[2] Source: U.S. Dept. of Commerce / National Bureau of Standards.

Handbook 119, July, 1976: Quieting: A Practical Guide to Noise Control; Page 61.

[3] Kutruff, K. H. (1991) Room Acoustics, Elsevier Science Publishers, Essex.

[4] (http://www.STC ratings.com).

Topic 4: Speech Technology

[1] L.R Rabiner and B.Gold Theory and Application of Digital Signal Processing

Prentice-Hall, Englewood cliffs, nj.

[2] C, J.Weinstein A Linear Predictive Vocoder with Voice Excitation Proc. Eascon.

[3] Daniel Minoli, Emma Minoli Delivering Voice Over IP Networks Wile Computer

Publishing john wiley & sons,inc.

[4] (http://www.data-compression.com/speech.html).

[5] (http://www.otolith.com/otolith/olt/lpc.html).

Acoustics Lab 2007

Documents

Transcript of Acoustics Lab 2007