ENEE408G Capstone -- Multimedia Signal Processing (F'05) Digital Speech Processing and Coding...

52
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Digital Speech Processing and Coding Digital Speech Processing and Coding Fall’05 Instructor: Carol Espy-Wilson Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/class/enee408g/ http://umd.blackbloard.com/ [email protected] ENEE408G Spring ENEE408G Spring 2004 2004 Lecture-2 Lecture-2

Transcript of ENEE408G Capstone -- Multimedia Signal Processing (F'05) Digital Speech Processing and Coding...

ENEE408G Capstone -- Multimedia Signal Processing (F'05)

Digital Speech Processing and CodingDigital Speech Processing and Coding

Fall’05 Instructor: Carol Espy-Wilson

Electrical & Computer Engineering

University of Maryland, College Park

http://www.ece.umd.edu/class/enee408g/http://umd.blackbloard.com/

[email protected]

ENEE408G Spring ENEE408G Spring 20042004Lecture-2Lecture-2

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [2]

Last LectureLast Lecture

Course overview and logistics

Bring multimedia to digital world: sampling & quantization

Introduction to speech processing– Different aspects of speech

Friday Lab Session– Speech Processing, Coding, Recognition, & HCI

Today: speech processing, coding, synthesis

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [3]

Speech ProductionSpeech Production

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [4]

Source-Filter View of Speech Production Source-Filter View of Speech Production (Stevens 1999)(Stevens 1999)

Source Spectrum

Vocal tract transfer function

Radiation Characteristics

Power spectrum of speech signal

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [5]

2000

4000

6000

8000

1.0 2.0 3.0 4.0 0.0

Fre

quen

cy (

kHz)

Time (sec)

“Sprouted grains and seeds are used in salads and dishes such as chop suey”

F2

2000

4000

6000

8000

0.1 0.3 0.5

Fre

quen

cy (

kHz)

fricativestopconsonant

glidevowel stop

consonantvowel

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [6]

Phonetic Features (Chomsky & Halle, 1968)Phonetic Features (Chomsky & Halle, 1968)

There are three kinds of phonetic features – Source features determine the kind of excitation signal

– Manner of articulation features determine how open or closed is the vocal tract

– Place of articulation features determine the location of primary constriction

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [7]

Source feature “voiced”Source feature “voiced”

-voiced +voiced

/z/ /s/

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [8]

Source Feature voicedSource Feature voiced

2000

4000

6000

8000

0.1 0.3 0.5

“Sprouted”

Fre

quen

cy (

kHz)

Time (sec)vertical striations

+voiced

turbulence-voiced

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [9]

Glottal Source (Klatt & Klatt 1990)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [10]

Modal Voice

Creaky Voice

Breathy Voice

Voice Quality-APP DetectorVoice Quality-APP Detector

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [11]

Manner feature “sonorant”Manner feature “sonorant”

-sonorant+sonorant

/z/vowelPrimary source at glottis

Primary source above the glottis at alveolar ridge

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [12]

Source Feature sonorant

2000

4000

6000

8000

0.1 0.3 0.5

“Sprouted”

Fre

quen

cy (

kHz)

Time (sec) low frequency energy+sonorant

high frequency energy-sonorant

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [13]

Place feature for stop consonantsPlace feature for stop consonants

/p/ /t/

+labial +alveolar

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [14]

Place Feature Labial vs. AlveolarPlace Feature Labial vs. Alveolar

falling

spectral prominence

dB

labial /b/

Frequency (Hz)

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [15]

risingfalling

spectral prominence

labial /p/

alveolar /t/dB

Frequency (Hz)

Place Feature Labial vs. AlveolarU

MC

P E

NE

E4

08

G S

lide

s (c

rea

ted

by

Ca

rol E

spy-

Wils

on

© 2

00

4)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [16]

Source-Filter TheorySource-Filter Theory

First “speaking machine” in 1930s NY World’s Fair– 14 keys, 1 wristband, 1 pedal

Modeling speech productionas a linear system– Sound sources

Either voiced or unvoiced– Voice sound

Modeled by a generator of pulses

– Unvoiced sound Modeled by white noise

generator– Articulation

Modeled by a cascade of single-resonance (pole) digital filters

Figure 1 of SPM May’98Speech Survey

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

© 2

00

3)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [17]

Linear Separable Model for Speech ProductionLinear Separable Model for Speech Production

Vocal tract is modeled as a linear time-varying system– Parameters of the linear system are slowly varying

– Excited by time-varying source (voiced or unvoiced)

Practical models– Model each speech frame

as Linear Time-Invariant

– Excited by either voicedor unvoiced source

– Allow overlaps in neighbouring frames

Figure 3.2 of Furui’s book

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [18]

Speech CodingSpeech Coding

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [19]

Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [20]

Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

Lowpass filtered (0-3400 Hz)Lowpass filtered (0-3400 Hz)

Bandpass filtered Bandpass filtered

(200-3400 Hz)(200-3400 Hz)

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [21]

Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [22]

Digital Coding of SpeechDigital Coding of Speech

0.050.054.84.87.27.2200200

waveform codingwaveform coding source codingsource coding

Synthetic Synthetic qualityquality

6464

broadcastbroadcastqualityquality

1616 9.69.6tolltoll

qualityquality commun.commun.qualityquality

Waveform coders: quantize speech samples directly at high bit Waveform coders: quantize speech samples directly at high bit rates.rates.

Source coders (vocoders): use knowledge of speech production Source coders (vocoders): use knowledge of speech production to parameterize the signal (model based)to parameterize the signal (model based)

Hybrid coders: partly waveform based and partly model based Hybrid coders: partly waveform based and partly model based (2.4-16 kbps)(2.4-16 kbps)

kbpskbps

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

Information Capacity I=BfInformation Capacity I=Bfss

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [23]

PCM codingPCM coding

How to encode a signal into bits?– Sampling and perform uniform quantization (2 parameters: , equal

quantization step size and B, # of bits) “Pulse Coded Modulation” (PCM) 8 bits per sample ~ good for speech 16 bits ~ needed for high-quality music

Tradeoff between fidelity and file size

How to “squeeze” out redundancy?

I(x,y)

Input signalSampler Quantizer Encoder

transmit

digitize/capture device

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [25]

Discussion on Improving PCM (1)Discussion on Improving PCM (1)

2 parameters: step size , # of bits B

Peak-to-peak range is 2Xmax,

Assume – where e[n] is uncorrelated with x[n], and it is uniformly

distributed

max2

2B

X

ˆ[ ] [ ] [ ]e n x n x n

ppee[e][e]1

2

2

2 2

2 2max

3 2

[ ]

Bx

e x

SNRX

max( ) 6 4.77 20log[ ]x

XSNR dB B

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

2

12e

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [26]

Uniform quantization Uniform quantization Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [27]

Discussion on Improving PCM (1)Discussion on Improving PCM (1)

Uniform quantization may give inconsistent range of relative amount errors– E.g., +/- 2 incurs 20% vs. 2% at amplitude 10 and 100

Non-uniform quantization

– Assign smaller quantization step size at small amplitude

to maintain consistent range of relative quantization errors over the entire dynamic range

– Can apply non-linear transform before uniform quantization via “companding” (compression-expansion)

-law companding: international standard for 64kbps speech

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [28]

Discussion on Improving PCM (1)Discussion on Improving PCM (1)

[ ] ln | [ ] |y n x n

( [ ])[ ] ( [ ])y nx n e sign x n1 [ ] 0

( [ ])1 [ ] 0

x nsign x n

x n

2

2 2 2

1x

x e e

SNR

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ˆ[ ] ln | [ ] | [ ]y n x n n

ˆ( [ ]) ( [ ])ˆ[ ] y n sign x nx n e[ ] [ ]ˆ[ ] | [ ] | ( [ ]) [ ]n nx n x n sign x n e x n e

ˆ[ ] [ ](1 [ ]) [ ] [ ] [ ]x n x n n x n x n n

ˆ[ ] [ ] [ ]x n x n e n

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [29]

Discussion on Improving PCM (1) Discussion on Improving PCM (1) Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

But, But, ln[0]

maxmax

| [ ] |log[1 ]

[ ] ( [ ])log[1 ]

x nX

y n X sign x n

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

not practicalnot practical

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [30]

Discussion on Improving PCM (1)Discussion on Improving PCM (1)Log CompandingLog Companding Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y C

aro

l Esp

y-W

ilso

n ©

20

04

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [31]

Discussion on Improving PCM (2)Discussion on Improving PCM (2)

Quantized PCM values may not be equally likely– Can we do better than encode each value using same # bits?

Example– P(“0” ) = 0.5, P(“1”) = 0.25, P(“2”) = 0.125, P(“3”) = 0.125

– If use same # bits for all values Need 2 bits to represent the four possibilities if treat equally

– If use less bits for likely values “0” ~ Variable Length Codes (VLC) “0” => [0], “1” => [10], “2” => [110], “3” => [111] Use 1.75 bits on average ~ saves 0.25 bit per sample!

Bring probability into the picture– Use probability distribution to reduce average # bits per quantized

sample

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [40]

How to Encode Correlated Sequence?How to Encode Correlated Sequence? Consider: high correlation between successive samples

Predictive coding– Basic principle: Remove redundancy between successive pixels and only encode

residual between actual and predicted

– Residue usually has much smaller dynamic range Allow fewer quantization levels for the same MSE => get

compression– Compression efficiency depends on intersample redundancy

First try

uQ (n)

Predictor+

eQ(n)

uP(n) = uQ(n-1) DecodeDecode

rr

u(n)

Predictor

Quantizer_

e(n) eQ(n)

EncodeEncoderr

u’P(n) = u(n-1)

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [41]

Predictive Coding (cont’d)Predictive Coding (cont’d)

Problem with 1st try– Input to predictor are different at

encoder and decoder decoder doesn’t know u(n)!

– Mismatch error could propagate to future reconstructed samples

Solution: Differential PCM (DPCM)

– Use quantized sequence uQ(n) for prediction at both encoder and decoder

– Prediction error e(n)

– Quantized prediction error eQ(n)

– Distortion d(n) = e(n) – eQ(n)

uQ (n)

Predictor+

eQ(n)

uP(n)= uQ(n-1)

DecodeDecoderr

Think: Think: what predictor to use?what predictor to use?

EncodeEncoderr

u(n)

Predictor

Quantizer_

e(n) eQ(n)

+uP(n) =uQ(n-1)

uQ(n)

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [43]

Linear Prediction Analysis of SpeechLinear Prediction Analysis of Speech

are called Linear Prediction Coefficients (LPC)

a1

a2

aP

+[ ]s n ][ne+

_1z

1z

1z

a1

a2

aP

++

_ 1z

1z

1z

][ne [ ]s n

Analysis Synthesis

{ }ia

Error Minimization 

Normal equations

Can be solved using the famous Levinson Recursion, which leads to lattice formulation of the linear prediction solution

22

{ }ˆmin ( [ ]) ( [ ] [ ])

kan n

E E e n E s n s n ˆSa s

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [44]

Source-Filter View of Speech ProductionSource-Filter View of Speech Production

e(t) v(t) r(t) s(t)

E() V() R() S()

s(t) = e(t)*v(t)*r(t)

S() = E()V()R()

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [45]

All-Pole Modeling of SpeechAll-Pole Modeling of Speech

Auto-regressive (AR) model: all-pole filter

– H(z) is the overall transfer function

– Glottal Flow G(z), Vocal Tract V(z), Radiation R(z), Gain

Synthesis process:

u[n]: the vocal tract input, s[n]: speech output

1

( ) ( ) ( ) ( )( )1

Pk

kk

H z G z V z R zA za z

)(

)(

zAzH ][n u ][ns

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [46]

All-Pole Model and Linear PredictionAll-Pole Model and Linear Prediction

1

( )

( ) ( ) 1P

kk

k

S z

U z A z a z

1

[̂ ] [ ]P

kk

s n a s u k

Here is a linear prediction of order P for s[n]

)(zP +

+

_

[̂ ]s n

[ ]s n ][ne

where is the prediction error sequence ˆ[ ] [ ] [ ]e n s n s n

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

1

( ) ( ) ( )P

kk

k

S z a S z z U z

1

ˆ [ ] [ ] [ ] [ ] [ ]P

kk

s n a s n k u n s n e n

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [47]

Model-based CodingModel-based Coding

Linear Prediction Coder (LPC)

– LPC Vocoder ( voice coder ) Divide speech into frames (several tens milliseconds) and

encode the LPC coefficients of each frame Additional parameters to facilitate synthesis:

voiced/unvoiced flag, gain, pitch (for voiced)

– Line Spectrum Pair (LSP) Coding

Hybrid Coding: LPC Residual Coding– Between LPC and waveform codingU

MC

P E

NE

E4

08

G S

lide

s (c

rea

ted

by

R.

Liu

& M

.Wu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [48]

Line Spectrum Pair (LSP) CodingLine Spectrum Pair (LSP) Coding

Pros and Cons of LPC method– Good performance at coding rate down to 2.4kbps

– Synthesized voice becomes unnatural when below 2.4kbps

– When the poles are near the unit circle, quantization in LPC coefficients may result in instability.

LSP parameters– LSP are frequencies extracted from polynomials constructed from LPC

coefficients

– Frequency domain features (similar to formant)

=> produce less distortion due to quantization

[See details in Design Project on Speech]

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [49]

Hybrid CodingHybrid Coding

“Hybrid” – between LPC and waveform coding– LPC Residual Coding: encode and slowly update LPC coefficients, and

send the LPC residual (e.g. encoded using Vector Quantization)

Advantages:– Free from quality degradation due to source modeling

– Low-frequency waveform is exactly reproduced

– Spectral information of the entire frequency range is preserved

– No need of pitch period estimation and voiced/unvoiced decision

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [52]

Code-Excited Linear Predictive Coding (CELP)Code-Excited Linear Predictive Coding (CELP)

Multipulse-Excised Linear Predictive Coding (MPC)– Do not distinguish voiced/unvoiced sound explicitly

Code-Excited Linear Predictive Coding (CELP) – Replace the multi-pulses of MPC with vector-quantized sequences based

on long-term prediction of periodicity and short-term prediction

Figure 6.32 of Furui’s book

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [53]

Speech Coding MethodsSpeech Coding Methods

– Waveform coding; Hybrid coding; Analysis-synthesis coding

Table 6.1 of Furui’s bookU

MC

P E

NE

E4

08

G S

lide

s (c

rea

ted

by

R.

Liu

& M

.Wu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [54]

Speech Quality vs. Transmission RateSpeech Quality vs. Transmission Rate

Figure 6.2 of Furui’s book

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [55]

Comparison of Different Speech Coding Tech.Comparison of Different Speech Coding Tech.

Table 6.2 of Furui’s book

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [56]

Put Together: A Digital Telephone SystemPut Together: A Digital Telephone System

– 8kHz and 8-bit per sample for telephone speech => 64kbps

– Anti-aliasing filter before sampling

– Non-uniform quant-ization (e.g., through -law or A-law companding ~ signalcompression-expansion)

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [57]

Speech SynthesisSpeech Synthesis

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [58]

Speech SynthesisSpeech Synthesis Speech synthesis: a process that artificially produces speech

– Articulatory synthesis, Formant synthesis, and LPC synthesis

– Issues other than synthesizer structure: text analysis, etc.

Figure 7.2 of Furui’s bookU

MC

P E

NE

E4

08

G S

lide

s (c

rea

ted

by

R.

Liu

& M

.Wu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [59]

Comparison of Synthesis MethodsComparison of Synthesis Methods

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

.Liu

© 2

00

2)

Table 7.1 of Furui’s book

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [60]

Text-to-Speech Conversion SystemText-to-Speech Conversion System

=> See more in Design Project and try it out

Figure 7.8 of Furui’s bookU

MC

P E

NE

E4

08

G S

lide

s (c

rea

ted

by

R.

Liu

& M

.Wu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [61]

Analysis/SynthesisAnalysis/Synthesis

Naturally spoken Naturally spoken utteranceutterance

Synthesized Synthesized utteranceutterance

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [62]

Human Computer Interface/Interaction (HCI)Human Computer Interface/Interaction (HCI)

Multi-modal multimedia communications and interactions

– Info. & interface through speech/audio, image/video, graphics, etc.

Building blocks for speech based HCI

– Speech recognition and speaker identification

– Natural language understanding

– (Speech synthesis)

– Examples voice command, dictation Question-and-Answer: for intelligent customer

service, voice-based info. retrieval, call routing, ……

Enhance speech-based HCI with graphics: “talking head”

=> See more in Design Project and try it out

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y R

. L

iu &

M.W

u ©

20

02

)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [63]

SummarySummary

Speech production and analysis– Spectrogram; Pitch, Formant

– Linear prediction model

Speech coding– Basic compression tools

Speech Synthesis

This week’s Lab session:– Design project#1 on Speech

Next lecture: speech recognition

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [64]

AssignmentsAssignments

“The Past, Present, and Future of Speech Processing”

“Talk to the Machine”

UM

CP

EN

EE

40

8G

Slid

es

(cre

ate

d b

y M

.Wu

& R

.Liu

© 2

00

2)