Matalan bittinopeuden puheenkoodaus - TUTMatalan bittinopeuden puheenkoodaus Jani Nurminen...

Matalan bittinopeudenpuheenkoodaus

Jani Nurminen

20.2.2013

Low bit rate speech coding – lecture outline

• Introduction• Speech modeling in low bit rate speech coding

• (Waveform interpolation speech coding)• Sinusoidal speech coding

• (Vector quantization)• (Vector quantization & codebook training)• (Memory reduction techniques (split VQ, multistage VQ))• (Quantization of vectors having variable dimension)

• Exploiting correlation between speech frames• Predictive VQ• Dynamic codebook reordering• Matrix quantization

• Multi-mode quantization• Segmental speech coding• Perceptual irrelevancy removal• Demo samples

Introduction

• Many quite good speech coding technologies available, for example• GSM codec

• Full rate 13 kbps: Linear Predictive Coding with Regular Pulse Excitation• Half rate 5.6 kbps: Vector Self-Excited Linear Prediction

• AMR (Adaptive Multi-Rate ; 8 kHz ; bit rates 4.75-12.2 kbps)• AMR-WB (16 kHz ; bit rates 6.6-23.85 kbps)• AMR-WB+ (wider band, up to 48 kHz, mono/stereo ; bit rates 6-48 kbps)

• However, there is still some work to do…• High speech quality at (very) low bit rates (also at 16 kHz sampling frequency)• Better quality / bit rate ratio at high bit rates

• General audio (music, etc)• 3D audio

• This lecture will focus on speech coding at (very) low bit rates• From <1.0 kbps to ~3-4 kbps

• Good introduction lecture to speech coding at higher rates given annually by Jari Hagqvist(NRC) on course “Digital Mobile Communication Systems” (?)

Introduction – special challenges at (very) low bit rates

• Bit rates from <1.0 kbps to ~3-5 kbps• Uncompressed 8-kHz speech signal:

• 16 bits / sample * 8000 samples / second = 128 kbps

• Uncompressed 16-kHz speech signal 256 kbps• Compression ratio from about 25:1 to about 160:1• Another way to look at it:

• At 1.0 kbps, an average of 0.125 bits is available per sample

• Speech has to remain intelligible and quality has to remain reasonably high• Special techniques are needed to achieve the goal:

• Exploit knowledge from speech production• Exploit redundancies/correlations• Exploit properties of human hearing

Speech modeling in low bit rate speech coding

• Source / filter model• “Vocal tract”• “Excitation” (residual)

• Linear prediction (LP) can be used for separating the two components

0 50 100 150 200 250 300 350 400-0.5

0

0.5Speech signal x(n)

time (ms)

norm

aliz

edam

plitu

de

0 50 100 150 200 250 300 350 400-0.5

0

0.5Residual signal r(n)

time (ms)

norm

aliz

edam

plitu

de

Speech modeling in low bit rate speech coding

• Excitation very hard to model well• Plenty of detailed information• High entropy• CELP type of representation / waveform coding not possible at very low bit rates• Simple voice coding (pulses+noise) does not provide high quality

• Several techniques have been developed for modeling the excitation at low bitrates

• In this lecture, we will focus on two quite successful approaches:• Waveform interpolation• Sinusoidal speech modeling

Waveform interpolation speech coding - Encoding

• Based on interpolation of pitch-cycle waveforms (+linear prediction)• First step (after LPC) in residual encoding: pitch estimation and interpolation

0 100 200 300 400 500 600 700 800-0.5

0

0.5Residual signal and detected pitch periods

time (samples)

norm

aliz

edam

plitu

de 42 65 73 79

0 100 200 300 400 500 600 700 8000

50

100

Pitch track

time (samples)

pitc

hpe

riod

(sam

ples

)


• Encoder extracts pitch-cycle-length Characteristic Waveforms (CWs)


• CWs are aligned and normalized

0

-5

0

5

t i+5bti+4bti+3bti+2bti+btiti -b

time

ti-2bt i-3bt i-4b2 (rad)

ampl

itude

afte

rpow

erno

rmal

izat

ion


• CWs are decomposed into Slowly and Rapidly Evolving Waveforms(SEW&REW)


• Block diagram of the encoder:

LP analysis andfiltering

pitch estimation

waveformdecomposition

alignment andnormalization

waveformextraction

speech

LSF

power

residual waveform CW

pitch

SEW

REW

Waveform interpolation speech coding - Quantization

• Benefit of SEW/REW decomposition:• SEW represents the periodic component:

• can be quantized using a slow update rate• redundancy

• REW represents the noise-like component:• can be quantized very roughly

• Parameters to be quantized:• LP coefficients (LSFs)• Pitch• Gain/energy• SEW (magnitude spectrum (harmonics), possibly phases at higher bit rates)• REW (very coarse magnitude spectrum)

Waveform interpolation speech coding – Decoding

• Decoding of parameters• CW reconstruction: CW = SEW + REW• Realignment and denormalization• Phase track estimation (based on pitch)

0 20 40 60 80 100 120 140 16020

30

40

50Pitch track

time (samples)

pitc

hpe

riod

(sam

ples

)

0 20 40 60 80 100 120 140 1600

Phase track

time (samples)

phas

e(ra

d)

2

4

6

8

10

12

Waveform interpolation speech coding - Decoding

• Conversion from CW into residual using the phase track

Waveform interpolation speech coding - Decoding

• Block diagram of the decoder:

CWreconstruction

LP synthesisfiltering

conversion toresidual signal

phase trackestimation

realignment anddenormalization

pitch

SEW

REW

power

LSF

CW residual speechsurface

phase track

Sinusoidal speech coding

• Very closely related to waveform interpolation• Residual is represented using harmonic sinusoids

• Voicing estimation:• Threshold of voicing• Voicing levels for the individual sinusoids

• Both components (voiced and unvoiced) can be represented using sinusoids

• Linearly evolving “voiced phases”, Random “unvoiced phases”• Typical parameters: LSFs, pitch, gain, magnitudes, voicing (scalar or vector)

Vector quantization

• Very efficient basic tool for speech coding• Codebook: N vectors• Quantization:

• Select the closest-matching vector from the codebook• Using some distortion measure (e.g. squared error)

• Store or transmit the index of the code vector

• Dequantization:• Table look-up (pick the code vector corresponding to the index)

• The most common distortion measures:• Squared error

• Weighted squared error (perceptual weights)

Vector quantization

• Basic structure of a vector quantizer

Vector quantization

• Illustration of two-dimensional vector quantization:

Vector quantization

• Training can be handled using the very simple Generalized Lloyd Algorithm:

Vector quantization – memory & complexity reduction

• Sometimes high quality would require unrealistic codebook sizes• Memory requirements• Search complexity (full search needed in general)

• For example, high quality quantization of an LSF vector requires ~20 bits• 2^20 = 1 048 576 code vectors• With 10-dimensional vectors and 2 bytes/element: 20 MB of memory

• Problem can be solved using structural constraints• Two very popular approaches:

• Split VQ• Very simple approach• Not the best solution

• Multi-stage VQ• Better than split VQ in terms of accuracy (at a given bit rate)• High-quality training not so simple but possible• Search algorithm affects the performance


• Split VQ


• Multi-stage VQ

Vector quantization – Tree search in MSVQ

Quantization of variable-dimension vectors

• One of the assumptions in vector quantization is that the dimension is fixed• In very low bit rate speech coding, the quantization of the excitation generally

requires quantization of variable-dimension vectors• The problem is usually solved using some dimension normalization technique:

• Truncation / zero-padding• Frequency bins• Band-limited interpolation (e.g. splines)• Discrete Cosine Transform• Polynomial approximation

• DCT offers good performance ; however,• Complexity is reasonable high• No simple approaches available for codebook training with perceptual weights

Exploiting correlation between speech frames

• Vector quantization is the best possible technique for quantization of individualvectors

• However, it does not take into account the correlations between consecutivespeech frames

• Efficiency can be further improved by exploiting the correlations/redundancy• At the expense of bigger delay or increased sensitivity to bit errors

• Good approaches• Predictive quantization• Dynamic Codebook Reordering• Matrix Quantization


• Predictive quantization:• Use the previous vector for forming a prediction of the current vector• Quantize only the prediction error

• Auto-Regressive (AR), Moving Average (MA) or both


• Dynamic codebook reordering• Reduces the entropy of the index data through reordering• After encoding (and storing/transmitting the index), the codebook is reordered

based on the distance to the selected code vector• Similar reordering is done at the decoder after each decoding operation• Since the vectors from consecutive frames are likely to be quite similar, entropy is

reduced• Less bits needed for lossless compression of indices (using e.g. Huffman coding)

• No need for quantizer retraining• Works quite well for conventional VQs• Suitable also e.g. for multi-stage quantization if the whole quantizer structure is

taken into account• Examples in the following slides


• Index probabilities at the last stage of a 4-stage quantizer


• Bit rates achievable using a 4-stage MSVQ for LSFs (originally 2200 bps):


• Matrix quantization:• Group consecutive vectors into matrices• Quantize one matrix at a time• Can be realized using conventional VQ (by concatenating the vectors)

Multi-mode quantization

• Accuracy for a given bit rate can be further enhanced using multi-modequantization

• Two or more modes• Separate codebooks (and/or predictors) for the different modes

• Especially useful if the mode can be derived from other parameters withoutsending any additional information

• Use of voicing-based modes provides good performance:• Voicing information readily available• For example LSFs are quite different in voiced and unvoiced regions

Segmental speech coding

• Our current approach for very low bit rates• Used in our VLBR speech codec

• Basic ideas:• Divide the speech into segments with high intra-segment similarity• Handle quantization for one segment at a time• Exploit redundancies within each segment• Multiple modes for different segment types, one possible approach:

• Silence• Unvoiced speech• Voiced speech• Speech containing both voiced and unvoiced characteristics


• Motivation: segmental nature of speech


• Encoder structure


• Segmentation:


• Listening test results (MOS scale 1-5, old version of the codec)

Perceptual irrelevancy removal

• Perceptual weighting can be used to enhance the quantization of the speechparameters

• Even better results can be obtained by removing the perceptually irrelevant partsof the signal as a part of preprocessing pre-processing

• Psychoacoustic model is used in determining perceptual importance


• Block diagram:


• Example of masking threshold determination

The End

• Thank you!

Matalan bittinopeuden puheenkoodaus - TUTMatalan bittinopeuden puheenkoodaus Jani Nurminen...

Documents

Transcript of Matalan bittinopeuden puheenkoodaus - TUTMatalan bittinopeuden puheenkoodaus Jani Nurminen...