Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes...
-
Upload
willa-greer -
Category
Documents
-
view
217 -
download
2
Transcript of Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes...
![Page 1: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/1.jpg)
Characteristics of Speech
Long-term (sentence level, several seconds) Drastic/irregular changes
Short-term (frame level, 20ms or so) Regular periodic changes for voiced sounds Noise-like for unvoiced sounds
Hard to recognize without context information
![Page 2: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/2.jpg)
Spectrum in Frequency-DomainThree basic characteristics in a spectrum:
Timbre: Spectrum after smoothing Pitch: Distance between harmonics Intensity: Magnitude of spectrum
Second formant F2First formant
F1Pitch freq
Intensity
![Page 3: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/3.jpg)
Timber Demo: Real-time Spectrogram
Simulink model for real-time display of spectrogram dspstfft_audio (Before MATLAB R2011a) dspstfft_audioInput (R2012a or later)
Spectrogram:Spectrum:
![Page 4: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/4.jpg)
Audio Feature Extraction & Recog.
Frame blocking Frame duration of 20 ms
Feature extraction Volume, pitch, MFCC, LPC, etc
Endpoint detection Based on volume & ZCR
Recognition DTW, HMM
![Page 5: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/5.jpg)
Example: Audio Feature Extraction
256 points/frame84 points overlap11025/(256-84)=64 feature vectors per second 0 50 100 150 200 250 300
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
![Page 6: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/6.jpg)
Three Basic Acoustic Features Three basic speech features
Volume/Energy/Intensity(音量、能量、強度): Vibration Amplitude
Pitch(音高): Fundamental frequency (which is equal to the reciprocal of the fundamental period)
Timbre(音色): The waveform within a fundamental period
These features are perceived subjectively by humans. However, we can use some mathematics to “emulate” human and capture these features.
![Page 7: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/7.jpg)
Acoustic Feature: EnergyEnergy is the square sum of a frame, also known as
intensity or volume.Characteristics:
Usually noise and fricative have low energy. Energy is influence a lot by microphone setup. If we take log of square sum, and times 10, we have
energy in terms of Decibel(分貝) Energy is commonly used in endpoint detection. In embedded system implementation, volume can be
computed as the abs. sum of a frame in order to reduce computation.
![Page 8: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/8.jpg)
Acoustic Feature: Zero Crossing Rate
Zero crossing rate (ZCR) The number of zero crossing in a frame.
Characteristics: Noise and unvoiced sound have high ZCR. ZCR is commonly used in endpoint detection,
especially in detection the start and end of unvoiced sound.
To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR.
![Page 9: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/9.jpg)
![Page 10: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/10.jpg)
Pitch
Computation Pitch freq. is the reciprocal of fundamental period. Pitch in terms of semitone:
440log*1269 2
freqsemitone
![Page 11: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/11.jpg)
一般聲音的產生與接收基本流程
發音體的震動 空氣的波動 耳膜的振動 內耳神經的接收 大腦的辨識
發聲機制 敲擊所引發的自然震動頻率(例:音叉) 空氣摩擦所引發的共振頻率(例:笛子)
![Page 12: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/12.jpg)
Human Speech Production
![Page 13: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/13.jpg)
The Vocal Tract
![Page 14: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/14.jpg)
Glottal Volume Velocity &Resulting Sound Pressure (Voiced)
![Page 15: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/15.jpg)
Speech Production
Glottal Pulses Vocal Tract Speech Signal
(a) Source Spectrum (c) Output Energy Spectrum
+
+=
=
(b) Filter Function
![Page 16: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/16.jpg)
Acoustical Analysis(speech signal of “ 七” )
![Page 17: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/17.jpg)
Speech Production Modeling
phonation
whispering
frication
compression
vibration
Impulse Train
Generator
Noise Generator
Pitch Period
×u(n)
Time-varying digital filter
Vocal Tract Parameters
s(n)
G
![Page 18: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/18.jpg)
Parametric Representation
×u(n)
G
A(z) s(n)
Z-Transform
Model
Write in A(z)
G = gain of excitationu(n) = excitation source(quasi-periodic pulse train or random noise)
p
kk
knsnuGns a1
)()(.)(
p
k
k
kzSzUGzS za
1
)()(.)(
)(
1
1
1
)(.
)()(
1zAzUG
zSzH p
k
k
k za
![Page 19: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/19.jpg)
The Speech Model : A Summary
Voiced/unvoiced classification,Pitch period for voiced sounds,The gain parameter, andThe coefficients of the digital filters, {ak}.
p
kk
knsnuGns a1
)()(.)(
p
kk
knsns a1
)()(
![Page 20: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/20.jpg)
名詞對照 Cochlea:耳蝸 Phoneme:音素、音位 Phonics:聲學;聲音基礎教學法(以聲音為基礎進而教拼字的教學法)
Phonetics:語音學 Phonology:音系學、語音體系 Prosody:韻律學;作詩法 Syllable:音節 Tone:音調 Alveolar:齒槽音
Silence:靜音 Noise:雜訊 Glottis:聲門 larynx:喉頭 Pharynx:咽頭 Pharyngeal:咽部的,喉音的 Velum:軟顎 Vocal chords:聲帶 Esophagus:食管 Diaphragm:橫隔膜 Trachea:氣管
![Page 21: Characteristics of Speech zLong-term (sentence level, several seconds) yDrastic/irregular changes zShort-term (frame level, 20ms or so) yRegular periodic.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee65503460f94bf6731/html5/thumbnails/21.jpg)
Hints for Exercises
How to generate a sine wave signal: Math formula: MATLAB code:
duration=3;
f=440;
fs=16000;
time=(0:duration*fs-1)/fs;
y=0.8*sin(2*pi*f*t);
plot(time, y);
sound(y, fs);
)2sin(* ftay