A History and Overview of Machine Listening
Transcript of A History and Overview of Machine Listening
Machine Listening - Dan Ellis 2010-05-12 /281
1. A Machine Listener2. Key Tools in Machine Listening3. Outstanding Problems
A History and Overviewof Machine Listening
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK
Machine Listening - Dan Ellis 2010-05-12 /28
1. Machine Listening
2
“Listening puts us in the world” (Handel, 1989)
• Listening = Getting useful information from sound
signal processing + abstraction“useful” depends on what you’re doing
• Machine Listening = devices that respond to particular sounds
Machine Listening - Dan Ellis 2010-05-12 /28
• 1984
two claps togglepower outlet on/off
Listening Machines• 1922
magnet releases dogin response to 500 Hzenergy
3
Machine Listening - Dan Ellis 2010-05-12 /28
Why is Machine Listening obscure?• A poor second to Machine Vision:
vision leads to more immediate practical applications (robotics)?“machine listening” has been subsumed by speech recognition?images are more tangible?
4
Machine Listening - Dan Ellis 2010-05-12 /28
Listening to Mixtures
• The world is cluttered& sound is transparentmixtures are a certainty
• Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence
5
Machine Listening - Dan Ellis 2010-05-12 /28
Listening vs. Separation• Extracting information
does not require reconstructing waveforms
e.g. Shazam fingerprinting[Wang ’03]
• But... high-level description permits resynthesis6
Query audio
0
500
1000
1500
2000
2500
3000
3500
4000
Match: 05−Full Circle at 0.032 sec
0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
500
1000
1500
2000
2500
3000
3500
4000
time / sec
freq
/ Hz
Machine Listening - Dan Ellis 2010-05-12 /28
Listening Machine Parts
• Representation: what is perceived• Organization: handling overlap and interference• Recognition: object classification & description• Models: stored information about objects
7
Front-end
representation
SoundScene
organization /
separation
Object
recognition
& description
Memory /
models
Action
Machine Listening - Dan Ellis 2010-05-12 /28
Listening Machines
• What would count as a machine listener, ... and why would we want one?replace peoplesurpass peopleinteractive devicesrobot musicians
• ASR? yes• Separation? maybe
8
Machine Listening - Dan Ellis 2010-05-12 /28
Machine Listening Tasks
9
Dectect
Classify
Describe
EnvironmentalSound
Speech Music
Task
Domain
ASR MusicTranscription
MusicRecommendation
VAD Speech/Music
Emotion
“Sound Intelligence”
EnvironmentAwareness
AutomaticNarration
Machine Listening - Dan Ellis 2010-05-12 /28
2. Key Tools in Machine Listening• History: an eclectic timeline:
what worked? what is useful?
10
20102000199019801970
Early
spee
ch re
cogn
ition
Bake
r ‘75:
Hidden
Mark
ov M
odels
Davis &
Merm
elstei
n ‘80:
MFCCs
Wein
traub
‘86:
Sourc
e sep
aratio
n
McAula
y/Qua
tieri ‘
86: Si
nuso
id mod
els
Breg
man ‘9
0: Aud
itory
Scen
e Ana
lysis
Brow
n & C
ooke
‘94:
CASA
Bell &
Sejno
wski ‘9
5: IC
A
Goto ‘9
4: Musi
c Sce
ne A
nalys
is
Serra
‘89:
SMS
Lee &
Seun
g ‘99:
NMF
Kristj
ansso
n et a
l. ‘06:
Iroqu
ois
Weis
s & El
lis’10
: Eige
nvoic
e mod
els
Ellis ‘
96: P
redict
ion-dr
iven C
ASA
Muscle
Fish ‘9
8
Ellis &
Lee ‘0
6: Se
gmen
ting L
ifeLo
gs
“The
Clap
per”
Tzan
etakis
& C
ook: ‘
02 M
usic G
enre
Klapu
ri ‘03
: Tran
script
ion
Diariza
tion
Moorer
’75:
Music T
ranscr
iption
Applications:SpeechMusicEnvironmentalSeparation
Machine Listening - Dan Ellis 2010-05-12 /28
Early Speech Recognition• DTW template matching
• Innovations:template models o time warping
11
Vintsyuk ’68Ney ’84
10 20 30 40 50 time /frames
10
20
30
40
50
60
70
Reference
Test
ONE
TWO
THRE
EFO
URFI
VE
Input frames fi
Refe
renc
e fra
mes
r j
D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01
D(i-1,j-1) + T11
D(i-1,j)
D(i-1,j) D(i-1,j)
T10
T 01
T 11
Local match cost
Lowest cost to (i,j)
Best predecessor(including transition cost)
Machine Listening - Dan Ellis 2010-05-12 /28
MFCCs• One feature to rule them all?
just the right amount of blurring12
0.25 0.255 0.26 0.265 0.27 time / s−0.5
0
0.5
0 1000 2000 3000 freq / Hz05
10
0 5 10 15 freq / Mel012
x 104
0 5 10 15 freq / Mel
050
100
0 10 20 30 quefrency−200
0
200
FFT X[k]
Mel scalefreq. warp
log |X[k]|
IFFT
Truncate
MFCCs
Sound
spectra
audspec
cepstra
Davis & Mermelstein ’80Logan ’00
MFCC resynthesis:
Machine Listening - Dan Ellis 2010-05-12 /28
Hidden Markov Models• Recognition as
inferring the parameters of a generative model
• Time warping + probability + training
13
Baker ’75Jelinek ’76
1
2
3
0
0.2
0.4
0.6
0.8
0 10 20 300
0
1
2
3
0 1 2 3 40
0.2
0.4
0.6
0.8
observation x time step n
State sequenceEmission distributions
Observation
sequence
x nx n
p(x|q
)p(
x|q)
q = A q = B q = C
q = A q = B q = C
AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
AS
EC
Bp(qn+1|qn)
SABCE
0 1 0 0 0
0 0 0 0 1
0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1
S A B C E
qn
qn+1.8 .8
.7
.1
.1
.1.1.1
.1
.1
S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E
Machine Listening - Dan Ellis 2010-05-12 /28
Model Decomposition• HMMs applied to multiple sources
infer generative states for many modelscombination of observations...exponential complexity...
14
Varga & Moore ’90Gales & Young ’95
Ghahramani & Jordan’97Kristjansson et al ’06
Herhsey et al ’10
model 1
model 2
observations / time
Machine Listening - Dan Ellis 2010-05-12 /28
Segmentation & Classification• Label audio at scale of ~ seconds
segment when signal changesdescribe by statisticsclassify with existing models (HMMs?)
• Many supervisedapplicationsspeaker IDmusic genreenvironment ID
15
VTS_04_0001 - Spectrogram
freq
/ kH
z
1 2 3 4 5 6 7 8 9012345678
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
MFC
C b
in
1 2 3 4 5 6 7 8 92468
101214161820
-20-15-10-505101520
MFCC dimension
MFC
C d
imen
sion
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video Soundtrack
MFCCfeatures
MFCCCovariance
Matrix
Chen & Gopalakrishnan ’98Tzanetakis & Cook ’02
Lee & Ellis ’06
Machine Listening - Dan Ellis 2010-05-12 /28
Music Transcription• Music audio has a very specific structure
... and an explicit abstract content
pitch + harmonicsrhythm‘harmonized’ voices
16
Moorer ’75Goto ’94,’04
Klapuri ’02,’06
Got
o ’0
4
Machine Listening - Dan Ellis 2010-05-12 /28
Sinusoidal Models• Stylize a spectrogram
into discrete components
discrete pieces → objectsgood for modification & resynthesis
17
time / s0.5 1 1.5 2 2.5 3 3.5
freq
/ kHz
0
1
2
3
4
freq
/ kHz
0
1
2
3
4
McAulay & Quatieri ’84Serra ’89
Maher & Beauchamp ’94
from Wang ’95
Machine Listening - Dan Ellis 2010-05-12 /28
Comp. Aud. Scene Analysis• Computer implementations of
principles from [Bregman 1990] etc.
time-frequency masking for resynthesis
18
Weintraub ’85Brown & Cooke ’94
Ellis ’96Roweis ’03
Hu & Wang ’04
inputmixture
signalfeatures(maps)
discreteobjectsFront end Object
formationGrouping
rulesSourcegroups
onsetperiodfrq.mod
time
freq
freq
/ kHz
0
2
4
6
8
freq
/ kHz
0
2
4
6
8
time / s0.5 1 1.5 2 2.5 3
harmonicity, onset cues
Oracle mask:
Machine Listening - Dan Ellis 2010-05-12 /28
Independent Component Analysis• Can separate “blind” combinations by
maximizing independence of outputs
19
m1 m2
s1 s2
a11 a21
a12 a22
x
−δ MutInfo δa
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
mix 1
mix
2
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
/
kurto
sis
Kurtosis vs. Mixture Scatter
s1
s1
s2
s2
kurt(y) = E
��y − µ
σ
�4�− 3
Bell & Sejnowski ’95Smaragdis ’98
kurtosis
as a measureof independence?
Machine Listening - Dan Ellis 2010-05-12 /28
Nonnegative Matrix Factorization• Decomposition of spectrograms
into templates + activation
fast & forgiving gradient descent algorithmfits neatly with time-frequency masking
20
Lee & Seung ’99Abdallah & Plumbley ’04Smaragdis & Brown ’03
Virtanen ’07
Smar
agdi
s ’0
4
X = W · H
1 2 3Bases from all Wt
50 100 150 200
3
2
1
Row
s of
H
Time (DFT slices)
Freq
uenc
y (D
FT in
dex)
50 100 150 200
20
40
60
80
100
120
Virtanen ’03sounds
Machine Listening - Dan Ellis 2010-05-12 /28
Summary of Key Ideas
21
Front-end
representation
SoundScene
organization /
separation
Object
recognition
& description
Memory /
models
Action
SpectrogramMFCCsSinusoids
SegmentationCASAICA, NMF
Direct matchHMMsDecompositionAdaptationSVMs
TemplatesHMMsBases
Machine Listening - Dan Ellis 2010-05-12 /28
3. Open Issues• Where to focus to advance machine listening?
task & evaluationseparationmodels & learningASAcomputational theoryattention & searchspatial vs. source
22
Machine Listening - Dan Ellis 2010-05-12 /28
Scene Analysis Tasks• Real scenes are
complex!
background noisemany objectsprominence
• Is this just scaling,or is it different?
• Evaluation?
23
70
60
50
40
dB
200400100020004000f/Hz Noise1
200400100020004000f/Hz Noise2,Click1
200400100020004000f/Hz City
0 1 2 3 4 5 6 7 8 9501002004001000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400100020004000f/Hz Wefts1 4
501002004001000
Weft5 Wefts6,7 Weft8 Wefts9 12
Machine Listening - Dan Ellis 2010-05-12 /28
How Important is Separation?• Separation systems often evaluated by SNR
based on pre-mix components - is this relevant?
• Best machine listening systems have resynthesise.g. Iroquois speech recognition - “separate then recognize”
• Separated signals don’t have to match originals to be useful
24
mixseparation
words
identifytarget energy
sourceknowledge
speechmodels
t-f masking+ resynthesis ASR
mix wordsidentifytarget energy
speechmodels
find bestwords model
Machine Listening - Dan Ellis 2010-05-12 /28
How Many Models?• More specific models ➝ better analysis
need dictionaries for “everything”??
• Model adaptation and hierarchyspeaker adapted models : base + parametersextrapolation beyond normal
• Time scales of model acquisitioninnate/evolutionary (hair-cell tuning)developmental (mother tongue phones) dynamic - the “Bolero” effect
25
CNBH, Physiology Department, Cambridge UniversityCNBH, Physiology Department, Cambridge University
Recognition of Scaled Vowels
Sm
ith
, P
atte
rso
n,
Tu
rner
, K
awah
ara
and
Iri
no
JA
SA
(2
00
5)
Domain of normal speakers
/a/
/u/
/i/
/e/
/o/
mean
pitch / Hz
VT
leng
th r
atio
Smith, Patterson et al. ’05
Machine Listening - Dan Ellis 2010-05-12 /28
Auditory Scene Analysis?• Codebook models learn harmonicity, onset
• Can also capture sequential structuree.g. consonants follow vowelsuse overlapping patches?
• But: computational factors
26
VQ256 codebook female
Codeword
Level / dBFr
eque
ncy (
kHz)
50 100 150 200 2500
2
4
6
8
80
60
40
20
0
... to subsume rules/representations of CASA
Machine Listening - Dan Ellis 2010-05-12 /28
Computational Theory• Marr’s (1982) perspective on perception
• What is the computational theory of machine listening?independence? sources?
27
ComputationalTheory
Properties of the worldthat make the problem
solvable
AlgorithmSpecific calculations
& operations
ImplementationDetails of
how it’s done
Machine Listening - Dan Ellis 2010-05-12 /28
Summary• Machine Listening:
Getting useful information from sound
• Techniques for:representationseparation / organizationrecognition / descriptionmemory / models
• Where to go?separation?computational theory?
28
Machine Listening - Dan Ellis 2010-05-12 /28
References 1/2S. Abdallah & M. Plumbley (2004) ”Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval 2004.J. Baker (1975) “The DRAGON system – an overview,” IEEE Trans. Acoust. Speech, Sig. Proc. 23, 24–29, 1975.A. Bell & T. Sejnowski (1995) “An information maximization approach to blind separation and blind deconvolution,” Neural Computation, 7, 1129-1159, 1995.A. Bregman (1990) Auditory Scene Analysis, MIT Press.G. Brown & M. Cooke (1994) “Computational auditory scene analysis,” Comp. Speech & Lang. 8(4), 297–336, 1994.S. Chen & P. Gopalakrishnan (1998) “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” Proc. DARPA Broadcast News Transcription and Understanding Workshop.S. Davis & P. Mermelstein (1980) “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust. Speech Sig. Proc. 28, 357–366, 1980.D. Ellis (1996) Prediction-driven computational auditory scene analysis, Ph.D thesis, EECS dept., MIT.D. Ellis & K. Lee (2006) “Accessing minimal impact personal audio archives,” IEEE Multimedia 13(4), 30-38, Oct-Dec 2006.M. Gales & S. Young (1995) “Robust speech recognition in additive and convolutional noise using parallel model combination,” Comput. Speech Lang. 9, 289–307, 1995.Z. Ghahramani & M. Jordan (1997) “Factorial hidden Markov models,” Machine Learning, 29(2-3,) 245–273, 1997.M. Goto & Y. Muraoka (1994) “A beat tracking system for acoustic signals of music,” Proc. ACM Intl. Conf. on Multimedia, 365–372, 1994.
M. Goto (2004) “A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals,” Speech Comm., 43(4), 311–329, 2004.S. Handel (1989) Listening, MIT Press.J. Hershey, S. Rennie, P. Olsen, T. Kristjansson (2006) “Super-human multi-talker speech recognition: A graphical modeling approach,” Comp. Speech & Lang., 24, 45–66.G. Hu and D.L.Wang (2004) “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.F. Jelinek (1976) “Continuous speech recognition by statistical methods,” Proc. IEEE 64(4), 532–556.A. Klapuri (2003) “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech & Audio Proc.,11(6).A. Klapuri (2006) “Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes,” Proc. Int. Symp. Music Info. Retr.T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, and R. Gopinath (2006) “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” Proc. Interspeech, 775-1778, 2006.D. Lee & S. Seung (1999) ”Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788.B. Logan (2000) “Mel frequency cepstral coefficients for music modeling,” Proc. Int. Symp. Music Inf. Retrieval, Plymouth, September 2000.R. Maher & J. Beachamp (1994) “Fundamental frequency estimation of musical signals using a two-way mismatch procedure,” J. Acoust. Soc. Am., 95(4), 2254–2263, 1994.D. Marr (1982) Vision, MIT Press.
29
Machine Listening - Dan Ellis 2010-05-12 /28
References 2/2R. McAulay & T. Quatieri (1986) “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust. Speech Sig. Proc. 34, 744– 754, 1986. A. Moorer (1975) On the segmentation and analysis of continuous musical sound by computer, Ph.D. thesis, CS dept., Stanford Univ.H. Ney (1984) “The use of a one stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Sig. Proc. 32, 263–271, 1984.S. Roweis (2003) ”Factorial Models and Refiltering for Speech Separation and Denoising”, Proc. Eurospeech, 2003.X. Serra (1989) A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition, Ph.D. thesis, Music dept., Stanford Univ.P. Smaragdis (1998) “Blind separation of convolved mixtures in the frequency domain,” Intl.Wkshp. on Indep. & Artif. Neural Networks,Tenerife, Feb. 1998.P. Smaragdis & J. Brown (2003) ”Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October 2003P. Smaragdis (2004) “Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs,” Proc. ICA, LNCS 3195, 494–499.D. Smith, R. Patterson, R. Turner, H. Kawahara & T. Irino (2005) “The processing and perception of size information in speech sounds,” J Acoust Soc Am. 117(1), 305–318.G. Tzanetakis & P. Cook (2002) “Musical genre classification of audio signals,” IEEE Tr. Speech & Audio Proc. 10(5).A.Varga & R. Moore (1990) “Hidden Markov Model decomposition of speech and noise,” IEEE ICASSP, 845–848, 1990.
T. Vintsyuk (1971) “Element-wise recognition of continuous speech composed of words from a specified dictionary,” Kibernetika 7, 133–143, 1971.T. Virtanen (2007) “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, Mar. 2007.A. Wang (2006) “The Shazam music recognition service,” Comm. ACM 49(8), 44-48, 2006.M. Weintraub (1986) A theory and computational model of auditory monaural sound separation, Ph.D. thesis, EE dept., Stanford Univ.R. Weiss & D. Ellis (2010) “Speech separation using speaker-adapted Eigenvoice speech models,” Comp. Speech & Lang., 24(1), 16–29, Jan 2010.
30