A History and Overview of Machine Listening

Machine Listening - Dan Ellis 2010-05-12 /281

1. A Machine Listener2. Key Tools in Machine Listening3. Outstanding Problems

A History and Overviewof Machine Listening

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK

mailto:[email protected]

mailto:[email protected]

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


1. Machine Listening

2

“Listening puts us in the world” (Handel, 1989)

• Listening = Getting useful information from sound

signal processing + abstraction“useful” depends on what you’re doing

• Machine Listening = devices that respond to particular sounds


• 1984

two claps togglepower outlet on/off

Listening Machines• 1922

magnet releases dogin response to 500 Hzenergy

3


Why is Machine Listening obscure?• A poor second to Machine Vision:

vision leads to more immediate practical applications (robotics)?“machine listening” has been subsumed by speech recognition?images are more tangible?

4


Listening to Mixtures

• The world is cluttered& sound is transparentmixtures are a certainty

• Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence

5


Listening vs. Separation• Extracting information

does not require reconstructing waveforms

e.g. Shazam fingerprinting[Wang ’03]

• But... high-level description permits resynthesis6

Query audio

0

500

1000

1500

2000

2500

3000

3500

4000

Match: 05−Full Circle at 0.032 sec

0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

500

1000

1500

2000

2500

3000

3500

4000

time / sec

freq

/ Hz


Listening Machine Parts

• Representation: what is perceived• Organization: handling overlap and interference• Recognition: object classification & description• Models: stored information about objects

7

Front-end

representation

SoundScene

organization /

separation

Object

recognition

& description

Memory /

models

Action


Listening Machines

• What would count as a machine listener, ... and why would we want one?replace peoplesurpass peopleinteractive devicesrobot musicians

• ASR? yes• Separation? maybe

8


Machine Listening Tasks

9

Dectect

Classify

Describe

EnvironmentalSound

Speech Music

Task

Domain

ASR MusicTranscription

MusicRecommendation

VAD Speech/Music

Emotion

“Sound Intelligence”

EnvironmentAwareness

AutomaticNarration


2. Key Tools in Machine Listening• History: an eclectic timeline:

what worked? what is useful?

10

20102000199019801970

Early

spee

ch re

cogn

ition

Bake

r ‘75:

Hidden

Mark

ov M

odels

Davis &

Merm

elstei

n ‘80:

MFCCs

Wein

traub

‘86:

Sourc

e sep

aratio

n

McAula

y/Qua

tieri ‘

86: Si

nuso

id mod

els

Breg

man ‘9

0: Aud

itory

Scen

e Ana

lysis

Brow

n & C

ooke

‘94:

CASA

Bell &

Sejno

wski ‘9

5: IC

A

Goto ‘9

4: Musi

c Sce

ne A

nalys

is

Serra

‘89:

SMS

Lee &

Seun

g ‘99:

NMF

Kristj

ansso

n et a

l. ‘06:

Iroqu

ois

Weis

s & El

lis’10

: Eige

nvoic

e mod

els

Ellis ‘

96: P

redict

ion-dr

iven C

ASA

Muscle

Fish ‘9

8

Ellis &

Lee ‘0

6: Se

gmen

ting L

ifeLo

gs

“The

Clap

per”

Tzan

etakis

& C

ook: ‘

02 M

usic G

enre

Klapu

ri ‘03

: Tran

script

ion

Diariza

tion

Moorer

’75:

Music T

ranscr

iption

Applications:SpeechMusicEnvironmentalSeparation


Early Speech Recognition• DTW template matching

• Innovations:template models o time warping

11

Vintsyuk ’68Ney ’84

10 20 30 40 50 time /frames

10

20

30

40

50

60

70

Reference

Test

ONE

TWO

THRE

EFO

URFI

VE

Input frames fi

Refe

renc

e fra

mes

r j

D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01

D(i-1,j-1) + T11

D(i-1,j)

D(i-1,j) D(i-1,j)

T10

T 01

T 11

Local match cost

Lowest cost to (i,j)

Best predecessor(including transition cost)


MFCCs• One feature to rule them all?

just the right amount of blurring12

0.25 0.255 0.26 0.265 0.27 time / s−0.5

0

0.5

0 1000 2000 3000 freq / Hz05

10

0 5 10 15 freq / Mel012

x 104

0 5 10 15 freq / Mel

050

100

0 10 20 30 quefrency−200

0

200

FFT X[k]

Mel scalefreq. warp

log |X[k]|

IFFT

Truncate

MFCCs

Sound

spectra

audspec

cepstra

Davis & Mermelstein ’80Logan ’00

MFCC resynthesis:


Hidden Markov Models• Recognition as

inferring the parameters of a generative model

• Time warping + probability + training

13

Baker ’75Jelinek ’76

1

2

3

0

0.2

0.4

0.6

0.8

0 10 20 300

0

1

2

3

0 1 2 3 40

0.2

0.4

0.6

0.8

observation x time step n

State sequenceEmission distributions

Observation

sequence

x nx n

p(x|q

)p(

x|q)

q = A q = B q = C

q = A q = B q = C

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

AS

EC

Bp(qn+1|qn)

SABCE

0 1 0 0 0

0 0 0 0 1

0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1

S A B C E

qn

qn+1.8 .8

.7

.1

.1

.1.1.1

.1

.1

S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E


Model Decomposition• HMMs applied to multiple sources

infer generative states for many modelscombination of observations...exponential complexity...

14

Varga & Moore ’90Gales & Young ’95

Ghahramani & Jordan’97Kristjansson et al ’06

Herhsey et al ’10

model 1

model 2

observations / time


Segmentation & Classification• Label audio at scale of ~ seconds

segment when signal changesdescribe by statisticsclassify with existing models (HMMs?)

• Many supervisedapplicationsspeaker IDmusic genreenvironment ID

15

VTS_04_0001 - Spectrogram

freq

/ kH

z

1 2 3 4 5 6 7 8 9012345678

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MFC

C b

in

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimension

MFC

C d

imen

sion

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix

Chen & Gopalakrishnan ’98Tzanetakis & Cook ’02

Lee & Ellis ’06


Music Transcription• Music audio has a very specific structure

... and an explicit abstract content

pitch + harmonicsrhythm‘harmonized’ voices

16

Moorer ’75Goto ’94,’04

Klapuri ’02,’06

Got

o ’0

4


Sinusoidal Models• Stylize a spectrogram

into discrete components

discrete pieces → objectsgood for modification & resynthesis

17

time / s0.5 1 1.5 2 2.5 3 3.5

freq

/ kHz

0

1

2

3

4

freq

/ kHz

0

1

2

3

4

McAulay & Quatieri ’84Serra ’89

Maher & Beauchamp ’94

from Wang ’95


Comp. Aud. Scene Analysis• Computer implementations of

principles from [Bregman 1990] etc.

time-frequency masking for resynthesis

18

Weintraub ’85Brown & Cooke ’94

Ellis ’96Roweis ’03

Hu & Wang ’04

inputmixture

signalfeatures(maps)

discreteobjectsFront end Object

formationGrouping

rulesSourcegroups

onsetperiodfrq.mod

time

freq

freq

/ kHz

0

2

4

6

8

freq

/ kHz

0

2

4

6

8

time / s0.5 1 1.5 2 2.5 3

harmonicity, onset cues

Oracle mask:


Independent Component Analysis• Can separate “blind” combinations by

maximizing independence of outputs

19

m1 m2

s1 s2

a11 a21

a12 a22

x

−δ MutInfo δa

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

mix 1

mix

2

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

/

kurto

sis

Kurtosis vs. Mixture Scatter

s1

s1

s2

s2

kurt(y) = E

��y − µ

σ

�4�− 3

Bell & Sejnowski ’95Smaragdis ’98

kurtosis

as a measureof independence?


Nonnegative Matrix Factorization• Decomposition of spectrograms

into templates + activation

fast & forgiving gradient descent algorithmfits neatly with time-frequency masking

20

Lee & Seung ’99Abdallah & Plumbley ’04Smaragdis & Brown ’03

Virtanen ’07

Smar

agdi

s ’0

4

X = W · H

1 2 3Bases from all Wt

50 100 150 200

3

2

1

Row

s of

H

Time (DFT slices)

Freq

uenc

y (D

FT in

dex)

50 100 150 200

20

40

60

80

100

120

Virtanen ’03sounds


Summary of Key Ideas

21

Front-end

representation

SoundScene

organization /

separation

Object

recognition

& description

Memory /

models

Action

SpectrogramMFCCsSinusoids

SegmentationCASAICA, NMF

Direct matchHMMsDecompositionAdaptationSVMs

TemplatesHMMsBases


3. Open Issues• Where to focus to advance machine listening?

task & evaluationseparationmodels & learningASAcomputational theoryattention & searchspatial vs. source

22


Scene Analysis Tasks• Real scenes are

complex!

background noisemany objectsprominence

• Is this just scaling,or is it different?

• Evaluation?

23

70

60

50

40

dB

200400100020004000f/Hz Noise1

200400100020004000f/Hz Noise2,Click1

200400100020004000f/Hz City

0 1 2 3 4 5 6 7 8 9501002004001000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400100020004000f/Hz Wefts1 4

501002004001000

Weft5 Wefts6,7 Weft8 Wefts9 12


How Important is Separation?• Separation systems often evaluated by SNR

based on pre-mix components - is this relevant?

• Best machine listening systems have resynthesise.g. Iroquois speech recognition - “separate then recognize”

• Separated signals don’t have to match originals to be useful

24

mixseparation

words

identifytarget energy

sourceknowledge

speechmodels

t-f masking+ resynthesis ASR

mix wordsidentifytarget energy

speechmodels

find bestwords model


How Many Models?• More specific models ➝ better analysis

need dictionaries for “everything”??

• Model adaptation and hierarchyspeaker adapted models : base + parametersextrapolation beyond normal

• Time scales of model acquisitioninnate/evolutionary (hair-cell tuning)developmental (mother tongue phones) dynamic - the “Bolero” effect

25

CNBH, Physiology Department, Cambridge UniversityCNBH, Physiology Department, Cambridge University

Recognition of Scaled Vowels

Sm

ith

, P

atte

rso

n,

Tu

rner

, K

awah

ara

and

Iri

no

JA

SA

(2

00

5)

Domain of normal speakers

/a/

/u/

/i/

/e/

/o/

mean

pitch / Hz

VT

leng

th r

atio

Smith, Patterson et al. ’05


Auditory Scene Analysis?• Codebook models learn harmonicity, onset

• Can also capture sequential structuree.g. consonants follow vowelsuse overlapping patches?

• But: computational factors

26

VQ256 codebook female

Codeword

Level / dBFr

eque

ncy (

kHz)

50 100 150 200 2500

2

4

6

8

80

60

40

20

0

... to subsume rules/representations of CASA


Computational Theory• Marr’s (1982) perspective on perception

• What is the computational theory of machine listening?independence? sources?

27

ComputationalTheory

Properties of the worldthat make the problem

solvable

AlgorithmSpecific calculations

& operations

ImplementationDetails of

how it’s done


Summary• Machine Listening:

Getting useful information from sound

• Techniques for:representationseparation / organizationrecognition / descriptionmemory / models

• Where to go?separation?computational theory?

28


References 1/2S. Abdallah & M. Plumbley (2004) ”Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval 2004.J. Baker (1975) “The DRAGON system – an overview,” IEEE Trans. Acoust. Speech, Sig. Proc. 23, 24–29, 1975.A. Bell & T. Sejnowski (1995) “An information maximization approach to blind separation and blind deconvolution,” Neural Computation, 7, 1129-1159, 1995.A. Bregman (1990) Auditory Scene Analysis, MIT Press.G. Brown & M. Cooke (1994) “Computational auditory scene analysis,” Comp. Speech & Lang. 8(4), 297–336, 1994.S. Chen & P. Gopalakrishnan (1998) “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” Proc. DARPA Broadcast News Transcription and Understanding Workshop.S. Davis & P. Mermelstein (1980) “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust. Speech Sig. Proc. 28, 357–366, 1980.D. Ellis (1996) Prediction-driven computational auditory scene analysis, Ph.D thesis, EECS dept., MIT.D. Ellis & K. Lee (2006) “Accessing minimal impact personal audio archives,” IEEE Multimedia 13(4), 30-38, Oct-Dec 2006.M. Gales & S. Young (1995) “Robust speech recognition in additive and convolutional noise using parallel model combination,” Comput. Speech Lang. 9, 289–307, 1995.Z. Ghahramani & M. Jordan (1997) “Factorial hidden Markov models,” Machine Learning, 29(2-3,) 245–273, 1997.M. Goto & Y. Muraoka (1994) “A beat tracking system for acoustic signals of music,” Proc. ACM Intl. Conf. on Multimedia, 365–372, 1994.

M. Goto (2004) “A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals,” Speech Comm., 43(4), 311–329, 2004.S. Handel (1989) Listening, MIT Press.J. Hershey, S. Rennie, P. Olsen, T. Kristjansson (2006) “Super-human multi-talker speech recognition: A graphical modeling approach,” Comp. Speech & Lang., 24, 45–66.G. Hu and D.L.Wang (2004) “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.F. Jelinek (1976) “Continuous speech recognition by statistical methods,” Proc. IEEE 64(4), 532–556.A. Klapuri (2003) “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech & Audio Proc.,11(6).A. Klapuri (2006) “Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes,” Proc. Int. Symp. Music Info. Retr.T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, and R. Gopinath (2006) “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” Proc. Interspeech, 775-1778, 2006.D. Lee & S. Seung (1999) ”Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788.B. Logan (2000) “Mel frequency cepstral coefficients for music modeling,” Proc. Int. Symp. Music Inf. Retrieval, Plymouth, September 2000.R. Maher & J. Beachamp (1994) “Fundamental frequency estimation of musical signals using a two-way mismatch procedure,” J. Acoust. Soc. Am., 95(4), 2254–2263, 1994.D. Marr (1982) Vision, MIT Press.

29


References 2/2R. McAulay & T. Quatieri (1986) “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust. Speech Sig. Proc. 34, 744– 754, 1986. A. Moorer (1975) On the segmentation and analysis of continuous musical sound by computer, Ph.D. thesis, CS dept., Stanford Univ.H. Ney (1984) “The use of a one stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Sig. Proc. 32, 263–271, 1984.S. Roweis (2003) ”Factorial Models and Refiltering for Speech Separation and Denoising”, Proc. Eurospeech, 2003.X. Serra (1989) A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition, Ph.D. thesis, Music dept., Stanford Univ.P. Smaragdis (1998) “Blind separation of convolved mixtures in the frequency domain,” Intl.Wkshp. on Indep. & Artif. Neural Networks,Tenerife, Feb. 1998.P. Smaragdis & J. Brown (2003) ”Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October 2003P. Smaragdis (2004) “Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs,” Proc. ICA, LNCS 3195, 494–499.D. Smith, R. Patterson, R. Turner, H. Kawahara & T. Irino (2005) “The processing and perception of size information in speech sounds,” J Acoust Soc Am. 117(1), 305–318.G. Tzanetakis & P. Cook (2002) “Musical genre classification of audio signals,” IEEE Tr. Speech & Audio Proc. 10(5).A.Varga & R. Moore (1990) “Hidden Markov Model decomposition of speech and noise,” IEEE ICASSP, 845–848, 1990.

T. Vintsyuk (1971) “Element-wise recognition of continuous speech composed of words from a specified dictionary,” Kibernetika 7, 133–143, 1971.T. Virtanen (2007) “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, Mar. 2007.A. Wang (2006) “The Shazam music recognition service,” Comm. ACM 49(8), 44-48, 2006.M. Weintraub (1986) A theory and computational model of auditory monaural sound separation, Ph.D. thesis, EE dept., Stanford Univ.R. Weiss & D. Ellis (2010) “Speech separation using speaker-adapted Eigenvoice speech models,” Comp. Speech & Lang., 24(1), 16–29, Jan 2010.

30

http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&retmode=ref&cmd=prlinks&id=15704423




A History and Overview of Machine Listening

Documents

Transcript of A History and Overview of Machine Listening