grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author:...

Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 1

Sound, Mixtures, and Learning:LabROSA overview

Sound Content Analysis

Recognizing sounds

Organizing mixtures

Accessing large datasets

Dan Ellis <[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/

1

2

3

4

http://www.ee.columbia.edu/~dpwe/muscontent/



• Sound understanding: the key challenge

- what listeners do- understanding = abstraction

• Applications

- indexing/retrieval- robots- prostheses

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0


The problem with recognizing mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events

- ... like listeners do

• Hearing is ecologically grounded

- reflects natural scene properties = constraints- subjective, not absolute


Approaches to handling sound mixtures

• Separate signals, then recognize

- e.g. CASA, ICA- nice, if you can do it

• Recognize combined signal

- ‘multicondition training’- combinatorics..

• Recognize with parallel models

- full joint-state space?- or: divide signal into fragments,

then use missing-data recognition


Outline


Recognizing sounds

- Speech recognition- Nonspeech

Organizing mixtures


1

2

3

4


Recognizing Sounds: Speech

• Standard speech recognition structure:

• How to handle additive noise?

- just train on noisy data: ‘multicondition training’

2

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A


Novel speech signal representations

(with Marios Athineos)

• Common sound models use 10ms frames

- but: sub-10ms envelope is perceptible

• Use a parametric (LPC) model on spectrum

• Convert to features for ASR

- improvements esp. for stops

Time

Fre

quen

cy

Fireplce - orig

0 2 40

2k

4k

0

2k

4k

Time

Fre

quen

cy

TDLP resynth

0 2 40

2k

4k

0

2k

4k

Time

Fre

quen

cy

CTFLP resynth

0 2 40

2k

4k

0

2k

4k

Time

Fre

quen

cy

CTFLP resynth 4x TSM

0 2 40

2k

4k

0

2k

4k

200 400 600 800 1000 1200 1400 1600 1800 2000

1

1.5

2

2.5

3

3.5

4

4.5

5

2

0

2

4

waveform0-0.5 kHz0.5-1 kHz1-2 kHz2-4 kHz


Finding the Information in Speech

(with Patricia Scanlon)

• Mutual Information in time-frequency:

• Use to select classifier input features

5

10

15

-200 -100 0 100 200

5

10

15

-200 -100 0 100 200time / ms

MI / bits

freq

/ B

ark

0.005

0.01

0.015

0.02

0.025

0.02

0.04

0.06

0.08

0

0.01

0.02

0.03

0.04

0.020.040.060.080.10.120.14

All phones Stops

Vowels Speaker (vowels)

5

10

15

5

10

15

5

10

15

-50 0 50

5

10

15

-50 0 50 time / ms

frfe

q / B

ark

133ftrs

IRREG IRREG+CHKBD

95ftrs

57ftrs

19ftrs

85ftrs

47ftrs

28ftrs

9ftrs

time / ms

0 20 40 60 80 100 120 14045

50

55

60

65

70

Number of features

Fra

me

accu

racy

/ %

IRREGRECTRECT, all frames

IRREG+CHKBDRECT+CHKBD


Alarm sound detection

(Ellis 2001)

• Alarm sounds have particular structure

- people ‘know them when they hear them’- clear even at low SNRs

• Why investigate alarm sounds?

- they’re supposed to be easy- potential applications...

• Contrast two systems:

- standard, global features,

P

(

X

|

M

)

- sinusoidal model, fragments,

P

(

M

,

S

|

Y

)

time / s

hrn01 bfr02 buz01

level / dB

freq

/ kH

z

0 5 10 15 20 250

1

2

3

4

-40

-20

0

20s0n6a8+20


Alarms: Results

• Both systems commit many insertions at 0dB SNR, but in different circumstances:

20 25 30 35 40 45 50

0

6 7 8 9

time/sec0

freq

/ kH

z

1

2

3

4

0

freq

/ kH

z

1

2

3

4Restaurant+ alarms (snr 0 ns 6 al 8)

MLP classifier output

Sound object classifier output

NoiseNeural net system Sinusoid model system

Del Ins Tot Del Ins Tot

1 (amb) 7 / 25 2 36% 14 / 25 1 60%

2 (bab) 5 / 25 63 272% 15 / 25 2 68%

3 (spe) 2 / 25 68 280% 12 / 25 9 84%

4 (mus) 8 / 25 37 180% 9 / 25 135 576%

Overall 22 / 100 170 192% 50 / 100 147 197%


Outline


Recognizing sounds

Organizing mixtures

- Auditory Scene Analysis- Missing data recognition- Parallel model inference


1

2

3

4


Auditory Scene Analysis

(Bregman 1990)

• How do people analyze sound mixtures?

- break mixture into small

elements

(in time-freq)- elements are

grouped

in to sources using

cues

- sources have aggregate

attributes

• Grouping ‘rules’ (Darwin, Carlyon, ...):

- cues: common onset/offset/modulation, harmonicity, spatial location, ...

3

Frequencyanalysis

Groupingmechanism

Onsetmap

Harmonicitymap

Positionmap

Sourceproperties

(after Darwin, 1996)


Cues to simultaneous grouping

• Elements + attributes

• Common onset

- simultaneous energy has common source

• Periodicity

- energy in different bands with same cycle

• Other cues

- spatial (ITD/IID), familiarity, ...

• But: Context ...

time / s

freq

/ H

z

0 1 2 3 4 5 6 7 8 90

2000

4000

6000

8000


Computational Auditory Scene Analysis:The Representational Approach

(Cooke & Brown 1993)

• Direct implementation of psych. theory

- ‘bottom-up’ processing- uses common onset & periodicity cues

• Able to extract voiced speech:

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.fi.aif


Adding top-down constraints

Perception is not directbut a search for plausible hypotheses

• Data-driven (bottom-up)...

- objects irresistibly appear

vs. Prediction-driven (top-down)

- match observations with parameters of a world-model

- need world-model constraints...

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents


Prediction-Driven CASA

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12


Segregation vs. Inference

• Source separation requires attribute separation- sources are characterized by attributes

(pitch, loudness, timbre + finer details)- need to identify & gather different attributes for

different sources ...

• Need representation that segregates attributes- spectral decomposition- periodicity decomposition

• Sometimes values can’t be separated- e.g. unvoiced speech- maybe infer factors from probabilistic model?

- or: just skip those values, infer from higher-level context

- do both: missing-data recognition

p O x y, ,( ) p x y, O( )Æ


Missing Data Recognition

• Speech models p(x|m) are multidimensional...- i.e. means, variances for every freq. channel- need values for all dimensions to get p(•)

• But: can evaluate over a subset of dimensions xk

• Hence, missing data recognition:

- hard part is finding the mask (segregation)

xk

xu

y

p(xk,xu)

p(xk|xu<y ) p(xk )

p xk m( ) p xk xu, m( ) xudÚ=

P(x1 | q)

P(x | q) =

· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)

Present data mask

time �

dim

ensi

on �


Comparing different segregations

• Standard classification chooses between models M to match source features X

• Mixtures Æ observed features Y, segregation S, all related by

- spectral features allow clean relationship

• Joint classification of model and segregation:

- probabilistic relation of models & segregation

M* P M X( )M

argmax P X M( ) P M( )P X( )--------------◊

Margmax = =

P X Y S,( )

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y,( ) P M( ) P X M( )P X Y S,( )

P X( )-------------------------◊ XdÚ P S Y( )◊=


Multi-source decoding

• Search for more than one source

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks- locally coherent regions

• Lots of issues in models, representations, matching, inference...

Y(t)

S1(t)q1(t)

S2(t)q2(t)


What a speech HMM contains

• Markov model structure: states + transitions

• A generative model- but not a good speech generator!

- only meant for inference of p(X|M)

K

A

S

0.80.18

0.9

0.1

0.02

0.9

0.1T E

0.8

0.2

M'1

O

K

A

S

0.80.05

0.9

0.1

0.15

0.9

0.1T E

0.8

0.2

O

Model M'2

0

1

2

3

4

freq

/ kH

z

10 20 30 40 50

10

20

30

40

50

State models (means) State Transition Probabilities

0 1 2 3 4 50

1

2

3

4

freq

/ kH

z

time / sec


“One microphone source separation”(Roweis 2000, Manuel Reyes)

• State sequences Æ t-f estimates Æ mask

- 1000 states/model (Æ 106 transition probs.)- simplify by modeling subbands (coupled HMM)?

freq

/ H

z

0

1000

2000

3000

4000

time / sectime / sec

0

1000

2000

3000

4000

0

1000

2000

3000

4000

0 1 2 3 4 5 60

1000

2000

3000

4000

0 1 2 3 4 5 6

Speaker 1O

rig

inal

voic

esS

tate

mea

ns

Res

ynth

esis

mas

ksM

ixtu

re

Speaker 2


Subband models(Reyes, Jojic)

• Reduce the number of states required- 4000 states ¥1 band Æ 30 states ¥19 bands

• Train coupled HMMs via variational approx

1,1S 2,1S 3,1S 4,1S 5,1S 6,1S

1,2S 2,2S 3,2S 4,2S 5,2S 6,2S

1,3S 2,3S 3,3S 4,3S 5,3S 6,3S

1,4S 2,4S 3,4S 4,4S 5,4S 6,4S

1,5S 2,5S 3,5S 4,5S 5,5S 6,5S

1,1S 2,1S 3,1S 4,1S 5,1S 6,1S

1,2S 2,2S 3,2S 4,2S 5,2S 6,2S

1,3S 2,3S 3,3S 4,3S 5,3S 6,3S

1,4S 2,4S 3,4S 4,4S 5,4S 6,4S

1,5S 2,5S 3,5S 4,5S 5,5S 6,5S


Tracking source normalization

• Standard HMM sound models try to normalize away energy variation- but: each source in mixture has different ‘gain’

• Instead, factor out scalar gain for each source

- solve with var. approx. to P(S,a)

Model 1

Data

Source 1

Normalization÷k1

Normalization÷k2

Model 2

Data

Source 2Composed

Data

Normalization ?

Model 1 Model 2

Normalizationmismatch

x1 x2 x3 x4 x5 x6

S1 S2 S3 S4S5 S6

a1 a2 a3 a4a5 a6


Outline


Recognizing sounds

Organizing mixtures

Accessing large datasets- Meeting Recordings- The Listening Machine - Music Information Retrieval

1

2

3

4


Accessing large datasets:The Meeting Recorder Project

(with ICSI, UW, IDIAP, SRI, Sheffield)

• Microphones in conventional meetings- for summarization / retrieval / behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, IDIAP, NIST):

- ~100 hours collected & transcribed

• NSF ‘Mapping Meetings’ project

4


Speaker Turn detection(Huan Wei Hee, Jerry Liu)

• Acoustic: Triangulate tabletop mic timing differences- use normalized peak value for confidence

• Behavioral: Look for patterns of speaker turns

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

lag 1-2 / ms

lag

3-4

/ ms

mr-2000-11-02-1440: PZM xcorr lags

4

12

3

0 50 100 150 200 250 3000

50

100

150

200

250

300

time / s

100x

R s

kew

/sam

ps

Example cross coupling response, chan3 to chan0

3:

8:

10:9:

7:5:

2:1:

0 5 10 15 20 25 30 35 40 45 50 55 60

mr04: Hand-marked speaker turns vs. time + auto/manual boundaries

Pa

rtic

ipa

nt

time/min


The Listening Machine

• Smart PDA records everything

• Only useful if we have index, summaries- monitor for particular sounds- real-time description

• Scenarios

- personal listener Æ summary of your day- future prosthetic hearing device- autonomous robots

• Meeting data, ambulatory audio


Personal Audio

• LifeLog / MyLifeBits / Remembrance Agent:Easy to record everything you hear

• Then what?- prohibitively time consuming to

search- but .. applications if access easier

• Automatic content analysis / indexing...

50 100 150 200 2500

2

4

14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30

5

10

15

freq

/ H

zfr

eq /

Bar

k

time / min

clock time


Music Information Retrieval

• Transfer search concepts to music?- “musical Google”- finding something specific / vague / browsing- is anything more useful than human annotation?

• Most interesting area: finding new music- is there anything on mp3.com that I would like?- audio is only information source for new bands

• Basic idea:Project music into a space where neighbors are “similar”

• Also need models of personal preference- where in the space is the stuff I like- relative sensitivity to different dimensions

• Evaluation problems- requires large, shareable music corpus!


Music similarity from Anchor space

• A classifier trained for one artist (or genre) will respond partially to a similar artist

• Each artist evokes a particular pattern of responses over a set of classifiers

• We can treat these classifier outputs as a new feature space in which to estimate similarity

• “Anchor space” reflects subjective qualities?

n-dimensionalvector in "Anchor

Space"Anchor

Anchor

Anchor

AudioInput

(Class j)

p(a1|x)

p(a2|x)

p(an|x)

GMMModeling

Conversion to Anchorspace

n-dimensionalvector in "Anchor

Space"Anchor

Anchor

Anchor

AudioInput

(Class i)

p(a1|x)

p(a2|x)

p(an|x)

GMMModeling

Conversion to Anchorspace

SimilarityComputation

KL-d, EMD, etc.


Playola interface ( www.playola.org )

• Browser finds closest matches to single tracks or entire artists in anchor space

• Direct manipulation of anchor space axes

http://www.playola.org/


Evaluation

• Are recommendations good or bad?

• Subjective evaluation is the ground truth- .. but subjects aren’t familiar with the bands

being recommended- can take a long time to decide if a

recommendation is good

• Measure match to other similarity judgments- e.g. musicseer data:

Top rank agreement

0

10

20

30

40

50

60

70

80

cei cmb erd e3d opn kn2 rnd ANK

%

SrvKnw 4789x3.58

SrvAll 6178x8.93

GamKnw 7410x3.96GamAll 7421x8.92


Summary

• Sound- .. contains much, valuable information at many

levels- intelligent systems need to use this information

• Mixtures- .. are an unavoidable complication when using

sound- looking in the right time-frequency place to find

points of dominance

• Learning- need to acquire constraints from the

environment- recognition/classification as the real task


LabROSA Summary

DO

MA

INS

AP

PLI

CA

TIO

NS

ROSA

• Broadcast• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Speech recognition• Speech characterization• Nonspeech recognition

• Object-based structure discovery & learning

• Scene analysis• Audio-visual integration• Music analysis

• Structuring• Search• Summarization• Awareness• Understanding


Extra Slides


Music Applications

• Music as a complex, information-rich sound

• Applications of separation & recognition:- note/chord detection & classification

- singing detection (Æ genre identification ...)

23 240

1

2

72 730

1

231 32 40 41 48 49

80 81 85 86 88 89

freq

/ kH

z

DYWMB: Alignments to MIDI note 57 mapped to Orig Audio

0 50 100 150 200

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto Lindsay

Eric MatthewsThe MolesThe Roots

Michael Penntrue voice

time / sec

Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)


Artist Similarity

• Recognizing work from each artist is all very well...

• But: what is similarity between artists?

- pattern recognition systems give a number...

• Need subjective ground truth:Collected via web site

www.musicseer.com

• Results:- 1,000 users, 22,300 judgments collected over 6 months

backstreet_boys

whitney_

new_ron_carter

a

all_saints

annie_lennox

aqua

belinda_carlislers

celine_dionchristina_aguilera

eiffel_65

erasure

miroquai

janet_jacksonjessica_simpson lara_fabian

lauryn_hill

madonna

mariah_carey

nelly_furtado

pet_shop_boys

pr

roxette

sade

wain

sof

spice_girls

toni_braxtone

http://www.musicseer.org


Independent Component Analysis (ICA)(Bell & Sejnowski 1995 et seq.)

• Drive a parameterized separation algorithm to maximize independence of outputs

• Advantages:- mathematically rigorous, minimal assumptions- does not rely on prior information from models

• Disadvantages:- may converge to local optima...- separation, not recognition- does not exploit prior information from models

m1m2

s1s2

a11a21

a12a22

x

�� MutInfo�a

grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author:...

Documents

Transcript of grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author:...