Lecture 10: Signal Separation - ee.columbia.edudpwe/e6820/lectures/L10-sigsep.pdf · 2...

45
EE E6820: Speech & Audio Processing & Recognition Lecture 10: Signal Separation Dan Ellis <[email protected]> Michael Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 April 14, 2009 1 Sound mixture organization 2 Computational auditory scene analysis 3 Independent component analysis 4 Model-based separation E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 1 / 45

Transcript of Lecture 10: Signal Separation - ee.columbia.edudpwe/e6820/lectures/L10-sigsep.pdf · 2...

EE E6820: Speech & Audio Processing & Recognition

Lecture 10:Signal Separation

Dan Ellis <[email protected]>Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

April 14, 2009

1 Sound mixture organization

2 Computational auditory scene analysis

3 Independent component analysis

4 Model-based separation

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 1 / 45

Outline

1 Sound mixture organization

2 Computational auditory scene analysis

3 Independent component analysis

4 Model-based separation

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 2 / 45

Sound Mixture Organization

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)Stab

Rumble StringsChoir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0

Auditory Scene Analysis: describing a complex sound in termsof high-level sources / events

. . . like listeners do

Hearing is ecologically groundedI reflects ‘natural scene’ propertiesI subjective, not absolute

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 3 / 45

Sound, mixtures, and learning

time / s time / s time / s

freq

/ kH

z

0 0.5 10

1

2

3

4

0 0.5 1 0 0.5 1

+ =

Speech Noise Speech + Noise

SoundI carries useful information about the worldI complements vision

Mixtures

. . . are the rule, not the exceptionI medium is ‘transparent’, sources are manyI must be handled!

LearningI the ‘speech recognition’ lesson:

let the data do the workI like listeners

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 4 / 45

The problem with recognizing mixtures

“Imagine two narrow channels dug up from the edge of a lake,with handkerchiefs stretched across each one. Looking only at themotion of the handkerchiefs, you are to answer questions such as:How many boats are there on the lake and where are they?” (afterBregman, 1990)

Received waveform is a mixtureI two sensors, N signals . . . underconstrained

Disentangling mixtures as the primary goal?I perfect solution is not possibleI need experience-based constraints

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 5 / 45

Approaches to sound mixture recognition

Separate signals, then recognize

e.g. Computational Auditory Scene Analysis (CASA),Independent Component Analysis (ICA)

I nice, if you can do it

Recognize combined signalI ‘multicondition training’I combinatorics. . .

Recognize with parallel modelsI full joint-state space?I divide signal into fragments,

then use missing-data recognition

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 6 / 45

What is the goal of sound mixture analysis?

CASA ?

Separate signals?I output is unmixed waveformsI underconstrained, very hard . . .I too hard? not required?

Source classification?I output is set of event-namesI listeners do more than this. . .

Something in-between?Identify independent sources + characteristics

I standard task, results?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 7 / 45

Segregation vs. Inference

Source separation requires attribute separationI sources are characterized by attributes

(pitch, loudness, timbre, and finer details)I need to identify and gather different attributes for different

sources. . .

Need representation that segregates attributesI spectral decompositionI periodicity decomposition

Sometimes values can’t be separated

e.g. unvoiced speechI maybe infer factors from probabilistic model?

p(O, x , y)→ p(x , y |O)

I or: just skip those values & infer from higher-level context

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 8 / 45

Outline

1 Sound mixture organization

2 Computational auditory scene analysis

3 Independent component analysis

4 Model-based separation

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 9 / 45

Auditory Scene Analysis (Bregman, 1990)

How do people analyze sound mixtures?I break mixture into small elements (in time-freq)I elements are grouped in to sources using cuesI sources have aggregate attributes

Grouping ‘rules’ (Darwin and Carlyon, 1995)I cues: common onset/offset/modulation,

harmonicity, spatial location, . . .

Frequencyanalysis

Sound

Elements Sources

Groupingmechanism

Onsetmap

Harmonicitymap

Spatialmap

Sourceproperties

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 10 / 45

Cues to simultaneous grouping

Elements + attributes

time / s

freq

/ H

z

0 1 2 3 4 5 6 7 8 90

2000

4000

6000

8000

Common onsetI simultaneous energy has common source

PeriodicityI energy in different bands with same cycle

Other cuesI spatial (ITD/IID), familiarity, . . .

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 11 / 45

The effect of context

Context can create an ‘expectation’

i.e. a bias towards a particular interpretation

e.g. Bregman’s “old-plus-new” principle:I A change in a signal will be interpreted as an added source

whenever possible

+

time / s

freq

uenc

y / k

Hz

0.0 0.4 0.8

1

2

1.20

I a different division of the same energydepending on what preceded it

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 12 / 45

Computational Auditory Scene Analysis (CASA)

Object 1 percept

Sound mixture

Object 2 perceptObject 3 percept

Goal: Automatic sound organizationI Systems to ‘pick out’ sounds in a mixture

. . . like people do

e.g. voice against a noisy backgroundI to improve speech recognition

ApproachI psychoacoustics describes grouping ‘rules’

. . . just implement them?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 13 / 45

CASA front-end processing

Correlogram: Loosely based on known/possible physiology

E6820 SAPR - Dan Ellis L11 - Signal Separation 2005-04-13 - 13

!

"#$#!%&'()*+(,!-&'.+//0(1

!

2 "'&&+3'1&456

7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18

- linear filterbank cochlear approximation

- static nonlinearity

- zero-delay slice is like spectrogram

- periodicity from delay-and-multiply detectors

Cochlea

filterbank

sound

envelope

follower

short-time

autocorrelation

delay line

fre

qu

en

cy c

ha

nn

els

"'&&+3'1&45/30.+

freq

lag

time

I linear filterbank cochlear approximationI static nonlinearityI zero-delay slice is like spectrogramI periodicity from delay-and-multiply detectors

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 14 / 45

Bottom-Up Approach (Brown and Cooke, 1994)

Implement psychoacoustic theoryinputmixture

signalfeatures(maps)

discreteobjectsFront end Object

formationGrouping

rulesSourcegroups

onsetperiodfrq.mod

time

freq

I left-to-right processingI uses common onset & periodicity cues

Able to extract voiced speech

time / sec

freq

/ kh

z

v3n7 original mix

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1

2

3

4

5

6

7

8

time / seclevel ./ dB

v3n7 - Hu & Wang 02

0 0.2 0.4 0.6 0.8 1 1.2 1.40

-40

-30

-20

-10

0

10

20

freq

/ kh

z

1

2

3

4

5

6

7

8

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 15 / 45

Problems with ‘bottom-up’ CASA

E6820 SAPR - Dan Ellis L11 - Signal Separation 2005-04-13 - 15

!

"#$%&'()!*+,-!.%$,,$(/012!3454

!

6 3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,)

- need to have ‘regions’, but hard to find

!

6 "'#+$=+7+,<!+)!,-'!1#+(>#<!70'

- how to handle aperiodic energy?

!

6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89

- cannot separate within a single t-f element

!

6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C,

- how to model illusions?

time

freq

Circumscribing time-frequency elementsI need to have ‘regions’, but hard to find

Periodicity is the primary cueI how to handle aperiodic energy?

Resynthesis via masked filteringI cannot separate within a single t-f element

Bottom-up leaves no ambiguity or contextI how to model illusions?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 16 / 45

Restoration in sound perception

Auditory ‘illusions’ = hearing what’s not there

The continuity illusion & Sinewave Speech (SWS)

Time

Frequency

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

Time

Frequency

0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

I duplex perception

What kind of model accounts for this?I is it an important part of hearing?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 17 / 45

Adding top-down constraints: Prediction-Driven CASA

Perception is not directbut a search for plausible hypotheses -

Data-driven (bottom-up). . .input

mixturesignal

featuresdiscreteobjectsFront end Object

formationGrouping

rulesSourcegroups

I objects irresistibly appear

vs. Prediction-driven (top-down)

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents

I match observations with a ‘world-model’I need world-model constraints. . .

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 18 / 45

Generic sound elements for PDCASA (Ellis, 1996)

Goal is a representational space thatI covers real-world perceptual soundsI minimal parameterization (sparseness)I separate attributes in separate parameters

200400

100020004000

f/Hz profile

0

NoiseObject

0.000

0.005

0.010 magContour

0.0 0.1 0.2time/s

H[k]

A[n]

NX [n,k]

XN[n,k] = H[k]·A[n]50

100

200

400f/Hz Periodogram

p[n]

200400

100020004000

f/HzWeftObject

0.0 0.2 0.4 0.8time/s

H [n,k]W

50

100

200

400f/Hz

XW[n,k] = HW[n,k]·P[n,k]

0.6

Correlogramanalysis

H[k] T[k]ClickObject

2.2 2.4time / s

decayTimes

0.0 0.1time / s

magProfile

0.00 0.01

X [n,k]C

XC[n,k] = H[k]·exp((n - n0) / T[k])

Object hierarchies built on top. . .

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 19 / 45

PDCASA for old-plus-new

Incremental analysis

t1 t2 t3Input Signal

Time t1:Initial elementcreated

Time t2:Additionalelement required

Time t2:Second element finished

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 20 / 45

PDCASA for the continuity illusion

Subjects hear the tone as continuous

. . . if the noise is a plausible masker

100020004000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4t / s

Data-driven analysis gives just visible portions:

Prediction-driven can infer masking:

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 21 / 45

Prediction-Driven CASAExplain a complex sound with basic elements

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 22 / 45

Aside: Ground Truth

What do people hear in sound mixtures?I do interpretations match?

→ Listening tests to collect ‘perceived events’:

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 23 / 45

Aside: Evaluation

Evaluation is a big problem for CASAI what is the goal, really?I what is a good test domain?I how do you measure performance?

SNR improvementI tricky to derive from before/after signals: correspondence

problemI can do with fixed filtering maskI differentiate removing signal from adding noise

Speech Recognition (ASR) improvementI recognizers often sensitive to artifacts

‘Real’ task?I mixture corpus with specific sound events. . .

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 24 / 45

Outline

1 Sound mixture organization

2 Computational auditory scene analysis

3 Independent component analysis

4 Model-based separation

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 25 / 45

Independent Component Analysis (ICA)(Bell and Sejnowski, 1995, etc.)

If mixing is like matrix multiplication, then separation issearching for the inverse matrix

x1x2

s1s2

^^

s1 w11w21

w12w22

×

a11

a21a12

a22

x1x2

s1s2

a11a21

a12a22

×

−δ MutInfoδwijs2

i.e. W ≈ A−1

I with N different versions of the mixed signals (microphones),we can find N different input contributions (sources)

I how to rate quality of outputs?i.e. when do outputs look separate?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 26 / 45

Gaussianity, Kurtosis, & Independence

A signal can be characterized by its PDF p(x)i.e. as if successive time values are drawn from

a random variable (RV)I Gaussian PDF is ‘least interesting’I Sums of independent RVs (PDFs convolved) tend to Gaussian

PDF (central limit theorem)

Measures of deviations from Gaussianity:4th moment is Kurtosis (“bulging”)

kurt(y) = E

[(y − µσ

)4]− 3

-4 -2 0 2 40

0.2

0.4

0.6

0.8

Gaussian kurt = 0

exp(-|x|) kurt = 2.5

exp(-x4) kurt = -1

I kurtosis of Gaussian is zero (this def.)I ‘heavy tails’ → kurt > 0I closer to uniform dist. → kurt < 0I Directly related to KL divergence from Gaussian PDF

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 27 / 45

Independence in Mixtures

Scatter plots & Kurtosis values

-0.4 -0.2 0 0.2 0.4 0.6-0.4

-0.2

0

0.2

0.4

0.6s1 vs. s2

-0.5 0 0.5

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

x1 vs. x2

p(s1) kurtosis = 27.90

-0.2 -0.1 0 0.1 0.2

p(s2) kurtosis = 53.85

p(x1) kurtosis = 18.50

-0.2 -0.1 0 0.1 0.2

p(x2) kurtosis = 27.90

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 28 / 45

Finding Independent Components

Sums of independent RVs are more Gaussian→ minimize Gaussianity to undo sumsi.e. search over wij terms in inverse matrix

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

mix 1

mix

2

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

θ / π

kurt

osis

Kurtosis vs. θMixture Scatter

s1

s1

s2

s2

Solve by Gradient descent or Newton-Raphson:

w+ = E[xg(wT x)]− E[g ′(wT x)]w

w =w+

‖w+‖

“Fast ICA”, (Hyvarinen and Oja, 2000)http://www.cis.hut.fi/projects/ica/fastica/

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 29 / 45

Limitations of ICA

Assumes instantaneous mixingI real world mixtures have delays & reflectionsI STFT domain?

x1(t) = a11(t) ∗ s1(t) + a12(t) ∗ s2(t)

⇒ X1(ω) = A11(ω)S1(ω) + A12(ω)S2(ω)

I Solve ω subbands separately, match up answers

Searching for best possible inverse matrixI cannot find more than N outputs from N inputsI but: “projection pursuit” ideas + time-frequency masking. . .

Cancellation inherently fragileI s1 = w11x1 + w12x2 to cancel out s2I sensitive to noise in x channelsI time-varying mixtures are a problem

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 30 / 45

Outline

1 Sound mixture organization

2 Computational auditory scene analysis

3 Independent component analysis

4 Model-based separation

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 31 / 45

Model-Based Separation: HMM decomposition

(e.g. Varga and Moore, 1990; Gales and Young, 1993)

Independent state sequences for 2+ component source models

model 1

model 2

observations / time

New combined state space q′ = q1× q2I need pdfs for combinations p(X | q1, q2)

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 32 / 45

One-channel Separation: Masked Filtering

Multichannel → ICA: Inverse filter and canceltarget s

interference nn

s

-mix

+

proc

One channel: find a time-frequency maskmix s+n

s

proc

spectrogram

mask

stftistftx

Cannot remove overlapping noise in t-f cells,but surprisingly effective (psych masking?):

Frequency

0 dB SNR

Noisemix

Maskresynth

2000400060008000

Time

Frequency

0 1 2 302000400060008000

-12 dB SNR

Time0 1 2 3

-24 dB SNR

Time0 1 2 3

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 33 / 45

“One microphone source separation”

(Roweis, 2001)

State sequences → t-f estimates → maskfre

q / H

z

0

1000

2000

3000

4000

time / sectime / sec

0

1000

2000

3000

4000

0

1000

2000

3000

4000

0 1 2 3 4 5 60

1000

2000

3000

4000

0 1 2 3 4 5 6

Speaker 1

Orig

inal

voic

esSt

ate

mea

nsR

esyn

thes

ism

asks

Mix

ture

Speaker 2

I 1000 states/model (→ 106 transition probs.)I simplify by subbands (coupled HMM)?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 34 / 45

Speech Fragment Recognition

(Barker et al., 2005)

Signal separation is too hard! Instead:I segregate features into partially-observed sourcesI then classify

Made possible by missing data recognitionI integrate over uncertainty in observations

for true posterior distribution

Goal: Relate clean speech models P(X |M)to speech-plus-noise mixture observations

. . . and make it tractable

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 35 / 45

Missing Data RecognitionSpeech models p(x |m) are multidimensional. . .i.e. means, variances for every freq. channel

I need values for all dimensions to get p(·)But: can evaluate over a subset of dimensions xk

p(xk |m) =

∫p(xk , xu |m) dxu

!

"#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 67789::976!9!:7!;!68

!

"#$$#%&!'()(!*+,-&%#)#-%

!

. /0++,1!2-3+4$!

p

(

x

|

m

)

!

!(5+!264)#3#2+%$#-%(4777

!

9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&

9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!

p

(•)

!

. 86)9!,(%!+:(46()+!-:+5!(!

$6;$+)!-<!3#2+%$#-%$!

x

k

!

. =+%,+>!

2#$$#%&!3()(!5+,-&%#)#-%9

!

9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G

xk

xu

y

p(xk,xu)

p(xk|xu<y ) p(xk )

p xk m! " p xk xu# m! " xud$=

P(x1 | q)

P(x | q) =

· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)

?5+$+%)!3()(!2($@

time �!

dim

ensi

on �!

Hence, missing data recognition:

!

"#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 67789::976!9!:7!;!68

!

"#$$#%&!'()(!*+,-&%#)#-%

!

. /0++,1!2-3+4$!

p

(

x

|

m

)

!

!(5+!264)#3#2+%$#-%(4777

!

9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&

9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!

p

(•)

!

. 86)9!,(%!+:(46()+!-:+5!(!

$6;$+)!-<!3#2+%$#-%$!

x

k

!

. =+%,+>!

2#$$#%&!3()(!5+,-&%#)#-%9

!

9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G

xk

xu

y

p(xk,xu)

p(xk|xu<y ) p(xk )

p xk m! " p xk xu# m! " xud$=

P(x1 | q)

P(x | q) =

· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)

?5+$+%)!3()(!2($@

time �!

dim

ensi

on �!

I hard part is finding the mask (segregation)E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 36 / 45

Missing Data Results

Estimate static background noise level N(f )

Cells with energy close to background are considered“missing”

"1754" + noise

SNR mask

Missing DataClassifier "1754"

0 5 10 15 2020

40

60

80

100Factory Noise

Digit

reco

gnitio

n ac

cura

cy /

%SNR (dB)

a priorimissing dataMFCC+CMN

"

I must use spectral features!

But: nonstationary noise → spurious mask bitsI can we try removing parts of mask?

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 37 / 45

Comparing different segregations

Standard classification chooses between models M to matchsource features X

M∗ = argmaxM

p(M |X ) = argmaxM

p(X |M)p(M)

Mixtures: observed features Y , segregation S ,all related by p(X |Y ,S)

freq

ObservationY(f)

Segregation S

SourceX(f)

Joint classification of model and segregation:

p(M,S |Y ) = p(M)

∫p(X |M)

p(X |Y , S)

p(X )dXp(S |Y )

I P(X ) no longer constant

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 38 / 45

Calculating fragment matches

p(M, S |Y ) = p(M)

∫p(X |M)

p(X |Y ,S)

p(X )dXp(S |Y )

p(X |M) – the clean-signal feature modelp(X |Y ,S)

p(X ) – is X ‘visible’ given segregation?

Integration collapses some bands. . .

p(S |Y ) – segregation inferred from observationI just assume uniform, find S for most likely MI or: use extra information in Y to distinguish Ss. . .

Result:I probabilistically-correct relation betweenI clean-source models p(X |M) andI inferred, recognized source + segregation p(M,S |Y )

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 39 / 45

Using CASA features

p(S |Y ) links acoustic information to segregationI is this segregation worth considering?I how likely is it?

Opportunity for CASA-style information to contributeI periodicity/harmonicity: these different frequency bands

belong togetherI onset/continuity: this time-frequency region must be whole

Frequency Proximity HarmonicityCommon Onset

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 40 / 45

Fragment decoding

Limiting S to whole fragments makes hypothesis searchtractable:

Fragments

T1 T2 T3 T4 T5 T6time

Hypothesesw1

w2

w3

w4

w5

w6

w7

w8

I choice of fragments reflects p(S |Y )p(X |M)i.e. best combination of segregation and match to speech models

Merging hypotheses limits space demands

. . . but erases specific history

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 41 / 45

Speech fragment decoder results

Simple p(S |Y ) model forces contiguous regions to staytogether

I big efficiency gain when searching S space

"1754" + noise

SNR mask

Fragments

FragmentDecoder "1754"

Clean-models-based recognition rivals trained-in-noiserecognition

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 42 / 45

Multi-source decoding

Search for more than one source

Y(t)

S1(t)q1(t)

S2(t)q2(t)

Mutually-dependent data masksI disjoint subsets of cells for each sourceI each model match p(Mx |Sx ,Y ) is independentI masks are mutually dependent: p(S1,S2 |Y )

Huge practical advantage over full search

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 43 / 45

Summary

Auditory Scene Analysis:I Hearing: partially understood, very successful

Independent Component Analysis:I Simple and powerful, some practical limits

Model-based separation:I Real-world constraints, implementation tricks

Parting thought

Mixture separation the main obstacle in many applications e.g.soundtrack recognition

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 44 / 45

ReferencesAlbert S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound.

The MIT Press, 1990. ISBN 0262521954.C.J. Darwin and R.P. Carlyon. Auditory grouping. In B.C.J. Moore, editor, The

Handbook of Perception and Cognition, Vol 6, Hearing, pages 387–424. AcademicPress, 1995.

G. J. Brown and M. P. Cooke. Computational auditory scene analysis. Computerspeech and language, 8:297–336, 1994.

D. P. W. Ellis. Prediction–driven computational auditory scene analysis. PhD thesis,Department of Electrtical Engineering and Computer Science, M.I.T., 1996.

Anthony J. Bell and Terrence J. Sejnowski. An information-maximization approach toblind separation and blind deconvolution. Neural Computation, 7(6):1129–1159,1995. ftp://ftp.cnl.salk.edu/pub/tony/bell.blind.ps.Z.

A. Hyvarinen and E. Oja. Independent component analysis: Algorithms andapplications. Neural Networks, 13(4–5):411–430, 2000.http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/.

A. Varga and R. Moore. Hidden markov model decomposition of speech and noise. InProceedings of the 1990 IEEE International Conference on Acoustics, Speech andSignal Processing, pages 845–848, 1990.

M.J.F. Gales and S.J. Young. Hmm recognition in noise using parallel modelcombination. In Proc. Eurospeech-93, volume 2, pages 837–840, 1993.

S. Roweis. One-microphone source separation. In NIPS, volume 11, pages 609–616.MIT Press, Cambridge MA, 2001.

Jon Barker, Martin Cooke, and Daniel P. W. Ellis. Decoding speech in the presence ofother sources. Speech Communication, 45(1):5–25, 2005. URLhttp://www.ee.columbia.edu/~dpwe/pubs/BarkCE05-sfd.pdf.

E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 45 / 45