E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 11:Signal Separation
Sound Mixture Organization
Computational Auditory Scene Analysis
Independent Component Analysis
Model-Based Separation
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2006
1
2
3
4
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 2
Sound Mixture Organization
� Auditory Scene Analysis: describing a complex sound in terms of high-level sources / events
- ... like listeners do
� Hearing is ecologically grounded
- reflects ‘natural scene’ properties- subjective, not absolute
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 3
Sound, mixtures, and learning
� Sound
- carries useful information about the world- complements vision
� Mixtures
- .. are the rule, not the exception- medium is ‘transparent’, sources are many- must be handled!
� Learning
- the ‘speech recognition’ lesson:let the data do the work
- like listeners
time / s time / s time / s
freq
/ kH
z
0 0.5 10
1
2
3
4
0 0.5 1 0 0.5 1
+ =
Speech Noise Speech + Noise
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 4
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
� Received waveform is a mixture
- two sensors, N signals ... underconstrained
� Disentangling mixtures as the primary goal?
- perfect solution is not possible- need experience-based constraints
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 5
Approaches to sound mixture recognition
� Separate signals, then recognize
- e.g. Computational Auditory Scene Analysis (CASA), Independent Component Analysis (ICA)
- nice, if you can do it
� Recognize combined signal
- ‘multicondition training’- combinatorics..
� Recognize with parallel models
- full joint-state space?- divide signal into fragments,
then use missing-data recognition
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 6
What is the goal of CASA?
� Separate signals?
- output is unmixed waveforms- underconstrained, very hard ... - too hard? not required?
� Source classiÞcation?
- output is set of event-names- listeners do more than this...
� Something in-between?Identify independent sources + characteristics
- standard task, results?
CASA ?
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 7
Segregation vs. Inference
� Source separation requires attribute separation
- sources are characterized by attributes(pitch, loudness, timbre + finer details)
- need to identify & gather different attributes for different sources ...
� Need representation that segregates attributes
- spectral decomposition- periodicity decomposition
� Sometimes values can�t be separated
- e.g. unvoiced speech- maybe infer factors from probabilistic model?
- or: just skip those values, infer from higher-level context
p O x y, ,( ) p x y, O( )→
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 8
Outline
Sound Mixture Organization
Computational Auditory Scene Analysis
- Human Auditory Scene Analysis- Bottom-up and Top-down models- Evaluation
Independent Component Analysis
Model-Based Separation
1
2
3
4
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 9
Auditory Scene Analysis
(Bregman 1990)
� How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)- elements are grouped in to sources using cues- sources have aggregate attributes
� Grouping �rules� (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
2
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 10
Cues to simultaneous grouping
� Elements + attributes
� Common onset
- simultaneous energy has common source
� Periodicity
- energy in different bands with same cycle
� Other cues
- spatial (ITD/IID), familiarity, ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 11
The effect of context
� Context can create an �expectation�: i.e. a bias towards a particular interpretation
� e.g. Bregman�s �old-plus-new� principle:
A change in a signal will be interpreted as an
added
source whenever possible
- a different division of the same energy depending on what preceded it
+
time / s
freq
uenc
y / k
Hz
0.0 0.4 0.8
1
2
1.20
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 12
Computational Auditory Scene Analysis(CASA)
� Goal: Automatic sound organization;Systems to �pick out� sounds in a mixture
- ... like people do
� E.g. voice against a noisy background
- to improve speech recognition
� Approach:
- psychoacoustics describes grouping ‘rules’- ... just implement them?
Object 1 percept
Sound mixture
Object 2 perceptObject 3 percept
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 13
CASA front-end processing
� Correlogram:Loosely based on known/possible physiology
- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors
Cochleafilterbank
sound
envelopefollower
short-timeautocorrelation
delay line
freq
uenc
y ch
anne
ls
Correlogramslice
freq
lag
time
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 14
The Representational Approach
(Brown & Cooke 1993)
� Implement psychoacoustic theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
� Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 15
Problems with �bottom-up� CASA
� Circumscribing time-frequency elements
- need to have ‘regions’, but hard to find
� Periodicity is the primary cue
- how to handle aperiodic energy?
� Resynthesis via masked Þltering
- cannot separate within a single t-f element
� Bottom-up leaves no ambiguity or context
- how to model illusions?
time
freq
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 16
Restoration in sound perception
� Auditory �illusions� = hearing what�s not there
� The continuity illusion
� Sinewave Speech (SWS)
- duplex perception
� What kind of model accounts for this?
- is it an important part of hearing?
1000
2000
4000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s
5
10
15f/Bark S1−env.pf:0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
40
60
80
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 17
Adding top-down constraints:Prediction-Driven CASA (PDCASA)
Perception is not directbut a search for plausible hypotheses
� Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with a ‘world-model’- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 18
Generic sound elements for PDCASA
� Goal is a representational space that
- covers real-world perceptual sounds- minimal parameterization (sparseness)- separate attributes in separate parameters
� Object hierarchies built on top...
200
400
1000
2000
4000
f/Hz profile
0
NoiseObject
0.000
0.005
0.010 magContour
0.0 0.1 0.2time/s
H[k]
A[n]
NX [n,k]
XN[n,k] = H[k]·A[n]50
100
200
400f/Hz Periodogram
p[n]
200
400
1000
2000
4000
f/HzWeftObject1
0.0 0.2 0.4 0.8time/s
H [n,k]W
50
100
200
400f/Hz
XW[n,k] = HW[n,k]·P[n,k]
0.6
Correlogramanalysis
H[k] T[k]ClickObject1
2.2 2.4time / s
decayTimes
0.0 0.1time / s
magProfile
0.00 0.01
X [n,k]T
XC[n,k] = H[k]·exp((n - n0) / T[k])
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 19
PDCASA for old-plus-new
� Incremental analysist1 t2 t3
Input signal
Time t1:initial element created
Time t2:Additional element required
Time t3:Second element finished
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 20
PDCASA for the continuity illusion
� Subjects hear the tone as continuous... if the noise is a plausible masker
� Data-driven analysis gives just visible portions:
� Prediction-driven can infer masking:
1000
2000
4000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4i /
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 21
Prediction-Driven CASA
� Explain a complex sound with basic elements
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 22
Aside: Ground Truth
� What do people hear in sound mixtures?- do interpretations match?
→ Listening tests to collect �perceived events�:
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 23
Aside: Evaluation
� Evaluation is a big problem for CASA- what is the goal, really?- what is a good test domain?- how do you measure performance?
� SNR improvement- tricky to derive from before/after signals:
correspondence problem- can do with fixed filtering mask;
but rewards removing signal as well as noise
� Speech Recognition (ASR) improvement- recognizers typically very sensitive to artefacts
� �Real� task?- mixture corpus with specific sound events...
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 24
Outline
Sound Mixture Organization
Computational Auditory Scene Analysis
Independent Component Analysis- Blind source separation- Independence and kurtosis- Limits of the approach
Model-Based Separation
1
2
3
4
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 25
Independent Component Analysis (ICA)(Bell & Sejnowski 1995 etc.)
� If mixing is like matrix multiplication, then separation is searching for the inverse matrix
- i.e. W ≈ A-1
- with N different versions of the mixed signals (microphones), we can find N different input contributions (sources)
- how to rate quality of outputs?i.e. when do outputs look separate?
3
x1x2
s1s2
^^
s1 w11w21
w12w22
×
a11
a21a12
a22
x1x2
s1s2
a11a21
a12a22
×
−δ MutInfoδwijs2
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 26
Gaussianity, Kurtosis & Independence
� A signal can be characterized by its PDF p(x)- i.e. as if successive time values are drawn from a
random variable (RV)- Gaussian PDF is ‘least interesting’- Sums of independent RVs (PDFs convolved)
tend to Gaussian PDF (Weak law of large nums)
� Measures of deviations from Gaussianity: 4th moment is Kurtosis (�bulging�)
-kurtosis of Gaussian is zero (this def.)-‘heavy tails’ → kurt > 0-closer to uniform dist. → kurt < 0
�Directly related to KL divergence from Gaussian PDF
kurt y( ) Ey µ–σ
------------
43–=
-4 -2 0 2 40
0.2
0.4
0.6
0.8
Gaussian kurt = 0
exp(-|x|) kurt = 2.5
exp(-x4) kurt = -1
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 27
Independence in Mixtures
� Scatter plots & Kurtosis values
-0.4 -0.2 0 0.2 0.4 0.6-0.4
-0.2
0
0.2
0.4
0.6s1 vs. s2
-0.5 0 0.5
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
x1 vs. x2
p(s1) kurtosis = 27.90
-0.2 -0.1 0 0.1 0.2
p(s2) kurtosis = 53.85
p(x1) kurtosis = 18.50
-0.2 -0.1 0 0.1 0.2
p(x2) kurtosis = 27.90
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 28
Finding Independent Components
� Sums of independent RVs are more Gaussian→→→→ minimize Gaussianity to undo sums- i.e. search over wij terms in inverse matrix
� Solve by Gradient descent or Newton-Raphson:
“Fast ICA”, http://www.cis.hut.fi/projects/ica/fastica/
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
mix 1
mix
2
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
θ / π
kurt
osis
Kurtosis vs. θMixture Scatter
s1
s1
s2
s2
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 29
Limitations of ICA
� Assumes instantaneous mixing- real world mixtures have delays & reflections- STFT domain?
Solve ω subbands separately, match up answers
� Searching for best possible inverse matrix- cannot find more than N outputs from N inputs
but: “projection pursuit” ideas+ time-frequency masking...
� Cancellation inherently fragile
- to cancel out s2
- sensitive to noise in x channels- time-varying mixtures are a problem
x1 t( ) a11 t( ) s1 t( )⊗ a12 t( ) s2 t( )⊗+=
X1 ω( ) A11 ω( )S1 ω( ) A12 ω( )S2 ω( )+=⇒
s1 w11 x1⋅ w12 x2⋅+=
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 30
Outline
Sound Mixture Organization
Computational Auditory Scene Analysis
Independent Component Analysis
Model-Based Separation- Fitting models to mixtures- Missing-data recognition- Speech Fragment Decoding
1
2
3
4
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 31
Model-Based Separation:HMM decomposition
(e.g. Varga & Moore 1991, Gales & Young 1996)
� Independent state sequences for 2+ component source models
� New combined state space q' = {q1 q2}
- need pdfs for combinations
4
model 1
model 2
observations / time
p X q1 q2,( )
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 32
One-channel Separation: Masked Filtering
� Multichannel →→→→ ICA: Inverse Þlter & cancel
� One channel: Þnd a time-frequency mask
� Cannot remove overlapping noise in TF cells, but surprisingly effective (psy. masking?):
target s
interference nn
s
-mix
+
proc
mix s+n
s
proc
spectrogram
mask
stftistftx
Fre
quen
cy
0 dB SNR
Noisemix
Maskresynth
2000
4000
6000
8000
Time
Fre
quen
cy
0 1 2 30
2000
4000
6000
8000
-12 dB SNR
Time0 1 2 3
-24 dB SNR
Time0 1 2 3
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 33
�One microphone source separation�(Roweis 2000, Manuel Reyes)
� State sequences → t-f estimates → mask
- 1000 states/model (→ 106 transition probs.)- simplify by subbands (coupled HMM)?
freq
/ H
z
0
1000
2000
3000
4000
time / sectime / sec
0
1000
2000
3000
4000
0
1000
2000
3000
4000
0 1 2 3 4 5 60
1000
2000
3000
4000
0 1 2 3 4 5 6
Speaker 1
Ori
gin
alvo
ices
Sta
tem
ean
sR
esyn
thes
ism
asks
Mix
ture
Speaker 2
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 34
Speech Fragment Recognition(Jon Barker & Martin Cooke, Sheffield)
� Signal separation is too hard!Instead:- segregate features into partially-observed
sources- then classify
� Made possible by missing data recognition- integrate over uncertainty in observations
for true posterior distribution
� Goal:Relate clean speech models P(X|M)to speech-plus-noise mixture observations- .. and make it tractable
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 35
Missing Data Recognition
� Speech models p(x|m) are multidimensional...- i.e. means, variances for every freq. channel- need values for all dimensions to get p(•)
� But: can evaluate over a subset of dimensions xk
� Hence, missing data recognition:
- hard part is finding the mask (segregation)
xk
xu
y
p(xk,xu)
p(xk|xu<y ) p(xk )
p xk m( ) p xk xu, m( ) xud∫=
P(x1 | q)
P(x | q) =
· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)
Present data mask
time →
dim
ensi
on →
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 36
Missing Data Results
� Estimate static background noise level N(f)
� Cells with energy close to background are considered �missing�
- must use spectral features!
� But: nonstationary noise → spurious mask bits- can we try removing parts of mask?
"1754" + noise
SNR mask
Missing DataClassifier "1754"
0 5 10 15 2020
40
60
80
100Factory Noise
Dig
it re
cogn
ition
acc
urac
y / %
SNR (dB)
a priorimissing dataMFCC+CMN
∞
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 37
Comparing different segregations
� Standard classiÞcation chooses between models M to match source features X
� Mixtures: observed features Y, segregation S, all related by
� Joint classiÞcation of model and segregation:
- P(X) no longer constant
M∗ P M X( )M
argmax P X M( )P M( )P X( )--------------⋅
Margmax = =
P X Y S,( )
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 38
Calculating fragment matches
� P(X|M) - the clean-signal feature model
� P(X|Y,S)/P(X) - is X �visible� given segregation?
� Integration collapses some bands...
� P(S|Y) - segregation inferred from observation- just assume uniform, find S for most likely M - or: use extra information in Y to distinguish S’s...
� Result: - probabilistically-correct relation between
clean-source models P(X|M)and inferred, recognized source + segregation P(M,S|Y)
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 39
Using CASA features
� P(S|Y) links acoustic information to segregation- is this segregation worth considering?- how likely is it?
� Opportunity for CASA-style information to contribute- periodicity/harmonicity:
these different frequency bands belong together- onset/continuity:
this time-frequency region must be whole
Frequency Proximity HarmonicityCommon Onset
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 40
Fragment decoding
� Limiting S to whole fragments makes hypothesis search tractable:
- choice of fragments reflects P(S|Y) · P(X|M)i.e. best combination of segregationand match to speech models
� Merging hypotheses limits space demands- .. but erases specific history
Fragments
T1 T2 T3 T4 T5 T6
time
Hypothesesw1
w2
w3
w4
w5
w6
w7
w8
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 41
Speech fragment decoder results
� Simple P(S|Y) model forces contiguous regions to stay together- big efficiency gain when searching S space
� Clean-models-based recognition rivals trained-in-noise recognition
"1754" + noise
SNR mask
Fragments
FragmentDecoder "1754"
0 5 10 15 20 ∞20
40
60
80
100Factory Noise
Dig
it re
cogn
ition
acc
urac
y / %
SNR (dB)
a priorimultisourcemissing dataMFCC+CMN
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 42
Multi-source decoding
� Search for more than one source
� Mutually-dependent data masks- disjoint subsets of cells for each source- each model match P(Mx|Sx,Y) is independent
- masks are mutually dependent: P(S1,S2|Y)
� Huge practical advantage over full search
Y(t)
S1(t)q1(t)
S2(t)q2(t)
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 43
Summary
� Auditory Scene Analysis:Hearing: partially understood, very successful
� Independent Component Analysis:Simple and powerful, some practical limits
� Model-based separation:Real-world constraints, implementation tricks
Mixture separation the main obstacle in many applications e.g. soundtrack recognition
E6820 SAPR - Dan Ellis L11 - Signal Separation 2006-04-13 - 44
ReferencesAapo Hyvärinen, Errki Oja (2000) “Independent Component Analysis: Algorithms and
Applications”, Neural Networks.http://www.ee.columbia.edu/~dpwe/e6820/papers/HyvO00-icatut.pdf
Martin Cooke, Dan Ellis (2001) “The auditory organization of speech and other sources in listeners and computational models”, Speech Communication, vol. 35, no. 3-4, pp. 141-177.http://www.ee.columbia.edu/~dpwe/pubs/CookeE01-audorg.pdf
Jon Barker, Martin Cooke, Dan Ellis (2004) “Decoding speech in the presence of other sources”, Speech Communication, to appear.http://www.ee.columbia.edu/~dpwe/papers/BarkCE04-sfd.pdf
Dan Ellis (1996) “Prediction-driven Computational Auditory Scene Analysis”, Ph.D. dissertation, MIT EECS.http://www.ee.columbia.edu/~dpwe/pdcasa/
Top Related