grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author:...
Transcript of grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author:...
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 1
Sound, Mixtures, and Learning:LabROSA overview
Sound Content Analysis
Recognizing sounds
Organizing mixtures
Accessing large datasets
Dan Ellis <[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 2
Sound Content Analysis
• Sound understanding: the key challenge
- what listeners do- understanding = abstraction
• Applications
- indexing/retrieval- robots- prostheses
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 3
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events
- ... like listeners do
• Hearing is ecologically grounded
- reflects natural scene properties = constraints- subjective, not absolute
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 4
Approaches to handling sound mixtures
• Separate signals, then recognize
- e.g. CASA, ICA- nice, if you can do it
• Recognize combined signal
- ‘multicondition training’- combinatorics..
• Recognize with parallel models
- full joint-state space?- or: divide signal into fragments,
then use missing-data recognition
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 5
Outline
Sound Content Analysis
Recognizing sounds
- Speech recognition- Nonspeech
Organizing mixtures
Accessing large datasets
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 6
Recognizing Sounds: Speech
• Standard speech recognition structure:
• How to handle additive noise?
- just train on noisy data: ‘multicondition training’
2
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 7
Novel speech signal representations
(with Marios Athineos)
• Common sound models use 10ms frames
- but: sub-10ms envelope is perceptible
• Use a parametric (LPC) model on spectrum
• Convert to features for ASR
- improvements esp. for stops
Time
Fre
quen
cy
Fireplce - orig
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
TDLP resynth
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
CTFLP resynth
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
CTFLP resynth 4x TSM
0 2 40
2k
4k
0
2k
4k
200 400 600 800 1000 1200 1400 1600 1800 2000
1
1.5
2
2.5
3
3.5
4
4.5
5
2
0
2
4
waveform0-0.5 kHz0.5-1 kHz1-2 kHz2-4 kHz
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 8
Finding the Information in Speech
(with Patricia Scanlon)
• Mutual Information in time-frequency:
• Use to select classifier input features
5
10
15
-200 -100 0 100 200
5
10
15
-200 -100 0 100 200time / ms
MI / bits
freq
/ B
ark
0.005
0.01
0.015
0.02
0.025
0.02
0.04
0.06
0.08
0
0.01
0.02
0.03
0.04
0.020.040.060.080.10.120.14
All phones Stops
Vowels Speaker (vowels)
5
10
15
5
10
15
5
10
15
-50 0 50
5
10
15
-50 0 50 time / ms
frfe
q / B
ark
133ftrs
IRREG IRREG+CHKBD
95ftrs
57ftrs
19ftrs
85ftrs
47ftrs
28ftrs
9ftrs
time / ms
0 20 40 60 80 100 120 14045
50
55
60
65
70
Number of features
Fra
me
accu
racy
/ %
IRREGRECTRECT, all frames
IRREG+CHKBDRECT+CHKBD
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 9
Alarm sound detection
(Ellis 2001)
• Alarm sounds have particular structure
- people ‘know them when they hear them’- clear even at low SNRs
• Why investigate alarm sounds?
- they’re supposed to be easy- potential applications...
• Contrast two systems:
- standard, global features,
P
(
X
|
M
)
- sinusoidal model, fragments,
P
(
M
,
S
|
Y
)
time / s
hrn01 bfr02 buz01
level / dB
freq
/ kH
z
0 5 10 15 20 250
1
2
3
4
-40
-20
0
20s0n6a8+20
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 10
Alarms: Results
• Both systems commit many insertions at 0dB SNR, but in different circumstances:
20 25 30 35 40 45 50
0
6 7 8 9
time/sec0
freq
/ kH
z
1
2
3
4
0
freq
/ kH
z
1
2
3
4Restaurant+ alarms (snr 0 ns 6 al 8)
MLP classifier output
Sound object classifier output
NoiseNeural net system Sinusoid model system
Del Ins Tot Del Ins Tot
1 (amb) 7 / 25 2 36% 14 / 25 1 60%
2 (bab) 5 / 25 63 272% 15 / 25 2 68%
3 (spe) 2 / 25 68 280% 12 / 25 9 84%
4 (mus) 8 / 25 37 180% 9 / 25 135 576%
Overall 22 / 100 170 192% 50 / 100 147 197%
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 11
Outline
Sound Content Analysis
Recognizing sounds
Organizing mixtures
- Auditory Scene Analysis- Missing data recognition- Parallel model inference
Accessing large datasets
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 12
Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small
elements
(in time-freq)- elements are
grouped
in to sources using
cues
- sources have aggregate
attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
3
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 13
Cues to simultaneous grouping
• Elements + attributes
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
• But: Context ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 14
Computational Auditory Scene Analysis:The Representational Approach
(Cooke & Brown 1993)
• Direct implementation of psych. theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
• Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 15
Adding top-down constraints
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with parameters of a world-model
- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 16
Prediction-Driven CASA
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 17
Segregation vs. Inference
• Source separation requires attribute separation- sources are characterized by attributes
(pitch, loudness, timbre + finer details)- need to identify & gather different attributes for
different sources ...
• Need representation that segregates attributes- spectral decomposition- periodicity decomposition
• Sometimes values can’t be separated- e.g. unvoiced speech- maybe infer factors from probabilistic model?
- or: just skip those values, infer from higher-level context
- do both: missing-data recognition
p O x y, ,( ) p x y, O( )Æ
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 18
Missing Data Recognition
• Speech models p(x|m) are multidimensional...- i.e. means, variances for every freq. channel- need values for all dimensions to get p(•)
• But: can evaluate over a subset of dimensions xk
• Hence, missing data recognition:
- hard part is finding the mask (segregation)
xk
xu
y
p(xk,xu)
p(xk|xu<y ) p(xk )
p xk m( ) p xk xu, m( ) xudÚ=
P(x1 | q)
P(x | q) =
· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)
Present data mask
time �
dim
ensi
on �
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 19
Comparing different segregations
• Standard classification chooses between models M to match source features X
• Mixtures Æ observed features Y, segregation S, all related by
- spectral features allow clean relationship
• Joint classification of model and segregation:
- probabilistic relation of models & segregation
M* P M X( )M
argmax P X M( ) P M( )P X( )--------------◊
Margmax = =
P X Y S,( )
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------◊ XdÚ P S Y( )◊=
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 20
Multi-source decoding
• Search for more than one source
• Mutually-dependent data masks
• Use e.g. CASA features to propose masks- locally coherent regions
• Lots of issues in models, representations, matching, inference...
Y(t)
S1(t)q1(t)
S2(t)q2(t)
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 21
What a speech HMM contains
• Markov model structure: states + transitions
• A generative model- but not a good speech generator!
- only meant for inference of p(X|M)
K
A
S
0.80.18
0.9
0.1
0.02
0.9
0.1T E
0.8
0.2
M'1
O
K
A
S
0.80.05
0.9
0.1
0.15
0.9
0.1T E
0.8
0.2
O
Model M'2
0
1
2
3
4
freq
/ kH
z
10 20 30 40 50
10
20
30
40
50
State models (means) State Transition Probabilities
0 1 2 3 4 50
1
2
3
4
freq
/ kH
z
time / sec
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 22
“One microphone source separation”(Roweis 2000, Manuel Reyes)
• State sequences Æ t-f estimates Æ mask
- 1000 states/model (Æ 106 transition probs.)- simplify by modeling subbands (coupled HMM)?
freq
/ H
z
0
1000
2000
3000
4000
time / sectime / sec
0
1000
2000
3000
4000
0
1000
2000
3000
4000
0 1 2 3 4 5 60
1000
2000
3000
4000
0 1 2 3 4 5 6
Speaker 1O
rig
inal
voic
esS
tate
mea
ns
Res
ynth
esis
mas
ksM
ixtu
re
Speaker 2
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 23
Subband models(Reyes, Jojic)
• Reduce the number of states required- 4000 states ¥1 band Æ 30 states ¥19 bands
• Train coupled HMMs via variational approx
1,1S 2,1S 3,1S 4,1S 5,1S 6,1S
1,2S 2,2S 3,2S 4,2S 5,2S 6,2S
1,3S 2,3S 3,3S 4,3S 5,3S 6,3S
1,4S 2,4S 3,4S 4,4S 5,4S 6,4S
1,5S 2,5S 3,5S 4,5S 5,5S 6,5S
1,1S 2,1S 3,1S 4,1S 5,1S 6,1S
1,2S 2,2S 3,2S 4,2S 5,2S 6,2S
1,3S 2,3S 3,3S 4,3S 5,3S 6,3S
1,4S 2,4S 3,4S 4,4S 5,4S 6,4S
1,5S 2,5S 3,5S 4,5S 5,5S 6,5S
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 24
Tracking source normalization
• Standard HMM sound models try to normalize away energy variation- but: each source in mixture has different ‘gain’
• Instead, factor out scalar gain for each source
- solve with var. approx. to P(S,a)
Model 1
Data
Source 1
Normalization÷k1
Normalization÷k2
Model 2
Data
Source 2Composed
Data
Normalization ?
Model 1 Model 2
Normalizationmismatch
x1 x2 x3 x4 x5 x6
S1 S2 S3 S4S5 S6
a1 a2 a3 a4a5 a6
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 25
Outline
Sound Content Analysis
Recognizing sounds
Organizing mixtures
Accessing large datasets- Meeting Recordings- The Listening Machine - Music Information Retrieval
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 26
Accessing large datasets:The Meeting Recorder Project
(with ICSI, UW, IDIAP, SRI, Sheffield)
• Microphones in conventional meetings- for summarization / retrieval / behavior analysis- informal, overlapped speech
• Data collection (ICSI, UW, IDIAP, NIST):
- ~100 hours collected & transcribed
• NSF ‘Mapping Meetings’ project
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 27
Speaker Turn detection(Huan Wei Hee, Jerry Liu)
• Acoustic: Triangulate tabletop mic timing differences- use normalized peak value for confidence
• Behavioral: Look for patterns of speaker turns
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
lag 1-2 / ms
lag
3-4
/ ms
mr-2000-11-02-1440: PZM xcorr lags
4
12
3
0 50 100 150 200 250 3000
50
100
150
200
250
300
time / s
100x
R s
kew
/sam
ps
Example cross coupling response, chan3 to chan0
3:
8:
10:9:
7:5:
2:1:
0 5 10 15 20 25 30 35 40 45 50 55 60
mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
Pa
rtic
ipa
nt
time/min
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 28
The Listening Machine
• Smart PDA records everything
• Only useful if we have index, summaries- monitor for particular sounds- real-time description
• Scenarios
- personal listener Æ summary of your day- future prosthetic hearing device- autonomous robots
• Meeting data, ambulatory audio
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 29
Personal Audio
• LifeLog / MyLifeBits / Remembrance Agent:Easy to record everything you hear
• Then what?- prohibitively time consuming to
search- but .. applications if access easier
• Automatic content analysis / indexing...
50 100 150 200 2500
2
4
14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30
5
10
15
freq
/ H
zfr
eq /
Bar
k
time / min
clock time
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 30
Music Information Retrieval
• Transfer search concepts to music?- “musical Google”- finding something specific / vague / browsing- is anything more useful than human annotation?
• Most interesting area: finding new music- is there anything on mp3.com that I would like?- audio is only information source for new bands
• Basic idea:Project music into a space where neighbors are “similar”
• Also need models of personal preference- where in the space is the stuff I like- relative sensitivity to different dimensions
• Evaluation problems- requires large, shareable music corpus!
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 31
Music similarity from Anchor space
• A classifier trained for one artist (or genre) will respond partially to a similar artist
• Each artist evokes a particular pattern of responses over a set of classifiers
• We can treat these classifier outputs as a new feature space in which to estimate similarity
• “Anchor space” reflects subjective qualities?
n-dimensionalvector in "Anchor
Space"Anchor
Anchor
Anchor
AudioInput
(Class j)
p(a1|x)
p(a2|x)
p(an|x)
GMMModeling
Conversion to Anchorspace
n-dimensionalvector in "Anchor
Space"Anchor
Anchor
Anchor
AudioInput
(Class i)
p(a1|x)
p(a2|x)
p(an|x)
GMMModeling
Conversion to Anchorspace
SimilarityComputation
KL-d, EMD, etc.
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 32
Playola interface ( www.playola.org )
• Browser finds closest matches to single tracks or entire artists in anchor space
• Direct manipulation of anchor space axes
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 33
Evaluation
• Are recommendations good or bad?
• Subjective evaluation is the ground truth- .. but subjects aren’t familiar with the bands
being recommended- can take a long time to decide if a
recommendation is good
• Measure match to other similarity judgments- e.g. musicseer data:
Top rank agreement
0
10
20
30
40
50
60
70
80
cei cmb erd e3d opn kn2 rnd ANK
%
SrvKnw 4789x3.58
SrvAll 6178x8.93
GamKnw 7410x3.96GamAll 7421x8.92
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 34
Summary
• Sound- .. contains much, valuable information at many
levels- intelligent systems need to use this information
• Mixtures- .. are an unavoidable complication when using
sound- looking in the right time-frequency place to find
points of dominance
• Learning- need to acquire constraints from the
environment- recognition/classification as the real task
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 35
LabROSA Summary
DO
MA
INS
AP
PLI
CA
TIO
NS
ROSA
• Broadcast• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Speech recognition• Speech characterization• Nonspeech recognition
• Object-based structure discovery & learning
• Scene analysis• Audio-visual integration• Music analysis
• Structuring• Search• Summarization• Awareness• Understanding
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 36
Extra Slides
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 37
Music Applications
• Music as a complex, information-rich sound
• Applications of separation & recognition:- note/chord detection & classification
- singing detection (Æ genre identification ...)
23 240
1
2
72 730
1
231 32 40 41 48 49
80 81 85 86 88 89
freq
/ kH
z
DYWMB: Alignments to MIDI note 57 mapped to Orig Audio
0 50 100 150 200
Boards of CanadaSugarplastic
Belle & SebastianMercury Rev
CorneliusRichard Davies
Dj ShadowMouse on Mars
The Flaming LipsAimee Mann
WilcoXTCBeck
Built to SpillJason Falkner
OvalArto Lindsay
Eric MatthewsThe MolesThe Roots
Michael Penntrue voice
time / sec
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 38
Artist Similarity
• Recognizing work from each artist is all very well...
• But: what is similarity between artists?
- pattern recognition systems give a number...
• Need subjective ground truth:Collected via web site
www.musicseer.com
• Results:- 1,000 users, 22,300 judgments collected over 6 months
backstreet_boys
whitney_
new_ron_carter
a
all_saints
annie_lennox
aqua
belinda_carlislers
celine_dionchristina_aguilera
eiffel_65
erasure
miroquai
janet_jacksonjessica_simpson lara_fabian
lauryn_hill
madonna
mariah_carey
nelly_furtado
pet_shop_boys
pr
roxette
sade
wain
sof
spice_girls
toni_braxtone
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 39
Independent Component Analysis (ICA)(Bell & Sejnowski 1995 et seq.)
• Drive a parameterized separation algorithm to maximize independence of outputs
• Advantages:- mathematically rigorous, minimal assumptions- does not rely on prior information from models
• Disadvantages:- may converge to local optima...- separation, not recognition- does not exploit prior information from models
m1m2
s1s2
a11a21
a12a22
x
�� MutInfo�a