Cross-Modal (Visual-Auditory) Denoising
-
Upload
marsden-harvey -
Category
Documents
-
view
27 -
download
0
description
Transcript of Cross-Modal (Visual-Auditory) Denoising
Cross-Modal (Visual-Auditory)
DenoisingDana Segev
Yoav Y. Schechner
Michael Elad
Technion – Israel Institute of Technology
1
2
Digits sequence Noisy digits sequence
Denoised by state of the art algorithm of Cohen & Berdugo
Segev, Schechner, Elad, Cross-Modal Denoising
Use one modality to denoise another?
• Use video to denoise a soundtrack?
3
Segev, Schechner, Elad, Cross-Modal Denoising
a
Very intenseNon-stationaryUnknownUnseen source.
Noise
Single microphone
4
Segev, Schechner, Elad, Cross-Modal Denoising
5
very noisy audio
time (sec)
Input
Algorithm
denoised audio
OutputFor human and machine hearing
video
Cross-modalExample-
Based
Segev, Schechner, Elad, Cross-Modal Denoising
6
Segev, Schechner, Elad, Cross-Modal Denoising
7
Segev, Schechner, Elad, Cross-Modal Denoising
8
Training xample set
nput test set
I
E
Segev, Schechner, Elad, Cross-Modal Denoising
9
Segev, Schechner, Elad, Cross-Modal Denoising
10
~syllable(0.25 sec)
Segev, Schechner, Elad, Cross-Modal Denoising
lophone
11
Xylophone
Segev, Schechner, Elad, Cross-Modal Denoising
lophone
12
Sound
Xylophone
Segev, Schechner, Elad, Cross-Modal Denoising
13
... ...
Exam
ple
s
Segev, Schechner, Elad, Cross-Modal Denoising
14
... ...
Exam
ple
s
Segev, Schechner, Elad, Cross-Modal Denoising
15
... ...
Exam
ple
s
Segev, Schechner, Elad, Cross-Modal Denoising
16
... ...
Exam
ple
s
Segev, Schechner, Elad, Cross-Modal Denoising
Cross-modal representation.
17
• Generating multimodal features.
• Cross-modal pattern recognition.
• Rendering a denoised signal.
• Learning feature statistics.
Segev, Schechner, Elad, Cross-Modal Denoising
18
Input video
Video feature-space
time (sec)
Input audio
Audio feature-spaceSegev, Schechner, Elad, Cross-Modal
Denoising
19
Input audio-video
time (sec)
Audio-video feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
20
Training audio-video
Audio-video examples
feature-space
time (sec)
Segev, Schechner, Elad, Cross-Modal Denoising
21
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
22
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
23
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
24
Nearest Neighbor
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
25
Nearest Neighbor
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
26
Exam
ple
s
... ...
Segev, Schechner, Elad, Cross-Modal Denoising
27
Exam
ple
s
... ...
Segev, Schechner, Elad, Cross-Modal Denoising
28
Noisy audio
Clean segment
Clean segment
Clean segment
Segev, Schechner, Elad, Cross-Modal Denoising
29
Noisy audio
Clean segment
Clean segment
Clean segment Denoised
Segev, Schechner, Elad, Cross-Modal Denoising
Exam
ple
s
... ...
30
Segev, Schechner, Elad, Cross-Modal Denoising
31
Examples..
. ..
.
Input
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
32
...
...
...
...
Examples
Input
Segev, Schechner, Elad, Cross-Modal Denoising
33
...
...
...
...
...
...
...
...
...
...
Examples
Input
Segev, Schechner, Elad, Cross-Modal Denoising
34
...
...
...
...
...
...
...
...
...
...
Examples
Input
Segev, Schechner, Elad, Cross-Modal Denoising
Bartender experiment
35
Segev, Schechner, Elad, Cross-Modal Denoising
36
...
...
...
...
...
...
...
...
...
...
Examples
Input
Segev, Schechner, Elad, Cross-Modal Denoising
Cross-modal representation.
37
• Generating multimodal features.
Cross-modal pattern recognition (NN).Rendering a denoised signal.
• Learning feature statistics.
Segev, Schechner, Elad, Cross-Modal Denoising
38
Feature-space
Segev, Schechner, Elad, Cross-Modal Denoising
39
Feature-spaceFor the k-th
example segment:
Segev, Schechner, Elad, Cross-Modal Denoising
40
Feature-space
bi
fif
ty
two
ar
bi - fif - ty- two
For the k-th example segment:
Segev, Schechner, Elad, Cross-Modal Denoising
41
Current cluster
Next cluster
bi ty fif two ar
bi
tyfif
twoar
1
1
1
1
1
1
1
Feature-space
bi
fif
ty
two
ar
1
2
1
Segev, Schechner, Elad, Cross-Modal Denoising
42
Current cluster
Next cluster
bi ty fif two ar
bi
tyfif
twoar
13
17
22
9
43
21
53
60
2
3
7 11
6
23
12
5
7
6
1
2
4
526 1
12
Syllable consecutive probability
The probability for transition
between clusters
=Number of examples in training set
Segev, Schechner, Elad, Cross-Modal Denoising
43
Hidden Markov Model
PTimedelay
bi
fif
fif
bi
Segev, Schechner, Elad, Cross-Modal Denoising
44
PTimedelay
bi
fif
fif
bike
Audio noise
Segev, Schechner, Elad, Cross-Modal Denoising
45
Hidden Markov Model
PTimedelay
bi
fif
fif
bi
+mi
Audio noise
keSegev, Schechner, Elad, Cross-Modal Denoising
46
Examples..
. ..
.
Input
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
47
...
...
Examples
Input
...
...
...
...
...
...
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
48
...
...
Examples
Input
...
...
...
...
...
...
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
49
Input video
Segev, Schechner, Elad, Cross-Modal Denoising
50
Input video
Segev, Schechner, Elad, Cross-Modal Denoising
51
Input video
Segev, Schechner, Elad, Cross-Modal Denoising
52
A Cost function
A Regularization term
A Data term
A Regularization term
A Data term
Segev, Schechner, Elad, Cross-Modal Denoising
53
A Cost function
A Regularization term
A Data term
A Regularization term
A Data term
Optimally vector of indices
Segev, Schechner, Elad, Cross-Modal Denoising
54
• nodes
• edges
Complexity:
Examples
Input
...
.. .
...
...
...
...
...
...
...
...
Complexity: Dynamic Programming
Segev, Schechner, Elad, Cross-Modal Denoising
55
...
...
Examples
Input
...
...
...
...
...
...
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
56
...
...
Examples
Input
...
...
...
...
...
...
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
57
...
...
Examples
Input
...
...
...
...
...
...
...
...
Segev, Schechner, Elad, Cross-Modal Denoising
Cross-modal representation.
58
• Generating multimodal features.
Cross-modal pattern recognition.
Rendering a denoised signal.
Learning feature statistics.
Segev, Schechner, Elad, Cross-Modal Denoising
Audio Features
59
• Sensitivity to sound perception.• Dimension reduction
Visual Features• Focusing on the
motion of interest• Dimension reduction
SpeechFeatures
MusicFeatures
Requirements
The spatial trajectoryof a hitting rod
DCT coefficients
MFCCs
Spectrogram of each segment
Segev, Schechner, Elad, Cross-Modal Denoising
60
MFCCs – Mel-frequency Ceptral Coefficients
Audio signalSignal spectrum
Mel-frequency filter bank log(.)
DCT
MFCCsSegev, Schechner, Elad, Cross-Modal Denoising
61
Spectrogram of each segment
Spectrogram
Xylophne signal
Spectrogram
accumulation
Segev, Schechner, Elad, Cross-Modal Denoising
The given movie
62
. . .
speech
Segev, Schechner, Elad, Cross-Modal Denoising
Locking on the object of interest
63
. . .speech
Segev, Schechner, Elad, Cross-Modal Denoising
64
. . .speech
Extracting global motion by tracking
Segev, Schechner, Elad, Cross-Modal Denoising
65
. . .speech
Extracting global motion by tracking
Segev, Schechner, Elad, Cross-Modal Denoising
Extracting features
66
DCT coefficients which highly represent motion between frames
speech
Segev, Schechner, Elad, Cross-Modal Denoising
The given movie
67
. . .
Xylophone
Segev, Schechner, Elad, Cross-Modal Denoising
Locking on the object of interest
68
Xylophone
. . .
Segev, Schechner, Elad, Cross-Modal Denoising
Extracting global motion by tracking
69
Xylophone
. . .
X
Z Y
Segev, Schechner, Elad, Cross-Modal Denoising
70
Xylophone
. . .X
Z Y
Extracting global motion by tracking
Segev, Schechner, Elad, Cross-Modal Denoising
Extracting features
71
Xylophone
Hitting rod spatial coordinates
X
YZ
Segev, Schechner, Elad, Cross-Modal Denoising
Speech
72
• A corpus of a limited number of words and
syllables:
Digits and bar beverages.
• Video rate 25fps, Audio rate 8000Hz.
• Kmeans clustering, 350 clusters.
• Distance measurement l2 norm.Xylophone
• A corpus of a limited sounds.
• Video rate 25fps, Audio rate 16000Hz
• Distance measurement l2 norm.Segev, Schechner, Elad, Cross-Modal Denoising
73
Xylophone
•Training duration: 103 sec
•Testing duration : 100 secMusic from song by
GNR: SNR = 0.9Xylophone
Melody: SNR = 1
Segev, Schechner, Elad, Cross-Modal Denoising
Speech: Digits
74
•Training duration: 60 sec•Testing duration : 240 sec
Noisy Denoised
SNR = 0.07
Segev, Schechner, Elad, Cross-Modal Denoising
Speech: Bartender
75
Music from song by Phil Collins
Male Speech White Gaussian
•Training duration: 48 sec
•Testing duration : 350 sec
SNR = 0.59
SNR = 0.3 SNR = 0.38
Segev, Schechner, Elad, Cross-Modal Denoising
76
video
very noisy audio
time (sec)
Input
Algorithm
denoised audio
OutputFor human and machine hearing
• Example-based• Hidden Markov Model
Segev, Schechner, Elad, Cross-Modal Denoising