Cross-Modal (Visual-Auditory) Denoising

Cross-Modal (Visual-Auditory)

DenoisingDana Segev

Yoav Y. Schechner

Michael Elad

Technion – Israel Institute of Technology

1

2

Digits sequence Noisy digits sequence

Denoised by state of the art algorithm of Cohen & Berdugo

Segev, Schechner, Elad, Cross-Modal Denoising

Use one modality to denoise another?

• Use video to denoise a soundtrack?

3


a

Very intenseNon-stationaryUnknownUnseen source.

Noise

Single microphone

4


5

very noisy audio

time (sec)

Input

Algorithm

denoised audio

OutputFor human and machine hearing

video

Cross-modalExample-

Based


6


7


8

Training xample set

nput test set

I

E


9


10

~syllable(0.25 sec)


lophone

11

Xylophone


lophone

12

Sound

Xylophone


13

... ...

Exam

ple

s


14

... ...

Exam

ple

s


15

... ...

Exam

ple

s


16

... ...

Exam

ple

s


Cross-modal representation.

17

• Generating multimodal features.

• Cross-modal pattern recognition.

• Rendering a denoised signal.

• Learning feature statistics.


18

Input video

Video feature-space

time (sec)

Input audio

Audio feature-spaceSegev, Schechner, Elad, Cross-Modal

Denoising

19

Input audio-video

time (sec)

Audio-video feature-space


20

Training audio-video

Audio-video examples

feature-space

time (sec)


21

Feature-space


22

Feature-space


23

Feature-space


24

Nearest Neighbor

Feature-space


25

Nearest Neighbor

Feature-space


26

Exam

ple

s

... ...


27

Exam

ple

s

... ...


28

Noisy audio

Clean segment

Clean segment

Clean segment


29

Noisy audio

Clean segment

Clean segment

Clean segment Denoised


Exam

ple

s

... ...

30


31

Examples..

. ..

.

Input

...

...


32

...

...

...

...

Examples

Input


33

...

...

...

...

...

...

...

...

...

...

Examples

Input


34

...

...

...

...

...

...

...

...

...

...

Examples

Input


Bartender experiment

35


36

...

...

...

...

...

...

...

...

...

...

Examples

Input



37


Cross-modal pattern recognition (NN).Rendering a denoised signal.

• Learning feature statistics.


38

Feature-space


39

Feature-spaceFor the k-th

example segment:


40

Feature-space

bi

fif

ty

two

ar

bi - fif - ty- two

For the k-th example segment:


41

Current cluster

Next cluster

bi ty fif two ar

bi

tyfif

twoar

1

1

1

1

1

1

1

Feature-space

bi

fif

ty

two

ar

1

2

1


42

Current cluster

Next cluster

bi ty fif two ar

bi

tyfif

twoar

13

17

22

9

43

21

53

60

2

3

7 11

6

23

12

5

7

6

1

2

4

526 1

12

Syllable consecutive probability

The probability for transition

between clusters

=Number of examples in training set


43

Hidden Markov Model

PTimedelay

bi

fif

fif

bi


44

PTimedelay

bi

fif

fif

bike

Audio noise


45

Hidden Markov Model

PTimedelay

bi

fif

fif

bi

+mi

Audio noise

keSegev, Schechner, Elad, Cross-Modal Denoising

46

Examples..

. ..

.

Input

...

...


47

...

...

Examples

Input

...

...

...

...

...

...

...

...


48

...

...

Examples

Input

...

...

...

...

...

...

...

...


49

Input video


50

Input video


51

Input video


52

A Cost function

A Regularization term

A Data term


A Data term


53

A Cost function


A Data term


A Data term

Optimally vector of indices


54

• nodes

• edges

Complexity:

Examples

Input

...

.. .

...

...

...

...

...

...

...

...

Complexity: Dynamic Programming


55

...

...

Examples

Input

...

...

...

...

...

...

...

...


56

...

...

Examples

Input

...

...

...

...

...

...

...

...


57

...

...

Examples

Input

...

...

...

...

...

...

...

...



58


Cross-modal pattern recognition.

Rendering a denoised signal.

Learning feature statistics.


Audio Features

59

• Sensitivity to sound perception.• Dimension reduction

Visual Features• Focusing on the

motion of interest• Dimension reduction

SpeechFeatures

MusicFeatures

Requirements

The spatial trajectoryof a hitting rod

DCT coefficients

MFCCs

Spectrogram of each segment


60

MFCCs – Mel-frequency Ceptral Coefficients

Audio signalSignal spectrum

Mel-frequency filter bank log(.)

DCT

MFCCsSegev, Schechner, Elad, Cross-Modal Denoising

61

Spectrogram of each segment

Spectrogram

Xylophne signal

Spectrogram

accumulation


The given movie

62

. . .

speech


Locking on the object of interest

63

. . .speech


64

. . .speech

Extracting global motion by tracking


65

. . .speech



Extracting features

66

DCT coefficients which highly represent motion between frames

speech


The given movie

67

. . .

Xylophone


Locking on the object of interest

68

Xylophone

. . .



69

Xylophone

. . .

X

Z Y


70

Xylophone

. . .X

Z Y



Extracting features

71

Xylophone

Hitting rod spatial coordinates

X

YZ


Speech

72

• A corpus of a limited number of words and

syllables:

Digits and bar beverages.

• Video rate 25fps, Audio rate 8000Hz.

• Kmeans clustering, 350 clusters.

• Distance measurement l2 norm.Xylophone

• A corpus of a limited sounds.

• Video rate 25fps, Audio rate 16000Hz

• Distance measurement l2 norm.Segev, Schechner, Elad, Cross-Modal Denoising

73

Xylophone

•Training duration: 103 sec

•Testing duration : 100 secMusic from song by

GNR: SNR = 0.9Xylophone

Melody: SNR = 1


Speech: Digits

74

•Training duration: 60 sec•Testing duration : 240 sec

Noisy Denoised

SNR = 0.07


Speech: Bartender

75

Music from song by Phil Collins

Male Speech White Gaussian

•Training duration: 48 sec

•Testing duration : 350 sec

SNR = 0.59

SNR = 0.3 SNR = 0.38


76

video

very noisy audio

time (sec)

Input

Algorithm

denoised audio

OutputFor human and machine hearing

• Example-based• Hidden Markov Model


Cross-Modal (Visual-Auditory) Denoising

Documents

Transcript of Cross-Modal (Visual-Auditory) Denoising