Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya...
-
Upload
patience-mason -
Category
Documents
-
view
220 -
download
2
Transcript of Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya...
Extracting Noise-Robust Features from Audio Data
Chris Burges, John Platt,
Erin Renshaw, Soumya Jana*
Microsoft Research
*U. Illinois, Urbana/Champaign
Outline
• Goals
• De-equalization and Perceptual Thresholding
• Oriented PCA
• Distortion Discriminant Analysis
• Results and demo
Audio Feature Extraction Goals
• Map high dim space of audio samples to low dim feature space
• Must be robust to likely input distortions
• Must be informative (different audio segments map to distant points)
• Must be efficient (use small fraction of resources on typical PC)
Audio Fingerprinting as a Test Bed
• Train: Given an audio clip, extract one or more feature vectors
• Test: Compare feature vectors, computed from an audio stream, with database of stored vectors, to identify the clip
• Clip is not changed, unlike watermarking
What is audio fingerprinting?
Why audio fingerprinting?
• Enable user to identify streaming audio (e.g. radio) in real time
• Track music play for royalty assignment
• Identify audio segments in e.g. television or radio: “Did my commercial air?”
• Search unlabeled data, e.g. on web: “Did someone violate my copyright?”
Goals, cont.
• Feature extraction: current audio features are often hand-crafted. Can we learn the features we need?
• Database must be able to handle 1000s of requests, and contain millions of clips
Outline
• Goals
• De-equalization and Perceptual Thresholding
• Oriented PCA
• Distortion Discriminant Analysis
• Results and demo
Feature Extraction
Output 64 floats every half-frame
Downsample to 11.025 KHz mono
Compute Robust ProjectionsCollate Samples
(Repeat)
De-EqualizePerceptual Thresholding
Compute windowed FFTcoefficients, take magnitude
De-Equalization
Goal: Remove slow variation in frequency space
MCLT
Magnitude r
Phase
DCT(log r)
Apply highpass mask
exp(Inverse DCT)
Normalize Power
1.0
0.0
De-Equalization
Before After
De-equalize by flattening the log spectrum. But does thisdistort? Can reconstruct signal by adding phases back in:
Outline
• Goals
• De-equalization and Perceptual Thresholding
• Oriented PCA
• Distortion Discriminant Analysis
• Results and demo
Oriented PCA
• Want projection so that the variance of the projected signal is maximized, and MSE of projected noise simultaneously minimized. [Daimantaras & Kung, ’96]
• Maximize Generalized Rayleigh coefficient
where C1 is the signal covariance, C2 the noise correlation.
Outline
• Goals
• De-equalization and Perceptual Thresholding
• Oriented PCA
• Distortion Discriminant Analysis
• Results and demo
Distortion Discriminant Analysis
DDA is a convolutional neural network whoseweights are trained using OPCA:
2048
372ms, step every 186ms
6.1s, step every 186ms
64 64
32
64
64
OPCA
OPCA
The Distortions
Original
3:1 compress above 30dB
Light Grunge
Distort Bass
“Mackie Mid boost” filter
Old Radio
Phone
RealAudio© Compander
DDA Training Data
• 50 20s audio segments, catted (16.7min)• Compute distortions• Solve generalize eigenvector problem
(after preprocessing)• Subtract mean projections of train data,
norm noise to unit variance• Repeat• Use 10 20s segments as validation set to
compute scale factor for final projections
Outline
• Goals
• De-equalization and Perceptual Thresholding
• Oriented PCA
• Distortion Discriminant Analysis
• Results and demo
OPCA vs. Bark Band Averaging
Compare with a ‘standard’ approach:
• 23 ms windows, same preprocessing.• Use 10 Bark bands from 510 Hz to 2.7 KHz.• Average over bands to get 10-dim features.
OPCA vs. Bark Averaging
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
1.2
1.4
Light GrungeDistort Bass
Phone
OPCA Bark
Noi
se-t
o-S
igna
l Var
ianc
e R
atio
Old Radio
OPCA vs. Bark Averaging, cont.
0 5 10 15 20 25 300
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
3 to 1 compress 30dBMackie MidReal Compander
OPCA Bark
No
ise
-to
-Sig
na
l Va
ria
nce
Ra
tio
DDA Eigenspectrum
0 50 100 150 200 2500
200
400
600
800
1000
1200
1400
Eigenvalue Index
Gen
eral
ized
Eig
enva
lues
(Sig
nal-t
o-no
ise
rat
ios)
1st Layer2nd Layer
.....
DDA for Alignment Robustness
Train second layer with alignmentdistortion:
0 20 40 60 80 100 120 140 160 1800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
No Shift TrainingWith Shift Training
Alignment shift / ms
Min
Tar
get
/Tar
get s
q. d
ist
Min
Ta
rge
t/Non
Tar
get
sq.
dis
t
Test Results – Large Test Set
• 500 songs, 36 hours of music• One stored fingerprint computed for every song,
about 10s in, random start point• Test alignment noise, false positive and negative
rates, same with distortions• Define distance between two clips as minimum
distance between fingerprint of first, and all traces of the second
• Approx. 3.5 108 chances for a false positive
Test Results: Positives
0 50 100 150 200 250 300 350 400 450 5000
0.005
0.01
0.015
0.02
0.025
0.03
372ms steps186ms steps
Fingerprints, sorted by distance
No
rma
lize
d d
ista
nce
sq
ua
red
Test Results: Negatives
Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026
0 10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Smallest 100 of 249,500 target/nontarget values
No
rma
lize
d d
ista
nce
sq
ua
red
Fit with parametric log-logistic model
1( )
1 exp( log( ) )P d x
x
log1
Py
P
-2 -1.5 -1 -0.5 0 0.5 1 1.5-15
-10
-5
0
5
10
15
DataLinear Fit
log(x)
Gives 2.3 fp’sper song for 106
fingerprints
Tests for Other Distortions
• Take first 10 test clips
• Distort each in 7 ways
• Further distort with 2% pitch-preserving time compression (distortion not in training set)
Test Results on Distorted Data
0 10 20 30 40 50 60 70 80 90 1000.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
Smallest 100 of 39,920 target /non-target values
0 10 20 30 40 50 60 70 800
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
No
rmal
ized
dis
tan
ce s
qu
are
d
Time compressed distortions, not sortedAll target/target values, sorted
Computational Cost
On a 750 MHz PIII, computing features on a stream and searching for fingerprints, using simple sequential scan, for 10,000 audio clips, takes 12MB memory and 10% of CPU.
Conclusions
• DDA features work well for audio fingerprinting. How about other audio tasks?
• A layered approach is necessary to keep DDA computation feasible.
• This is a convolutional neural net, with weights chosen by OPCA and linear transfer functions.
• Demo
http://research.microsoft.com/~cburges/tech_reports/dda.pdf