Post on 31-Mar-2015
Informatik og Matematisk Modellering / Intelligent Signalbehandling
1Kaare Brandt Petersen
Machine Learning on Sound... how hard can it be?
Audio Information SeminarThursday, June 8, 2006Kaare Brandt Petersen
Kaare Brandt Petersen 2
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Agenda Motivation
The reason it might be hard:- From data and information- Features
The good news:- Computer power and machine learning- Examples
Conclusions
Kaare Brandt Petersen 3
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Motivation What can we do with audio information?
News archive: Find the grumpy voice in a TV broadcasting from a busy street in the middle east. Search in newsarchives
Music: 6 billion friends. Navigating in the world landscape of music
Kaare Brandt Petersen 4
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Data Sound as perceived by humans
and by computers
-0.000762939453130.00231933593750-0.00714111328125 0.007720947265630.00076293945313-0.00772094726563-0.00900268554688-0.00527954101563-0.00076293945313-0.00231933593750-0.007141113281250.000244140625000.013122558593750.00650024414063-0.01052856445313-0.01089477539063-0.00305175781250-0.01052856445313-0.01089477539063-0.00305175781250
[ Beeps ]
- "There's the televison"
[ Music - violins ]
[ Steps ]- "Its all right there"- "All right there!"
- "Look. Listen. Neel. Pray" - "Commericals!"
[ Male voice - indoor ]
Dialogue Sound events
12 MonkeysMovie from 1995
Kaare Brandt Petersen 5
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Data Is the data-to-information translation really necessary?
1) Query by signal processing[ humans learn how computers think ]
2) Query by information[ computers learn how humans think ]
3) Query by example[ various approaches ]
"happy jazz"
ZCR < 198
Archive
Kaare Brandt Petersen 6
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Data Going from 5 million real
numbers to "Opera"
Bridging the gap: From data to information
Constructing sound features the right way
Information
Meaning
Context
Kaare Brandt Petersen 7
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Features Many shorttime features
Zero crossing rateSpectral flatnessSpectral bandwidthSpectral centroidsSpectral rolloffSpectral fluxEnergy...
Mel Frequency Cepstral Coefficients (MFCC) [Foote97, Rabiner93]Real Cepstral Coefficients (RCC) Linear Prediction Coefficients (LPC)Wavelets Gamma-tone-filterbanksSone / BarkChroma features...
ZCR
MFCC 1
Spec
Sp-Flatness
MFCC 2-7
Waveform
Sp-BandwidthSp-Centroid
Chroma
12 Monkeys sound clip
Kaare Brandt Petersen 8
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Features Aggregating shorttime features
Audio clip = data cloud
Distribution of valuesBasic statistics [Wold96]Histograms and vector quantization [Foote97]Gaussian Mixture Models [Auc02]K-means clustering [Logan01]Anchors by Neural Networks [Beren03]
Temporal modellingSVD of e.g. spectrogram [Gu04] AR-coefficients [Meng05]
Kaare Brandt Petersen 9
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Features What we are trying to do: From data to information
-0.000762939453130.00231933593750-0.00714111328125 0.007720947265630.00076293945313-0.00772094726563-0.00900268554688-0.00527954101563-0.00076293945313-0.00231933593750-0.007141113281250.000244140625000.013122558593750.00650024414063-0.01052856445313-0.01089477539063
Data
ZCRSpectralMFCCChromaSone/BarkRCCLPC...
Low-levelFeatures
Basic statsGMMKmeansAnchorsAR coeffSVDHMM...
High-levelFeatures
"Rough""Deep""Sparky""Broad""Melancolic""Majestic""Jazz""Rock"...
Information
Kaare Brandt Petersen 10
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Features Music similarity example
"Shape of my heart"Backstreet Boys, 2000
"Thats the way it is"Celine Dion, 2000
"Cantaloop"Us3, 1993
"The limitations observed in this paper (...) suggests that the usual route to timbre similarity may not be the optimal one" [Auc04]
Kaare Brandt Petersen 11
Informatik og Matematisk Modellering / Intelligent Signalbehandling
The bad news Sound data is far from the information
Not all features are useful
It is not obvious what the information labels should be
Kaare Brandt Petersen 12
Informatik og Matematisk Modellering / Intelligent Signalbehandling
The good news Computer power Signal processing
- strong development in signal processing and machine learning in general
- Large amounts of data
- Increased interest in sound and music processing
Kaare Brandt Petersen 13
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Example: Genre estimation Genre estimation by temporal
integration
Peter AhrendtAnders Meng[Meng05]
Processing:Sound -> MFCC -> AR
Kaare Brandt Petersen 14
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Example: Genre estimation Genre estimation by temporal integration +
kernel methods
Jeronimo Arenas-GarciaTue Lehn-SchiølerKaare Brandt Petersen [ArGa06]
Processing:Sound -> MFCC -> AR -> KOPLS
Btw: A data harvesting tool coming up - ISMIR 2006
Kaare Brandt Petersen 15
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Example: Source separation Spectrogram modelling with
sparse NTF2D
Morten MørupMikkel Schmidt, [Mørup06]
W = time-frequency patternsH = time, amplitude, pitch
048
0 2 4 6
Time [s]
Fre
qu
ency
[H
z]
0 0.2 0.4 0.6 0.8200
400
800
1600
3200
Original (mixed)
Separated sources (Harp) (Flute)
Kaare Brandt Petersen 16
Informatik og Matematisk Modellering / Intelligent Signalbehandling
Example: CNN Translating a CNN news broadcast
Kasper JørgensenLasse MølgaardLars Kai Hansen[Jorg06]
Music or Speech?Sound -> MFCC, STE, SpF, ZCR -> mean/var
Speaker change detectionSound -> MFCC -> VQ
Speech recognitionSphinx 4 (Carnegie Mellon)
Kaare Brandt Petersen 17
Informatik og Matematisk Modellering / Intelligent Signalbehandling
ConclusionsIt is hard:
Sound data is far from the information Good features are hard to find
but machine learning is catching up:
Examples: Genre, Source separation, CNN-translation
Kaare Brandt Petersen 18
Informatik og Matematisk Modellering / Intelligent Signalbehandling
References[Wold96] Wold, E.; Blum, T.; Keislar, D. & Wheaton, J. "Content-based Classification, Search, and Retrieval of Audio" IEEE Multimedia, 1996, 3, 27-36 [Foote97] Foote, J."Content-based retrieval of music and audio", Multimedia Storage and Archiving Systems II, Proc. of SPIE, 1997, 3229, 138-147[Logan01] Logan and Salomon, "A music similarity function based on signal analysis", ICME 2001[Beren03] Berenzweig, Ellis and Lawrence, "Anchorspace for classification and similarity measurement of music" ICME 2003[Rabiner93] Rabiner, L. & Juang, B.H. "Fundamentals of Speech Recognition", Prentice-Hall, 1993 [Gu04] Gu, Lu, Cai and Zhang, "Dominant Feature vector based audio similarity measure", Proceedings of the Pacific Rim Conference on Multimedia, PCM, 2004[Tza02] Tzanetakis and Cook, "Music Genre Classification of Music", IEEE Transactions on Speech and Audio Processing, 2002, 10, 293-302[Auc02] Aucouturier and Pachet, "Music Similarity Measures: Whats the use?" ISMIR 2002 [Meng05] Anders Meng, Peter Ahrendt and Jan Larsen: "Improving Music Genre Classification by Short-Time Feature Integration", ICASSP, 2005. [Auc04] Aucouturier, Pachet, "Improving Timbre Similarity: How high is the sky?", JNRSAS, 2004[Mørup06] Sparse Non-negative Tensor Factor Double Deconvolution (SNTF2D) for multi channel time-frequency analysis", submitted to JMLR 2006[ArGa06], "Reduced Kaernel Orthonormal Partial Least Squares", submitted for NIPS 2006[Jorg06] Kasper Jørgensen, Lasse Mølgaard, Lars Kai Hansen, "Unsupervised speaker change detection for broadcast news segmentation", EUSIPCO 2006