2015/12/71 Music Information Retrieval: Overview and Challenges J.-S. Roger Jang (張智星)...
-
Upload
silvester-stone -
Category
Documents
-
view
254 -
download
1
Transcript of 2015/12/71 Music Information Retrieval: Overview and Challenges J.-S. Roger Jang (張智星)...
112/04/21 1
Music Information Retrieval:Overview and Challenges
J.-S. Roger Jang (張智星)Multimedia Information Retrieval (MIR) Lab
CSIE Dept, National Taiwan Univ.
http://mirlab.org/jang
-2-
Outline
Music information Retrieval (MIR) Intro to MIR Intro to ISMIR & MIREX
Two classical paradigms of MIR QBSH (query by singing/humming) AFP (audio fingerprinting)
Conclusions
-3-
Introduction to QBSH
QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranked list retrieved from the song database
according to similarity to the query
Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX, since
2006
-4-
Two Steps in QBSH
Pitch Tracking To detect the period of a
waveform Time domain (時域 )
ACF (Autocorrelation function)
NSDF (Normalized squared difference function)
AMDF (Average magnitude difference function)
Frequency domain (頻域 )Harmonic product spectrumCepstrum
Database comparison To find similarity between
query and database songs Linear scaling Dynamic time warping Recursive alignment Hybrid methods
-5-
Frame Blocking for Pitch Tracking
Sample rate = 16 kHzFrame size = 512 samplesFrame duration = 512/16000 = 0.032 s = 32 msOverlap = 192 samplesHop size = frame size – overlap = 512-192 = 320 samplesFrame rate = 16000/320 = 50 frames/sec = Pitch rate
0 50 100 150 200 250 300-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
-6-
ACF: Auto-correlation Function
Shifted frame s(t-):
Original frame s(t):
=30 acf(30) = inner product of the overlap part
Pitch period
To play safe, the frame size needs to cover at least two fundamental periods!
1n
t
acf s t s t
-7-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )
69440
log12 2
freq
semitone
-8-
Demos
Pitch related demos Pitch tracking Pitch shift
-9-
Basic Comparison Method:Linear Scaling
Scale the query pitch linearly to match the candidates
Original input pitch
Stretched by 1.25
Stretched by 1.5
Compressed by 0.75
Compressed by 0.5
Target pitch in database
Best match
Original pitch
-10-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音
-11-
Comparison of Pitch VectorsYellow line : Target pitch vector
-12-
QBSH Demos
QBSH demos by our lab Description QBSH on the web: MIRACLE QBSH on toys
Existing commercial QBSH systems www.midomi.com www.soundhound.com
-13-
Our QBSH System: MiracleSingle server with GPU
NVIDIA 560 Ti, 384 cores (speedup factor = 10)
Master server
Clients Single server
PC
PDA/Smartphone
Cellular
Master serverRequest: pitch vector
Response: search result
Database size: ~20,000
-14-
Improving QBSH
Many ways to improve QBSH Sorted error vector Various weight for rests Re-ranking for better accuracy Better memory arrangement in GPU …
-15-
Intro to Audio Fingerprinting (AFP)
Goal Identify a noisy version of a given audio clips
Also known as… “Query by exact example” no “cover versions”
are allowed
-16-
AFP Applications
Commercial applications of AFP Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web)
Major commercial players Shazam, Soundhound, Intonow, Viggle…
-17-
Two Stages in AFP
Offline Feature extraction Hash table construction
for songs in database Inverted indexing
Online Feature extraction Hash table search Ranked list of the
retrieved songs/music
-18-
Robust Feature Extraction
Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets …
Extensive test required for choosing the best features
-19-
Representative Approaches to AFP
Philips J. Haitsma and T.
Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.
Shazam A.Wang, “An industrial-
strength audio search algorithm”, ISMIR 2003
Google S. Baluja and M. Covell,
“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.
V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011
-20-
Improvement on AFP
Re-ranking of AFP by learning to rankDemo:
http://mirlab.org/demo/audioFingerprinting
-21-
Shazam’s Method
Ideas Take advantage of music local structures
Find salient peaks on spectrogramPair peaks to form landmarks for comparison
Efficient search by hash tablesUse positions of landmarks as hash keysUse song ID and offset time as hash valuesUse time constraints to find matched landmarks
-22-
How to Find Salient Peaks
We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time
-23-
How to Find Initial Threshold?
Goal To suppress neighboring
peaks
Ideas Find the local max. of mag.
spectra of initial 10 frames Superimpose a Gaussian on
each local max. Find the max. of all
Gaussians 50 100 150 200 2500
0.5
1
1.5
2
2.5
3
3.5
4
Original signal
Positive local maximaFinal output
-24-
How to Update the Threshold along Time?
Decay the threshold Find local maxima larger
than the threshold salient peaks
Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima
-25-
Time-decaying Thresholds
Frame index
Fre
q in
dex
Forward pass
200 400 600 800 1000 1200
50
100
150
200
250
1
2
3
4
5
Frame index
Fre
q in
dex
Backward pass
200 400 600 800 1000 1200
50
100
150
200
250
1
2
3
4
5
Forward:
Backward:
-26-
How to Pair Salient Peaks?
Target zone
-27-
Salient Peaks and Landmarks
Peak picking after forward smoothing
Matched landmarks (green)
(Source: Dan Ellis)
-28-
Landmarks for Hash Table Access
-29-
Optimization Strategies for AFP
Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance
-30-
Demos of Audio Fingerprinting
Commercial apps Shazam Soundhound
Our demo http://mirlab.org/demo/audioFingerprinting
-31-
QBSH vs. AFP
QBSH Goal: MIR Feature: Pitch
PerceptibleSmall data size
Method: LS Database
Harder to collectSmall storage
BottleneckCPU/GPU-bound
AFP Goal: MIR Features: Landmarks
Not perceptibleBig data size
Method: Matched LM Database
Easier to collectLarge storage
BottleneckI/O-bound
-32-
Conclusions
Successful applications in MIR QBSH AFP
Due to Faster bigger memory Advances in GPU/CPU
(Moore’s law) New machine learning
methods
Challenges in MIR Audio melody extraction
from polyphonic musicDatabase collection for
QBSHCover song ID (which
cannot handled by AFP)
Polyphonic music transcription
-33-
Thank you for your attention!
Questions & comments?