2015/12/71 Music Information Retrieval: Overview and Challenges J.-S. Roger Jang （張智星）...

112/04/21 1

Music Information Retrieval:Overview and Challenges

J.-S. Roger Jang （張智星）Multimedia Information Retrieval (MIR) Lab

CSIE Dept, National Taiwan Univ.

http://mirlab.org/jang

http://mirlab.org/jang

-2-

Outline

Music information Retrieval (MIR) Intro to MIR Intro to ISMIR & MIREX

Two classical paradigms of MIR QBSH (query by singing/humming) AFP (audio fingerprinting)

Conclusions

-3-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranked list retrieved from the song database

according to similarity to the query

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX, since

2006

-4-

Two Steps in QBSH

Pitch Tracking To detect the period of a

waveform Time domain (時域 )

ACF (Autocorrelation function)

NSDF (Normalized squared difference function)

AMDF (Average magnitude difference function)

Frequency domain (頻域 )Harmonic product spectrumCepstrum

Database comparison To find similarity between

query and database songs Linear scaling Dynamic time warping Recursive alignment Hybrid methods

-5-

Frame Blocking for Pitch Tracking

Sample rate = 16 kHzFrame size = 512 samplesFrame duration = 512/16000 = 0.032 s = 32 msOverlap = 192 samplesHop size = frame size – overlap = 512-192 = 320 samplesFrame rate = 16000/320 = 50 frames/sec = Pitch rate

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-6-

ACF: Auto-correlation Function

Shifted frame s(t-):

Original frame s(t):

=30 acf(30) = inner product of the overlap part

Pitch period

To play safe, the frame size needs to cover at least two fundamental periods!

1n

t

acf s t s t

-7-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freq

semitone

-8-

Demos

Pitch related demos Pitch tracking Pitch shift

-9-

Basic Comparison Method:Linear Scaling

Scale the query pitch linearly to match the candidates

Original input pitch

Stretched by 1.25

Stretched by 1.5

Compressed by 0.75

Compressed by 0.5

Target pitch in database

Best match

Original pitch

-10-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

-11-

Comparison of Pitch VectorsYellow line : Target pitch vector

-12-

QBSH Demos

QBSH demos by our lab Description QBSH on the web: MIRACLE QBSH on toys

Existing commercial QBSH systems www.midomi.com www.soundhound.com

-13-

Our QBSH System: MiracleSingle server with GPU

NVIDIA 560 Ti, 384 cores (speedup factor = 10)

Master server

Clients Single server

PC

PDA/Smartphone

Cellular

Master serverRequest: pitch vector

Response: search result

Database size: ~20,000

-14-

Improving QBSH

Many ways to improve QBSH Sorted error vector Various weight for rests Re-ranking for better accuracy Better memory arrangement in GPU …

-15-

Intro to Audio Fingerprinting (AFP)

Goal Identify a noisy version of a given audio clips

Also known as… “Query by exact example” no “cover versions”

are allowed

-16-

AFP Applications

Commercial applications of AFP Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web)

Major commercial players Shazam, Soundhound, Intonow, Viggle…

-17-

Two Stages in AFP

Offline Feature extraction Hash table construction

for songs in database Inverted indexing

Online Feature extraction Hash table search Ranked list of the

retrieved songs/music

-18-

Robust Feature Extraction

Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets …

Extensive test required for choosing the best features

-19-

Representative Approaches to AFP

Philips J. Haitsma and T.

Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.

Shazam A.Wang, “An industrial-

strength audio search algorithm”, ISMIR 2003

Google S. Baluja and M. Covell,

“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.

V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-20-

Improvement on AFP

Re-ranking of AFP by learning to rankDemo:

http://mirlab.org/demo/audioFingerprinting

-21-

Shazam’s Method

Ideas Take advantage of music local structures

Find salient peaks on spectrogramPair peaks to form landmarks for comparison

Efficient search by hash tablesUse positions of landmarks as hash keysUse song ID and offset time as hash valuesUse time constraints to find matched landmarks

-22-

How to Find Salient Peaks

We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time

-23-

How to Find Initial Threshold?

Goal To suppress neighboring

peaks

Ideas Find the local max. of mag.

spectra of initial 10 frames Superimpose a Gaussian on

each local max. Find the max. of all

Gaussians 50 100 150 200 2500

0.5

1

1.5

2

2.5

3

3.5

4

Original signal

Positive local maximaFinal output

-24-

How to Update the Threshold along Time?

Decay the threshold Find local maxima larger

than the threshold salient peaks

Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima

-25-

Time-decaying Thresholds

Frame index

Fre

q in

dex

Forward pass

200 400 600 800 1000 1200

50

100

150

200

250

1

2

3

4

5

Frame index

Fre

q in

dex

Backward pass

200 400 600 800 1000 1200

50

100

150

200

250

1

2

3

4

5

Forward:

Backward:

-26-

How to Pair Salient Peaks?

Target zone

-27-

Salient Peaks and Landmarks

Peak picking after forward smoothing

Matched landmarks (green)

(Source: Dan Ellis)

-28-

Landmarks for Hash Table Access

-29-

Optimization Strategies for AFP

Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance

-30-

Demos of Audio Fingerprinting

Commercial apps Shazam Soundhound

Our demo http://mirlab.org/demo/audioFingerprinting

-31-

QBSH vs. AFP

QBSH Goal: MIR Feature: Pitch

PerceptibleSmall data size

Method: LS Database

Harder to collectSmall storage

BottleneckCPU/GPU-bound

AFP Goal: MIR Features: Landmarks

Not perceptibleBig data size

Method: Matched LM Database

Easier to collectLarge storage

BottleneckI/O-bound

-32-

Conclusions

Successful applications in MIR QBSH AFP

Due to Faster bigger memory Advances in GPU/CPU

(Moore’s law) New machine learning

methods

Challenges in MIR Audio melody extraction

from polyphonic musicDatabase collection for

QBSHCover song ID (which

cannot handled by AFP)

Polyphonic music transcription

-33-

Thank you for your attention!

Questions & comments?

2015/12/71 Music Information Retrieval: Overview and Challenges J.-S. Roger Jang （張智星）...

Documents

Transcript of 2015/12/71 Music Information Retrieval: Overview and Challenges J.-S. Roger Jang （張智星）...