Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Singer similarity / identification

Francois Thibault

MUMT 614B

McGill University

Introduction Relatively easy for humans to identify

singing voice in various contexts Difficult to find time/environment

invariant features for robust automatic identification

Growing demand for such systems as Network databases keep expanding

Background (1) Significant research in speaker identification,

systems perform poorly with singing voice (inadequate training)

Singer identification research can draw much of automatic instrument recognition systems

Artist / singer identification much harder than song identification (due to necessity of context invariant features)

Background (2) Often builds on speech / music discrimination

systems Acoustical features heavily used to create N-

dimensional Euclidean space: loudness, pitch, brightness, bandwidth, harmonicity

Often uses the same tools as style identification because each singer correspond to a ‘micro’ style

Kim and Whitman overview Segmentation of vocal regions

prior to singer identification algorithm Assumes singing regions display

strong harmonic energy in voice frequency range

Band-pass filter (200-2000 Hz) Inverse comb filter bank to detect

harmonicity Identification classifier uses

features based on LPC

K & W features extraction Determine formant

location and amplitude by a 12-poles linear predictor using the autocorrelation method

Augments low frequency resolution without increasing model order by warping the frequency representation with a function approximating the Bark scale

K & W classification Uses Gaussian mixture model (GMM) to

capture behavior of a class Parameters of Gaussians determined by

Expectation Maximization (EM) Run PCA prior to EM (normalizes the data

variance, good for EM)

SVMs computes optimal hyperplane that can linearly separate classes

K & W results Testbed contained more than 200

songs by 17 solo singers Half for training, half for testing

Vocal segmentation inaccurate (~55%) Experimenting GMM and SVM for

complete song and vocal parts only Overall results well short of human

performance

K & W Experimental results

Liu and Huang overview Singer classification of MP3 files First segment audio into phonemes Calculate feature vector and store phoneme

feature vector with associated singer for training set

Above feature vectors are used as discriminators for classification of unknown MP3 music objects

L & H System Architecture

L & H segmentation features Phoneme segmentation is derived from

polyphase filter coefficients by obtaining a frame energy measurement

K & W phoneme database Phonemes are separated by a minimum

in FE

L & H Phoneme features The phoneme features are obtained

directly from the MDCT coefficients

L & H classification (1) Compares phonemes

features with those in the phoneme database

Discriminating radius (Euclidean distance) is determines uniqueness of a phoneme

Number of neighbors by same singer within the discriminating radius is called frequency (w)

L & H classification (2) kNN classifier used to guess artist in

unknown MP3 songs For efficiency, only uses the first N phonemes

in unknown MP3 Find the k closest neighbors in database and

allow to vote if distance is within a threshold For each neighbor, give a weighted vote

dependent on frequency, and distance

where w is frequency and

K & W results 3 influencing factors

Number of neighbors (N) Threshold for vote decision Number of singers in

database

Other works… Minnowmatch: MIR engine including

artist classification using NN and SVM (Whitman, Flake, Lawrence (NEC))

Quest for ground truth in musical artist similarity: determine accurate measure of similarity given subjective nature of artist classification (Ellis, Whitman, Berenzweig, Lawrence)

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Documents

Transcript of Singer similarity / identification Francois Thibault MUMT 614B McGill University.