HWJH.taslp12

1482 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

A Tandem Algorithm for Singing Pitch Extractionand Voice Separation From Music Accompaniment

Chao-Ling Hsu, DeLiang Wang, Fellow, IEEE, Jyh-Shing Roger Jang, and Ke Hu, Student Member, IEEE

Abstract—Singing pitch estimation and singing voice separationare challenging due to the presence of music accompaniments thatare often nonstationary and harmonic. Inspired by computationalauditory scene analysis (CASA), this paper investigates a tandemalgorithm that estimates the singing pitch and separates thesinging voice jointly and iteratively. Rough pitches are first esti-mated and then used to separate the target singer by consideringharmonicity and temporal continuity. The separated singing voiceand estimated pitches are used to improve each other iteratively.To enhance the performance of the tandem algorithm for dealingwith musical recordings, we propose a trend estimation algorithmto detect the pitch ranges of a singing voice in each time frame.The detected trend substantially reduces the difficulty of singingpitch detection by removing a large number of wrong pitch can-didates either produced by musical instruments or the overtonesof the singing voice. Systematic evaluation shows that the tandemalgorithm outperforms previous systems for pitch extraction andsinging voice separation.

Index Terms—Computational auditory scene analysis (CASA),iterative procedure, pitch extraction, singing voice separation,tandem algorithm.

I. INTRODUCTION

W HILE separating target voice from a monaural mixtureof different sound sources appears effortless for the

human auditory system, it is very difficult for machines and hasbeen extensively studied for decades. In particular, separatingsinging voice from music accompaniment remains a majorchallenge.

Singing voice separation is, in a sense, a special case ofspeech separation and has many similar applications. For ex-ample, automatic speech recognition corresponds to automaticlyrics recognition [24], automatic speaker identification toautomatic singer identification [30], and automatic subtitle

Manuscript received December 01, 2010; revised May 23, 2011; accepted De-cember 17, 2011. Date of publication January 02, 2012; date of current versionMarch 02, 2012. Part of the work was conducted while C.-L. Hsu was visitingThe Ohio State University. This work was supported in part by the NationalScience Council, Taiwan, under Grant NSC 96-2628-E-007-141-MY3 and inpart by the by the Air Force Office of Scientific Research (AFOSR) under GrantFA9550-08-1-0155. The associate editor coordinating the review of this manu-script and approving it for publication was Dr. Tomohiro Nakatani.

C.-L. Hsu is with Mediatek, Inc., Hsinchu 30078, Taiwan (e-mail:[email protected]).

D. L. Wang and K. Hu are with the Department of Computer Scienceand Engineering and Center for Cognitive Science, The Ohio State Uni-versity, Columbus, OH 43210 USA (e-mail: [email protected];[email protected]).

J.-S. R. Jang is with the Department of Computer Science, National TsingHua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2011.2182510

alignment which aligns speech and subtitle to automatic lyricalignment [28] which can be used in a karaoke system. Theseapplications also encounter similar problems. They performsubstantially worse in the presence of background noise ormusic accompaniment. Compared to speech separation, separa-tion of singing voice could be simpler with less pitch variation.On the other hand, there are several major differences. Forspeech separation, or the cocktail party problem, the goal is toseparate the target speech from various types of backgroundnoise which can be broadband or narrowband, periodic oraperiodic. In addition, the background noise is independent ofspeech in most cases so that their spectral contents are uncor-related. For singing voice separation, the goal is to separatesinging voice from music accompaniments which in most casesare broadband, periodic, and strongly correlated to the singingvoice. Furthermore, the upper pitch boundary of singing canbe as high as 1400 Hz for soprano singers [29] while the pitchrange of normal speech is between 80 and 500 Hz. Thesedifferences make the separation of singing voice and musicaccompaniment potentially more challenging.

For monaural singing voice separation, existing methodscan be generally classified into three categories dependingon their underlying methodologies: spectrogram factorization(e.g., [19], [22]), model-based methods (e.g., [16], [19]), andpitch-based methods (e.g., [13], [23]). Spectrogram factor-ization methods utilize the redundancy of the singing voiceand music accompaniment by decomposing the input signalinto a pool of repetitive components. Each component is thenassigned to a sound source. Model-based methods learn a setof spectra from music accompaniment only segments. Spectraof the vocal signal are then learned from the sound mixture byfixing accompaniment spectra. Pitch-based methods use ex-tracted vocal pitch contours as the cue to separate the harmonicstructure of the singing voice.

These methods have their limitations. Spectrogram factor-ization methods encounter difficulties in assigning repetitivecomponents or bases to the corresponding sound sources. Theperformance drops significantly when the number of musicalinstruments increases. Furthermore, it is difficult to separatesinging voice from a short mixture since vocal signals typicallyhave more diverse spectra than that of each instrument [23].Model-based methods require a considerable amount of musicaccompaniment only segments so that they can model thecharacteristics of the background music.

Compared to spectrogram factorization and model-basedmethods, pitch-based methods potentially have fewer limita-tions. The only required cue is the pitch contours of the singingvoice which can be extracted from a very short mixture and

1558-7916/$31.00 © 2012 IEEE

HSU et al.: TANDEM ALGORITHM FOR SINGING PITCH EXTRACTION AND VOICE SEPARATION FROM MUSIC ACCOMPANIMENT 1483

does not need the accompaniment only parts. In [8], a predomi-nant-F0 estimation method is proposed to estimate fundamentalfrequency (F0) using maximum a posteriori probability esti-mation and pitch trajectories based on pitch continuity. Li andWang [13] proposed a computational auditory scene analysis(CASA) system which is effective for singing voice separation.In their system, pitch is detected based on a hidden Markovmodel (HMM). In [6], a source/filter model based approach isused to model singing voice. This model together with a musicaccompaniment model is used to perform a maximum likeli-hood estimation of the mixture. Pitch contours are estimated viaa Viterbi-type algorithm based on source model parameters andpitch continuity. Tachibana et al. performed pitch estimationbased on singing voice enhanced by a harmonic/percussivesound separation. They use matched filtering to model pitchlikelihoods and a Gaussian mixture model (GMM) to modelpitch transitions [21]. In a model-based method [7], sinusoidalmodels are used to model harmonic partials given detectedpitch. The models are used to create smooth amplitude andphase trajectories over time and sinusoids are generated andsummed to produce an estimate of the vocal signal. However,as pointed out in [23], a drawback of this method is that it willoverestimate partials in the case of music interference.

In aforementioned methods, either separation of singingvoice is based on detected pitch or pitch detection benefits fromenhanced singing voices. As pointed out in [4], this interde-pendency between pitch estimation and pitch-based separationcreates a “chicken and egg” problem for pitch estimation andvoice separation. In order to escape from this dilemma in thecontext of speech separation, Hu and Wang [11] recently pro-posed a tandem algorithm which performs pitch estimation andvoice separation jointly and iteratively. It is observed that thetarget pitch can be estimated from a few harmonics of the targetsignal. On the other hand, one can separate some target signalswithout perfect pitch estimation. Thus, their strategy is to havea rough estimate of the target pitch first and then separate thetarget speech by considering harmonicity and temporal conti-nuity. The separated speech and the estimated target pitch arethen used to improve upon each other iteratively. In [11], theyshow a consistent performance improvement for all types ofintrusion except rock music, presumably because of the strongharmonicity of the music accompaniment. This indicates thatseparating speech from music is challenging to their tandemalgorithm.

In this study, we investigate and extend the tandem algorithmto separate the voiced portions of singing voice from music ac-companiment. To improve the performance of pitch-based voiceseparation, the most important issue is to improve pitch estima-tion. Most of pitch detection methods operate in a plausible pitchrange chosen heuristically. The pitch range is usually large tocover most of the possible pitches of singing voice such as from80 Hz to 500 Hz. However, as mentioned earlier the pitch ofsinging can be as high as 1400 Hz. On the other hand, it is un-likely that pitch changes in such a wide range in a short periodof time.

To address the above problems, we propose a trend estima-tion algorithm to bound the singing pitch contours in a seriesof time–frequency (T-F) blocks that have much narrower pitch

Fig. 1. Schematic diagram of the proposed system.

ranges compared to the entire possible range. The estimatedtrend substantially reduces the difficulty of singing pitch detec-tion by eliminating a large number of wrong pitch candidates.

The rest of this paper is organized as follows. In Section II,we give an overview of the extended tandem algorithm. InSection III, we describe the proposed trend estimation in detail.Two key steps of the tandem algorithm, mask estimation andpitch determination, are then presented in Section IV. Theiterative procedure of the tandem algorithm is explained inSection V. Section VI describes singing voice detection. Thesystematic evaluation on pitch estimation and singing voiceseparation is given in Section VII. Finally, we conclude thiswork in Section VIII.

II. SYSTEM OVERVIEW

The proposed system is illustrated in Fig. 1. A trend esti-mation algorithm first estimates the pitch ranges of the singingvoice. The estimated trend is then incorporated in the tandemalgorithm to acquire the initial estimate of the singing pitch.Singing voice is then separated according to the initially es-timated pitch. The above two stages, i.e., pitch determinationand voice separation then iterate until convergence. A post-processing stage is introduced to deal with the “sequentialgrouping” problem, i.e., deciding which pitch contours belongto the target [2], [27], an issue unaddressed in the originaltandem algorithm [11]. Finally, singing voice detection is per-formed to discard the nonvocal parts of the separated singingvoice. As will be clear later, these improvements contribute tosignificantly better performance compared to the version forspeech separation.

Our tandem algorithm detects multiple pitch contours andseparates the singer by estimating the ideal binary mask (IBM),which is a binary matrix constructed using premixed source sig-nals. In the IBM, 1 indicates that the singing voice is strongerthan interference in the corresponding time-frequency unit and0 otherwise. The IBM has been shown to be a reasonable goalof CASA [25] and used as a measure of ceiling performance forspeech separation in many studies.

III. TREND ESTIMATION

This section describes the proposed trend estimation algo-rithm. The purpose of trend estimation is to bound the singingpitch in a plausible range. As a result of placing a tight bound onthe pitch range, we hope to reduce spurious pitches caused bymusic accompaniments or higher harmonics. First, the singingvoice is enhanced by considering temporal and spectral smooth-ness. As the fundamental frequency of a singing voice tendsto be smooth across time, we bound the vocal F0s in a seriesof time–frequency blocks. These T-F blocks give rough pitch


ranges along time which are much narrower than the possiblepitch range.

A. Vocal Component Enhancement

It can be observed in a spectrogram that frequencies of thesounds generated by harmonic musical instruments (e.g., vi-olin) are very smooth along the temporal direction while thosefrom percussive instruments (e.g., drum) are smooth along thespectral direction. Ono et al. [15] handled harmonic/percussivesource separation (HPSS) by using this characteristic to separatethe mixture into harmonic components and percussive compo-nents. Specifically, HPSS is devised as an optimization problemthat minimizes the following objective function:

(1)

under the constraint that ,where and are the complex spectrogram com-ponents of harmonic sound and percussive sound to be esti-mated, is the spectrogram of the input mixture, where

and index the time frame and frequency bin, respectively.is an exponential constant (around 0.6) to imitate the auditory

system.Tachibana et al. [21] extended the original method to a multi-

stage HPSS version to enhance the singing voice by tuning thesmoothness along the time and frequency axes. The idea comesfrom the observation that the partials of singing voice are notas smooth as those of harmonic instruments along the temporaldirection but much smoother than those of percussive instru-ments. They first apply a larger size window (e.g., 256 ms) tocompute short time Fourier transform (STFT) so that the spec-trum has a high spectral resolution and low temporal resolution.This biased resolution makes the partials of harmonic instru-ments smooth in the temporal direction since window lengthsare long and bandwidths are narrow. Thus, more energy of har-monic instruments is assigned to the harmonic component. Inother words, most of the energy of singing voice and percussivesounds are assigned to the percussive component. Afterward,HPSS is applied to the percussive component again but with ashorter window (e.g., 30 ms) to separate the singing voice frompercussive sounds.

In this study, we only apply the first stage of the multistageHPSS which attenuates the energy of harmonic instruments. Thereason is that the sounds of percussive instruments are aperi-odic and do not create much difficulty in estimating target pitch.Fig. 2(a) and (b) shows the spectrograms of an input mixture be-fore and after applying HPSS, respectively, and the solid linesrepresent the pitch contours of the singing voice. As we can see,the energy of the sounds produced by harmonic instruments isattenuated significantly.

B. Pitch Range Estimation

The goal of this stage is to find a sequence of relatively tightpitch ranges where the F0s of the singing voice are present. Themain idea to achieve this goal is to remove unreliable peaks not

Fig. 2. Example of trend estimation. (a) The spectrogram of a song mixture.(b) The spectrogram after applying HPSS. Dashed lines show the extendedboundaries of the estimated trend and solid lines show the ground truth pitch.(c) Results of MR-FFT. (d) Result after deleting harmonic peaks. (e) Magni-tude-downsampled diagram. Color indicates the energy of a T-F block. The solidline indicates the optimal path found by DP.

originating from periodic sounds and then higher harmonics ofthe singing voice. The remaining peaks approximate fundamen-tals and we estimate the F0 range by bounding the peaks in asequence of T-F blocks.

First, we apply the multi-resolution fast Fourier transform(MR-FFT) proposed by Dressler [5]. The MR-FFT method an-alyzes sounds in different time-frequency resolutions by usingdifferent window lengths. Based on the local sinusoidality cri-teria, it deletes unreliable peaks which do not originate fromperiodic sounds by considering local characteristics of phasespectrum, or more precisely, the instantaneous frequencies of


neighboring frequency bins; a detailed discussion on phase andthe instantaneous frequency can be found in [1].

Because some peaks in the same time frame may correspondto the same sinusoidal component, we first check the instanta-neous frequencies of the peaks in each time frame. If their in-stantaneous frequencies are close enough (less than 0.2 semi-tone), the one with the largest magnitude is selected. Harmonicsare then deleted based on the observation that a vocal F0 canonly be the lowest frequency within a frame. Fig. 2(c) showsthe result of MR-FFT and the result of deleting harmonics isshown in Fig. 2(d).

The remaining peaks are used to estimate the pitch range ofsinging voice. We first downsample the magnitudes of the peaksby summing the largest peak values in the frames within a T-Fblock, which is a rectangular area whose vertical side representsa frequency range and horizontal side a time duration. The entireplane is divided into a fixed set of T-F blocks with 50% overlapin time and in frequency. Then, given a multi-resolution spec-trogram produced by MR-FFT, the downsampled mag-nitude of the T-F block structure is defined as

and (2)

where and indicate the time and frequency indices of a T-Fblock, respectively. and indicate the ranges of T-F blocks intime and frequency, respectively. and are the numbersof time frames and frequency bins of a T-F block, respectively,and and are the shifts in frame and frequency bins of aT-F block, respectively.

Note that, although is fixed for all T-F blocks, the band-widths are different for T-F blocks with different frequency in-dices. This is because frequency bins in MR-FFT are spaced by0.25 semitone. In other words, a T-F block with a smaller fre-quency index has a narrower bandwidth, consistent with humanaudition.

Finally, we find an optimal path consisting of a sequence ofT-F blocks that contain the largest magnitudes by using dy-namic programming (DP). The problem is defined as finding anoptimal path that maximizes the followingscore:

(3)

where the first term is the sum of strengths of the T-F blocksalong the path, and the second term controls the smoothness ofthe path with the use of a penalty coefficient . The larger is ,the smoother is a computed path. In this study, is set to half ofthe average strength of the largest for each time indexof T-F blocks:

(4)

and the Viterbi algorithm [18] is used to find the optimal path.Fig. 2(e) shows an example of a magnitude-downsampled dia-gram. A color represents the energy strength of a T-F block. Thesolid line indicates the optimal path found by DP.

While the optimal path is a sequence of T-F blocks, the neigh-bors of these blocks also provide useful information which re-veals the energy distribution of vocal partials since they are halfoverlapped. Specifically, given a T-F block in the op-timal path with frequency boundaries and , weextend the lower and upper boundaries as follows:

if

if

else(5)

where is the amount of extension and is set to 4 semitones inthis study. The upper boundary of the selected T-F blockis extended if its upper neighbor has much larger energythan its lower neighbor . In this case, most of the energyof a vocal partial is likely concentrated at higher frequenciesand we therefore extend the upper boundary to cover possibledynamics of vocal partials. The same procedure is also appliedto extend a lower boundary. The estimated trend is then formedby these extended pitch ranges. This makes trend estimation ca-pable of locating dominant partials in the first step and then ex-tending its boundary to tolerate the possible pitch changes ofthe singing voice. Fig. 2(b) shows the extended boundaries ofthe estimated trend (dashed lines). As can be seen, the estimatedtrend bounds the singing F0s successfully.

IV. MASK ESTIMATION AND PITCH DETERMINATION

Two key steps of our tandem algorithm are: 1) IBM estima-tion given target pitch and 2) pitch determination given a binarymask. Since results of one stage are inputs to the other one, thetwo stages are used to improve each other iteratively. This sec-tion describes these two stages in detail.

A. Front-End Processing

We first perform time–frequency decomposition by usingsimilar settings to those in [11]. The input song mixture isdecomposed into 128 channels using the gammatone filterbank[17] whose center frequencies are quasi-logarithmically spacedfrom 50 Hz to 8 kHz. The signals in different frequency chan-nels are then split into 40-ms frames with 20-ms overlap. Let

denote a T-F unit at channel and frame , andthe filtered signal at channel and time . The correspondingnormalized correlogram at is computed by thefollowing autocorrelation function (ACF):

(6)


where is the time delay. is the frame shift and is thesampling time. The above summation is over 40 ms, the lengthof a time frame. The peaks of the ACF indicate the periodicityof the filter response, and the corresponding delays indicate theperiods.

Two adjacent channels triggered by the same harmonic havehighly correlated responses [3], [26], and we compute the cross-channel correlation between and by

(7)

where denotes the average of over .In high frequencies, a filter responds to multiple harmonics

and the filtered signal is amplitude-modulated [10]. We thus cal-culate the ACF using the envelopes of filtered sig-nals. The corresponding cross-channel correlation is calculatedsimilarly to (7). Here, we extract envelopes by halfwave rectifi-cation and bandpass filtering [11].

B. IBM Estimation Given Target Pitch

To estimate the IBM given the pitch of the target (singingvoice), we first obtain a statistical model for generating the like-lihoods of target-dominant and interference-dominant T-F unitsby a supervised learning scheme. Following [11], 6 features areextracted from each T-F unit, resulting in the following featurevector, shown in (8) at the bottom of the page, where the first 3correspond to the filter response and the last 3 the envelope re-sponse. The function int returns the nearest integer. Previouswork [10], [13] shows that is a good indicatorof similarity between estimated pitch and the responseperiod of . is the average instantaneous frequencyof the filter response within . If the filter response has a pe-riod close to , is close to an integer thatindicates the harmonic number of the pitched sound. The secondfeature shows the difference of and its nearestinteger, which measures how likely corresponds to a har-monic of the estimated pitch. The last 3 features are obtainedsimilarly to the first 3 by replacing filter responses with responseenvelopes.

Let be the hypothesis that a T-F unit is target dominantand otherwise. is labeled as target if

(9)

Since and add to 1,we can perform the classification by

(10)

For classification, we obtain by training amultilayer perceptron (MLP) [20] to estimate the posterior prob-ability with one hidden layer and 5 units for each filter channel,the same configuration as in [11]. During MLP training, groundtruth pitch is used.

If more than one pitch candidate are detected at a frame, anadditional probability comparison at the target pitch period andthe interference pitch period is made within . Furthermore,the neighborhood of is considered for unit labeling. See[11] for details.

C. Pitch Estimation Given Binary Mask

A common way to estimate target pitch is to sum autocorrela-tions across all the channels and then identify the most dominantpeak in the summary correlogram [4]. This can be improvedby calculating the summary correlogram only from target-dom-inant T-F units according to the given binary maskwhere a value of 1 indicates that is dominated by the targetand 0 otherwise. Also, as shown in [11], replacing ACF with

improves pitch estimation so we estimate thepitch period by

(11)

We have found that only 0.07% (about 0.5% for speech in[11]) of consecutive frames in singing voice have more than20% relative pitch changes in our training set. Hence, temporalcontinuity is used to check the reliability of the estimated pitch.If pitch changes of three consecutive frames are less than 20%,the estimated pitch periods of these three frames are consideredreliable. Otherwise, an unreliable pitch is reestimated by lim-iting the plausible pitch range to 20% of a neighboring reliablepitch.

V. TANDEM ALGORITHM

Our tandem algorithm has the same general steps in [11] butis different in the following ways: 1) the results of the trendestimation are provided to the tandem algorithm; 2) all statis-tical models used for estimating the IBM are re-trained by usingmusic data instead of speech data; 3) input mixtures are pre-pro-cessed by HPSS; 4) since the tandem algorithm usually givesmore than one pitch candidate for each frame, we add a sequen-tial grouping step which selects one pitch value as the target aspostprocessing. In the following subsections, we describe thetandem algorithm by highlighting the above aspects.

A. Initial Estimation

The iterative procedure starts with pitch estimation. Insteadof using the original mixture as in [11], we use the output fromHPSS as the input signal.

(8)


We first treat all T-F units with high cross-channel correla-tions as dominated by a single source. The estimated pitch pe-riod is then selected as the one supported by most active (value1) T-F units. A T-F unit is considered supporting a pitch pe-riod if the corresponding is higher than 0.75;this threshold is relaxed, if needed, to ensure that there is at leastone active unit in each frame. The T-F units that do not meet thethreshold are then used to estimate the second pitch period ifsuch units exist. Note that the possible pitch periods are nowconfined to the pitch range of the estimated trend. With the esti-mated pitch period , the corresponding mask is reestimated asthe target if .

After the above estimation, individual pitch periods arecombined into pitch contours based on temporal continuity ofboth pitch periods and corresponding masks. As a result of thisstep, we obtain multiple pitch contours and their associated T-Fmasks.

B. Iterative Estimation

The key idea of this step is to expand the pitch contours ac-cording to temporal continuity. Let be a pitch contour con-taining a sequence of pitch points in a continuous set of framesand be the associated mask at frame

. We first expand the mask by lettingand , where and are the first andlast frame of the pitch contour, respectively. A new is thenestimated from this new mask [11]. If the newly estimated pitchpoints pass the continuity criterion [11], it is considered a reli-able pitch. Otherwise, it is discarded.

Since the pitch periods in are reestimated, we update themask for each pitch contour as described in Section IV-B. Theabove two steps thus iterate until the estimation of pitch andmask converges, i.e., the estimated pitch contours and the cor-responding masks do not change. This typically happens within20 iterations.

We note here that the estimated pitches are not limited withinthe estimated trend during the iterative estimation. Thus, it ispossible to detect the pitch points outside the trend if they havehigh temporal continuity. However, most estimated pitch con-tours are still within the trend because of the temporal continuitycriterion.

Fig. 3 shows the pitch estimation result of the tandem al-gorithm where the input mixture is the same as that in Fig. 2.Fig. 3(a) shows the result without HPSS and trend estimation.Fig. 3(b) shows the result when HPSS is applied and Fig. 3(c)shows the result with both HPSS and trend estimation. Eachcolor represents a pitch contour. Note that some pitch points areoutside the trend around 3.8 seconds and are correctly detectedbecause they have high temporal continuity with the previouspitch points. The raw-pitch accuracy of Fig. 3(a)–3(c) is 58%,75%, and 90%, respectively. In this example, HPSS and trendestimation improve the singing pitch extraction significantly.

C. Postprocessing

Since the pitch estimation algorithm outputs up to two pitchvalues for each frame, some pitch contours may overlap in time.

Fig. 3. Pitch estimation result of the tandem algorithm. (a) The result withoutHPSS and trend estimation. (b) The result with only HPSS. (c) The result withboth HPSS and trend estimation. Different pitch contours are indicated by dis-tinct colors.

This creates a sequential grouping problem that each pitch con-tour has to be assigned either to target voice or backgroundnoise. Sequential grouping is a difficult problem and the ex-isting methods are not yet mature [27]. Fortunately, the pro-posed trend estimation algorithm is able to remove most of thetime-overlapped pitch contours because it is unlikely that morethan one pitch candidate are dominant in a narrow pitch range[see Fig. 3(c)]. This makes the sequential grouping problemmuch easier to address.

We group pitch contours as follows. First, we assign pitchpoints to the target pitch track for the frames that only have onepitch candidate. For the frames that contain more than one pitchcandidate, we pick the one supported by most channels as thetarget pitch. Here, a channel is considered as supporting pitchcandidate if (10) is met. After that, we fill the gaps whereno reliable pitch is estimated by using the most dominant pitchperiod in the initial estimation and thus generate a continuouspitch track. The corresponding mask to the target pitch trackforms the estimated IBM.


VI. SINGING VOICE DETECTION

This stage employs a continuous HMM to decode an inputmixture into vocal and nonvocal sections. We use the signalsafter applying HPSS which attenuates the energy from musicaccompaniment instead of the original mixture. Given the fea-ture vectors of an input mixture, theproblem is to find the most likely sequence of vocal/nonvocalstates, :

(12)

where is the output probability density function (pdf)of binary state (vocal or nonvocal), and is thetransition probability from state to . The state outputpdfs and features will be specified below in the evaluation sec-tion. As shown in [12], HMM generally outperforms static-clas-sifier-based methods, such as GMM or perceptrons, for speechdetection.

VII. EVALUATION

In this section, we evaluate our proposed algorithm in fourparts. The first part evaluates the performance of singing voicedetection. The performance of trend estimation is evaluated inthe second part. The third and the fourth parts evaluate pitchestimation and singing voice separation, respectively.

We use MIR-1K, a publicly available dataset introduced inour previous work [9], to evaluate our proposed system. Al-though there are several publicly available datasets that are com-monly used in speech separation, as far as we know, MIR-1Kis the only one designed for singing voice separation. It con-tains 1000 song clips recorded at 16-kHz sampling rate with16-bit resolution. The duration of each clip ranges from 4 to13 seconds, and the total length of the dataset is 133 minutes.These clips were extracted from 110 karaoke songs which con-tain a mixed track and a music accompaniment track. Thesesongs were selected from 5000 Chinese pop songs and sungby 8 females and 11 males. Most of the singers are amateurswith no professional training. The music accompaniment andthe singing voice were recorded at the left and right channels,respectively. The ground truth of the target pitch is estimatedby using clean singing voice with manual adjustment. All songsare mixed at 5, 0, and 5 dB for evaluation.

A. Evaluation of Singing Voice Detection

1) Dataset Description: We use all 1000 clips of MIR-1Kfor training and evaluating the HMM for singing voice detec-tion. The dataset is divided into two subsets of similar sizes (487versus 513, recorded by disjoint subjects) for two-fold cross val-idation. Singing voices and music accompaniments are mixedat 0-dB SNR to generate the training samples. Here the singingvoice is the signal and the music accompaniment the noise.

2) Acoustic Features: 39-dimensional MFCCs (12 cepstralcoefficients plus log energy, together with their first and secondderivatives) are extracted from each frame. The MFCCs arecomputed from STFT with a half-overlapped 40-ms Hamming

Fig. 4. Performance of singing voice detection. The precision, recall, andoverall accuracy at �5, 0, and 5 dB SNR are shown in (a) (b), and (c)respectively.

window. Cepstral mean subtraction (CMS) is used to reducechannel effects.

3) Performance Measure: The performance of singing voicedetection is represented by precision, recall, and overall accu-racy. The recall for vocal frame detection is the percentage ofthe frames that are correctly classified as vocal over all the vocalframes; the precision is the percentage of the frames that are cor-rectly classified as vocal over the frames that are classified asvocal. The overall accuracy is the percentage of all the framescorrectly classified.

4) Experimental Settings and Results: Two 32-componentsGMMs are trained for vocal frames and nonvocal frames,respectively. Both GMMs have diagonal covariance ma-trices. Parameters of the GMMs are initialized via a K-meansclustering algorithm [14] and are iteratively adjusted via anexpectation–maximization algorithm with 20 iterations. Eachof the GMMs is considered as a state in a fully connectedHMM, where the transition probabilities are obtained throughframe counts of the labeled dataset. For a given new mixture,the Viterbi algorithm is used to decode the mixtures into vocaland nonvocal segments.

Fig. 4 shows the performance of singing voice detection. Weevaluate the algorithm for the signals mixed at 5, 0 dB, and5 dB. The gray bars and white bars show the average perfor-mance without and with HPSS as preprocessing, respectively.For each mean, the error bars show the 95% confidence inter-vals. As can be seen, the results with HPSS perform uniformlybetter, especially for lower SNR levels. The results show thatsinging voice detection benefits significantly from HPSS whichattenuates the energy of music accompaniment. We also triedother features such as perceptual linear prediction, but found noperformance gain.

B. Evaluation of Trend Estimation

1) Dataset Description: All 1000 song clips in MIR-1K areused for evaluating the performance of trend estimation.

2) Performance Measure: The estimated trend is treated ascorrect if the ground truth pitch is located within the lower


Fig. 5. Results of trend estimation and pitch extraction for different algorithms. (a) The results for voiced parts of the recordings. (b) The results for wholerecordings with vocal detection. (c) The results for whole recordings without vocal detection. � after a method indicates the ceiling performance.

bound and upper bound of the trend. The correct rate is calcu-lated for vocal frames.

3) Experimental Settings and Results: The duration and pitchrange for each T-F block is set to be 1.5 second and 8 semitones,respectively, with 50% overlap for both time and frequency. Thebandwidth of the trend is equal to one octave (12 semitones)after boundary extension.

Fig. 5 shows the performance of trend estimation and singingpitch detection at different SNR levels. Means together with95% confidence intervals are shown. As we can see in Fig. 5(a),our trend estimation performs very well. Accuracy of the trendestimation rises slightly when the SNR increases. It is expectedsince the energy of vocal sounds is more prominent at higherSNRs. Nevertheless, the trend estimation achieves at least 90%accuracy on average and is robust at different SNRs.

C. Evaluation of Singing Pitch Extraction

1) Dataset Description: The dataset is divided into two sub-sets in the same way as described in Section VII-A for two-foldcross validation. Singing voices and music accompaniments aremixed at 0-dB SNR to generate the samples for training the MLPmentioned in Section IV-B.

2) Performance Measure: The estimated pitch is treated ascorrect if the difference to the ground truth pitch is less than 5%in Hz. In addition, we calculated correct detection rate for voicedparts and whole recordings with and without singing voice de-tection.

3) Experimental Results: The results of singing pitch ex-traction are shown in Fig. 5, where the voiced only results inFig. 5(a) show the performance of different pitch extractionalgorithms when voiced singing is present (according to theground truth). Fig. 5(b) and (c) gives the overall results withand without singing voice detection, where an overall resultis the percentage of all the frames (voiced or unvoiced) withcorrect pitch detection; in other words, both voiced/unvoicedclassification errors and pitch estimation errors (for correctlyclassified voiced frames) are counted in the overall results. In

each case, the performance of the tandem algorithm before andafter postprocessing is shown. In addition, we show the perfor-mance of the original tandem algorithm [11] with and withoutapplying HPSS; note that the plausible pitch range in [11] hasbeen widened to 80–800 Hz to account for singing voice. Theresults of the singing pitch detection algorithm proposed by Liand Wang [13] are also given. An asterisk after a method indi-cates the ceiling performance of the algorithm where we treata result as correct if one of the two estimated pitch candidatesis correct. The algorithms without an asterisk only produce onepitch value for a frame. Note that in the figure we slightly shiftsome curves to avoid overlapping.

As can be seen in Fig. 5(a), the results of the proposed algo-rithm with sequential grouping as postprocessing are very closeto the ceiling performance before postprocessing. This confirmsthat our postprocessing step deals with the sequential groupingproblem successfully with the help of trend estimation that elim-inates most of overlapped pitch contours.

Comparing with previous methods, our proposed algorithmperforms substantially better. It is worth noting that applyingHPSS to the original tandem algorithm improves its ceilingperformance significantly. This shows that vocal enhancementusing HPSS is clearly helpful for singing pitch extraction. How-ever, our proposed algorithm still outperforms theirs in this caseand shows the effectiveness of our trend estimation. ComparingFig. 5(b) and (c), the performance is significantly improved forall algorithms when singing voice detection is applied and thisshows the importance of singing voice detection.

We also submitted a preliminary version of the pitch extrac-tion algorithm to the Music Information Retrieval EvaluationeXchange (MIREX) 2010 audio melody extraction competi-tion. Each submission has been tested on six datasets which arehidden to the participants. Our algorithm achieved the best av-erage raw-pitch accuracy for vocal songs comparing to the other16 submissions in the last two years. Details can be found athttp://nema.lis.illinois.edu/nema_out/mirex2010/results/ame/.These results indicate the effectiveness of our pitch detection


Fig. 6. Results of singing voice separation. (a) SNR gains for voiced parts ofthe recordings. (b) SNR gains for whole recordings with vocal detection.

in corpora in addition to MIR-1K. Since we have improvedboth trend estimation and pitch extraction, we expect that betterperformance will be achieved by the proposed algorithm.

D. Evaluation of Singing Voice Separation

1) Performance Measure: To compare the waveforms di-rectly, we measure the SNR of the separated singing voice indecibels [10]:

(13)

where is the target signal which is generated by applyingan all-one mask to clean singing voice and is the separatedsinging voice.

2) Experimental Results: Fig. 6 shows SNR gains of the sep-arated singing voice for different methods. Fig. 6(a) and (b)shows the results for voiced parts and for whole recordings withsinging voice detection, respectively. The IBM results show theceiling performance of a binary mask based algorithm. The re-sults with ideal pitch show the performance when ground truthpitch is given to the tandem algorithm. The performance of theproposed algorithm is shown by circle line. The results of a pre-vious CASA-based singing voice separation method [13] anda model-based method proposed by Ozerov et al. [16] are alsoshown for comparison. The latter results were generated by theauthors of [16].

As can be seen, the proposed algorithm outperforms the pre-vious systems significantly. Even with provided target pitch, theLi–Wang system does not perform as well as the proposed al-gorithm except for 5 dB SNR. This is because classification-based mask estimation performs much better than the rule-basedone used in their system. By providing ideal pitch to both theproposed and the Li-Wang system, we can see that the sepa-ration results of the current system are significantly better andmuch closer to the IBM results.

VIII. CONCLUSION

We have proposed an extended tandem algorithm that es-timates target pitch and separates singing voice from musicaccompaniment iteratively. By coupling with the proposedtrend estimation, the tandem algorithm is improved signifi-cantly for both pitch estimation and voice separation in musicalrecordings. Systematic evaluation shows that the proposedalgorithm performs significantly better than a previous CASAand a model-based system. Together with our previous systemfor unvoiced singing voice separation [9], we have a CASAsystem to separate both voiced and unvoiced portions of singingvoice from music accompaniment.

In this study, we use the MIR-1K corpus which contains songmixtures synthetically created so that the ground truth is avail-able for evaluation purposes [9]. Whether the observed perfor-mance from MIR-1K extends to recorded mixtures with profes-sional singers needs to be investigated in future research. An-other interesting issue is the tradeoff between the pitch range ofa trend and the utility of trend estimation in pitch estimation. Abroader range increases the chance of enclosing the true pitchpoint, leading to more accurate trend estimation; on the otherhand, it is less useful as a constraint for pitch estimation. In thispaper, trend estimation is performed prior to pitch estimation.Future work needs to analyze this tradeoff to see if there is anoptimal trend range for pitch detection.

REFERENCES

[1] T. Abe and M. Honda, “Sinusoidal model based on instantaneous fre-quency attractors,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14,no. 4, pp. 1292–1300, Jul. 2006.

[2] A. Bregman, Auditory Scene Analysis. Cambridge, MA: MIT Press,1990.

[3] G. J. Brown and M. Cooke, “Computational auditory scene analysis,”Comput. Speech Lang., vol. 8, pp. 297–336, 1994.

[4] A. de Cheveigne, “Multiple F0 Estimation,” in Computational AuditoryScene Analysis: Principles, Algorithms, and Applications, D. L. Wangand G. J. Brown, Eds. Hoboken, NJ: Wiley and IEEE Press, 2006,pp. 45–79.

[5] K. Dressler, “Sinusoidal extraction using an efficient implementation ofa multi-resolution FFT,” in Proc. Int. Conf. Digit. Audio Effects, 2006,pp. 247–252.

[6] J.-L. Durrieu, G. Richard, and B. David, “Singer melody extraction inpolyphonic signals using source separation methods,” in Proc. IEEEICASSP, 2008, pp. 169–172.

[7] H. Fujihara, M. Goto, J. Ogata, K. Komatani, T. Ogata, andH. G. Okuno, “Automatic synchronization between lyrics andmusic CD recordings based on Viterbi alignment of segregated vocalsignals,” in Proc. IEEE Int. Symp. Multimedia, San Diego, CA, 2006,pp. 257–264.

[8] M. Goto, “A real-time music-scene-description system: Predomi-nant-F0 estimation for detecting melody and bass lines in real-worldaudio signals,” Speech Commun., vol. 43, no. 4, pp. 311–329, 2004.

[9] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singing voiceseparation for monaural recordings using the MIR-1K dataset,” IEEETrans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 310–319, Feb.2010.

[10] G. Hu and D. L. Wang, “Monaural speech segregation based on pitchtracking and amplitude modulation,” IEEE Trans. Neural Netw., vol.15, no. 5, pp. 1135–1150, Sep. 2004.

[11] G. Hu and D. L. Wang, “A tandem algorithm for pitch estimationand voiced speech segregation,” IEEE Trans. Audio, Speech, Lang.Process., vol. 18, no. 8, pp. 2067–2079, Nov. 2010.

[12] J. Li and C.-H. Lee, “On designing and evaluating speech event detec-tors,” in Proc. Interspeech, Lisbon, Portugal, Sep. 2005.

[13] Y. Li and D. L. Wang, “Separation of singing voice from music accom-paniment for monaural recordings,” IEEE Trans. Audio, Speech, Lang.Process., vol. 15, no. 4, pp. 1475–1487, May 2007.


[14] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982.

[15] N. Ono, K. Miyamoto, J. L. Roux, H. Kameoka, and S. Sagayama,“Separation of a monaural audio signal into harmonic/percussive com-ponents by complementary diffusion on spectrogram,” in Proc. EU-SIPCO, 2008.

[16] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adaptationof bayesian models for single channel source separation and itsapplication to voice/music separation in popular songs,” IEEE Trans.Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 1564–1578, Jul.2007.

[17] R. D. Patterson, J. Holdsworth, I. Nimmo-Smith, and P. Rice, “An ef-ficient auditory filterbank based on the gammatone function,” MRCAppl. Psychol. Unit, 1988, Rep. 2341.

[18] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-cation in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,Feb. 1989.

[19] B. Raj, P. Smaragdis, M. V. Shashanka, and R. Singh, “Separating aforeground singer from background music,” in Proc. Int. Symp. Fron-tiers Res. Speech Music (FRSM), Mysore, India, 2007.

[20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” in Parallel Distributed Pro-cessing, D. E. Rumelhart and J. L. McClell, Eds. Cambridge, MA:MIT Press, 1986, pp. 318–362.

[21] H. Tachibana, T. Ono, N. Ono, and S. Sagayama, “Melody line estima-tion in homophonic music audio signals based on temporal-variabilityof melody source,” in IEEE ICASSP, 2010, pp. 425–428.

[22] S. Vembu and S. Baumann, “Separation of vocals from polyphonicaudio recordings,” in Proc. ISMIR, 2005, pp. 337–344.

[23] T. Virtanen, A. Mesaros, and M. Ryynänen, “Combining pitch-basedinference and non-negative spectrogram factorization in separating vo-cals from polyphonic music,” in Proc. SAPA, Brisbane, Australia, 2008.

[24] C. K. Wang, R. Y. Lyu, and Y. C. Chiang, “An automatic singing tran-scription system with multilingual singing lyric recognizer and robustmelody tracker,” in Proc. Eurospeech, Geneva, Switzerland, 2003.

[25] D. L. Wang, “On ideal binary mask as the computational goal of audi-tory scene analysis,” in Speech Separation by Humans and Machines,P. Divenyi, Ed. Norwell, MA: Kluwer, 2005, pp. 181–197.

[26] D. L. Wang and G. J. Brown, “Separation of speech from interferingsounds based on oscillatory correlation,” IEEE Trans. Neural Netw.,vol. 10, no. 3, pp. 684–697, May 1999.

[27] , D. Wang and G. Brown, Eds., Computational Auditory Scene Anal-ysis: Principles, Algorithms, and Applications. Hoboken, NJ: Wileyand IEEE Press, 2006.

[28] Y. Wang, M.-Y. Kan, T. L. Nwe, A. Shenoy, and J. Yin, “LyricAlly:Automatic synchronization of acoustic musical signals and textuallyrics,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, New York,2004, pp. 212–219.

[29] T. Zhang, “Perception of singing,” in Psychology of Music, D. Deutsch,Ed., 2nd ed. New York: Academic, 1999, pp. 171–214.

[30] T. Zhang, “System and method for automatic singer identification,” inProc. IEEE ICME, 2003, pp. 33–36.

Chao-Ling Hsu received the B.S. degree in com-puter science and information engineering fromChung-Hua University, Hsinchu, Taiwan, in 2003,and the Ph.D. degree in computer science fromNational Tsing Hua University, Hsinchu, Taiwan, in2011.

In 2010, he was a Visiting Scholar in the Depart-ment of Computer Science and Engineering, OhioState University (OSU), Columbus. He is currently aSenior Engineer at Mediatek, Inc., Hsinchu, Taiwan,developing audio algorithms and related products.

His research interests include melody recognition, music signal processing,computational auditory scene analysis, and speech enhancement.

DeLiang Wang, (F’04) photograph and biography not available at the time ofpublication.

Jyh-Shing Roger Jang received the B.S. degree inelectrical engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 1984, and the Ph.D. degreein the Electrical Engineering and Computer ScienceDepartment, University of California at Berkeley, in1992.

He was with the MathWorks, Inc., from 1993to 1995, and coauthored the Fuzzy Logic Toolbox.Since 1995, he has been with the Department ofComputer Science, National Tsing Hua University,Taiwan. He has published several books, including

Neuro-Fuzzy and Soft Computing (Prentice Hall, 1997), MATLAB Program-ming (2004, in Chinese), and JavaScript Programming and Applications (2007,in Chinese). He has also maintained two online tutorials on “Audio SignalProcessing and Recognition” and “Data Clustering and Pattern Recognition.”His research interests include speech recognition/assessment, music analysisand retrieval, face detection/recognition, pattern recognition, neural networks,and fuzzy logic.

Ke Hu (S’09) received the B.E. and M.E. degreesin automation from University of Science andTechnology of China, Hefei, in 2003 and 2006, re-spectively, and the M.S. degree in computer scienceand engineering from The Ohio State University,Columbus, in 2010, where he is currently pursuingthe Ph.D. degree.

His research interests include monaural speechseparation, statistical modeling, and machinelearning.

HWJH.taslp12

Documents

Transcript of HWJH.taslp12