NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA...

11
Int. J. Appl. Math. Comput. Sci., 2016, Vol. 26, No. 1, 203–213 DOI: 10.1515/amcs-2016-0014 NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA NEURAL–NETWORK–BASED MULTI–ODF FUSION BARTLOMIEJ STASIAK a, ,J ˛ EDRZEJ MO ´ NKO a ,ADAM NIEWIADOMSKI a a Institute of Information Technology Lód´ z University of Technology, ul. Wólcza´ nska 215, 90-924 Lód´ z, Poland e-mail: [email protected] The problem of note onset detection in musical signals is considered. The proposed solution is based on known approaches in which an onset detection function is defined on the basis of spectral characteristics of audio data. In our approach, several onset detection functions are used simultaneously to form an input vector for a multi-layer non-linear perceptron, which learns to detect onsets in the training data. This is in contrast to standard methods based on thresholding the onset detection functions with a moving average or a moving median. Our approach is also different from most of the current machine- learning-based solutions in that we explicitly use the onset detection functions as an intermediate representation, which may therefore be easily replaced with a different one, e.g., to match the characteristics of a particular audio data source. The results obtained for a database containing annotated onsets for 17 different instruments and ensembles are compared with state-of-the-art solutions. Keywords: note onset detection, onset detection function, multi-layer perceptron, multi-ODF fusion, NN-based fusion. 1. Introduction Segmentation may be deemed one of the most important skills attributed to intelligence, either human or artificial. Assigning significance to some spatially or temporally correlated groups of data in an image or a sound file is an elementary step of analysis, providing grounds for feature extraction, description and, eventually, comprehension. A fundamental stage in audio segmentation process is onset detection or, especially in music information retrieval (MIR), note onset detection. It is used as a starting point in numerous practical applications, including rhythm and tempo analysis (Laroche, 2003; Peeters, 2005), query-by-humming (QbH) music search engines (Huang et al., 2008; Typke et al., 2007), support systems in music education (Zhang and Wang, 2009; Yin et al., 2005) and parametric audio coding (Bartkowiak and Januszkiewicz, 2012). Note onsets are tightly related to attack transients in musical signals. This is due to the fact that the sound produced by a musical instrument is a non-stationary signal in a short period of time after excitation occurs. Detection of transients, in particular attack transients, Corresponding author is applicable in musical signal processing due to the fact that during transient states the magnitude and the phase of a signal tend to change rapidly. However, the precise definition of the onset time, making it possible to unambiguously locate it on the time axis, is not a straightforward task (Bello et al., 2005; Lerch, 2012). Various definitions, including perceptual onset time (POT), perceptual attack time (PAT), acoustic onset time (AOT) and note onset time (NOT), have been proposed (Repp, 1996; Lerch, 2012) in order to highlight differences between the time when the onset is perceivable by a human listener, when it is measurable by audio monitoring devices, and simply when the note-on command is triggered by a MIDI synthesizer. Analysis of polyphonic music is especially difficult, due to natural limitations of performers’ precision in playing several notes simultaneously. Moreover, the onset-specific type of change in the temporal and spectral characteristics of a sound varies significantly for different instruments and types of articulation. For example, pitched non-percussive (PNP) sounds, as those produced by bowed instruments, are generally considered more difficult to analyze than pitched/non-pitched percussive ones (PP/NPP), as intensity-related features may be not

Transcript of NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA...

Page 1: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Int. J. Appl. Math. Comput. Sci., 2016, Vol. 26, No. 1, 203–213DOI: 10.1515/amcs-2016-0014

NOTE ONSET DETECTION IN MUSICAL SIGNALS VIANEURAL–NETWORK–BASED MULTI–ODF FUSION

BARTŁOMIEJ STASIAK a,∗, JEDRZEJ MONKO a, ADAM NIEWIADOMSKI a

aInstitute of Information TechnologyŁódz University of Technology, ul. Wólczanska 215, 90-924 Łódz, Poland

e-mail: [email protected]

The problem of note onset detection in musical signals is considered. The proposed solution is based on known approachesin which an onset detection function is defined on the basis of spectral characteristics of audio data. In our approach, severalonset detection functions are used simultaneously to form an input vector for a multi-layer non-linear perceptron, whichlearns to detect onsets in the training data. This is in contrast to standard methods based on thresholding the onset detectionfunctions with a moving average or a moving median. Our approach is also different from most of the current machine-learning-based solutions in that we explicitly use the onset detection functions as an intermediate representation, whichmay therefore be easily replaced with a different one, e.g., to match the characteristics of a particular audio data source.The results obtained for a database containing annotated onsets for 17 different instruments and ensembles are comparedwith state-of-the-art solutions.

Keywords: note onset detection, onset detection function, multi-layer perceptron, multi-ODF fusion, NN-based fusion.

1. Introduction

Segmentation may be deemed one of the most importantskills attributed to intelligence, either human or artificial.Assigning significance to some spatially or temporallycorrelated groups of data in an image or a sound file is anelementary step of analysis, providing grounds for featureextraction, description and, eventually, comprehension.

A fundamental stage in audio segmentation processis onset detection or, especially in music informationretrieval (MIR), note onset detection. It is used asa starting point in numerous practical applications,including rhythm and tempo analysis (Laroche, 2003;Peeters, 2005), query-by-humming (QbH) music searchengines (Huang et al., 2008; Typke et al., 2007), supportsystems in music education (Zhang and Wang, 2009; Yinet al., 2005) and parametric audio coding (Bartkowiak andJanuszkiewicz, 2012).

Note onsets are tightly related to attack transients inmusical signals. This is due to the fact that the soundproduced by a musical instrument is a non-stationarysignal in a short period of time after excitation occurs.Detection of transients, in particular attack transients,

∗Corresponding author

is applicable in musical signal processing due to thefact that during transient states the magnitude and thephase of a signal tend to change rapidly. However,the precise definition of the onset time, making itpossible to unambiguously locate it on the time axis,is not a straightforward task (Bello et al., 2005; Lerch,2012). Various definitions, including perceptual onsettime (POT), perceptual attack time (PAT), acousticonset time (AOT) and note onset time (NOT), havebeen proposed (Repp, 1996; Lerch, 2012) in order tohighlight differences between the time when the onset isperceivable by a human listener, when it is measurable byaudio monitoring devices, and simply when the note-oncommand is triggered by a MIDI synthesizer.

Analysis of polyphonic music is especially difficult,due to natural limitations of performers’ precision inplaying several notes simultaneously. Moreover, theonset-specific type of change in the temporal and spectralcharacteristics of a sound varies significantly for differentinstruments and types of articulation. For example,pitched non-percussive (PNP) sounds, as those producedby bowed instruments, are generally considered moredifficult to analyze than pitched/non-pitched percussiveones (PP/NPP), as intensity-related features may be not

Page 2: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

204 B. Stasiak et al.

sufficient for successful segmentation (Zhang and Wang,2009; Collins, 2005). Finally, it should be rememberedthat musical instruments are basically modelled bycomplex systems of linear and non-linear ordinary aswell as partial differential equations with time-varyingparameters (Rabenstein and Petrausch, 2008), and thecomplete description of physical phenomena occurring inonset-related transient states is definitely nontrivial. Allthese factors make the task of building a universal onsetdetector a real challenge, which justifies searching fornew, machine-learning-based methods which would beable to deal with the uncertainties inherent in formulationof the problem.

The classical approach in the onset detection task iscomposed of construction of an onset detection function(ODF), also known as a novelty function, and pickingthe peaks of the ODF, which indicate the occurrence ofsomething new in the signal (Bello et al., 2005; Belloand Sandler, 2003; Duxbury et al., 2003; Laroche, 2003).In this work we propose a novel solution, combining theODF-based approach with machine learning. One of thekey advantages of our method (NN-based multi-ODF fu-sion) is simultaneous application of many ODFs, allowingcovering a broad range of onset-relevant information.

The remainder of the paper is organized asfollows. The next section presents methods andalgorithms proposed in the literature, with an emphasison neural-network-based solutions (Section 2.3). In thiscontext, our approach is presented (Section 3), alongwith the description of the audio dataset selected fortesting, details of the data preparation procedure andtesting schemes. The obtained results are displayed anddiscussed (Section 4), and some conclusions and futureperspectives are formulated in the last section.

2. Previous work on onset detection

2.1. Methods based on onset detection functions.Most of the ODF construction methods found in theliterature utilize information about the magnitude and/orphase of STFT (short-time Fourier transform) frequencybins in consecutive frames, for finding spectrum changesindicating note onset occurrences. For example, one of thesimplest approaches, known as the spectral flux, is basedon a sum of half-wave rectified differences between thek-th magnitude spectrum bins of two consecutive STFTframes (Bello et al., 2005):

SF(n) =∑

k

hwr(|Xk(n)| − |Xk(n− 1)|) , (1)

where Xk(n) is the k-th complex frequency bin of then-th frame and

hwr(x) =x+ |x|

2

is the half-wave rectifier function, so that only positivechanges in the magnitude are taken into account.

Most of the methods which involve informationon phase changes rely on the differences between thepredicted and actual phases of each frequency bin. Thiscan be defined as

dϕk(n) = princarg[ϕk(n)− 2ϕk(n− 1) + ϕk(n− 2)] ,

where ϕk(n) is the k-th frequency bin of the n-th STFTframe and the princarg operator maps the argument to the[−π, π] range. In the case of musical signals, the ODFdepending solely on phase information may be sensitive tochanges in all spectrum bins regardless of their magnitude.Therefore, it is worth combining the information onmagnitude and the phase, e.g., by weighing phasedeviation coefficients dϕk(n) by magnitude changes:

WPD(n) =∑

k

|(|Xk(n)| − |Xk(n− 1)|) · dϕk(n)| .(2)

A more sophisticated method based on the phasespectrum was proposed by Bello and Sandler (2003), whoused phase deviation coefficients to build a bin histogramof phase deviations for every STFT frame. Then the resultmay be calculated with some statistical characteristics(e.g., kurtosis) of such a distribution:

PHK(n) = Kurt(h(dϕ(n))) , (3)

where h is the bin histogram of the phase deviations.There also exist methods which define the detection

function on the basis of the complex spectrum, therebytaking into account both the amplitude and the phase.Referring to Duxbury et al. (2003), the detection functionmay be formulated as follows:

CD(n) =∑

k

|Sk(n)− Sk(n)| , (4)

where |Sk(n) − Sk(n)| is the magnitude of the complexdifference between the expected (predicted) and the actualk-th frequency bin of n-th STFT frame, where

Sk(n) = |Sk(n− 1)|ej(2ϕk(n−1)−ϕk(n−2)) .

The resulting detection functions are processed withadaptive thresholding and peak-picking algorithms. Amoving average or a moving median is usually preferredover a fixed threshold as it can follow the dynamicsof a sound (Duxbury et al., 2003; Böck et al., 2012).Additionally, some methods for controlling the salienceof a peak are often applied (Dixon, 2006). Nevertheless,unequivocal determination of the onsets is far from trivial,and both false positives (FP: onsets reported in placeswhere no onset actually appears in the recording) and falsenegatives (FN: actual onsets that have not been reported)

Page 3: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Note onset detection in musical signals via neural-network-based multi-ODF fusion 205

are inherent to practically all the approaches proposed sofar.

Having denoted the correctly located onsets byTP (true positives), the assessment of quality of theonset detection may be expressed in terms of precision,defined as the ratio TP/(TP+FP), and recall, defined asTP/(TP+FN). Note that too low a threshold value leadsto reporting most of peaks, including the irrelevant ones,and thus it results in excellent recall (low FN) but poorprecision (high FP). The opposite outcome (low recall andhigh precision) is expected for too high a threshold value,overshooting many relevant peaks. The harmonic mean ofprecision and recall, known as the F-measure, is thereforeoften reported as a “balanced” result of the onset detectionprocedure (Dixon, 2006; Böck et al., 2012).

2.2. Multi-ODF fusion. A separate research direction,especially related to our approach, is fusion of severalonset detection functions. This is accomplished either onthe feature-level by a set of pre-defined rules or a linearcombination of ODFs (Tian et al., 2014), or in the form ofthe score-level fusion in which the decisions are taken onthe basis of the already computed onsets (Quintela et al.,2009; Tian et al., 2014). However, despite the apparentsimilarities to our solution (cf. Section 3), the differencesare indeed very significant. Tian et al. (2014) deliberatelyrefrain from using machine learning, while Quintelaet al. (2009) although they apply, e.g., KNN- andSVM-based classifiers, operate on a completely differentrepresentation of input data in the form of lists ofpre-computed onset candidates and their locations intime. In this context, our neural-network-based approachrelying on unprocessed raw ODF values presents analternative point of view on the onset detection problem.

2.3. Methods based on machine learning. Thepopularity of machine learning applications for onsetdetection is growing rapidly with some excellent resultsreported in recent research. Neural networks are the toolof choice (Lacoste and Eck, 2007; Böck et al., 2012),although other data-driven techniques have also beenused (Davy and Godsill, 2002). The input data usuallyconsist of a time-frequency representation of the soundsignal, mapped non-linearly in the frequency domainaccording to a perceptual model. Böck et al. (2012) used abank of triangular filters positioned at critical bands of theBark scale to filter the STFT magnitude spectra, computedwith three different window lengths in parallel. In thisway, the redundancy resulting from unnecessarily highfrequency resolution of the STFT in the upper frequencyrange may be avoided. Hertz to mel scale mapping (Eybenet al., 2010) and the constant-Q transform (Lacoste andEck, 2007) have also been applied for similar reasons.Nevertheless, the problem of dimensionality reduction

of the data used as the neural network input is notfully resolved, yet forcing system designers to applyspecial preprocessing methods, including, e.g., randomsampling of the input window along time and frequencyaxes (Lacoste and Eck, 2007).

The structure of the neural networks used in theonset detection problem has often been subjected toextensive research, and some non-standard approacheshave also been proposed. For instance, a multi-netapproach proposed by Lacoste and Eck (2007) is basedon merging the results obtained from several networks,each trained with a different set of hyper-parameters, bymeans of an additional “output” neural network followedby a peak-picking procedure. Apart from the standardquestions regarding the number of hidden layers andhidden neurons, several different NN types, includingthe recurrent neural network (RNN), the feed-forwardconvolutional neural network (CNN) and the LSTM(long short-term memory) neural network, have beenconsidered (Böck et al., 2012; Eyben et al., 2010; Schlüterand Böck, 2014).

3. Our solution: NN-based multi-ODFfusion

In our approach a neural network is applied in a differentway for solving the onset detection problem. Insteadof taking a pre-processed spectrogram as the raw input,we compute several onset detection functions and puttheir values to the input of the neural network. Thenetwork summarizes the information from all the ODFsand generates its own onset probability estimation on thisbasis (Fig. 1).

This approach follows the standard division ofa pattern recognition system into feature extractionand classification blocks. The main role of thefeature extraction block is to compute the onsetdetection functions which basically employ much moreproblem-specific a priori knowledge compared withapproaches in which this knowledge has to be learntdirectly from spectral data. In this way the constructionof the classifier itself may be simplified and the inputspace dimensionality may be kept reasonably low. Thisis especially important as multi-dimensional data needmore training examples, which—in the case of the onsetdetection problem—implies a laborious process of manualannotation of audio files (Daudet et al., 2004).

3.1. Dataset. The dataset collected by PierreLeveau (Daudet et al., 2004) was chosen to test theeffectiveness of our solution. The collection contains17 audio files representing a variety of music styles andinstruments. It was annotated by three expert listenersfor the total number of over 670 onsets, reported in thecorresponding ground-truth files. It should be noted that

Page 4: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

206 B. Stasiak et al.

Fig. 1. Processing steps of our onset detection system. Top: audio data acquisition and spectrogram computation, bottom: constructionof several onset detection functions, neural network training and thresholding the output.

all the three annotations had to be consistent for any givenonset to be included in the database. For this reason, someof the onsets are missing from the ground-truth files iftheir timing differed between the annotators by more thana predefined value (100 ms).

3.2. Data preparation. A basic tool used inour solution is a multi-layer perceptron (MLP) withone hidden layer and a non-linear (unipolar sigmoid)activation function. As has been stated before, the inputof the neural network is based on the data obtained fromseveral onset detection functions aligned in time andsampled uniformly within a sliding window. In fact, noexplicit sampling is necessary, as the ODFs are alreadyextracted from the audio signal on the per-frame basis. Inour approach the original audio files, recorded at fs =44.1 kHz, were cut into frames of size N = 2048 with ahalf-frame overlap, which resulted in computing the ODFsamples every K ms, where

K =1

fs

N

2≈ 23.22 . (5)

Four onset detection functions defined with theformulas (1)–(4) were included into the study (Fig. 2,left column), although any other type and number ofODFs may be applied as well. The sliding window witha fixed number of ns = 5 samples for each of thefour ODFs is used. A single input vector is thereforecomposed of 20 samples plus additional four valuescomputed as arithmetic means of nm = 21 ODF samplesin the neighborhood of the sliding window. In this waywe incorporate the concept of a moving mean into oursolution, so that the classifier may benefit also from the

long-term information related to the average level of thesignal in a given interval. Finally, the neural networkhas 24 inputs, and the input vector for a given locationof the sliding window, denoted by n, takes the form of aconcatenation of four vectors (cf. Fig. 2, right column):

x(n) = [vSF(n), vWPD(n), vPHK(n), vCD(n)] , (6)

defined as

vSF(n) =[SF (n− 2) , SF (n− 1) , SF (n) , . . .

SF (n+ 1) , SF (n+ 2) , SF(n)],

vWPD(n) =[WPD(n− 2) , WPD(n− 1) ,

WPD(n) , . . . ,WPD(n+ 1) ,

WPD(n+ 2) , WPD(n)],

vPHK(n) =[PHK (n− 2) , PHK(n− 1) ,

PHK(n) , . . . ,PHK(n+ 1) ,

PHK(n+ 2) , PHK(n)],

vCD(n) =[CD (n− 2) , CD(n− 1) , CD (n) ,

. . . , CD(n+ 1) , CD(n+ 2) , CD(n)],

(7)

where the last element of the vector vSF(n) is computedas

SF(n) =

n+�nm/2�∑

k=n−�nm/2�SF(k) , (8)

and similarly for the remaining three vectors.

Page 5: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Note onset detection in musical signals via neural-network-based multi-ODF fusion 207

Fig. 2. Left column: four ODFs of a sax solo recording (G. Gershwin, Summertime, from Porgy and Bess—the beginning) and theirselected fragment, marked with black rectangles in the left column, shown magnified in the right column. The vertical linesmark the ground-truth onsets.

Page 6: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

208 B. Stasiak et al.

Fig. 3. Target values (the circles) for a fragment of an audio file with onsets appearing in the middle of the 20th and the 30th frame.The continuous line represents the ideal model (Gaussian curve) of the onsets.

At the output of the network, we expect a singlevalue indicating the probability that an onset appears inthe center of the sliding window at a given position, i.e.,within the n-th frame, assuming the input vector in theform defined by Eqn. (6). However, a binary response(onset present/not present) may lead to misclassificationif the onset in the ground-truth data appears one framebefore or after the actual ODF peak, which may easilyhappen due to unavoidable imprecision of the musicannotation process. We therefore decided to define a softcondition for the onset presence, in which the target outputvalue of the network is modeled as a Gaussian curvecentered at the n-th frame. After some simplificationsand rounding, the consecutive target values for an onsetappearing in the n-th frame are set to (Fig. 3)

t(n) = 0.75,

t(n± 1) = 0.55,

t(n± 2) = 0.35, (9)

t(n± 3) = 0.25,

t(n± 4) = 0.25,

...

We decided to limit the range of output values to[0.25, 0.75] instead of [0, 1] because the unipolar sigmoidused as the neural activation function is unable to reachthe endpoints of the second of these intervals, which mightimpede the learning process (Bishop, 1995).

The output of the neural network is treated as anotheronset detection function for which the peak-picking andthresholding procedures must be applied (cf. Section 2.1).The obvious advantage with respect to the original ODFsused to construct the input data is that there is nocorrespondence to the energy of the input signal, andhence a fixed threshold

T ∈ (0.25, 0.75) (10)

may be used instead of the moving average or median.We also do not have to consider the characteristics of each

individual ODF, such as whether the onsets are indicatedby local maxima or minima.

3.3. Train/test procedures. Two training/testingschemes were applied:

1. In the first scheme, one instrument was removedfrom the dataset and all the remaining 16 were usedto train the network (1-vs-all scheme). After thetraining had been finished, the removed instrumentwas used to test the network and only the results forthis instrument (“unknown” in the learning phase)were reported. This procedure was repeated 17times, so that each instrument was treated as the“unknown” one exactly once.

2. In the second scheme, a single audio file was usedboth for training and for testing in the 10-foldcross-validation procedure (c-v scheme). In thiscase all the remaining instruments were deliberatelyignored and only a part (1/10) of the recording of thechosen one was treated as the “unknown” test data.The training was repeated 10 times, so that each 1/10of the file was treated as the “unknown” one exactlyonce. The arithmetic means of all these ten foldswere reported and the whole procedure was repeatedfor the remaining audio files.

The first scheme allows obtaining a universal onsetdetector, trained on a variety of sound sources. Thetask is more difficult here, as no data from the testedrecording are used in the training stage and the obtainedresults for some instruments may be suboptimal if theircharacteristics differ much from the general “population”.One particular problem that was encountered during thetests was that the results obtained for the clarinet and thesaxophone were significantly delayed with respect to theannotated onsets in the ground-truth files (Fig. 4). Thisobservation corresponds to the delay of the signal energyincrease with respect to the beginning of embouchure,which is specific to some woodwind instruments.

Page 7: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Note onset detection in musical signals via neural-network-based multi-ODF fusion 209

Fig. 4. Relative differences in milliseconds between the annotations in the ground-truth files and the results obtained in the first testingscheme (1-vs-all). For each instrument the optimal shift of all the onset positions, yielding the best F-measure, is reported.

Fig. 5. Results: the first testing scheme (one instrument vs. all the others). The numbers in brackets represent the total number ofannotated onsets for each instrument.

4. Results and discussion

The results obtained in the first testing scheme (1-vs-all),after pre-shifting the ground-truth onsets accordingly, arepresented in Fig. 5. For each tested instrument, thetraining was repeated 10 times, and the average value isreported. These results were obtained after 100 trainingepochs in the off-line mode (Bishop, 1995). In fact, dueto the large number of input vectors (ca. 10000), resultingfrom application of the sliding window (Fig. 2) to all therecordings from the training set, as few as 25 epochswere enough to reach the overall F-measure value ofca 78% (Table 1). Several networks with various numbersof hidden neurons between 15 and 80 were tested, butthe variation in the outcomes was relatively low. The

presented results were obtained for 30 hidden units.

The results obtained in the second scheme (c-v) areshown in Fig. 6. Here the data used for testing (1/10 of therecording of a given instrument) were also disjoint fromthe training data (the remaining 9/10). The train/test cyclefor a single instrument was repeated 10 times until everyfragment of the recording had been used exactly once fortesting. The value reported is the arithmetic mean of allthe ten folds. The number of training vectors is naturallymuch lower here compared with the first testing scheme:from 233 (distguit1 file) to 1169 (clarinet1) per each foldand therefore the number of epochs must be appropriatelygreater. The fixed number of 5000 epochs was set for allthe instruments, although the learning dynamics and thespeed of convergence varied greatly in many cases. This

Page 8: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

210 B. Stasiak et al.

Fig. 6. Results: the second testing scheme (cross-validation: 1/10 of a recording vs. the remaining 9/10). The numbers in bracketsrepresent the total number of annotated onsets for each instrument. Note that the number of onsets in the classic3 dataset (four)appeared too small to successfully train the classifier for this case.

variability was observed during the tests in the obtainedsequences of values of the training error E, defined as

E =

√√√√ 1

M

M∑

n=1

|y(n)− t(n)|2, (11)

where M is the number of available audio frames, t(n) isthe expected target value (Eqn. 9) and y(n) is the actualvalue obtained at the output of the neural network for then-th frame.

For example, the error value E after 5000 epochsfor the piano and cello recordings reached the levelsof 0.038 and 0.15, respectively, indicating the relativecomplexity of detection of the pitched, non-percussivecello onsets. Applying an individual approach to eachinstrument and using more flexible stopping criteria tocontrol the generalization error (e.g., with a validation set)would supposedly lead to a further improvement of theresults.

Independence of the output range of the network onthe energy of the signal is an advantage, allowing usageof a single, fixed value T (Eqn. (10)) to threshold theoutput of the network. However, this value still has to beappropriately set in order to achieve the desired precisionand recall levels. Maximization of the F-measure requiredthe threshold value of 0.4 in the first testing scheme and0.43 in the second one (Fig. 7). Comparison of theobtained values of precision, recall and the F-measure forboth testing schemes is presented in Table 1.

4.1. Discussion. The general expectations on thesuperiority of the second testing scheme (c-v) areobviously met, as can be seen in Table 1. Training thenetwork on the same type of data as those used for testing

Fig. 7. Relation of the F-measure (ordinate) to the MLP outputthreshold (abscissa) for the second testing scheme (c-v).

leads to obtaining a more specialized and more effectiveonset detector. This is very well visible in the case ofbowed instruments (cello and violin), which are of a veryspecific (PNP) onset type. Their F-measure was 0.354 and0.643 in the first testing scheme and 0.721 and 0.882 in thesecond one. Low values of recall can be also observed inFig. 5 for these two instruments, which may suggest that

Table 1. Overall results of the onset detection tests.

Scheme 1 Scheme 2(1-vs-all) (c-v)

Correctly detected onsets 486 569Precision 0.856 0.821

Recall 0.724 0.863

F-measure 0.785 0.842

Page 9: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Note onset detection in musical signals via neural-network-based multi-ODF fusion 211

Table 2. Results for ODF subgroups, sorted by the F-measure(‘X’ means that the corresponding ODF was includedin the input set).

SF WPD PHK CD F-measure

X X 0. 490

X X 0. 657X X X 0. 658X X 0. 687

X X X 0. 729X X 0. 732

X X 0. 734X X 0. 735

X X X 0. 774X X X 0. 781

the threshold of 0.4 used in the first scheme is too highin these cases. The PNP onsets are sufficiently differentfrom most of the other onsets (used for training) to makethe network “hesitate” and generate lower output values.

Comparing the results in Figs. 5 and 6, we canobserve that for some instruments (e.g., synthbass) theonset detector trained on the other recordings performsbetter. The explanation may be that these instrumentshave relatively few onsets in the annotated ground-truthfiles, so the network simply does not have sufficient datafor building a proper model of an onset when no otherrecordings are used (the second scheme). This is bestseen in the case of the classic3 file, containing only fourannotated onsets, which results in zero values of bothrecall and precision. This file is specific also because itcontains a fragment of orchestral music with very soft,slow onsets, presenting substantial problems to humanannotators. Due to the imprecise timing, a significantnumber of the actual onsets were not included in theground-truth data (cf. Section 3.1), leading to extremelylow precision values also in the first testing scheme(Fig. 5). This may be, however, regarded more as ademonstration of the fundamental ambiguities underlyingthe formulation of the onset-detection problem in general.

The relative influence of the individual onsetdetection functions was evaluated in a separate group oftests in which the input vectors were reduced to containthe values of only two or three ODFs. The first testingscheme (one instrument vs. all the others) was appliedfor these tests. For each subgroup the threshold yieldingthe highest F-measure value was sought (the thresholdvalues fell within the range 0.36–0.42). The resultspresented in Table 2 indicate that the complex-domainspectrum (Eqn. (4)) contains the most useful onset-relatedinformation. Its removal leads to a rather rapid dropof the obtained results, which is generally not observedto such an extent in the case of the other ODFs. The

information carried by individual ODFs overlaps in anon-trivial way, which may be observed e.g. on thebasis of the PHK onset detection function (Eqn. (3)):the set WPD+PHK+CD performs definitely better thanWPD+CD, while SF+WPD+PHK is actually worse thanSF+WPD, indicating (perhaps) the need to increase thenumber of hidden neurons in the network structure.The problem complexity also partially stems from theheterogeneity of the sound sources in our database.Replacing one onset detection function with anothermay help some instrument types or a music genre,while degrading the results for others. An exampleshown in Fig. 8 reveals that changing PHK to WPD in3-ODF configuration, although generally yields inferiorresults, for piano recordings leads to some enhancement(best seen for the piano+vocal recording in the classic2set). A repository containing more detailed results isavailable (Stasiak, 2015) and may be used for furthercomparison and analysis.

The obtained results are comparable tostate-of-the-art solutions (cf. the F-measure resultsreported in MIREX (2013): eleven algorithms, median:0.8025, max: 0.8727). The results reported in theliterature for this particular dataset (Daudet et al., 2004)are also similar, including, e.g., the recall value of ca 80%for the EER point in the work of Alonso et al. (2005)or the values of the F-measure for several instruments(violin: 87%, trumpet: 89%, piano: 98%) obtained on theextended Leveau database by Lee and Kuo (2006).

5. Conclusions

In this work a solution of the onset detection problemin musical signals, based on a feed-forward multi-layernon-linear neural network, was presented. Unlike manyother approaches based on direct analysis of the spectrum,the intermediate representation built upon classical onsetdetection functions was applied. In this way the inputdata are already presented in domain-relevant form, whichallows simplifying the construction of the neural networkand the training process.

The obtained results are comparable tostate-of-the-art solutions. It should be noted thatseveral improvements may be easily introduced into theproposed method, including, e.g., application of moreperceptually-motivated onset detection functions (anda different number of ODFs, if necessary), controlling thegeneralization error and modification of the stop criteriafor the training process and, eventually, preliminaryautomatic classification of the recordings with respect tothe type of instruments. This last operation may allow usto use several networks trained for different onset types,similarly as in the presented second testing scheme.

Concluding, a decided advantage of the presentedsolution is its relative simplicity and extensibility. In

Page 10: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

212 B. Stasiak et al.

Fig. 8. Comparison of the F-measure value obtained for two different ODF subgroups.

future work, more training data will be used to obtainmore reliable models, leading to further improvements ofthe results.

ReferencesAlonso, M., Richard, G. and David, B. (2005). Extracting note

onsets from musical recordings, Proceedings of the IEEEInternational Conference on Multimedia and Expo 2005,Amsterdam, The Netherlands, pp. 1–4.

Bartkowiak, M. and Januszkiewicz, Ł. (2012). Hybrid sinusoidalmodeling of music with near transparent audio quality,Proceedings of the Joint AES/IEEE Conference NTAV-SPA,Łódz, Poland, pp. 91–96.

Bello, J., Daudet, L., Abdullah, S., Duxbury, C., Davies, M. andSandler, M. (2005). A tutorial on onset detection in musicsignals, IEEE Transactions on Speech and Audio Process-ing 13(5): 1035–1047.

Bello, P. and Sandler, M. (2003). Phase-based note onsetdetection for music signals, Proceedings of the IEEEConference on Acoustics, Speech, and Signal ProcessingICASSP, Hong Kong, Vol. 5, pp. 441–444.

Bishop, C.M. (1995). Neural Networks for Pattern Recognition,Oxford University Press, New York, NY.

Böck, S., Arzt, A., Krebs, F. and Schedl, M. (2012). Onlinereal-time onset detection with recurrent neural networks,Proceedings of the 15th International Conference on Digi-tal Audio Effects (DAFx 2012), York, UK, pp. 1–4.

Collins, N. (2005). A comparison of sound onset detectionalgorithms with emphasis on psychoacoustically motivateddetection functions, Proceedings of the AES 118th Interna-tional Convention, Barcelona, Spain, pp. 28–31.

Daudet, L., Richard, G. and Leveau, P. (2004). Methodologyand tools for the evaluation of automatic onset detectionalgorithms in music, 5th International Conference on Mu-sic Information Retrieval, ISMIR 2004, Barcelona, Spain,pp. 72–75.

Davy, M. and Godsill, S.J. (2002). Detection of abrupt spectralchanges using support vector machines: An applicationto audio signal segmentation, IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, ICASSP2002, Orlando, FL, USA, pp. 1313–1316.

Dixon, S. (2006). Onset detection revisited, Proceedings of theInternational Conference on Digital Audio Effects (DAFx-06), Montreal, Quebec, Canada, pp. 133–137.

Duxbury, C., Bello, J., Davies, M. and Sandler, M. (2003).Complex domain onset detection for musical signals, Pro-ceedings of the 6th International Conference on DigitalAudio Effects (DAFx-03), London, UK, pp. 1–4.

Eyben, F., Böck, S., Schuller, B. and Graves, A. (2010).Universal onset detection with bidirectional long shorttermmemory, Neural Networks, 11 th International Society forMusic Information Retrieval Conference (ISMIR 2010),Utrecht, The Netherlands, pp. 589–594.

Huang, S., Wang, L., Hu, S., Jiang, H. and Xu, B. (2008).Query by humming via multiscale transportation distancein random query occurrence context, IEEE InternationalConference on Multimedia and Expo, ICME 2008, Han-nover, Germany, pp. 1225–1228.

Lacoste, A. and Eck, D. (2007). A supervised classificationalgorithm for note onset detection, EURASIP Journal ofAdvanced Signal Processing 2007: 153–153.

Laroche, J. (2003). Efficient tempo and beat tracking inaudio recordings, Journal of the Audio Engineering Society51(4): 226–233.

Lee, W.-C. and Kuo, C.-C. (2006). Musical onset detectionbased on adaptive linear prediction, IEEE InternationalConference on Multimedia and Expo, ICME 2006, Toronto,Ontario, Canada, pp. 957–960.

Lerch, A. (2012). An Introduction to Audio Content Analysis:Applications in Signal Processing and Music Informatics,Wiley/IEEE Press, Hoboken, NJ.

MIREX (2013). Audio onset detection results in MusicInformation Retrieval Evaluation eXchange MIREX, 2013,

Page 11: NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA …pldml.icm.edu.pl/pldml/element/bwmeta1.element.bwn... · well as partial differential equations with time-varying parameters (Rabenstein

Note onset detection in musical signals via neural-network-based multi-ODF fusion 213

http://nema.lis.illinois.edu/nema_out/mirex2013/results/aod/summary.html.

Peeters, G. (2005). Time variable tempo detection and beatmarking, Proceedings of the International Computer Mu-sic Conference, ICMC 2005, Barcelona, Spain, pp. 1–4.

Quintela, N.D., Giménez, A.P. and Guijarro, S.T. (2009). Acomparison of score-level fusion rules for onset detectionin music signals, Proceedings of the 10th InternationalSociety for Music Information Retrieval Conference, IS-MIR09, Kobe, Japan, pp. 117–121.

Rabenstein, R. and Petrausch, S. (2008). Block-based physicalmodeling with applications in musical acoustics, Interna-tional Journal of Applied Mathematics and Computer Sci-ence 18(3): 295–305, DOI: 10.2478/v10006-008-0027-6.

Repp, B.H. (1996). Patterns of note onset asynchronies inexpressive piano performance, Journal of the AcousticalSociety of America 100(6): 3917–3932.

Schlüter, J. and Böck, S. (2014). Improved musical onsetdetection with convolutional neural networks, Proceed-ings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP 2014), Florence,Italy, pp. 6979–6983.

Stasiak, B. (2015). Results repository,http://ics.p.lodz.pl/~basta/NN_MULTI_ODF_FUSION/Stasiak_OnsetDB.zip.

Tian, M., Fazekas, G., Black, D.A.A. and Sandler, M. (2014).Design and evaluation of onset detectors using differentfusion policies, 15th International Society of Music Infor-mation Retrieval (ISMIR) Conference, ISMIR 2014, Taipei,Taiwan, pp. 631–636.

Typke, R., Wiering, F. and Veltkamp, R.C. (2007).Transportation distances and human perception ofmelodic similarity, Musicae Scientiae 11(1): 153–181.

Yin, J., Wang, Y. and Hsu, D. (2005). Digital violin tutor:An integrated system for beginning violin learners, inH. Zhang et al. (Eds.), ACM Multimedia, ACM, New York,NY, pp. 976–985.

Zhang, B. and Wang, Y. (2009). Automatic musictranscription using audio-visual fusion for violin practicein home environment, Technical Report TRA7/09, NationalUniversity of Singapore, Singapore.

Bartłomiej Stasiak received the M.Sc. degreein music from the Music Academy of Łódz in2001, the M.Sc. degree in computer science fromthe Łódz University of Technology in 2004 andthe Ph.D. degree in computer science from theGdansk University of Technology in 2010. Heis an assistant professor at the Institute of In-formation Technology at the Łódz University ofTechnology. His research interests include arti-ficial intelligence applications in image recogni-

tion, sound signal processing and music information retrieval.

Jedrzej Monko received the M.Sc. degree fromthe Łódz University of Technology, Faculty ofTechnical Physics, Information Technology andApplied Mathematics (2012), where he is cur-rently a Ph.D. student at the Institute of Informa-tion Technology. His research interests includecomputer graphics, multimedia and music infor-mation retrieval.

Adam Niewiadomski received the M.Sc. de-gree from the Łódz University of Technology,Poland, in 1998, and the Ph.D. and D.Sc. de-grees from the Polish Academy of Sciences, War-saw, in 2001 and 2009, respectively, all in com-puter science. He is currently an associate profes-sor with the Institute of Information Technology,Łódz University of Technology. He is the authoror a coauthor of more than 70 technical papers.His research interests include methods of com-

putational intelligence, fuzzy representations of information, automatedtheorem proving, extensions of fuzzy sets, and e-learning.

Received: 25 June 2014Revised: 13 May 2015