DIGITALVISION Synthesis of the Singing Voice by ...) ... represented by a smaller space than a...

©D

IGIT

ALV

ISIO

N

Among the many existing approaches to the syn-thesis of musical sounds, the ones that have hadthe biggest success are without any doubt thesampling-based ones, which sequentially concate-nate samples from a corpus database [1]. Strictly

speaking, we could say that sampling is not a synthesis tech-nique, but from a practical perspective it is convenient to treat itas such. From what we explain in this article it should becomeclear that, from a technology point of view, it is also adequate toinclude sampling as a type of sound synthesis model.

The success of sampling relies on the simplicity of theapproach (it just samples existing sounds), but most important-

ly it succeeds in capturing the naturalness of the sounds, sincethe sounds are real sounds. However, sound synthesis is far frombeing a solved problem and sampling is far from being an idealapproach. The lack of flexibility and expressivity are two of themain problems, and there are still many issues to be worked onif we want to reach the level of quality that a professional musi-cian expects to have in a musical instrument.

Sampling-based techniques have been used to reproducepractically all types of sounds and basically have been used tomodel the sound space of all musical instruments. They havebeen particularly successful for instruments that have discreteexcitation controls, such as percussion or keyboard instruments.

Synthesis of the Singing Voice by

Performance Sampling and Spectral Models

[Toward synthesis engines that sound as natural and expressive as a real singer]

1053-5888/07/$25.00©2007IEEE IEEE SIGNAL PROCESSING MAGAZINE [67] MARCH 2007

[Jordi Bonada and Xavier Serra]

IEEE SIGNAL PROCESSING MAGAZINE [68] MARCH 2007

For these instruments it is feasible to reach an acceptable level ofquality by using large sample databases, thus by sampling a suffi-cient portion of the sound space produced by a given instrument.This is much more difficult for the case of continuously excitedinstruments, such as bowed strings, wind instruments, or thesinging voice, and therefore recent sampling-based systems con-sider a trade-off between performance modeling and samplereproduction (e.g., [2]). For these instruments there are numer-ous control parameters and many ways to attack, articulate, orplay each note. The control parameters are constantly changingand the sonic space covered by a performer could be consideredto be much larger than for the discretely excited instruments.

The synthesis approaches based on physical models have theadvantage of having the right parameterization for being con-trolled like a real instrument and thus have great flexibility andthe potential to play expressively. One of the main open prob-lems relates to the control of these models, in particular how togenerate the physical actions that excite the instrument. Insampling these actions are embedded in the recorded sounds.

We have worked on the synthesis of the singing voice formany years now, mostly together with Yamaha Corp., and someof our results have been incorporated into the Vocaloid(http://www.vocaloid.com) software synthesizer. Our goal hasalways been to develop synthesis engines that could sound asnatural and expressive as a real singer (or choir [3]) and whoseinputs could be just the score and the lyrics of the song. This isa very difficult goal and there is still a lot of work to be done, butwe believe that our proposed approach can reach that goal. Inthis article we will overview the key aspects of the technologiesdeveloped so far and identify the open issues that still need to betackled. The core of the technologies is based on spectral pro-cessing, and over the years we have added performance actionsand physical constraints in order to convert the basic samplingapproach to a more flexible and expressive technology whilemaintaining its inherent naturalness. In the first part of the arti-cle we introduce the concept of synthesis based on performancesampling and the specific spectral models that we have devel-oped and used for the singing voice. In the second part we dis-cuss the different components of the synthesizer, and weconclude by identifying the open issues of this research work.

PERFORMANCE-BASED SAMPLING SYNTHESISSampling has always been considered a way to capture andreproduce the sound of an instrument, but in fact it should bebetter considered a way to model the sonic space produced by aperformer with an instrument. This is not just a fine distinction;it is a significant conceptual shift of the goal to be achieved.

We want to model the sonic space of a performer/instrumentcombination. This does not mean that the synthesizer shouldn’tbe controlled by a performer, it just means that we want to beflexible in the choice of input controllers and be able to usehigh-level controls, such as a traditional music score, or toinclude lower-level controls if they are available, thus takingadvantage of a smearing of the traditional separation betweenperformer and instrument.

Figure 1 shows a visual representation of a given sonic spaceto be modeled. Space A represents the sounds that a giveninstrument can produce by any means. Space B is the subset ofspace A that a given performer can produce by playing thatinstrument. The trajectories shown in space B represent theactual recordings that have been sampled. The reality is that thissonic space is an infinite multidimensional one, but we hope tobe able to get away by approximating it with a finite space. Thetrajectories represent paths in this multidimensional space. Thesize of these spaces may vary depending on the perceptually rel-evant degrees of freedom in a given instrument/performer com-bination; thus, we could say that a performed drum can berepresented by a smaller space than a performed violin. This is avery different concept than the traditional timbre space; herethe space is defined both by the sound itself and by the controlexerted on the instrument by the performer. Thus, the soundspace of an accomplished performer would be bigger than thespace of a not so skilled one.

From a given sampled sonic space and the appropriate inputcontrols, the synthesis engine should be able to generate any tra-jectory within the space, thus producing any sound contained init. The brute-force approach is to do an extensive sampling of thespace and perform simple interpolations to move around it. Inthe case of the singing voice, the space is so huge and complexthat this approach is far from being able to cover a large enoughportion of the space. Thus, the singing voice is a clear examplethat the basic sampling approach is not adequate and that a para-meterization of the sounds is required. We have to understandthe relevant dimensions of the space and we need to find a soundparameterization with which we can move around these dimen-sions by interpolating or transforming the existing samples.

Figure 2 shows a block diagram of our proposed synthesizer.The input of the system is a generalization of the traditionalscore, a Performance Score, which can include any symbolicinformation that might be required for controlling the synthe-sizer. The Performer Model converts the input controls intolower level performance actions, the Performance TrajectoryGenerator creates the parameter trajectories that express theappropriate paths to move within the sonic space, and theSound Rendering module is the actual synthesis engine thatproduces the output sound by concatenating a sequence of[FIG1] Instrument sonic space.

A Possible SoundsProduced by theInstrument

B Sounds Produced bythe Performer Playingthe Instrument

Recorded AudioSamples


transformed samples which approximate the performance tra-jectory. The Performance Database includes not only the per-formance samples but also models and measurements thatrelate to the performance space and that give relevant informa-tion to help in the process of going from the high-level scorerepresentation to the output sound.

In the next sections we will present in more detail each ofthese components for the specific case of a singing voice butconsidering that most of the ideas and technologies could beapplied to any performer/instrument combination.

MODELING THE SINGING VOICEAs stated by Sundberg [4], we consider the singing voice as thesounds produced by the voice organ and arranged in adequatemusical sounding sequences. The voice organ encloses the differ-ent structures that we mobilize whenwe produce voice: the breathing system,the vocal folds, and the vocal and nasaltracts. More specifically, as shown inFigure 3, voice sounds originate froman airstream from the lungs that isprocessed by the vocal folds and thenmodified by the pharynx, the mouth,and nose cavities. The sound producedby the vocal folds is called the voicesource. When the vocal folds vibrate, theairflow is chopped into a set of pulsesproducing voiced sounds (i.e., harmonicsounds). Otherwise, different parts ofthe voice organ can work as oscillatorsto create unvoiced sounds. For example,in whispering, vocal folds are too tenseto vibrate, but they form a narrow pas-sage that makes the airstream becometurbulent and generate noise.

The vocal tract acts as a resonatorand shapes acoustically the voice source,especially enhancing certain frequenciescalled formants (i.e., resonances) (Therealso exist the antiformants or antireso-nances; i.e., frequency regions in whichthe amplitudes of the voice source areattenuated. These are especially presentin nasal sounds because nasal cavitiesabsorb energy from the sound wave.)The five lowest formants are the mostsignificant ones for vowel quality andvoice color. It is important to note thatthe vocal tract cannot be considered alinear-phase-response filter. Instead,each formant decreases the phasearound its center frequency, as can beseen in Figure 4. This property is percep-tually relevant, especially for middle- andlow-pitch utterances.

In a broad sense, and according to whether the focus is puton the system or its output, synthesis models can be classifiedinto two main groups: spectral models and physical models.Spectral models are mainly based on perceptual mechanisms ofthe listener while physical models focus on modeling the pro-duction mechanisms of the sound source. Either of these twomodels is suitable depending on the specific requirements ofthe application, or they might be combined for taking advan-tages of both approaches. Historical and in-depth overviews ofsinging voice synthesis models are found in [5]–[7].

The main benefit of using physical models is that the parame-ters used in the model are closely related to the ones a singer usesto control his/her own vocal system. As such, some knowledge ofthe real-world mechanism must be brought on the design. Themodel itself can provide intuitive parameters if it is constructed so

[FIG2] Block diagram of the proposed synthesizer.

High Level Mid Level Low Level

PerformanceTrajectoryPerformance

TrajectoryGenerator

SoundRendering

PerformerModel

SoundPerformanceScore

PerformanceDB

[FIG3] The voice organ.

Leve

l

RadiatedSpectrum

VelumFrequencyVocal Tract

Leve

l

Formants

Frequency

Frequency

Vocal FoldsTrachea

Leve

l

Lungs

VoiceSource

Tran

sglo

ttal

Airf

low

Time


that it sufficiently matches the physical system. However, such asystem usually has a large number of parameters, and the map-

ping of those quite intuitive controls of the production mecha-nism to the final output of the model, and so to the listener’s per-ceived quality, is not a trivial task. The controls would be relatedto the movements of the vocal apparatus elements such as jawopening, tongue shape, sub-glottal air pressure, tensions of thevocal folds, etc. The first digital physical model of the voice wasbased on simulating the vocal tract as a series of one-dimensionaltubes [8], afterwards extended by means of digital waveguide syn-thesis [9] in the SPASM singing voice system [10]. Physical mod-els are evolving fast and becoming more and more sophisticated;two-dimensional (2D) models are common nowadays providingincreased control and realism [11], [12]; and the physical configu-ration of the different organs during voice production is beingestimated with great detail combining different approaches. Forexample, 2D vocal tract shapes can be estimated from functionalmagnetic resonance imaging (fMRI) [13], X-ray computed tomog-raphy (CT) [14], or even audio recordings and EGG signals bymeans of genetic algorithms [15].

Alternatively, spectral models are related to some aspects of thehuman perceptual mechanism. Changes in the parameters of a

spectral model can be more easilymapped to a change of sensation in thelistener. Yet parameter spaces yielded bythese systems are not necessarily themost natural ones for manipulation.Typical controls would be pitch, glottalpulse spectral shape, formant frequen-cies, formant bandwidths, etc. Of partic-ular relevance is the sinusoidal-basedsystem in [16].

A typical example of combining bothapproaches is formant synthesizers (e.g.,[17], CHANT system [18]), considered tobe pseudo-physical models because eventhough they are mainly spectral modelsthey make use of the source/filterdecomposition, which considers voice tobe the result of a glottal excitation wave-form (i.e., the voice source) filtered by alinear filter (i.e., the vocal tract). Thevoice model we present here would bepart of this group.

EPR (EXCITATION PLUSRESONANCES) VOICE MODELThe EpR Voice Model [19] is based onan extension of the well knownsource/filter approach [20]. It modelsthe magnitude spectral envelopedefined by the harmonic spectralpeaks of the singer’s spectrum. It con-sists of three filters in cascade (Figure5). The first filter models the voicesource frequency response with anexponential curve plus one resonance.

[FIG4] Vocal tract transfer function.

80

60

40

20

00 500 1,000 1,500 2,500 3,000

0 500 1,000 1,500 2,500 3,000

Frequency [Hz]

Frequency [Hz]

Mag

nitu

de [d

B]

Pha

se

0

−2

−4

−6

−8

2,000

2,000

[FIG5] EpR voice model.

GaindB − SlopeDepthdB

GaindB

Frequency

Source Resonance

Source CurvedB = GaindB +SlopeDepthdB (e

Slope f −1)

Residual

Frequency

Frequency

Vocal Tract Resonances

AmpdB

AmpdB

AmpdB

Residual Envelope

Voice Source

Vocal Tract


The second one models the vocal tract with a vector of reso-nances that emulate the voice formants. The last filter stores theamplitude differences between the two previous filters and theharmonic envelope. Hence, EpR can perfectly reproduce all thenuances of the harmonic envelope.

Each of the parameters of this model can be controlled inde-pendently. However, whenever a formant frequency is modified,the residual filter envelope is scaled, taking as anchor points theformant frequencies, thus preserving the local spectral ampli-tude shape around each formant.

AUDIO PROCESSINGSince our system is a sample-based synthesizer in which samplesof a singer database are transformed and concatenated along timeto compose the resulting audio, high-quality voice transformationtechniques are a crucial issue. Thus, we need audio-processingtechniques especially adapted to the particular characteristics ofthe singing voice, using EpR to model voice timbres, and preserv-ing the relationship between formants and phase.

SPECTRAL MODELING SYNTHESISWe initially used spectral modeling synthesis (SMS) [21] as thebasic transformation technique [19]. SMS had the advantage ofdecomposing the voice into harmonics and residuals, respectivelymodeled as sinusoids and filtered white noise. Both componentswere independently transformed, so the system yielded a great flexi-bility. But although the results were quite encouraging in voicedsustained parts, in transitory parts and consonants, especially invoiced fricatives, harmonic and residual components were not per-ceived as one, and moreover transients were significantly smeared.When modifying harmonic frequencies, e.g., in a transpositiontransformation, harmonic phases were computed according to thephase rotation of ideal sinusoids at the estimated frequencies.Hence, the strong phase synchronization (or phase-coherence)between the various harmonics resulting of the formant and phaserelationship was not preserved, producing phasiness (a lack of pres-ence, a slight reverberant quality, as if recorded in a small room)artifacts audible especially at low pitch utterances.

PHASE-LOCKED VOCODERIntending to improve our results, we moved to a spectral tech-nique based on the phase-locked vocoder [22] where the spec-trum is segmented into regions, each of which contains aharmonic spectral peak and its surroundings. In fact, in thistechnique each region is actually represented and controlled bythe harmonic spectral peak, so that most transformations basi-cally deal with harmonics and compute how their parameters(amplitude, frequency, and phase) are modified, in a similar wayto sinusoidal models. These modifications are applied uniformlyto all the spectral bins within a region, therefore preserving theshape of the spectrum around each harmonic. Besides, a map-ping is defined between input and output harmonics in order toavoid shifting in frequency the aspirated noise components andintroducing artifacts when low level peaks are amplified. Weimproved the harmonic phase continuation method by assum-ing a perfectly harmonic frequency distribution [23]. The resultwas that when pitch was modified (i.e., transposition), theunwrapped phase envelope was actually scaled according to thetransposition factor. The sound quality was improved in terms ofphasiness but not sufficiently, and the relation between for-mants and phase was not really preserved.

Several algorithms have been proposed regarding the har-monic phase-coherence for both phase-vocoder and sinusoidalmodeling (e.g., [24] and [25]), most based on the idea of definingpitch-synchronous input and output onset times and reproduc-ing at the output onset times the phase relationship existing inthe original signal at the input onset times. However, the resultsare not good enough because the onset times are not synchro-nized to the voice pulse onsets but assigned to an arbitrary posi-tion within the pulse period. This causes unexpected phasealignments at voice pulse onsets, which doesn’t reproduce theformant to phase relations, adding an unnatural “roughness”characteristic to the timbre (Figure 6). Different algorithmsdetect voice pulse onsets relying on the minimal phase charac-teristics of the voice source (i.e., glottal signal) (e.g., [26] and[27]). Following this same idea, we proposed in [28] a method toestimate the voice pulse onsets out of the harmonic phases based

[FIG6] Spectrums obtained when the window is centered at the voice pulse onset (left figure) and between two pulse onsets (middlefigure). In the middle figure the harmonics are mapped to perform one octave down transposition. The right figure shows the spectrumof the transformed signal with the window centered at the voice pulse onset. The “doubled” phase alignment adds an undesired“roughness” characteristic to the voice signal. We don’t see only one voice pulse per period as expected but two with strong amplitudemodulation.

(a) (b) (c)

x' (n)

∠X' (Ω)

n

Ω

Ω

0

F1 F2

2π

X' (Ω)| |

x (n)

∠X (Ω)

Ω

Ω

n

F1 F2

2π

0

X' (Ω) |

x (n)

∠X (Ω)

n

Ω

Ω

0

F2F1HarmonicMapping

2π

X' (Ω)|


on the property that when the analysis window is properly cen-tered the unwrapped phase envelope is nearly flat with shifts

under each formant, thus being close to a maximally flat phasealignment (MFPA) condition. By means of this technique, both

phasiness and roughness can be greatlyreduced to be almost inaudible.

VOICE PULSE MODELINGBoth SMS and phase-vocoder techniquesare based on modifying the frequencydomain characteristics of the voice sam-ples. Several transformations can bedone with ease and with good results; forinstance, transposition and timbre modi-fication (here timbre modification isunderstood as modification of the enve-lope defined by the harmonic spectralpeaks). However, certain transformationsare rather difficult to achieve, especiallythe ones related to irregularities in thevoice pulse sequence, which imply to addand control subharmonics [29].Irregularities in the voice pulse sequenceare inherent to roughness or creakyvoices and appear frequently in singing,sometimes even as an expressiveresourse such as growl.

Dealing with this issue, we proposedin [28] an audio processing techniqueespecially adapted to the voice, based onmodeling the radiated voice pulses in fre-quency domain. We call it voice pulsemodeling (VPM). We have seen beforethat voiced utterances can be approxi-mated as a sequence of voice pulses lin-early filtered by the vocal tract. Hence,the output can be considered as theresult of overlapping a sequence of fil-tered voice pulses. It can be shown that

the spectrum of a single filtered voice pulse can be approximatedas the spectrum obtained by interpolation of the harmonic peaks(in polar coordinates; i.e., magnitude and phase), when theanalysis window is centered on a voice pulse onset (Figure 7 andFigure 8). The same applies if the timbre is modeled by the EpRmodel discussed above. Since the phase of the harmonic peaks isinterpolated, special care must be taken regarding phase unwrap-ping in order to avoid discontinuities when different integernumbers of 2π periods are added to a certain harmonic in con-secutive frames (Figure 9). Finally, once we have a single filteredvoice pulse we can reconstruct the voice utterance by overlap-ping several of these filtered voice pulses arranged as in the origi-nal sequence. This way, introducing irregularities to thesynthesized voice pulse sequence becomes straightforward.

Clearly we have disregarded all the data contained in the spec-trum bins between harmonic frequencies that contain noisy orbreathy characteristics of the voice. Like in SMS, we can obtain aresidual by subtracting the VPM resynthesis to the input signal,

[FIG7] Voice pulse modeling. The spectrum of a single filtered voice pulse can beapproximated as the spectrum obtained by polar interpolation of harmonic peaks whenthe analysis window is centered on a voice pulse onset.

x(n) = y(n).w(n)

FFT

y(n) = + s−1 (n) + s(n) + s1 (n) + s2 (n) + s3 (n) + s4 (n) +

s(n)

s1(n) ≈ s(n−∆)

s2(n) ≈ s(n−2∆)

s3(n) ≈ s(n−3∆)

s4(n) ≈ s(n−4∆)

≈S(ejΩ)X(ejΩ)

n

n

n

n

n

n

Ω

n

∆

n

IFFT

≈∠S(ejΩ)∠X(ejΩ)

......

[FIG8] Synthesis of single pulses from the analysis of recordedvoiced utterances.


therefore ensuring a perfect reconstruction if no transformationsare applied. It is well known that the aspirated noise producedduring voiced utterances has a time structure that is perceptuallyimportant and is said to be correlated to the glottal voice sourcephase [30]. Hence, in order to preserve this time structure, theresidual is synthesized using a PSOLA method synchronized tothe synthesis voice pulse onsets. Finally, for processing transient-like sounds we adopted the method in [31], which by integratingthe spectral phase is able to robustly detect transients and dis-criminate which spectral peaks contribute to them, thereforeallowing translating transient components to new time instants.

VPM is similar to old techniques such as fonction d’onde for-mantique (FOF) [32] and the voice synthesis system (VOSIM) [33],where voice is modeled as a sequence of pulses whose timbre isroughly represented by a set of ideal resonances. However, in VPMthe timbre is represented by all the harmonics, allowing capturingsubtle details and nuances of both amplitude and phase spectra. Interms of timbre representation we could obtain similar resultswith spectral smoothing by applying restrictions to the poles andzeros estimation in autoregressive (AR), autoregressive movingaverage (ARMA), or Prony’s models. However, with VPM we havethe advantage of being able to smoothly interpolate different voicepulses avoiding problems due to phase unwrapping, and decom-posing the voice into three components (harmonics, noise, andtransients) that can be independently modified. A block diagram ofVPM analysis and synthesis processes is shown in Figure 10.

PERFORMER MODELFrom the input score, the Performer Model is in charge of gen-erating lower level actions; thus, it is responsible for incorporat-ing performance-specific knowledge to the system. Some scoresmight include a fair number of performance indications, and inthese cases the Performer Model would have an easier task.However, we are far from completely understanding and beingable to simulate the music performance process and this istherefore still one of the most open-ended problems in musicsound synthesis. The issues involved are very diverse, going

from music theory to cognition and motor control problems,and current approaches to performance modeling only give par-tial answers. The most successful practical approaches havebeen based on developing performance rules by an analysis bysynthesis approach [34], [35], and more recently machine learn-ing techniques have been used for generating these performancerules automatically [36]. Another useful method is based onusing music notation software, which allows interactive playingand adjustment of the synthetic performance [37].

For the particular case of the singing voice system present-ed here, the main characteristics that are determined by thePerformer Model are the tempo; the deviation of note durationfrom the standard value; the vibrato occurrences and charac-teristics; the loudness of each note; and how notes areattacked, connected, and released in terms of musical articula-tion (e.g., legato, staccato, etc.). It is beyond the scope of thisresearch to address higher levels of performance control suchas the different levels of phrasing and their influence in all theexpressive nuances.

[FIG10] VPM analysis and synthesis block diagram.

X'(ejΩ)X'(e

jΩ)

X 'n+1

X 'n

FFTPeak

Detection

HarmonicPeaks

Input

HarmonicPeaks

SpectralPeaksY (e jΩ)y (n)

Voice PulseOnsets

VoicePulse

Onsets

Estimated VoicePulse Spectrum

++ −

Estimated VoicePulse Spectrum

Voice Residual

Transients

t

TransientAnalysis

TransientTransformations

PSOLA

Transients

VoiceResidual

ResidualTransformations

PitchTransformations

Pulse SequenceTransformations

PulseSequence

+

X (ejΩ)Estimation

MaximallyFlat PhaseAlignment

VPMResynthesis

PeakSelection

SpectralPeaks

PitchDetection

Pitch

Time-Shift

PulseInterpolation

PulseTransformations

Overlap and Add

IFFT

Time-Shift

[FIG9] Phase unwrapping problem. Different numbers of 2π

periods are added to the harmonic number t in consecutiveframes.

2π

Ωt−2

−0.95π

−1.1π

Ωt−1 Ωt+1 Ωt+2Ωt

Ω

Frame n-1

0

2π

Ωt−2 Ωt−1 Ωt+1 Ωt+2Ωt

Ω

Frame n

0


PERFORMANCE DATABASEIn this section, we explain how to define the sampling grid of theinstrument’s sonic space and the steps required to build a data-base of it. First of all, we must identify the dimensions of thesonic space we want to play with and the transformations we cando with the samples. For each transformation we need to studyalso the trade-off between sound quality and transformationrange, preferably in different contexts (e.g., we can think of a casewhere transpositions sound acceptable for the interval [−3,+4]semitones at low pitches, but the interval becomes [−2,+2] athigh pitches). Besides, we should also consider the sample con-catenation techniques to be used in synthesis and the conse-quent trade-off between sample distance and natural soundingtransitions. Once we have set the sampling grid to be a goodcompromise for the previous trade-offs, we have to come up withdetailed scripts for recording each sample, trying to minimizethe score length and thus maximize the density of database sam-ples in the performance being recorded. These scripts should beas detailed as possible in order to avoid ambiguities and assurethat the performance will contain the target samples.

In the case of singing voice we decided to divide the sonicspace into three subspaces (A, B, and C), each with differentdimensions (Figure 11). Subspace A contains the actual samplesthat are transformed and concatenated at synthesis. SubspacesB and C contain samples that, once properly modeled, specifyhow samples from A should be transformed to show a variety ofspecific expressions.

In subspace A, the phonetic axis has a discrete scale set asunits of two allophones combinations (i.e., di-allophones) plussteady-states of voiced allophones. Using combinations of two ormore allophones is in fact a common practice in concatenativespeech synthesis (e.g., [38]). However, not all di-allophonescombinations must be sampled but only a subset that statistical-ly covers the most frequent combinations. For example, Spanishlanguage has 29 possible allophones [39] and, after analyzing

the phonetic transcription of several books containing morethan 300,000 words, we found out that 521 combinations out ofthe 841 theoretically possible ones were enough to cover morethan 99.9% of all occurrences. The pitch axis must be adapted tothe specific range of the singer. In our case, the sound qualityseems to be acceptable for transpositions up to ±6 semitones,and the sampling grid is set accordingly, often with three pitchvalues being enough to cover the singer’s range (excludingfalsetto). The loudness axis is set to the interval [0,1], assuming0 to be the softest singing and 1 the loudest singing. In ourexperiments, we decided to record very soft, soft, normal, andloud qualitative labels for steady-states and just soft and normalfor articulations. Finally, for tempo we recorded normal and fastspeeds, valued as 90 and 120 beats per minute (BPM) respective-ly (in this case BPM corresponds to the number of syllables perminute, considering a syllable is sung for each note).Summarizing, for each di-allophone we sampled three pitchesand two loudness and two tempo contexts, summing 12 differ-ent locations in the sonic space.

Subspace B contains different types of vibratos. We don’tintend to be exhaustive but want to achieve a coarse represen-tation of how the singer performs vibratos. Afterwards, at syn-thesis, samples in B are used as parameterized templates forgenerating new vibratos with scaled depth, rate, and tremolocharacteristics. More precisely, each template stores voicemodel control envelopes obtained from analysis, each of whichcan be later used in synthesis to control voice model transfor-mations of samples in A. For vibratos, as shown in Figure 12,the control envelopes we use are the EpR (gain, slope, slopedepth) and pitch variations relative to their slowly varyingmean, plus a set of marks pointing to the beginning of eachvibrato cycle and used as anchor points for transformations (forinstance, time-scaling can be achieved by repeating and inter-polating vibrato cycles) and for estimations (e.g., vibrato ratewould be the inverse of the duration of one cycle). Additionally,

[FIG11] Subspaces of the singing voice sonic space.

Phonetic

...Tra

nsiti

on P

orta

men

to

Tra

nsiti

on L

egat

o

...Atta

ck S

harp

Atta

ck S

oft

C

BA

...Jazz

y

Ope

ristic

Reg

ular

Vibrato

15075100125

0.5

0.75

1

Tempo

Loudness

200 300 ...1000.25

Pitch

...

dZ-Q

p-V

Silence-m

Articulation


vibrato samples are segmented into attack, sustain, and releasesections. During synthesis, these templates are applied to flat(i.e., without vibrato) samples from sub-space A, and the EpR voice modelensures that the harmonics will followthe timbre envelope defined by the for-mants while varying their frequency(Figure 13). Depending on the singingtechnique adopted by the singer, it ispossible that formants vary in syn-chrony with the vibrato phase. Hence,in the specific case where the templatewas obtained from a sample correspon-ding to the same phoneme being syn-thesized, we can use also the controlenvelopes related to the formants loca-tion and shape for generating thosesubtle timbre variations.

The third subspace C representsmusical articulations. We model it withthree basic types: note attacks, transi-tions, and releases. For each of them weset a variety of labels trying to covermost basic resourses. Note we are con-sidering here the musical note level andnot higher levels such as musical phras-ing or style, which are aspects to beruled by the Performer Model discussed above. As with vibratos,samples here become parameterized templates applicable to anysynthesis context [23].

Several issues have to be considered related to the recordingprocedure. Singers usually get bored or tired if the scripts arerepetitive and mechanical. Therefore, it is recommended to singon top of stimulating musical backgrounds. This, in addition,ensures an accurate control of tempo and tuning and increasesthe feeling of singing. Also, using actual meaningful sentencesincreases further the feeling of singing. We did this for Spanishlanguage, using a script that computed the minimum set of sen-tences required to cover the di-allophone subset out of severalSpanish books. An excerpt of the Spanish recording script isshown in Figure 14. It took about 3 or 4 hours per singer to fol-low the whole scripts. Another aspect to consider is that voicequality often degrades after singing for a long time and pausesare therefore necessary now and then. Finally another concernis that high pitches or high loudness levels are difficult to holdcontinuously and can exhaust singers.

The creation of the singer database isnot an easy task. Huge numbers of soundfiles have to be segmented, labeled, andanalyzed, especially when sampling sub-space A. That is why we put special effortsin automating the whole process andreducing manual time-consuming tasks[40] (Figure 15). Initially, the recordedaudio files are cut into sentences. For each

of them we add a text file including the phonetic transcription andtempo and pitch and loudness values set in the recording scripts.

[FIG13] Along a vibrato, harmonics are following the EpR spectralenvelope while their frequency is varying. For instance, whenthe fifth harmonic (H5) oscillates, it follows the shape of thesecond formant (F2) peak.

X(Ω) HarmonicAmplitudeVariation

HarmonicFrequencyExcursion

EpR SpectralEnvelope

H7H6H5H4H3H2H1

F1F2

Ω

[FIG14] Excerpt from the Spanish recording script. The first line shows the sentence to besung, the second line is the corresponding Speech Assessment Methods PhoneticAlphabet (SAMPA) phonetic transcription, and the third line is the list of phoneticarticulations to be added to the database.

Sil-e T-i i-G G-T T-a G-k o-l s-o s-T i-l a-T T-j j-o o-n e-s a-t t-e m-p p-e e-r t-u

Sil-e-n-T-i-G-T-a-G-k-o-m-o-l-a-s-o-s-T-i-l-a-T-j-o-n-e-s-d-e-l-a-t-e-m-p-e-r-a-t-u-r-a-Sil

en zigzag como las oscilaciones de la temperatura

Articulationsto Add

Phonetic Transcription

Text

[FIG12] Vibrato template with several control envelopes estimated from a sample.The pitch curve is segmented into vibrato cycles. Depth is computed as the differencebetween local maximum and minimum of each cycle, while mean pitch is computedas their average.

MeanEnvelopes

n

EpR

Con

trol

Env

elop

es

Vibrato Cycle

MeanPitch

VibratoDepth−10 st

−8 stPitch

76 dB

79 dB

−1E−4

−7E−5Slope

−22 dB

−24 dBGain

Waveform

Slope Depth

Next we perform an automatic phoneme segmentation adapting aspeech recognizer tool to work as an aligner between the audioand the transcription. Then we do an automatic sample segmenta-tion fixing the precise boundaries of the di-allophone units aroundeach phoneme onset, following several rules that depend on theallophone families. For example, in the case of unvoiced fricativessuch as the “s” in the English language, we end the ∗-s articulationjust at the “s” onset, and the beginning of the s-∗ articulation is setat the beginning of the “s” utterance. We do it like this because inour synthesis unvoiced timbres are not smoothly concatenated;thus, it is better to synthesize the whole “s” from a unique sample.

In our recording scripts we set tempo, pitch, and loudness tobe constant along each sentence. One reason is that we want tocapture loudness and pitch variations inherent to phoneticarticulations (Figure 16) and make them independent of theones related to musical performance. Another reason is that wewant to constrain the singer’s voice quality and expression to

ensure a maximal consistency between different samples amongthe database, intending to help hide the fact that the system isconcatenating samples and increasing the sensation of a contin-uous flow. Another significant aspect is that we detect gaps,stops, and timbre stable segments (Figure 17). The purpose is touse this information for better fitting samples at synthesis; e.g.,avoiding using those segments in the case of fast singing insteadof time-compressing samples. Details can be found in [40].

The last step is to manually tune the resulting database andcorrect erroneous segmentations or analysis estimations. Toolsare desired to automatically locate most significant errors andhelp the supervisor to perform this tuning without having tocheck every single sample.

PERFORMANCETRAJECTORY GENERATIONThe Performance Trajectory Generator module converts per-formance actions set by the Performer Model into adequateparameter trajectories within the instrument’s sonic space. Wecharacterize singing voice performance trajectories in terms ofcoordinates in subspace A: phonetic unit sequences plus pitchand loudness envelopes. We assume that the tempo axis isalready embedded in the phonetic unit track as both timinginformation and actual sample selection. An essential aspectwhen computing phonetic timing is to ensure the precise align-ment of certain phones (mainly vowels) with the notes [41], [4].

We saw in the previous section how the singing voice sonicspace was split into three subspaces, the first one (A) includingthe actual samples to be synthesized and the other two (B and C)representing ways of transforming those samples to obtain spe-cific musical articulations and vibratos. Hence, the Performance

[FIG15] Singer database creation process.

Sil-aI

aI-S

S-U

U-d

d-h

d h

h-

-v

V-n

n-@U

@U-n

n-Sil

Residual / Transients

VPM

EpR

Tempo: 90Note: A4Loudness: Normal

SampleAnalysis

AutomaticDi-AllophoneSegmentation

AutomaticPhoneme

Segmentation

I should have knownaI S U d h v n @U n aI S U v n @U n

M[FIG16] Loudness and pitch variations inherent to phoneticarticulations. In this figure we can observe a valley in waveformamplitude and pitch along the a-g-a transition.

Waveform

Pitch

a0.3 0.4

g a



Trajectory Generator actually applies models from subspaces Band C to the coarse Performance Score coordinates in A, in orderto obtain detailed trajectory functions within subspace A.

A representative example is shown in Figure 18, starting witha high-level performance score (top left) of two notes with thelyrics fly me. The former note is an Ab2 played forte and attackedsoftly, and the latter a G2 played piano ending with a long releaseand exhibiting a wet vibrato. The note transition is smooth (i.e.,legato). From this input an internal score (top right) is built thatdescribes a sequence of audio samples, a set of musical articula-tion templates, vibrato control parameters, and loudness andpitch values. Lowering another level, musical articulation tem-plates are applied and the resulting performance trajectory (bot-tom right) is obtained. In this same figure, we can observeanother possible input performance score (bottom left), in thiscase a recorded performance by a real singer, obviously a low-level score. From this we could directly generate a performancetrajectory with a similar expression by performing phoneme seg-mentation and computing pitch and loudness curves.

The optimal sample sequence is computed so that the overalltransformation required at synthesis is minimized. In other words,we choose the samples that better match locally the performancetrajectory. With this aim, we compute the matching distance as a

cost function derived from several transformation costs: temporalcompression or expansion applied to the samples to fit the givenphonetic timing, pitch and loudness transformations needed toreach the target pitch and loudness curves, and concatenationtransformations required to connect consecutive samples.

SOUND RENDERINGThe Sound Rendering engine is the actual synthesizer that pro-duces the output sound. Its input consists of the performancetrajectory within the instrument sonic space. The renderingprocess works by transforming and concatenating a sequence of

[FIG17] Gaps, stops, and timbre stable segments detection.

Stable Frames

a k a

Stable Frames

[FIG18] From performance score to performance trajectory.

l-aI m-I

Pitch G2Ab2

1

0 Loudness

Low-LevelPerformance Score

Time

PerformanceTrajectory

High-LevelPerformance Score

LegatoAttack Soft

Vibrato Wet

Release Long

I-Sil

Me

VibratoRate

VibratoDepth

0

1

0 1

Sil-f

Fly

SampleTrackNote

TrackVibrato

Track

IaI-mf-l

Articulation Legato

f p

Pitch G2Ab2

Soft

Wet Vibrato

Long

Me

Loudness

f l aI

Fly

Vibrato

PhoneticTranscription

m I

Time

I-Sil

Me

1

0

l-aI

Loudness

Pitch G2Ab2

Sil-f

Fly

SampleTrack

Im-IaI-mf-l

Time

(a) (b)

(d) (c)


database samples. We could think of many possible transforma-tions. However, we are mostly interested in those directly relat-ed to the sonic space axes, since they allow us to freelymanipulate samples within the sonic space and therefore matchthe target trajectory with ease (Figure 19).

However, feasible transformations are determined by thespectral models we use and their parameterization. In ourcase, spectral models have been specially thought for tack-ling the singing voice (as seen in the Modeling the SingingVoice section) and allow transformations such as transposi-tion, loudness, and time-scaling, all clearly linked to the Asonic subspace axes. Several resourses related to musicalarticulation are already embedded in the performance trajec-tory itself, thus no specific transformations with this pur-pose are needed by the rendering module. Still, othertransformations not linked to our specific sonic space axesand particular to the singing voice are desired, such as theones related to voice quality and voice phonation, whichmight be especially important for achieving expressive andnatural-sounding rendered performances. For example, wecould think of transformations for controlling breathiness orroughness qualities of the synthetic voice. In particular,roughness transformation would be very useful in certain

musical styles, such as blues, to produce growling utter-ances. We explore several methods for producing these kindsof alterations using our voice models in [29] and [42].

If we restricted our view to the sonic space we deal with,we would reach the conclusion that transformed samples doconnect perfectly. However, this is not true, because the actu-al sonic space of the singing voice is much richer and complexthan our approximation. Thus, transformed samples almostnever connect perfectly. Many voice features are not describedprecisely by the coordinates in the A subspace, and otherssuch as voice phonation modes are just ignored. For example,phonetic axis describes the timbre envelope as phonemelabels, so rather coarsely. Hence, when connecting samples

with the same phonetic description, for-mants won’t match precisely. Another rea-son for imperfect connections is that thetransposition factor applied to each sampleis computed as the difference between thepitch specified in the recording scripts andthe target pitch, with the aim of preservingthe inner pitch variations inherent to pho-netic articulations (see the PerformanceDatabase section above), and thereforepitch rarely matches at sample joints.

In order to smoothly connect samples,we compute the differences found at jointpoints for several voice features and trans-form accordingly surrounding sample sec-tions by specific correction amounts. Thesecorrection values are obtained from spread-ing out the differences around the connec-tion points, as shown in Figure 20. We

apply this method to our voice models, specifically to theamplitude and phase envelopes of the VPM spectrum, to thepitch and loudness trajectories, and to the EpR parameters,including the controls of each formant [43]. Results are ofhigh quality. However, further work is needed to tacklephonation modes and avoid audible discontinuities such asthose found in breathy to nonbreathy connections.

FUTURE CHALLENGESIn this article we have introduced the concept of synthesisbased on performance sampling. We have explained thatalthough sampling has been considered a way to captureand reproduce the sound of an instrument, it should be bet-ter considered a way to model the sonic space produced by aperformer with an instrument. With this aim we presentedour singing voice synthesizer, pointing out the main issuesand complexities emerging along its design. Although thecurrent system is able to generate convincing results in cer-tain situations, there is still much room for improvements,especially in the areas of expression, spectral modeling, andsonic space design. However, we believe we are not so farfrom the day when computer singing will be barely distin-guishable from human performances.

[FIG19] Matching a performance trajectory by transformingsamples. On the left side we see the target trajectory in gray andthe samples in black. The two selected samples are drawn withwider width. On the right side we see how these samples aretransformed and approximate the target trajectory.

(a) (b)

[FIG20] Sample concatenation smoothing.

Sample A Sample B

Correction

Discontinuity

GenericVoiceFeature

GenericVoiceFeature

Frame n-1

Time

Time

Frame n


AUTHORSJordi Bonada ([email protected]) studied telecommuni-cation engineering at the Technical University of Catalonia(UPC) of Barcelona, Spain. In 1996 he joined the MusicTechnology Group of the Pompeu Fabra University (UPF) as aresearcher and developer. He has been leading variousresearch projects in the fields of spectral signal processing,especially in audio time-scaling and singing voice synthesisand modeling. He is currently working toward the Ph.D. atUPF University in singing voice synthesis.

Xavier Serra ([email protected]) is an associateprofessor at the Department of Technology of the PompeuFabra University (UPF) in Barcelona, Spain. He received thePh.D. from Stanford University in 1989 with a dissertationfocused on spectral modeling of musical signals. Currentlyhe is the director of the Music Technology Group of theUPF, a group dedicated to audio content analysis, descrip-tion, and synthesis.

REFERENCES[1] D. Schwarz, “Corpus-based concatenative synthesis,” IEEE Signal ProcessingMag., vol. 24, no. 2, pp. 92–104, Mar. 2007.

[2] E. Lindemann, “Music synthesis with reconstructive phrase modeling,” IEEESignal Processing Mag., vol. 24, no. 2, pp. 80–91, Mar. 2007.

[3] J. Bonada, A. Loscos, and M. Blaauw, “Unisong: A choir singing synthesizer,” pre-sented at the 121st AES Conv., convention paper 6933, San Francisco, CA, Oct. 2006.

[4] J. Sundberg, The Science of the Singing Voice. DeKalb, IL: Northern IllinoisUniv. Press, 1987.

[5] M. Kob, “Singing voice modeling as we know it today,” Acta Acust. united withAcustica, vol. 90, no. 4, pp. 649–661, 2004.

[6] X. Rodet, “Synthesis and processing of the singing voice,” in Proc. 1st IEEEBenelux Workshop on Model based Processing and Coding of Audio, Leuven, Nov. 2002.

[7] P.R. Cook, “Singing voice synthesis: History, current work, and future direc-tions,” Computer Music J., vol. 20, no. 3, pp. 38–46, Fall 1996.

[8] J. Kelly and C. Lochbaum, “Speech synthesis,” in Proc. 4th Int. Congr.Acoustics, 1962, Copenhagen, Denmark, pp. 1–4.

[9] J.O. Smith, “Physical modeling using digital waveguides,” Computer Music J.,vol. 16, no. 4, pp. 74–87, 1992.

[10] P. Cook, “SPASM: A real-time vocal tract physical model editor/controller andsinger; The Companion Software Synthesis System,” Computer Music J., vol. 17,no. 1, pp. 30–44, 1992.

[11] J. Mullen, D.M. Howard, and D.T. Murphy, “Acoustical simulations of thehuman vocal tract using the 1D and 2D digital waveguide software model,” inProc. 4th Int. Conf. Digital Audio Effects, Naples, Italy, pp. 311–314, Oct. 2004.

[12] J. Mullen, D.M. Howard, and D.T. Murphy, “Waveguide physical modeling ofvocal tract acoustics: flexible formant bandwidth control from increased modeldimensionality,” IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3,pp. 964–971, May 2006.

[13] B.H. Story, I.R. Titze, and E.A. Hoffman, “Vocal tract area functions from mag-netic resonance imaging,” J. Acoust. Soc. Amer., vol. 104, no. 1, pp. 471–487, 1996.

[14] B.H. Story, “Using imaging and modeling techniques to understand the rela-tion between vocal tract shape to acoustic characteristics,” in Proc. StockholmMusic Acoustics Conf., 2003, SMAC-03, pp. 435–438.

[15] C. Cooper, D. Murphy, D. Howard, and A. Tyrrell, “Singing synthesis with anevolved physical model,” IEEE Trans. Audio, Speech, Language Processing, vol.14, no. 4, pp. 1454–1461, July 2006.

[16] M. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and E. George, “A singingvoice synthesis system based on sinusoidal modeling,” in Proc. Int. Conf.Acoustics, Speech, Signal Processing, Munich, Germany, vol. 1, 199, pp. 435–4387.

[17] D.H. Klatt, “Review of text-to-speech conversion for English,” J. Acoust. Soc.Amer., vol. 82, no. 3, pp. 737–793, 1987.

[18] G. Bennett and X. Rodet, “Synthesis of the singing voice,” in CurrentDirections in Computer Music Research, M.V. Mathews and J.R. Pierce, Eds.,Cambridge, MA: MIT Press, pp. 19–44, 1989.

[19] J. Bonada, A. Loscos, P. Cano, X. Serra, and H. Kenmochi, “Spectral approachto the modeling of the singing voice,” presented at the 111th AES Conv., conven-tion paper 5452, New York, Sept. 2001.

[20] D.G. Childers, “Measuring and modeling vocal source-tract interaction,” IEEETrans. Biomed. Eng., vol. 41, no. 7, pp. 663–671, July 1994.

[21] X. Serra, “A system for sound analysis-transformation-synthesis based on adeterministic plus stochastic decomposition,” Ph.D. dissertation, CCRMA, Dept.Music, Stanford Univ., Stanford, CA, 1989.

[22] J. Laroche and M. Dolson, “New phase-vocoder techniques for real-time pitch-shifting, chorusing, harmonizing, and other exotic audio effects,” J. Audio Eng.Soc., vol. 47, no. 11, pp. 928–936, Nov. 1999.

[23] J. Bonada, A. Loscos., O. Mayor, and H. Kenmochi, “Sample-based singingvoice synthesizer using spectral models and source-filter decomposition,” in Proc.3rd Int. Workshop Models and Analysis of Vocal Emissions for BiomedicalApplications, Firenze, Italy, 2003, pp. 175–178.

[24] J. Laroche, “Frequency-domain techniques for high-quality voice modifica-tion,” in Proc. 6th Int. Conf. Digital Audio Effects, London, UK, Sept. 2003.

[25] R. DiFederico, “Waveform preserving time stretching and pitch shifting forsinusoidal models of sound,” in Proc. 1st Int. Conf. Digital Audio Effects,Barcelona, Spain, Nov. 1998, pp. 44–48.

[26] R. Smits and B. Yegnanarayana, “Determination of instants of significant exci-tation in speech using group delay function,” IEEE Trans. Speech AudioProcessing, vol. 3, no. 5, pp. 325–333, Sept. 1995.

[27] B. Yegnanarayana and R. Veldhuis, “Extraction of vocal-tract system charac-teristics from speech signal,” IEEE Trans. Speech Audio Processing, vol. 6, no. 4,pp. 313–327, July 1998.

[28] J. Bonada, “High quality voice transformations based on modeling radiatedvoice pulses in frequency domain,” in Proc. 7th Int. Conf. Digital Audio Effects,Naples, Italy, Oct. 2004, pp. 291–295.

[29] A. Loscos and J. Bonada, “Emulating rough and growl voice in spectral domain,”in Proc. 7th Int. Conf. Digital Audio Effects, Naples, Italy, Oct. 2004, pp. 49–52.

[30] M. Kob, “Physical modeling of the singing voice,” Ph.D. dissertation, Instituteof Technical Acoustics, Aachen Univ., Aachen, Germany, 2002.

[31] A. Röbel, “A new approach to transient processing in the phase vocoder,” inProc. 6th Int. Conf. Digital Audio Effects, London, UK, Sept. 2003, pp. 344–349.

[32] X. Rodet, Y. Potard, and J.B.B. Barrière, “The CHANT Project: from the syn-thesis of the singing voice to synthesis in general,” Computer Music J., vol. 8, no.3, pp. 15–31, 1984.

[33] W. Kaegi and S. Tempelaars, “VOSIM—A new sound synthesis system,” J.Audio Eng. Soc., vol. 26, no. 6, pp. 418–425, 1978.

[34] R. Bresin and A. Friberg, “Emotional coloring of computer-controlled musicperformances,” Computer Music J., vol. 24, no. 4, pp. 44–63, 2000.

[35] J. Sundberg, A. Askenfelt, and L. Frydén, “Musical performance: A synthesis-by-rule approach,” Computer Music J., vol. 7, no. 1, pp. 37–43, 1983.

[36] G. Widmer and W. Goebl, “Computational models of expressive music perform-ance: The state of the art,” J. New Music Res., vol. 33, no. 3, pp. 203–216, 2004.

[37] M. Laurson, V. Norilo, and M. Kuuskankare, “PWGLSynth: A visual synthesislanguage for virtual instrument design and control,” Computer Music J., vol. 29,no. 3, pp. 29–41, 2005.

[38] A. Black, “Perfect synthesis for all of the people all of the time,” in Proc. IEEETTS Workshop 2002, Santa Monica, CA, Sept. 2002, pp. 167–170.

[39] J. Llisterri and J.B. Mariño, “Spanish adaptation of SAMPA and automatic pho-netic transcription,” ESPRIT Project 6819 (SAM-A Speech Technology Assessmentin Multilingual Applications), 1993.

[40] J. Bonada, A. Loscos, and M. Blaauw, “Improvements to a sample-concatena-tion based singing voice synthesizer,” presented at the 121st AES Conv., conven-tion paper 6900, San Francisco, CA, Oct. 2006.

[41] J. Ross and J. Sundberg, “Syllable and tone boundaries in singing,” in Proc.4th Pan European Voice Conf., Stockholm, Sweden, Aug. 2001.

[42] L. Fabig and J. Janer, “Transforming singing voice expression: The sweetnesseffect,” in Proc. 7th Int. Conf. Digital Audio Effects, Naples, Italy, 2004, pp. 70–75.

[43] J. Bonada, A. Loscos, and H. Kenmochi, “Sample-based singing voice synthe-sizer by spectral concatenation,” in Proc. Stockholm Music Acoustics Conf.,Stockholm, Sweden, 2003, pp. 439–442. [SP]

DIGITALVISION Synthesis of the Singing Voice by ...) ... represented by a smaller space than a...

Documents

Transcript of DIGITALVISION Synthesis of the Singing Voice by ...) ... represented by a smaller space than a...