Download - Exploring the perceived harshness of cello sounds by morphing … · 2017. 4. 11. · Cello bowing requires a very ﬁne control of the musicians’ gestures to ensure the quality

Exploring the perceived harshness of cello sounds by morphingand synthesis techniques

Jocelyn Roz�e,a) Mitsuko Aramaki, Richard Kronland-Martinet, and Sølvi YstadAix Marseille Univ, CNRS, PRISM (Perception, Representation, Image, Sound, Music), 31 Chemin J. Aiguier,13402 Marseille Cedex 20, France

(Received 7 June 2016; revised 23 February 2017; accepted 26 February 2017; published online 24March 2017)

Cello bowing requires a very fine control of the musicians’ gestures to ensure the quality of the per-

ceived sound. When the interaction between the bow hair and the string is optimal, the sound is per-

ceived as broad and round. On the other hand, when the gestural control becomes more

approximate, the sound quality deteriorates and often becomes harsh, shrill, and quavering. In this

study, such a timbre degradation, often described by French cellists as harshness (d�echarnement), is

investigated from both signal and perceptual perspectives. Harsh sounds were obtained from experi-

enced cellists subjected to a postural constraint. A signal approach based on Gabor masks enabled

us to capture the main dissimilarities between round and harsh sounds. Two complementary methods

perceptually validated these signal features: First, a predictive regression model of the perceived

harshness was built from sound continua obtained by a morphing technique. Next, the signal struc-

tures identified by the model were validated within a perceptual timbre space, obtained by multidi-

mensional scaling analysis on pairs of synthesized stimuli controlled in harshness. The results

revealed that the perceived harshness was due to a combination between a more chaotic harmonic

behavior, a formantic emergence, and a weaker attack slope. VC 2017 Acoustical Society of America.

[http://dx.doi.org/10.1121/1.4978522]

[TRM] Pages: 2121–2136

I. INTRODUCTION

Musical expressivity has been the subject of a large

number of studies for various instruments relating or not the

musicians’ body movements to the produced sounds. In the

case of cello playing, the particular musician–instrument

interaction, in close embrace type, emphasizes the implica-

tion of the corporeal aspects in the expressivity. We investi-

gate this topic in the present study by focusing on a special

acoustic feature of cello sounds, linked to a degradation of

timbre quality, and frequently referred to as harshness

(d�echarnement in French) by cellist teachers. A harsh sound

(son d�echarn�e) is recognized among bowed string players as

a sound in which timbre is impoverished, inducing a sensa-

tion of shrill, whistling, and loss in sound consistency.

Harsh notes are very frequent among young cellist stu-

dents and often arise from an inappropriate bowing gesture

(Guettler, 2002) and/or a bad postural coordination. Indeed,

many studies have demonstrated that, in addition to instru-

mental gestures directly responsible for the sound produc-

tion, musical interpretation is strongly influenced by

accompanying (or ancillary) gestures, i.e., postural move-

ments that are rarely learnt in music schools. Their musical

significance was explored for the clarinet (Wanderley et al.,2005; Desmet et al., 2012), the piano (Thompson and Luck,

2012), the harp (Chadefaux et al., 2013), and the violin (Van

Zijl and Luck, 2013). For example, Wanderley et al. (2005)

demonstrated that clarinetists might transcribe the phrasing

and rhythmical elements of a musical score through arm

flapping or waist and knee bending. In the same line of

thinking, we were interested in exploring embodied music

cognition of cellists, and how perceived musical features

might result from their gestures and postural synergies.

Leman (2008) introduced this notion of corporeal encoding

as an interconnection between two learning mechanisms: a

high-level reenacting mechanism, called action-perception

loop, able to build a complex sequence of gestures (right

bowing-arm positioning, left hand fingerings, etc.) in order

to reach a local musical target, and a low-level mechanism,

called sensorimotor loop, which continuously adjusts the

musician’s motor program (bow force, bow velocity, pos-

tural variables, etc.) to maintain the control of the sound pro-

duction. In Roz�e et al. (2016), we designed an experiment

with these former studies in mind, by assessing the effects of

different kinds of postural constraints on the cellist’s musical

expressivity along musical phrases. To ensure that expres-

sive degradations were a consequence of postural constraints

and not of technical weaknesses, the musicians that took part

in the experiment were all professional or much experi-

mented players. This experiment turned out to be particu-

larly revealing, since in some difficult musical passages, the

postural constraint affected the cellist’s bowing gesture in a

manner which might produce timbre degradations resulting

in sounds perceived as harsh. The aim of the present study is

to investigate this perceptual harshness in terms of signal

properties.

Previous research related to the perception of harsh

acoustic features for bowed-string instruments can be found

in Stepanek and Otcenasek (2005), Stepanek (2006), and

more recently in Fritz et al. (2012). These perceptual studiesa)Electronic mail: [email protected]

J. Acoust. Soc. Am. 141 (3), March 2017 VC 2017 Acoustical Society of America 21210001-4966/2017/141(3)/2121/16/$30.00

http://dx.doi.org/10.1121/1.4978522

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1121/1.4978522&domain=pdf&date_stamp=2017-03-01

aimed at correlating acoustic properties of violin sounds to a

list of verbal attributes frequently used to characterize musi-

cal timbre. Multidimensional scaling (MDS) analysis was

performed on the dissimilarity scores between pairs of stim-

uli in order to construct a perceptual space of verbal attrib-

utes, likely to explain the acoustic correlates along each one

of its dimensions. For example, in the article by Fritz et al.(2012), the main dimension of the space assessing the

“overall sound quality” corresponds to a spectral balance

between adjectives of desirable sounds (warm, rich, mellow)

and undesirable ones (metallic, cold, harsh). To correlate the

amount of harshness perceived along this dimension with

physical properties, the authors created synthetic violin

sounds by modifying the energy levels in five-octave spec-

tral bands, and performed dissimilarity tests on pairs of these

sounds. They demonstrated that the undesirable perceptual

effect of harshness corresponds to localized and excessive

high-frequency energy, combining most of the time with

poor low-frequency content. Thus, the harshness phenome-

non primarily seems to be linked to a spectral energy transfer

toward upper partials and indirectly to the notion of sound

brightness (brillance). Roz�e et al. (2016) characterized this

spectral transfer more finely by designing a relative bright-

ness descriptor related to the three frequency bands of the tri-

stimulus criterion (Pollard and Jansson, 1982). Numerous

studies (Grey and Gordon, 1978; McAdams, 1999; Caclin

et al., 2005) highlighted a strong correlation between bright-

ness and spectral centroid, i.e., the center of gravity of the

spectral energy distribution. According to Barthet et al.(2008) and Barthet et al. (2010), the brightness might also

play an important role in the tension perceived within clari-

net interpretation between a muffled tone judged as soft and

a bright, more aggressive one. Analyses of cello sounds by

Chudy et al. (2013) confirmed that boosted higher spectral

harmonics would cause a tone to be more “tensed.” As a

consequence, sounds produced in our expressive experimen-

tal context and perceived as harsh might also comply with an

increasingly aggressive and tense musical expression.

The aim of the present study is to better explain the

harshness phenomenon from an acoustical point of view,

based on certain assumptions linked to the way a postural

constraint alters the bowing gesture, and hereby induces this

sound degradation. The works of Demoucron (2008) or

Schoonderwaldt (2009) provide some cues from investiga-

tions on the relationship between the brightness and the

physical parameters of the bow–string interaction in the

case of the violin. They computed the brightness determined

by the spectral centroid values for different combinations

of physical control parameters (bow force, bow velocity,

bow-bridge distance), and projected them into Schelleng

diagrams (Schelleng, 1973). Such diagrams constitute a bi-

dimensional space defined by the bow force and the bow-

bridge distance, and display the variations of acoustical

parameters for a given bow velocity. Within Schelleng dia-

grams, spectral centroids of violin sounds increase with the

bow force and decrease with the bow velocity or the bow-

bridge distance. Guettler et al. (2003), Guettler (2003), and

Schoonderwaldt et al. (2003) also showed that lowering the

bow velocity or increasing the bow force resulted in higher

frequency partials. Hereby, we may assume that harsh

sounds, which are brighter than normal ones, result from

potentially stronger or slower bowing gestures, or smaller

bow-bridge distances (“sul ponticello”). Such sounds, there-

fore, might locate around the upper frontier of the Schelleng

diagram within the “Raucous” area, above which the normal

Helmholtz motion no longer exists.

In this paper, we are going to explore in depth the signal

characteristics composing the notion of harshness for cello

sounds, and validate it from a perceptual point of view by

adapted listening tests. We expect that this phenomenon

matches a complex multidimensional alteration affecting

spectral, but maybe also spectro-temporal dimensions. Thus,

we will first observe this sound quality degradation from a

macroscopical point of view before inferring on the spectral

and temporal descriptors of timbre that best characterize the

harshness. Then, we will use two complementary methods to

perceptually validate these descriptors. In the first method,

the aim is to build a perceptual model of harshness com-

posed of the most relevant signal descriptors. A morphing

technique based on Gabor tools is presented to build con-

tinua between round and harsh sounds. Then the amount of

perceived harshness is predicted by multilinear regression on

all the descriptors to identify the most relevant descriptors in

theory. In the second method, we assess the perceived sensa-

tion of harshness by controlling the envelopes of two rele-

vant descriptors of the predictive model. The process uses

cross-synthesis techniques and is perceptually validated by

the construction of a perceptual timbre space issued from

MDS analysis.

II. QUANTIFYING THE SOUND QUALITYDEGRADATION

In this part, we attempt to quantify the acoustic degrada-

tion of cello sounds, perceived as a harshness phenomenon,

through a macroscopical comparative analysis and finer

comparisons of suitably chosen acoustic descriptors. The

presented sound corpus was used in Roz�e et al. (2016) for

a preliminary and quantitative exploration of the harsh

degradation, based on a spectro-temporal duality between

two descriptors. In the present paper, our aim is to refine this

preliminary approach by proposing a full panel of signal

descriptors that might account for the harshness phenome-

non. An original approach based on Gabor masks is pro-

posed in order to assess the descriptors that best characterize

the timbre transformation between normal and harsh sounds.

A. Sound corpus

The sound corpus was created from data collected in the

experiment conducted by Roz�e et al. (2016). In the protocol

of this experiment, we asked seven cellists to play a score in

the most expressive way, while being subjected to four kinds

of postural conditions. For the scope of this paper, we only

deal with data collected in two opposite postural situations: a

normal one, where the cellists could play naturally as in a

performance context, and a physically fully constrained one,

in which they had to play with the torso attached to the back

of a chair by a safety race harness, and a neck collar limiting

2122 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.

their head movements. All the participants were professional

or very experienced musicians to ensure that any loss of

expressivity was due to the physical immobilization and no

to potential weaknesses in their playing technique. They

were asked to perform the experimental score with the same

instrument at tempo ¼ 45 bpm, in two articulation variants:

detached (d�etach�e) and legato. By using the same cello, we

ensured that the physical characteristics of the instrument

would not alter subsequent timbre analyses.

For each postural condition, the musicians’ impressions

were collected through a questionnaire assessing their feel-

ings, specific difficulties that they may have encountered, as

well as the perceived influence of the postural condition on

their movements and expressive sound features. Most of them

reported a degradation of their usual sound palette within the

fully constrained postural situation, and especially the impres-

sion of producing “tighter, more tense sounds” (“sons plus�etriqu�es, tendus” in French), “lacking of depth and natural

resonances.” Some of them also evoked more “harsh, shrill,

and whistling notes” (“notes plus d�echarn�ees, criardes etsifflantes” in French), which were particularly common within

a specific musical passage of the score (Fig. 1), containing

more gestural and rhythmic difficulties. A thorough post-

examination of the sound recordings revealed that in spite of

the postural constraint, the cellists managed quite well to pro-

duce a proper Helmholtz motion, except in the specific musi-

cal passage described previously. According to the authors’

subjective judgments, several notes located in the first bar of

the passage were randomly affected by the harsh timbre deg-

radations. The first note of the bar, a dotted sixteenth of pitch

E3 (fundamental frequency 329.63 Hz) was most clearly and

regularly perceived as harsh. As a consequence, we chose this

note to create the sound corpus of the study and to perform

harsh timbre comparisons. It should be mentioned that this

note corresponds to a rhythmic difficulty requiring an excel-

lent coordination between the right and left arms, since it

occurs prior to a pushed bow stroke (pouss�e d’archet) com-

bined with a fast-shift (d�emanch�e rapide) of the left arm. We

may here suppose that the constraint altered this coordination

between the two instrumental gestures, which resulted in a

strong decrease in timbre quality for this note.

The sound corpus was obtained by extracting 20 of these

E3 notes from the recordings of all the cellists, whatever the

articulations d�etach�e or legato. The extraction process for

the E3 notes relied on a pitch-tracking algorithm adapted

from the MIR toolbox (Lartillot and Toiviainen, 2007).

More precisely, these 20 notes were composed of 10 pairs of

E3 notes, respectively, selected within the normal and con-

strained postural contexts for a given cellist and articulation.

A pair of notes associates one of the round/beautiful sounds

produced in the normal situation to one of the harsh/

degraded sounds produced in the constrained situation (for

the same cellist and articulation).

The selection of round/harsh pairs of notes was per-

formed by the authors through an informal listening test,

conducted on the audio recordings obtained after the experi-

ment. The authors used high quality loudspeakers and

headphones to detect significant harshness dissimilarities

between sounds in the normal and the posturally constrained

situation. Their musical expertise and knowledge of cello

practice enabled them to select the sounds that best reflected

the differences between the two postural situations. The rele-

vance of the authors’ judgment was further confirmed by

the results of the first perceptual experiment presented in

this paper [see Fig. 6(a), Sec. III E]. Sometimes, in the con-

strained situation, some cellists played the E3 without being

able to prevent the bow from slipping on the D open string,

which resulted in a sound that contained a second tone in the

onset phase. We systematically removed this artifact as far

as possible by means of a very selective notch filter centered

on the D2 frequency.

The average length of the notes composing this corpus

was 0.31 s, which is shorter than the theoretical duration of

dotted sixteenths (0.5 s at tempo ¼ 45 bpm). The standard

deviation of their durations was 0.06 s. This suggests that the

cellists consistently played the E3 notes more like sixteenths

(theoretical duration 0.33 s at ¼ 45 bpm) than like dotted

sixteenths. Timbre discrimination of such short stimuli might

not be obvious. However, their durations remained higher

than the minimum duration required to recognize the timbre

of a sound (around 100 ms in Schaeffer, 1966). The authors

thus judged the task feasible, especially since the partici-

pants were cellists who are trained to constantly pay atten-

tion to the quality of their sounds.

B. Gabor tools

In the present study, Gabor transforms were used to

reveal differences between pairs of round/harsh sounds com-

posing the corpus. Gabor transforms, or short-time Fourier

transforms (STFTs) with Gaussian analysis windows, are

useful tools to get a macroscopic view of sound signal prop-

erties. They make it possible to represent the timbre of a

sound as an image in the time-frequency space, defined on a

suitably discretized time-frequency lattice. If a,b> 0 are

time and frequency sampling constants, and L denotes the

length of the analyzed signal, the time-frequency lattice of

its Gabor transform is characterized by M, the number of

FIG. 1. Bar of the score selected to investigate the harshness. The analyzed

note (E3) has been circled for each articulation variant: (a) Detached(D�etach�e) playing mode, (b) Legato playing mode.

J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2123

frequency channels, and N, the number of time steps, such

that L¼Mb¼Na (LTFAT toolbox of Søndergaard et al.,2012). For a signal x 2 C

L, the transform can be written as

coefficients cm;n 2 CM�N

of the signal expansion in a family

of Gabor atoms gm;n ¼ e2ipmb l�nað Þg l� nað Þ,

cm;n ¼ hx; gm;ni ¼XL�1

l¼1

x lð Þe�2ipmb l�nað Þg l� nað Þ; (1)

with n 2 0;N � 1½ �;m 2 0;M � 1½ �. l is a discrete time vari-

able along the length L of the signal, and a, b> 0 are the

time and frequency sampling constants, respectively.

Differences between Gabor transforms for round and harsh

sounds can further be observed by computing their time-

frequency transfer function. This function is called a Gabor

mask m, and can be evaluated using a minimization process

based on the least-square criterion (Depalle et al., 2006)

W mð Þ ¼ kcm;n x2ð Þ �Mmcm;n x1ð Þk2 þ kkm� 1k2; (2)

where k 2 Rþ is the Lagrange parameter introduced to con-

trol the norm of m.

For the source and target signals, respectively, denoted

x1; x2 2 CL, the mask m is estimated by considering that

there is no difference between the signals when m¼ 1,

m ¼kcm;n x1ð Þkkcm;n x2ð Þk þ k

kcm;n x1ð Þk2 þ k¼ C1C2 þ k

C21 þ k

; (3)

where C1 and C2 are the modulus of the Gabor transforms of

the signals x1 and x2, respectively. As explained by Olivero

et al. (2010) and Sciabica et al. (2012), k plays the role of a

regularization parameter, which enables us to reveal more or

less detailed differences between signals.

C. Highlighting overall features of harshness

In this section we explain how to compute and tune a

Gabor mask in order to capture the salient dissimilarities of tim-

bre between the round and harsh versions of the note E3. One

sound pair was chosen among the ten pairs of the corpus (Roz�e,

2017). First of all, we matched the two signals of the pair in the

time-frequency domain. In the frequency space, this operation

was intrinsically achieved because the pitches to compare were

identical. In the temporal space, however, we needed to syn-

chronize the signal durations with an accurate time-scaling pro-

cess. This implied changing the signal durations without

affecting their pitch-contour and the time-evolution of the for-

mantic structure. The procedure was achieved using AdobeAudition software (San Jos�e, CA) by reducing the duration of

the longest signal so that it fitted with the shortest.

To capture global spectral tendencies within the Gabor

mask, a sliding window size of M¼ 2048 bins was chosen.

This corresponds to a lattice frequency resolution of �20 Hz.

From Eq. (3), we controlled the Gabor mask estimation

with different values of the regularization parameter k to get

the roughest (k� 1) or finest (k! 0) differences between

the signals. The time-frequency representation of the mask is

displayed in Fig. 2 for k¼ 1, and can be interpreted as an

overview of the harshnessing process of a round sound. In

Fig. 2, we can notice that black and white parts of the image

mainly coincide with the frequency bands situated at multi-

ple integers of the fundamental frequency of the note E3,

which is logical since the analyzed sounds are instrumental

and harmonic by nature. More interestingly, we observe as a

whole that the harsh sound (relatively to the round one)

presents strong energy reinforcements (white bands) in the

middle-upper spectrum area, and strong energy decreases

(black bands) in the lower spectrum area. This corresponds

for the harsh sound to a transfer of spectral energy toward

higher frequencies and the emergence of formants.

A detailed observation of the Gabor mask guides us

toward three salient acoustic features that might explain

the harshness phenomenon: an emergence of formantic

areas, an energy loss in the attack part, and fluctuations in

the temporal deployment of harmonic amplitudes. Regarding

the formantic feature, the repartition of black and white

bands clearly indicates a transfer of the spectral energy from

lower to higher partials, with a noticeable reinforcement

of the fifth component, the G]5 (330� 5¼ 1650 Hz). This

may indicate that the harsh E3 note would globally sound

brighter and in sympathetic resonance with its major third.

It is important to note that the emergence of formants might

be related to the physics resulting from an incorrect bowing

gesture (Demoucron, 2008; Guettler, 2003), suggesting a

slower bow speed for harsh sounds. Regarding the attack

feature, we can notice consistent black portions located at

the beginning of each harmonic band. This might correspond

to a global delay in the birth of harmonics composing the

harsh sound, globally implying a less marked attack than in

the round sound. Finally, regarding the harmonic features,

we observe amplitude fluctuations appearing in the temporal

deployment of each partial, which suggests some differences

FIG. 2. Gabor mask computed between the round and harsh signals of a cor-

pus pair for a regularization parameter k¼ 1. Their macroscopic differences

appear as deviations from m¼ 1 (corresponding to no transformation).

White frequency bands correspond to spectral energy reinforcements (mask

deviations greater than one) of the harsh sound compared to the round

sound, whereas black bands correspond to a decrease in spectral energy

(mask deviations lower than one).


in the relative life cycle of partials between round and harsh

sounds.

D. From Gabor masks to harshness descriptors

The three features revealed by the Gabor mask analysis

suggest that the harshness is multidimensional by nature.

Further investigations of perceived and observed timbre

changes should be carried out by associating these observa-

tions with existing spectral, temporal, and spectro-temporal

timbre descriptors.

1. Spectral descriptors

From a spectral viewpoint, the main phenomenon

revealed by the Gabor mask sheds light on an energy increase

in the high frequencies of the harsh signal, and more precisely

an energy transfer from low to high frequencies with a for-

mantic reinforcement around 1650 Hz. This transfer may cor-

respond to an increase in the barycenter of the spectral energy

distribution, commonly characterized as the spectral centroid

(Peeters, 2004; Schoonderwaldt, 2009). Many studies refer to

this descriptor as perceptually associated with the brightness

(Grey and Gordon, 1978; McAdams, 1999; Caclin et al.,2005; Merer et al., 2007) and it has been validated as a reli-

able indicator of the timbre of bowed-string instruments like

the violin (Stepanek and Otcenasek, 2005; Stepanek, 2006;

Fritz et al., 2012). From our observations of the differences

between round and harsh sound signals, an increase in harsh-

ness might contribute to an increase in brightness.

Considering the harmonic nature of cello sounds, we

decided to compute the harmonic spectral descriptors. For

this purpose, a detection of the harmonic components and

the extraction of their instantaneous amplitudes were per-

formed beforehand. Knowing the fundamental frequency

of the note analyzed (E3), this process could be achieved

through decomposition in subbands centered on the signal

harmonics (Barthet et al., 2007). We could thus compute the

harmonic spectral centroid (HSC) of the sound, which corre-

sponds to the amplitude-weighted mean of its harmonic

peaks in the spectrum

HSC ¼ f0

XK

k¼1

kAk

XK

k¼1

Ak

(4)

where k and Ak are, respectively, the bin index and the ampli-

tude of the kth harmonic peak of the fundamental frequency

f0, and K is the total number of harmonics considered.

To characterize the spectral energy transfer occurring

within the harshness phenomenon, we computed the har-

monic tristimulus (Pollard and Jansson, 1982), which

describes the spectral energy distribution in three frequency

bands as the energy ratio between each band and the total

number of harmonics. The first band contains the fundamen-

tal frequency, the second one the medium partials (two,

three, four), and the last one the higher partials (five and

more). Three coordinates were hereby obtained, correspond-

ing to spectral centroid computations for each band

TR1 ¼A1XK

k¼1

Ak

; TR2 ¼

X4

k¼2

Ak

XK

k¼1

Ak

; TR3 ¼

XK

k¼5

Ak

XK

k¼1

Ak

; (5)

where Ak is the amplitude of the kth harmonic peak and K is

the total number of harmonics considered. From this classi-

cal definition of the tristimulus, we designed a new more

compact ratio focusing on the spectral transfer related to the

harshness phenomenon

TRIratio ¼ TR3

TR1 þ TR2

: (6)

The values of this spectral transfer descriptor are expected to

decrease for a round cello note and increase for a harsh one.

We characterized the emergence of formants revealed

by the Gabor mask using MFCCs (mel-frequency cepstral

coefficients; Davis and Mermelstein, 1980). This descriptor

stands for a simplified view of the spectral envelope with

just a few coefficients, and is therefore well suited to reveal

particular spectral envelope variations such as appearances

or shifts in formantic areas. It also takes into account the fre-

quency selectivity of our auditory system by using frequency

bands on a mel scale that are more selective for lower fre-

quencies than for higher ones (Kim et al., 2006; Bogert

et al., 1963). The MFCCs are linked to the notion of ceps-

trum, which has been intensively used to characterize the

formants of the voice because of its acoustic behavior of

source/filter type. A cello may also be modeled in this way,

if we consider the bow friction on a string as the excitation

source and the instrument-body as the filter. A harsh sound

would result from an incorrect excitation of the string by the

bow, which translates through the cello sound box into the

shift of a formant (an energy transfer) from low- to high-

frequency areas. MFCC coefficients ci are computed through

a DCT (discrete cosine transform) applied to the logarithmic

spectral envelope as follows:

ci ¼XNf

k¼1

log Ekð Þcos i k� 1

2

� �pNf

" #; 1 � i � Nc; (7)

where Ek is the spectral energy measured in the critical band

of the kth mel filter, Nf is the total number of mel filters, and

Nc�Nf is the number of MFCC coefficients to compute.

Standard recommendations were followed for MFCC

calculations (Kim et al., 2006): a mel-filter bank built with a

number of filters Nf¼ 24, and a bandwidth of Bm¼ 2700 mel

(limiting the frequency range to 8000 Hz). With a typical

number of MFCC coefficients Nc¼Nf /2¼ 12, we obtained

the smooth spectral envelopes of a round and a harsh signal,

presented in Fig. 3(a). The formantic structures of these two

envelopes look quite different, but can clearly be explained

by an examination of the first two MFCC coefficients.

Indeed, the MFCC coefficients C1 and C2 [see Fig. 3(b)] are


the founding parameters that describe the global spectral

shape. C1 induces a low-pass behavior of the spectrum, while

C2 induces a bandpass effect. Note that C0 simply represents

the mean log-energy output of the filterbank, which is not of

interest in the present case when comparing spectral shapes.

The MFCC analysis thus suggests that the shape of the spec-

tral envelope evolves from a low-pass behavior for the round

sound (C1>C2), to a bandpass behavior for the harsh one

(C1<C2). This is coherent with the spectral transfer of har-

monic energies accounted for by the tristimulus criterion

[Eq. (5)].

In the same way as the TRIratio, we designed an MFCC

ratio suitable for describing the harsh formantic behavior

MFCCratio ¼ kC2kkC1k

: (8)

Given the behavior of the MFCC coefficients, the values of

this formantic descriptor should decrease for a round cello

note and increase for a harsh one.

2. Temporal descriptors

From a purely temporal viewpoint, the Gabor masks

shed light on a lack of energy at the beginning of the harsh

signal. Perceptual studies of Grey (1977) and McAdams

et al. (1995) have shown that the rise time of the energy of

the signal during the attack phase plays a prominent role in

instrument classification. From these statements, we deduced

that a suitable descriptor of the attack phase might help to

determine the timbre quality and thus contribute to the per-

ception of the harshness. A common temporal descriptor is

the log attack time (LAT) (Kim et al., 2006):

LAT ¼ log10 Tmax � Tthresholdð Þ; (9)

where Tmax is the time needed for the signal envelope to

reach its maximal value, and Tthreshold is the time needed for

the envelope to exceed 2% of its maximal value.

Furthermore, to propose a descriptor that is less

dependent on the energy differences between signals, we

FIG. 3. (a) Formantic envelopes of the round and harsh signals (the same sound pair as in Fig. 2). The vertical lines correspond to the center frequencies of 24

critical bands in the mel-filter bank, from which spectral envelope energies are computed. (b) The 12 first mean MFCC coefficients associated to the previous

round and harsh formantic envelopes. Absolute values of these coefficients are displayed to highlight the behavior difference between signals.


computed an attack slope (ATS) or temporal increase

(Peeters, 2004), defined as the temporal average slope of the

energy during the attack phase

ATS ¼ AP

AT; (10)

where AP is the peak value of the signal envelope reached at

time Tmax, and AT¼ Tmax�Tthreshold is the attack time.

According to our observations, the values of this attack

descriptor should increase for a round cello note and

decrease for a harsh one.

3. Spectro-temporal descriptor

The third feature of the Gabor mask suggested the pres-

ence of energy fluctuations or asynchrony in the temporal

deployment of each harmonic amplitude. The instantaneous

amplitudes of harmonic components were thus extracted for

each round-harsh sound pair. In Fig. 4, we present this sub-

band decomposition for the same pair used in Gabor mask

analysis. Interestingly, it can be observed that the nature of

harmonic interactions seems quite different between the two

types of sounds. Indeed, the temporal harmonic structure

deploys itself regularly in the round signal, whereas it is much

more chaotic and disordered in the harsh one, with noticeable

reinforcements in energy of the fifth component around

120 ms and 220 ms after onset. This spectro-temporal phe-

nomenon might be efficiently captured by the HSV (harmonic

spectral variation), a descriptor related to the spectral flux

which focuses on harmonic components, and that has already

been used by Chudy et al. (2013) to assess timbre changes in

cello performances. The HSV is a descriptor related to the

time-varying spectral content, describing the spectral varia-

tion of harmonic amplitudes between adjacent frames. At the

frame level n, the HSV is defined as the complement to one

of the normalized correlation between the amplitudes of har-

monic peaks from two adjacent frames (Kim et al., 2006)

HSV ¼ 1

N

XN

n¼2

1�

XK

k¼1

Ak;n–1Ak;nð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXK

k¼1

A2k;n–1

s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXK

k¼1

A2k;n

s8>>>>><>>>>>:

9>>>>>=>>>>>;; (11)

where Ak,n denotes the amplitude of the kth harmonic peak

in the nth frame. K and N are, respectively, the total number

of harmonics and frames considered. The values of this spec-

tral fluctuation descriptor should decrease for a round cello

note and increase for a harsh one.

E. Validating harshness descriptors

In order to assess the relevance of the six previously pre-

sented descriptors to characterize the harshness at the signal

level, we performed statistical tests on the sound corpus,

composed of round and harsh notes divided in two groups of

ten samples. Mean values and standard deviation were then

computed on the two groups for each descriptor. The results,

presented in Fig. 5 and Table I, indicate that all the descrip-

tors except the LAT are promising candidates for explaining

the harshness. The relevance of each signal descriptor was

evaluated by performing a simple paired two-tailed t-test,

based on the null hypothesis that the means of the two

groups are the same. We can also observe that, except for

the LAT and ATS, the descriptor values are much more

spread for harsh sounds than for round sounds. Regarding

the spectral descriptors (HSC, TRIratio, and MFCCratio),

this phenomenon suggests that harsh sounds would sound

more or less bright, according to spectral energy increases in

the upper or middle parts of the spectrum, respectively. By

contrast, round sounds may be perceived as “deep” and

“mellow,” because their spectral energies are better localized

on low-frequency components. In the same way, the spectro-

temporal descriptor values (HSV) that reveal a stronger

spectral spread within harsh sounds, suggests different levels

of asynchrony in the temporal evolution of transients. By

contrast, round sounds may be characterized overall by a

more systematic synchronicity between the harmonic transi-

ents. In the two next parts, we will try to assess the percep-

tion of these signal tendencies through two complementary

analysis methods.

III. BUILDING A PERCEPTUAL MODELOF HARSHNESS

From the validation of harshness descriptors, we con-

cluded on five acoustic descriptors that potentially could dis-

criminate round and harsh sounds from a signal point of

view. In this part, we propose an experimental protocol that

aims at identifying the most prominent signal descriptors

reflecting the variation of perceived harshness. This is based

on a perceptual evaluation of intermediate sounds between

round and harsh signals of a pair.

FIG. 4. Instantaneous amplitudes of

the first five harmonics for the (a)

round and (b) harsh sounds (sound pair

of Fig. 2).


A. Participants

Fifteen experienced cellists took part in the experiment.

They were all volunteer teachers or students from musical

schools or the conservatory of Marseille. All of them had

extensive experience in cello playing, from 7 to 40 years of

practice (the youngest had teacher recommendations). None

of them had self-reported hearing problems.

B. Continua of stimuli

An intuitive way to assess an amount of perceived

harshness consists in comparing a continuum of gradually

harsher sounds with a reference sound judged to be of good

quality (round; Aramaki et al., 2009; Aramaki et al., 2011).

The ten sound pairs of the corpus were thus used as a base to

build continua of stimuli for this experiment. For each sound

pair, we carried out a morphing process to create a contin-

uum of four sounds gradually deteriorating into harshness.

This process first required a synchronization of the durations

of the round/harsh signals for each pair. It was achieved with

the Adobe Audition software, by using the same time align-

ment technique as for Gabor mask construction.

The morphing process between pairs of round and harsh

sounds was possible thanks to the properties of the Gabor

multipliers (Olivero et al., 2010). Indeed, from Eq. (2), we

define Mm, the Gabor multiplier of mask m, which morphs

into a target signal x2 by a pointwise multiplication of its

mask with the Gabor transform coefficients of the source

signal x1,

x2 ’Mmx1 ¼Xm;n

mcm;n x1ð Þhm;n; (12)

where cm,n(x1) is the Gabor transform of the source signal x1,

and hm,n the dual synthesis window of gm,n [Eq. (1)].

To create the four sounds composing the continuum of a

given pair, we provided the multiplier equation [Eq. (12)]

with the round sound as the source signal x1, and the harsh

one as the target signal x2. Intermediate sounds of these

two signals were generated by computing several Gabor

masks with gradually decreasing values of the regularization

parameter k [Eq. (3)]. The steps selected for k have been

chosen by the authors through an informal listening test to

produce the most continuous evolution in harshness percep-

tion. For each sound pair, two Gabor masks corresponding to

two gradual intermediary k values were computed, and the

two associated sounds resynthesized with the multiplier for-

mula. The four sounds composing each continuum were

finally equalized in loudness with the Loudness toolbox

(Genesis, 2009). The whole procedure performed on each of

the ten sound pairs resulted in a sound corpus composed of

ten continua of four stimuli (cf. Roz�e, 2017, for examples of

continua).

FIG. 5. (Color online) Comparison of

spectral, temporal, and spectro-temporal

features between the two groups of

round and harsh sounds. The central

marks are the medians, the edges of the

boxes are the 25th and 75th percentiles.

TABLE I. Results of paired t-tests on the six defined acoustical descriptors.

The relevance of discriminating the round and harsh groups of ten sounds

for each descriptor is given by the p-value p: *p< 0.05, **p< 0.01,***p< 0.001.

Descriptors HSC TRIratio MFCCratio LAT ATS HSV

t(9) ¼ 5.69*** 4.1** 4.13** 1.21 �6.17*** 5.16***


Note that the morphing process that transforms the

round sound into a harsh one is a monotonously decreasing

mathematical transformation, which not necessarily reflects

a linear perception of harshness. For this reason, in some

continua, it was slightly more difficult to ensure the per-

ceived continuity between the second synthetic stimulus and

the harsh target sound x2. The subject’s scores observed in

the following MUSHRA (multi stimulus test with hidden ref-

erence and anchor)-based listening test [Fig. 6(a)] confirmed

this perceptual tendency.

C. MUSHRA-based procedure

The listening test of this experiment was designed with

the MATLAB software, and installed on a laptop with

Sennheiser HD-650 headphones (Wedemark, Germany). We

performed it in the laboratory or in quiet places depending on

the cellists’ availabilities. Its conception was inspired from

the MUSHRA comparison method (ITU-R Recommendation

BS.1534-1, 2003), commonly used for audio codec quality

assessment. The MUSHRA design relies on multiple compari-

sons of items with respect to a given reference. This reference

is hidden among the set of items in order to potentially elimi-

nate subjects who cannot recognize it.

We designed an interface based on these principles

(cf. Roz�e, 2017), but suited to our perceptual comparisons

of harshness. Each subject was randomly presented with

ten successive trials corresponding to the ten continua of

four stimuli to compare. Within each continuum, the presen-

tation order of the four sounds was randomized. The round

sound was chosen as the hidden reference of the MUSHRA

test. Subjects were asked to rate the relative amount of per-

ceived harshness between this reference and each of the four

sounds of the continuum according to a continuous quality

scale (CQS). This CQS consists of identical graphical scales

divided into five equal intervals gradually composed from

the bottom to the top scale of the adjectives “barely more

harsh” (a peine plus in French), “a little more harsh” (un peuplus), “more harsh” (plus), “considerably more harsh” (bienplus), and “much more harsh” (beaucoup plus). Subjects

FIG. 6. Observed vs predicted scores of the least-square multiple linear regression model (a) by continuum (observed with the subjects’ standard deviations in

grey solid line, predicted in black dotted line) and (b) for all the continua (R2¼ 0.724, n¼ 40). (c) Normal probability plot of residuals. (d) Raw dispersion of

residuals.


could rate 0 for a sound on this scale if they did not perceive

a difference in harshness with respect to the reference, or

even judge it identical. They could listen to the sounds as

many times as desired, but it was not possible for them to

return to previous continua once their choice was validated.

D. Data analysis

To investigate the relationships between the multidi-

mensional set of acoustic features and the perceptual scores

of the subjects, we carried out a multiple linear regression

analysis. Two regression equations containing the most sig-

nificant signal predictors were inferred. We here describe the

different data preparation steps for the regression process.

1. Score checking

Subjects’ ratings were most often zero for the sound cor-

responding to the hidden reference, although a small amount

of harshness was sometimes attributed for this reference. We

checked that the score of the round reference was always the

lowest within each continuum. Consequently, we did not

reject any continuum or subject.

2. Dataset building

For the 40 sounds composing the experimental plane,

we computed the 5 acoustical descriptors from Sec. II E. In

addition, the distribution of the four perceptual scores within

each continuum was centered and normalized (mean 0, stan-

dard deviation 1) to compare the ten continua independently

of their respective references. These perceptual scores were

averaged on all the subjects for each sound.

3. Multiple linear regression

We performed a multiple linear regression analysis

on the dataset in order to find a relevant model of signal

descriptors which might explain the perceptual harshness

scores (40 observations). This regression relies on typical

least squares fitting, a process consisting in searching the fit-

ting coefficients bp, which minimize the mean square differ-

ence �2 between the model bY (or prediction vector Xpbp) and

the response vector Y,

Y ¼ bY þ � ¼ Xpbp þ �; (13)

where p¼ 5 independent variables Xp of signal descriptors

computed on the 40 sounds. To certify the robustness of

this basic model, we checked the distribution of its residual

vector �, and validated the assumptions of linearity and

homoscedasticity.

As a complementary approach, we also performed a

coarser, stepwise regression to identify the most prominent

descriptors of the basic predictive model. This stepwise

regression has the advantage of automatically detecting mul-

ticollinearities among the predictors. It is here used as an

additional approach to keep the less redundant descriptors

that are likely to explain the perceived harshness scores.

E. Results

The results of the least-square multiple linear regression

method are presented in Table II. They suggest that three out

of the five regression coefficients are statistically different

from zero (significance level 5%): The spectro-temporal

descriptor HSV is predominant in the model (p< 0.001),

the temporal descriptor ATS also seems to be relevant

(p< 0.01), and the MFCCratio appears as the most relevant

among the purely spectral descriptors (p< 0.05).

Figure 6 describes the fitting between the predictive

model and the observed scores. In Fig. 6(a), we notice that

the subjects actually perceived a global increase in amount

of harshness within most sound continua. This perception

was not fully linear, since they often rated the second syn-

thetic stimuli (s3) as harsher than the real harsh sound (s4).

This can be explained as a consequence of the morphing pro-

cess that does not directly reflect human perception. In the

same way, most of the predictions within each continuum

follow an increasing tendency, which reveals that it was

possible to find a linear combination of acoustic descriptors

that fitted quite well the observed scores of harshness.

This tendency is confirmed in Fig. 6(b) with a rather signifi-

cant determination coefficient (R2¼ 0.724 and adjusted R2

¼ 0.683). Similarly, the statistical distribution of the resid-

uals [Fig. 6(c)] also tends to confirm that linearity and homo-

scedasticity assumptions are verified. Indeed, the linearity

hypothesis appears in the normal probability plot of resid-

uals, and homoscedasticity in the regular scattering of resid-

uals on both sides of the axis y¼ 0. A Lilliefors test

confirmed that the residual vector � exhibits a normal distri-

bution (p> 0.5).

From the process of least-square regression, we thus

obtain a basic predictive model of perceived harshness. This

can be written as an equation predicting the cellists’ scores

from the three best fitting descriptors

Ybasic ¼ 0:29MFCCratio� 0:36ATSþ 0:49HSV: (14)

The stepwise regression approach provides a comple-

mentary model of lesser significance (R2¼ 0.684 and

adjusted R2¼ 0.667), retaining only two prominent descrip-

tors: the HSV (p< 0.001) and the ATS (p< 0.001),

Ystepwise ¼ 0:59HSV� 0:38ATS (15)

F. Discussion

The global equation [Eq. (14)] suggests that the sensation

of perceived harshness is multidimensional and characterized

by a complex transformation of timbre involving spectral,

temporal, and spectro-temporal aspects. Interestingly, we can

TABLE II. Results of the least-square multiple linear regression. The p-

value p of each descriptor gives the relevance to add its coefficient bp to the

model: *p< 0.05, **p< 0.01, ***p< 0.001.

Descriptors HSC TRIratio MFCC ratio ATS HSV

Coefficients bp 0.45 �0.58 0.29* �0.36** 0.49***


notice the negative coefficient of ATS, while positive coeffi-

cients are obtained for HSV and MFCCratio. This observation

reveals an anti-correlation between temporal and spectral fea-

tures, already observed in Sec. II D: An increase in perceived

harshness induces a formantic emergence and more harmonic

asynchrony, associated to a decrease in the temporal ATS.

This is consistent with the Helmholtz motion, which builds

more slowly or is more unstable. Furthermore, the stepwise

regression eliminates the MFCCratio as a harshness predictor,

which suggests a certain collinearity between the MFCCratio

and the HSV descriptors. This dependence is confirmed by

the correlation coefficient between the two descriptors (r2

¼ 0.65***), which can be explained from signal consider-

ations, since the emergence of a formant corresponds to a sud-

den growth of upper partials within the harmonic’s life cycle

[see the evolution of the fifth component in Fig. 4(b) for an

example]. This implies more crossings between the temporal

evolutions of harmonic amplitudes, which should result in a

stronger harmonic disorder as an indirect consequence.

Gabor multipliers turned out to be a quite efficient tool

to create continua with increasingly harsher sounds. The

degradations of timbre produced by this technique led to a

model of perceived harshness. However, this approach

presents some limitations since it proposes a statistical con-

struction relying on the variations of a single feature (the

regularization parameter k) which has no significance in

terms of acoustic descriptors. This is not sufficient to evalu-

ate the individual perceptual relevance of each descriptor.

Section IV aims at digging deeper into these aspects.

IV. TOWARD A CONTROL OF PERCEIVEDHARSHNESS

In this section, we are going to assess to which extent

each descriptor brought to light by the predictive model con-

tribute to the perceived harshness. This procedure is essential

as a perceptual validation of each descriptor. For this pur-

pose, a synthesis process is used to manipulate/control the

temporal and spectral signal envelopes. This control will fur-

ther be perceptually validated by a paired comparison test,

resulting in the elaboration of a perceptual harshness space.

A. Choice of control descriptors

The basic predictive model [Eq. (14)] of perceived

harshness suggests that both spectral and temporal features

are involved in this phenomenon. Among the three main

descriptors (HSV, ATS, and MFCCratio) characterizing the

harshness, the HSV was found to be predominant. However,

this descriptor is not suitable for resynthesizing and control-

ling the harshness, since it encompasses too many details in

the signal. Instead, we might take advantage of the spectro-

temporal duality of the other descriptors unveiled in the pre-

vious experiment to attempt to control the harshness through

only a few parameters: the coefficients composing the

MFCCratio, which can be considered as a shaping parameter

of the spectral sound envelope (and which is correlated with

the HSV), and the ATS, defining the attack slope, likely to

be considered as a shaping parameter of the temporal attack

envelope.

The validity of this assumption can be tested by carrying

out a partial bivariate least-square linear regression on the

purely spectral MFCCratio and purely temporal ATS predic-

tor. This leads to a less relevant predictive model than Eq.

(14) (R2¼ 0.595 and adjusted R2¼ 0.573), but with satisfy-

ing p-value predictors: MFCCratio (p< 0.001) and ATS

(p< 0.001). We therefore decided to use these descriptors to

control the perceived harshness.

B. Participants

The harshness, produced by this synthesis method, was

assessed through a listening test based on dissimilarity rat-

ings of sound pairs. The participants who took part in this

listening test were the same 15 experienced cellists as in

Sec. III A. Each of the two tests lasted for around 15–20 min

for each subject. The tests were separated by a short pause.

The total duration of these two listening tests was thus

around 45 min per subject.

C. Stimuli

1. Cross-synthesis

A round/harsh representative sound pair was chosen by

informal listening tests from the ten paired samples compos-

ing the sound corpus. For each signal of this pair, we used

the technique of subband decomposition of the harmonics to

design a synthetic calibrated sound composed of 25 harmon-

ics. This number of harmonics was necessary to obtain a per-

ceptually convincing cello sound.

From this root synthetic pair (a round and a harsh sound)

called S1/S8, we created six intermediary synthetic stimuli

using a cross-synthesis technique performed in the spectral

and temporal domains. This technique requires the parame-

terization of spectral and temporal envelopes by means of a

basis function expansion: env ¼PK

k¼1 Ck/k, where Ck are

the coefficients of the expansion, and /k the set of basis

functions to be combined linearly. This parameterization

turns out to be a very flexible way to control and adjust the

shape of the envelopes since it enables us to design enve-

lopes, presenting intermediate spectral or temporal features

of round and harsh sounds.

To apply intermediate envelopes to the source sound, a

whitening process consisting in dividing the source signal

by its own envelope (env1) was first effectuated. The result-

ing (whitened) signal was then multiplied by the intermedi-

ate envelope (env2). This cross-synthesis technique thus

consisted in filtering the source sound by the transfer func-

tion env2/env1. To avoid divergence effects due to potential

divisions by zero of the source envelope env1, we turned

the division of the transfer function into a difference of

logarithmic envelopes (Z€olzer, 2008): log jenv2j=jenv1jð Þ¼ log jenv2j � log jenv1j. By applying the exponential oper-

ator to this logarithmic difference, the cross-synthesis pro-

cess of envelopes could be achieved accordingly.

a. Spectral cross-synthesis. In the spectral domain, we

parameterized a formantic envelope as an expansion of the

mel-cepstrum coefficients Ck, with the discrete cosines


functions /k as a basis. Indeed, according to Eq. (7), we can

rebuild a logarithmic envelope Espec of the spectrum signal

magnitude jX fð Þj from a vector of MFCC coefficients by

means of an inverse DCT (Poli and Prandoni, 1997),

Espec mel fð Þð Þ ¼XNc

k¼1

Ck cos 2pkmel

Bm

� �; 0 � mel � Bm

’ log jX fð Þj; (16)

where Nc is the subset of coefficients chosen for resynthesiz-

ing the spectrum envelope. The bandwidth of the Nf¼ 24

mel-filter bank already used for computations of MFCCs is

Bm¼ 2700 mel (Sec. II D). Espec represents a smoothed ver-

sion of the real spectral envelope, whose precision depends

on the number of coefficients used for the reconstruction (we

chose Nc¼Nf /2¼ 12). From the MFCC vector modeling the

formantic envelope, we then obtained intermediate spectral

envelopes between sounds S1 and S8 by tuning the coeffi-

cients C1 and C2 composing the MFCCratio.

b. Temporal cross-synthesis. In the time domain, we

parameterized a temporal envelope as an expansion of

B-spline basis functions /k weighted by coefficients C0k. It

approximated the temporal signal e(t) by an envelope Etemp

composed of the sum of a small set of scaled splines

(Ramsay, 2006)

Etemp tð Þ ¼XmþL�1

k¼1

C0kBk t; sð Þ ’ e tð Þ; (17)

where Bk(t,s) is the value at t of the B-spline basis function

defined by an order m and a knot sequence s of length L – 1.

The number of splines was defined by the order plus the

number of interior knots mþL – 1. A basis of five piecewise

cubic splines (order m¼ 3) was chosen to parameterize the

temporal envelopes with quite a good fit. From there, we

could design some intermediate temporal envelopes between

sounds S1 and S8, by specifically tuning the coefficients C03and C04 of their five-spline vectors. Indeed, this manipulation

made it possible to move the peak attack amplitude forward

or backward, which intrinsically corresponded to controlling

the ATS.

2. Eight synthetic sounds

Figure 7 gives a synopsis of this stimuli design struc-

tured into two clusters, a round one and a harsh one. Each

cluster is composed of the initial root sound (S1 or S8,

respectively), and three synthetic sounds designed by cross-

synthesis from the initial stimuli (cf. Roz�e, 2017).

In the spectral domain, we created two sounds in each

cluster. For the round cluster created from the round sound

S1, the formantic envelope of the sound S3 was designed by

decreasing C1 and increasing C2. Symmetrically, for the

harsh cluster created from the harsh sound S8, the formantic

envelope of the sound S6 was designed by increasing C1 and

decreasing C2. The final values of the MFCC coefficients C1

and C2 hereby defined the intermediate spectral envelopes of

S1 and S8. The upper part of Fig. 7 presents the shapes of

the formantic envelopes for these four sounds:

cS3 ¼ cS1 �ERH

spec

ERspec

; cS6 ¼ cS8 �EHR

spec

EHspec

;

where bSi represents the spectrum of the sound Si.

In the temporal domain, we also created two sounds in

each cluster. For the cluster round created from the round

sound S1, the sound S2 with a lower ATS was designed

by decreasing C03 and increasing C04. Symmetrically, for the

cluster harsh created from the harsh sound S8, the sound S7,

with a higher ATS, was designed by increasing C03 and

FIG. 7. Design by cross-synthesis of the eight synthetic stimuli. (Top) The

two original spectral envelopes of S1 (ERspec) and S8 (EH

spec) with their respec-

tive intermediaries (ERHspec and EHR

spec). (Middle) Cluster of the four synthetic

sounds built from the round sound S1 (mid-left) and from the harsh sound

S8 (mid-right). (Bottom) The two original temporal envelopes of S1 (ERtemp)

and S8 (EHtemp) with their respective intermediaries (ERH

temp and EHRtemp).


decreasing C04. The final values of the spline coefficients C03and C04 hereby defined the intermediate temporal envelopes

of S1 and S8. The bottom part of Fig. 7 presents the shapes

of the temporal envelopes for these four sounds:

S2 ¼ S1�ERH

temp

ERtemp

; S7 ¼ S8�EHR

temp

EHtemp

:

The two last synthetic stimuli were created by hybrid

crossings of the spectral and temporal envelopes in each

cluster. For the cluster round, the sound S4 was designed by

applying the temporal envelope of S2 to the sound S3.

Symmetrically, for the cluster harsh, the sound S5 was

designed by applying the temporal envelope of S7 to the

sound S6:

S4 ¼ S3�ERH

temp

Etemp S3ð Þ ; S5 ¼ S6�EHR

temp

Etemp S6ð Þ;

where Etemp(Si) represents the temporal envelope of the

sound Si.

Finally, we equalized the loudness of the eight synthetic

stimuli by the means of the Loudness toolbox (Genesis,

2009) to ensure timbre comparisons with the same sound

level.

D. Paired-comparison procedure

The listening test of this experiment was performed sub-

sequent to the previous one (Sec. III C) with the same equip-

ment and in the same conditions. We designed an interface

(cf. Roz�e, 2017) to evaluate the amount of perceived harsh-

ness between pairs of the eight synthetic stimuli. The combi-

nations of pairs were presented in both ways (direct AB and

reverse BA), and in a global random way for each subject to

avoid any effect of presentation order. Since identical pairs

were not presented, a total of 56 pairs (8� 7) was evaluated.

For each sound pair presented, the subject was asked to

process two tasks: first, choose the sound of the pair A/B per-

ceived as the most harsh. It was possible to check a third

item entitled “none” if the two sounds were considered equal

in terms of harshness. Then, assess the difference of per-

ceived harshness of the checked sound relatively to the other

one, according to a CQS, corresponding to a graphic scale

divided into three equal intervals, i.e., “a little bit more

harsh,” “more harsh,” and “much more harsh.” The subjects

were informed that they could rate zero on this scale, if they

judged the two sounds of a pair identical in terms of harsh-

ness, even if they were structurally different. In this case,

they had to check the “none” item. Conversely, the harshness

rating was only available when a non-zero rating was chosen

on the scale. They could listen to the sounds of each pair as

many times as they desired, but it was not possible for them

to return to previous pairs once their choice was validated.

E. Data analysis

To explore the repartition of the subjects’ ratings, we

performed a metric MDS analysis. We provided the MDS

with a dissimilarity matrix composed of the harshness pro-

portion rated by the subjects between pairs of synthetic stim-

uli. This dissimilarity matrix was designed as a full non-

symmetric 8� 8 table of 56 harshness ratings, including the

2 ways of presenting the stimuli (direct AB in the upper-

triangular and reverse BA in the lower-triangular). Since

only symmetrical matrices can be used in MDS analyses, we

averaged the dissimilarity ratings of each symmetric element

and stored this average in an upper-triangular matrix of 28

(56/2) distances. This process was repeated for each partici-

pant, and finally we obtained a global dissimilarity matrix by

averaging the individual matrices of all subjects. The quality

of the obtained MDS configuration was evaluated by taking

into account two criteria: a scree-elbow criterion of the

Kruskal stress evolution (for the choice of the right number

of perceptual dimensions in the new space), and an analysis

of the Shepard diagram (for the preservation of the distances

in the new space relative to initial dissimilarities). Finally,

each dimension of the MDS space was correlated with the

salient signal descriptors of harshness previously obtained.

F. Results and discussion

The goodness-of-fit criterion, corresponding to a mini-

mization of the Kruskal stress, indicated that three dimen-

sions were sufficient to get an optimal MDS configuration

(stress � 0.093). Figure 8 presents the tridimensional space

of the eight synthetic stimuli obtained by the MDS, with its

two main bi-dimensional projections. This system depicts a

perceptual harshness space. Table III reports the correlations

between each MDS dimension and the five acoustic descrip-

tors highlighted in Sec. II E.

On the first spatial dimension, a very strong correlation

can be observed with the HSVs descriptor, and to a lesser

extent with the temporal ATS. The high HSV score suggests

that the perceived harshness essentially increases with a loss

of synchronism between the harmonics of the signal. The

ATS score, weaker and negatively correlated to the first

dimension, reveals a concurrent loss of ATS in the temporal

signal envelope. The predominance of HSV in these results

is a bit surprising, given that it was not directly used to cre-

ate the stimuli. However, these variations in harmonic cha-

otic behavior actually emerged as an indirect consequence of

synthesis transformations on the formantic envelopes.

Interestingly, the first dimension of Fig. 8(a) splits the

three-dimensional (3D) space into two groups matching the

clustering design of the synthetic stimuli: At the left, we

obtain the round cluster (S1, S2, S3, S4) and at the right the

harsh one (S8, S7, S6, S5). Besides, the pairs of stimuli that

only differ by their ATS (S1/S2, S3/S4, S5/S6, and S7/S8)

rate quite similarly on this axis. Therefore, in the round

cluster, the intermediary pair of stimuli S3/S4 was perceived

as harsher than the S1/S2 pair because of a stronger har-

monic asynchrony coupled with a smallest temporal ATS.

Symmetrically, in the harsh cluster, the intermediary pair of

stimuli S5/S6 was perceived as rounder than the S7/S8 pair

because of less harmonic asynchrony coupled with a greatest

temporal ATS. These results validate the physical relevance

of the HSV and ATS descriptors in the predictive harshness


model [Eq. (14)]. It is also consistent with the stepwise com-

plementary approach of the previous experiment [Eq. (15)],

and confirms the central role of the simultaneous opposite

variations of HSV/ATS in harshness perception.

Referring back to Table III, the second spatial dimen-

sion correlates quite well with all the purely spectral descrip-

tors HSC, TRIratio, and MFCCratio. The strong correlation

with the MFCCratio validates the physical coherence of this

descriptor in the predictive model of harshness [Eq. (14)].

Besides, the HSC and TRIratio descriptors arise as corollar-

ies from modifications of the spectral shape, since the emer-

gence of a formant necessarily implies a transfer of spectral

energy. In Fig. 8(b), the projection of the MDS configuration

on the first two dimensions shows that the stimuli do not fol-

low the cluster design any longer. Certain sounds synthe-

sized from intermediate formantic envelopes were actually

rated outside the bounds of the two root sounds (S1 and S8).

On the second dimension, both stimuli S3/S4, formantically

harsher than S1, seem hereafter to be perceived with more

degraded quality than the root harsh sound S8. And symmet-

rically, both stimuli S5/S6, formantically less harsh than S8,

now seem to be perceived with better timbral quality than

the round root sound S1. This result turns out to comply with

the evolution of the MFCCratio computed from intermediate

values of the first two Mel cepstrum coefficients (C1 and

C2). Therefore, the second dimension should be interpreted

with respect to the spectral energy distribution, and tends to

characterize the harshness phenomenon in terms of per-

ceived brightness.

Finally, Table III does not reveal at first sight any acous-

tic correlates of the third spatial dimension. However, the

projection of the MDS configuration on the dimensions one

and three of the perceptual space [Fig. 8(c)] reveals a sur-

prising organization of the stimuli. Indeed, they are distrib-

uted on either side of the third axis, according to pairs only

differing by the ATS of their temporal envelope. With the

exception of the pair S5/S6 perceived a little bit confusedly,

the sounds composed of a harsh temporal envelope are nega-

tively rated (S2, S4, S8), whereas those composed of a round

temporal envelope are positively rated (S1, S3, S7). This

third dimension thus reflects that subjects only detected

changes in temporal ATS, without significant consequences

on their perception of harshness.

V. CONCLUSION

In this paper, we explored a deterioration of the timbre

quality for cello sounds, often characterized as sound

FIG. 8. (Color online) (a) Perceptual space resulting from the metric MDS achieved in three dimensions (Kruskal stress¼ 0.093). The spatial configuration is

projected on the two main pairs of dimensions: (b) Dim1 vs Dim2, (c) Dim1 vs Dim3.

TABLE III. Correlations between the salient signal descriptors of harshness

and the coordinates of the eight stimuli along the three dimensions of the

MDS space, p-values: *p< 0.05, **p< 0.01, ***p< 0.001. Significant values

appear in boldface.

Acoustic correlates Dim 1 Dim 2 Dim 3

HSC 0.52 0.84** 0

TRIratio 0.58 0.79* �0.01

MFCCratio 0.04 0.84** 0.1

ATS �0.79* �0.37 0.32

HSV 0.98*** 0.02 �0.02


harshness among professional cellists. By using analysis-

synthesis techniques and suitable listening tests, this percep-

tually undesirable feature was modeled into a combination

of some relevant signal descriptors. First, a signal-based

approach using Gabor masks (Sec. II) revealed that a harsh

sound may essentially be characterized by three facets: an

energy transfer or formantic shift toward higher order parti-

als of its spectrum, a weakened ATS of its temporal enve-

lope, and more fluctuations in the harmonics’ life cycle.

Then, two complementary experiments enabled us to percep-

tually assess the effect of these signal features. On the one

hand, a least-square linear regression (Sec. III) performed on

the subjective ratings of continuously harsher sounds,

allowed us to build a predictive model of the harshness. This

model shed light on the spectro-temporal duality potentially

responsible for the harsh perception. On the other hand, a

MDS was carried out (Sec. IV), in order to perceptually vali-

date this signal duality, and better assess the harshness con-

tribution of each identified descriptor. To this aim, we

attempted to create rounder and harsher synthetic stimuli

simply by controlling their spectral and temporal envelope

shapes. The MDS, performed on the perceived dissimilarities

of pairs of these synthetic stimuli, were conclusive. As a

matter of fact, the method gave rise to a perceptual harshness

space, which two main dimensions were strongly correlated

to the acoustic descriptors identified by the predictive model.

The perceived harshness was thus partly due to the combina-

tion of a more chaotic harmonic behavior and a weaker

ATS, which corresponds to a more unstable Helmholtz

motion, building up more slowly. The shift in formant and

energy transfer toward the upper partials of the spectrum

appeared as corollaries of the loss in synchronicity between

signal transients.

It should be noted that the perceptual space created by

investigating the harshness phenomenon presents some inter-

esting analogies with the recognized timbre spaces in the lit-

terature (Grey, 1977; McAdams et al., 1995; Barthet et al.,2010). As a matter of fact, we can roughly consider by anal-

ogy with the terms of McAdams et al. (1995), our first

dimension as the “spectral fluctuations,” characterizing the

variations of synchronicity in the transients of harmonics.

Our second dimension may match with the “spectral energy

distribution,” and the third one may be interpreted in terms

of “temporal patterns” (even though this dimension is not

itself directly linked to harshness perception). We expect

similar trends for notes that differ from the one investigated

in this analysis, at least with respect to the signal descriptors.

Nevertheless, further investigations are needed to verify the

robustness of the perceived harshness phenomenon on other

parts of the score.

The results of this study naturally encourage the pursuit

of additional investigations of the gestures responsible for

this sound degradation. In particular, it seems important to

assess the coherency with physics-based synthesis models

(Demoucron, 2008), regarding the relationships between the

acoustical features composing the harshness and the control

gestures of the musician. Particularly, the transfer of spectral

energy responsible for the increase in brightness is likely to

coincide with a stronger bow force or a slower bow velocity

(Guettler et al., 2003; Guettler, 2003). Similarly, the weak

ATS is believed to match with an improper Helmholtz

motion, caused by a lack of firmness in the onset of the note

produced by the string–bow interaction. In addition to the

changes directly related to the physical parameters of the

bow, it could be very interesting to investigate how the pos-

tural constraint impacted the gestural coordination and con-

sequently the timbre of the sound. For example, we may ask

whether the cellists compensated their postural constraint by

consistently adjusting the joint movements of their right

arm, and if these specific adjustments resulted in the harsh

sounds that we collected for this study. If this is the case, it

might indicate that postural movements of the chest and the

head play an important role in motor coordination patterns

of the right arm’s joints, and indirectly in the maintenance of

a good (round) timbre quality.

A potential application of these results consists in

designing new technological tools that enable real-time mea-

sures of harshness. This technology-enhanced learning

would be of great interest for young cello students that are

not always aware of how their gestures can be optimized in

order to improve their sound quality. Inappropriate gestures

might hereby be corrected by the feedback of such devices.

ACKNOWLEDGMENTS

This work is partly supported by the French National

Research Agency and is part of the “Sonimove” project

(Grant No. ANR-14-CE24-0018). We would like to thank

the reviewers for their constructive comments related to the

manuscript.

Aramaki, M., Besson, M., Kronland-Martinet, R., and Ystad, S. (2011).

“Controlling the perceived material in an impact sound synthesizer,” IEEE

Trans. Audio, Speech, Lang. Process. 19(2), 301–314.

Aramaki, M., Gondre, C., Kronland-Martinet, R., Voinier, T., and Ystad, S.

(2009). “Thinking the sounds: An intuitive control of an impact sound syn-

thesizer,” in International Conference on Auditory Display (ICAD’09), pp.

119–124.

Barthet, M., Guillemain, P., Kronland-Martinet, R., and Ystad, S. (2010).

“From clarinet control to timbre perception,” Acta Acust. Acust. 96(4),

678–689.

Barthet, M., Kronland-Martinet, R., and Ystad, S. (2008). “Improving musi-

cal expressiveness by time-varying brightness shaping,” in ComputerMusic Modeling and Retrieval. Sense of Sounds (Springer-Verlag Berlin-

Heidelberg), pp. 313–336.

Barthet, M., Kronland-Martinet, R., Ystad, S., and Depalle, P. (2007). “The

effect of timbre in clarinet interpretation,” in International ComputerMusic Conference (ICMC) (Copenhagen, Denmark), pp. 59–62.

Bogert, B. P., Healy, M. J., and Tukey, J. W. (1963). “The quefrency alany-

sis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-

cepstrum and saphe cracking,” in Proceedings of the Symposium on TimeSeries Analysis, Vol. 15, pp. 209–243.

Caclin, A., McAdams, S., Smith, B. K., and Winsberg, S. (2005). “Acoustic

correlates of timbre space dimensions: A confirmatory study using syn-

thetic tones,” J. Acoust. Soc. Am. 118(1), 471–482.

Chadefaux, D., Le Carrou, J.-L., Wanderley, M. M., Fabre, B., and Daudet,

L. (2013). “Gestural strategies in the harp performance,” Acta Acust.

Acust. 99(6), 986–996.

Chudy, M., P�erez Carrillo, A., and Dixon, S. (2013). “On the relation

between gesture, tone production and perception in classical cello

performance,” in Proceedings of Meetings on Acoustics ICA2013, Vol. 19,

No. 1, p. 035017.

Davis, S., and Mermelstein, P. (1980). “Experiments in syllable-based rec-

ognition of continuous speech,” IEEE Trans. Acoust., Speech Signal

Process. 28, 357–366.


http://dx.doi.org/10.1109/TASL.2010.2047755

http://dx.doi.org/10.1109/TASL.2010.2047755

http://dx.doi.org/10.3813/AAA.918322

http://dx.doi.org/10.1121/1.1929229



http://dx.doi.org/10.1109/TASSP.1980.1163420

http://dx.doi.org/10.1109/TASSP.1980.1163420

Demoucron, M. (2008). “On the control of virtual violins—Physical model-

ling and control of bowed string instruments,” Ph.D. thesis, Universit�ePierre et Marie Curie-Paris VI; Royal Institute of Technology, Stockholm.

Depalle, P., Kronland-Martinet, R., and Torr�esani, B. (2006). “Time-fre-

quency multipliers for sound synthesis,” in Proc. the Wavelet XIIConference (SPIE Annual Symposium, San Diego, CA), pp. 221–224.

Desmet, F., Nijs, L., Demey, M., Lesaffre, M., Martens, J.-P., and Leman,

M. (2012). “Assessing a clarinet player’s performer gestures in relation to

locally intended musical targets,” J. New Music Res. 41(1), 31–48.

Fritz, C., Blackwell, A. F., Cross, I., Woodhouse, J., and Moore, B. C.

(2012). “Exploring violin sound quality: Investigating English timbre

descriptors and correlating resynthesized acoustical modifications with

perceptual properties,” J. Acoust. Soc. Am. 131(1), 783–794.

Genesis (2009). “Loudness toolbox,” available at http://genesis-acoustics.-

com/en/loudness_online-32.html (Last viewed January 3, 2017).

Grey, J. M. (1977). “Multidimensional perceptual scaling of musical

timbres,” J. Acoust. Soc. Am. 61(5), 1270–1277.

Grey, J. M., and Gordon, J. W. (1978). “Perceptual effects of spectral modi-

fications on musical timbres,” J. Acoust. Soc. Am. 63(5), 1493–1500.

Guettler, K. (2002). “On the creation of the Helmholtz motion in bowed

strings,” Acta Acust. Acust. 88(6), 970–985.

Guettler, K. (2003). “A closer look at the string player’s bowing gestures,”

CASJ (Series II) 4(7), 12–16.

Guettler, K., Schoonderwaldt, E., and Askenfelt, A. (2003). “Bow speed or

bowing position—Which one influence spectrum the most?,” in

Proceedings of the Stockholm Music Acoustic Conference (SMAC).ITU-R Recommendation, B. S. 1534-1 (2003). “Method for the subjective

assessment of intermediate quality levels of coding systems (MUSHRA),”

Technical Report, International Telecommunication Union, Geneva,

Switzerland.

Kim, H.-G., Moreau, N., and Sikora, T. (2006). MPEG-7 Audio andBeyond: Audio Content Indexing and Retrieval (John Wiley & Sons,

Hoboken, NJ).

Lartillot, O., and Toiviainen, P. (2007). “A MATLAB toolbox for musical fea-

ture extraction from audio,” in International Conference on Digital AudioEffects, pp. 237–244.

Leman, M. (2008). Embodied Music Cognition and Mediation Technology(MIT Press, Cambridge, MA).

McAdams, S. (1999). “Perspectives on the contribution of timbre to musical

structure,” Comput. Music J. 23(3), 85–102.

McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., and Krimphoff, J.

(1995). “Perceptual scaling of synthesized musical timbres: Common

dimensions, specificities, and latent subject classes,” Psychol. Res. 58(3),

177–192.

Merer, A., Ystad, S., Kronland-Martinet, R., and Aramaki, M. (2007).

“Semiotics of sounds evoking motions: Categorization and acoustic

features,” in International Symposium on Computer Music Modeling andRetrieval (Springer, Berlin Heidelberg), pp. 139–158.

Olivero, A., Torr�esani, B., and Kronland-Martinet, R. (2010). “A new

method for Gabor multipliers estimation: Application to sound morphing,”

in European Signal Processing Conference (EUSIPCO 2010), Aalborg,

Denmark, pp. 507–511.

Peeters, G. (2004). “A large set of audio features for sound description (sim-

ilarity and classification) in the CUIDADO project,” Technical Report,

IRCAM.

Poli, G. D., and Prandoni, P. (1997). “Sonological models for timbre charac-

terization,” J. New Music Res. 26(2), 170–197.

Pollard, H. F., and Jansson, E. V. (1982). “A tristimulus method for the

specification of musical timbre,” Acta Acust. Acust. 51(3), 162–171.

Ramsay, J. O. (2006). Functional Data Analysis (John Wiley & Sons, Inc.,

Hoboken, NJ).

Roz�e, J. (2017). “Exploring the perceived harshness of cello sounds by

morphing and synthesis techniques,” available at http://www.lma.cnrs-

mrs.fr/~kronland/Cello_Harshness (Last viewed January 3, 2017).

Roz�e, J., Aramaki, M., Kronland-Martinet, R., Voinier, T., Bourdin, C.,

Chadefaux, D., Dufrenne, M., and Ystad, S. (2016). “Assessing the influ-

ence of constraints on cellists’ postural displacements and musical

expressivity,” in Music, Mind and Embodiment—Post-Proceedings ofCMMR (Springer, New York), Vol. 9617 of LNCS, pp. 22–41.

Schaeffer, P. (1966). Trait�e des Objets Musicaux (Treaty of MusicalObjects) (Editions du Seuil, Paris).

Schelleng, J. C. (1973). “The bowed string and the player,” J. Acoust. Soc.

Am. 53(1), 26–41.

Schoonderwaldt, E. (2009). “The violinist’s sound palette: Spectral centroid,

pitch flattening and anomalous low frequencies,” Acta Acust. Acust.

95(5), 901–914.

Schoonderwaldt, E., Guettler, K., and Askenfelt, A. (2003). “Effect of the

width of the bow hair on the violin string spectrum,” in Proceedings of theStockholm Music Acoustics Conference (SMAC).

Sciabica, J.-F., Olivero, A., Roussarie, V., Ystad, S., and Kronland-

Martinet, R. (2012). “Dissimilarity test modelling by time-frequency rep-

resentation applied to engine sound,” in Audio Engineering SocietyConference: 45th International Conference: Applications of Time-Frequency Processing in Audio.

Søndergaard, P. L., Torr�esani, B., and Balazs, P. (2012). “The linear time

frequency analysis toolbox,” Int. J. Wavelets, Multiresol. Inform. Process.

10(04), 1250032.

Stepanek, J. (2006). “Musical sound timbre: Verbal description and

dimensions,” in Proceedings of the 9th International Conference onDigital Audio Effects (DAFx-06), pp. 121–126.

Stepanek, J., and Otcenasek, Z. (2005). “Acoustical correlates of the main

features of violin timbre perception,” in Proceedings of the Conference onInterdisciplinary Musicology, pp. 1–9.

Thompson, M. R., and Luck, G. (2012). “Exploring relationships between

pianists’ body movements, their expressive intentions, and structural ele-

ments of the music,” Musicae Sci. 16(1), 19–40.

Van Zijl, A. G., and Luck, G. (2013). “Moved through music: The effect of

experienced emotions on performers’ movement characteristics,” Psychol.

Music 41(2), 175–197.

Wanderley, M. M., Vines, B. W., Middleton, N., McKay, C., and Hatch, W.

(2005). “The musical significance of clarinetists’ ancillary gestures: An

exploration of the field,” J. New Music Res. 34(1), 97–113.

Z€olzer, U. (2008). Digital Audio Signal Processing (John Wiley & Sons,

Hoboken, NJ).


http://dx.doi.org/10.1080/09298215.2011.649769

http://dx.doi.org/10.1121/1.3651790

http://genesis-acoustics.com/en/loudness_online-32.html

http://genesis-acoustics.com/en/loudness_online-32.html

http://dx.doi.org/10.1121/1.381428

http://dx.doi.org/10.1121/1.381843

http://dx.doi.org/10.1162/014892699559797

http://dx.doi.org/10.1007/BF00419633

http://dx.doi.org/10.1080/09298219708570724

http://www.lma.cnrs-mrs.fr/~kronland/Cello_Harshness

http://www.lma.cnrs-mrs.fr/~kronland/Cello_Harshness

http://dx.doi.org/10.1121/1.1913322

http://dx.doi.org/10.1121/1.1913322


http://dx.doi.org/10.1177/1029864911423457

http://dx.doi.org/10.1177/0305735612458334

http://dx.doi.org/10.1177/0305735612458334

http://dx.doi.org/10.1080/09298210500124208